Databricks is a notebook-based unified solution for
performing various types of processing. It supports multiple programming
languages, including Python. Databricks offers a free community version
suitable for educational and training purposes (Databricks, n.d.). Discrete Event Simulation is a type of
simulation focused on the occurrence of events. SimPy is a Python library that
enables discrete event workloads (Team SimPy, 2020). Using
Databricks is a viable alternative to installing Python and Jupyter.
Getting Started
The first step is to go to community.cloud.databricks.com
and create an account.
You will likely be asked if you want a free trial on AWS or
Azure or if you want to use the community edition. For educational purposes,
the community edition is sufficient (and free).
Once you’ve selected Getting Started under Community Edition
you will receive notification that an email is being sent to you. Opening the link lands you on a page where
you are asked to reset (you are setting) your password.
After you’ve assigned your password, you land in the
Databricks environment.
The landing page contains links to common tasks, along with
a left navigation bar. If you are using a library like SimPy, it is a good idea
to import it and set it to always be installed when a cluster is created. The community edition seems to delete
clusters after periods of inactivity.
Importing the SImPy library and selecting that should always be
installed on clusters means you do not have to reimport it each time. Selecting Import Library brings up a screen
named “Create Library.” Don’t be confused. You are actually importing it into
the environment.
To import SimPy select PyPi for the Library Source and enter
SimPy in the package name. Then hit Create.
You will likely see a screen saying that there are no
clusters available. Click the option
which states “Install automatically on all clusters”
You will be prompted to ensure you really want to enable
this feature. As long as you are using
versions less than 7, you do.
At this point, you have told Databricks that when you create
a cluster, you want SimPy to always be installed and available. The next step is to use it. To do so, you
need to the landing page.
You might think you do this by clicking the Home link in the
Left Nav, instead, click the databricks icon above it. The home icon is used to navigate your
notebooks and libraries. It is also where you can import and export notebooks,
create Notebooks – libraries – folders – and ML Experiments.
Creating and Running a Notebook
Databricks uses the notebook paradigm made popular by
Jupyter. From the landing page, click New Notebook. Alternatively, notebooks
can be created using the Home link mention above. Clicking New Notebook brings up a dialog
where you supply the name of the notebook and select the language and cluster
to use. In this case, we have not
created any clusters, so you will leave that blank. Since we are exploring the
use of SimPy, ensure that Python is selected.
After clicking Create, you will land in a notebook with one
cell and one line in that cell.
To add more cells, hover in the middle of the box on the
lines above or below the cell.
Alternatively, click the chevron (Edit Menu) and select the
desired location for the new cell.
Most of the time, the first thing you’ll want to do is
import libraries so they are available for use within your code.
While the program is not very interesting, you can run it by
doing a Shift-Enter while in the cell, or selecting the play button from the
menu at the top. Since we have no
clusters, you will see a prompt asking if you would like to Launch a new
cluster and run your notebook. You most
likely want to check the box that launches and attaches without prompting.
Select Launch and Run.
At this point Databricks is creating a cluster with no
worker nodes and single driver node. Since we are using the Community edition,
there are no worker nodes. Since we are
writing Python that is not using Spark, that does not matter. If we were addressing Big Data problems, then
we would want to use a version of Databricks hosted in AWS or Azure.
After the cluster is created, the command is executed. You can tell the results of the execution by
looking below the cell. You should see
something like the following:
Now that we have a cluster, we need to write the rest of our
simplistic simulation. Since SimPy uses
the concept of a generator to produce events, we need to write one. Below the
first cell, add another cell and write code similar to the following:
The key instruction in this statement is the yield
statement. Next, we need to use this
generator within SimPy. To do that, add a cell below the one you just created
and put the following in it.
The final step is to run all of the cells. You can do that by hitting Shift-Enter in
each cell or select the run all (play button) at the top of the screen.
Generally, we would want to add documentation to our
notebook, so that others can understand it better. To do this, add a cell above
your first cell, and start it with %md.
This is a magic string that turns the cell into a markdown cell. Enter a
description using markdown syntax and then run the cell.
After execution, the cell converts to the output associated
with the markdown.
Publishing the Notebook
Often you need to share your finished. To do this, you publish the notebook. This
action results in a shareable link. Note
that anyone who has the link can view it.
The AWS or Azure hosted version of Databricks has a richer sharing model
where users can collaborate on a notebook in real-time.
When you click the publish button you will be prompted to
ensure you really want to publish the notebook.
Also, note that the share links are active for six months.
Once published, you will be given the link which you can
share.
The link for the notebook used in this example is:
Additional Resources
Databricks has a Get Started page, which quickly walks a new
user through the environment.
On that page, the quick start is very useful.
Conclusion
I walked through the process of signing up for Databricks
Community Edition. I then shared the process of importing a library (SimPy),
and then using that library on a cluster.
We created a die-rolling example and talked about how to publish that
example. While this is not an exhaustive coverage of the topic, hopefully, this
information will help you get started with SimPy in Dataricks.
References