Monday, August 03, 2020

Discrete Event Simulation with SimPy using Databricks



Databricks is a notebook-based unified solution for performing various types of processing. It supports multiple programming languages, including Python. Databricks offers a free community version suitable for educational and training purposes (Databricks, n.d.). Discrete Event Simulation is a type of simulation focused on the occurrence of events. SimPy is a Python library that enables discrete event workloads (Team SimPy, 2020).  Using Databricks is a viable alternative to installing Python and Jupyter.

Getting Started

The first step is to go to community.cloud.databricks.com and create an account.

You will likely be asked if you want a free trial on AWS or Azure or if you want to use the community edition. For educational purposes, the community edition is sufficient (and free).


Once you’ve selected Getting Started under Community Edition you will receive notification that an email is being sent to you.  Opening the link lands you on a page where you are asked to reset (you are setting) your password.

After you’ve assigned your password, you land in the Databricks environment.

The landing page contains links to common tasks, along with a left navigation bar. If you are using a library like SimPy, it is a good idea to import it and set it to always be installed when a cluster is created.  The community edition seems to delete clusters after periods of inactivity.  Importing the SImPy library and selecting that should always be installed on clusters means you do not have to reimport it each time.  Selecting Import Library brings up a screen named “Create Library.” Don’t be confused. You are actually importing it into the environment. 
To import SimPy select PyPi for the Library Source and enter SimPy in the package name. Then hit Create.

You will likely see a screen saying that there are no clusters available.  Click the option which states “Install automatically on all clusters”

You will be prompted to ensure you really want to enable this feature.  As long as you are using versions less than 7, you do.
At this point, you have told Databricks that when you create a cluster, you want SimPy to always be installed and available.  The next step is to use it. To do so, you need to the landing page.
You might think you do this by clicking the Home link in the Left Nav, instead, click the databricks icon above it.  The home icon is used to navigate your notebooks and libraries. It is also where you can import and export notebooks, create Notebooks – libraries – folders – and ML Experiments.

Creating and Running a Notebook

Databricks uses the notebook paradigm made popular by Jupyter. From the landing page, click New Notebook. Alternatively, notebooks can be created using the Home link mention above.  Clicking New Notebook brings up a dialog where you supply the name of the notebook and select the language and cluster to use.  In this case, we have not created any clusters, so you will leave that blank. Since we are exploring the use of SimPy, ensure that Python is selected.

After clicking Create, you will land in a notebook with one cell and one line in that cell.

To add more cells, hover in the middle of the box on the lines above or below the cell.

Alternatively, click the chevron (Edit Menu) and select the desired location for the new cell.

Most of the time, the first thing you’ll want to do is import libraries so they are available for use within your code.

While the program is not very interesting, you can run it by doing a Shift-Enter while in the cell, or selecting the play button from the menu at the top.  Since we have no clusters, you will see a prompt asking if you would like to Launch a new cluster and run your notebook.  You most likely want to check the box that launches and attaches without prompting. Select Launch and Run.

At this point Databricks is creating a cluster with no worker nodes and single driver node. Since we are using the Community edition, there are no worker nodes.  Since we are writing Python that is not using Spark, that does not matter.  If we were addressing Big Data problems, then we would want to use a version of Databricks hosted in AWS or Azure.
After the cluster is created, the command is executed.  You can tell the results of the execution by looking below the cell.  You should see something like the following:

Now that we have a cluster, we need to write the rest of our simplistic simulation.  Since SimPy uses the concept of a generator to produce events, we need to write one. Below the first cell, add another cell and write code similar to the following:

The key instruction in this statement is the yield statement.  Next, we need to use this generator within SimPy. To do that, add a cell below the one you just created and put the following in it.

The final step is to run all of the cells.  You can do that by hitting Shift-Enter in each cell or select the run all (play button) at the top of the screen.

Generally, we would want to add documentation to our notebook, so that others can understand it better. To do this, add a cell above your first cell, and start it with %md.  This is a magic string that turns the cell into a markdown cell. Enter a description using markdown syntax and then run the cell.

After execution, the cell converts to the output associated with the markdown.

Publishing the Notebook

Often you need to share your finished.  To do this, you publish the notebook. This action results in a shareable link.  Note that anyone who has the link can view it.  The AWS or Azure hosted version of Databricks has a richer sharing model where users can collaborate on a notebook in real-time.

When you click the publish button you will be prompted to ensure you really want to publish the notebook.  Also, note that the share links are active for six months.

Once published, you will be given the link which you can share.

The link for the notebook used in this example is:

Additional Resources

Databricks has a Get Started page, which quickly walks a new user through the environment.
On that page, the quick start is very useful.

Conclusion

I walked through the process of signing up for Databricks Community Edition. I then shared the process of importing a library (SimPy), and then using that library on a cluster.  We created a die-rolling example and talked about how to publish that example. While this is not an exhaustive coverage of the topic, hopefully, this information will help you get started with SimPy in Dataricks.
References
Databricks. (n.d.). Databricks Community Edition. https://community.cloud.databricks.com/

Team SimPy. (2020). SimPy: Discrete event simulation for Python. https://simpy.readthedocs.io/en/latest/



No comments: