Tuesday, October 15, 2024

Data Models in the Lakehouse

This is an excerpt from an upcoming whitepaper on Lakehouses.

Data Models

A common question is where do things like common data models reside within the Lakehouse? There are three basic answers. Gold, Silver, and Silver/Gold.

One answer is that all common data model items should be put in the Gold zone. This is because they were created for a specific purpose, namely, to enjoy the benefits of standardization. For example, healthcare has many common data models such as Observation Health Data Sciences’ (OHDSI) Observational Medical Outcomes Partnership (OMOP), Health Level 7’s (HL7) Fast Healthcare Interoperability Resources (FHIR), and Clinical Data Interchange Standards Consortium (CDISC). This is in addition to Data Vault, Snowflake Schema, and other modeling approaches, as shown in Figure 4.

Figure 4: Gold-Focused Data Modeling

The counterargument is that while they are created for a named reason, that reason is vague, so they should be in the Silver zone as they are fit for use but not created for a specific question, as shown in Figure 5.

Figure 5: Silver-Oriented Data Modeling

The first two answers have been discussed; they belong in the Gold or Silver zone. The answer I discussed in my book, Databricks Lakehouse Platform Cookbook (https://www.amazon.com/dp/9355519567), is that the decision depends on whether the model in question was created to be served or to act as an intermediate construct, as shown in Figure 6.

Figure 6: Use-Driven Modeling Approach

We have discussed several approaches to using data modeling techniques within a Lakehouse. This approach can be extended to cover any items created in a specific way, but not necessarily for a specific business purpose.

Data Modeling Recommendation

I recommend a pragmatic approach. From a technical perspective, it does not matter if you place them in Silver or Gold or a mixture of the two. Both Zones are fit for use. Instead, pick an approach (document it) and stick with it. This guidance is also recommended for other design decisions, such as mapping workspaces, environments, business functions, and other criteria to Unity Catalog Catalogs. If you find yourself (and your organization) engaged in multiple meetings belaboring this topic, review the approach I recommended in my book (discussed in the previous section). Gold items are typically more focused on consumption than Silver. As such, if your modeling output is directly consumable in a performant fashion – call it Gold; otherwise, put it in Silver.

Thursday, September 26, 2024

Lakehouse: You’re probably doing it wrong! How the Lakehouse should really work

Databricks has been championing the Lakehouse and Medallion Architectures for some time. While the approach is familiar, people’s understandings often differ from best practices. This discussion aims to clarify how to use the Medallion Architecture and discuss best practices.

Medallion Architecture in the Lakehouse

At this point, most people have heard of the Medallion architecture. The terms Bronze, Silver, and Gold are intuitive and gaining adoption. That said, often people assume bronze maps to raw, silver to refine/curate/etc., and gold to serve. This is normal, but it is important to understand that the Lakehouse is new and different. Figure 1 contains a visual representation of the typical flow. A key concept is that we prefer to skip landing data from a source system in a raw/landing zone. Instead, the preference is to connect to the source system and land the data directly in the Bronze zone.

Figure 1: Medallion Architecture

Bronze is an append-only Delta table. It serves to capture and memorialize the history of a given data asset. Assuming the system being read includes updates and supports incremental extraction, Bronze will contain duplicates based on the table’s primary key. Additionally, the only operation performed during the population of Bronze is restoring the data to the format it was in within the source system. For example, if we are ingesting data from an Event Hub, we typically need to convert the encoded body to a string and then convert that string to a JSON object. We will likely need to change data types from the external system to Delta and Spark alternatives, such as VARCHAR becoming a String. We should not update tables in Bronze or prune and transform columns. Instead, keep it as close as possible to the structure of the source system.

When we refine Bronze tables to Silver, we first apply business rules to filter invalid records. Once we have applied data quality rules, we can remove differences using the MERGE INTO construct. Note that there should be a one-to-one mapping from the source table to a Bronze version of that table and a refined and fit-for-use Silver table, as presented in Figure 2. Additionally, we avoid changing the columns during this transformation.

Figure 2: Zones

Once a table has been refined for the Silver zone, we can use it in data engineering activities. A common activity is to combine silver tables to prepare for their use in Gold items. For example, we may combine normalized tables to reduce effort later. We sometimes refer to this operation as Silver-to-Silver refinement, indicating that one or more Silver tables are used to construct other Silver tables. While these tables are fit-for-use, they were not created for a specific business purpose. When we create a table to address a business question, we place those tables in the Gold zone. Gold tables are often created by applying aggregations or joining multiple Silver or Gold tables, as presented in Figure 3.

Figure 3: Zone Movements

One key thing to consider when developing solutions for this pattern is that the Bronze table will grow to be large. That means that if we attempt to identify new records in the Bronze table by accessing the Bronze table, our search time will increase over time. This challenge is addressed through the use of Delta’s Change Data Feed (CDF).

CDF allows us to identify the records associated with each version quickly. This avoids searching the Bronze table and can greatly improve performance. Note that a record identified by the primary key might have multiple records in a single application of CDF to Silver. This occurs when source to Bronze is not tied to Bronze to Silver. Once you have applied business rules to the records in the CDF, use a RANK OVER operation to get the most recent valid update to the Silver table.

Summary

When constructing or evaluating a framework to populate the tables in your Lakehouse, ensure that CDF, data quality rules, and RANK OVER operations are being used. If not used, your Lakehouse will likely perform well initially, only to gradually (if not suddenly) start to take longer to move records from Source to Bronze to Silver.

Saturday, October 21, 2023

It has been far too long since I have posted anything here. I will attempt to do better, but we will see...

Monday, December 07, 2020

Traditional and Competency-Based Education

Competency-Based Education (CBE) is an emerging alternative to traditional educational approaches. To effectively engage in CBE, it is essential to understand how it differs from traditional educational approaches. To that end, the potential issues related to CBE will be explored, followed by the mitigation strategies. Next, the strengths of CBE will be discussed, followed by mechanisms to leverage those strengths. Lastly, the conclusion will be shared.

Potential Issues

To effectively discuss CBE’s potential issues, it is necessary first to evaluate the high-level differences between CBE and traditional educational approaches. The primary difference between traditional education and CBE is the focus on demonstrating skills or knowledge (Gervais, 2016). The CBE approach is based on the concept that once a student has demonstrated the skills associated with a given course, they receive credit. This approach contrasts with traditional approaches, which have a fixed timeline and require students to work at roughly the same pace.

Because CBE enables students to move at different speeds, it could be challenging to have a lecture that is beneficial to all members of the class. For example, if one student enters the course with mastery of the course topics, the introductory discussions will be of little value to that student. A more complex challenge is that students may have gaps in their knowledge. They may have a great deal of depth in one area while missing fundamentals concepts. In this case, the students might be tempted to skip early work to complete the more advanced curriculum items.

Mitigation Strategies

As with many things, a single action cannot address the challenges previously discussed. Instead, a shift from traditional solutions to need-based instruction must occur. Additionally, CBE may require an instructor to teach in a different style than they have previously taught. The level of instructor engagement with an individual student must be much higher, given the customized and targeted educational experience associated with CBE.

For example, one approach would be to break instructional elements down to smaller segments. This point speaks to the challenges that the course designers must face when shifting to CBE (Cunningham, Key, & Capron, 2016). Rather than having a half-hour to an hour lecture, an instructor might employ small five to ten-minute miniature lectures targeted to a specific activity or unit of knowledge. Conducting these mini-discussions and then recording them so that a student can skip over the things they already know could be beneficial to the student. Additionally, ensuring that the high-level concepts which the miniature lectures cover are documented might enable students with gaps in their knowledge to identify those elements.

Strengths

From the student’s perspective, CBE’s primary strength is that it leverages their existing knowledge and background. Rather than assuming all students enter the course with the same level of knowledge, the assumption is that each student has potentially vast differences in their foundational knowledge. This shift in focus allows a student to complete work they know how to do quickly, enabling them to focus on improvement areas. While this could often be done in a flexible traditional course (working ahead), the student could not complete it early.

Leveraging Strengths

One way to leverage CBE’s strengths is to work closely with the student to ensure they have an accurate self-assessment of their knowledge. This leveraging of the strengths ties into CBE’s fundamental nature, where the instructor serves as a mentor and guide in the educational journey, not an oracle issuing lectures from on high.

Conclusion

A brief comparison of CBE to traditional educational approaches was presented. Understanding the differences between CBE and traditional educational approaches is essential to engaging in CBE as an instructor. An assessment of the potential issues and associated mitigation strategies were shared. The strengths and ways to maximize those strengths were explored. CBE is a powerful educational approach. The level of instructor engagement, targeted instruction, and student-driven timelines make CBE a compelling alternative to traditional education.

References

Cunningham, J., Key, E., & Capron, R. (2016). An evaluation of competency‐based education programs: A study of the development process of competency‐Based programs. The Journal of Competency‐Based Education, 1(3), 130-139.

Gervais, J. (2016). The operational definition of competency‐based education. The Journal of Competency‐Based Education, 1(2), 98-106.

Monday, August 03, 2020

Discrete Event Simulation with SimPy using Databricks

Databricks is a notebook-based unified solution for performing various types of processing. It supports multiple programming languages, including Python. Databricks offers a free community version suitable for educational and training purposes (Databricks, n.d.). Discrete Event Simulation is a type of simulation focused on the occurrence of events. SimPy is a Python library that enables discrete event workloads (Team SimPy, 2020). Using Databricks is a viable alternative to installing Python and Jupyter.

Getting Started

The first step is to go to community.cloud.databricks.com and create an account.

You will likely be asked if you want a free trial on AWS or Azure or if you want to use the community edition. For educational purposes, the community edition is sufficient (and free).

Once you’ve selected Getting Started under Community Edition you will receive notification that an email is being sent to you. Opening the link lands you on a page where you are asked to reset (you are setting) your password.

After you’ve assigned your password, you land in the Databricks environment.

The landing page contains links to common tasks, along with a left navigation bar. If you are using a library like SimPy, it is a good idea to import it and set it to always be installed when a cluster is created. The community edition seems to delete clusters after periods of inactivity. Importing the SImPy library and selecting that should always be installed on clusters means you do not have to reimport it each time. Selecting Import Library brings up a screen named “Create Library.” Don’t be confused. You are actually importing it into the environment.

To import SimPy select PyPi for the Library Source and enter SimPy in the package name. Then hit Create.

You will likely see a screen saying that there are no clusters available. Click the option which states “Install automatically on all clusters”

You will be prompted to ensure you really want to enable this feature. As long as you are using versions less than 7, you do.

At this point, you have told Databricks that when you create a cluster, you want SimPy to always be installed and available. The next step is to use it. To do so, you need to the landing page.

You might think you do this by clicking the Home link in the Left Nav, instead, click the databricks icon above it. The home icon is used to navigate your notebooks and libraries. It is also where you can import and export notebooks, create Notebooks – libraries – folders – and ML Experiments.

Creating and Running a Notebook

Databricks uses the notebook paradigm made popular by Jupyter. From the landing page, click New Notebook. Alternatively, notebooks can be created using the Home link mention above. Clicking New Notebook brings up a dialog where you supply the name of the notebook and select the language and cluster to use. In this case, we have not created any clusters, so you will leave that blank. Since we are exploring the use of SimPy, ensure that Python is selected.

After clicking Create, you will land in a notebook with one cell and one line in that cell.

To add more cells, hover in the middle of the box on the lines above or below the cell.

Alternatively, click the chevron (Edit Menu) and select the desired location for the new cell.

Most of the time, the first thing you’ll want to do is import libraries so they are available for use within your code.

While the program is not very interesting, you can run it by doing a Shift-Enter while in the cell, or selecting the play button from the menu at the top. Since we have no clusters, you will see a prompt asking if you would like to Launch a new cluster and run your notebook. You most likely want to check the box that launches and attaches without prompting. Select Launch and Run.

At this point Databricks is creating a cluster with no worker nodes and single driver node. Since we are using the Community edition, there are no worker nodes. Since we are writing Python that is not using Spark, that does not matter. If we were addressing Big Data problems, then we would want to use a version of Databricks hosted in AWS or Azure.

After the cluster is created, the command is executed. You can tell the results of the execution by looking below the cell. You should see something like the following:

Now that we have a cluster, we need to write the rest of our simplistic simulation. Since SimPy uses the concept of a generator to produce events, we need to write one. Below the first cell, add another cell and write code similar to the following:

The key instruction in this statement is the yield statement. Next, we need to use this generator within SimPy. To do that, add a cell below the one you just created and put the following in it.

The final step is to run all of the cells. You can do that by hitting Shift-Enter in each cell or select the run all (play button) at the top of the screen.

Generally, we would want to add documentation to our notebook, so that others can understand it better. To do this, add a cell above your first cell, and start it with %md. This is a magic string that turns the cell into a markdown cell. Enter a description using markdown syntax and then run the cell.

After execution, the cell converts to the output associated with the markdown.

Publishing the Notebook

Often you need to share your finished. To do this, you publish the notebook. This action results in a shareable link. Note that anyone who has the link can view it. The AWS or Azure hosted version of Databricks has a richer sharing model where users can collaborate on a notebook in real-time.

When you click the publish button you will be prompted to ensure you really want to publish the notebook. Also, note that the share links are active for six months.

Once published, you will be given the link which you can share.

The link for the notebook used in this example is:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4694861607190813/1558069920648317/7438844333572998/latest.html

Additional Resources

Databricks has a Get Started page, which quickly walks a new user through the environment.

https://docs.databricks.com/getting-started/index.html

On that page, the quick start is very useful.

https://docs.databricks.com/getting-started/quick-start.html

Conclusion

I walked through the process of signing up for Databricks Community Edition. I then shared the process of importing a library (SimPy), and then using that library on a cluster. We created a die-rolling example and talked about how to publish that example. While this is not an exhaustive coverage of the topic, hopefully, this information will help you get started with SimPy in Dataricks.

References

Databricks. (n.d.). Databricks Community Edition. https://community.cloud.databricks.com/

Team SimPy. (2020). SimPy: Discrete event simulation for Python. https://simpy.readthedocs.io/en/latest/

Tuesday, July 02, 2019

Learner-Centered Teaching

Learner-centered teaching is a form of instruction which focuses on the needs and interests of the learner (KeenGwe, OnChwari, & OnChwari, 2009). The idea is to engage the learner in the learning process, help them take responsibility for their learning, and help them learn how to learn. In this form of instruction, engagement is critical.

The role of an educator in a learner-centered environment is that of guide and facilitator (Weimer, 2002). That guidance should be customized and based on the unique identity of the student. Rather than lecturing to a class, our role should be to answer questions, point out potential pitfalls, and ensure the students are engaged.

The way this can be applied in a computer science course is by assessing everyone’s strengths and weaknesses and providing specific guidance to each. While this will be challenging, the benefits to the student make such effort worthwhile. As with other disciplines, establishing a trusting relationship built on mutual respect will provide a clear channel of communication. The students must be comfortable to say “I don’t get it” without fear of embarrassment or loss of stature.

References

KeenGwe, J., OnChwari, G., & OnChwari, J. (2009). Technology and student learning: Towards a learner-centered teaching model. AACE Journal, 17(1), 11-22.

Weimer, M. (2002). Learner-centered teaching: Five key changes to practice: John Wiley & Sons.

Wednesday, April 03, 2019

The NoSQL Aggregate Data Model

The aggregate data model is based on the idea that elements of a complex objects are often manipulated in a related fashion (Sadalage & Fowler, 2012). These aggregates are in part a solution to the object-relational impedance mismatch problem. By saving the object that exists within an application in a similar fashion within the database, it greatly simplifies saving and loading of that information. Relationships between aggregates is maintained by links. This allows the manipulation of one aggregate (a customer for example) be independent from a related aggregate (like an order). This linkage is somewhat similar to the relationships in the relational model, but differs in that the things being related are potentially of much higher complexity.

Within a document database, such as MongoDB, Binary JavaScript Object Notation (BSON) is used to store aggregates (Kaur & Rani, 2013). MongoDB supports the creation of an index to increase the speed of execution of a query. Documents within MongoDB are essentially serialized versions of classes, and in turn aggregates. MongoDB supports composition and associations. Composition is when one aggregate is embedded within another. An association is a link, as mentioned earlier. It is possible to query MongoDB to retrieve a particular field, or element of an aggregate (MongoDB Inc, n.d.). For example, consider an aggregate that represents a customer. The customer has a name, age, and address. It is possible to query all customers and project only the name and age fields.

The benefits of an aggregate data model is that allows for easy mapping to the object model within an application. If the object model is well constructed, the data model will be also. It also provides a reasonable partitioning mechanism, as the elements within an aggregate are closely related and should be stored physically close to each other. Another advantage of the aggregate data model is that retrieval of a given aggregate is simplified compared to that of the relational model. In the relational model, a transformation is often required to construct the object the calling application requires. Within the aggregate model, that work is not required.

A negative of the aggregate data model is that there are not abstraction layers, similar to views in the relational model. The consumer of an aggregate data model typically interacts directly with the data store. If the aggregate changes it is the responsibility of the consumer of the data to adjust to those changes. Additionally, I have firsthand experience attempting to retrieve information from an aggregate data model in a way that was not intended by the designers of the data model. In some systems, it is necessary to retrieve every entire document to perform analytics.

I have worked with NoSQL databases in the past. The term aggregate data model was not used, but the concept is valid. Typically, individuals refer to the database’s JSON, rather than an abstract concept. MongoDB refers to the concept as a “denormalized” model (Anonymous, n.d.). When the term aggregate is used, it typically relates to the analytical operation. This is yet another example of the lack of standardization within the NoSQL database space.

References

Anonymous. (n.d.). Data Model Design. Retrieved from https://docs.mongodb.com/manual/core/data-model-design/

Kaur, K., & Rani, R. (2013, 6-9 Oct. 2013). Modeling and querying data in NoSQL databases. Paper presented at the Big Data, 2013 IEEE International Conference on.

MongoDB Inc. (n.d.). Project Fields to Return from Query. Retrieved from https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/

Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: a brief guide to the emerging world of polyglot persistence: Pearson Education.

Alan Dennis' Blog

Tuesday, October 15, 2024

Data Models in the Lakehouse

Data Models

Data Modeling Recommendation

Thursday, September 26, 2024

Lakehouse: You’re probably doing it wrong! How the Lakehouse should really work

Medallion Architecture in the Lakehouse

Summary

Saturday, October 21, 2023

Monday, December 07, 2020

Traditional and Competency-Based Education

Potential Issues

Mitigation Strategies

Strengths

Leveraging Strengths

Conclusion

References

Monday, August 03, 2020

Discrete Event Simulation with SimPy using Databricks

Getting Started

Creating and Running a Notebook

Publishing the Notebook

Additional Resources

Conclusion

Tuesday, July 02, 2019

Wednesday, April 03, 2019

The NoSQL Aggregate Data Model

Blog Archive