Tuesday, October 15, 2024

Data Models in the Lakehouse

This is an excerpt from an upcoming whitepaper on Lakehouses.

Data Models

A common question is where do things like common data models reside within the Lakehouse? There are three basic answers. Gold, Silver, and Silver/Gold.

One answer is that all common data model items should be put in the Gold zone. This is because they were created for a specific purpose, namely, to enjoy the benefits of standardization. For example, healthcare has many common data models such as Observation Health Data Sciences’ (OHDSI) Observational Medical Outcomes Partnership (OMOP), Health Level 7’s (HL7) Fast Healthcare Interoperability Resources (FHIR), and Clinical Data Interchange Standards Consortium (CDISC). This is in addition to Data Vault, Snowflake Schema, and other modeling approaches, as shown in Figure 4.


Figure 4: Gold-Focused Data Modeling

The counterargument is that while they are created for a named reason, that reason is vague, so they should be in the Silver zone as they are fit for use but not created for a specific question, as shown in Figure 5.


Figure 5: Silver-Oriented Data Modeling

The first two answers have been discussed; they belong in the Gold or Silver zone. The answer I discussed in my book, Databricks Lakehouse Platform Cookbook (https://www.amazon.com/dp/9355519567), is that the decision depends on whether the model in question was created to be served or to act as an intermediate construct, as shown in Figure 6.



Figure 6: Use-Driven Modeling Approach

We have discussed several approaches to using data modeling techniques within a Lakehouse. This approach can be extended to cover any items created in a specific way, but not necessarily for a specific business purpose.

Data Modeling Recommendation

I recommend a pragmatic approach. From a technical perspective, it does not matter if you place them in Silver or Gold or a mixture of the two. Both Zones are fit for use. Instead, pick an approach (document it) and stick with it. This guidance is also recommended for other design decisions, such as mapping workspaces, environments, business functions, and other criteria to Unity Catalog Catalogs. If you find yourself (and your organization) engaged in multiple meetings belaboring this topic, review the approach I recommended in my book (discussed in the previous section). Gold items are typically more focused on consumption than Silver. As such, if your modeling output is directly consumable in a performant fashion – call it Gold; otherwise, put it in Silver.

Thursday, September 26, 2024

Lakehouse: You’re probably doing it wrong! How the Lakehouse should really work

Databricks has been championing the Lakehouse and Medallion Architectures for some time. While the approach is familiar, people’s understandings often differ from best practices. This discussion aims to clarify how to use the Medallion Architecture and discuss best practices.

Medallion Architecture in the Lakehouse

At this point, most people have heard of the Medallion architecture. The terms Bronze, Silver, and Gold are intuitive and gaining adoption. That said, often people assume bronze maps to raw, silver to refine/curate/etc., and gold to serve. This is normal, but it is important to understand that the Lakehouse is new and different. Figure 1 contains a visual representation of the typical flow. A key concept is that we prefer to skip landing data from a source system in a raw/landing zone. Instead, the preference is to connect to the source system and land the data directly in the Bronze zone.




Figure 1: Medallion Architecture

Bronze is an append-only Delta table. It serves to capture and memorialize the history of a given data asset. Assuming the system being read includes updates and supports incremental extraction, Bronze will contain duplicates based on the table’s primary key. Additionally, the only operation performed during the population of Bronze is restoring the data to the format it was in within the source system. For example, if we are ingesting data from an Event Hub, we typically need to convert the encoded body to a string and then convert that string to a JSON object. We will likely need to change data types from the external system to Delta and Spark alternatives, such as VARCHAR becoming a String. We should not update tables in Bronze or prune and transform columns. Instead, keep it as close as possible to the structure of the source system.

When we refine Bronze tables to Silver, we first apply business rules to filter invalid records. Once we have applied data quality rules, we can remove differences using the MERGE INTO construct. Note that there should be a one-to-one mapping from the source table to a Bronze version of that table and a refined and fit-for-use Silver table, as presented in Figure 2. Additionally, we avoid changing the columns during this transformation.



Figure 2: Zones

Once a table has been refined for the Silver zone, we can use it in data engineering activities. A common activity is to combine silver tables to prepare for their use in Gold items. For example, we may combine normalized tables to reduce effort later. We sometimes refer to this operation as Silver-to-Silver refinement, indicating that one or more Silver tables are used to construct other Silver tables. While these tables are fit-for-use, they were not created for a specific business purpose. When we create a table to address a business question, we place those tables in the Gold zone. Gold tables are often created by applying aggregations or joining multiple Silver or Gold tables, as presented in Figure 3.



Figure 3: Zone Movements

One key thing to consider when developing solutions for this pattern is that the Bronze table will grow to be large. That means that if we attempt to identify new records in the Bronze table by accessing the Bronze table, our search time will increase over time. This challenge is addressed through the use of Delta’s Change Data Feed (CDF).

CDF allows us to identify the records associated with each version quickly. This avoids searching the Bronze table and can greatly improve performance. Note that a record identified by the primary key might have multiple records in a single application of CDF to Silver.  This occurs when source to Bronze is not tied to Bronze to Silver. Once you have applied business rules to the records in the CDF, use a RANK OVER operation to get the most recent valid update to the Silver table.

Summary

When constructing or evaluating a framework to populate the tables in your Lakehouse, ensure that CDF, data quality rules, and RANK OVER operations are being used. If not used, your Lakehouse will likely perform well initially, only to gradually (if not suddenly) start to take longer to move records from Source to Bronze to Silver.