Thursday, September 26, 2024

Lakehouse: You’re probably doing it wrong! How the Lakehouse should really work

Databricks has been championing the Lakehouse and Medallion Architectures for some time. While the approach is familiar, people’s understandings often differ from best practices. This discussion aims to clarify how to use the Medallion Architecture and discuss best practices.

Medallion Architecture in the Lakehouse

At this point, most people have heard of the Medallion architecture. The terms Bronze, Silver, and Gold are intuitive and gaining adoption. That said, often people assume bronze maps to raw, silver to refine/curate/etc., and gold to serve. This is normal, but it is important to understand that the Lakehouse is new and different. Figure 1 contains a visual representation of the typical flow. A key concept is that we prefer to skip landing data from a source system in a raw/landing zone. Instead, the preference is to connect to the source system and land the data directly in the Bronze zone.




Figure 1: Medallion Architecture

Bronze is an append-only Delta table. It serves to capture and memorialize the history of a given data asset. Assuming the system being read includes updates and supports incremental extraction, Bronze will contain duplicates based on the table’s primary key. Additionally, the only operation performed during the population of Bronze is restoring the data to the format it was in within the source system. For example, if we are ingesting data from an Event Hub, we typically need to convert the encoded body to a string and then convert that string to a JSON object. We will likely need to change data types from the external system to Delta and Spark alternatives, such as VARCHAR becoming a String. We should not update tables in Bronze or prune and transform columns. Instead, keep it as close as possible to the structure of the source system.

When we refine Bronze tables to Silver, we first apply business rules to filter invalid records. Once we have applied data quality rules, we can remove differences using the MERGE INTO construct. Note that there should be a one-to-one mapping from the source table to a Bronze version of that table and a refined and fit-for-use Silver table, as presented in Figure 2. Additionally, we avoid changing the columns during this transformation.



Figure 2: Zones

Once a table has been refined for the Silver zone, we can use it in data engineering activities. A common activity is to combine silver tables to prepare for their use in Gold items. For example, we may combine normalized tables to reduce effort later. We sometimes refer to this operation as Silver-to-Silver refinement, indicating that one or more Silver tables are used to construct other Silver tables. While these tables are fit-for-use, they were not created for a specific business purpose. When we create a table to address a business question, we place those tables in the Gold zone. Gold tables are often created by applying aggregations or joining multiple Silver or Gold tables, as presented in Figure 3.



Figure 3: Zone Movements

One key thing to consider when developing solutions for this pattern is that the Bronze table will grow to be large. That means that if we attempt to identify new records in the Bronze table by accessing the Bronze table, our search time will increase over time. This challenge is addressed through the use of Delta’s Change Data Feed (CDF).

CDF allows us to identify the records associated with each version quickly. This avoids searching the Bronze table and can greatly improve performance. Note that a record identified by the primary key might have multiple records in a single application of CDF to Silver.  This occurs when source to Bronze is not tied to Bronze to Silver. Once you have applied business rules to the records in the CDF, use a RANK OVER operation to get the most recent valid update to the Silver table.

Summary

When constructing or evaluating a framework to populate the tables in your Lakehouse, ensure that CDF, data quality rules, and RANK OVER operations are being used. If not used, your Lakehouse will likely perform well initially, only to gradually (if not suddenly) start to take longer to move records from Source to Bronze to Silver.