Databricks has been championing the Lakehouse and Medallion
Architectures for some time. While the approach is familiar, people’s
understandings often differ from best practices. This discussion aims to
clarify how to use the Medallion Architecture and discuss best practices.
Medallion Architecture in the Lakehouse
At this point, most people have heard of the Medallion
architecture. The terms Bronze, Silver, and Gold are intuitive and gaining
adoption. That said, often people assume bronze maps to raw, silver to refine/curate/etc.,
and gold to serve. This is normal, but it is important to understand that the
Lakehouse is new and different. Figure 1 contains a visual representation of
the typical flow. A key concept is that we prefer to skip landing data from a source
system in a raw/landing zone. Instead, the preference is to connect to the
source system and land the data directly in the Bronze zone.
Figure 1: Medallion Architecture
Bronze is an append-only Delta table. It serves to capture
and memorialize the history of a given data asset. Assuming the system being
read includes updates and supports incremental extraction, Bronze will contain
duplicates based on the table’s primary key. Additionally, the only operation
performed during the population of Bronze is restoring the data to the format
it was in within the source system. For example, if we are ingesting data from
an Event Hub, we typically need to convert the encoded body to a string and then
convert that string to a JSON object. We will likely need to change data types
from the external system to Delta and Spark alternatives, such as VARCHAR
becoming a String. We should not update tables in Bronze or prune and transform
columns. Instead, keep it as close as possible to the structure of the source system.
When we refine Bronze tables to Silver, we first apply business
rules to filter invalid records. Once we have applied data quality rules, we
can remove differences using the MERGE INTO construct. Note that there should be
a one-to-one mapping from the source table to a Bronze version of that table and
a refined and fit-for-use Silver table, as presented in Figure 2. Additionally,
we avoid changing the columns during this transformation.
Figure 2: Zones
Once a table has been refined for the Silver zone, we can
use it in data engineering activities. A common activity is to combine silver tables
to prepare for their use in Gold items. For example, we may combine normalized
tables to reduce effort later. We sometimes refer to this operation as Silver-to-Silver
refinement, indicating that one or more Silver tables are used to construct
other Silver tables. While these tables are fit-for-use, they were not created
for a specific business purpose. When we create a table to address a business
question, we place those tables in the Gold zone. Gold tables are often created
by applying aggregations or joining multiple Silver or Gold tables, as presented
in Figure 3.
Figure 3: Zone Movements
One key thing to consider when developing solutions for this
pattern is that the Bronze table will grow to be large. That means that if we attempt
to identify new records in the Bronze table by accessing the Bronze table, our
search time will increase over time. This challenge is addressed through the
use of Delta’s Change Data Feed (CDF).
CDF allows us to identify the records associated with each
version quickly. This avoids searching the Bronze table and can greatly improve
performance. Note that a record identified by the primary key might have multiple
records in a single application of CDF to Silver. This occurs when source to Bronze is not tied
to Bronze to Silver. Once you have applied business rules to the records in the
CDF, use a RANK OVER operation to get the most recent valid update to the Silver
table.
Summary
When constructing or evaluating a framework to populate the
tables in your Lakehouse, ensure that CDF, data quality rules, and RANK OVER operations
are being used. If not used, your Lakehouse will likely perform well initially,
only to gradually (if not suddenly) start to take longer to move records from Source
to Bronze to Silver.
No comments:
Post a Comment