This is an excerpt from an upcoming whitepaper on Lakehouses.
Data Models
A common question is where do things like common data models
reside within the Lakehouse? There are three basic answers. Gold, Silver, and
Silver/Gold.
One answer is that all common data model items should be put
in the Gold zone. This is because they were created for a specific purpose, namely,
to enjoy the benefits of standardization. For example, healthcare has many
common data models such as Observation Health Data Sciences’ (OHDSI) Observational
Medical Outcomes Partnership (OMOP), Health Level 7’s (HL7) Fast Healthcare
Interoperability Resources (FHIR), and Clinical Data Interchange Standards
Consortium (CDISC). This is in addition to Data Vault, Snowflake Schema, and
other modeling approaches, as shown in Figure 4.
Figure 4: Gold-Focused Data Modeling
The counterargument is that while they are created for a
named reason, that reason is vague, so they should be in the Silver zone as
they are fit for use but not created for a specific question, as shown
in Figure 5.
Figure 5: Silver-Oriented Data Modeling
The first two answers have been discussed; they belong in the Gold or Silver zone. The answer I discussed in my book, Databricks Lakehouse Platform Cookbook (https://www.amazon.com/dp/9355519567), is that the decision depends on whether the model in question was created to be served or to act as an intermediate construct, as shown in Figure 6.
Figure 6: Use-Driven Modeling Approach
We have discussed several approaches to using data modeling
techniques within a Lakehouse. This approach can be extended to cover any items
created in a specific way, but not necessarily for a specific business purpose.
Data Modeling Recommendation
I recommend a pragmatic approach. From a technical
perspective, it does not matter if you place them in Silver or Gold or a
mixture of the two. Both Zones are fit for use. Instead, pick an approach
(document it) and stick with it. This guidance is also recommended for other
design decisions, such as mapping workspaces, environments, business functions,
and other criteria to Unity Catalog Catalogs. If you find yourself (and your
organization) engaged in multiple meetings belaboring this topic, review the approach
I recommended in my book (discussed in the previous section). Gold items are
typically more focused on consumption than Silver. As such, if your modeling
output is directly consumable in a performant fashion – call it Gold;
otherwise, put it in Silver.
No comments:
Post a Comment