Tuesday, October 15, 2024

Data Models in the Lakehouse

This is an excerpt from an upcoming whitepaper on Lakehouses.

Data Models

A common question is where do things like common data models reside within the Lakehouse? There are three basic answers. Gold, Silver, and Silver/Gold.

One answer is that all common data model items should be put in the Gold zone. This is because they were created for a specific purpose, namely, to enjoy the benefits of standardization. For example, healthcare has many common data models such as Observation Health Data Sciences’ (OHDSI) Observational Medical Outcomes Partnership (OMOP), Health Level 7’s (HL7) Fast Healthcare Interoperability Resources (FHIR), and Clinical Data Interchange Standards Consortium (CDISC). This is in addition to Data Vault, Snowflake Schema, and other modeling approaches, as shown in Figure 4.


Figure 4: Gold-Focused Data Modeling

The counterargument is that while they are created for a named reason, that reason is vague, so they should be in the Silver zone as they are fit for use but not created for a specific question, as shown in Figure 5.


Figure 5: Silver-Oriented Data Modeling

The first two answers have been discussed; they belong in the Gold or Silver zone. The answer I discussed in my book, Databricks Lakehouse Platform Cookbook (https://www.amazon.com/dp/9355519567), is that the decision depends on whether the model in question was created to be served or to act as an intermediate construct, as shown in Figure 6.



Figure 6: Use-Driven Modeling Approach

We have discussed several approaches to using data modeling techniques within a Lakehouse. This approach can be extended to cover any items created in a specific way, but not necessarily for a specific business purpose.

Data Modeling Recommendation

I recommend a pragmatic approach. From a technical perspective, it does not matter if you place them in Silver or Gold or a mixture of the two. Both Zones are fit for use. Instead, pick an approach (document it) and stick with it. This guidance is also recommended for other design decisions, such as mapping workspaces, environments, business functions, and other criteria to Unity Catalog Catalogs. If you find yourself (and your organization) engaged in multiple meetings belaboring this topic, review the approach I recommended in my book (discussed in the previous section). Gold items are typically more focused on consumption than Silver. As such, if your modeling output is directly consumable in a performant fashion – call it Gold; otherwise, put it in Silver.

No comments: