Data Lakehouse Architecture

In a previous blogpost, we’ve already discussed the flexible technology approach within togaether. For us, it’s very important to understand our customers’ needs well, design the right architecture, and, as the cherry on top, choose the right technology. Although our company was only established a few years ago, the core of our team has been working togaether (😉) for much longer, with the origins of our company dating back to 2007. Due to this rich history, we’ve seen many architectural possibilities emerge, giving us a broad view of the various options.

Therefore, in the upcoming blogs, we’ll delve deeper into the different architectural possibilities when setting up a (modern) data platform, explaining a concept in each blog, along with its pros and cons. In this blog, we’ll explore a concept where we see many possibilities: the Data Lakehouse.

Over the past years, we’ve followed the concept of the Data Lakehouse with great enthusiasm (partly thanks to Databricks), and this setup is something we often propose to many clients. To understand the concept well, we need to take a step back and understand the two fundamental building blocks of a Data Lakehouse:

Data Lake

A data lake is a centralized repository that allows for the storage of structured, semi-structured, and unstructured data at any scale. It is designed to store vast amounts of raw data in its native format, without the need for upfront data transformation or schema definition.

One major drawback of utilizing a Data Lake is its inherent lack of structure. Without proper governance in place, the data stored in the Data Lake becomes challenging to retrieve, ultimately transforming the Data Lake into a Data Swamp.

Data Warehouse

A data warehouse serves as a centralized storage facility for structured and preprocessed data originating from diverse sources, with the aim of facilitating business intelligence, reporting, and analysis.

In contrast to a data lake, which stores raw and unprocessed data, a data warehouse typically organizes data in a structured manner using predefined schemas and relationships. It undergoes an Extract, Transform, Load (ETL) process, where data is extracted from various sources, transformed into a consistent format, and loaded into the warehouse. Data warehouses prioritize query performance optimization and often employ techniques such as indexing, partitioning, and aggregations to enable efficient data retrieval and analysis.

One downside of implementing a Data Warehouse (DWH) is that the key largely lies with IT: a DWH is the end product of development after an expectation from an end-user. However, it’s not always clear to an end-user what they exactly want or can do with their data. Typically, insights become sharp only once the first iteration of data is developed in the DWH.

When we combine the good qualities of these concepts, we refer to a Data Lakehouse

In a Data Lakehouse architecture, data is stored in its raw, untransformed form in a data lake, just like in a traditional Data Lake. However, it also introduces a layer of schema enforcement and indexing on top of the Data Lake. This allows for the application of schema-on-read, meaning that data can be queried using different schemas and structures without the need for upfront transformation. By combining Data Lake and Data Warehouse capabilities, a Data Lakehouse provides a more integrated and efficient approach to data management and analytics. It allows organizations to store large volumes of diverse data, perform on-demand transformations and analyses, and enable a wide range of data use cases, including reporting, ad hoc querying, machine learning, and real-time analytics.

Data Lakehouse – 3 layers

The data lake combines the advantages of a Data Lake and a Data Warehouse and offers 3 levels where data can be stored and used.

Bronze

Data is stored in its raw/original format
Backup layer

Silver

Data is cleansed
Data is enriched
Changes in data are captured, history is created
Data exploration layer

Gold

Data is modeled, prepped for end-user visualization
Data reporting layer

For us the greatest advantage, however, is the fact that it brings IT and business closer together. As discussed earlier, the gravity lies with IT in a traditional DWH. However, a Data Lakehouse narrows the gap between business and IT through its data exploration capabilities: Once data is loaded into Silver, we can involve end-users in the process and (through workshops) use the Silver layer to explore data togaether, thus creating more value in the eventual layer we make available to the entire organization: the Gold layer.

Each type of architecture serves its own purpose, and there’s much more to explore about the Data Lakehouse. If you’re interested in learning more, we’d love to hear from you and would be happy to arrange an inspirational session to discuss (modern) data platforms togaether!”

Algorhythm

Data Lakehouse Architecture

Wim Kelchtermans - April 30, 2024April 30, 2024