New trendy technologies and ideas keep coming and going within IT. One of the new ideas is the Data Lakehouse, a combination of a Data Lake and a Data Warehouse, no longer 2 separate systems, but 1 system for all activities. In this blog, we will dive deeper into what the difference is between this Lakehouse and the traditional Data Warehouse and more modern Data Lake. Lakehouses can be provisioned in Azure using Databricks Lakehouse platform and with Azure Synapse analytics. What advantages does it have for your organization and your data specialists?
Legacy systems
Data Warehouses
First, it might be useful to give some background on our legacy systems. The traditional Data Warehouse, on-premise or in the cloud, can handle structured data (data is in defined boxes) extremely well. By using Structured Query Language (SQL), this data can be transformed into something where the end user can also do something with it. The Data Warehouse is primarily focused on providing business insights that can be used for marketing decisions. For example, this data can be shown in dashboards and reports, with nice graphs displayed.
A traditional Data Warehouse is not good at unstructured data, SQL does not know how to work with this. For the more data science-like activities, this type of unstructured data is often needed. A lot of data is also needed to create a good machine learning model. This requires a storage solution which is cheap, allows a lot of data to be stored, and can handle unstructured data.
In addition, a Data Warehouse likes to know in advance what the data looks like, so it knows how to process it. This is called a schema on write, which requires you to write out the schema in advance. This is often a lot of work and not feasible if the data is constantly changing. A Data Lake looks at the data as how it comes in and reads what needs to be done with it (schema on read).
Data Lakes
A Data Lake is a lake of data in which everything can be dumped. In some cases, everything was dumped into this lake and the overview was lost. Nevertheless, it did provide a solution that allowed data scientists to get to work. If you want to use both reports and data science, structured and unstructured data, you have to manage two different systems next to each other, which is not ideal. What if we could turn this into 1 system? This is where Data Lakehouses provide a solution.
Data Lakehouses
What is it?
The Data Lakehouse tries to combine the best of the Data Lake and the Data Warehouse. A Lakehouse gives you the same analytical capabilities you’re used to have in a modern cloud Data Warehouse. Creating business insights and input for dashboards can absolutely be done. At the same time, a lot of data can be stored in a raw format so that the data scientists can continue to develop their machine learning models.
The following points are typical of Lakehouses:
- It can store both structured and unstructured data for data science projects and reporting;
- Both batch and streaming data can be processed. Batch data means that with every interval a load will be done, the data will come in batches. Whereas with streaming data, the data will arrive as a continuous stream. With a Data Lakehouse you can choose what type of processing best suits your use cases;
- It is important that all transactions are ACID (Atomicity, Consistency, Isolation, Durability) because it also support both streaming data and batch data. These are characteristics that ensure that duplications or errors are prevented;
- It can provide for both Data Science projects and analysis and reporting purposes;
- It has a big focus on data quality, so that the insights you create and the analysis you do are actually correct;
- Compute and storage are separate. This is useful because you can fine-tune your configuration exactly to your own needs.
A Lakehouse can be set up by using the Medallion model: Bronze, Silver, and Gold tables:
- The Bronze tables are tables that have not been used yet, these are unprocessed “raw” files;
- The source tables may have many string fields due to inconsistency in the data. For the Silver tables, fields are converted to the correct data type. For example, date time columns are converted to a readable format. A unique key needs to be generated if this is missing. Try not to do too many transformations in this layer, the real work is done in the Gold tables.
- For the Gold tables, these Silver tables are adjusted and optimized so they can be used in reports and dashboards. This is often where a major cleanup of the data is done as well; the data quality for reporting must be of good quality.
With this architecture, the business user can get quick insights into performance with the Gold tables. Based on the Silver tables, the data analyst tries to make new insights and perform in-depth analysis. And the data scientist can use the Bronze or Silver data to train algorithms. All data specialists can work with the same data in the same solution.
The advantages
The Lakehouse makes it possible to do all the data work in one system, instead of having to maintain 2 systems. Besides this, all required data (in whatever form) can be stored, it’s a cost-effective solution and ultimately, it has the same (or more) reporting capabilities compared to Data Warehouses.
Premium Microsoft Consulting with Devoteam
Devoteam M Cloud, an Expert Azure MSP Partner, is one of the world’s leading providers of Microsoft Cloud technologies. Our team of over 1,000 Microsoft experts in EMEA, helps modernize the IT architecture of leading businesses and governmental organizations. With this, we support our clients in their digital journey to the Cloud, make them future proof, and provide the best managed services.