Data Engineering

Data lakes: Idyllic location or swamp monster?

I have always been a fan of history. To know where we come from as a species gives us so many answers about how we function today. We evolve and create new tools to fulfill our needs and we discard old patterns that we don’t need anymore. We grow.

Data, and the way we use data, grows, too. In 2020 we have reached 64 zettabytes of data, by 2025 it is expected to be 180 zettabytes. With growth comes a need to maintain and to harvest. But first a bit of history.

Data has many uses. It can be created by a variety of providers such as applications or devices. It comes in all forms and sizes, structured in very colorful ways. We all use it in some way, and we keep finding ways to use it for our benefit. We at Algorhythm focus on analytics so we venture in the vein of data warehousing, reporting and data science.

Data warehousing isn’t new, it has been around for decades and with good cause. Benefiting business decision making for years. But data warehousing is expensive and serves a specific need. Storing data in a data warehouse means provisioning databases, first on-prem and more recently in the cloud. It means transforming the data in information, cleaning for accuracy, for reporting and gaining that knowledge you need to keep growing your business.

Because data is expensive to put into that information, we are selective in what we put in our data warehouse. Through analysis with business users, we collect requirements to see which data is best suited for the task. Storing data in a data warehouse means often to store it in a database. But with the rise of different types of data this becomes more problematic. Databases are not too fond of semi-structured data such as XML or JSON, let alone unstructured data such as images, voice, … and yet there is information there that can be harvested or can be useful.

Data warehousing isn’t the sole user of data. A lot more fields are using data as a source for their operations such as data science, machine learning and AI. These fields not only use some of the same data but also very different data to provide a business with even more information on how to grow.

Enter the data lake. A place where we can store data to serve a variety of uses. The most common technology used for building a data lake is a cloud storage provider such as Amazons S3 storage. It can also be an on-prem solution where there are several providers to maintain a data lake at your company. Which to choose, depends entirely on your requirements. Cloud providers are generally more mature in providing data lake solutions with a relative ease to setup.

The goal of the data lake is to provide a central point for data of all sorts to convene and be used. We can build a data platform from this place providing solutions for data warehousing, machine learning, AI, data science, etc. without duplicating the data. A data lake can not only contain raw data but also transformed data, ready to be used again by other processes once again.  

And this is where we see the great benefit of a well-organized data lake. The data lake is the entry point for all data in a raw format where we can either transform it for analytical purposes or use it as a source for other applications. It is the main data platform for all data usage. One can argue with the need of a data warehouse within a data lake but that entirely depends on the business and the maturity of handling data. More on that in a future blog.

A data lake is not without faults, however. As with any solution its benefit greatly depends on how well it is implemented. Without proper metadata management, capturing what is put in the data lake, it can easily become a mess where nobody knows what data resides in the data lake. Henceforth the name data swamp. Data governance is sorely needed, and data owners should be appointed to ensure the validity of the data lake. Actions such as storing data definitions and data cleansing rules will make sure that your data lake is in pristine condition to further grow your business.

We, at Algorhythm, started a sandbox environment to experiment with several use cases to explore and gain further knowledge in the data domain, across multiple cloud providers. You’ve probably read the blog by my colleague where he explains how we used APIs to get data into a data lake. The reason we load the data into the data lake is so other colleagues in our company can use that as a starting point to do predictive analytics, ML or other actions without having to use the API again. We believe in learning by experience and we create challenges for ourselves to become more proficient each day.

Should you, as a business, go for a data lake? It depends entirely on your requirements, but it provides a great start point to start using data a lot more and in a more varied way. Even if you should only use it for simple analytics, you can still start with a data lake to ingest raw data for your data warehouse to build a data platform in the long term. A data lake in the cloud is fairly inexpensive and provides a central starting point for all your data needs. When in doubt, be sure to contact us to see what solutions are possible.