Data Lake is the 2.0 version of data warehouse?
Some people mistakenly believe that the data lake is version 2.0 of the data warehouse. The centralized store for all the data in an organization is the same concept behind a data warehouse. Although they are similar, they are different tools that serve different purposes. The comparison of the data lake and data warehouse showed as Table 1.
Table 1 The comparison of data lake and data warehouse
|Data Lake||Data Warehouse|
|Data||Store everything||Focuses only on business processes|
|Cost||Low-cost storage||Expensive storage that gives fast response times|
|Type of data||Unstructured, semi-unstructured and structured||Mostly in structure and tabular form|
|Benefit||Scalable storage||Highly performance|
|Agility||Highly agile, configure and reconfigure as needed||Less agile and has fixed configuration|
|Users||Data scientists, experts||Widely used by business users|
|Use cases||Machine Learning, predictive analytics, data discovery, profiling||Batch reports, BI, visualizations|
Data warehouses are the traditional way to collect and store large amounts of data. It is highly organized and structured. This highly structured approach means that it can quickly solve a series of very specific problems. On the other hand, data is not stored in its original form, it means data is difficult to access, so only IT professionals could use it. Compute and storage were expensive for the data warehouse.
Different users and accessibility
Even if used only once, the data stored in the warehouse must be complete, of uniform quality, and stored in tables. At the same time, the data warehouse contains more processed data, which anticipates a business-centric user base and business intelligence applications. Most of the data never enter the data warehouse. For example, semi-structured and unstructured data can’t save in the data warehouse. And data takes a long time load it into the data warehouse. All data must go through access, model, source, clean and load processes to become available, so the key reason to have a data lake is to make data available.
The data lake can be applied to a large number and variety of problems-precisely because it lacks structure and organization. The data lake contains raw data and can meet the needs of users across the enterprise, although technically more specialized users often get the most value. The lack of a predefined schema makes the data lake more versatile and flexible.
The term of data lake
In October of 2010, James Dixon came up with the term “Data Lake.” He is credited with naming the concept of a data lake. He used the following analogy:
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”
There are 2.5 quintillion bytes of data created each day at our current pace. In this data increasing era, there is a lot of different types of data. For example, structured data (RDMS tables, columnar, etc), semi-structured data (JSON, CSV files, XML, etc) and unstructured data (video files, images, email messages, etc). The data lake is like a large container, very similar to real lakes. Just like in a lake you have multiple tributaries to enter in. In the same way, data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
The characteristic of data lake
The biggest advantage of a data lake is flexibility. By allowing data to remain in its original format, larger, more timely data flow analysis can be performed. The data lake is characterized by three key attributes:
1. Data lake can collect everything
Data stored in a lake can be anything, both raw sources over extended periods of time as well as any processed data. From completely unstructured data (such as text documents or images) to semi-structured data (such as hierarchical web content) to strictly structured rows and columns of relational databases.
2. Data lake can dive in anywhere
Data Lake enables users across multiple business units to refine, explore and enrich data on their terms.
3. Data lake is flexible access
Data Lake enables multiple data to access patterns across a shared infrastructure, it includes batch, interactive, online, search, in-memory and other processing engines.
The benefit of data lake
Data lake is best for businesses that need to provide large amounts of data to different users with diverse skills and needs. The benefits of the data lake showed as below:
1. Single source of truth
All the dependents can store their raw data in the data lake, the data lake does not need to define data through the architecture without a difficult process. As a result, everyone can get the most real data using a data lake.
2. Real-time decision analysis
With the huge processing power of the data lake, the user can use tools to ensure the high quality of the data to arrive at real-time decision analytics.
3. Data democratization
Data democratization means that everybody has access to data. The data lake makes data available to the entire organization. Every user is empowered to access any and all organization data if they have the proper privileges.
The data lake has all kinds of benefits to organizations, data managers and processors. However, many organizations still unaware of the powerful benefits of a data lake and how a data lake can deal with large data.
1. Amber Lee (2016). Data Lakes 101: An Overview. Dataversity.
2. Christine Taylor (2018). Structured vs. Unstructured Data. Datamation.
3. Keith D. Foote (2018). A Brief History of Data Lakes. Dataversity.
4. Phil Simon. Data Lake and Data Warehouse – Know the Difference. SAS.
5. What is Data Lake? It’s Architecture. Guru99.