Data lakes as a way to tap the potential of big data: architectures and the issue of data discovery

The digital era has led to a massive, unprecedented growth in data. Never before has the value of information as a strategic resource been more evident. Data lakes are essential for managing and optimising the use of these vast stores of data.

Big data and data lakes

The term ‘big data’ refers to large and complex data volumes that exceed the capabilities of conventional data management and analytics technologies. The 5 V’s of big data are:

Volume: Brought about by digitalisation in various spheres of life, big data encompasses huge amounts of data that exceed conventional storage capabilities.
Velocity: Data is generated and updated in real time by sensors, social media platforms, IoT devices and so forth, all of which poses a major challenge in term of its collection and processing.
Variety: Big data can include structured, semi-structured and unstructured data, making it necessary to find new approaches to managing and analysing it.
Veracity: The quality and veracity of data varies widely because it comes from different sources. The accuracy of the data is crucial to avoid incorrect results.
Value: The aim of big data analytics is to extract useful information from this data, spot trends and make well-founded decisions in order to generate strategic advantages for companies. The value of big data lies in its ability to generate knowledge.

Big data poses new challenges for traditional data management technologies, which include time-to-information, data heterogeneity, data quality and governance. Data lakes play a key role in overcoming these challenges.

A data lake is a centralised, scalable repository for storing all types of data in a raw, unprocessed form. The data in a data lake can be structured or unstructured, and it can be stored without a predefined structure or schema. Data lakes are therefore well suited for storing big data, as they offer flexibility as a repository for a wide range of data. In addition, data lakes often utilise technologies to store data that offer high scalability and flexibility along with the ability to effectively control costs.

The main attributes of data lakes are as follows:

Scalability: Data lakes can grow horizontally in order to process large volumes of data.
Flexibility: Data lakes can store any type of data with no predefined structure.
Advanced analytics: Data lakes provide a solid foundation for advanced analytics and machine learning.
Efficiency: Data lakes can be less expensive than conventional systems used to store large amounts of data.

Creating and managing a data lake requires careful planning when it comes to the data life cycle to ensure the data is accessible, secure and consistently available for use. The data life cycle includes:

Data entry
Data storage
Data cataloguing
Data preparation and analysis
Data maintenance and cleansing
Data governance and performance monitoring/management.

Different users with special responsibilities and data access privileges work in a data lake, including data engineers, data scientists, business analysts, IT administrators, data consumers and data quality analysts. Along with that, data stewards are responsible for the management, quality and integrity of the data stored in the data lake.

Data lake architectures

Various approaches to data lake architectures have emerged in recent years:

Centralised data lakes: All data is collected in a single repository, which provides a comprehensive overview, though this can become confusing over time.
Decentralised data lakes: Data is managed separately by different business units or functions, which can improve data management but also lead to duplicate data.
Cloud data lakes: Companies are migrating their data lakes over to cloud services such as Amazon S3, Azure Data Lake Storage and Google Cloud Storage, which provide scalability, make it easy to manage the data lakes and give them access to cloud-based analytics services.
Zone-based data lakes: Data is divided into zones to make it easier to manage and analyse certain fragments of data without having to access the entire data lake.
Semantic data lake: A semantic structure is applied to the data in order to improve the search and analytics capabilities.

In addition to these approaches, there are also Lambda and Kappa architectures that enable the scalable and flexible management and analysis of large volumes of data.

The Lambda architecture was developed to meet the challenges of parallel and discrete batch and stream processing. This architecture is well suited when both real-time and batch data streams need to be processed simultaneously, and features the following layers:

Batch layer: This layer is responsible for retrospective batch processing of data and includes processes such as aggregation, indexing and preparation of data for analysis.
Speed layer: This layer is responsible for processing real-time or streaming data.
Serving layer: In this layer, the data processed in the two previous layers is made available to users via APIs or direct queries.
Batch and serving views: These views present aggregated versions of the data, and are updated regularly to display new, processed data.

The Kappa architecture is streamlined in response to the more complex Lambda architecture, integrating both batch and streaming processing into a single data flow. The main difference between the two architectures is that Lambda maintains a strict separation between batch and streaming processing, while Kappa relies primarily on streaming and treats batch processing as an exception.

The relevant requirements, the complexity of the data and the performance of the application determine whether it makes sense to opt for a Lambda or Kappa architecture.

Data discovery

Data lakes are ideal for incorporating raw data from multiple sources, though the variety and volume of data can make data lakes complex and difficult to navigate. The challenges involved in finding the right data include:

Excess complexity: The wide variety of data formats and types makes it difficult to identify data specific or relevant to the analysis.
Data quality: The lack of standardisation and unclean data can lead to incorrect results or improper use of the data.
Large volumes of data: The vast quantities of data contained in the data lake can slow down the process of data discovery and impede access to data if there are no efficient systems implemented.

Advanced algorithms that utilise data mining and machine learning are used to meet these challenges in terms of data collection. These algorithms identify similar data by looking for different types of similarities:

Content-based similarity: Content-based similarity algorithms analyse data based on defined characteristics such as keywords or attributes in order to identify similar data. For example, similar documents can be found based on their contents.
Structure-based similarity: These algorithms analyse the data structure (database schemas, for example) to find similar data by identifying common patterns or relationships between the data.
Use-based similarity: These algorithms track the use of data by users. They also identify similar data that is utilised in similar contexts. For example, they can recognise that two business analysts are using similar data to carry out similar analyses and generate a recommendation based on these similarities.

These algorithms help make data discovery in data lakes more efficient. There are a number of use cases for solutions based on this type of algorithm:

Personal recommendations: Users receive recommendations for related or similar data based on their activities and needs.
Simplified searches: Intelligent search engines allow users to find data in natural language and deliver relevant results.
Higher data quality: Once similar data has been identified, this information can be used to detect duplicate or incorrect data and thus improve the overall quality of the data.

Conclusion

To recap, big data offers enormous potential in terms of the opportunities it offers, but it also brings with it a number of complex challenges. Data lakes are critical to managing and analysing large amounts of data, and you need to understand the underlying principles and challenges in order to make effective use of them.

Would you like to learn more about exciting topics from the adesso world? Then take a look at our latest blog posts.

Also interesting:

Author Christian Del Monte

Christian Del Monte is a software architect and engineer with many years of experience. In various projects in the B2B and B2C sectors, he has worked with a variety of software architectures implemented with different IT technologies and frameworks. He is particularly interested in data lakes as well as highly available, real-time software systems and their implementation with means such as cloud, microservices and event-driven architectures.

Category:	Methodology
Tags:	Data Lake Big Data