By Artem Koval, Director of Data Analytics, ClearScale
The exponential rise in data over the past several years has added many new terms to the IT lexicon for managing that data. Among them is the data lake, a scalable, low-cost, centralized data repository for storing raw data from a variety of sources. It enables users to store data as-is without structuring it first. They can then run different types of analytics to gain insights and guide better decision-making.
The Challenges of Building Data Lakes
Building a data lake isn’t simple. It involves a number of manual steps, which makes the process complex and time-consuming. You have to load data from diverse sources and monitor the data flows. You have to set up partitions, turn on encryption, and manage keys. Redundant data has to be deduplicated.
Without the right technology, architecture, data quality, and data governance, a data lake can easily become a data swamp — an isolated pool of difficult-to-use, hard-to-understand, often inaccessible data. Fortunately, the use of modern data lake solutions and the cloud — the AWS Cloud, in particular, in our experience — simplifies things.
The Case for an AWS Cloud-based Data Lake
Building a data lake in the cloud eliminates the costs and hassle of managing the necessary infrastructure required in an on-premises data center. It also lowers engineering costs through the efficiencies of using cloud-based tools. Because cloud services are flexible and offer on-demand infrastructure, it’s also easier to re-think, re-engineer, and re-architect a data lake if you have new use cases.
AWS offers even more benefits by virtue of its broad portfolio of services for building a data lake, as well as analyzing the data. That includes Amazon Simple Storage Service (Amazon S3) and Amazon Glacier for storing data in any format – securely and at scale. There are data ingestion tools like Amazon Kinesis Data Streams, Amazin Kinesis Data Firehose, and AWS Direct Connect that can be used to transfer large amounts of data to S3.
To make it easy for end-users to discover the relevant data to use in their analysis, AWS Glue automatically creates a single catalog that is searchable by users. You can also take advantage of AWS artificial intelligence (AI) services such as Amazon Comprehend, Amazon Forecast, Amazon Personalize, and Amazon Rekognition to gather insights from unstructured datasets, generate accurate forecasts, create recommendation machines, and analyze images and videos stored in S3.
There’s also the option to use Amazon SageMaker to build, train, and deploy machine learning (ML) models quickly with your datasets stored in S3. Using Amazon FSx for Lustre, you can launch file systems for HPC and ML applications.
AWS Lake Formation Addresses the Trends of Building Data Lakes
One of the services our team at ClearScale particularly likes is AWS Lake Formation. In addition to simplifying the data lake-building process, it addresses many of the trends affecting how data lakes are built and used.
1. The need for data preparation
The amount of data generated daily is growing. According to data published by IBM, the world produces 2.5 quintillion bytes of data each day. Unstructured and semi-structured data comprise most of it. That data is coming from an increasingly wide variety of sources, such as machine-to-machine interactions and real-time sensor data. And it’s coming in a seemingly endless variety of forms.
As such, it’s often messy, inconsistent, and unstandardized. Before it can be analyzed, the data has to be cleaned and transformed. Lake Formation features capabilities that facilitate the required data preparation.
For example, Lake Formation uses ML to clean and deduplicate data to improve data consistency and quality. It can reformat data for analytics tools such as Apache Parquet and Optimized Row Columnar (ORC).
In addition, Lake Formation contains FindMatches, an ML transform that enables you to match records across different datasets and identify and remove duplicate records with little to no human intervention. Lake Formation also allows for creating custom transformation jobs with AWS Glue and Apache Spark to meet specific requirements.
2. Data lake automation
From data ingestion and preparation to making data ready to be queried, there are a lot of manual steps involved in building a data lake. For data lakes to be truly beneficial, they need to be more efficient. Automating as many steps as possible is essential.
That’s what Lake Formation does. For example, it employs pre-defined templates that can ingest data from different sources. It then automates the provisioning and configuring of storage.
Lake Formation crawls the data to extract schema and metadata tags, and then automatically optimizes the partitioning of the data. From there, it transforms data into formats like Apache Parquet and ORC for easier analytics. It also automatically classifies and prepares the data using an organization’s data access policies to govern access to that data.
3. Greater cost-effectiveness
In the cloud, users who want faster results with their big data analytics can easily steer more resources to the tasks they’re executing. As performance increases, however, it becomes more difficult to keep costs down. Companies are increasingly looking for a better balance between performance benchmarks and efficiency benchmarks.
This is another area where Lake Formation can make a difference. You get the cost efficiencies associated with the cloud, as well as those generated through the use of Lake Formation. For example, Lake Formation source crawlers reduce the overhead involved in just getting data from wherever it is into your data lake.
There’s also the matter of where raw data is loaded, which could be in partitions that are too small or large. Lake Formation optimizes the partitioning of data in S3, improving performance and reducing costs. Data is organized by size, time period, and/or relevant keys. This enables fast scans and parallel, distributed reads for the most commonly used queries.
In addition, there’s no extra charge for using Lake Formation’s features. It builds on capabilities available in AWS Glue and uses the Glue Data Catalog, jobs, and crawlers. It also integrates with services like Amazon CloudTrail, AWS IAM, Amazon CloudWatch, Amazon Athena, Amazon EMR, Amazon Redshift, and others.
4. Accommodating more data and more workloads
The amount of data — and the sources of it — are increasing daily. So are the uses of that data and the tools that enable them. We can expect that AI, ML, streaming analytics, and other workload types will continue expanding and changing. Data lakes must be able to handle it all.
This is yet another area where Lake Formation shows its power. It allows for importing data from databases already in AWS, including MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. Both bulk and incremental data loading are supported.
Data can be moved from on-premises databases by connecting with Java Database Connectivity (JDBC), identifying the target sources, and providing access credentials in the console. Lake Formation reads and then loads the data into the data lake. Custom ETL jobs can also be created with AWS Glue to import data from other databases.
Semi-structured and unstructured data can also be pulled from other S3 data sources. It just requires specifying the S3 path to register the data sources and authorize access. Lake Formation can collect and organize data sets such as logs from AWS CloudTrail, AWS CloudFront, and AWS Elastic Load Balancing. The data can then be loaded into the data lake with Amazon Kinesis or Amazon DynamoDB using custom jobs.
5. The balance between data governance and ease of use
Data lakes need to be easily accessible by those who need to use them. They also must be secure and well-governed. The two concepts may seem incompatible, but that’s the challenge that evolving data lake architecture has to embrace.
One way Lake Formation takes this on is with user access permissions that augment AWS Identity and Access Management (IAM) policies. When someone tries to access the data using one of AWS’ services, that person’s credentials are sent to Lake Formation. Lake Formation returns temporary credentials to permit data access.
In essence, access is controlled by grant and revoke permissions that can be specified on tables and columns instead of buckets and objects. Policies granted to particular users can be viewed and altered easily and all the data access is available to audit in one location with Lake Formation.
Lake Formation also integrates with IAM so authenticated users and roles can be automatically mapped to data protection policies stored in the Data Catalog. The IAM integration enables the use of Microsoft Active Directory or LDAP to federate into IAM using SAML.
Third-party business applications, like Tableau and Looker, can also be connected to AWS data sources through Athena or Redshift. Data access is managed by the underlying data catalog. So, regardless of the application used, data access is governed and controlled.
Building Data Lakes on the Cloud
ClearScale has extensive experience in building data lakes using AWS services like Lake Formation. You can read one of our case studies and watch a testimonial video here. But every project is unique. The services and tools we use in building data lakes are based on each customer’s specific needs.
With all the benefits offered by AWS Lake Formation, there’s a good chance it might be one of the tools we employ. ClearScale can help you determine if a data lake is the right solution for your company. And we can build it to best serve your purposes.