Big Data is just that: big. According to Statista, a leading provider of market and consumer data, data creation will exceed 180 zettabytes by 2025 ─ approximately 118.8 zettabytes more than in 2020. That’s a lot of data that has the potential to be translated into business intelligence (BI). It’s also a lot of data that must be stored somewhere so the information can be processed, analyzed, viewed, and distributed. For many organizations, that somewhere will be in the AWS cloud. And for solution architects, that means choosing the most suitable AWS data storage repository.
There are various options, and the appropriate choice will depend on numerous factors. Among them: whether the data is structured or unstructured and if it’s currently in use or its usage is to be determined.
Before making any decisions, however, it’s essential to understand how the different repositories work, what differentiates them from one another, and which AWS resources are available.
The Data Warehouse
A data warehouse is a central repository of information that can be analyzed to make more informed decisions. It ingests structured data with predefined schema from transactional systems, relational databases, and other sources. It then connects that data to downstream analytical tools used by business analysts, data engineers, data scientists, and decision-makers.
Data warehouse architecture consists of tiers. The top tier is the front-end client. It presents results through reporting, analysis, and data mining tools. The middle tier consists of an analytics engine that accesses and analyzes data. The bottom tier is a database server, where data is loaded and stored.
Data that’s frequently accessed is stored in very fast storage, such as solid-state drives (SSD). If it’s accessed infrequently, it’s stored in a lower-cost object store, like Amazon S3. The data warehouse will automatically move frequently accessed data into faster storage to optimize query speed.
Data warehouses follow a schema-on-write data model. The source data must fit into a predefined structure (schema) before entering the warehouse. This is usually accomplished through an extract-transform-load (ETL) process. You must know how the data will be used so you can optimize the structure before it enters a warehouse
Storage and compute resources are tightly coupled so ingesting more data into the warehouse requires more ETL. That entails more computation, which increases time, cost, and complexity. Defining schema also requires planning in advance.
In terms of AWS data warehouse resources, among the most predominant is Amazon Redshift. It offers petabyte-scale data warehousing and exabyte-scale data lake analytics together in a single pay-only-for-what-you-use. AWS also offers a broad set of managed services that can be used to quickly deploy end-to-end analytics and data warehousing solutions.
The Data Mart
A data mart is a data warehouse or a portion of a data warehouse but is intentionally limited in scope. It’s focused on a specific functional area or subject matter, and usually serves the needs of a single team or business unit, like finance, marketing, or sales
Data marts can be created quickly because of their limited coverage. They’re simple to design, build, and administer, and can be built from a large data warehouse, operational stores, or a combination of the two.
Since it’s condensed and summarized, data mart information derived from the broader data warehouse allows each department to access more focused data to its operations. There’s less data in the data mart, so the processing overhead is decreased. As such, queries run faster. Because data marts concentrate on specific functional areas, however, querying across areas can become complex.
Data marts that are fed directly from source data can also generate inconsistent information. Those fed from an existing data warehouse avoid inconsistency issues.
The Data Lake
A data lake is a centralized data repository that allows for storing, governing, discovering, and sharing structured, semi-structured and unstructured data at any scale. It eliminates data silos by acting as a single landing zone for data from multiple sources.
Unlike data warehouses, data lakes ingest all data types in their source format. This encourages a schema-on-read process model.
One of the advantages of schema-on-read is that it results in loose coupling of the compute and storage resources for maintaining a data lake. Bypassing the ETL process means you can ingest large volumes of data into a data lake without the time, cost, and complexity that usually accompanies the ETL process. Instead, compute resources are consumed at query time where they’re more targeted and cost-effective.
Data lakes also make it easy and cost-effective to store large volumes of organizational data, including data without a clearly defined use case. The downside is that, without organization, governance, or integration with known ETL or analytics tools, data lakes can easily become data swamps.
AWS data lake resources include Amazon S3 (object storage); AWS Lake Formation, a service that makes it easy to set up a secure data lake in days; Amazon S3 Glacier and Glacier Deep Archive, low-cost Amazon S3 cloud storage classes for data archiving and long-term backup; AWS Backup, cost-effective, fully managed, policy-based service that simplifies data protection at scale; AWS Glue, a serverless data integration service; and AWS Data Exchange, which makes it easy to find, subscribe to, and use third-party data in the cloud.
The Data Lakehouse
A data lakehouse combines the flexibility, scale, and cost-efficiency of data lakes with the atomicity, consistency, isolation, and durability (ACID) transactions of data warehouses. It enables querying data across a data warehouse, data lake, and operational databases to gain faster, deeper insights that aren’t possible otherwise. Data can be stored in open file formats in a data lake and queried in place while joining with data warehouse data.
A data lakehouse has dual layered architecture. The warehouse layer resides over a data lake enforcing schema on write, providing quality and control to facilitate the BI and reporting.
On AWS, Amazon Redshift powers a lake house architecture.
ClearScale Knows Data Storage
ClearScale is experienced in selecting and implementing data repositories that are best suited for application development projects – particularly those from AWS. Learn how ClearScale has helped numerous customers ensure their applications incorporate the most effective, cost-efficient data management solutions. Read the case studies here.
Then find out what ClearScale can do for your organization’s data storage needs. Contact us today.