By Artem Koval, Director of Data Analytics, ClearScale
While there’s no direct or implied connection to its namesake Athena, the Olympian goddess of wisdom and war, Amazon Athena can help deliver important insights. In a way, that’s similar to imparting wisdom. But AWS’s interactive query service offers much more.
In this blog, we discuss what Amazon Athena does, how it works, its benefits, and how it compares to other query services.
What is Athena
Amazon Athena is a cost-effective, interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) with standard ANSI-SQL. It uses Apache Presto, an open-source, distributed SQL query engine, but is offered as an AWS managed service.
Amazon Athena is serverless, so there’s no infrastructure or clusters to purchase, manage or maintain. Because it uses Amazon S3 as the underlying data store, Amazon Athena is highly available and durable with data redundantly stored across multiple facilities and multiple devices in each facility.
It’s built to scale automatically. Even when dealing with complex queries and large data sets, it can execute queries in parallel and quickly generate results.
Supported Data Types and Formats
It also works with compressed data in Zlib and others. In addition, it supports simple data types such as INTEGER, DOUBLE, VARCHAR, and complex data types such as MAPS, ARRAY, STRUCT, and more.
How Amazon Athena Works
From the AWS Management Console, users can point Amazon Athena at their data stored in Amazon S3. From there, they simply create the schema by writing DDL statements on the console or using the Amazon Athena create table form and then use the built-in query editor to execute SQL queries on the data. The compute resources needed to return results for a query are provisioned automatically by AWS. Results are delivered in seconds.
Amazon Athena works directly with data stored in Amazon S3, so there’s no need to load data. Nor are complex ETL jobs required to prepare data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
Amazon Athena uses a managed data catalog to store information and schemas about the databases and tables created for data stored in Amazon S3. The catalog can be modified using DDL statements or via the AWS Management Console. Amazon Athena uses schema-on-read technology, which means table definitions are applied to the data in Amazon S3 when queries are executed. Table definitions and schema can be deleted without affecting the underlying data stored on Amazon S3.
In addition to the AWS Management Console, Amazon Athena can be accessed via a JDBC or ODBC connection, the Athena API, or the Athena CLI.
Amazon Athena Pricing
Pricing is based on terabytes of data scanned at $5 per terabyte. Amazon Athena queries data directly from Amazon S3, so your source data is billed at S3 rates. When Amazon Athena runs a query, it stores the results in an S3 bucket of your choice, and you are billed at standard S3 rates for these result sets. Users can save 30% to 90% on per-query costs and get better performance by compressing, partitioning, and converting data into columnar formats.
There’s no charge for failed queries, but users are charged for canceled queries based on the amount of data scanned up to when the query was canceled.
Benefits of Amazon Athena
The following are just some of the benefits Amazon Athena offers.
- It’s SQL-based with queries written in regular ANSI-SQL. SQL is a commonly used language for data analysts and DBAs and is considered easier to work with than Python or Scala.
- It uses Apache Presto, the distributed SQL query engine, which is optimized for low-latency data analysis.
- Amazon Athena Federated Query enables the service to run SQL queries across relational, non-relational, object, and custom data sources.
- Users can run multiple queries simultaneously.
- Queries are executed in parallel for large data sets, making complex queries fast.
- It’s serverless, so there’s no need to purchase, manage or maintain infrastructure to use the tool. The software automatically handles configuration and software updates.
- It’s cost-effective. Users only pay for data scanned. Plus, it uses Amazon S3 for data storage, so costs are lower compared to storing the same amounts of data in a coupled database.
- It supports commonly used open-source formats, which reduces vendor lock-in and enables users to employ additional querying and analytics tools as needed.
- It easily integrates with other AWS services, including AWS CloudFormation, Amazon CloudFront, AWS CloudTrail, Amazon QuickSight, Amazon S3 Inventory, Amazon Virtual Private Cloud, AWS Glue, AWS Step Functions, Elastic Load Balancing, and AWS Systems Manager Inventory.
- It employs multiple security technologies and tools, including AWS Identity and Access Management (IAM) policies, Amazon S3 bucket policies, and access control lists.
- It can be used for machine learning (ML). Developers can use Amazon SageMaker to create and deploy ML models in Amazon Athena.
Common Use Cases for Amazon Athena
Amazon Athena works for a wide variety of use cases that require querying data stored on Amazon S3. It can be used:
- To run ad-hoc analytics on big data
- To provide a more cost-effective option to Redshift, a coupled database that can be costly and complex to operate at higher scales
- To quickly check new datasets for validity
- For streaming analytics
- For Log analysis
- To query encrypted data
- To create a unified metadata repository
- For on-demand analysis of data spread across multiple data stores using a single tool and SQL dialect
- To design self-service ETL pipelines and event-based data processing workflows with Athena’s integration with AWS Step Functions
- To unify diverse data sources to produce rich input features for ML model training workflows
- To develop user-facing data-as-a-product applications that surface insights across data mesh architectures
Amazon Athena Limitations
Amazon Athena isn’t right for all use cases. It’s important to keep in mind that:
- Optimization is limited to queries. For instance, data already stored in Amazon S3 can’t be optimized.
- Without indexing, the operation load on Amazon Athena increases and can potentially impact performance.
- Data must first be partitioned to enable efficient queries. Partitions must then be managed for what best fits performance needs.
- Amazon Athena Federated Query is needed to connect data sources. Stored procedures, parameterized queries, and Presto-federated connectors aren’t supported.
- Amazon Athena can time out when querying a table with thousands of partitions.
AWS Services Comparisons
Amazon Athena is often compared to other AWS services that offer similar capabilities. The following are three of the most commonly discussed:
- Amazon S3 Select vs. Amazon Athena. Amazon S3 Select and Amazon Athena are both serverless solutions and allow for performing SQL-style queries against data in Amazon S3. The main difference between them is that you can only use the SQL SELECT queries when using S3 Select. That means no joins, no groupings, and no other sophisticated SQL operations. Athena can be used for all kinds of SQL queries.
Another limitation of S3 Select is that you can only perform the SELECT operation on one object at a time. Data also needs to be in a structured format, i.e., JSON, CSV, Parquet.
- Amazon Redshift vs. Amazon Athena. Amazon RedShift, a data warehouse service, can analyze data by using standard SQL-based clients and business intelligence (BI) tools. It handles more complex, multipart SQL queries and is a better fit for organizations that need to combine data from disparate sources into a common format.
- Amazon Elastic MapReduce (EMR) vs Amazon Athena. Amazon EMR enables teams to run distributed data processing frameworks, like Apache Hadoop, Apache Spark, and the Presto SQL query engine. It’s better suited for projects that require custom code, specific cluster configurations, or extremely large data sets. However, Athena can query data processed by EMR without affecting ongoing EMR jobs.
ClearScale’s Use of Amazon Athena
ClearScale regularly employs a variety of best practices in developing big data solutions. The use of Amazon Athena for querying data stored in S3 ─ without having to manage any infrastructure ─ is among them.
For example, ClearScale has a client that had found its SQL queries were becoming more complex, the resources to perform those complex queries were becoming more costly, and the time to get results back was increasing.
ClearScale’s solution centered around the implementation of Amazon Athena, which allows for querying large data sources without the need to manage servers or data warehouses. The customer’s data was aggregated into a centralized S3 bucket. Users could then simply point to specific datasets in the Amazon S3 instance, configure the schema, and execute queries with Amazon Athena’s built-in query editor or created data API for Athena.
The solution was a perfect fit for the client’s data science team since they were already familiar with ANSI SQL queries. Because the AWS service is serverless, there were no issues with scalability due to server constraints that the infrastructure team was involved with previously. The customer was also able to realize significant operational cost savings.
If you’re interested in learning more about the use of Amazon Athena or other query services or would like to discuss a specific data analytics project, ClearScale is here for you: