As an organization grows, so does the amount of data it stores, either from the information it creates on its own or from an aggregation of data from multiple sources. From a business operations perspective, trying to glean useful information from terabytes worth of data in a quick and expedient way can become challenging as the amount of data grows, or if multiple, complex joins are needed.
A global news organization discovered this fact several years ago as they were attempting to query information from their existing EC2 data stores, in addition to being able to query from S3 stored in JSON format, as well as from AWS Kinesis Firehose streaming data service. Their SQL queries were becoming more and more complex and the time to get results back was increasing based on the complexity of the query and the need for the client to scale clusters to accommodate the more expensive SQL queries.
They asked ClearScale, an AWS Certified Premier Partner, to find ways of optimizing the queries that would allow them to dive deep into their Big Data repositories without the associated delay in reporting the results. They knew that ClearScale’s expertise in the AWS services ecosystem would likely find solutions to this common, yet complex problem.
The two issues at the top of the client’s list of concerns were a combination of performance and cost. The more complex the query, the costlier the use of resources needed to perform that query and the latency of the delivered results.
The ClearScale Big Data Queries Solution
Once ClearScale had evaluated the data schema and queries the client was planning to use, it was quickly apparent that the solution to the client’s issues was implementing the AWS Athena, a serverless querying technology that allows customers to query large data sources without the need for managing servers or data warehouses. Based on ANSI standard SQL, and using standard formats such as CSV, JSON, ORC, Avro, and Parquet, Athena is built to allow a user to point to specific datasets in their Amazon S3 instance, configure the schema they would like to use, and then quickly execute queries with Athena’s built-in query editor.
By aggregating all of the various data the customer had access to in their EC2, S3, and AWS Kinesis Firehose streaming data service into a centralized S3 bucket, it would allow Athena to rapidly query the data and return results. By running queries in parallel, Athena is able to quickly query the data, regardless of the size of the data set or the number of complex joins, and return results quickly and usually within seconds.
Moreover, because of how Amazon has chosen to implement Athena, customers only pay for queries that are run. Most customers can save anywhere between 30% to 90% of their per-query costs over traditional query requests through non-Athena implementations. This is accomplished in part by compressing, partitioning, and converting data into columnar formats which allow for faster queries over larger data sets.
The result for our client was apparent. Prior to engaging with ClearScale, the customer spent hours attempting to query the data before they were able to analyze the results, including time spent prior to the query working with their infrastructure team to perform extract, transform, and load (ETL) operations on the data. With Athena in place, the results based on the same complex SQL queries took seconds to run against millions of objects. This overwhelming improvement in query performance allowed the client to spend more time analyzing the results to discover trends and valuable information.
The AWS Athena implementation was the ideal solution for a number of reasons. Not only did it perform better than their prior operational model, but because Athena uses ANSI SQL, it was a perfect fit for the client’s data science team since they were already very familiar with ANSI SQL queries.
Moreover, because the AWS service is serverless, there were no issues with scalability due to server constraints that the infrastructure team was involved with prior to the Athena solution. Finally, because the client only had to pay per query. Based on how much data they actually queried against, combined with the cost savings they recognized since they no longer had to maintain their own server environments, they were able to realize significant operational cost savings.
ClearScale continues to work closely with this global news organization as they begin to fully realize the potential of the AWS Athena solution. As the client becomes more familiar with the query power available to them, ClearScale will be there to help usher in additional refinements to their workflows and feature requests, all aimed at being able to get valuable results out of increasingly complex queries. For ClearScale, success is not defined as delivering a completed project to a client, but rather is defined by making sure that our clients are successful from now and into the future.
Learn more about ClearScale’s big data services here.