Discovering the Power of AWS Glue in Large Scale Data Analysis
Feb 8, 2018


A modern technology-driven business is awash in mountains of sorted and unsorted data that can be challenging to transform into something that allows for deep analysis. From a development perspective, defining, building, and normalizing or cataloging data in a way that allows for easier analysis is a daunting task, especially if new data sets are identified as needing analysis.
Developers or database administrators have to create processes to extract the information from whichever repository it is stored in, transform it into a common set of data types or even fields, and ultimately load that information in an indexed format to a location that can then be accessed by business intelligence toolsets.
It is this Extract/Transform/Load (ETL) process that typically is a bottleneck in the overall analysis process. Besides the need to create a new process each time a new dataset is identified, the need to set up routines to run these ELT processes at set times when certain conditions are met can be a maintenance nightmare.
The Benefits of AWS Glue
With the introduction by Amazon Web Services (AWS) of a service called AWS Glue, this formerly painstaking task has been eliminated. By integrating closely with other key AWS services, such as DynamoDB and other RDS database interfaces, Glue allows an organization to simply point to the location where the raw data resides and Glue will take care of the extraction, transformation, and loading of the data into a format whereby data analysis tools can access it.
Moreover, since AWS Glue provides a serverless environment where clients only pay for the resources they use at the time, Glue is invoked to process data, and by generating ETL code that is customizable, reusable, and portable, it gives developers in an organization more freedom to sculpt processes that suit each particular need.
How ClearScale Applied AWS Glue to Two Different Use Cases
ClearScale, an AWS Certified Premier Partner, was asked by two clients about ways they could best utilize AWS Glue to solve ongoing challenges they were experiencing within their organizations. Even though each client had very different concerns when attempting to aggregate their data sources together, ClearScale was able to utilize AWS Glue in a very similar manner to overcome their challenges.
Client 1: How to Manage and Aggregate Widely Different Data Sources
The first client needed to compile a lengthy amount of data from numerous data sources into one location so that it could be analyzed. Their challenge was that they were not able to accurately predict decisions based on customer data. For them, the complexity of the data meant a significant amount of resources were needed to cleanse the data, and they struggled to find ways to bring the data together in a fashion that was easily understood.
With AWS Glue, ClearScale was able to define custom scripts that could identify all of the various data sources the client was reliant on and bring them together, cleanse and transform the data, enrich the datasets, and then divide them into Hive tables for use later on. This meant that the complexity of the data sources no longer stood as a barrier for the client and ultimately led to a clearer understanding of customer predictability.
Client 2: Extreme Load Latency Means Greater Costs and Inefficiencies
The second client had challenges in ingesting all of the data they needed. With only two instances of their data available at any point in time, a bottleneck occurred within RDS each time they needed to consolidate the data into a centralized location. This in turn meant that hours would be spent as the data were aggregated together, which meant that other activities around reporting were impacted, ultimately leading to higher operational costs. They were interested in finding ways to speed the process up and reduce overall costs.
ClearScale implemented AWS Glue for this client as well. Because of its inherent distributed nature within the AWS Service ecosystem, Glue was able to take all of the client’s large volume of data from disparate sources and load them within minutes instead of hours. Because AWS only charges for the time that Glue operates, and because the low cost for utilizing the service is built into the pricing model, the client was able to recognize an immediate reduction in overall operational costs while streamlining the data ingestion process.
Architecture Diagram (For Client 1 and Client 2)
ClearScale Charts the Path
Like any of the AWS Services, not every service fits every client’s need. Depending on the complexity and specific needs of the client, ClearScale can assess the need and determine which approach and which services are in the best interest of the client for long-term success.
With AWS Glue, the implementations ClearScale executed showed that both clients were able to recognize the immediate benefit of the solution. Delivering Glue meant very low costs for ongoing operations, high data ingestion speeds because of the distributed nature of the AWS landscape, and consistent cleansing and transformation paradigms that were reproducible using customizable ETL code which was reusable going forward. Leveraging ClearScale as a partner in your own company’s journey means that the outcome will benefit your organization, your infrastructure, and your customers for years to come.