Building Better Data Pipelines with AWS Step Functions
Oct 22, 2019


When it comes to data management, there are three important components. There’s generating the data, which is often called online transaction processing (OLTP). There’s also analyzing the data, which is referred to as online analytical processing (OLAP). Both can involve multiple systems.
Then there’s the process of moving data between systems. This can include copying data, moving it from on-premise to the cloud, reformatting it, combining it with other data sources, and other steps. Each step can require separate software. That’s where the data pipeline comes in.
A data pipeline enables a smooth, automated flow of data from one point to the next. It defines what, where, and how data is collected. It then automates the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization. And, it provides end-to-end velocity by eliminating errors and combatting bottlenecks or latency.
The Customer Need – AWS Data Pipeline
For one ClearScale client, however, its analytics pipeline, along with its MySQL database, was proving to be the bottleneck in operating its AWS-based technology platform. Anticipating a significant workload increase on the platform that would intensify the bottlenecks, the company reached out to ClearScale to help remedy the situation.
Specifically, the company needed a new AWS-centric data pipeline solution that would mitigate the bottlenecks while offering greater scalability to handle increased workloads. The solution also needed to be more cost-efficient and not require as many staff resources for management and administration.
There are a variety of ways to manage data. But for many AWS data management projects, AWS Data Pipeline is seen as the go-to service for processing and moving data between AWS compute and storage services and on-premise data sources. It’s known for helping to create complex data processing workloads that are fault-tolerant, repeatable, and highly available.
However, it’s not the only option — or even the only option for every situation.
A Creative Approach to Data Management
ClearScale determined that AWS Data Pipeline lacked the flexibility to meet this particular customer’s needs. After evaluating various options, the ClearScale team decided to take a unique approach by using AWS Step Functions, a general-purpose workflow management tool, for data pipeline orchestration.
Developed for orchestrating complex flows using Lambda functions, it’s primarily used for app development. But ClearScale determined how to use its attributes for data pipeline orchestration and combine it with other AWS services to create a solution that could best meet the customer’s needs.
AWS Step Functions work by coordinating multiple AWS services into serverless workflows. There are no costs or personnel required for provisioning, scaling, and managing servers.
The workflows are comprised of a series of steps. The output of one step acts as the input for the next one and translates into easy-to-understand state machine diagrams. Step Functions automatically trigger and track each step and retry when there are errors. As a result, the steps execute in order and as expected. Step Functions also log the state of each step, so that any problems can be diagnosed and debugged quickly. This all parallels well with the steps associated with data pipelines.
The Rest of the AWS Data Pipeline Solution
Amazon S3 was selected as the primary storage platform for the solution’s associated data lake because of its virtually unlimited scalability. It can be seamlessly and non-disruptively increased, with the customer only paying for what is used. It’s designed to provide 99.999999999% durability and has native encryption and access control capabilities. All data types can be stored in their native formats. It also integrates with services such as Amazon Athena and AWS Glue to query and process data, as well as with AWS Lambda serverless computing to run code without provisioning or managing servers.
The solution also uses Amazon Athena, a serverless, interactive query service that makes it easy to analyze data in Amazon S3 using well-known SQL. The customer only pays for the amount of scanned data. Plus, there’s no need for complex ETL jobs to prepare data for analysis.
Amazon Kinesis is employed to collect, process, and analyze real-time, streaming data at any scale, including video, audio, application logs, website clickstreams, and other telemetry data. It allows for processing and analyzing data as it arrives instead of waiting until all data is collected.
The Database
Amazon Aurora is used for the database. It’s three times faster than standard MySQL databases and provides the security, availability, and reliability of commercial databases at 1/10th the cost. It’s fully managed by Amazon RDS, which automates tasks such as database setup, patching, and backups. It delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across three Availability Zones (AZs).
In addition, the use of Aurora auto-scaling dynamically adjusts the number of Aurora replicas provisioned for an AuroraDB cluster using single-master replication. This enables the AuroraDB cluster to handle sudden workload increases. When the workload decreases, unneeded replicas are removed, so the customer doesn’t have to pay for unused provisioned DB instances.
The Results – Reduced Data Management Costs
ClearScale’s innovative solution generated serverless analytics and data pipelines for the customer that reduced its administrative costs for data management. Data processing is faster. The analytics are far more granular and beneficial. The overall process is more secure and reliable.
The solution also has repercussions for data management in general. ClearScale’s creative approach is spurring discussions on and investigations into the use of AWS Step Functions for building “better AWS data pipelines”. This proves once again that ClearScale is truly at the forefront of the Big Data and app development industries.
Learn how ClearScale’s pioneering spirit and vast expertise can benefit your organization.