From climate-resistant crops to the ability to predict the level of disease risk in healthy individuals, genomics — the study of the full genetic balance of an organism — holds great promise. It’s a data-intensive field where reproducibility, efficient management, and fast processing of large amounts of data are essential.

One thing that slows down that processing is doing things manually. That was the case for a recent ClearScale customer. Lack of the right IT architecture required one of the company’s key project teams to process data samples manually. This caused delays in the overall workflow. The time-consuming endeavor entailed going to the data source, downloading the samples, gathering the input parameters, uploading the data to Amazon Elastic Compute Cloud (EC2) instances, and running a variety of data pipeline steps.

Knowing that ClearScale had specific expertise in working with AWS services and in Big Data applications and automation, the customer requested assistance in automating some of its processes — including processing data samples with Nextflow.

Containers, Pipelines, and Workloads

The first step was for the ClearScale team to gather and review the company’s current workflow and business requirements. The team determined that creating and automating data pipelines was the optimal solution. Pipelines are created to process data in steps. The output produced by one step is passed on as input to the next step.

Various AWS services were evaluated for use in architecting the solution. The team chose to go with Amazon Elastic Container Service (Amazon ECS), a highly scalable, high-performance container orchestration service. The idea was to create containers for all the steps for processing data samples without the customer having to install and operate its own container orchestration software. Nor would it have to manage and scale a cluster of virtual machines (VMs) or schedule containers on the VMs. That would significantly reduce capital investments.

Nextflow, a free, flexible open-source software, was selected to enable scalable, reproducible scientific workflows using the containers. Nextflow includes built-in support for AWS Batch, a managed computing service that runs containerized workloads over Amazon ECS.

The use of AWS Batch allows for seamless deployment of Nextflow pipelines in the cloud by offloading the process executions as managed batch jobs. The service spins up the required computing instances on-demand. It scales up and down the number of instances to accommodate the workload resource needs at any point in time. That flexibility could yield cost savings as well.

The Nextflow Custom Architecture

ClearScale then developed the architecture based on an AWS solution for running workflows with EC2 instances pre-configured for Nextflow. The team modified the solution by incorporating Cell Ranger, a set of analysis pipelines; Perl, a programming language; ingestion containers for different Nextflow processes; AWS CodeBuild jobs; custom job definitions; and other components.

With the customized Nextflow solution tested, documented, and deployed, the customer is now able to process data faster. That can lead to accelerated analyses and, ultimately, faster time to market for the company’s products. The solution also enables the company to efficiently scale the required resources to meet demand and then scale them back when the demand is gone for more cost savings.

The Rest of the Story

Increasingly, companies involved in genomics and fields such as biology, drug discovery, and molecular diagnostics are reaching out to ClearScale. They’re looking for assistance developing custom architecture and infrastructure to optimize and accelerate their data-handling processes and workflows.

While we don’t bill ourselves as genomics experts, we do have extensive experience in using the vast array of AWS services created to support data pipeline development and deployment, as well as automation, cloud migration, and more. That experience is invaluable to genomics companies as well as to those in other fields.