Big data has transformed the way companies conduct business. Through the careful analysis of large amounts of information, organizations can be empowered with modern decision-making capabilities that make it easier to draw conclusions and carry out actions based on accurate, relevant, and up-to-date data. Companies that embrace Big Data gain access to tools that can significantly reduce the cost of doing business.

Using Big Data to help an organization lower costs requires a detailed understanding of how, when, and where money is being spent. Without an accurate method to assess the true price of a product or service, associated costs can continue to rise, compounding an already difficult and expensive problem to solve. Big data helps organizations curb runaway or unnecessary spending by turning towards detailed analytics platforms to identify cost centers, pinpoint wasteful areas or opportunities for efficiency, and ultimately develop plans that improve the bottom line.

A client in the financial services industry addresses these inefficiencies through comprehensive performance management tools. Its core analytic and operating system functions create a unique workflow engine for organizations to ensure their clients operate with high efficiency. Serving the needs of multiple interested parties, its cloud-based SaaS platform was specifically engineered for using data to model, predict, and manage costs with near 100% accuracy.

The Challenge – Migrating HDP Cluster to Amazon EMR

A significant aspect of this client’s analytics efforts involves a Hadoop cluster, executing large numbers of batch jobs to generate insights from comprehensive data sets. This client currently operates a Hortonworks Data Platform (HDP) cluster self-hosted on EC2 nodes, but to reduce costs and reduce the lengthy Hortonworks upgrade process, the client is looking to migrate their Hadoop cluster to Amazon Elastic MapReduce (Amazon EMR).

One of the most unique aspects of the migration was the requirement to integrate Apache Sentry, a framework to enable, monitor, and manage Hadoop data security, with an Amazon EMR cross-realm trust Kerberos cluster. Because this client handles vast amounts of data containing private, sensitive information, they desired Sentry’s data security framework to use agents to sync policies and users and enable plugins that run within the same process as the Hadoop component. This client approached ClearScale, an AWS Premier Consulting Partner, to draft a proposal to migrate their existing Hortonworks platform to Amazon EMR.

The ClearScale Solution

ClearScale began by analyzing the client’s existing data infrastructure. Starting with its Hortonworks implementation, ClearScale audited the amounts of data being utilized, the types of data processing that occurred, and the method by which teams within its organization developed and built applications for the Hortonworks platform.

Upon conclusion of the audit, ClearScale built a custom architecture design to meet the client’s specific needs. The architecture framework was designed using the Amazon EMR managed cluster platform to cost-effectively process and analyze vast amounts of data, an AWS RDS MySQL as a meta store for the data, and S3 for cloud-based storage. To meet the client’s specific data security needs, the Apache Sentry framework was integrated with Hadoop and added Kerberos for network authentication. Each component was chosen to address the client’s requests for lower maintenance, scalability, and security at a lower cost.

EMR Architecture Design

EMR Architecture Design

The Benefits

This client can now take advantage of the nearly unlimited expanding storage capabilities of S3, offering both industry-leading scalability and data availability. Amazon EMR’s highly scalable infrastructure also makes it possible to set up clusters using task-based On-Demand Instances or Spot Instances, flexible options that can save significant costs on particular workflows. Compared to Hortonworks, the flexibility of deployments through Amazon EMR allows for easily deployable development systems and upgrade testing.

The result is a system that’s more secure, flexible, and cost-efficient. As one of the best choices for Big Data processing and analysis, Amazon EMR helps this client do more with less, streamlining their processing needs to be as efficient and effective as possible.

Because this client chose to partner with ClearScale, the swift transition of their data operations was handled by a trusted AWS Premier Consulting Partner with experts available to address any issues at hand. With its data operations fully addressed, this client can now focus on delivering quality analytics that saves large organizations time and money.