Capturing data and utilizing it for business needs is a relatively easy thing to accomplish if done correctly. Analyzing the data to look for patterns and trends is an entirely different issue. When companies decide to implement deep analytics to drive decision-making, often implementing the suite of reports can be a challenge.
This was partially true for our client, a real-time data analytics company, who approached us to help them find a better way of performing Big Data analytics that they needed. Over the years, they had built out a substantial catalog of functions and stored procedures that they had utilized in their Microsoft SQL and Tableau environments. Those functions and procedures were used to analyze data that was then used by thousands of teachers, education professionals, and analysts to gain valuable insight into how schools operated which, in turn, helps change students’ lives.
The challenge our client faced was the reality that with a large dataset and increased usage and concurrency of users and queries, their Microsoft SQL environment was starting to feel the effects of increased latency and high operational cost. They required a solution that potentially could replace their existing data analytics environment and raise the bar on performance while lowering the overall cost of operation.
As an AWS Premier Consulting Partner, ClearScale has a large number of tools available in the AWS suite of products, and we set out to perform an analysis of Amazon Redshift vs. Microsoft SQL data analytics solution. We knew that in addition to using stored procedures and variable-driven SQL scripts, our client also used Tableau analytics to help do deep dives into the data. As such, our analysis also had to take into account a Tableau solution as part of the determination.
The SQL Server to Redshift Solution
Our goal was to determine the cost-effectiveness and scalability of an Amazon Redshift + Tableau combination and then compare it to their existing solution to gauge the value of moving from SQL Server to Redshift. We already knew that Redshift was built with optimized data warehousing in mind and was, like all AWS solutions, highly scalable and agile in nature. From an architecture perspective, this was due in large part to the fact that Redshift utilizes Dense Storage (DS) nodes, essentially hard disk drives, to build out extremely large data warehouses while at the same time leveraging Dense Compute (DC) nodes composed of high-speed CPUs, lots of RAM, and solid-state drives.
Not only that, but Redshift can easily provision additional amounts of space and capacity as needed on the fly and copy over client data from one cluster to the newly provisioned cluster with no downtime or loss of data. With the ability to shift from a minimum of 160GB DC1Large nodes up to 16TB DS2.8XLarge nodes, Redshift can utilize up to a petabyte of compressed client data in the newly created provisioned cluster and automatically reroute existing traffic from the old cluster to the new with minimal effort.
In addition, Redshift uses massively parallel processing (MPP), columnar storage, data compression, zone maps, and a 10GigE mesh network to maximize throughput and optimize data queries. Combined with fault-tolerant routines which automatically back up client data to an Amazon S3 bucket and provide point-in-time restorations should a node fail, Redshift provides a complete end-to-end solution for data warehousing and analysis needs.
As per standard Amazon practice, everything stored in AWS is fully encrypted, both in transit by utilizing SSL, and at rest by leveraging hardware-accelerated AES-256 encryption. This encryption works not only on the data sets being primarily leveraged but on any automatic backups using Redshift-managed keysets or ones that the client chooses to create using AWS CloudHSM or Key Management Services. Allowing clients to also set their own firewall rules to restrict access to their stored data warehouse cluster, in conjunction with AWS CloudTrail to allow auditing of all API calls, is standard operating procedure when utilizing Redshift. It also logs all SQL operations (including connection attempts and data access or change requests) and is compliant with SOC1, SOC2, SOC3, and PCI DSS Level 1 requirements.
When it was time to test Redshift under custom load, ClearScale utilized Tableau’s TabJolt, a tool that can easily be deployed to any environment which then runs simulated loads or query requests against the data warehouse. Because TabJolt is aware of the Tableau presentation model used for data queries and realizes how it can simulate possible interactions, it is able to perform load and performance testing without an operator’s need to create scripts; this gives TabJolt an advantage over traditional load and performance testing methods and provides essentially a point-and-run approach. It also logs typical KPIs, such as throughput and average response time, so ClearScale could see how Amazon Redshift compared to Microsoft SQL data warehouse solution.
When the analysis of the two approaches to data warehousing was complete, it was clear to both ClearScale and our client that not only did Redshift + Tableau outperform their Microsoft SQL + Tableau implementation, but the cost was hugely reduced compared to their existing operations. At roughly $1000 per terabyte per year uncompressed, or using compressed data that can drop as low as $333 per terabyte per year, from a cost perspective the SQL Server to Redshift solution made good financial sense for our client.
Coupled with the knowledge that Redshift is highly scalable and highly secure, and offers high concurrency, and high processing power and throughput, ClearScale was able to demonstrate to our client that migrating their data warehousing solution to Amazon Redshift made a good operational choice for them going forward. Once again, ClearScale was able to show a valuable client how AWS solutions could be implemented to meet their unique and demanding operational needs and greatly reduce their cost.