The ability for a Web application to take high volumes of requests from customers can make or break an organization at a critical juncture. As numerous companies have learned over the years, building a digital product that can’t scale and meet the spikes in demand can lead to lost opportunities, revenue, and negative perceptions in the market.
A recent engagement with a client reminded us of this concern. ClearScale, an AWS Premier Consulting Partner, was asked by a client to review and audit the current version of their application along with a newer version they had created. Unlike other companies, this client realized the risk they faced at a critical time and needed to understand where the performance issues were within their application and cloud infrastructure layers.
ClearScale needed to perform an extensive audit of the application which was being prepared for an expected increase in donations due to the 2016 American election cycle. Despite having rebuilt the application, they were still seeing high error rates and high response times between a request from the application and the response from the database.
The goal was to understand how many concurrent users the application could tolerate before it began experiencing issues, as well as how many Requests Per Second (RPS) for given pages was possible. The application itself was built on Ruby on Rails with the database being served up by AWS Relational Database Service (RDS).
Previously, the client had experienced their largest transaction day of approximately 64,000 transactions with a peak of around 6,500 transactions per hour. Not only were they expecting a far higher transaction count for their election donation system in September 2016, but they believed that nearly 40% of the transactions would be first-time customers requiring higher processing resources.
The audit needed to focus on in-depth performance testing(/services) of both the client’s current application as well as their new application to compare how they performed under high loads and RPS. The client provided the test cases and ClearScale converted these into actual test plans using Apache JMeter 3.0 and whose requests were made to instances through HAProxy Load Balancers with external endpoints. The approach was to instantiate thread counts starting at around 1,000 threads for 120 seconds, hold for 30 minutes, and then ramp up to 5,000 threads.
Performance tests were run against both sets of applications over a two-day period. Repeated testing revealed that the newest application designed to streamline overall donation processing consistently saw anywhere between 55% and 89% of the requests failing due to a gateway time-out. With RDS CPU processing hovering near 100% during each of the tests, the average processing time for a donation was between 25 and 55 seconds. Even with these types of results, the ClearScale team discovered that the newer application performed 11% better in RPS tests than the current system in place with a 13% decrease in error rates.
However, even under ideal circumstances, the audit found that the system could only take around 100 donations per second. As donations rose to 200 per second the increase in gateway time-outs occurred. The culprit was less-than-optimal RDS configuration and what it was being asked to do during a donation request.
There were several areas that ClearScale found and recommended for remediation with the client that would actually decrease the high error rates and allow the system to perform at a level that was needed in the election donation cycle.
It was discovered that root access keys were older than 90 days, thus increasing the risk of security issues. In addition, ClearScale found a number of unused security groups as well as a need to restrict access to certain ports for only a limited number of external IP addresses. In addition, ClearScale recommended that access needed to be restricted to HTTP and HTTPS for the WebApp instances removed unused routing tables, and then performed a reconfiguration of RDS to allow for ideal performance under the tightest security standards.
ClearScale determined that auto-scaling wasn’t properly configured, so an effort was undertaken to bolster this critical area of concern. In addition, the client had their RDS instance upgraded to r3.8xl for optimal performance. Then RDS was reconfigured so that the read load, which was resulting in a high RDS CPU usage, was actually offloaded from the primary RDS to the read replicas set up in the application layer, thus allowing for critical transactional data to only be allowed within the RDS instance. An effort was made to catch slower queries using query logging mechanisms to help pinpoint issues going forward, add indexing to all RDS tables, and moving the calculations and sorting activities from the RDS database to the application itself so that it lowered the CPU load on the database.
As part of the normalization of the entire instance, ClearScale discovered that the EC2 instances that the client had set up which would normally be distributed across several Availability Zones were actually not uniformly set up. As such, one Availability Zone had a higher number of EC2 instances than the other, thus increasing the risk to operations should an issue occur. ClearScale redistributed the EC2 instances to be evenly distributed across the Availability Zones.
Overall, this remediation effort actually helped save the client money. By optimizing the auto-scaling policies and CNAME record sets, reducing the instance types for RDS read replicas, and removing unused CodeCommit repositories, archiving access logs, and removing unused EIP, the client now pays less to AWS for its services.
ClearScale encourages all organizations to undergo an application and infrastructure audit on a regular basis in order to find ways to optimize the performance and reduce bottlenecks in transactional activities.
In the case of our client, they came to us to perform an audit and in the end we discovered ways to not only increase transactional counts, but also identified areas where they were paying for unused or overused services.
No matter the type of issue our clients bring to us, ClearScale believes that in order to make our client successful a thorough understanding of the existing ecosystem needs to be a priority, along with an agreement on what the requirements and goals are. Laying out a roadmap to success means that we deliver results that meet or exceed our client’s expectations.