Scaling up for IPL 2023
We’ve had some great feats when it comes to performance engineering, but the one that occurred recently, in preparation for the IPL 2023 was noteworthy. In less than 5 days, three teams poured their souls out to accomplish something that was originally deemed impossible to achieve!
This humongous feat wouldn’t have been possible without significant contributions from Navin Nagpal, Denzil Sequeira, Hardik Shah, Kishlaya Kumar, Avneesh Sharma, Kaushal Singh, Lalitha Duru, Shubham Jadhavar, Mayank Pal, Jude Pereira, and the entire engineering team at CleverTap ❤️
One of our prominent customers wanted us to prepare for ingesting 240 million events per minute during the IPL. Naturally, when such a requirement comes in, we have a SOP (standard operating procedure) in place. However, that SOP stated that we’d have to 3x the customer’s TesseractDB cluster size from 23 to 69 nodes. That is acceptable as long as there’s a justifiable requirement.
Heads up! An overview of our current architecture can be found here.
Usually, we do not deviate from the standard cluster size of 23, and whenever we’ve had to deviate from 23, there are multiple teams involved: the engineering and the finance teams for example. However, for this particular requirement, something didn’t feel quite right. If we went to 69 nodes for this particular customer, it’d be a herculean effort to bring that back down to 23, immediately after the IPL.
Knowing well that keeping them in a 69 node cluster would mean that the economics wouldn’t be justified post the 2023 IPL season, we decided to address this in a two pronged manner: break off the team into making them re-shard ready, from 23 to 69 nodes; while the other evaluated the feasibility of scaling them vertically. Scaling vertically would mean that they’d retain the 23 node cluster size, and have larger instance sizes during the cricket season. This option had a significant benefit over re-sharding them horizontally: we could switch the instance sizes immediately after IPL.
Re-sharding from a higher node count to a lower node count is something that we’ve never had to do before, and if we decided to go down this route, it would take away precious time from our roadmap. Our PMs wouldn’t be very happy with that 🙁
Time was running out, and therefore, both teams had to move quickly.
Divide and Conquer
Team #1: Optimise!
One of the teams quickly started evaluating the feasibility of squeezing more out of the existing 23 node cluster. What really transpired, was true engineering. Within the first few hours, we quadrupled the throughput from 45 million per minute to 180 million per minute! 🎉
That was a massive feat in under 24 hours! Ironically, this happened just by enabling one of the core features built into the platform way back in 2017, which didn’t see the day of light at that time. Along with this, we switched the JVM’s garbage collector to G1 GC, as a part of our ongoing migration to Java 17.
Team #2: Prepare to Re-shard
This is something that we do on a regular basis. There’s an SOP to this too, and a well automated procedure. However, the infrastructure to re-shard had to be specially provisioned by our infrastructure team. They were to run one mock sharding attempt, and then the final, production sharding later that evening. In this exercise, the customer’s cluster would be re-sharded from 23 to 69 nodes. This was the backup plan in case the first team didn’t have enough time to optimise the platform.
It’s just a few hours to midnight, and a decision had to be made. Re-shard, or double down on optimising the ingestion pipeline. We had hit 180 million per minute earlier that day, but the requirement was for 240. The go ahead was given for the team responsible for sharding the customer’s TesseractDB cluster.
The next morning, the team responsible for optimisations had hit 239 million per minute! At this point, we ceased re-sharding operations, and both teams focussed all their efforts on moving the needle even further, so that we’d have a comfortable margin just in case traffic spiked beyond the requirement.
The Linux Kernel’s Unexpected Limits
At this point, the team broke off into two parts yet again, with a new goal in mind: find the maximum number of TCP connections and the limit of TesseractDB running on a brand new hardware configuration (read as EC2 instance type).
Very quickly, we discovered that we hit the default limit of TCP connections in Amazon Linux 2’s Linux kernel. That number was 262,144 connections. On any other day, I’d have been happy with that, but that day, we needed more. According to our new ingestion pipeline, there’d be 80,000 connections reserved for ingesting events alone. However, the customer in question wanted to be able to run Native Display Campaigns too, and that would mean that we’d need to support live traffic from over 600 instances! Each of these instances could potentially spawn 600 connections to each TesseractDB node, driving up the grand total to 360,600 TCP connections!
Something had to be done about that. Together with infra, we configured netfilter to allow up to 1 million connections. To our surprise, it was quite easy to do so:
# echo 1000000 > /proc/sys/net/nf_conntrack_max # echo 125000 > /sys/module/nf_conntrack/parameters/hashsize
For an explanation on how we arrived at the magical 125,000 number for Netfilter’s hash size, see here.
The Ultimate Threshold
We simulated a massive number of TCP connections to a single node by using wrk, a tool that’s quite common amongst the performance engineers at CleverTap. For some reason, despite configuring Linux to allow a million connections, we could only hit 765,000 connections. However, this was more than enough, and didn’t require further scaling out 🙂
Meanwhile, the team optimising the ingestion pipeline made further leaps, and were able to ingest slightly over 289 million per minute, peaking at 309 million! This was accomplished by fine tuning Jetty, an embedded server that’s been powering all of our components’ internal APIs. The most important learning for us from this exercise was that thread pools and blocking IO are incredibly tricky to optimise, and fine tune to the right number. This just paves the way for us to move to NIO (non blocking IO) at some point 🙂
The Piña Colada Checklist 🍹
Yes, you read that right! A day before the first IPL game, we drafted a piña colada checklist. If everything had been checked off from this checklist, the team would be sipping piña coladas, watching the very first game of IPL 2023! Of course, this was metaphorical, for we did have a minor emergency during the first game. However, our infrastructure team responded to this incident within seconds (literally!), and everything was business as usual 🙂
The future looks very bright at this moment, knowing that we managed to scale our existing ingestion pipeline to by an order of magnitude with less than 3 days of work. We’ve yet not touched the true potential of the EC2 instances that we’re utilising, and we know that we can optimise the pipeline to ingest significantly larger amounts of data points per minute!