Application Profiling at Scale
CleverTap has reached a scale where we run thousands of message delivery service instances across multiple AWS regions. Most of our customers expect a very high campaign delivery throughput. This brings up the need to often and thoroughly profile our message delivery system, looking at both CPU and memory usage. One year ago, we were mainly concerned about our message delivery system, but we also clearly felt the need for a generic technical solution that would help us profile various company services.
Existing Profiling Solutions
As an AWS tenant, there was an opportunity to explore CodeGuru, an AWS-managed service. We were optimistic about this option since it looked easy to use. However, the major issue with this option is that it wasn’t available in all AWS regions where our services run. This meant that we could not profile in those regions and therefore, could not use the managed AWS profiler.
Another option that’s native to the JDK is jprof. We found it hard to use as it required command line access to our VMs, and collecting the resulting profiles was difficult.
A third option was JProfiler, which is similar to jprof. However, that too could be used on the command line only.
We then found async-profiler. It was particularly useful because of its lack of safepoint bias. In brief, safepoint bias is a weak point in sampling profilers where the sample is taken at the next available safepoint poll location (more on the subject here). However, async-profiler still has to be used on the command line, and that is inconvenient as VM access is non-trivial in CleverTap because of our tightened security restrictions.
Then it occurred to us to see if there was a way to make our services self-profiling. In other words, as engineers, we do want to profile regularly, and we want that to be easy or we simply wouldn’t do it often enough.
From the solutions listed above, we liked async-profiler, but we wanted to make it easy to use. So, we thought, why not embed it into our services?
Async-profiler provides a Java interface to start profiling within the current JVM, and to stop it too. So that looked great! It meant that our services could initiate their own profiling and stop when necessary.
Since our services have HTTP endpoints available in the internal EC2 network, we could expose these commands over HTTP endpoints. The only problem left was collecting the output files (in the form of JFR files) from our production environment.
The easiest solution for collecting the JFR files was to simply upload them to S3. So we came up with a library that used the async-profiler Java API available in Maven, which starts (including various options), and stops the profiler over HTTP. When the profiler is stopped, the resulting JFR file is uploaded to S3.
This made profiling CleverTap services a lot easier, and of course, we are now able to do it more often.
However, it must be acknowledged that this profiling solution has some disadvantages:
- Since our services run within a Docker container, we couldn’t use the perf_event_open syscall, and we cannot reduce Docker’s security restrictions because of profiling. However, the option of enabling the itimer option turned out to be good enough for us.
- This solution made it easy to profile a single instance of a service, but there was no aggregation of profiling output files across multiple instances. This meant that producing an overall picture was not possible. However, the building blocks were in place, and in the future such an aggregator can be implemented if we find there is such a need.
Overall, a very simple and generic approach has enabled us to implement and embed a very fundamental profiling utility within our services. Some of our services have extremely critical latencies, and performance is a key performance indicator. In the last year, this simple tool has identified hundreds of bottlenecks in our message delivery and data ingestion pipelines, which has helped us create more value for our customers.