Using dd, iostat, iotop, and blktrace to Discover a Bug in the Linux Kernel
With a recent redesign, we gradually transitioned our in-house analytics database to run on Graviton. This has been a 6-month long journey, since quite a bit of our database engine was rewritten to suit the ever growing use case of storing more and more data. During our initial performance tests, everything seemed to be in order. But alas! Going to production brought in its own set of problems.
After the first few weeks of transitioning various clusters, we observed our very first roadblock: excruciating disk IO performance. These seemed to appear at random, one server instance at a time (several instances make up a cluster). To put in some context, a snapshot of a part of the database is written periodically. These snapshots range anywhere from 10 GB to 30 GB.
The First Symptoms
All our HTTP threads were waiting on a lock. Who had it? Taking multiple stack traces with jstack confirmed that it was this snapshot activity that caused the entire database to come to a grinding halt. Was it something that we had overlooked? Was it a concurrency problem? At the first sight of this, our entire team dove into this problem together. We re-reviewed the code path, and everything seemed to be in order.
To add to this confusion, jstack didn’t provide us with a complete stack trace. Just partial stack traces of each thread. We knew this for sure, since the thread writing the snapshot clearly didn’t have the right start frame. It was definitely missing a few frames.
What now? Armed with this information, we took a heap dump, and started analysing it. The heap dump revealed the complete stack trace, and we then identified that the thread that had the lock, was actually writing to disk. This was expected, but why was it taking so long? Was there something in our new code that caused more data to be written to the disk suddenly? Was there a bug in the JVM running on ARM?
Isolating the Problem
Our first step in debugging this IO problem was to pinpoint what was causing the problem. We quickly used dd
to write to the same EBS volume, to see if it was performing as expected. To our surprise, we found out that the throughput had been degraded severely!
# dd if=/dev/zero of=/path/to/a/file/on/our/ebs/volume bs=512 ^C264+0 records in 264+0 records out 135168 bytes (135 kB) copied, 3.35126 s, 40.3 kB/s
Whoa! We’ve seen disk degradation problems before, but nothing like this. This was the first step in isolating the problem to the system, rather than the JVM.
Okay, let’s bring up Amazon support to see what’s happening here. They reported back saying that the EBS volume was alright, and that the underlying hardware was also optimal. Nothing from their side.
Now, this snapshot activity has been running for years, and has never posed a problem. We had no reason to doubt the application. But how do we reproduce this problem?
Reproducing the Problem
The first step to solving any problem is to reproduce it. Given that the disk throughput deteriorates over time (after roughly 4 days or so), we had to find a way to simulate this.
dd
to the rescue! We found a reliable way to reproduce the problem, by doing the following:
- Use
tmpfs
to simulate 95% of RAM usage - Use
dd
to write to an attached EBS volume
# mkdir /ram # mount -t tmpfs -o size=121268M none /ram/ # dd if=/dev/zero of=/ram/a_really_large_file bs=4M
Why 95%? On our production systems, we allocate 95% of the total system memory to the JVM, using the -Xmx
option. The next step was to write a 30 GB file sequentially and repeatedly, to simulate our database taking a snapshot:
# while true; do date; dd if=/dev/zero of=/path/to/a/file/on/our/ebs/volume bs=512 count=20000000; done &> dd.log & disown
After roughly 4 hours (we actually left it running overnight), dd
came to a grinding halt, and wrote data slowly. Success! We were now able to reproduce the problem.
Migrating Back to r5.xxx
At this time, since our production workload was severely affected, we ran the same benchmark on an equivalent instance, with the only difference being the instance type: r6g.4xlarge (Graviton2) vs r5.4xlarge (Intel).
Four hours… six hours… and one night later, dd
was able to write consistently. For the time being, we decided to move all our clusters to r5.xxx. This bought us valuable time to experiment with it on the side, since we had a reliable way to reproduce the issue. Phew!
Is It Graviton2?
We continued our chat with AWS support. They reported that the only possible lead that they had were periods of “microbursting” activity on the root volume. Now, although our database’s storage is on a different EBS volume, it does write its logs to the root volume. Yet this couldn’t fully explain why our Java thread couldn’t write to the storage volume. We chased this problem down, as we didn’t have anything else to go on.
Graphing IOPS from iostat
We ran iostat side by side, along with the dd
command, and graphed the IOPS on the root volume as reported by it:
Umm, wait, what!? There are bursts of excessive IO. (Note: to ensure that we could eliminate root volume microburst issues, AWS support recommended that we create a 1 TB root volume, and try to reproduce it.) There was clearly something happening here. Who’s consuming IOPS on the root volume?
Was It cadvisor?
We decided to use blktrace
to record disk activity, and then blkparse
to interpret its output. It reported that cadvisor
was doing considerable disk activity. A quick Google search sent us scurrying to this bug. Had it resurfaced? Let’s take cadvisor out of the equation:
That’s quite the difference! There were no bursts of disk IOPS this time! Is this it? Have we found the holy grail? We didn’t know what to compare this to. So, to find a base reference, we stopped everything else too, including docker:
Armed with such data, we thought that cadvisor
was the culprit. We shut it down temporarily, and ran the dd
workload overnight, on a fresh instance. Unfortunately, it wasn’t quite that simple. dd
slowed down after the usual 4 hour mark.
Linux Kernel Versions
We realised that there was a difference in the kernel versions between our two instance types. This appeared to be the culprit, and as soon as we matched the kernel version to that of the r5.xxx, everything returned to normal!
So it wasn’t cadvisor
in the end. However, on the way, we did learn a thing or two about debugging memory.
There was a ton of analysis done on some files in the /proc/ filesystem, and some other commands that were issued. However, they didn’t show anything valuable, and were therefore omitted from this article.