Demystifying G1 GC, GCLocker, JNI Critical, and Fake OOM Exceptions

This post is co-authored together with Hardik Shah and Avneesh Sharma.

TesseractDB^TM runs on the JVM, which allows us to iterate fast. Since it’s our bread and butter, we’ve been extremely careful when it comes to allocating memory during the lifetime of a query. Memory is allocated frugally, and even simple things like ThreadLocals are avoided (not that using a ThreadLocal allocates memory, but it’s detrimental to performance).

In order to accelerate certain types of queries, we pushed our allocation rates to as high as 1.6 GB/s within a test environment. It consists of byte arrays used as buffers for streaming data from multiple sources, which makes data readily available for the querying engine. As a side effect, some JVMs would terminate with the following written to stderr:

[78541.871s][warning][gc,alloc] Thread-231880: Retried waiting for GCLocker too often allocating 65538 words
Terminating due to java.lang.OutOfMemoryError: Java heap space

They’d terminate since we run our JVMs with -XX:+ExitOnOutOfMemoryError. This is mostly a precautionary measure to avoid running application logic in a degraded state (running systems in a degraded mode will eventually lead to a catastrophe).

Initial Analysis

In order to understand what’s in the heap, we ran a heap dump through Eclipse MAT. It verified our theory that there was a lot of garbage (only about 112 GB live out of a total heap of 181 GB). Next up, we looked at the garbage collector’s logs (which in our case is G1 GC):

[2022-08-01T00:36:02.153+0530][219327984ms][info][gc] GC(19094) Pause Young (Normal) (G1 Preventive Collection) 185229M->185201M(185696M) 7.491ms
[2022-08-01T00:36:02.163+0530][219327994ms][info][gc] GC(19095) Pause Young (Normal) (G1 Preventive Collection) 185233M->185210M(185696M) 7.572ms
[2022-08-01T00:36:02.175+0530][219328006ms][info][gc] GC(19096) Pause Young (Normal) (G1 Preventive Collection) 185242M->185215M(185696M) 7.607ms
[2022-08-01T00:36:02.185+0530][219328015ms][info][gc] GC(19097) Pause Young (Normal) (G1 Preventive Collection) 185247M->185216M(185696M) 7.304ms
[2022-08-01T00:36:02.195+0530][219328026ms][info][gc] GC(19098) Pause Young (Normal) (G1 Preventive Collection) 185280M->185218M(185696M) 7.009ms
[2022-08-01T00:36:02.204+0530][219328035ms][info][gc] GC(19099) Pause Young (Normal) (G1 Preventive Collection) 185282M->185220M(185696M) 6.683ms
[2022-08-01T00:36:02.213+0530][219328044ms][info][gc] GC(19100) Pause Young (Normal) (G1 Preventive Collection) 185284M->185231M(185696M) 6.757ms
[2022-08-01T00:36:02.223+0530][219328054ms][info][gc] GC(19101) Pause Young (Normal) (G1 Preventive Collection) 185295M->185243M(185696M) 6.635ms
[2022-08-01T00:36:02.240+0530][219328071ms][info][gc] GC(19102) Pause Young (Normal) (G1 Preventive Collection) 185307M->185241M(185696M) 6.634ms
[2022-08-01T00:36:02.255+0530][219328086ms][info][gc] GC(19103) Pause Young (Normal) (G1 Preventive Collection) 185305M->185243M(185696M) 6.119ms
[2022-08-01T00:36:02.270+0530][219328101ms][info][gc] GC(19104) Pause Young (Normal) (G1 Preventive Collection) 185307M->185247M(185696M) 6.326ms
[2022-08-01T00:36:02.289+0530][219328120ms][info][gc] GC(19105) Pause Young (Normal) (GCLocker Initiated GC) 185343M->185259M(185696M) 6.903ms
[2022-08-01T00:36:02.306+0530][219328137ms][info][gc] GC(19106) Pause Young (Normal) (GCLocker Initiated GC) 185323M->185279M(185696M) 7.091ms
[2022-08-01T00:36:02.319+0530][219328150ms][info][gc] GC(19107) Pause Young (Normal) (GCLocker Initiated GC) 185311M->185284M(185696M) 7.019ms
[2022-08-01T00:36:02.334+0530][219328165ms][info][gc] GC(19108) Pause Young (Normal) (G1 Preventive Collection) 185316M->185286M(185696M) 6.885ms
[2022-08-01T00:36:02.342+0530][219328173ms][info][gc] GC(19109) Pause Young (Normal) (G1 Preventive Collection) 185318M->185292M(185696M) 6.789ms
[2022-08-01T00:36:02.351+0530][219328182ms][info][gc] GC(19110) Pause Young (Normal) (G1 Preventive Collection) 185324M->185297M(185696M) 6.503ms
[2022-08-01T00:36:02.367+0530][219328198ms][info][gc] GC(19111) Pause Young (Normal) (G1 Preventive Collection) 185361M->185304M(185696M) 7.365ms
[2022-08-01T00:36:02.385+0530][219328216ms][info][gc] GC(19112) Pause Young (Normal) (GCLocker Initiated GC) 185368M->185321M(185696M) 7.075ms
[2022-08-01T00:36:02.394+0530][219328225ms][info][gc] GC(19113) Pause Young (Normal) (G1 Preventive Collection) 185353M->185319M(185696M) 6.698ms

At that point, it would terminate, due to an OOM error being thrown in some thread. Now, the very first thing that puzzled us was: Why isn’t the stop-the-world GC cycle running (aka Pause Full)?

Seeing that only young GC cycles would run, we tried all sorts of ways to tune G1 and ensure that a full GC would kick in, because according to all documentation on G1, it is supposed to run before an OOM is thrown.

JNI Critical and GCLocker

In order to understand the role of GCLocker, one must first understand JNI Critical:

As soon as execution resumes in the VM for that particular thread, that thread will run a GC cycle if there was one requested while it was in JNI critical. This will be reflected in the GC logs as a “GCLocker Initiated GC.”

Put simply, GCLocker is a tool used to prevent the JVM from moving things around, while JNI is reading/writing to an address, as moving objects around would result in a change in address, which JNI is unaware of.

Okay great, a GC cycle does run in the end, so why is an OOM error thrown? Let’s consider the following scenario:

Surely, when the GCLocker Initiated GC cycle completed, some memory must have been freed, right? Yes, but there are other threads that requested memory too! There’s no FIFO concept when it comes to the memory freed by the GC cycle. Therefore, the same thread that initially requested memory allocation must wait again, for yet another cycle to complete.

And here’s the real kicker: The default value of GCLockerRetryAllocationCount is 2 (at least in Java 17). This would inadvertently produce a false OOM error being thrown!

Tuning G1

The default value of 2 is obviously insufficient for our use case. What happens when this is set to a much larger value, say, 100? We added the following two flags to our VM options:

-XX:+UnlockDiagnosticVMOptions
-XX:GCLockerRetryAllocationCount=100

A video because it’s the easiest way to capture the gravity of G1’s activity.

A premature OOM isn’t thrown, since many young GC cycles can’t free enough memory, and a full GC kicks in!

For the curious: doesn’t G1 split each generation into regions of fixed size? And if yes, can’t it pin just that region? Yes, but G1 hasn’t implemented region pinning yet. In fact, there’s a JDK feature request that’s been open for a while: [JDK-8276094] JEP 423: Region Pinning for G1 – Java Bug System

Conclusion

It took us a few days to truly understand this, since we only went deeper after we eliminated everything else. We hope that it helps you at some point or the other!

CleverTap Tech Blog

Demystifying G1 GC, GCLocker, JNI Critical, and Fake OOM Exceptions

Initial Analysis

JNI Critical and GCLocker

Tuning G1

Conclusion

Further Reading

Like this:

Manuel vicente

A heap dump from OutOfMemoryError contains lots of garbage: how can it happen? - JXRay.com

Marcin Erdmann

Leave a ReplyCancel reply

Demystifying G1 GC, GCLocker, JNI Critical, and Fake OOM Exceptions