Streaming Data from S3 in a Reliable Way
This article is co-authored with Hardik Shah.
When we redesigned our querying and data storage architecture in Arc, some of the data was stored and read directly from Amazon S3 from millions of objects (yes, millions really!). This was a consequence of us building a novel columnar file storage structure inspired by Apache ORC, RCFile, and some other formats.
An in-depth look at this structure will be published in a later article.
From our preliminary testing work, we learned:
Object Sizing is Important
When storing millions of objects in S3, make sure to size every object around 128 MB minimum or so. Through our repeated benchmarks, we’ve observed a static latency of ~100 milliseconds. Storing data in smaller objects is detrimental to performance as it increases the latency required to retrieve data.
Cache Key Listings Aggressively
Our storage structure allows us to discover data at runtime without the need to have a service which stores massive amounts of metadata (aka a single point of failure). To achieve this, we use the object listing API extensively. We learned that this is actually an expensive operation for S3. A single request can consume anywhere from 10 seconds to 60 seconds. To address this, we started caching object listing responses aggressively on disk and in memory.
Read S3 Streams Asynchronously
Our objects are read using byte-range requests and placed into a lazy and buffered SequenceInputStream. Everything worked well in isolation until we went to production where millions of streams are opened and closed from hundreds of clusters.
Tip: Use a PrivateLink to S3 from your subnet to avoid skyrocketing NAT gateway costs and to optimize for network bandwidth.
On occasion, S3AbortableInputStream would spew a whole bunch of connection reset exceptions. Given our deep abstraction of how data is processed, we came up with a nifty trick to resume exactly from where a stream was broken as shown below:
These were some of the key takeaways from transitioning part of our workload to Amazon S3. Some of the other tricks we’ve used to tune the Amazon S3 client are also a part of the standard S3 tuning documentation provided from AWS, so we won’t be covering those here.