In collaboration with Carnegie Mellon University, Twitter is building the next generation of storage backend, Segcache, into Pelikan. Segcache enables high memory efficiency, high throughput, and excellent scalability for Twitter’s cache workloads.
This design provides the biggest benefit for workloads that access predominantly small objects and use TTL (time-to-live) 1. These workloads, which represent most of what Twitter has seen in production, have their memory footprint reduced by as much as 60%. We achieve this while maintaining comparable throughput to Twitter’s existing production solution. Segcache also offers much better (write) scalability compared to Memcached in our storage-only benchmark.
The work was first published as a conference paper at NSDI’21, titled “Pelikan Segcache: a memory-efficient and scalable in-memory key-value cache for small objects”. It received NSDI Community Award, and the code used in the paper is merged into Pelikan codebase as an experimental server as of April 2021.
In-memory caches are widely adopted by modern web applications to provide low-latency, high-throughput access to key-value data. In a previous post, we have described the base requirements of caching in datacenters. The architecture of Pelikan makes it easy to evolve individual components, such as plugging in a new storage design. This serves as a convenient backdrop to design, test, and integrate Segcache, an idea born out of collaboration between Carnegie Mellon University and Twitter.
Prevalence and Impact of TTLs
Time-to-live (TTL) is typically associated with a key at write time, and determines the maximum lifespan of the key through expiration. Many Web services use TTLs ranging from a few seconds to a few weeks. The use of TTL is prevalent among datacenter caches for a number of reasons.
First, TTL is used to bound staleness for keys that are read and cached from canonical data store. Second, TTL is used for periodical re-computation of expensive composite objects, such as a rendered view or a scored index. Third, TTL is used for implicit deletion of keys with a specified lifespan, such as API rate limiters and data that fall under GDPR requirements.
Expired objects are unusable, and therefore offer no value. In contrast, evictions remove objects that could be useful in the future. However, expired objects often cannot be removed from the cache promptly, and therefore wasting precious memory. In our previous work, we have demonstrated that for some workloads, if all the expired objects can be removed timely, a lot of evictions can be avoided.
Maintainers of popular caching systems such as Memcached and Redis recognize the importance of TTL all along. We summarize existing techniques used in Memcached, Redis, and some research systems in the following table.
While caches have to respect TTL, existing solutions often do not optimize for expiration. They either incur high computation overhead performing expiration, using a lot of CPU and memory bandwidth, or fail to remove most expired objects in a timely fashion. The detailed discussion can be found in Section 2.1 of our paper.
Small objects, Large Metadata Overhead
It has been shown at Twitter, Facebook, and Reddit that most objects stored in in-memory caches are small. Among Twitter’s top 100+ Twemcache clusters, the mean object size has a median value less than 300 bytes, and the largest cache has a median object size around 200 bytes. In contrast to these small objects, most existing solutions have relatively large metadata per object. Memcached has 56 bytes of metadata per key, Redis is similar, and Pelikan’s slab storage uses 39 bytes2. This means more than one third of the memory goes to metadata for a cache where the average object size is 100 bytes.
Most caching systems introduce some form of memory fragmentation. Redis, for example, delegates memory management to external libraries such as jemalloc to simplify service design. As a result, it incurs external memory fragmentation. Both Twitter’s production deployment and a previous publication have shown that this can cause more than 2x memory usage as intended. Moreover, when using maximum amount of memory configured, (re-)allocations could become complicated and cause unpredictably high tail latency.
Slab-based storage such as the one used in Memcached takes key-value memory management in-house, resulting in much better and stabler performance. However, the slab allocator in turn suffers from internal memory fragmentation and potential slab calcification problem.
Segcache Design Overview
We addressed the problems above with a new storage backend called Segcache, which is short for segment-structured cache. It is a log-structured cache with additional features and constraints.
Segcache prioritizes expiration over eviction by making sure expired objects are removed efficiently. It further offers the following features to achieve high memory efficiency and high throughput:
- Proactive expiration: expired objects can be removed from cache within one second.
- Minimal object metadata: only 5 bytes per object3.
- Almost no memory fragmentation is introduced by variable object size.
- A merge-based eviction algorithm to preserve more frequently accessed objects.
- Similar per-thread throughput as Pelikan’s slab storage, and up to 40% higher than Memcached for Twitter’s workloads.
- Near-linear scalability: we tested up to 24 threads, and achieved 8x higher throughput than Memcached 4.
Segcache has three main components.
The first component is called object store, which is the space allocated for storing key-values. Segcache divides this space into segments. Each segment is a small log storing objects of similar TTLs. There are some similarities between slab and segment. However, slabs store objects of similar sizes, whereas segments store objects of variable sizes but similar TTLs.
The second component is the hash table for fast lookups. Unlike the object chaining hash table used in most caching systems, Segcache uses a bulk-chaining hash table to allow higher occupancy per bucket. Each hash bucket has eight slots. The first one is used to store bucket metadata. The next seven slots store object information, but using offsets into the object store instead of pointers. The values here are chosen to take advantage of the typical size of one CPU cacheline, so that scanning a hash bucket is very fast.
The third component in Segcache is an index into object store, called TTL buckets. Segcache breaks all possible TTL values into ranges, and each TTL bucket is responsible for one range. The range increases with the absolute values of the TTL to keep the number of ranges manageable. Segments of the same TTL range are linked into a chain headed by a TTL bucket.
Three design principles guided the design of Segcache.
In society, sharing improves resource utilization in general. Its manifestation in computer systems includes multi-tenant hardware and serverless computing. Segcache applies a similar concept to object metadata storage. It primarily maximizes metadata sharing between objects in the same segment, and secondarily among objects in the same hash bucket.
First, objects in the same segment share similar creation time, approximate TTL, and therefore expiration time. We further decided they should share a reference counter and next segment pointer. These metadata are stored in the segment header. The metadata cost is typically amortized over 1000s to 10,000s of objects. Second, objects in the same hash bucket share the same approximate last-access timestamp, CAS (compare-and-set) value, and bucket-level spinlock.
Notice that during this sharing, some metadata values become approximate, such as various timestamps. They can cause expiration to be earlier than intended. That said, cache has the unique advantage that the presence of a key at any moment is not a given. Therefore, it is a legitimate tradeoff to slightly shift the expiration time in exchange for highly compressed metadata. In our research, we also observed that objects near the end of their stated TTL have very few accesses to begin with, further limiting the downside of a small early shift in expiration.
Be proactive, do not be lazy
As we mentioned earlier, efficiently and timely removal of expired objects is critical in TTL-based caching workloads. The design of Segcache ensures the efficient removal of expired objects.
Segcache ensures segments in each TTL bucket are sorted by creation, and therefore, expiration time. This property allows Segcache to quickly identify expired segments simply by looking at the first segment of each TTL bucket. If that segment is expired, all objects in it are removed, and the segment recycled. This process continues until we run into the first segment that has not expired, at which point we move to the next TTL bucket. In doing so, we never need to scan segments that have not expired, minimizing wasted computation on valid objects.
Many caching systems spend considerable amount of CPU cycles on maintaining object indexes for eviction and other cache operations. Such activity can limit the throughput and scalability of the storage design. The figure below shows the overhead comparison of Memcached’s object tracking and Segcache’s segment tracking.
Memcached manages objects through several queues, such as object LRU queue and
free chunk queue. Most operations require touching at least one queue. For
get request moves an object to the head of the LRU queue 5, and
delete request requires moving the freed object space to the free chunk
queue. In contrast, Segcache maintains segment queues instead of object queues.
On average, operating on segments requires far less bookkeeping per request. The
necessary bookkeeping, such as during eviction, is performed as a batch
operation over a contiguous block of memory, which significantly improves
Segcache reduces the locking frequency by several orders of magnitude, which
becomes significant for scalability with multiple threads. The number of
locking operations in a systems that performs object-level bookkeeping is
proportional to the the number of (write) requests, while the number of locking
operations in Segcache is proportional to the frequency a segment is created or
moved. For example, if each segment stores 10,000 objects and the write ratio is
10%, Segcache only locks when a segment becomes full, roughly once every
10,000 / 10% = 100,000 requests, this is a 10,000 times reduction compared to
a design that locks at least once for every request.
Throughput and scalability are both important for in-memory caches to fully utilize the power of modern hardware. As the number of objects stored in a single instance of cache continues to grow, we believe that macro-management will become increasingly necessary.
We evaluated Segcache using production traces from Twitter, and we compare Segcache with two production systems — Pelikan with slab-storage, and Memcached, as well as two research systems — Hyperbolic, and LHD. Because the two research systems do not consider TTLs in their design, and use random sampling for eviction, we added random sampling to expire objects in these two systems. In the following sections, we compare the memory efficiency, single-thread throughput, and throughput scalability with multiple threads.
This is one of the most important aspects of caches. Instead of showing miss ratio curves (miss ratio vs cache size), we show the relative cache size to achieve the miss ratios observed in production. This is more intuitive as it shows how much DRAM we should provision for each design to match the current production miss ratio.
Compared to Pelikan-slab (the bar labeled with Production), Segcache can reduce the memory requirement by 40%-90%. While comparing to state-of-the-art, Segcache reduces memory requirement by 22%-60%. This indicates that by switching to the new backend, we can save a significant amount of DRAM for memory-bound workloads.
Throughput and Scalability
Besides memory efficiency, throughput and thread scalability are also important. The figure below shows that Segcache achieves similar single-thread throughput as Pelikan-slab, both of which are significantly higher than Memcached and the two research systems.
In terms of scalability, we observe that Memcached can scale up to 8 threads. After that, we could not achieve higher throughput by adding more cores. Meanwhile, Segcache can scale almost linearly to 24 threads in our test. At 24 threads, it achieves 8x higher throughput than Memcached with 8 threads.
We do not consider Segcache as perfect for everything. For example, workloads that access very large objects, or do not rely on TTLs at all will see limited benefit in adopting Segcache’s design. Segcache also makes it harder to reuse the memory allocated to a key that has since been replaced, which makes it less optimal for workloads with very high overwrite rate. However, for that latter scenario, we are working on an improved design to recycle deleted objects more quickly.
To summarize, we designed a new storage backend for Pelikan called Segcache. Segcahe groups objects of similar TTLs into segments, and provides efficient and proactive TTL expiration, tiny object metadata, and almost no memory fragmentation. As a result of this design, we show that Segcache can significantly reduce the memory footprint required to serve Twitter’s production workloads. Besides, it allows Pelikan to better utilize the many cores offered by modern CPUs.
We are still tweaking the Segcache design and preparing it for production adoption. As we near production-readiness, we will talk about the system-wide optimization we performed to make Pelikan-segcache work well as a full-fledged cache backend.
Workloads with large objects or do not use TTL can also benefit from Segcache, but the benefit is smaller. ↩
Pelikan provides cuckoo storage, which uses 5 bytes per object metadata. However, it only works for workloads in which all objects are of the same (or similar) size. ↩
When we compare object metadata, we do not consider the memory usage of the hash table in any system, because hash table load is often configurable. ↩
When we measure throughput, we remove the networking stack and focus only on the storage stack. A cache deployed as a distributed cluster in production typically sees the kernel networking stack as the throughput bottleneck. However, if the storage module is used locally, or when we switch to a faster networking stack such as DPDK, the throughput and scalability of hte storage module will become important. ↩
Newer versions of Memcached avoid popping a popular object each time it is read. However, the chain operation is still of complexity O(Nrequest). ↩