ARC and L2ARC Design
A data buffer in ARC essentially consists of two portions: its header (
struct arc_buf_hdr) and the data portion (
struct arc_buf). The header records general information about where on-pool the data in the ARC buffer belongs, when it was created and number of other items. The data portion simply holds the data block and a few utility functions for dealing with the data. ARC buffers can exist without their data portion, but never without a header. The thing placed on ARC lists (MRU, MFU, ghost MRU and ghost MFU) are buffer headers, which may or may not contain their data portion (if on MRU/MFU, data is present, if on the ghost lists, then not). An on-pool data block can thus exist in one of three general states:
- Cached in ARC: on one of the MRU or MFU lists together with its data.
- Recently cached in ARC: on one of the ghost lists, without its data.
- Evicted: not in ARC at all.
The L2ARC was designed to extend this situation in a very simple manner:
- A thread, called the
l2arc_feed_thread, periodically scans the tails of the ARC MRU and MFU lists for buffers that are about to "fall off" the end of each respective list into the ghost cache.
- After performing a quick eligibility check, the buffer's data portion is written to L2ARC, noting in its header that it is now also present on L2ARC (by adding a special
- If during regular operation the ARC eviction process wants to remove this buffer, it detects that the buffer is L2ARC-cached, and so rather than moving it to its ghost list, moves it to a new list, called the l2c_only (level-2-cache only) state.
- If an L2ARC-cached buffer is re-requested from the ARC after it has been evicted, rather than going to the on-pool devices, the L2ARC issues a read to the respective L2ARC device to retrieve the buffer's data.
- If for whatever reason this fails (i.e. L2ARC device died, or retrieved data is corrupt), the read is re-issued to the main pool.
l2arc_feed_thread writes buffers to devices in a simple rotor fashion, looping around the end of the device once it has been filled. Sooner or later, it starts overwriting previously written buffers - this is resolved by evicting the buffers from the L2ARC device (essentially just noting in the
arc_buf_hdr structure that the device is no longer on the L2ARC device). If this kicks a buffer off the ARC completely (i.e. it was only in L2ARC and nowhere else), then we probably wasn't needed in the first place (remember, an L2ARC cache hit places the buffer back on the MRU/MFU ARC lists).
It is important to note that there is no direct eviction path for a buffer from ARC to L2ARC - if this were the case, during extreme memory pressure (when the ARC needs to evict buffers), we might be stalled waiting for L2ARC devices to finish writing buffers to L2ARC. By doing ARC eviction and L2ARC feeding in separate threads, it is guaranteed that a slow or failing L2ARC device can't bring the system down (at worst the feed thread will miss some eligible buffers for writing).
Extensions To L2ARC To Implement Persistency
The notable problem above is that
l2arc_feed_thread never writes anything besides the ARC buffer data, and does so in an unstructured way. Therefore, it is not possible to examine an L2ARC device and be able to determine what blocks it contains. To resolve this issue, we've modified the L2ARC feed thread to:
- Periodically append some metadata about written buffers written in the last few feed cycles. This takes the form of short metadata blocks called "pbufs" (persistency buffers). Each pbuf contains all the metadata needed to reconstruct the ARC buffers written before it in L2ARC and a pointer to the previously written pbuf, forming a linked list of pbufs.
- At the start of the device we reserve a small region to hold a set of uberblocks. We update one of these uberblocks each time we write a new pbuf to point to the most recently written entry.
The following image demonstrates how we organize these two on-device data structures:
On-device Metadata Structures
Uberblock Region and Uberblocks
We reserve a 1 MiB long block at the start of each L2ARC device (immediately following the vdev label offset
VDEV_LABEL_START_SIZE) which we call an uberblock region. We subdivide this uberblock region into 256 separate 4 KiB long uberblocks. During normal operation, these uberblocks are then updated in a rotor fashion after each pbuf commit. This is not meant to provide error resilience as in the case of main pool uberblocks (in case of L2ARC read error we simply drop the L2ARC metadata), but rather to allow for some flash cell wear leveling on SSDs which don't support it internally. Given that current cheap MLC flash can sustain around 2000 erase-program cycles per block, this gives around it an endurance of roughly 512000 uberblock updates. Even on fairly busy pools we expect uberblocks to only be updated roughly every 10 seconds, so it would take approximately 2 months to burn through all flash cycles on these kinds o devices (and one can always tune the L2ARC to emit pbufs and uberblock updates less frequently, extending this lifetime much further).
On-device an L2ARC uberblock has the following structure:
magic:This is the magic number by which we verify whether an L2ARC device even contains uberblocks.
version:The version of this uberblock structure. Currently always 0x1.
ublk_flags:A 16-bit set of flags identifying various special options about this uberblock:
Bit Field name Description 0 Byte order 1 = remainder of fields is MSB first (big endian), 0 = LSB first (little endian) 1 Evict first 1 = indicates that L2ARC is doing its first sweep though device 2-15 Reserved Set to 0
spa_guid:GUID of the main pool SPA to which this L2ARC device belongs (used to protect against rebuilding buffers from different pool).
birth:A sequential birth counter, starting at 0. Each uberblock update to this L2ARC device increments this counter. This is then used to locate the newest uberblock.
evict_tail:The last device address that L2ARC buffer eviction reached.
alloc_space:How much space is allocated on the L2ARC device.
pbuf_daddr:Device address (byte offset) of the newest pbuf.
pbuf_asize:Actual size of newest pbuf.
pbuf_cksum:Fletcher4 checksum of newest pbuf (i.e. L2ARC device contents starting at pbuf_daddr up to pbuf_daddr + pbuf_asize).
ublk_cksum:Fletcher4 checksum of all 4064 bytes in this uberblock up to this field.
Persistency Buffer (pbuf)
Pbufs are on-disk data structures that record the
arc_buf_hdr metadata for a number of ARC buffers in a portable persistent manner. The amount of ARC header data held in a single pbuf depends on some system behavior, but is generally around 100 MiB, placing consecutive pbufs roughly 100 MiB apart. It is this possible to estimate the number of pbufs on a given L2ARC device by simply dividing the device's capacity in MiB by 100. This is also approximately the number of IO operations it will take to read all pbuf data back at L2ARC rebuild time.
On-device a pbuf has the following structure:
magic:A magic value which we check to see if the device contains a pbuf at a specific device address.
version:Version of the pbuf structure. Currently always 0x1.
pbuf_flags:A 16-bit set of flags identifying various special options about this pbuf:
Bit Field name Description 0 Byte order 1 = remainder of fields is MSB first (big endian), 0 = LSB first (little endian) 1 Compressed 1 = indicates `payload' field is LZ4-compressed, 0 = payload is uncompressed
Decompression of the payload must be done via the lz4_decompress ZFS function.
2-15 Reserved Set to 0
prev_pbuf_daddr:Device address of previous pbuf in pbuf linked list.
prev_pbuf_asize:Actual size of previous pbuf.
prev_pbuf_cksum:Fletcher4 checksum of previous pbuf.
payload_size:Number of uncompressed bytes in `items' payload. If no compression was specified by the pbuf flags, this field's value is exactly equal to the number of bytes remaining in this pbuf.
dva:DVA of the respective ARC buffer. See
birth_txg:Birth TXG number of the ARC buffer. See
cksum0:First 64-bits of Fletcher2 checksum of ARC buffer contents. See
freeze_cksum:Complete ARC buffer freeze checksum. See
size:Uncompressed size of ARC buffer contents. See
l2_daddr:Device address on L2ARC device which contains the ARC buffer's data. See
l2_asize:Actual on-device size of ARC buffer data. If the buffer is compressed, this value will be smaller than `
size'. If the
l2_compressfield below is set to
l2_asizeare undefined, as the ARC buffer has been eliminated by compression and isn't present on L2ARC at all. See
l2_compress:Compression applied to ARC buffer contents (see
contents_type:ARC buffer contents type (data or metadata). See
flags:ARC buffer flags. See
As noted above, making L2ARC data persistent across reboots requires the introduction of some metadata structures, which then have to been re-read at L2ARC device attach time to restore the ARC headers. This necessarily means that we will need to consider two areas of performance: how much metadata there is, and how long it takes to restore it to memory when a rebuild is issued.
Overhead of Metadata
Persistent metadata comes in two types: uberblocks and pbufs. Given that uberblocks and the uberblock region are of a fixed known size (1 MiB total), we must focus on the amount of space pbufs consume on L2ARC storage. Pbufs have been designed to be very efficient in terms of space usage, mainly by recording the minimum set of fields from an ARC buffer header needed to reconstruct it, and by allowing the pbuf payload to be compressed.
A pbuf consists of exactly 56 bytes of pbuf header information and 88 bytes per ARC buffer header referenced. Given that the smallest possible pbuf contains 1 ARC buffer header, this gives us a worst achievable metadata overhead of 28.125%. However, this is the worst possible configuration and the default configuration of L2ARC persistency doesn't even allow it. The worst possible results using default settings can be achieved by setting
recordsize property on the main pool to its lowest possible value (512). Then, assuming the L2ARC writes only a single ARC block per feed cycle (something that would require great care to achieve at all), we can emit pbufs with a minimum of 128 ARC buffer headers. Assuming a 20% reduction in volume of the pbuf via compression, this gives a theoretical maximum overhead of 13.818%.
In the following table lists typically expected metadata overhead ratios for various average block sizes cached on the L2ARC device:
|Average Block Size||Data Referenced By Pbuf||Pbuf Raw Size||Est. Pbuf Compression Ratio||Pbuf Actual Size||Overhead|
|2k||100 MiB||4400 KiB||45%||2640 KiB||2.36%|
|4k||100 MiB||2200 KiB||40%||1320 KiB||1.18%|
|8k||100 MiB||1100 KiB||35%||715 KiB||0.7%|
|16k||100 MiB||550 KiB||30%||385 KiB||0.38%|
|32k||100 MiB||275 KiB||25%||206 KiB||0.20%|
|64k||100 MiB||137 KiB||20%||110 KiB||0.1%|
|128k||100 MiB||69 KiB||15%||58 KiB||0.057%|
During rebuild, ZFS reads in all metadata stored on an L2ARC device in order to reconstruct in-memory ARC buffer headers. This means, that whatever L2ARC capacity you have installed, take the metadata overhead of it and that amount of data needs to be read in during rebuild. At a typical 0.1% of overhead, this means that a 240 GB L2ARC device will need to read roughly 240 MB of data. Most solid state drives can transfer this kind of data volume within a second or two.
However, given the structure of L2ARC pbufs, we don't know which pbuf to start fetching before we are done reading the current pbuf, and so we are also bound by the device's IOPS. In practice, the implementation tries to alleviate these problems in two ways:
- By speculatively initiating a prefetch for the next pbuf when only the header of the current pbuf has been read so far. Currently, this provides roughly a 15-20% boost in L2ARC rebuild times when compared to a straight "read/decode/read" linear implementation.
- By executing the rebuild in a background thread for each L2ARC device in parallel. This way the L2ARC rebuild doesn't block main pool import and possibly system boot (if the L2ARC device was part of the root pool).
While L2ARC rebuild is going on in the background, writes to the L2ARC devices are prevented to make sure that new L2ARC writes don't overwrite any metadata yet to be read or that is in the process of being read. To prevent malformed or damaged pbufs from sending the rebuild process into an unending cycle (e.g. by making the pbufs into a loop), the rebuild process contains a deadline time out, which when hit will force the rebuild process to stop.
As an example, one benchmark system where L2ARC rebuild was tested was an HP MicroServer with a 55 GiB L2ARC device on an OCZ Vertex 3 SSD. Rebuilding 55 GiB of raw L2ARC data takes approximately 1.9 seconds on such a device and is unnoticeable during system boot or device attach (given that it proceeds asynchronously in the background anyway).
The following kstats were added to the
arcstats set to provide additional information about persistent L2ARC performance:
l2_meta_writes:number of metadata pbufs written to L2ARC devices since boot.
l2_meta_avg_size:floating average of uncompressed size of a metadata pbuf. See the "Floating Averages and Ratios" section below for how these values are computed and updated.
l2_meta_avg_asize:floating average of the actual on-disk size of a metadata pbufs. Comparing these two numbers you can establish pbuf compression efficiency.
l2_asize_to_meta_ratio:floating ratio of the amount of space taken up on L2ARC devices by ARC buffer data divided by space taken up by pbuf metadata. For example, a value of "1000" means that for each 1000 bytes of data we consume 1 byte of metadata (i.e. metadata is taking up about 0.1% of available L2ARC space).
l2_rebuild_attempts:number of times an L2ARC rebuild has been attempted. An L2ARC rebuild is attempted (and this kstat incremented) only if an L2ARC device is deemed to contain usable L2ARC metadata (this is determined by looking at whether it contains a valid uberblock).
l2_rebuild_successes:number of times an L2ARC rebuild has successfully completed. Except for cases of device failure or serious data corruption, this number should track the above kstat.
l2_rebuild_unsupported:number of times we looked at an L2ARC device and found that L2ARC metadata rebuilding is not possible (either because the device contains no valid uberblocks, or because it didn't belong to this pool).
l2_rebuild_timeout:number of times a rebuild has been aborted due to hitting a predefined maximum duration timeout (see "Tunables" below). A non-zero value here indicates that your L2ARC device is either very slow or is failing.
l2_rebuild_arc_bytes:number of ARC buffer bytes recovered from an L2ARC device during metadata rebuild. This doesn't mean that the ARC buffers have been read back into memory, just how much raw ARC-cacheable buffer data there is on L2ARC.
l2_rebuild_arc_bytes, but indicates how much space the ARC buffers are taking up on the L2ARC device. If you've enabled L2ARC compression on datasets, this number will likely be lower than
l2_rebuild_arc_bytes. Otherwise, it should be precisely equal.
l2_rebuild_bufs:how many ARC buffers were recovered from L2ARC during rebuild.
l2_rebuild_metabufs:how many L2ARC pbufs were processed from L2ARC during rebuild.
l2_rebuild_precached:if during rebuild we discover that a buffer is already in ARC, we do not attempt rebuild it and instead bump this counter.
l2_rebuild_uberblk_lookups:number of attempted uberblock lookups. This is the first step taken during L2ARC rebuild. Unless you have disabled L2ARC rebuild (see Tunables below), this number should get bumped at each L2ARC device attach.
l2_rebuild_uberblk_errors:if during L2ARC rebuild we encounter faulty uberblocks, we bump this counter and abort the rebuild. Any non-zero value here indicates a failing L2ARC device or corrupted L2ARC metadata.
l2_rebuild_io_errors:bumped each time an L2ARC rebuild read hits an I/O error. Any non-zero value here indicates a failing L2ARC device.
l2_rebuild_cksum_errors:bumped each time an L2ARC rebuild hits a buffer which doesn't checksum to its correct value (though it should). This indicates either a failing L2ARC device or corrupted L2ARC metadata.
l2_rebuild_loop_errors:bumped each time an L2ARC rebuild encounters a pbuf list loop. This is a serious condition and indicates severely corrupted L2ARC metadata.
l2_rebuild_abort_lowmem:if we the system encounters a low-memory condition during L2ARC rebuild, we abort the rebuild and bump this counter, in order to avoid destabilizing the system even further. You may want to try to re-add the L2ARC device at a later time when the memory pressure is relieved, in order to rebuild the L2ARC device's complete contents.
Floating Averages and Ratios
Due to the fact that we don't keep a history of L2ARC metadata we write in memory, it is difficult compute certain aggregate averages and ratio values. Instead, we compute a "floating statistic", which tracks the recent history of the statistic by slowly factoring in individual updates to it, which is very similar to how load average track system load. We do this by applying the following formula:
STAT = STAT - STAT / FACTOR + NEWVAL / FACTOR
- STAT: is the value of the respective statistic
- NEWVAL: is the new value to be added to the statistic
- FACTOR: is the division factor which determines the contribution of the new value to the overall result. The greater this factor, the less of an effect do new values take on the statistic and more history is preserved. By default, we use a factor value of 3.
To see an example of this in effect, please consider the following how the following example (in GNU bc):
After 10 updates the floating average value is within 2% of the desired new value.
|Please note that changing any of these values is generally not advised. Their defaults were chosen to provide best overall performance based on lots of testing and benchmarking.|
This is a list of kernel-tunable parameters. You may alter these using the familiar MDB syntax:
# echo "<tunable>/W <new_value>" | mdb -kw
To make a tunable value persistent, append it to
/etc/system in the following format:
set zfs:<tunable> = <new_value>
l2arc_pbuf_compress_minsz:minimum size in bytes a pbuf must reach before we attempt to compress it (default: 8192 bytes). If a pbuf cannot be compressed, it is written in uncompressed form instead. Increasing this value to something like 32k or more makes sense if you use L2ARC devices with 4k sectors, where compression is unlikely to yield buffers that are small enough to give a measurable reduction in pbuf size.
l2arc_pbuf_max_sz:maximum number of ARC bytes a pbuf may reference before we commit it to L2ARC (default: 100 MiB). This forms the data-based upper bound of the pbuf commit interval.
l2arc_pbuf_max_buflists:maximum number of L2ARC feed cycles a pbuf may contain (default: 128). If an L2ARC feed cycle attempts to write ARC buffers to L2ARC, we open a new "buflist" within the currently open pbuf and write the metadata of buffers to be committed to this buflist. Once a pbuf reaches this maximum number of buflists, it is flushed to storage, regardless of the number of bytes it references. This forms a soft time-based upper bound of the pbuf commit interval.
l2arc_rebuild_enabled:a boolean value controlling whether we should attempt to rebuild L2ARC contents when adding an L2ARC device to the pool (default: B_TRUE). This setting is applied before the device's L2 uberblock has been read, so any writes to this device overwrite any existing metadata on the device. Set this to B_FALSE in case you are having trouble with L2ARC rebuild timing out and you'd like to drop existing metadata on the L2ARC device.
l2arc_rebuild_timeout:a timeout value in seconds of how long at most an L2ARC rebuild is allowed to take (default: 60s). This serves to protect the system from failing or faulty L2ARC devices which are taking too long to complete the rebuild. We check this timeout after each read and if it is hit, we abort L2ARC metadata rebuild and return immediately. If your machine is hitting this timeout, check your system console for L2ARC rebuild timeout messages or the