Skip to end of metadata
Go to start of metadata

Currently under development as issue #3525 is a new feature that makes the second-level ARC (L2ARC) persistent across reboots.

Design

ARC and L2ARC Design

A data buffer in ARC essentially consists of two portions: its header (struct arc_buf_hdr) and the data portion (struct arc_buf). The header records general information about where on-pool the data in the ARC buffer belongs, when it was created and number of other items. The data portion simply holds the data block and a few utility functions for dealing with the data. ARC buffers can exist without their data portion, but never without a header. The thing placed on ARC lists (MRU, MFU, ghost MRU and ghost MFU) are buffer headers, which may or may not contain their data portion (if on MRU/MFU, data is present, if on the ghost lists, then not). An on-pool data block can thus exist in one of three general states:

  1. Cached in ARC: on one of the MRU or MFU lists together with its data.
  2. Recently cached in ARC: on one of the ghost lists, without its data.
  3. Evicted: not in ARC at all.

The L2ARC was designed to extend this situation in a very simple manner:

  • A thread, called the l2arc_feed_thread, periodically scans the tails of the ARC MRU and MFU lists for buffers that are about to "fall off" the end of each respective list into the ghost cache.
  • After performing a quick eligibility check, the buffer's data portion is written to L2ARC, noting in its header that it is now also present on L2ARC (by adding a special l2arc_buf_hdr structure).
  • If during regular operation the ARC eviction process wants to remove this buffer, it detects that the buffer is L2ARC-cached, and so rather than moving it to its ghost list, moves it to a new list, called the l2c_only (level-2-cache only) state.
  • If an L2ARC-cached buffer is re-requested from the ARC after it has been evicted, rather than going to the on-pool devices, the L2ARC issues a read to the respective L2ARC device to retrieve the buffer's data.
  • If for whatever reason this fails (i.e. L2ARC device died, or retrieved data is corrupt), the read is re-issued to the main pool.

The l2arc_feed_thread writes buffers to devices in a simple rotor fashion, looping around the end of the device once it has been filled. Sooner or later, it starts overwriting previously written buffers - this is resolved by evicting the buffers from the L2ARC device (essentially just noting in the arc_buf_hdr structure that the device is no longer on the L2ARC device). If this kicks a buffer off the ARC completely (i.e. it was only in L2ARC and nowhere else), then we probably wasn't needed in the first place (remember, an L2ARC cache hit places the buffer back on the MRU/MFU ARC lists).

It is important to note that there is no direct eviction path for a buffer from ARC to L2ARC - if this were the case, during extreme memory pressure (when the ARC needs to evict buffers), we might be stalled waiting for L2ARC devices to finish writing buffers to L2ARC. By doing ARC eviction and L2ARC feeding in separate threads, it is guaranteed that a slow or failing L2ARC device can't bring the system down (at worst the feed thread will miss some eligible buffers for writing).

Extensions To L2ARC To Implement Persistency

The notable problem above is that l2arc_feed_thread never writes anything besides the ARC buffer data, and does so in an unstructured way. Therefore, it is not possible to examine an L2ARC device and be able to determine what blocks it contains. To resolve this issue, we've modified the L2ARC feed thread to:

  • Periodically append some metadata about written buffers written in the last few feed cycles. This takes the form of short metadata blocks called "pbufs" (persistency buffers). Each pbuf contains all the metadata needed to reconstruct the ARC buffers written before it in L2ARC and a pointer to the previously written pbuf, forming a linked list of pbufs.
  • At the start of the device we reserve a small region to hold a set of uberblocks. We update one of these uberblocks each time we write a new pbuf to point to the most recently written entry.

The following image demonstrates how we organize these two on-device data structures:

On-device Metadata Structures

Uberblock Region and Uberblocks

We reserve a 1 MiB long block at the start of each L2ARC device (immediately following the vdev label offset VDEV_LABEL_START_SIZE) which we call an uberblock region. We subdivide this uberblock region into 256 separate 4 KiB long uberblocks. During normal operation, these uberblocks are then updated in a rotor fashion after each pbuf commit. This is not meant to provide error resilience as in the case of main pool uberblocks (in case of L2ARC read error we simply drop the L2ARC metadata), but rather to allow for some flash cell wear leveling on SSDs which don't support it internally. Given that current cheap MLC flash can sustain around 2000 erase-program cycles per block, this gives around it an endurance of roughly 512000 uberblock updates. Even on fairly busy pools we expect uberblocks to only be updated roughly every 10 seconds, so it would take approximately 2 months to burn through all flash cycles on these kinds o devices (and one can always tune the L2ARC to emit pbufs and uberblock updates less frequently, extending this lifetime much further).

On-device an L2ARC uberblock has the following structure:

struct l2uberblock {
        /* always big endian */
        uint32_t magic = 0x12bab10c;
        uint8_t  version = 0x1;
        uint8_t  reserved = 0x0;
        uint16_t ublk_flags;

        /* endianness of remaining items determined by `ublk_flags' */
        uint64_t spa_guid;
        uint64_t birth;
        uint64_t evict_tail;
        uint64_t alloc_space;
        uint64_t pbuf_daddr;
        uint32_t pbuf_asize;
        uint64_t pbuf_cksum[4];

        uint8_t  reserved[3996] = {0x0, 0x0, ... 0x0};

        uint64_t ublk_cksum[4] = fletcher4(of the 4064 bytes above);
} l2dev_uberblks[256];

  • magic: This is the magic number by which we verify whether an L2ARC device even contains uberblocks.
  • version: The version of this uberblock structure. Currently always 0x1.
  • ublk_flags: A 16-bit set of flags identifying various special options about this uberblock:

    BitField nameDescription
    0Byte order1 = remainder of fields is MSB first (big endian), 0 = LSB first (little endian)
    1Evict first1 = indicates that L2ARC is doing its first sweep though device
    2-15ReservedSet to 0
  • spa_guid: GUID of the main pool SPA to which this L2ARC device belongs (used to protect against rebuilding buffers from different pool).
  • birth: A sequential birth counter, starting at 0. Each uberblock update to this L2ARC device increments this counter. This is then used to locate the newest uberblock.
  • evict_tail: The last device address that L2ARC buffer eviction reached.
  • alloc_space: How much space is allocated on the L2ARC device.
  • pbuf_daddr: Device address (byte offset) of the newest pbuf.
  • pbuf_asize: Actual size of newest pbuf.
  • pbuf_cksum: Fletcher4 checksum of newest pbuf (i.e. L2ARC device contents starting at pbuf_daddr up to pbuf_daddr + pbuf_asize).
  • ublk_cksum: Fletcher4 checksum of all 4064 bytes in this uberblock up to this field.

Persistency Buffer (pbuf)

Pbufs are on-disk data structures that record the arc_buf_hdr metadata for a number of ARC buffers in a portable persistent manner. The amount of ARC header data held in a single pbuf depends on some system behavior, but is generally around 100 MiB, placing consecutive pbufs roughly 100 MiB apart. It is this possible to estimate the number of pbufs on a given L2ARC device by simply dividing the device's capacity in MiB by 100. This is also approximately the number of IO operations it will take to read all pbuf data back at L2ARC rebuild time.

On-device a pbuf has the following structure:

struct l2arc_pbuf {
        /* always big endian */
        uint32_t magic = 0xdb0faba6;
        uint8_t  version = 0x1;
        uint8_t  reserved = 0x0;
        uint16_t pbuf_flags;

        /* endianness of remaining items determined by `pbuf_flags' */
        uint64_t prev_pbuf_daddr;
        uint32_t prev_pbuf_asize;
        uint64_t prev_pbuf_cksum[4];

        uint32_t payload_size;
        /* if(pbuf_flags & pbuf_is_compressed) decompress `payload' before decoding */
        struct l2arc_pbuf_payload_item {
                /* these fields mirror [l2]arc_buf_hdr structure fields */
                uint64_t dva[2];
                uint64_t birth_txg;
                uint64_t cksum0;
                uint64_t freeze_cksum[4];
                uint32_t size;
                uint64_t l2_daddr;
                uint32_t l2_asize;
                uint8_t  l2_compress;
                uint8_t  contents_type;
                uint16_t reserved = 0x0;
                uint32_t flags;
        } payload[];   /* continues for remainder of pbuf */
};

  • magic: A magic value which we check to see if the device contains a pbuf at a specific device address.
  • version: Version of the pbuf structure. Currently always 0x1.
  • pbuf_flags: A 16-bit set of flags identifying various special options about this pbuf:

    BitField nameDescription
    0Byte order1 = remainder of fields is MSB first (big endian), 0 = LSB first (little endian)
    1Compressed1 = indicates `payload' field is LZ4-compressed, 0 = payload is uncompressed
    Decompression of the payload must be done via the lz4_decompress ZFS function.
    2-15ReservedSet to 0
  • prev_pbuf_daddr: Device address of previous pbuf in pbuf linked list.
  • prev_pbuf_asize: Actual size of previous pbuf.
  • prev_pbuf_cksum: Fletcher4 checksum of previous pbuf.
  • payload_size: Number of uncompressed bytes in `items' payload. If no compression was specified by the pbuf flags, this field's value is exactly equal to the number of bytes remaining in this pbuf.
  • dva: DVA of the respective ARC buffer. See struct arc_buf_hdr->b_dva.
  • birth_txg: Birth TXG number of the ARC buffer. See struct arc_buf_hdr->b_birth.
  • cksum0: First 64-bits of Fletcher2 checksum of ARC buffer contents. See struct arc_buf_hdr->b_cksum0.
  • freeze_cksum: Complete ARC buffer freeze checksum. See struct arc_buf_hdr->b_freeze_cksum.
  • size: Uncompressed size of ARC buffer contents. See struct arc_buf_hdr->b_size.
  • l2_daddr: Device address on L2ARC device which contains the ARC buffer's data. See struct l2arc_buf_hdr->b_daddr.
  • l2_asize: Actual on-device size of ARC buffer data. If the buffer is compressed, this value will be smaller than `size'. If the l2_compress field below is set to ZIO_COMPRESS_EMPTY, then l2_daddr and l2_asize are undefined, as the ARC buffer has been eliminated by compression and isn't present on L2ARC at all. See struct l2arc_buf_hdr->b_asize.
  • l2_compress: Compression applied to ARC buffer contents (see enum zio_compress in usr/src/uts/common/fs/zfs/sys/zio.h). See struct l2arc_buf_hdr->b_compress.
  • contents_type: ARC buffer contents type (data or metadata). See struct arc_buf_hdr->b_type.
  • flags: ARC buffer flags. See struct arc_buf_hdr->b_flags.

Performance

As noted above, making L2ARC data persistent across reboots requires the introduction of some metadata structures, which then have to been re-read at L2ARC device attach time to restore the ARC headers. This necessarily means that we will need to consider two areas of performance: how much metadata there is, and how long it takes to restore it to memory when a rebuild is issued.

Overhead of Metadata

Persistent metadata comes in two types: uberblocks and pbufs. Given that uberblocks and the uberblock region are of a fixed known size (1 MiB total), we must focus on the amount of space pbufs consume on L2ARC storage. Pbufs have been designed to be very efficient in terms of space usage, mainly by recording the minimum set of fields from an ARC buffer header needed to reconstruct it, and by allowing the pbuf payload to be compressed.

A pbuf consists of exactly 56 bytes of pbuf header information and 88 bytes per ARC buffer header referenced. Given that the smallest possible pbuf contains 1 ARC buffer header, this gives us a worst achievable metadata overhead of 28.125%. However, this is the worst possible configuration and the default configuration of L2ARC persistency doesn't even allow it. The worst possible results using default settings can be achieved by setting recordsize property on the main pool to its lowest possible value (512). Then, assuming the L2ARC writes only a single ARC block per feed cycle (something that would require great care to achieve at all), we can emit pbufs with a minimum of 128 ARC buffer headers. Assuming a 20% reduction in volume of the pbuf via compression, this gives a theoretical maximum overhead of 13.818%.

In the following table lists typically expected metadata overhead ratios for various average block sizes cached on the L2ARC device:

Average Block SizeData Referenced By PbufPbuf Raw SizeEst. Pbuf Compression RatioPbuf Actual SizeOverhead
2k100 MiB4400 KiB45%2640 KiB2.36%
4k100 MiB2200 KiB40%1320 KiB1.18%
8k100 MiB1100 KiB35%715 KiB0.7%
16k100 MiB550 KiB30%385 KiB0.38%
32k100 MiB275 KiB25%206 KiB0.20%
64k100 MiB137 KiB20%110 KiB0.1%
128k100 MiB69 KiB15%58 KiB0.057%

Rebuild Times

During rebuild, ZFS reads in all metadata stored on an L2ARC device in order to reconstruct in-memory ARC buffer headers. This means, that whatever L2ARC capacity you have installed, take the metadata overhead of it and that amount of data needs to be read in during rebuild. At a typical 0.1% of overhead, this means that a 240 GB L2ARC device will need to read roughly 240 MB of data. Most solid state drives can transfer this kind of data volume within a second or two.

However, given the structure of L2ARC pbufs, we don't know which pbuf to start fetching before we are done reading the current pbuf, and so we are also bound by the device's IOPS. In practice, the implementation tries to alleviate these problems in two ways:

  • By speculatively initiating a prefetch for the next pbuf when only the header of the current pbuf has been read so far. Currently, this provides roughly a 15-20% boost in L2ARC rebuild times when compared to a straight "read/decode/read" linear implementation.
  • By executing the rebuild in a background thread for each L2ARC device in parallel. This way the L2ARC rebuild doesn't block main pool import and possibly system boot (if the L2ARC device was part of the root pool).

While L2ARC rebuild is going on in the background, writes to the L2ARC devices are prevented to make sure that new L2ARC writes don't overwrite any metadata yet to be read or that is in the process of being read. To prevent malformed or damaged pbufs from sending the rebuild process into an unending cycle (e.g. by making the pbufs into a loop), the rebuild process contains a deadline time out, which when hit will force the rebuild process to stop.

As an example, one benchmark system where L2ARC rebuild was tested was an HP MicroServer with a 55 GiB L2ARC device on an OCZ Vertex 3 SSD. Rebuilding 55 GiB of raw L2ARC data takes approximately 1.9 seconds on such a device and is unnoticeable during system boot or device attach (given that it proceeds asynchronously in the background anyway).

Exported kstats

The following kstats were added to the arcstats set to provide additional information about persistent L2ARC performance:

  • l2_meta_writes: number of metadata pbufs written to L2ARC devices since boot.
  • l2_meta_avg_size: floating average of uncompressed size of a metadata pbuf. See the "Floating Averages and Ratios" section below for how these values are computed and updated.
  • l2_meta_avg_asize: floating average of the actual on-disk size of a metadata pbufs. Comparing these two numbers you can establish pbuf compression efficiency.
  • l2_asize_to_meta_ratio: floating ratio of the amount of space taken up on L2ARC devices by ARC buffer data divided by space taken up by pbuf metadata. For example, a value of "1000" means that for each 1000 bytes of data we consume 1 byte of metadata (i.e. metadata is taking up about 0.1% of available L2ARC space).
  • l2_rebuild_attempts: number of times an L2ARC rebuild has been attempted. An L2ARC rebuild is attempted (and this kstat incremented) only if an L2ARC device is deemed to contain usable L2ARC metadata (this is determined by looking at whether it contains a valid uberblock).
  • l2_rebuild_successes: number of times an L2ARC rebuild has successfully completed. Except for cases of device failure or serious data corruption, this number should track the above kstat.
  • l2_rebuild_unsupported: number of times we looked at an L2ARC device and found that L2ARC metadata rebuilding is not possible (either because the device contains no valid uberblocks, or because it didn't belong to this pool).
  • l2_rebuild_timeout: number of times a rebuild has been aborted due to hitting a predefined maximum duration timeout (see "Tunables" below). A non-zero value here indicates that your L2ARC device is either very slow or is failing.
  • l2_rebuild_arc_bytes: number of ARC buffer bytes recovered from an L2ARC device during metadata rebuild. This doesn't mean that the ARC buffers have been read back into memory, just how much raw ARC-cacheable buffer data there is on L2ARC.
  • l2_rebuild_l2arc_bytes: same as l2_rebuild_arc_bytes, but indicates how much space the ARC buffers are taking up on the L2ARC device. If you've enabled L2ARC compression on datasets, this number will likely be lower than l2_rebuild_arc_bytes. Otherwise, it should be precisely equal.
  • l2_rebuild_bufs: how many ARC buffers were recovered from L2ARC during rebuild.
  • l2_rebuild_metabufs: how many L2ARC pbufs were processed from L2ARC during rebuild.
  • l2_rebuild_precached: if during rebuild we discover that a buffer is already in ARC, we do not attempt rebuild it and instead bump this counter.
  • l2_rebuild_uberblk_lookups: number of attempted uberblock lookups. This is the first step taken during L2ARC rebuild. Unless you have disabled L2ARC rebuild (see Tunables below), this number should get bumped at each L2ARC device attach.
  • l2_rebuild_uberblk_errors: if during L2ARC rebuild we encounter faulty uberblocks, we bump this counter and abort the rebuild. Any non-zero value here indicates a failing L2ARC device or corrupted L2ARC metadata.
  • l2_rebuild_io_errors: bumped each time an L2ARC rebuild read hits an I/O error. Any non-zero value here indicates a failing L2ARC device.
  • l2_rebuild_cksum_errors: bumped each time an L2ARC rebuild hits a buffer which doesn't checksum to its correct value (though it should). This indicates either a failing L2ARC device or corrupted L2ARC metadata.
  • l2_rebuild_loop_errors: bumped each time an L2ARC rebuild encounters a pbuf list loop. This is a serious condition and indicates severely corrupted L2ARC metadata.
  • l2_rebuild_abort_lowmem: if we the system encounters a low-memory condition during L2ARC rebuild, we abort the rebuild and bump this counter, in order to avoid destabilizing the system even further. You may want to try to re-add the L2ARC device at a later time when the memory pressure is relieved, in order to rebuild the L2ARC device's complete contents.
Floating Averages and Ratios

Due to the fact that we don't keep a history of L2ARC metadata we write in memory, it is difficult compute certain aggregate averages and ratio values. Instead, we compute a "floating statistic", which tracks the recent history of the statistic by slowly factoring in individual updates to it, which is very similar to how load average track system load. We do this by applying the following formula:

STAT = STAT - STAT / FACTOR + NEWVAL / FACTOR

where:

  • STAT: is the value of the respective statistic
  • NEWVAL: is the new value to be added to the statistic
  • FACTOR: is the division factor which determines the contribution of the new value to the overall result. The greater this factor, the less of an effect do new values take on the statistic and more history is preserved. By default, we use a factor value of 3.

To see an example of this in effect, please consider the following how the following example (in GNU bc):

$ bc
stat=100000 ; newval=200000; factor=3
stat=stat - stat/factor + newval/factor ; stat
133333
stat=stat - stat/factor + newval/factor ; stat
155555
stat=stat - stat/factor + newval/factor ; stat
170370
stat=stat - stat/factor + newval/factor ; stat
180246
stat=stat - stat/factor + newval/factor ; stat
186830
stat=stat - stat/factor + newval/factor ; stat
191220
stat=stat - stat/factor + newval/factor ; stat
194146
stat=stat - stat/factor + newval/factor ; stat
196097
stat=stat - stat/factor + newval/factor ; stat
197398
stat=stat - stat/factor + newval/factor ; stat
198265

After 10 updates the floating average value is within 2% of the desired new value.

Tunables

Please note that changing any of these values is generally not advised. Their defaults were chosen to provide best overall performance based on lots of testing and benchmarking.

This is a list of kernel-tunable parameters. You may alter these using the familiar MDB syntax:

# echo "<tunable>/W <new_value>" | mdb -kw

To make a tunable value persistent, append it to /etc/system in the following format:

set zfs:<tunable> = <new_value>
  • l2arc_pbuf_compress_minsz: minimum size in bytes a pbuf must reach before we attempt to compress it (default: 8192 bytes). If a pbuf cannot be compressed, it is written in uncompressed form instead. Increasing this value to something like 32k or more makes sense if you use L2ARC devices with 4k sectors, where compression is unlikely to yield buffers that are small enough to give a measurable reduction in pbuf size.
  • l2arc_pbuf_max_sz: maximum number of ARC bytes a pbuf may reference before we commit it to L2ARC (default: 100 MiB). This forms the data-based upper bound of the pbuf commit interval.
  • l2arc_pbuf_max_buflists: maximum number of L2ARC feed cycles a pbuf may contain (default: 128). If an L2ARC feed cycle attempts to write ARC buffers to L2ARC, we open a new "buflist" within the currently open pbuf and write the  metadata of buffers to be committed to this buflist. Once a pbuf reaches this maximum number of buflists, it is flushed to storage, regardless of the number of bytes it references. This forms a soft time-based upper bound of the pbuf commit interval.
  • l2arc_rebuild_enabled: a boolean value controlling whether we should attempt to rebuild L2ARC contents when adding an L2ARC device to the pool (default: B_TRUE). This setting is applied before the device's L2 uberblock has been read, so any writes to this device overwrite any existing metadata on the device. Set this to B_FALSE in case you are having trouble with L2ARC rebuild timing out and you'd like to drop existing metadata on the L2ARC device.

  • l2arc_rebuild_timeout: a timeout value in seconds of how long at most an L2ARC rebuild is allowed to take (default: 60s). This serves to protect the system from failing or faulty L2ARC devices which are taking too long to complete the rebuild. We check this timeout after each read and if it is hit, we abort L2ARC metadata rebuild and return immediately. If your machine is hitting this timeout, check your system console for L2ARC rebuild timeout messages or the l2_rebuild_timeout kstat.
Labels:
  1. Feb 17, 2013

    On the self-checksum,

    it is more consistent with the rest of ZFS to use SHA-256 instead of Fletcher-4 for the l2 uberblock checksum

     

    1. May 07, 2013

      Also some processors (e.g., ARMv8/Aarch64) have instruction set support for SHA-256 (similar to how x86 has AESNI instructions), but it's rather unlikely to see any common hardware with Fletcher-4 support any time soon.

      1. May 07, 2013

        Fletcher is orders of magnitude faster than SHA-256, so unless the speed bump from hardware acceleration is >>10x, it's not worth bothering about. Moreover, the L2 uberblock is only a few kBs long, so worrying about hashing performance there is not warranted.

  2. Feb 17, 2013

    On worries about endurance,

    Forget about worrying about endurance. There is absolutely no trickery you can try that can defeat the cleverness of the algorithms for wear leveling in modern SSDs.

     

  3. Feb 17, 2013

    On using the 'reserved' boot area,

    I don't see the advantage of using a circular buffer here. Though there is a reserved area that is available (3.5 MB), I do not see why regular COW doesn't work well here. The method employed by the ZIL seems more appropriate.

     

  4. Feb 17, 2013

    On shared L2ARC,

    It seems to me that this design is assuming an unshared L2ARC. This is a departure from the current implementation. I am not a fan of the current implementation, as it has failure modes that are difficult to troubleshoot. However, there is a marketing reason why L2ARCs are currently shared. I'd like to see this issue addressed directly.

     

    1. Feb 17, 2013

      Why do you think L2ARC is shared? As I read the code in l2arc_write_eligible(), a buffer may only be cached on an L2ARC device if it belongs to the same SPA as the L2ARC device itself.

      1. Feb 17, 2013

        L2ARC devices can be shared across different pools.

         

        1. Feb 17, 2013

          I see, so a cache dev can be assigned to multiple pools at the same time - did realize that. No problem, I'll add an extra condition to only restore buffers for the current pool. Alternatively, we can enable metadata only in case the cache dev is only used by a single pool. Which do you think is better?

        2. Feb 17, 2013

          Minor correction on my previous post, ATM a rebuild attempt will succeed only for the last pool to flush data to the L2ARC, and even then not work very well. In short, the current implementation is broken if multiple pools attempt to use the L2ARC device concurrently.

          I'll fix this and get back to you.

        3. Apr 04, 2013

          Only spare devices can be shared across different pools but can only be active on one pool. Cache devices are added to a given pool and cannot be shared.