Skip to end of metadata
Go to start of metadata

^ George Wilson's talk at ZFS Day and a blog with examples, related to that speech: http://blog.delphix.com/gwilson/2012/11/15/4k-sectors-and-zfs/

Introduction

ZFS was designed in the early 2000s to be aware of trends in disk and device storage for the next generation of storage hardware. Unlike previous filesystems that have fixed assumptions about the device sector size, ZFS is designed to accomodate different sector sizes. As the consumer hard disk (HDD) market migrates from 512 byte sector sizes to 4KB sectors, ZFS is ready to make the change. In some cases, 4KB physical sector disks are also called Advanced Format (AF) disks.

The method used by ZFS is to query the drive parameters using the mechanisms provided by the OS for device inquiries. In most cases, the device reports the physical and logical sector size. There are some nuances in the implementations. For example, the SATA protocol uses a sector size that is a multiple of 512 bytes. SCSI devices (SAS, FC, etc) can have many different sizes, and it is not uncommon for vendors to offer 512, 520, or 528 bytes per sector. By 2010, many HDD manufacturers have started offering consumer-grade disks with 4KB sector sizes. Going forward, some HDD vendors have deprecated the use of sector sizes that are not powers of 2 (eg. 520 or 528 bytes). ZFS allocates space using a logical sector size that is a power of 2 that is less than or equal to the physical sector size.

Unfortunately, some HDD manufacturers do not properly respond to the device inquiry sizes. ZFS looks to the physical sector size (aka physical block size) for its hint on how to optimize its use of the device. If the disk reports that the physical sector size is 512 bytes, then ZFS will use an internal sector size of 512 bytes. The problem is that some HDDs misrepresent 4KB sector disks as having a physical sector size of 512 bytes. The proper response should be that the logical sector size is 512 bytes and the physical sector size is 4KB. By 2011-2012, most HDDs were properly reporting logical and physical sector sizes. In some cases, the HDD vendors advertise the disks as "emulating 512 byte sectors" or "512e", with the expected name for disks which advertise 4Kb sector size being "4kn" (for "4k native").

There is no functional or reliability problem with 4KB physical sectors being represented as 512 byte logical sectors. This technique has been used for decades in computer systems to allow expansion of device or address sizes. The general name for the technique is read-modify-write: when you need to write 512 bytes (or less than the physical sector size) then the device reads 4KB (the physical sector), modifies the data, and writes 4KB (because it can't write anything smaller). For HDDs, the cost can be a whole revolution, or 8.33 ms for a 7,200 rpm disk. Thus the performance impact for read-modify-write can be severe, and even worse for slower, consumer-grade, 5,400 rpm or variable speed "green" drives.
Bottom line: for best performance, the HDD needs to properly communicate the physical block size via the inquiry commands for best performance.
Inside ZFS, the kernel variable that describes the physical sector size in bytes is ashift, or the log2(physical sector size). A search of the ZFS archives, can find references to the ashift variable in discussions about the sector sizes.

The issues experienced by ZFS implementers and the 4KB sector size HDDs are summarized as:

  1. Some HDD models misrepresent their physical block size, resulting in unexpectedly poor performance for some workloads
  2. Attempting to replace a HDD that had 512 byte physical sectors with a new HDD that has 4KB logical and physical sectors can fail with a mismatched sector size error message
  3. Some, but not all ZFS implementations offer command-line options to set the physical sector size in either the zpool command or via other OS commands and configuration settings
  4. Older Solaris releases did not set the default sector boundaries on 4KB boundaries, negatively impacting performance (fixed in illumos, later OpenSolaris releases, Solaris 10 recent updates, and Solaris 11)
  5. 4KB sector disks are not as space-efficient as 512 byte sector disks, in part because ZFS metadata is compressed, dynamically allocated, and often less than 4KB physical size
  6. As George Wilson reminds in his talk on ZFS Day (see above), certain HDDs include an "XP Jumper" to offset LBA addresses by 1, so that Windows XP's partitions which by default start at logical offset of 63 512-byte blocks would in fact become 4Kb-aligned (starting at physical offset of 64 512b-blocks). However, presence of such jumper can have adverse effects upon advanced users who do take care to define 4k-aligned partitions (and have these shifted by 1 into misaligned hardware IOs). So, if you have unexpectedly bad performance despite your tuning – keep this reason in mind.

This page will try to share some knowledge and offer good operating practices for ZFS on illumos.

The zpool create and zpool add Commands

The physical sector size is queried when the zpool create or zpool add commands are executed. In these cases, a new top-level virtual device (vdev) is created and the ashift value is set. If the disks mixed physical sizes, then an error message is shown.

When adding 4KB physical sector size disks (ashift = 12) to a pool containing 512 byte physical sector disks (ashift = 9), or vice-versa, then the resulting pool contains mixed sector size top-level vdevs. ZFS functions properly with mixed-size top-level vdevs.

Note: attempting to replace disks with 512 byte physical sectors (or attach into a mirror made from such disks) with disks that only support 4KB logical sectors can fail, leading to operational issues with stocking spares.

Disk partitioning and alignment

TODO: describe usage of parted, fdisk and format to pre-create Solaris slices for rpool disks, and how to check the layout of a particular disk to see if alignment is right.

Also, the examples below represent disks with 512b native sector sizes. It would be informative if someone added similar examples of disks with 512e and 4kn sectors.

There is a number of command-line utilities that can be used to administer and automate the needed setup, as well as inspection, of the disk layout and its alignment.

parted

The GNU partition editor parted which is, in particular, available in OI LiveCD media, can display and edit partitioning tables.

For example, here is a typical rpool disk created by the Solaris and OI installers – with an "msdos" (MBR) label and a further SMI slice table inside, and some strange-looking offset (16065 = 63*255 – a "cylinder" size) which does not offer very good 4K alignment:

And here is a typical data pool component disk, wholly given to ZFS – so it has created a GPT label and an 8Mb buffer partition (to cater for future replacement disk size variations) with proper offsets:

The parted can also be used to create partitions, for example, to prepare a roughly 250Gb rpool following the example sizes above: 

This properly aligned partition can then be used for an rpool – quite a useful trick to use during the OS installation (note that the current OpenIndiana installer will wipe any slice tables and other data in this partition).

Note that parted does not edit SMI labels, you need to use format to define slices (in case of "msdos" MBR partition tables). In case of EFI (GPT) partitioning this slicing is irrelevant, as these GPT partitions are directly usable by ZFS.

fdisk

The Solaris fdisk tool is of limited utility in this case, because it seems to create a (MBR) partitioning table aligned with odd legacy cylinder sizes.

It can also create a EFI (GPT) label with a Solaris partition starting at cylinder 0 and likely proper alignment.

format

The Solaris format tool is used to, among other things, edit the SMI label (Solaris slices in an MBR partition) or the EFI (GPT) partition table. It can also specify offsets and sizes with 512b-sector precision, but it can't create the partitioning table itself and invokes fdisk for that. Here are examples of format displaying the layout of the same disks as in the parted example above:

prtvtoc

The Solaris prtvtoc tool can also display the slicing or partitioning table info along with some other details, and is often used in conjunction with the fmthard utility to clone partitioning information from a master disk to several others – a convenient step in creating uniform labeling for components of the same pool.

Examples for the same disks as above:

Overriding the Physical Block Size

Some operating systems provide a way to pass the desired ashift value to zpool invokation in some way, or even hard-code ashift=12 into a separately built zpool binary.

The illumos-gate did not, after long discussions, follow this "trivial" path. Instead, illumos-based OSes can now override the physical block (sector) size for "talking" to particular devices regardless of what they announce, by using a value configured in the sd(7d) driver. Once the device vendor and identification strings are known (as detailed below), the /kernel/drv/sd.conf file can be modified:

sd.conf additions to override ashift for ZFS
  • See here for some community-contributed entries for 512e drives: List of sd-config-list entries for Advanced-Format drives

  • You MUST reconfigure the device as per George Wilson's blog cited at the beginning of this page. In the case of USB disk George's method seems not to work. However, removing and replugging the USB drive does.
  • Note also that this change is not required to persist if you're immediately using the disk with overridden options to create or change a ZFS pool: after the ashift value gets into the Top-level VDEV label, it is used regardless of the driver options. This may be important on systems such as VMs, where you have (seemingly) identical disk models which you want to use in different manners.
    The change does need to persist if the disk usage will be delayed – for example, if the disk requiring an override is a designated hot-spare for a pool.
  • Other overrides can be set, see the sd(7D) man page. An increasingly useful tunable is the "power-condition:false" that can be set on "green" HDDs for which the vendor has enforced power savings mode in the HDD firmware configuration, so that they spin down too quickly for ZFS's regular TXG syncs or for a larger array's staggered spin-up routine.

To enable this change on a running system, you need to reload the sd driver; keep in mind that in some cases a reboot may still be required:

Note: the illumos distribution builder can set the values by default for the known cases where overrides are appropriate.

Determining the drive's device vendor and identification strings

The format of the entries in the first column of sd-config-list, roughly, defines a substring search on the Vendor Name and the Device Model as returned by the device to appropriate inquiries.

The identifiers are the concatenation of the vendor and product strings returned by the SCSI inquiry command which can be accessed by several methods discussed below. The comparison code in sd.c strips leading and trailing blanks and elides repeated blanks. It converts both strings to upper case to make the comparison. If that comparison fails, it then attempts a wildcard match for a substring (according to the comment). However, the code is a bit odd and may not do what the comment states.

Query with format

The identifiers can be accessed by "format -e" and "inquiry":

Query with iostat

For another example, the "iostat -Er" command can be used to query the needed strings, here are sample outputs from several different hosts:

Query the kernel

Another suggested method was to query the kernel using a series of requests to the mdb -k debugger:

For instance, in my case I'm getting this output:

So my sd.conf (for some "green" drives) looks like this:

Observing Negotiated Block Sizes

Unfortunately, the classic Solaris tools for observing block or sector sizes do not show both the logical and physical block sizes. However, you can observe the values negotiated by the sd driver using mdb.

The unit number (un) corresponds to the SCSI driver (sd) instance number. In the above example, "un 0" is also known as "sd0." Sizes are in bytes, written as hex: 0x200 = 512, 0x1000 = 4096.

To inspect the ashift value actually used by ZFS for a particular pool, you can use zdb (without parameters it dumps labels of imported pools), for example:

Note that each pool can consist of one or several Top-level VDEVs, and each of those can have an individual composition of devices, i.e. not only mixing of mirrors and raidzN sets is possible, but also of devices with different hardware sector size and thus ashift as set at TLVDEV creation time. However, in order to avoid unbalanced IO and "unpleasant surprises" which might be difficult to explain and debug, it is discouraged to build pools from such mixtures.

Migration of older pools onto newer disks

... TODO: expand the text

In short: if the old pool was created with 512B-sectored drives, it is best not to replace its disks with 4KB-sectored ones, but create a new pool with those disks and proper alignment, and copy over the data. The copy can be done in a number of ways, one of which is a "zfs send ... | zfs recv ..." run and another is a series of "rsync" invokations – perhaps with some manual labor to recreate the needed dataset hierarchy and apply ZFS properties, and even logically recreate the snapshot and cloning history. The "rsync" approach may be especially useful if you need to change the dataset layout as well, because over time people often find their old setup not as optimal as they initially thought (wink)

Note: It was reported on the mailing lists that rsync version 3.10 has finally added support for Solaris ACLs and extended attributes. Previously it was recommended to use Sun (not GNU) tar or cpio to transfer filesystems with such extended file/dir attributes and ACLs.


Thanks

Thanks go to active users of openindiana-discuss and zfs-discuss mailing lists, including Richard Elling, George Wilson, Sašo Kiselkov, Edward Ned Harvey, Jim Klimov and uncountable others – including all those who ask questions so we can all receive answers.

See also

 

Labels: