ZFS was designed in the early 2000s to be aware of trends in disk and device storage for the next generation of storage hardware. Unlike previous filesystems that have fixed assumptions about the device sector size, ZFS is designed to accomodate different sector sizes. As the consumer hard disk (HDD) market migrates from 512 byte sector sizes to 4KB sectors, ZFS is ready to make the change. In some cases, 4KB physical sector disks are also called Advanced Format (AF) disks.
The method used by ZFS is to query the drive parameters using the mechanisms provided by the OS for device inquiries. In most cases, the device reports the physical and logical sector size. There are some nuances in the implementations. For example, the SATA protocol uses a sector size that is a multiple of 512 bytes. SCSI devices (SAS, FC, etc) can have many different sizes, and it is not uncommon for vendors to offer 512, 520, or 528 bytes per sector. By 2010, many HDD manufacturers have started offering consumer-grade disks with 4KB sector sizes. Going forward, some HDD vendors have deprecated the use of sector sizes that are not powers of 2 (eg. 520 or 528 bytes). ZFS allocates space using a logical sector size that is a power of 2 that is less than or equal to the physical sector size.
Unfortunately, some HDD manufacturers do not properly respond to the device inquiry sizes. ZFS looks to the physical sector size (aka physical block size) for its hint on how to optimize its use of the device. If the disk reports that the physical sector size is 512 bytes, then ZFS will use an internal sector size of 512 bytes. The problem is that some HDDs misrepresent 4KB sector disks as having a physical sector size of 512 bytes. The proper response should be that the local sector size is 512 bytes and the physical sector size is 4KB. By 2011-2012, most HDDs were properly reporting logical and physical sector sizes. In some cases, the HDD vendors advertise the disks as "emulating 512 byte sectors" or "512e."
There is no functional or reliability problem with 4KB physical sectors being represented as 512 byte logical sectors. This technique has been used for decades in computer systems to allow expansion of device or address sizes. The general name for the technique is read-modify-write: when you need to write 512 bytes (or less than the physical sector size) then the device reads 4KB (the physical sector), modifies the data, and writes 4KB (because it can't write anything smaller). For HDDs, the cost can be a whole revolution, or 8.33 ms for a 7,200 rpm disk. Thus the performance impact for read-modify-write can be severe, and even worse for slower, consumer-grade, 5,400 rpm or variable speed "green" drives. Bottom line: for best performance, the HDD needs to properly communicate the physical block size via the inquiry commands for best performance. Inside ZFS, the kernel variable that describes the physical sector size in bytes is
ashift, or the log2(physical sector size). A search of the ZFS archives, can find references to the
ashift variable in discussions about the sector sizes.
The issues experienced by ZFS implementers and the 4KB sector size HDDs are summarized as:
- Some HDD models misrepresent their physical block size, resulting in unexpectedly poor performance for some workloads
- Attempting to replace a HDD that had 512 byte physical sectors with a new HDD that has 4KB logical and physical sectors can fail with a mismatched sector size error message
- Some, but not all ZFS implementations offer command-line options to set the physical sector size in either the zpool command or via other OS commands and configuration settings
- Older Solaris releases did not set the default sector boundaries on 4KB boundaries, negatively impacting performance (fixed in illumos, later OpenSolaris releases, Solaris 10 recent updates, and Solaris 11)
- 4KB sector disks are not as space-efficient as 512 byte sector disks, in part because ZFS metadata is compressed, dynamically allocated, and often less than 4KB physical size
This page will try to share some knowledge and offer good operating practices for ZFS on illumos.
zpool create and zpool add Commands
The physical sector size is queried when the
zpool create or
zpool add commands are executed. In these cases, a new top-level virtual device (vdev) is created and the ashift value is set. If the disks mixed physical sizes, then an error message is shown.
When adding 4KB physical sector size disks (
ashift = 12) to a pool containing 512 byte physical sector disks (
ashift = 9), or vice-versa, then the resulting pool contains mixed sector size top-level vdevs. ZFS functions properly with mixed-size top-level vdevs.
Note: attempting to replace disks with 512 byte physical sectors with disks that only support 4KB logical sectors can fail, leading to operational issues with stocking spares.
Disk partitioning and alignment
... TODO: describe usage of
format to pre-create Solaris slices for
rpool disks, and how to check the layout of a particular disk to see if alignment is right.
Overriding the Physical Sector Size
illumos can override the physical sector size by configured in the sd(7d) driver. Once the device vendor and identification strings are known, the
/kernel/drv/sd.conf file can be modified:
sd-config-list = "SEAGATE ST3300657SS", "physical-block-size:4096", "DGC RAID", "physical-block-size:4096", "NETAPP LUN", "physical-block-size:4096";
- The format of the entries in the first column, roughly, defines a substring search on the Vendor Name and the Device Model as returned by the device to appropriate inquiries.
- Other overrides can be set, see the sd(7D) man page. An increasingly useful tunable is the
"power-condition:false"that can be set on HDDs for which the vendor has enabled power savings mode in the HDD firmware configuration.
To enable this change on a running system, you need to reload the
# update_drv -vf sd
Note: the illumos distribution builder can set the values by default for the known cases where overrides are appropriate.
Migration of older pools onto newer disks
... TODO: expand the text
In short: if the old pool was created with 512B-sectored drives, it is best not to replace its disks with 4KB-sectored ones, but create a new pool with those disks and copy over the data. The copy can be done in a number of ways, one of which is a
"zfs send ... | zfs recv ..." run.
Thanks go to active users of
zfs-discuss mailing lists, including Richard Elling, George Wilson, Saso Kiselkov, Edward Ned Harvey, Jim Klimov and uncountable others – including all those who ask questions so we can all receive answers.
- Using GNOP to emulate 4KB blocks over 512B (to create the pool via FreeBSD, such as a "zfsguru" LiveCD; you can then reboot into OI):