^ George Wilson's talk at ZFS Day and a blog with examples, related to that speech: http://blog.delphix.com/gwilson/2012/11/15/4k-sectors-and-zfs/
ZFS was designed in the early 2000s to be aware of trends in disk and device storage for the next generation of storage hardware. Unlike previous filesystems that have fixed assumptions about the device sector size, ZFS is designed to accomodate different sector sizes. As the consumer hard disk (HDD) market migrates from 512 byte sector sizes to 4KB sectors, ZFS is ready to make the change. In some cases, 4KB physical sector disks are also called Advanced Format (AF) disks.
The method used by ZFS is to query the drive parameters using the mechanisms provided by the OS for device inquiries. In most cases, the device reports the physical and logical sector size. There are some nuances in the implementations. For example, the SATA protocol uses a sector size that is a multiple of 512 bytes. SCSI devices (SAS, FC, etc) can have many different sizes, and it is not uncommon for vendors to offer 512, 520, or 528 bytes per sector. By 2010, many HDD manufacturers have started offering consumer-grade disks with 4KB sector sizes. Going forward, some HDD vendors have deprecated the use of sector sizes that are not powers of 2 (eg. 520 or 528 bytes). ZFS allocates space using a logical sector size that is a power of 2 that is less than or equal to the physical sector size.
Unfortunately, some HDD manufacturers do not properly respond to the device inquiry sizes. ZFS looks to the physical sector size (aka physical block size) for its hint on how to optimize its use of the device. If the disk reports that the physical sector size is 512 bytes, then ZFS will use an internal sector size of 512 bytes. The problem is that some HDDs misrepresent 4KB sector disks as having a physical sector size of 512 bytes. The proper response should be that the logical sector size is 512 bytes and the physical sector size is 4KB. By 2011-2012, most HDDs were properly reporting logical and physical sector sizes. In some cases, the HDD vendors advertise the disks as "emulating 512 byte sectors" or "512e", with the expected name for disks which advertise 4Kb sector size being "4kn" (for "4k native").
There is no functional or reliability problem with 4KB physical sectors being represented as 512 byte logical sectors. This technique has been used for decades in computer systems to allow expansion of device or address sizes. The general name for the technique is read-modify-write: when you need to write 512 bytes (or less than the physical sector size) then the device reads 4KB (the physical sector), modifies the data, and writes 4KB (because it can't write anything smaller). For HDDs, the cost can be a whole revolution, or 8.33 ms for a 7,200 rpm disk. Thus the performance impact for read-modify-write can be severe, and even worse for slower, consumer-grade, 5,400 rpm or variable speed "green" drives.
Bottom line: for best performance, the HDD needs to properly communicate the physical block size via the inquiry commands for best performance.
Inside ZFS, the kernel variable that describes the physical sector size in bytes is
ashift, or the log2(physical sector size). A search of the ZFS archives, can find references to the
ashift variable in discussions about the sector sizes.
The issues experienced by ZFS implementers and the 4KB sector size HDDs are summarized as:
zpoolcommand or via other OS commands and configuration settings
This page will try to share some knowledge and offer good operating practices for ZFS on illumos.
The physical sector size is queried when the
zpool create or
zpool add commands are executed. In these cases, a new top-level virtual device (vdev) is created and the
ashift value is set. If the disks mixed physical sizes, then an error message is shown.
When adding 4KB physical sector size disks (
ashift = 12) to a pool containing 512 byte physical sector disks (
ashift = 9), or vice-versa, then the resulting pool contains mixed sector size top-level vdevs. ZFS functions properly with mixed-size top-level vdevs.
Note: attempting to replace disks with 512 byte physical sectors (or attach into a mirror made from such disks) with disks that only support 4KB logical sectors can fail, leading to operational issues with stocking spares.
TODO: describe usage of
Also, the examples below represent disks with 512b native sector sizes. It would be informative if someone added similar examples of disks with 512e and 4kn sectors.
There is a number of command-line utilities that can be used to administer and automate the needed setup, as well as inspection, of the disk layout and its alignment.
The GNU partition editor
parted which is, in particular, available in OI LiveCD media, can display and edit partitioning tables.
For example, here is a typical
rpool disk created by the Solaris and OI installers – with an "msdos" (MBR) label and a further SMI slice table inside, and some strange-looking offset (16065 = 63*255 – a "cylinder" size) which does not offer very good 4K alignment:
# parted /dev/rdsk/c5t4d0p0 uni s pri Model: Generic Ide (ide) Disk /dev/rdsk/c5t4d0p0: 488390625s Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End Size Type File system Flags 1 16065s 488375999s 488359935s primary solaris boot
And here is a typical data pool component disk, wholly given to ZFS – so it has created a GPT label and an 8Mb buffer partition (to cater for future replacement disk size variations) with proper offsets:
# parted /dev/rdsk/c5t1d0p0 uni s pri Model: Generic Ide (ide) Disk /dev/rdsk/c5t1d0p0: 488390625s Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 256s 488374207s 488373952s zfs 9 488374208s 488390591s 16384s
parted can also be used to create partitions, for example, to prepare a roughly 250Gb
rpool following the example sizes above:
# parted /dev/rdsk/c2t0d0p0 mklabel msdos # parted /dev/rdsk/c2t0d0p0 uni s mkpart pri solaris 256 488373952 # parted /dev/rdsk/c2t0d0p0 set 1 boot on
This properly aligned partition can then be used for an
rpool – quite a useful trick to use during the OS installation (note that the current OpenIndiana installer will wipe any slice tables and other data in this partition).
parted does not edit SMI labels, you need to use
format to define slices (in case of "msdos" MBR partition tables). In case of EFI (GPT) partitioning this slicing is irrelevant, as these GPT partitions are directly usable by ZFS.
fdisk tool is of limited utility in this case, because it seems to create a (MBR) partitioning table aligned with odd legacy cylinder sizes.
It can also create a EFI (GPT) label with a Solaris partition starting at cylinder 0 and likely proper alignment.
format tool is used to, among other things, edit the SMI label (Solaris slices in an MBR partition) or the EFI (GPT) partition table. It can also specify offsets and sizes with 512b-sector precision, but it can't create the partitioning table itself and invokes
fdisk for that. Here are examples of
format displaying the layout of the same disks as in the
parted example above:
# format c5t4d0p0 selecting c5t4d0p0 [disk formatted] /dev/dsk/c5t4d0s0 is part of active ZFS pool rpool. Please see zpool(1M). ... format> p ... partition> p Current partition table (original): Total disk cylinders available: 30397 + 2 (reserved cylinders) Part Tag Flag Cylinders Size Blocks 0 root wm 1 - 30396 232.85GB (30396/0/0) 488311740 1 unassigned wm 0 0 (0/0/0) 0 2 backup wm 0 - 30396 232.85GB (30397/0/0) 488327805 3 unassigned wm 0 0 (0/0/0) 0 4 unassigned wm 0 0 (0/0/0) 0 5 unassigned wm 0 0 (0/0/0) 0 6 unassigned wm 0 0 (0/0/0) 0 7 unassigned wm 0 0 (0/0/0) 0 8 boot wu 0 - 0 7.84MB (1/0/0) 16065 9 unassigned wu 0 0 (0/0/0) 0 # format c5t1d0p0 selecting c5t1d0p0 [disk formatted] /dev/dsk/c5t1d0s0 is part of active ZFS pool pond. Please see zpool(1M). ... format> p ... partition> p Current partition table (original): Total disk sectors available: 488374207 + 16384 (reserved sectors) Part Tag Flag First Sector Size Last Sector 0 usr wm 256 232.87GB 488374207 1 unassigned wm 0 0 0 2 unassigned wm 0 0 0 3 unassigned wm 0 0 0 4 unassigned wm 0 0 0 5 unassigned wm 0 0 0 6 unassigned wm 0 0 0 8 reserved wm 488374208 8.00MB 488390591
prtvtoc tool can also display the slicing or partitioning table info along with some other details, and is often used in conjunction with the
fmthard utility to clone partitioning information from a master disk to several others – a convenient step in creating uniform labeling for components of the same pool.
Examples for the same disks as above:
# prtvtoc /dev/dsk/c5t4d0p0 * /dev/dsk/c5t4d0p0 partition map * * Dimensions: * 512 bytes/sector * 63 sectors/track * 255 tracks/cylinder * 16065 sectors/cylinder * 30399 cylinders * 30397 accessible cylinders * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First Sector Last * Sector Count Sector * 0 16065 16064 * * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 2 00 16065 488311740 488327804 2 5 00 0 488327805 488327804 8 1 01 0 16065 16064 # prtvtoc /dev/dsk/c5t1d0p0 * /dev/dsk/c5t1d0p0 partition map * * Dimensions: * 512 bytes/sector * 488390625 sectors * 488390558 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First Sector Last * Sector Count Sector * 34 222 255 * * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 4 00 256 488373952 488374207 8 11 00 488374208 16384 488390591
Some operating systems provide a way to pass the desired
ashift value to
zpool invocation in some way, or even hard-code
ashift=12 into a separately built
illumos-gate did not, after long discussions, follow this "trivial" path. Instead, illumos-based OSes can now override the physical block (sector) size for "talking" to particular devices regardless of what they announce, by using a value configured in the
sd(7d) driver. Once the device vendor and identification strings are known (as detailed below), the
/kernel/drv/sd.conf file can be modified:
sd-config-list = "SEAGATE ST3300657SS", "physical-block-size:4096", "DGC RAID", "physical-block-size:4096", "NETAPP LUN", "physical-block-size:4096";
See here for some community-contributed entries for
512e drives: List of sd-config-list entries for Advanced-Format drives
ashiftvalue gets into the Top-level VDEV label, it is used regardless of the driver options. This may be important on systems such as VMs, where you have (seemingly) identical disk models which you want to use in different manners.
sd(7D)man page. An increasingly useful tunable is the
"power-condition:false"that can be set on "green" HDDs for which the vendor has enforced power savings mode in the HDD firmware configuration, so that they spin down too quickly for ZFS's regular TXG syncs or for a larger array's staggered spin-up routine.
To enable this change on a running system, you need to reload the
sd driver; keep in mind that in some cases a reboot may still be required:
# update_drv -vf sd
Note: the illumos distribution builder can set the values by default for the known cases where overrides are appropriate.
The format of the entries in the first column of
sd-config-list, roughly, defines a substring search on the Vendor Name and the Device Model as returned by the device to appropriate inquiries.
The identifiers are the concatenation of the vendor and product strings returned by the SCSI inquiry command which can be accessed by several methods discussed below. The comparison code in
sd.c strips leading and trailing blanks and elides repeated blanks. It converts both strings to upper case to make the comparison. If that comparison fails, it then attempts a wildcard match for a substring (according to the comment). However, the code is a bit odd and may not do what the comment states.
The identifiers can be accessed by "
format -e" and "
# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c3t0d0 <DEFAULT cyl 8921 alt 2 hd 255 sec 63> /pci@0,0/pci1022,7458@11/pci1000,3060@4/sd@0,0 Specify disk (enter its number): 0 selecting c3t0d0 [disk formatted] ... inquiry - show vendor, product and revision ... format> inq Vendor: SEAGATE Product: ST973401LSUN72G Revision: 0556 format> ^D
For another example, the "
iostat -Er" command can be used to query the needed strings, here are sample outputs from several different hosts:
# iostat -Er | grep -i vendor | sort | uniq Vendor: ,Product: Virtual CDROM ,Revision: 1.00 ,Serial No: Vendor: Intel ,Product: Multi-Flex ,Revision: 0307 ,Serial No: 4C2020200000000 Vendor: ATA ,Product: ST3320620AS ,Revision: K ,Serial No: Vendor: AMI ,Product: Virtual CDROM ,Revision: 1.00 ,Serial No: Vendor: AMI ,Product: Virtual Floppy ,Revision: 1.00 ,Serial No: Vendor: ATA ,Product: Hitachi HUA72303 ,Revision: A580 ,Serial No: Vendor: ATA ,Product: SEAGATE ST32500N ,Revision: 3AZQ ,Serial No: Vendor: AMI ,Product: Virtual CDROM ,Revision: 1.00 ,Serial No: Vendor: AMI ,Product: Virtual Floppy ,Revision: 1.00 ,Serial No: Vendor: FUJITSU ,Product: MAY2073RCSUN72G ,Revision: 0501 ,Serial No: 0729S0C9HV Vendor: MATSHITA ,Product: CD-RW CW-8124 ,Revision: DZ13 ,Serial No: Vendor: SEAGATE ,Product: ST973401LSUN72G ,Revision: 0556 ,Serial No: 071611N3BL Vendor: SEAGATE ,Product: ST973401LSUN72G ,Revision: 0556 ,Serial No: 071611N3CQ Vendor: SEAGATE ,Product: ST973402SSUN72G ,Revision: 0400 ,Serial No: 0716216393
Another suggested method was to query the kernel using a series of requests to the
mdb -k debugger:
# echo "::walk sd_state | ::grep '.!=0' | ::print struct sd_lun un_sd | \ ::print struct scsi_device sd_inq | ::print struct scsi_inquiry \ inq_vid inq_pid" | mdb -k
For instance, in my case I'm getting this output:
inq_vid = [ "SEAGATE " ] inq_pid = [ "ST2000NM0001 " ] inq_vid = [ "SEAGATE " ] inq_pid = [ "ST3300657SS " ] ...
sd.conf (for some "green" drives) looks like this:
sd-config-list = "SEAGATE ST3300657SS", "power-condition:false", "SEAGATE ST2000NM0001", "power-condition:false";
Unfortunately, the classic Solaris tools for observing block or sector sizes do not show both the logical and physical block sizes. However, you can observe the values negotiated by the
sd driver using
# echo ::sd_state | mdb -k | egrep '(^un|_blocksize)' un 0: ffffff01d1b18d00 un_sys_blocksize = 0x200 un_tgt_blocksize = 0x200 un_phy_blocksize = 0x200 ...
The unit number (
un) corresponds to the SCSI driver (
sd) instance number. In the above example, "
un 0" is also known as "
sd0." Sizes are in bytes, written as hex: 0x200 = 512, 0x1000 = 4096.
To inspect the
ashift value actually used by ZFS for a particular pool, you can use
zdb (without parameters it dumps labels of imported pools), for example:
# zdb | egrep 'ashift| name' name='pond' ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 name='rpool' ashift=9 name='temp' ashift=9
Note that each pool can consist of one or several Top-level VDEVs, and each of those can have an individual composition of devices, i.e. not only mixing of mirrors and raidzN sets is possible, but also of devices with different hardware sector size and thus
ashift as set at TLVDEV creation time. However, in order to avoid unbalanced IO and "unpleasant surprises" which might be difficult to explain and debug, it is discouraged to build pools from such mixtures.
... TODO: expand the text
In short: if the old pool was created with 512B-sectored drives, it is best not to replace its disks with 4KB-sectored ones, but create a new pool with those disks and proper alignment, and copy over the data. The copy can be done in a number of ways, one of which is a "
zfs send ... | zfs recv ..." run and another is a series of "
rsync" invokations – perhaps with some manual labor to recreate the needed dataset hierarchy and apply ZFS properties, and even logically recreate the snapshot and cloning history. The "
rsync" approach may be especially useful if you need to change the dataset layout as well, because over time people often find their old setup not as optimal as they initially thought
Note: It was reported on the mailing lists that
rsync version 3.10 has finally added support for Solaris ACLs and extended attributes. Previously it was recommended to use Sun (not GNU)
cpio to transfer filesystems with such extended file/dir attributes and ACLs.
Thanks go to active users of
zfs-discuss mailing lists, including Richard Elling, George Wilson, Sašo Kiselkov, Edward Ned Harvey, Jim Klimov and uncountable others – including all those who ask questions so we can all receive answers.