^ George Wilson's talk at ZFS Day and a blog with examples, related to that speech: http://blog.delphix.com/gwilson/2012/11/15/4k-sectors-and-zfs/
ZFS was designed in the early 2000s to be aware of trends in disk and device storage for the next generation of storage hardware. Unlike previous filesystems that have fixed assumptions about the device sector size, ZFS is designed to accomodate different sector sizes. As the consumer hard disk (HDD) market migrates from 512 byte sector sizes to 4KB sectors, ZFS is ready to make the change. In some cases, 4KB physical sector disks are also called Advanced Format (AF) disks.
Unfortunately, some HDD manufacturers do not properly respond to the device inquiry sizes. ZFS looks to the physical sector size (aka physical block size) for its hint on how to optimize its use of the device. If the disk reports that the physical sector size is 512 bytes, then ZFS will use an internal sector size of 512 bytes. The problem is that some HDDs misrepresent 4KB sector disks as having a physical sector size of 512 bytes. The proper response should be that the logical sector size is 512 bytes and the physical sector size is 4KB. By 2011-2012, most HDDs were properly reporting logical and physical sector sizes. In some cases, the HDD vendors advertise the disks as "emulating 512 byte sectors" or "512e."", with the expected name for disks which advertise 4Kb sector size being "4kn" (for "4k native").
There is no functional or reliability problem with 4KB physical sectors being represented as 512 byte logical sectors. This technique has been used for decades in computer systems to allow expansion of device or address sizes. The general name for the technique is read-modify-write: when you need to write 512 bytes (or less than the physical sector size) then the device reads 4KB (the physical sector), modifies the data, and writes 4KB (because it can't write anything smaller). For HDDs, the cost can be a whole revolution, or 8.33 ms for a 7,200 rpm disk. Thus the performance impact for read-modify-write can be severe, and even worse for slower, consumer-grade, 5,400 rpm or variable speed "green" drives.
Bottom line: for best performance, the HDD needs to properly communicate the physical block size via the inquiry commands for best performance.
Inside ZFS, the kernel variable that describes the physical sector size in bytes is
ashift, or the log2(physical sector size). A search of the ZFS archives, can find references to the
ashift variable in discussions about the sector sizes.
- Some HDD models misrepresent their physical block size, resulting in unexpectedly poor performance for some workloads
- Attempting to replace a HDD that had 512 byte physical sectors with a new HDD that has 4KB logical and physical sectors can fail with a mismatched sector size error message
- Some, but not all ZFS implementations offer command-line options to set the physical sector size in either the
zpoolcommand or via other OS commands and configuration settings
- Older Solaris releases did not set the default sector boundaries on 4KB boundaries, negatively impacting performance (fixed in illumos, later OpenSolaris releases, Solaris 10 recent updates, and Solaris 11)
- 4KB sector disks are not as space-efficient as 512 byte sector disks, in part because ZFS metadata is compressed, dynamically allocated, and often less than 4KB physical size
- As George Wilson reminds in his talk on ZFS Day (see above), certain HDDs include an "XP Jumper" to offset LBA addresses by 1, so that Windows XP's partitions which by default start at logical offset of 63 512-byte blocks would in fact become 4Kb-aligned (starting at physical offset of 64 512b-blocks). However, presence of such jumper can have adverse effects upon advanced users who do take care to define 4k-aligned partitions (and have these shifted by 1 into misaligned hardware IOs). So, if you have unexpectedly bad performance despite your tuning – keep this reason in mind.
This page will try to share some knowledge and offer good operating practices for ZFS on illumos.
Note: attempting to replace disks with 512 byte physical sectors with disks (or attach into a mirror made from such disks) with disks that only support 4KB logical sectors can fail, leading to operational issues with stocking spares.
Disk partitioning and alignment
TODO: describe usage
Overriding the Physical Sector Size
Also, the examples below represent disks with 512b native sector sizes. It would be informative if someone added similar examples of disks with 512e and 4kn sectors.
There is a number of command-line utilities that can be used to administer and automate the needed setup, as well as inspection, of the disk layout and its alignment.
The GNU partition editor
parted which is, in particular, available in OI LiveCD media, can display and edit partitioning tables.
For example, here is a typical
rpool disk created by the Solaris and OI installers – with an "msdos" (MBR) label and a further SMI slice table inside, and some strange-looking offset (16065 = 63*255 – a "cylinder" size) which does not offer very good 4K alignment:
# parted /dev/rdsk/c5t4d0p0 uni s pri Model: Generic Ide (ide) Disk /dev/rdsk/c5t4d0p0: 488390625s Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End Size Type File system Flags 1 16065s 488375999s 488359935s primary solaris boot
And here is a typical data pool component disk, wholly given to ZFS – so it has created a GPT label and an 8Mb buffer partition (to cater for future replacement disk size variations) with proper offsets:
# parted /dev/rdsk/c5t1d0p0 uni s pri Model: Generic Ide (ide) Disk /dev/rdsk/c5t1d0p0: 488390625s Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 256s 488374207s 488373952s zfs 9 488374208s 488390591s 16384s
parted can also be used to create partitions, for example, to prepare a roughly 250Gb
rpool following the example sizes above:
# parted /dev/rdsk/c2t0d0p0 mklabel msdos # parted /dev/rdsk/c2t0d0p0 uni s mkpart pri solaris 256 488373952 # parted /dev/rdsk/c2t0d0p0 set 1 boot on
This properly aligned partition can then be used for an
rpool – quite a useful trick to use during the OS installation (note that the current OpenIndiana installer will wipe any slice tables and other data in this partition).
parted does not edit SMI labels, you need to use
format to define slices (in case of "msdos" MBR partition tables). In case of EFI (GPT) partitioning this slicing is irrelevant, as these GPT partitions are directly usable by ZFS.
fdisk tool is of limited utility in this case, because it seems to create a (MBR) partitioning table aligned with odd legacy cylinder sizes.
It can also create a EFI (GPT) label with a Solaris partition starting at cylinder 0 and likely proper alignment.
format tool is used to, among other things, edit the SMI label (Solaris slices in an MBR partition) or the EFI (GPT) partition table. It can also specify offsets and sizes with 512b-sector precision, but it can't create the partitioning table itself and invokes
fdisk for that. Here are examples of
format displaying the layout of the same disks as in the
parted example above:
# format c5t4d0p0 selecting c5t4d0p0 [disk formatted] /dev/dsk/c5t4d0s0 is part of active ZFS pool rpool. Please see zpool(1M). ... format> p ... partition> p Current partition table (original): Total disk cylinders available: 30397 + 2 (reserved cylinders) Part Tag Flag Cylinders Size Blocks 0 root wm 1 - 30396 232.85GB (30396/0/0) 488311740 1 unassigned wm 0 0 (0/0/0) 0 2 backup wm 0 - 30396 232.85GB (30397/0/0) 488327805 3 unassigned wm 0 0 (0/0/0) 0 4 unassigned wm 0 0 (0/0/0) 0 5 unassigned wm 0 0 (0/0/0) 0 6 unassigned wm 0 0 (0/0/0) 0 7 unassigned wm 0 0 (0/0/0) 0 8 boot wu 0 - 0 7.84MB (1/0/0) 16065 9 unassigned wu 0 0 (0/0/0) 0 # format c5t1d0p0 selecting c5t1d0p0 [disk formatted] /dev/dsk/c5t1d0s0 is part of active ZFS pool pond. Please see zpool(1M). ... format> p ... partition> p Current partition table (original): Total disk sectors available: 488374207 + 16384 (reserved sectors) Part Tag Flag First Sector Size Last Sector 0 usr wm 256 232.87GB 488374207 1 unassigned wm 0 0 0 2 unassigned wm 0 0 0 3 unassigned wm 0 0 0 4 unassigned wm 0 0 0 5 unassigned wm 0 0 0 6 unassigned wm 0 0 0 8 reserved wm 488374208 8.00MB 488390591
prtvtoc tool can also display the slicing or partitioning table info along with some other details, and is often used in conjunction with the
fmthard utility to clone partitioning information from a master disk to several others – a convenient step in creating uniform labeling for components of the same pool.
Examples for the same disks as above:
# prtvtoc /dev/dsk/c5t4d0p0 * /dev/dsk/c5t4d0p0 partition map * * Dimensions: * 512 bytes/sector * 63 sectors/track * 255 tracks/cylinder * 16065 sectors/cylinder * 30399 cylinders * 30397 accessible cylinders * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First Sector Last * Sector Count Sector * 0 16065 16064 * * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 2 00 16065 488311740 488327804 2 5 00 0 488327805 488327804 8 1 01 0 16065 16064 # prtvtoc /dev/dsk/c5t1d0p0 * /dev/dsk/c5t1d0p0 partition map * * Dimensions: * 512 bytes/sector * 488390625 sectors * 488390558 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First Sector Last * Sector Count Sector * 34 222 255 * * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 4 00 256 488373952 488374207 8 11 00 488374208 16384 488390591
Overriding the Physical Block Size
Some operating systems provide a way to pass the desired
ashift value to
zpool invocation in some way, or even hard-code
ashift=12 into a separately built
illumos-gate did not, after long discussions, follow this "trivial" path. Instead, illumos-based OSes can now override the physical block (sector) size for "talking" to particular devices regardless of what they announce, by using a value configured in the
sd(7d) driver. Once the device vendor and identification strings are known (as detailed below), the
/kernel/drv/sd.conf file can be modified:
sd-config-list = "SEAGATE ST3300657SS", "physical-block-size:4096", "DGC RAID", "physical-block-size:4096", "NETAPP LUN", "physical-block-size:4096";
- The format of the entries in the first column, roughly, defines a substring search on the Vendor Name and the Device Model as returned by the device to appropriate inquiries.
See here for some community-contributed entries for
512edrives: List of sd-config-list entries for Advanced-Format drives
- You MUST reconfigure the device as per George Wilson's blog cited at the beginning of this page. In the case of USB disk George's method seems not to work. However, removing and replugging the USB drive does.
- Note also that this change is not required to persist if you're immediately using the disk with overridden options to create or change a ZFS pool: after the
ashiftvalue gets into the Top-level VDEV label, it is used regardless of the driver options. This may be important on systems such as VMs, where you have (seemingly) identical disk models which you want to use in different manners.
The change does need to persist if the disk usage will be delayed – for example, if the disk requiring an override is a designated hot-spare for a pool.
- Other overrides can be set, see the the
sd(7D)man page. An increasingly useful tunable is the
"power-condition:false"that can be set on "green" HDDs for which the vendor has enabled enforced power savings mode in the HDD firmware configuration, so that they spin down too quickly for ZFS's regular TXG syncs or for a larger array's staggered spin-up routine.
To enable this change on a running system, you need to reload the
sd driver; keep in mind that in some cases a reboot may still be required:
# update_drv -vf sd
Note: the illumos distribution builder can set the values by default for the known cases where overrides are appropriate.
Determining the drive's device vendor and identification strings
The format of the entries in the first column of
sd-config-list, roughly, defines a substring search on the Vendor Name and the Device Model as returned by the device to appropriate inquiries.
The identifiers are the concatenation of the vendor and product strings returned by the SCSI inquiry command which can be accessed by several methods discussed below. The comparison code in
sd.c strips leading and trailing blanks and elides repeated blanks. It converts both strings to upper case to make the comparison. If that comparison fails, it then attempts a wildcard match for a substring (according to the comment). However, the code is a bit odd and may not do what the comment states.
Query with format
The identifiers can be accessed by "
format -e" and "
# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c3t0d0 <DEFAULT cyl 8921 alt 2 hd 255 sec 63> /pci@0,0/pci1022,7458@11/pci1000,3060@4/sd@0,0 Specify disk (enter its number): 0 selecting c3t0d0 [disk formatted] ... inquiry - show vendor, product and revision ... format> inq Vendor: SEAGATE Product: ST973401LSUN72G Revision: 0556 format> ^D
Query with iostat
For another example, the "
iostat -Er" command can be used to query the needed strings, here are sample outputs from several different hosts:
# iostat -Er | grep -i vendor | sort | uniq Vendor: ,Product: Virtual CDROM ,Revision: 1.00 ,Serial No: Vendor: Intel ,Product: Multi-Flex ,Revision: 0307 ,Serial No: 4C2020200000000 Vendor: ATA ,Product: ST3320620AS ,Revision: K ,Serial No: Vendor: AMI ,Product: Virtual CDROM ,Revision: 1.00 ,Serial No: Vendor: AMI ,Product: Virtual Floppy ,Revision: 1.00 ,Serial No: Vendor: ATA ,Product: Hitachi HUA72303 ,Revision: A580 ,Serial No: Vendor: ATA ,Product: SEAGATE ST32500N ,Revision: 3AZQ ,Serial No: Vendor: AMI ,Product: Virtual CDROM ,Revision: 1.00 ,Serial No: Vendor: AMI ,Product: Virtual Floppy ,Revision: 1.00 ,Serial No: Vendor: FUJITSU ,Product: MAY2073RCSUN72G ,Revision: 0501 ,Serial No: 0729S0C9HV Vendor: MATSHITA ,Product: CD-RW CW-8124 ,Revision: DZ13 ,Serial No: Vendor: SEAGATE ,Product: ST973401LSUN72G ,Revision: 0556 ,Serial No: 071611N3BL Vendor: SEAGATE ,Product: ST973401LSUN72G ,Revision: 0556 ,Serial No: 071611N3CQ Vendor: SEAGATE ,Product: ST973402SSUN72G ,Revision: 0400 ,Serial No: 0716216393
Query the kernel
Another suggested method was to query the kernel using a series of requests to the
mdb -k debugger:
# echo "::walk sd_state | ::grep '.!=0' | ::print struct sd_lun un_sd | \ ::print struct scsi_device sd_inq | ::print struct scsi_inquiry \ inq_vid inq_pid" | mdb -k
For instance, in my case I'm getting this output:
inq_vid = [ "SEAGATE " ] inq_pid = [ "ST2000NM0001 " ] inq_vid = [ "SEAGATE " ] inq_pid = [ "ST3300657SS " ] ...
sd.conf (for some "green" drives) looks like this:
sd-config-list = "SEAGATE ST3300657SS", "power-condition:false", "SEAGATE ST2000NM0001", "power-condition:false";
Observing Negotiated Block Sizes
Unfortunately, the classic Solaris tools for observing block or sector sizes do not show both the logical and physical block sizes. However, you can observe the values negotiated by the
sd driver using
# echo ::sd_state | mdb -k | egrep '(^un|_blocksize)' un 0: ffffff01d1b18d00 un_sys_blocksize = 0x200 un_tgt_blocksize = 0x200 un_phy_blocksize = 0x200 ...
The unit number (
un) corresponds to the SCSI driver (
sd) instance number. In the above example, "
un 0" is also known as "
sd0." Sizes are in bytes, written as hex: 0x200 = 512, 0x1000 = 4096.
To inspect the
ashift value actually used by ZFS for a particular pool, you can use
zdb (without parameters it dumps labels of imported pools), for example:
# zdb | egrep 'ashift| name' name='pond' ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 ashift=9 name='rpool' ashift=9 name='temp' ashift=9
Note that each pool can consist of one or several Top-level VDEVs, and each of those can have an individual composition of devices, i.e. not only mixing of mirrors and raidzN sets is possible, but also of devices with different hardware sector size and thus
ashift as set at TLVDEV creation time. However, in order to avoid unbalanced IO and "unpleasant surprises" which might be difficult to explain and debug, it is discouraged to build pools from such mixtures.
Migration of older pools onto newer disks
... TODO: expand the text
In short: if the old pool was created with 512B-sectored drives, it is best not to replace its disks with 4KB-sectored ones, but create a new pool with those disks and proper alignment, and copy over the data. The copy can be done in a number of ways, one of which is a "
zfs send ... | zfs recv ..." run and another is a series of "
rsync" invokations – perhaps with some manual labor to recreate the needed dataset hierarchy and apply ZFS properties, and even logically recreate the snapshot and cloning history. The "
rsync" approach may be especially useful if you need to change the dataset layout as well, because over time people often find their old setup not as optimal as they initially thought
Note: It was reported on the mailing lists that
rsync version 3.10 has finally added support for Solaris ACLs and extended attributes. Previously it was recommended to use Sun (not GNU)
cpio to transfer filesystems with such extended file/dir attributes and ACLs.
Thanks go to active users of
zfs-discuss mailing lists, including Richard Elling, George Wilson, Saso Sašo Kiselkov, Edward Ned Harvey, Jim Klimov and uncountable others – including all those who ask questions so we can all receive answers.
- Using GNOP to emulate 4KB blocks over 512B (to create the pool via FreeBSD, such as a "zfsguru" LiveCD; you can then reboot into OI):