^ George Wilson's talk at ZFS Day and a blog with examples, related to that speech: http://blog.delphix.com/gwilson/2012/11/15/4k-sectors-and-zfs/


ZFS was designed in the early 2000s to be aware of trends in disk and device storage for the next generation of storage hardware. Unlike previous filesystems that have fixed assumptions about the device sector size, ZFS is designed to accomodate different sector sizes. As the consumer hard disk (HDD) market migrates from 512 byte sector sizes to 4KB sectors, ZFS is ready to make the change. In some cases, 4KB physical sector disks are also called Advanced Format (AF) disks.

The method used by ZFS is to query the drive parameters using the mechanisms provided by the OS for device inquiries. In most cases, the device reports the physical and logical sector size. There are some nuances in the implementations. For example, the SATA protocol uses a sector size that is a multiple of 512 bytes. SCSI devices (SAS, FC, etc) can have many different sizes, and it is not uncommon for vendors to offer 512, 520, or 528 bytes per sector. By 2010, many HDD manufacturers have started offering consumer-grade disks with 4KB sector sizes. Going forward, some HDD vendors have deprecated the use of sector sizes that are not powers of 2 (eg. 520 or 528 bytes). ZFS allocates space using a logical sector size that is a power of 2 that is less than or equal to the physical sector size.

Unfortunately, some HDD manufacturers do not properly respond to the device inquiry sizes. ZFS looks to the physical sector size (aka physical block size) for its hint on how to optimize its use of the device. If the disk reports that the physical sector size is 512 bytes, then ZFS will use an internal sector size of 512 bytes. The problem is that some HDDs misrepresent 4KB sector disks as having a physical sector size of 512 bytes. The proper response should be that the logical sector size is 512 bytes and the physical sector size is 4KB. By 2011-2012, most HDDs were properly reporting logical and physical sector sizes. In some cases, the HDD vendors advertise the disks as "emulating 512 byte sectors" or "512e", with the expected name for disks which advertise 4Kb sector size being "4kn" (for "4k native").

There is no functional or reliability problem with 4KB physical sectors being represented as 512 byte logical sectors. This technique has been used for decades in computer systems to allow expansion of device or address sizes. The general name for the technique is read-modify-write: when you need to write 512 bytes (or less than the physical sector size) then the device reads 4KB (the physical sector), modifies the data, and writes 4KB (because it can't write anything smaller). For HDDs, the cost can be a whole revolution, or 8.33 ms for a 7,200 rpm disk. Thus the performance impact for read-modify-write can be severe, and even worse for slower, consumer-grade, 5,400 rpm or variable speed "green" drives.
Bottom line: for best performance, the HDD needs to properly communicate the physical block size via the inquiry commands for best performance.
Inside ZFS, the kernel variable that describes the physical sector size in bytes is ashift, or the log2(physical sector size). A search of the ZFS archives, can find references to the ashift variable in discussions about the sector sizes.

The issues experienced by ZFS implementers and the 4KB sector size HDDs are summarized as:

  1. Some HDD models misrepresent their physical block size, resulting in unexpectedly poor performance for some workloads
  2. Attempting to replace a HDD that had 512 byte physical sectors with a new HDD that has 4KB logical and physical sectors can fail with a mismatched sector size error message
  3. Some, but not all ZFS implementations offer command-line options to set the physical sector size in either the zpool command or via other OS commands and configuration settings
  4. Older Solaris releases did not set the default sector boundaries on 4KB boundaries, negatively impacting performance (fixed in illumos, later OpenSolaris releases, Solaris 10 recent updates, and Solaris 11)
  5. 4KB sector disks are not as space-efficient as 512 byte sector disks, in part because ZFS metadata is compressed, dynamically allocated, and often less than 4KB physical size
  6. As George Wilson reminds in his talk on ZFS Day (see above), certain HDDs include an "XP Jumper" to offset LBA addresses by 1, so that Windows XP's partitions which by default start at logical offset of 63 512-byte blocks would in fact become 4Kb-aligned (starting at physical offset of 64 512b-blocks). However, presence of such jumper can have adverse effects upon advanced users who do take care to define 4k-aligned partitions (and have these shifted by 1 into misaligned hardware IOs). So, if you have unexpectedly bad performance despite your tuning – keep this reason in mind.

This page will try to share some knowledge and offer good operating practices for ZFS on illumos.

The zpool create and zpool add Commands

The physical sector size is queried when the zpool create or zpool add commands are executed. In these cases, a new top-level virtual device (vdev) is created and the ashift value is set. If the disks mixed physical sizes, then an error message is shown.

When adding 4KB physical sector size disks (ashift = 12) to a pool containing 512 byte physical sector disks (ashift = 9), or vice-versa, then the resulting pool contains mixed sector size top-level vdevs. ZFS functions properly with mixed-size top-level vdevs.

Note: attempting to replace disks with 512 byte physical sectors (or attach into a mirror made from such disks) with disks that only support 4KB logical sectors can fail, leading to operational issues with stocking spares.

Disk partitioning and alignment

TODO: describe usage of parted, fdisk and format to pre-create Solaris slices for rpool disks, and how to check the layout of a particular disk to see if alignment is right.

Also, the examples below represent disks with 512b native sector sizes. It would be informative if someone added similar examples of disks with 512e and 4kn sectors.

There is a number of command-line utilities that can be used to administer and automate the needed setup, as well as inspection, of the disk layout and its alignment.


The GNU partition editor parted which is, in particular, available in OI LiveCD media, can display and edit partitioning tables.

For example, here is a typical rpool disk created by the Solaris and OI installers – with an "msdos" (MBR) label and a further SMI slice table inside, and some strange-looking offset (16065 = 63*255 – a "cylinder" size) which does not offer very good 4K alignment:

# parted /dev/rdsk/c5t4d0p0 uni s pri
Model: Generic Ide (ide)
Disk /dev/rdsk/c5t4d0p0: 488390625s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number  Start   End         Size        Type     File system  Flags
 1      16065s  488375999s  488359935s  primary  solaris      boot 

And here is a typical data pool component disk, wholly given to ZFS – so it has created a GPT label and an 8Mb buffer partition (to cater for future replacement disk size variations) with proper offsets:

# parted /dev/rdsk/c5t1d0p0 uni s pri
Model: Generic Ide (ide)
Disk /dev/rdsk/c5t1d0p0: 488390625s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number  Start       End         Size        File system  Name  Flags
 1      256s        488374207s  488373952s               zfs        
 9      488374208s  488390591s  16384s                              

The parted can also be used to create partitions, for example, to prepare a roughly 250Gb rpool following the example sizes above: 

# parted /dev/rdsk/c2t0d0p0 mklabel msdos
# parted /dev/rdsk/c2t0d0p0 uni s mkpart pri solaris 256 488373952
# parted /dev/rdsk/c2t0d0p0 set 1 boot on

This properly aligned partition can then be used for an rpool – quite a useful trick to use during the OS installation (note that the current OpenIndiana installer will wipe any slice tables and other data in this partition).

Note that parted does not edit SMI labels, you need to use format to define slices (in case of "msdos" MBR partition tables). In case of EFI (GPT) partitioning this slicing is irrelevant, as these GPT partitions are directly usable by ZFS.


The Solaris fdisk tool is of limited utility in this case, because it seems to create a (MBR) partitioning table aligned with odd legacy cylinder sizes.

It can also create a EFI (GPT) label with a Solaris partition starting at cylinder 0 and likely proper alignment.


The Solaris format tool is used to, among other things, edit the SMI label (Solaris slices in an MBR partition) or the EFI (GPT) partition table. It can also specify offsets and sizes with 512b-sector precision, but it can't create the partitioning table itself and invokes fdisk for that. Here are examples of format displaying the layout of the same disks as in the parted example above:

# format c5t4d0p0    
selecting c5t4d0p0
[disk formatted]
/dev/dsk/c5t4d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
format> p
partition> p
Current partition table (original):
Total disk cylinders available: 30397 + 2 (reserved cylinders)
Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       1 - 30396      232.85GB    (30396/0/0) 488311740
  1 unassigned    wm       0                0         (0/0/0)             0
  2     backup    wm       0 - 30396      232.85GB    (30397/0/0) 488327805
  3 unassigned    wm       0                0         (0/0/0)             0
  4 unassigned    wm       0                0         (0/0/0)             0
  5 unassigned    wm       0                0         (0/0/0)             0
  6 unassigned    wm       0                0         (0/0/0)             0
  7 unassigned    wm       0                0         (0/0/0)             0
  8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
  9 unassigned    wu       0                0         (0/0/0)             0
# format c5t1d0p0
selecting c5t1d0p0
[disk formatted]
/dev/dsk/c5t1d0s0 is part of active ZFS pool pond. Please see zpool(1M).
format> p
partition> p
Current partition table (original):
Total disk sectors available: 488374207 + 16384 (reserved sectors)
Part      Tag    Flag     First Sector         Size         Last Sector
  0        usr    wm               256      232.87GB          488374207    
  1 unassigned    wm                 0           0               0    
  2 unassigned    wm                 0           0               0    
  3 unassigned    wm                 0           0               0    
  4 unassigned    wm                 0           0               0    
  5 unassigned    wm                 0           0               0    
  6 unassigned    wm                 0           0               0    
  8   reserved    wm         488374208        8.00MB          488390591    


The Solaris prtvtoc tool can also display the slicing or partitioning table info along with some other details, and is often used in conjunction with the fmthard utility to clone partitioning information from a master disk to several others – a convenient step in creating uniform labeling for components of the same pool.

Examples for the same disks as above:

# prtvtoc /dev/dsk/c5t4d0p0
* /dev/dsk/c5t4d0p0 partition map
* Dimensions:
*     512 bytes/sector
*      63 sectors/track
*     255 tracks/cylinder
*   16065 sectors/cylinder
*   30399 cylinders
*   30397 accessible cylinders
* Flags:
*   1: unmountable
*  10: read-only
* Unallocated space:
*       First     Sector    Last
*       Sector     Count    Sector 
*           0     16065     16064
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      2    00      16065 488311740 488327804
       2      5    00          0 488327805 488327804
       8      1    01          0     16065     16064
# prtvtoc /dev/dsk/c5t1d0p0
* /dev/dsk/c5t1d0p0 partition map
* Dimensions:
*     512 bytes/sector
* 488390625 sectors
* 488390558 accessible sectors
* Flags:
*   1: unmountable
*  10: read-only
* Unallocated space:
*       First     Sector    Last
*       Sector     Count    Sector 
*          34       222       255
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      4    00        256 488373952 488374207
       8     11    00  488374208     16384 488390591

Overriding the Physical Block Size

Some operating systems provide a way to pass the desired ashift value to zpool invocation in some way, or even hard-code ashift=12 into a separately built zpool binary.

The illumos-gate did not, after long discussions, follow this "trivial" path. Instead, illumos-based OSes can now override the physical block (sector) size for "talking" to particular devices regardless of what they announce, by using a value configured in the sd(7d) driver. Once the device vendor and identification strings are known (as detailed below), the /kernel/drv/sd.conf file can be modified:

sd-config-list =
        "SEAGATE ST3300657SS", "physical-block-size:4096",
        "DGC     RAID", "physical-block-size:4096",
        "NETAPP  LUN", "physical-block-size:4096";

To enable this change on a running system, you need to reload the sd driver; keep in mind that in some cases a reboot may still be required:

# update_drv -vf sd

Note: the illumos distribution builder can set the values by default for the known cases where overrides are appropriate.

Determining the drive's device vendor and identification strings

The format of the entries in the first column of sd-config-list, roughly, defines a substring search on the Vendor Name and the Device Model as returned by the device to appropriate inquiries.

The identifiers are the concatenation of the vendor and product strings returned by the SCSI inquiry command which can be accessed by several methods discussed below. The comparison code in sd.c strips leading and trailing blanks and elides repeated blanks. It converts both strings to upper case to make the comparison. If that comparison fails, it then attempts a wildcard match for a substring (according to the comment). However, the code is a bit odd and may not do what the comment states.

Query with format

The identifiers can be accessed by "format -e" and "inquiry":

# format
Searching for disks...done 
         0. c3t0d0 <DEFAULT cyl 8921 alt 2 hd 255 sec 63>
Specify disk (enter its number): 0  
selecting c3t0d0  
[disk formatted]  
         inquiry    - show vendor, product and revision  
format> inq  
Vendor:   SEAGATE   
Product:  ST973401LSUN72G   
Revision: 0556  
format> ^D

Query with iostat

For another example, the "iostat -Er" command can be used to query the needed strings, here are sample outputs from several different hosts:

# iostat -Er | grep -i vendor | sort | uniq
Vendor:          ,Product: Virtual CDROM    ,Revision: 1.00 ,Serial No:   
Vendor: Intel    ,Product: Multi-Flex       ,Revision: 0307 ,Serial No: 4C2020200000000 
Vendor: ATA      ,Product: ST3320620AS      ,Revision: K    ,Serial No:   
Vendor: AMI      ,Product: Virtual CDROM    ,Revision: 1.00 ,Serial No:   
Vendor: AMI      ,Product: Virtual Floppy   ,Revision: 1.00 ,Serial No:   
Vendor: ATA      ,Product: Hitachi HUA72303 ,Revision: A580 ,Serial No:   
Vendor: ATA      ,Product: SEAGATE ST32500N ,Revision: 3AZQ ,Serial No:   
Vendor: AMI      ,Product: Virtual CDROM    ,Revision: 1.00 ,Serial No:   
Vendor: AMI      ,Product: Virtual Floppy   ,Revision: 1.00 ,Serial No:   
Vendor: FUJITSU  ,Product: MAY2073RCSUN72G  ,Revision: 0501 ,Serial No: 0729S0C9HV  
Vendor: MATSHITA ,Product: CD-RW  CW-8124   ,Revision: DZ13 ,Serial No:   
Vendor: SEAGATE  ,Product: ST973401LSUN72G  ,Revision: 0556 ,Serial No: 071611N3BL  
Vendor: SEAGATE  ,Product: ST973401LSUN72G  ,Revision: 0556 ,Serial No: 071611N3CQ  
Vendor: SEAGATE  ,Product: ST973402SSUN72G  ,Revision: 0400 ,Serial No: 0716216393  

Query the kernel

Another suggested method was to query the kernel using a series of requests to the mdb -k debugger:

# echo "::walk sd_state | ::grep '.!=0' | ::print struct sd_lun un_sd | \
  ::print struct scsi_device sd_inq | ::print struct scsi_inquiry \
  inq_vid inq_pid" | mdb -k

For instance, in my case I'm getting this output:

inq_vid = [ "SEAGATE " ]
inq_pid = [ "ST2000NM0001    " ]
inq_vid = [ "SEAGATE " ]
inq_pid = [ "ST3300657SS     " ]

So my sd.conf (for some "green" drives) looks like this:

sd-config-list = "SEAGATE ST3300657SS", "power-condition:false",
                 "SEAGATE ST2000NM0001", "power-condition:false";

Observing Negotiated Block Sizes

Unfortunately, the classic Solaris tools for observing block or sector sizes do not show both the logical and physical block sizes. However, you can observe the values negotiated by the sd driver using mdb.

# echo ::sd_state | mdb -k | egrep '(^un|_blocksize)'
un 0: ffffff01d1b18d00
    un_sys_blocksize = 0x200
    un_tgt_blocksize = 0x200
    un_phy_blocksize = 0x200

The unit number (un) corresponds to the SCSI driver (sd) instance number. In the above example, "un 0" is also known as "sd0." Sizes are in bytes, written as hex: 0x200 = 512, 0x1000 = 4096.

To inspect the ashift value actually used by ZFS for a particular pool, you can use zdb (without parameters it dumps labels of imported pools), for example:

# zdb | egrep 'ashift| name'

Note that each pool can consist of one or several Top-level VDEVs, and each of those can have an individual composition of devices, i.e. not only mixing of mirrors and raidzN sets is possible, but also of devices with different hardware sector size and thus ashift as set at TLVDEV creation time. However, in order to avoid unbalanced IO and "unpleasant surprises" which might be difficult to explain and debug, it is discouraged to build pools from such mixtures.

Migration of older pools onto newer disks

... TODO: expand the text

In short: if the old pool was created with 512B-sectored drives, it is best not to replace its disks with 4KB-sectored ones, but create a new pool with those disks and proper alignment, and copy over the data. The copy can be done in a number of ways, one of which is a "zfs send ... | zfs recv ..." run and another is a series of "rsync" invokations – perhaps with some manual labor to recreate the needed dataset hierarchy and apply ZFS properties, and even logically recreate the snapshot and cloning history. The "rsync" approach may be especially useful if you need to change the dataset layout as well, because over time people often find their old setup not as optimal as they initially thought (wink)

Note: It was reported on the mailing lists that rsync version 3.10 has finally added support for Solaris ACLs and extended attributes. Previously it was recommended to use Sun (not GNU) tar or cpio to transfer filesystems with such extended file/dir attributes and ACLs.


Thanks go to active users of openindiana-discuss and zfs-discuss mailing lists, including Richard Elling, George Wilson, Sašo Kiselkov, Edward Ned Harvey, Jim Klimov and uncountable others – including all those who ask questions so we can all receive answers.

See also