Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Some reformatting, moved device string detection into its own paragraph, added Saso Kiselkov's suggestion on mdb query

Widget Connector
urlhttps://www.youtube.com/watch?v=TmH3iRLhZ-A&feature=youtu.be

^ George Wilson's talk at ZFS Day and a blog with examples, related to that speech: http://blog.delphix.com/gwilson/2012/11/15/4k-sectors-and-zfs/

Introduction

ZFS was designed in the early 2000s to be aware of trends in disk and device storage for the next generation of storage hardware. Unlike previous filesystems that have fixed assumptions about the device sector size, ZFS is designed to accomodate different sector sizes. As the consumer hard disk (HDD) market migrates from 512 byte sector sizes to 4KB sectors, ZFS is ready to make the change. In some cases, 4KB physical sector disks are also called Advanced Format (AF) disks.

...

Unfortunately, some HDD manufacturers do not properly respond to the device inquiry sizes. ZFS looks to the physical sector size (aka physical block size) for its hint on how to optimize its use of the device. If the disk reports that the physical sector size is 512 bytes, then ZFS will use an internal sector size of 512 bytes. The problem is that some HDDs misrepresent 4KB sector disks as having a physical sector size of 512 bytes. The proper response should be that the logical sector size is 512 bytes and the physical sector size is 4KB. By 2011-2012, most HDDs were properly reporting logical and physical sector sizes. In some cases, the HDD vendors advertise the disks as "emulating 512 byte sectors" or "512e."", with the expected name for disks which advertise 4Kb sector size being "4kn" (for "4k native").

There is no functional or reliability problem with 4KB physical sectors being represented as 512 byte logical sectors. This technique has been used for decades in computer systems to allow expansion of device or address sizes. The general name for the technique is read-modify-write: when you need to write 512 bytes (or less than the physical sector size) then the device reads 4KB (the physical sector), modifies the data, and writes 4KB (because it can't write anything smaller). For HDDs, the cost can be a whole revolution, or 8.33 ms for a 7,200 rpm disk. Thus the performance impact for read-modify-write can be severe, and even worse for slower, consumer-grade, 5,400 rpm or variable speed "green" drives.
Bottom line: for best performance, the HDD needs to properly communicate the physical block size via the inquiry commands for best performance.
Inside ZFS, the kernel variable that describes the physical sector size in bytes is ashift, or the log2(physical sector size). A search of the ZFS archives, can find references to the ashift variable in discussions about the sector sizes.

...

  1. Some HDD models misrepresent their physical block size, resulting in unexpectedly poor performance for some workloads
  2. Attempting to replace a HDD that had 512 byte physical sectors with a new HDD that has 4KB logical and physical sectors can fail with a mismatched sector size error message
  3. Some, but not all ZFS implementations offer command-line options to set the physical sector size in either the zpool command or via other OS commands and configuration settings
  4. Older Solaris releases did not set the default sector boundaries on 4KB boundaries, negatively impacting performance (fixed in illumos, later OpenSolaris releases, Solaris 10 recent updates, and Solaris 11)
  5. 4KB sector disks are not as space-efficient as 512 byte sector disks, in part because ZFS metadata is compressed, dynamically allocated, and often less than 4KB physical size
  6. As George Wilson reminds in his talk on ZFS Day (see above), certain HDDs include an "XP Jumper" to offset LBA addresses by 1, so that Windows XP's partitions which by default start at logical offset of 63 512-byte blocks would in fact become 4Kb-aligned (starting at physical offset of 64 512b-blocks). However, presence of such jumper can have adverse effects upon advanced users who do take care to define 4k-aligned partitions (and have these shifted by 1 into misaligned hardware IOs). So, if you have unexpectedly bad performance despite your tuning – keep this reason in mind.

This page will try to share some knowledge and offer good operating practices for ZFS on illumos.

...

Note: attempting to replace disks with 512 byte physical sectors with disks (or attach into a mirror made from such disks) with disks that only support 4KB logical sectors can fail, leading to operational issues with stocking spares.

Disk partitioning and alignment

...

Note

TODO: describe usage

...

of parted,

...

fdisk

...

and format to pre-create Solaris slices

...

for rpool disks, and how to check the layout of a particular disk to see if alignment is right.

Overriding the Physical Sector Size

...

Also, the examples below represent disks with 512b native sector sizes. It would be informative if someone added similar examples of disks with 512e and 4kn sectors.

There is a number of command-line utilities that can be used to administer and automate the needed setup, as well as inspection, of the disk layout and its alignment.

parted

The GNU partition editor parted which is, in particular, available in OI LiveCD media, can display and edit partitioning tables.

For example, here is a typical rpool disk created by the Solaris and OI installers – with an "msdos" (MBR) label and a further SMI slice table inside, and some strange-looking offset (16065 = 63*255 – a "cylinder" size) which does not offer very good 4K alignment:

Code Block
# parted /dev/rdsk/c5t4d0p0 uni s pri
Model: Generic Ide (ide)
Disk /dev/rdsk/c5t4d0p0: 488390625s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number  Start   End         Size        Type     File system  Flags
 1      16065s  488375999s  488359935s  primary  solaris      boot 

And here is a typical data pool component disk, wholly given to ZFS – so it has created a GPT label and an 8Mb buffer partition (to cater for future replacement disk size variations) with proper offsets:

Code Block
# parted /dev/rdsk/c5t1d0p0 uni s pri
Model: Generic Ide (ide)
Disk /dev/rdsk/c5t1d0p0: 488390625s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number  Start       End         Size        File system  Name  Flags
 1      256s        488374207s  488373952s               zfs        
 9      488374208s  488390591s  16384s                              

The parted can also be used to create partitions, for example, to prepare a roughly 250Gb rpool following the example sizes above: 

Code Block
# parted /dev/rdsk/c2t0d0p0 mklabel msdos
# parted /dev/rdsk/c2t0d0p0 uni s mkpart pri solaris 256 488373952
# parted /dev/rdsk/c2t0d0p0 set 1 boot on

This properly aligned partition can then be used for an rpool – quite a useful trick to use during the OS installation (note that the current OpenIndiana installer will wipe any slice tables and other data in this partition).

Note that parted does not edit SMI labels, you need to use format to define slices (in case of "msdos" MBR partition tables). In case of EFI (GPT) partitioning this slicing is irrelevant, as these GPT partitions are directly usable by ZFS.

fdisk

The Solaris fdisk tool is of limited utility in this case, because it seems to create a (MBR) partitioning table aligned with odd legacy cylinder sizes.

It can also create a EFI (GPT) label with a Solaris partition starting at cylinder 0 and likely proper alignment.

format

The Solaris format tool is used to, among other things, edit the SMI label (Solaris slices in an MBR partition) or the EFI (GPT) partition table. It can also specify offsets and sizes with 512b-sector precision, but it can't create the partitioning table itself and invokes fdisk for that. Here are examples of format displaying the layout of the same disks as in the parted example above:

Code Block
# format c5t4d0p0    
selecting c5t4d0p0
[disk formatted]
/dev/dsk/c5t4d0s0 is part of active ZFS pool rpool. Please see zpool(1M).
...
format> p
...
partition> p
Current partition table (original):
Total disk cylinders available: 30397 + 2 (reserved cylinders)
Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       1 - 30396      232.85GB    (30396/0/0) 488311740
  1 unassigned    wm       0                0         (0/0/0)             0
  2     backup    wm       0 - 30396      232.85GB    (30397/0/0) 488327805
  3 unassigned    wm       0                0         (0/0/0)             0
  4 unassigned    wm       0                0         (0/0/0)             0
  5 unassigned    wm       0                0         (0/0/0)             0
  6 unassigned    wm       0                0         (0/0/0)             0
  7 unassigned    wm       0                0         (0/0/0)             0
  8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
  9 unassigned    wu       0                0         (0/0/0)             0
 
 
# format c5t1d0p0
selecting c5t1d0p0
[disk formatted]
/dev/dsk/c5t1d0s0 is part of active ZFS pool pond. Please see zpool(1M).
...
format> p
...
partition> p
Current partition table (original):
Total disk sectors available: 488374207 + 16384 (reserved sectors)
Part      Tag    Flag     First Sector         Size         Last Sector
  0        usr    wm               256      232.87GB          488374207    
  1 unassigned    wm                 0           0               0    
  2 unassigned    wm                 0           0               0    
  3 unassigned    wm                 0           0               0    
  4 unassigned    wm                 0           0               0    
  5 unassigned    wm                 0           0               0    
  6 unassigned    wm                 0           0               0    
  8   reserved    wm         488374208        8.00MB          488390591    

prtvtoc

The Solaris prtvtoc tool can also display the slicing or partitioning table info along with some other details, and is often used in conjunction with the fmthard utility to clone partitioning information from a master disk to several others – a convenient step in creating uniform labeling for components of the same pool.

Examples for the same disks as above:

Code Block
# prtvtoc /dev/dsk/c5t4d0p0
* /dev/dsk/c5t4d0p0 partition map
*
* Dimensions:
*     512 bytes/sector
*      63 sectors/track
*     255 tracks/cylinder
*   16065 sectors/cylinder
*   30399 cylinders
*   30397 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*       First     Sector    Last
*       Sector     Count    Sector 
*           0     16065     16064
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      2    00      16065 488311740 488327804
       2      5    00          0 488327805 488327804
       8      1    01          0     16065     16064
 
 
 
# prtvtoc /dev/dsk/c5t1d0p0
* /dev/dsk/c5t1d0p0 partition map
*
* Dimensions:
*     512 bytes/sector
* 488390625 sectors
* 488390558 accessible sectors
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*       First     Sector    Last
*       Sector     Count    Sector 
*          34       222       255
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      4    00        256 488373952 488374207
       8     11    00  488374208     16384 488390591

Overriding the Physical Block Size

Some operating systems provide a way to pass the desired ashift value to zpool invokation in some way, or even hard-code ashift=12 into a separately built zpool binary.

The illumos-gate did not, after long discussions, follow this "trivial" path. Instead, illumos-based OSes can now override the physical block (sector) size for "talking" to particular devices regardless of what they announce, by using a value configured in the sd(7d) driver. Once the device vendor and identification strings are known (as detailed below), the /kernel/drv/sd.conf file can be modified:

Code Block
titlesd.conf additions to override ashift for ZFS
sd-config-list =
        "SEAGATE ST3300657SS", "physical-block-size:4096",
	        "DGC     RAID", "physical-block-size:4096",
	        "NETAPP  LUN", "physical-block-size:4096";
  • The format of the entries in the first column, roughly, defines a substring search on the Vendor Name and the Device Model as returned by the device to appropriate inquiries.
  • See here for some community-contributed entries for 512e drives: List of sd-config-list entries for Advanced-Format drives

  • You MUST reconfigure the device as per George Wilson's blog cited at the beginning of this page. In the case of USB disk George's method seems not to work. However, removing and replugging the USB drive does.
  • Note also that this change is not required to persist if you're immediately using the disk with overridden options to create or change a ZFS pool: after the ashift value gets into the Top-level VDEV label, it is used regardless of the driver options. This may be important on systems such as VMs, where you have (seemingly) identical disk models which you want to use in different manners.
    The change does need to persist if the disk usage will be delayed – for example, if the disk requiring an override is a designated hot-spare for a pool.
  • Other overrides can be set, see the the sd(7D) man page. An increasingly useful tunable is the "power-condition:false" that can be set on "green" HDDs for which the vendor has enabled enforced power savings mode in the HDD firmware configuration, so that they spin down too quickly for ZFS's regular TXG syncs or for a larger array's staggered spin-up routine.

To enable this change on a running system, you need to reload the sd driver; keep in mind that in some cases a reboot may still be required:

Code Block
# update_drv -vf sd

Note: the illumos distribution builder can set the values by default for the known cases where overrides are appropriate.

Determining the drive's device vendor and identification strings

The format of the entries in the first column of sd-config-list, roughly, defines a substring search on the Vendor Name and the Device Model as returned by the device to appropriate inquiries.

The identifiers are the concatenation of the vendor and product strings returned by the SCSI inquiry command which can be accessed by several methods discussed below. The comparison code in sd.c strips leading and trailing blanks and elides repeated blanks. It converts both strings to upper case to make the comparison. If that comparison fails, it then attempts a wildcard match for a substring (according to the comment). However, the code is a bit odd and may not do what the comment states.

Query with format

The identifiers can be accessed by "format -e" and "inquiry":

Code Block
# format
Searching for disks...done 
AVAILABLE DISK SELECTIONS:
         0. c3t0d0 <DEFAULT cyl 8921 alt 2 hd 255 sec 63>
            /pci@0,0/pci1022,7458@11/pci1000,3060@4/sd@0,0 
Specify disk (enter its number): 0  
selecting c3t0d0  
[disk formatted]  
...
         inquiry    - show vendor, product and revision  
... 
format> inq  
Vendor:   SEAGATE   
Product:  ST973401LSUN72G   
Revision: 0556  
format> ^D

Query with iostat

For another example, the "iostat -Er" command can be used to query the needed strings, here are sample outputs from several different hosts:

Code Block
# iostat -Er | grep -i vendor | sort | uniq
Vendor:          ,Product: Virtual CDROM    ,Revision: 1.00 ,Serial No:   
Vendor: Intel    ,Product: Multi-Flex       ,Revision: 0307 ,Serial No: 4C2020200000000 
 
Vendor: ATA      ,Product: ST3320620AS      ,Revision: K    ,Serial No:   
 
Vendor: AMI      ,Product: Virtual CDROM    ,Revision: 1.00 ,Serial No:   
Vendor: AMI      ,Product: Virtual Floppy   ,Revision: 1.00 ,Serial No:   
Vendor: ATA      ,Product: Hitachi HUA72303 ,Revision: A580 ,Serial No:   
Vendor: ATA      ,Product: SEAGATE ST32500N ,Revision: 3AZQ ,Serial No:   
 
Vendor: AMI      ,Product: Virtual CDROM    ,Revision: 1.00 ,Serial No:   
Vendor: AMI      ,Product: Virtual Floppy   ,Revision: 1.00 ,Serial No:   
Vendor: FUJITSU  ,Product: MAY2073RCSUN72G  ,Revision: 0501 ,Serial No: 0729S0C9HV  
Vendor: MATSHITA ,Product: CD-RW  CW-8124   ,Revision: DZ13 ,Serial No:   
Vendor: SEAGATE  ,Product: ST973401LSUN72G  ,Revision: 0556 ,Serial No: 071611N3BL  
Vendor: SEAGATE  ,Product: ST973401LSUN72G  ,Revision: 0556 ,Serial No: 071611N3CQ  
Vendor: SEAGATE  ,Product: ST973402SSUN72G  ,Revision: 0400 ,Serial No: 0716216393  

Query the kernel

Another suggested method was to query the kernel using a series of requests to the mdb -k debugger:

Code Block
# echo "::walk sd_state | ::grep '.!=0' | ::print struct sd_lun un_sd | \
  ::print struct scsi_device sd_inq | ::print struct scsi_inquiry \
  inq_vid inq_pid" | mdb -k

For instance, in my case I'm getting this output:

Code Block
inq_vid = [ "SEAGATE " ]
inq_pid = [ "ST2000NM0001    " ]
inq_vid = [ "SEAGATE " ]
inq_pid = [ "ST3300657SS     " ]
...

So my sd.conf (for some "green" drives) looks like this:

Code Block
sd-config-list = "SEAGATE ST3300657SS", "power-condition:false",
                 "SEAGATE ST2000NM0001", "power-condition:false";

Observing Negotiated Block Sizes

Unfortunately, the classic Solaris tools for observing block or sector sizes do not show both the logical and physical block sizes. However, you can observe the values negotiated by the sd driver using mdb.

Code Block
# echo ::sd_state | mdb -k | egrep '(^un|_blocksize)'
un 0: ffffff01d1b18d00
    un_sys_blocksize = 0x200
    un_tgt_blocksize = 0x200
    un_phy_blocksize = 0x200
...

The unit number (un) corresponds to the SCSI driver (sd) instance number. In the above example, "un 0" is also known as "sd0." Sizes are in bytes, written as hex: 0x200 = 512, 0x1000 = 4096.

To inspect the ashift value actually used by ZFS for a particular pool, you can use zdb (without parameters it dumps labels of imported pools), for example:

Code Block
# zdb | egrep 'ashift| name'
     name='pond'
                 ashift=9
                 ashift=9
                 ashift=9
                 ashift=9
                 ashift=9
                 ashift=9
                 ashift=9
                 ashift=9
                 ashift=9
     name='rpool'
                 ashift=9
     name='temp'
                 ashift=9

Note that each pool can consist of one or several Top-level VDEVs, and each of those can have an individual composition of devices, i.e. not only mixing of mirrors and raidzN sets is possible, but also of devices with different hardware sector size and thus ashift as set at TLVDEV creation time. However, in order to avoid unbalanced IO and "unpleasant surprises" which might be difficult to explain and debug, it is discouraged to build pools from such mixtures.

Migration of older pools onto newer disks

... TODO: expand the text

In short: if the old pool was created with 512B-sectored drives, it is best not to replace its disks with 4KB-sectored ones, but create a new pool with those disks and proper alignment, and copy over the data. The copy can be done in a number of ways, one of which is a "zfs send ... | zfs recv ..." run and another is a series of "rsync" invokations – perhaps with some manual labor to recreate the needed dataset hierarchy and apply ZFS properties, and even logically recreate the snapshot and cloning history. The "rsync" approach may be especially useful if you need to change the dataset layout as well, because over time people often find their old setup not as optimal as they initially thought (wink)

Note: It was reported on the mailing lists that rsync version 3.10 has finally added support for Solaris ACLs and extended attributes. Previously it was recommended to use Sun (not GNU) tar or cpio to transfer filesystems with such extended file/dir attributes and ACLs.

...

Thanks

Thanks go to active users of openindiana-discuss and zfs-discuss mailing lists, including Richard Elling, George Wilson, Saso Sašo Kiselkov, Edward Ned Harvey, Jim Klimov and uncountable others – including all those who ask questions so we can all receive answers.

...

...