Accomodating Multi-Zone Disks with Linux Logical Volume Design

During the investigation of the effect of striping on raw disk device performance, a reduction in disk IO rate was observed during a sequential read of a single disk. This investigation focuses on quantifying and explaining this IO reduction, and proposes a logical volume design to offset the effect.

Test Setup: Hardware
The hardware used for this test was:

The Intel 965P chipset has six SATA ports off the southbridge, each with a maximum bandwidth of 300MB/s. Four of these were used to support the drives for this test. An overview of the 965P chipset architecture can be found in (PDF). the chipset block diagram from that document is shown below.

965 Chipset Block Diagram

For more details on the hardware configuration of the system, the output from lshw can be seen here.

Test Setup: Software
The software used for this test was:

The IOgraph script was written to generate the graphs showing CPU, disk, and network utilization for these performance tests.

Throughout this investigation, we'll be characterizing IO performance by using the dd command, specifically:

dd if=<device> of=/dev/null bs=1024K
We use this read because it's non-destructive, which makes it easy to replicate these tests on existing filesystems without harm. As well, we have seen in previous tests that write latency is quite small, and current generation drives can read and write at similar speeds, so the read speed is sufficiently characteristic as to be used for write speeds. We use the large block size to avoid any blocking bottlenecks. By using large block sizes and continuous reads, we are intentionally looking for the best case performance: the IO rates we see will not be typical of a more normal work load, but they will form the upper bound on the IO rate a normal work load can see.

Problem Description
During the testing for my investigation into disk striping, I noticed a progressive reduction in disk IO during a sequential read of a single physical disk. This reduction was a stepped decrease in throughput while running a dd command. The reduction can be seen in this graph:

IO Rate Decrease During 100G Read of /dev/sda

Note the decrease in the read IO rate at about the 500s and 1000s marks.

We can see the full extent of this reduction by removing the count parameter, so that the dd command reads the entire disk. Doing this shows that the reduction continues over the entire disk, and gets worse the further into the disk we read:

IO Rate Decrease During Full Read of /dev/sda

By the end of the dd (not count the final vertical tail at the very end), the read IO rate is only about half what it was at the beginning.


The first and simplest test is to distinguish whether the rate step-down is related to distance into the disk or time into the test. If we partition the drive such that /dev/sda1 is a 50GB partition starting from track 0, and then dd that partition we see this behaviour:

IO Rate Decrease During 0G Offset Read of /dev/sda

We can see the first reduction at around 500s. If this step is related to the duration of the test (as would be expected if it was a limitation in the application, operating system, or hard drive controller softare), then adjusting the partition start should continue to show the reduction at ~500s. However, if the IO reduction is related to the physical parameters of the drive, then adjusting the partition start with respect to the drive geometry should correspondingly adjust the time that the reduction occurs at.

If we change the partitioning so that /dev/sda1 is a 50GB partition (the same as previous) but starts from a track ~25GB into the drive, we see this behaviour:

IO Rate Decrease During 25G Offset Read of /dev/sda

This provides some evidence that the IO rate reduction is a consequence of drive geometry: based on the above data, it appears the further into the drive we're doing the reading and writing the slower the IO rate is.

My suspicion was that this IO rate reduction was a result of the reduced IO rate available as the read or write operation was performed on smaller and smaller tracks. Hard drives hold their data on multiple platters divided into concentric tracks. The size of each disk track increases as the distance from the spindle increases: tracks near the outer edge of the disk are larger than the tracks near the middle. Since bit density is approximately constant over the drive media, this means there is more data in the outer tracks than there is in the inner tracks. The IO rate is in part determined by how often (and how far) the drive head must move: each time the drive head has to reposition on a new track, the IO rate suffers. This means that, when the drive head is reading from the outer (and larger) tracks, there is more data available to be read before an IO-rate-impacting drive head reposition must occur, and thus the IO rate is higher.

After some searching, I found several articles describing this effect. The design of disks is such that tracks are grouped into "zones", where the data density is the same for all tracks in a zone, and thus the IO rate is the same for all tracks in the zone. These two papers describe these "multi-zone disks" in passing:

Multi-Zone Logical Volume Design Criteria
While this reduction in IO rate is an unavoidable part of platter-based drive geometry, we would like to design our logical volumes to ameliorate the effects of multi-zone disks. There are two key criteria that a logical volume design must meet to achieve this: stripe speed matching and zone ordering to optimize disk access speeds.

Logical Volume Design
Given the criteria of not striping partitions with different speeds, and gaining the ability to match disk speeds with data usage, there are several possible solutions to accomodating the changing IO rates of the disks.

Logical Volume Design: Single Disk, Multiple Partitions
The most straightforward solution is to simply not stripe at all, but instead partition a single disk. If we partition a single disk (D1) into multiple partitions (P1 through P4), we can then select the partitions to be used for slow or fast data access based on their position on the disk. We can use the slow partitions, for relatively less performance sensitive data, and fast partitions for more performance sensitive data. We're not mismatching speeds in a stripe (since we're not striping), and we can choose which partition serves what data based on the speed.

Single Disk, Multiple Partitions

There are several downsides to this solution, however. In order to match the data with the speed, we must have several partitions, which is a problem if we really want all of the disk space in a single partition. We also lose the benefits of striping, in particular the greatly improved aggregate IO rate, since we're not using multiple partitions in parallel. This solution would be adequate for limited purposes, but a more complete solution is needed.

Logical Volume Design: Multiple Disks, Single Stripe
Given several identical drives (D1 through D4), we could partition them so that each drive contains a single partition (P1) holding all the disk space, and then stripe those partitions together into a single stripe (S1). Because we're striping identical disks, we will implicitly stripe corresponding disk tracks together, which means we will not be mismatching partition speeds. And by striping over multiple disks, we will gain the advantage of greatly improved IO rate.

Multiple Disks, Single Stripe

This is essentially the solution described in this striping analysis . However, striping entire disks together restricts our ability to select what we use our fast tracks for. By default, the disk will access from fast (low numbered) tracks to slow (high numbered) tracks: this means that the first data placed on the stripe will be placed on the fast tracks, and as we progressively fill up the stripe, the data is written to slower and slower parts of the disks. Essentially, the fuller the stripe gets, the slower it gets.

A variant of this is to divide the stripe S1 into multiple logical volumes: as logical volumes are allocated on the stripe, they will begin with fast storage and move to slow, similar to partitioning a single disk into multiple partitions. For a system in which having multiple partitions is acceptable, this is a good solution. However, like the Single Disk, Multiple Partition solution, it suffers from preventing all space from being used in a single partition.

Logical Volume Design: Multiple Disks, Multiple Partitions
A combination of the previous two designs is to partition each drive, and then create four stripes, each containing the same partition from each disk.

Multiple Disks, Multiple Partitions

Assuming that the partitions range from fast (P1) through slow (P4), the stripes will also range from fast (S1) through slow (S4). Data can then be divided over these stripes depending on performance requirements. There is also no mismatch within each stripe, since the four components of each stripe have the same speed.

However, as with the Single Disk, Multiple Partition design, this again creates problems if we want a single stripe with all available space.

Logical Volume Design: Multiple Disks, Hierarchical Stripe
The logical volume software used on Linux allows the user to designate disk partitions as physical volumes. These physical volumes can be collected into volume groups, and these volume groups can then be divided into logical volumes, as a sort of logical partition. Depending on how the logical volume is created, it can either use the underlying physical volumes in parallel (striping) or sequentially (concatenated), although thus far we have only been striping across the volumes.

However, it's not necessary for the physical volumes which form the underpinnings of this hierarchy to actually be physical disk partitions: any block device can be used. This gives us the flexibility to organize our disks and partitions in such a way as to have a single logical volume striped across all disks with the ability to select at which point in the storage range the fast tracks are located.

Multiple Disks, Hierarchical Stripe

The result is a single logical volume C1 encompassing all the space on four disks. At any time, reading or writing the single logical volume accesses data from tracks at the same point on all four disks, so there is no speed mismatch. And, because we have the flexibility of ordering the stripes which compose C1, we can put the fast part of the disk where we wish: creating C1 by concatenating the S4, then S3, then S1, and then S2, would put the fastest disk storage three-quarters of the way into the storage range.

The choice of dividing each physical disk into four partitions was arbitrary, and is coupled with the optimization requirement. Dividing the physical disks into more partitions increases the ability to fine-tune disk performance over the storage range, at the cost of additional setup work.

Hierarchical Stripe: Setup
For this example, we begin with four disks: /dev/sda, /dev/sdb, /dev/sdc, and /dev/sdd.

The /dev/vg5/lvol0 device file now represents a concatenation of four stripes, ordered as above.

Hierarchical Stripe: Example
The purpose of the hierarchical stripe is control where the fast data access is in the storage range. If we create a full disk stripe across four drives, without any hierarchical striping, the data access IO read rate over the storage range will look like this:

Un-Reordered Logical Volume Read IO Rate Profile

This represents the bad situation: the fast tracks are at the beginning, and by the time we're halfway into the stripe, our IO rate has begun dropping.

If we set up a drive with hierarchical striping as described above, and perform the test, we now see:

Reordered Logical Volume Read IO Rate Profile

This shows the expected IO rates at various points in the storage range: near the beginning it is slowest, 40-50MBs, and then vg3, 55-65MB/s. As intended, the latter half of the of the disk is composed of vg1 and vg2, with their high IO rates in the area of ~70MB/s.

Hierarchical LVM Device Activation
Unfortunately, the /etc/init.d start scripts for LVM on Debian and possibly other distributions don't handle hierarchical logical volumes very well. They do a single pass vgscan and vgchange, and since the top level volume won't be seen until the sub-volumes are activated, the top level volume isn't detected or activated.

This can be fixed in the /etc/init.d/lvm start script, by duplicating the vgscan and vgchange lines, so that a vgscan and vgchange is done, followed by another vgscan and vgchange. Alternately, it works to have the /etc/init.d/lvm script run twice, by linking it twice in the /etc/rcS.d directory. So if originally /etc/rcS.d/S26lvm existed as a link to /etc/init.d/lvm, then create another link called /etc/rcS.d/S27lvm. When the second instance of the lvm script is run, it will detect the top level volume group and activate it.

Overall Conclusions

Revision History

March 27, 2007
- first published

March 28, 2007
- added section explaining use of dd command