Effect of Striping on Raw Disk Device Performance
with Linux Logical Volumes

This test was intended to determine how much improvement can be seen in raw disk read and write performance on Linux when one, two, three, and four disk stripes are used with the logical volume software. Results show that given proper configuration, sequential read and write performance improves near linearly as additional devices are striped.

Test Setup: Hardware
The hardware used for this test was:
  • CPU: Core 2 Duo 6300
  • Motherboard: Gigabyte GA-965P-DS3
  • Chipset: Intel 965P
  • Memory: 2GB DDR2
  • OS Drive: 120G IDE
  • Data Drives: 4x 320GB SATA
  • Storage: Promise SuperSwap 4100 enclosure
  • Network: Marvel 8053 Gigabit LAN Controller
The Intel 965P chipset has six SATA ports off the southbridge, each with a maximum bandwidth of 300MB/s. Four of these were used to support the drives for this test. An overview of the 965P chipset architecture can be found in http://www.intel.com/products/chipsets/P965/prodbrief.pdf (PDF). the chipset block diagram from that document is shown below.

For more details on the hardware configuration of the system, the output from lshw can be seen here.

Test Setup: Software
The software used for this test was: The IOgraph script was written to generate the graphs showing CPU, disk, and network utilization for these performance tests.

Throughout this investigation, we'll be characterizing IO performance by using the dd command, specifically:

dd if=<device> of=/dev/null bs=1024K
We use this read because it's non-destructive, which makes it easy to replicate these tests on existing filesystems without harm. As well, we have seen in previous tests that write latency is quite small, and current generation drives can read and write at similar speeds, so the read speed is sufficiently characteristic as to be used for write speeds. We use the large block size to avoid any blocking bottlenecks. By using large block sizes and continuous reads, we are intentionally looking for the best case performance: the IO rates we see will not be typical of a more normal work load, but they will form the upper bound on the IO rate a normal work load can see.

Test Procedure: Overview
In order to assess the influence of disk stripe width (the number of disks striped together) on raw device performance, the following steps were followed:
  • create a logical volume of a single drive
  • test the read and write speed of that logical volume
  • create a logical volume of two drives
  • test the read and write speed of that logical volume
  • create a logical volume of three drives
  • test the read and write speed of that logical volume
  • create a logical volume of four drives
  • test the read and write speed of that logical volume
I didn't test beyond a four disk stripe because I only had four disks available to me, and because a four disk stripe was intended as the final configuration on this server, so performance measurements beyond four disks were not as relevant.

Test Procedure: Striping
The Linux logical volume software was used to stripe the disks for the test. The four disks were Sony 320G SATA II devices (model ST3320620AS), controlled through the motherboard's SATA ports. This leaves them directly off the 965P's south bridge SATA ports, each with a bandwidth of 300MB/s. Their physical device names are /dev/sda, /dev/sdb, /dev/sdc, and /dev/sdd.

The commands used to stripe the disks were:

pvcreate <device1> [<device2> ...]
vgcreate <device1> [<device2> ...]
lvcreate -i <numberofdevices> -l 30000 vg1
This creates each device as a physical volume, then creates a volume group out of the physical volumes, and finally partitions space on the volume group for the logical volume. The stripe width (the number of disks in the stripe) is specified with the -i option to lvcreate.

So, for example, to make a two-disk stripe of /dev/sda and /dev/sdb, the following commands are used:

pvcreate /dev/sda /dev/sdb
vgcreate /dev/sda /dev/sdb
lvcreate -i 2 -l 30000 vg1

Test Procedure: Data Read and Write
The purpose of this test was to understand the theoretical upper limit to the benefit of disk striping on disk performance, rather than to assess the practical benefits of disk striping on a typical disk workload. Thus I didn't test read and write performance through a filesystem or using a realistic model of reads and writes. Instead, I generated a continuous stream of read or write accesses using the dd command.

Typical dd read and write commands would look like:

dd if=/dev/vg1/lvol0 of=/dev/null bsize=1024k count=102400
dd if=/dev/zero of=/dev/vg1/lvol0 bsize=1024k count=102400
For disk reads, this copies 100G of data from the logical device and sends it to /dev/null: since the /dev/null write latency is (presumably) small, the duration of the command will be dominated by disk performance.

Similarly, for disk writes, this sources the data to be written from /dev/zero. Since the IO rate of reading data from /dev/zero is very high, the performance of this command is dominated by disk performance.

To avoid the influence of buffer cache on the results, the commands generate a quantity of data many times the size of memory: 100G of data read/written compared to 2G of physical memory.

Test Procedure: Measurement
The IOgraph software is used in a similar fashion to the time command. The user runs the iograph command, and specifies the commands to be run and the resources to be monitored. The iograph command then begins monitoring the resources and executes the specified commands: once all the commands are complete, the monitoring data is collected and put into graph form with gnuplot.

So, for example, to execute the dd read command from above during a single disk stripe test, iograph would be used like this:

    iograph -cmd "dd if=/dev/vg1/lvol0 bs=1024K count=102400 of=/dev/null" \
        -disk /dev/sda1 \
        -cpu \
        -filename device-read-1-stripe \
        -title "1-stripe 100G device read"
This would make iograph execute the specified dd command while monitoring CPU utilization (-cpu) and the read and write performance on disk /dev/sda1.

Test Procedure: Measurement Results
There are two primary measurements taken for each sub-test: the duration of the read or write command indicating overall throughput, and the graphs produced by IOgraph. The fundamental measure is the duration: presumably as additional disks are striped, overall performance will increase and this should be reflected in reduced run times, indicative of increased throughput.

Beyond the overall device throughput, however, the graphs from IOgraph will provide a more detailed view of what the system is doing. Specifically, we'll be looking at several parameters:

  • Disk IO Rate: This is measured in MB/s over the duration of the test, for all disks in the stripe. This is used to confirm that the disks in the stripe are being utilized uniformly, and to monitor throughput over time.
  • System CPU Utilization: This will tell us whether we're spending more or less time in system calls: because the benchmark we're using is largely CPU bound, this is a measure of how much work is being done.
  • IO wait CPU Utilization: IO wait is time the CPU is idle while one or more processes are blocked waiting for a read or write request to be fulfilled. It indicates an inefficiency, in that the system is idle waiting for IO when it could have otherwise been preforming useful processing.
Of these measures, the IO wait times will be most interesting: the key benefit of striped drives is that it allows a single read or write to be farmed out across multiple drives. Since the bulk of disk IO latency is incurred while waiting for disk heads to move and platters to spin, having multiple disks allows multiple reads and writes to be run in parallel. So, as additional disks are striped, we anticipate that IO wait time will be reduced.

Results: Read Disk Utilization
These graphs show the disk read rates for 100G reads on one, two, three, and four disk stripes. The width of the disk stripe is shown in the title of each graph. Click on each graph to get a large version of it.

  • Overall, sequential read throughput on any given disk hovers around the 70-80MB/s mark: I was informally expecting something in the area of 30MB/s (based on previous experience with IDE drives), so this was a pleasant surprise.
  • Disk devices in the multiple-disk tests do appear to be used uniformly. This isn't unexpected, but it's nice to see confirmation that the devices are being used nearly identically.
  • With the exception of the four-stripe, the duration for each test is almost exactly proportional to the number of disks. This is a very positive result, since it means that striping additional disks provides a big benefit to IO throughput (for raw sequential access, at least).

    stripe size duration scale factor
    1 1364s 1.00x
    2 676s 2.01x
    3 447s 3.05x
    4 360s 3.79x

  • The single disk IO graph shows a stepped pattern, with the read IO rate starting at ~77MB/s but stepping down to ~75MB/s at about 500s into the test, and stepping down again to ~73MB/s at about 1000s into the test. It's unclear whether this step pattern is also present in the 2-, 3-, and 4-stripe graphs: it could be present, but masked because of the noise in the graph.
  • This reduction in IO throughput as we progress into the drive is actually a very worrying result: some informal tests (not shown here) indicate that this reduction in IO rate continues as we move further into the storage. I believe this is caused by the multi-zone nature of modern hard drives, and it has a serious impact on stripe design. A future performance investigation will focus on this.

Results: Read CPU Utilization
These graphs show the CPU utilization for 100G reads on one, two, three, and four disk stripes. The width of the disk stripe is shown in the title of each graph. Click each graph to see a larger image of the graph.

  • The bulk of all of these graphs show a red section layered on a green section. The green section represents system CPU usage, and the red shows IO wait CPU. Green is good, since it represents useful work being done by the operating system. Red is bad, since it represents time when useful work could have been done, but wasn't because the system was waiting for IO.
  • The system is running a Core 2 Duo CPU, which appears to the system to be two separate CPUs. That is why the system and IO wait CPU sum to only 50%, because they are only occupying one CPU. (See the write CPU graphs for discussion of why some graphs do go to 100%.)
  • From left to right and top to bottom, the graphs show a 1-, 2-, 3-, and 4-stripe test. It is easy to pick out the reduction in red over the graphs, which is quite satisfying.
  • As noted above, I suspect the non-linear scaling of the 4-stripe disk shown in the table above is caused by the system now bottlenecking on CPU rather than disk, and the fourth graph shows this. The red portion on that graph is quite small.

Results: Write Disk Utilization
These graphs show the disk write rates for 100G writes on one, two, three, and four disk stripes. The width of the disk stripe is shown in the title of each graph.

  • The results here largely mirror the results in the read test.
  • Interestingly, the read and write IO rates are very similar: this would indicate that IO rates are primarily driven by disk head movement and rotational latency, rather than the process of reading or writing the magnetic media.
  • As with the read IO graphs, the scaling factor as we stripe more disks is very close to linear:

    stripe size duration scale factor
    1 1364s 1.00x
    2 678s 2.01x
    3 451s 3.02x
    4 346s 3.94x

  • The 4-stripe write IO rate is higher here than the corresponding 4-stripe read IO rate, and (as will be seen in the next section), the IO wait time is higher. I suspect this means that, unlike the read case, we are seeing larger IO wait times with the write tests than the read tests, and thus see more benefit for the 4-stripe configuration.
  • The same stepped pattern shown in the 1-stripe read test is also clearly apparent in the 1-stripe write test. Further, the steps are about the same size, and occur about the same duration into the test. This implies that this step pattern is not likely to be related to a read- or write-specific limitation.

Results: Write CPU Utilization
These graphs show the CPU utilization for 100G writes on one, two, three, and four disk stripes. The width of the disk stripe is shown in the title of each graph.

  • As with the write IO rates, the write CPU graphs don't provide a great deal of extra information over the read CPU graphs.
  • I'm unclear why the combined IO wait and system time is capped at 50% in the first 1-stripe graph, while the rest have the combined value going to 100%. I initially thought that it was because in the 2-, 3-, and 4-stripe cases, there were multiple kernel threads writing to the different disks, but if that was the case, then I would expect the same thing to occur in the read CPU graphs, but it doesn't.

Overall Conclusions
  • Current generation SATA drives can support a sustained IO rate of 70-80MB/s when reading or writing sequentially, at least near the beginning of the storage.
  • Sequential read and write performance against the raw disk device scales close to linearly with the number of striped drives.
  • The reduction in IO wait CPU time clearly shows the benefit of striping multiple devices. Informally, then, the portion of time an existing system spends in IO wait would make a useful estimate of the benefit to be gained from moving to use striped drives.
  • The single threaded process driving sequential reads appears to reach saturation on a four-stripe, implying that efficiency would scale sub-linearly for more than four disks. However, I suspect a more normal workload with more disk seeks would drive IO wait times up again, and thus five- and six-stripe drives would produce visible benefit. Further performance tests will be directed at this question
  • The posited explanation of multi-zone drive behaviour as the underlying cause of the stepped reduction in IO rate (shown most clearly in the 1-stripe read and write IO rate graphs) needs to be investigated, as this could have a serious impact on overall throughput.
  • It is highly likely that the improvements in throughput as additional drives are striped here will only be partially realized with a more realistic workload, since sequential access to the raw device is essentially the most efficient access method. Accessing the drive with a mixed read and write workload on top of a filesystem will likely show smaller benefits. Further work is needed to quantify that.

Revision History

March 20, 2007
- first published

March 28, 2007
- added section explaining use of dd command