Bad Logical Volume Design

Purpose
A well designed logical volume can provide greatly improved IO rate. But how bad can things get with a badly designed logical volume? This page describes and quantifies the IO rate for several mis-designed logical volume layouts.

Test Setup: Hardware
The hardware used for this test was:

The Intel 965P chipset has six SATA ports off the southbridge, each with a maximum bandwidth of 300MB/s. An overview of the 965P chipset architecture can be found in http://www.intel.com/products/chipsets/P965/prodbrief.pdf (PDF). the chipset block diagram from that document is shown below.

965 Chipset Block Diagram

For more details on the hardware configuration of the system, the output from lshw can be seen here.

Test Setup: Software
The software used for this test was:

The IOgraph script was written to generate the graphs showing CPU, disk, and network utilization for these performance tests.

Throughout this investigation, we'll be characterizing IO performance by using the dd command, specifically:

dd if=<device> of=/dev/null bs=1024K
We use this read because it's non-destructive, which makes it easy to replicate these tests on existing filesystems without harm. As well, we have seen in previous tests that write latency is quite small, and current generation drives can read and write at similar speeds, so the read speed is sufficiently characteristic as to be used for write speeds. We use the large block size to avoid any blocking bottlenecks. By using large block sizes and continuous reads, we are intentionally looking for the best case performance: the IO rates we see will not be typical of a more normal work load, but they will form the upper bound on the IO rate a normal work load can see.

Investigation Overview
This investigation is intended to explore how much impact poor logical volume design can have on IO rate. The misconfigurations we'll be looking at are:

Two Stripe Partitions on the Same Disk
Since we know that striping improves IO rates, we know we should stripe across as many devices as possible. We've seen this improve IO rates linearly, in some cases.

However, these performance benefits only come if the partitions or disks involved in the stripe are independent. If two partitions on the same disk are striped together, this can greatly impact IO rate.

We have four disks each with one partition: /dev/sda1, /dev/sdb1, /dev/sdc1, and /dev/sdd1. The read IO rate for any one of these partitions looks like this:


Typical Partition IO Rate Profile

Striping four such partitions from separate disks together yields a single logical volume with this IO rate profile:


4-Stripe of Independent Partitions

This is what a stripe should look like: all components performing at their maximum, and so the aggregate IO rate of the stripe is equal to the sum of the IO rates of the stripe comoponents.

However, what if we had additional space on /dev/sda, for example in the /dev/sda2 partition. Although the /dev/sda2 partition is slightly slower than /dev/sda1, it is only marginally so:


/dev/sda2 Partition IO Rate Profile

To make use of this additional partition, we create a volume group consisting of /dev/sda1, /dev/sda2, /dev/sdb1, /dev/sdc1, and /dev/sdd1, and then create a single volume group spanning all 5 partitions.

The IO rate profile of such a mixed volume, however, is remarkably poor:


5-Stripe of Non-Independent Partitions

In a stripe across non-independent components, the individual component IO rates are about 3MB/s. This gives this 5-stripe approximately 4% of the throughput of a 4-stripe composed of independent components.

The explanation is that striping data across two partitions on the same disk is very nearly the worst case for disk IO. Disk head motion is the enemy of performance, and since the logical volume stripe is about 64K wide, it is likely that each block write is being followed by disk head seek to the other partition on the same disk.

The lesson is clear: never stripe to two different locations on the same disk.

Two Stripe Partitions with Mis-Matched Speeds
Beyond just ensuring that the components of the stripe are placed on different devices, it is also useful to check that the speed of the stripe components are similar. The reason is that the speed of all components of a stripe are capped at the speed of the slowest component: if there are two partitions in a stripe and one can support 100M/s and the other 10M/s, then both components will be driven at 10M/s.

To demonstrate this, consider that we have four partitions, /dev/sda1, /dev/sdb1, /dev/sdc1, and /dev/sdd1. The IO rate for each of these individual is identical. This graph shows what it looks like for /dev/sda1:


Fast Partition Read IO Rate Profile

If we stripe across these four partitions to create /dev/vg1/lvol0, the overall IO rate graph for the disks in the 4-stripe looks like this:


Fast Logical Volume Read IO Rate Profile

This is what we expect: if each of the components of a stripe can sustain an IO rate of 70-75MB/s, then when striped together they continue to be capable of 70-75MB/s, resulting in a stripe with an aggregate IO rate of 280-300MB/s. When the stripe components are matched with regard to IO rate, the aggregate IO rate is the sum of the IO rate of the components.

However, consider if we change /dev/sdd1 so that it is composed of slower disk tracks (closer to the end of the disk). This bad partition has an IO rate that looks like:


Slow Partition Read IO Rate Profile

As you can see, the average throughput for this bad partition is about 45MB/s.

If we now stripe together four partitions, three good partitions (each with an IO rate of 70-75MB/s), and one bad partition (with an IO rate of ~45MB/s), the overall IO rate graph for the entire stripe looks like this:


Slow Logical Volume Read IO Rate Profile

There's nothing really surprising about this: the system is writing to the stripe components in parallel, and this round-robin writing can't go faster than the slowest component. Even if three of the stripe components are capable of higher throughput, overall stripe throughput is only as fast as the slowest of the components. Effectively the speed of all stripe components is never greater than the slowest component.

The lesson is to stripe components with similar speeds: striping a fast and a slow partition together is essentially a waste of the IO rate of the fast partition, since it will not be driven any faster than the slow partition. This may not matter for a general purpose logical volume, but if we want to optimize overall throughput then we must be careful of what we put into the logical volume.

Striping Over a PCI Controller
In addition to ensuring we stripe over independent devices, and that we match device speeds, we must also be aware of the adapters and busses over which we are accessing our data. All of the tests done thus far have been performed with the SATA disks attached to the on-board SATA ports on the motherboard. As can be seen in the 965 block diagram above, this affords each disk a dedicated 300M/s channel.

If one were instead to use a PCI controller, the behaviour is somewhat different. The PCI's bus bandwidth is supposedly 132M/s. However, realistically, the maximum bandwidth is somewhat lower.

We begin with four drives: three of those drives are attached to the on-board SATA ports, and one is attached via a PCI SATA adapter, based on the ICH8 chipset. If we measure the bandwidth of the PCI-attached drive, /dev/sda, we see the following profile:


PCI-Attached IO Profile

This looks quite reasonable, on par with the on-board SATA-attached drives shown above. However, if we compose a 4-stripe using the 3 on-board SATA-attached drives and the PCI-attached drives, we see that the overall throughput is somewhat low:


3 On-Board-Attached, 1 PCI-Attached IO Profile

We have previously seen a 4-stripe composed of all on-board SATA-attached devices achieve throughputs of 70M/s per device, so the rate of ~60M/s per device here is lower than normal. It's not clear what the cause is here; we know the PCI bus itself can maintain at least 75M/s, as shown in the previous graph, so perhaps this is cumulative latency from the PCI bus affecting the on-board SATA throughput, or perhaps a limitation in the southbridge chipset in multiplexing data across the busses.

The effect gets worse when additional drives are placed on the PCI controller. This shows the IO profile for a 4-stripe in which two drives are hosted on the PCI SATA controller:


3 On-Board-Attached, 2 PCI-Attached IO Profile

Note that the IO rate is even lower. The problem is that the PCI bus is shared, and with two drives on the PCI controller, they must split the PCI bus bandwidth. Worse still, there is additional bandwidth lost to bus contention. These tests were run on a quiescent system; should there be any other PCI bus activity, the IO rate would drop even further.

Adding a second PCI controller will not help in this matter: the bottleneck is at the PCI bus, and not the PCI SATA adapter itself.

The lesson here, of course, is to avoid PCI-hosted disk adapters when striving for maximum throughput. If there are insufficient on-board SATA ports, the only other performance-preserving option is to use a PCI-Express controller. PCI-E slots, unlike PCI, each have a dedicated bus, and the throughput of that bus is 250M/s per channel. Common upper end motherboards support multiple PCI-E ports, and PCI-E SATA controllers are available.

Overall Conclusions



Revision History

March 29, 2007
- first published