01.08.2020

Storage Performance Testing: Virtual Machine Disk

Performance benchmarking is a huge and complex topic and we face different aspects of it from time to time. One relatively simple and probably common question is how to compare disks in the cloud? What benchmarks do we need to run and what parameters to use? When we want to compare different disk types or different cloud providers the testing process should be reproducible, so it is possible to save the results for future use.

In this article I want to explain a benchmarking methodology for this case and will show a small bash-based tool for running disk performance tests (using FIO[1] and a python-based tool to plot the results. The tool set is called EasyDiskBench[2]. I’ll also explain a bit about the results interpretation.

Prerequisites

The reason for the testing process during this article is the need to compare different disks. It means that we don’t have any predefined reference values.

Virtual machine disks might be literally anything. It could be a locally-connected raid array of HDDs or SSDs, or it could be any kind of network storage solution. Here the storage system behind the disk is treated as a blackbox. So, we don’t have access to a lower-level subsystem or even if we have it (it still could be our storage, right?) - we don’t care about it. We don’t try to take into account the busy hour, load spikes, etc.

There’s no reason to run specific application (eg. Postgresql, gcc), because the object of the testing process is a disk. We don’t know (or don’t care!) what kind of software will be running on top of it, but want to measure the disk performance. In this case there should be some test suite generating different workloads. It will help to understand the disk performance under different workload profiles and it can also indirectly show what mechanisms are used by the storage system.

The number one tool for disk testing is FIO[1]. It’s been written by Jens Axboe - Linux kernel block layer maintainer. There are a lot of command line options for this tool, it covers all the possible ways to do IO in Linux and it could even be used to test the network. If you’ve never seen this tool before - take a look at the documentation[3].

Taking all of it into account, the FIO tool and a bunch of test cases is our choice.

FIO options

The most valuable options are:

direct=1 tells FIO to open a file with the O_DIRECT flag, which forces the Linux kernel to bypass the page cache.
ioengine=libaio makes FIO use the libaio library to do IO. This library is used by a lot of applications and provides async interface. The most modern and interesting interface is io_uring[4], but it is still rarely used today.
loops=2 is used to run every test twice. You can easily achieve different values between the first and second run.
iodepth=1 is set because we care about latency for non-parallel workloads here.
sync=1 asks fio to open the file with O_SYNC, which means that the linux kernel will send a flush command for every write request sent. We could have used fsync=1 to make fio call fsync() on every write, but it’s easier to parse the results using the sync option. It’s used in synchronized write tests.

Workloads

Test suite starts with a write test to fill up the file (or a place on a disk). It is followed by a read test, then write test runs with other parameters and so on. This order is not accidental. Some caching mechanisms work on read requests and may mess up the results. It doesn’t mean we’re totally preventing caching from happening somewhere inside our blackbox.

Benchmarking

All the scripts I talk about here are placed on github[2]. The first thing to do is to create a virtual machine. It is a good practice to have a name for the machine. Either an A-record or a Host record in local ssh config. It will help to work with the results later.

To start basic test suite it’s just enough to run:

./run-basic.sh root something.example.com /root/fio

Where root is the remote username which is used with ssh/scp, something.example.com is the virtual machine name, /root/fio is the directory where all the tests will be copied (it will be cleaned up automatically) from the local machine. By default run-basic.sh uses file on the local filesystem and sometimes it is better to just skip the last possible option and use the default as shown here. Using a file is not a bad idea, but you should bear in mind this choice. Filesystem will add an additional workload on the disk itself caused by the journal and it should be taken into account. The good thing here is that most applications use a filesystem, so benchmarking a file in a filesystem makes the test more or less closer to a real-world usage. If you want to run the test over the raw disk add a path to the disk as the last argument for the run-basic.sh.

The script installs FIO using apt if it’s not installed yet and starts tests described in tests/basic.fio. After all tests have been completed, script copies files from the remote machine to the ./results/something.example.com/ directory.

It is possible to run the scripts a few more times on different machines, to get more results, and then use plotting scripts described next to plot them all together.

Plotting

To plot the graphs it’s enough to run:

./plot-all.sh

This script runs ./plot.py for all the files from all the directories in ./results/* where should be ./results/something.example.com in our case. It generates .png files in ./plots/ directory.

If more than one directory exists in ./results, it plots results of the same type of test on one chart making it possible to compare different runs.

Interpreting Results

To understand the tests and results naming refer to README[5].

Every test runs twice and both results are in the same file and treated as one test result. So that we can see the difference on one graph, if it appears.

Small Random Writes

4-8KiB block size is relatively small. Actually, 4KiB is the minimum memory page size for x86-64 and some other hardware architectures. Some latency-sensitive workloads operate small blocks sizes, for example databases use 8KiB block size for their journal. Databases usually write their journal sequentially, but synchronously and with no parallelism. So, these results tell us something about database write performance on the disk.

Also, this kind of workload makes traditional HDDs work slow due to physical architecture limitations.

Small Random Reads

This kind of workload is the same as random writes, but must perform better. It happens because storage systems have more complex code paths for writes to provide reliability. If this test shows worse performance than the random write, it looks suspicious:)

Synchronized and Non-Synchronized

Synchronized workload must show less performance then non-synchronized. There are some tricks here:

If comparing two different disks and one shows better non-synchronized write performance while synchronized performance is similar, it is probably caused by safe writeback caching on the backend side.
Non-Synchronized workload must always be faster. But, latency-sensitive applications are mostly affected by synchronized write performance.

Big Block Size

Testing performance with big block size is mostly about bandwidth. Here I have to mention that the kernel splits big requests to smaller parts, so what we see when benchmarking with 4Mb block size is not what it really is.

But, as we are comparing different disks on a blackbox storage, it is one of the workloads we have to look at to get the whole picture of the disk performance.

Pareto Distribution

Tests with pareto distribution are here to help us to detect if there’s a caching layer or not. Most tests with pareto distribution are expected to show better performance because all storage systems have some type of read cache to answer clients faster. For example, Ceph object storage uses page cache for some data and internal 2q-cache to speed up read requests from the clients.

If reads with pareto distribution are notably faster than just random reads on the same disk, then we are dealing with client-side caching. It’s neither bad nor good, it’s just the way it is. Cache behavior under different workloads depends on the cache policy and software realization.

Sequential vs. Random

Sometimes sequential reads might be faster than random. It only depends on the backend. If it is a distributed storage it probably won’t show the difference. This happens because the disk might be split on a lot of blocks and sequential load on the client becomes random on the backend. It also depends on the architecture of the storage. We still can imaging HDD-based backend, which will be faster for sequential workloads.

First Run vs. Second Run

The very first test (4KiB random write in my tests) will be worse than the second. It happens because of two factors:

Tests are running on the file and this file is not preallocated in the filesystem. So filesystem has to create file mappings for every write during the first run.
The same effect may happen on the backend: disk is not preallocated and blocks are allocated during the test. This process adds additional load on the storage.

Similar behavior might be observed in read tests, but it’s caused by a different reason: on the first run blocks might be cached on the storage system, so the second run shows better results.

Median and Mean

There is an option to plot mean values instead of median. At first glance there is no difference. And that would be true for a normally distributed data without a skew. But storage latency values are always not normally distributed and there are outliers often. In this case mean value will be affected by them, but median won’t.

It’s even possible to get the results where the median is lower for the results of the first test, but the second one finished faster. If one counts the mean latency for these results, one will see a higher value for the first result.

In our context it is better to know everything, than something:) In most cases the mean is just a bit higher and may vary more widely than median.

Median and Boxplot

Median and mean are understandable values. They help to basically understand the result, but they don’t show distribution and outliers. Sometimes we may have better median value, but worse distribution and much higher mean, which may be not suitable for some workloads. To find out more information about this I use boxplots. Boxplot consists of:

Box, which borders represent 25th and 75th percentiles (first and third quartiles).
Line inside the box is a median value.
Whiskers represent the most far outlier in 1.5 interquartile range (distance between first and third quartiles).
Individual values - outliers.

Boxplot can give you a lot of information about values distribution. We can use boxplots for latency here because FIO in our tests saves values for every IO without aggregation.

You can find more info on boxplots here[6]. Unfortunately I didn’t find a way to plot multiple results with boxplots on one picture without the overlapping issue with plotly or matplotlib:) Leaving it for TODO. Anyway, they are still useful.

After All

That’s all for the beginning. Not too much, but just enough to be able to compare different disks in the cloud. But why am I talking about cloud disks here?

Physical Devices

These scripts and methodologies don’t cover physical device benchmarking. The reason is that different types of physical devices have a lot of particular qualities, among which: technology disk is based on (SMR HDD, qlc, tlc, etc), physical block size device operates, size of the cache, and so on. Testing process for the physical device must be based on it’s specification and done with understanding of what device it is. It doesn’t mean you can’t treat the device as a blackbox, but it means that you have a few reasons to compare modern HDD with SATA SSD or NVMe SSD. Also, it’s impossible to achieve the highest possible performance for Linux using the tests I’m talking about here. It’s only possible using io_uring[4], but I consciously avoided it during this article, because not so many software uses this new interface for now. It might also be necessary to tweak some CPU and software settings to make it work as fast as it can.

Also, cloud disks are subject to unpredictable performance changes. It might be higher load due to some large clients, or a distributed storage internal process like recovery after a host or a disk fail. For a user it’s impossible to predict these things and it makes such disks differ from local physical devices.

What’s Next?

One more interesting topic is a benchmark with increased parallelism - increased iodepth or a number of jobs for FIO. The motivation for this is a wish to understand how the storage system scales with a lot of threads and how it reacts on the overload. It is possible to find out if bursts over the limits are possible or if the storage subsystem does some “clever” requests throttling. And there is another interesting topic about distributed storage benchmarking. But all these things do not fit into the scope of this article.

References

[1] https://github.com/axboe/fio “FIO: Flexible I/O Tester”
[2] https://github.com/AlexZzz/easydiskbench “EasyDiskBench”
[3] https://fio.readthedocs.io/en/latest/fio_doc.html “FIO documentation”
[4] https://kernel.dk/io_uring.pdf “Efficient IO with io_uring”
[5] https://github.com/AlexZzz/easydiskbench#tests-and-results-naming “EasyDiskBench: Tests and results naming”
[6] http://vita.had.co.nz/papers/boxplots.pdf “40 years of boxplots”