We were invited to speak on two panels at this year's 2011 Flash Memory Summit. Now that the conference is over, we wanted to circle back with some thoughts on what we saw and provide an explanation as to why we test SSDs the way that we do.
Synthetic vs Application vs. Trace-based
Performance has a profound impact on purchasing decisions, whether you're buying a new car, a vacuum, or a CPU. That's precisely the reason so much effort goes into picking the benchmarks that best represent real-world use.
Testing SSDs is no different. Just as we see in the processor and graphics markets, SSD #1 might SSD #2 in one benchmark, and then fall behind SSD #2 in a second metric. The secret is in the tests themselves. But whereas it's easy to reflect the performance of other PC-oriented components using a wide swath of real-world applications that truly tax the hardware, it's much more difficult to reflect SSD performance in the same way.
|Example||Iometer||BAPCo SYSmark||Intel IPEAK|
|I/O Settings||Set By User||No Controls||Created By User|
|Performance Evaluation||Specific Scenario||Specific Workload||Long-Term|
Benchmarks ultimately fall into three groups: synthetic, application-based, and trace-based. But everyone at the conference seemed to have a different take on which tests are best. Some vendors, such as Micron and HP, prefer synthetic benchmarks. Others, like Virident Systems, advocate application-level testing. Meanwhile, Intel maintains that a trace-based benchmark created using software like IPEAK is the best measure of storage performance.
Asked about the right way to test SSDs, there is no easy answer. Each metric has unique strengths. For consumers, however, we have to say that trace-based benchmarking is most representative, so long as the trace properly mirrors what the consumer is doing on his or her machine as well. Why is this?
Well, synthetic benchmarks test one specific scenario. For example, running Iometer requires defined settings for queue depth, transfer sizes, and seek distance. But how often is your workload exactly 80% 4 KB random reads and 20% 4 KB random writes at queue depth of four? Even if you change those values, it can be difficult to see how they relate to the real-world when you're trying to isolate one specific performance attribute. This does work well for gauging the behaviour of many enterprise applications, where a server might host a single piece of software accessing data in a consistent way.
Application benchmarks are a little different. There, you're measuring the performance of a specific application (or a set of applications), but without regard to queue depth, transfer sizes, or seek distance. Some tests, such as SYSmark, spit out a relative score based on an arbitrary unit of measure. Other application-based metrics are incredibly useful because they measure performance in a more tangible way. For example, the MySQL TPCC benchmark provides scores in transactions per second. That's an intuitive measurement for an IT manager tasked with increasing the number of queries a server can handle. Unfortunately, in many cases, application-based benchmarks are either ambiguous or limited to a specific scenario.
Trace-based benchmarks are a mix of the two aforementioned approaches, measuring performance as the drive is actually used. This involves recording the I/O commands at the operating system level and playing them back. The result is a benchmark where queue depth, transfer size, and seek distance change over time. That's a great way to shed some light on a more long-term view of performance, which is why we based our Storage Bench v1.0 on a trace.
The complication is that no two users' workloads are the same. So, a trace is most valuable to the folks who run similar applications.
However, traces still offer an intuitive way to express the overall performance difference between two storage solutions. Synthetic benchmarks are usually dialled up to unrealistic settings, demonstrating the most fringe differences between two drives as a means to draw some sort of conclusion at the end of the story. An application-based benchmark might do a better job of illustrating real-world differences, but then only in a specific piece of software. Using traces, it's possible to see how responsive one drive is versus another.