In a nutshell: In skyzh/uring-positioned-io, I wrapped the underlying io_uring
interface provided by Tokio and implemented io_uring
based asynchronous random file reading in Rust. You can use it like this.
|
|
This article introduces the basic use of io_uring
, then describes the implementation of an asynchronous read file library I wrote, and finally does a benchmark to compare performance with mmap.
Click here to visit the GitHub project.
io_uring Introduction
io_uring
is an asynchronous I/O interface provided by the Linux kernel. It was introduced in Linux 5.1 in May 2019 and is now used in various projects. For example.
- RocksDB’s MultiRead is currently doing concurrent reads of files via
io_uring
. - Tokio wraps a layer of API for
io_uring
. With the release of Tokio 1.0, the developers have indicated that true asynchronous file operations will be provided via io_uring in the future (see Announcing Tokio 1.0). Currently Tokio’s asynchronous file operations are implemented by opening a separate I/O thread to call the synchronous API. - QEMU 5.0 already uses
io_uring
(see ChangeLog).
Most of the current tests on io_uring
compare the performance of Direct I/O with Linux AIO. io_uring
can typically achieve twice the performance of AIO.
Random file reading scenarios
In database systems, we often need multiple threads to read the contents of a file at any location (<fid>, <offset>, <size>)
. The often used read / write
API cannot do this (because it has to seek first and needs an exclusive file handle). The following method allows for random reads of files.
- Map the file directly into memory via
mmap
. Reading a file becomes a direct memory read, and can be done concurrently in multiple threads. pread
can readcount
bytes starting from a certain locationoffset
, also supporting concurrent reads in multiple threads.
However, both of these options block the current thread. For example, if a page fault occurs after mmap
reads a block of memory, the current thread will block; pread
itself is a blocking API. Asynchronous APIs (e.g. Linux AIO / io_uring
) can reduce context switching and thus improve throughput in some scenarios.
Basic usage of io_uring
The io_uring
related syscall can be found at here. The liburing provides an easier-to-use API. Tokio’s io_uring crate builds on this by providing a io_uring
API for the Rust language. language’s io_uring
API. Here is an example of how to use io_uring
.
To use io_uring
, you need to create a ring, and here we use the concurrent
API provided by tokio-rs/io-uring
, which supports multiple threads using the same ring.
Each ring corresponds to a commit queue and a completion queue, where the queue is set to hold up to 256 elements.
The process of I/O operations via io_uring
is divided into three steps: adding tasks to the commit queue, submitting tasks to the kernel, and retrieving tasks from the completion queue. Here is an example of the process of reading a file.
You can construct a read file task with opcode::Read
, and add the task to the queue with ring.submit().push(entry)
.
Once the task has been added, commit it to the kernel.
|
|
Final polling for completed tasks.
In this way, we implement random reads of files based on io_uring
.
io_uring
currently has three execution modes: default mode, poll mode and kernel poll mode. If you use kernel poll mode, you do not necessarily need to call the function that commits the task.
Implementing an asynchronous read file interface with io_uring
Our goal is to implement an interface like this, wrapping io_uring
and exposing only a simple read
function to the developer.
|
|
After referring to tokio-linux-aio for asynchronous wrapping of Linux AIO, I used the following method to implement io_uring
based asynchronous reads.
- The developer needs to create a
UringContext
before usingio_uring
. - While the
UringContext
is created, one (or more)UringPollFuture
is run in the background to submit and poll for completed tasks. (corresponds to the second and third operations of the read file in the previous section). - A developer can call the interface for reading a file from
ctx
, creating aUringReadFuture
withctx.read
. After callingctx.read.await
.UringReadFuture
creates an objectUringTask
that is fixed in memory, and then puts the read file task into a queue, using the address ofUringTask
as the user data for the read operation. There is a channel insideUringTask
.UringPollFuture
submits the task in the background.UringPollFuture
polls for completed tasks in the background.UringPollFuture
fetches the user data, reduces it to aUringTask
object, and notifiesUringReadFuture
through the channel that the I/O operation has completed.
The whole process is shown in the following figure.
This makes it easy to call io_uring
to read the file asynchronously. This also has a side benefit: task commits can be automatically batching. Normally, an I/O operation would generate a syscall, but since we use a single Future to commit and poll tasks, there may be multiple uncommitted tasks in the queue at commit time, and they can all be committed at once. This reduces the overhead of syscall context cutting (and of course increases latency). From the benchmark results, we can see that each commit can pack about 20 read tasks.
Benchmark
Compare the performance of wrapped io_uring
with that of mmap
. The test load is 128 1G files with random read aligned 4K blocks. My computer has 32G of RAM and a 1T NVMe SSD. 6 cases were tested as follows.
- 8-thread mmap. (mmap_8)
- 32 thread mmap. (mmap_32)
- 512-thread mmap. (mmap_512)
- 8 threads 8 concurrent
io_uring
. (uring_8) - 8 threads 32 concurrent
io_uring
. That is, 8 worker threads, 32 future simultaneous reads.(uring_32) - 8 threads of 512 concurrent
io_uring
. (uring_512)
Tested Throughput (op/s) and Latency (ns).
case | throughput | p50 | p90 | p999 | p9999 | max |
---|---|---|---|---|---|---|
uring_8 | 104085.77710777053 | 83166 | 109183 | 246416 | 3105883 | 14973666 |
uring_32 | 227097.61356918357 | 142869 | 212730 | 1111491 | 3321889 | 14336132 |
uring_512 | 212076.5160505447 | 1973421 | 3521119 | 19478348 | 25551700 | 35433481 |
mmap_8 | 109697.87025744558 | 78971 | 107021 | 204211 | 1787823 | 18522047 |
mmap_32 | 312829.53428971884 | 100336 | 178914 | 419955 | 4408214 | 55129932 |
mmap_512 | 235368.9890904751 | 2556429 | 3265266 | 15946744 | 50029659 | 156095218 |
Found that mmap is far superior to io_uring
. Well, it’s true that this wrapper is not very good, but it’s barely usable. Here is a heatmap with one minute latency, each set of data is presented in the order of mmap first and then io_uring
.
mmap_8 / uring_8
mmap_32 / uring_32
mmap_512 / uring_512
Some possible improvements
- It looks like right now
io_uring
is not performing very well after my wrapping with Tokio. You can later test the overhead introduced by this wrapping of Tokio by comparing the performance of Rust / C on theio_uring
nop instruction. - Test the performance of Direct I/O. Only Buffered I/O has been tested so far.
- Compare to Linux AIO, performance is not worse than Linux AIO.
- Use perf to see where the bottlenecks are now. Currently
cargo flamegraph
is hooked up andio_uring
can’t request memory. - Currently, the user must ensure that
&mut buf
is valid throughout the read cycle. If Future is aborted, there will be a memory leak. A similar problem with futures-rs can be found at https://github.com/rust-lang/futures-rs/issues/1278 . Tokio’s current I/O solves this problem by copying twice (first to the cache, then to the user). - Maybe wrap write files and other operations along the way as well.