Direct IO Comes to the Nanos Unikernel with O_DIRECT.

Intro

We think databases in particular mesh very well with the general idea of unikernels and we are far from alone, whether it was the folks behind SycallaDB or DBOS or TUM. One of the things that you'll see in some databases and we think we'll see more of in the future is O_DIRECT.

What is O_DIRECT

O_DIRECT, a flag indicating one wishes to perform "direct io", came from XFS and IRIX which came from the company SGI. You might recognize them from other great hits such as OpenGL and the Indy. Later on around 2001, O_DIRECT got into linux. Solaris uses directio(), OpenBSD doesn't support O_DIRECT. MacOS doesn't support O_DIRECT but it does have a fcntl flag that disables caching, although if it's already there then the cache can be used for reading:

fcntl(fd, F_NOCACHE, 1)

Windows has a similar flag of FILE_FLAG_NO_BUFFERING.

ZFS used to not support O_DIRECT but recently support was merged in 2018 by Brian Behlendorf. Docker has an outstanding ticket about O_DIRECT causing issues for mac users running things like sqlserver and foundationdb on mac.

I really like detailing this history as it spans multiple decades and has been cropping up in prior years even outside of our own implementation. There is a lot of kernel work that happens just like this. I think a lot of people have this idea that an idea is debated, tried and implemented or not and it's all decided decades ago and nothing new happens but that's simply not the case for many ideas like this one.

Essentially it is a way to bypass the operating system page cache and have reads/writes go directly from the app to the storage device. Here is a write test that tests the required alignment I ripped from our test suite:

ops run main -n -c config.json

The config just pads the fs:

{
  "BaseVolumeSz": "100m"
}
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <string.h>
#include <sys/uio.h>
#include <unistd.h>
#include <assert.h>

int main() {
    const char *file_name = "test_direct";
    const int alignment = 512;
    int fd = open(file_name, O_CREAT | O_RDWR | O_DIRECT, S_IRUSR | S_IWUSR);
    assert(fd >= 0);
    unsigned char wbuf[3 * alignment];
    unsigned char rbuf[3 * alignment];
    struct iovec iovs[2];
    unsigned char *ptr;

   /* unaligned base pointers: writev() may or may not fail with EINVAL (it fails on Nanos and
     * succeeds on Linux with ext4 filesystem) */
    if ((intptr_t)wbuf & (alignment - 1))
        ptr = wbuf;
    else
        ptr = wbuf + 1;
    iovs[0].iov_base = ptr;
    iovs[1].iov_base = ptr + alignment;
    iovs[0].iov_len = iovs[1].iov_len = alignment;
    if (writev(fd, iovs, 2) > 0)
        assert(lseek(fd, 0, SEEK_SET) == 0);
    else
        assert(errno == EINVAL);

    /* unaligned buffer length */
    ptr = (unsigned char *)((intptr_t)(wbuf - 1) & ~(alignment - 1)) + alignment;
    iovs[0].iov_base = ptr;
    iovs[1].iov_base = ptr + alignment;
    iovs[0].iov_len = 1;
    assert((writev(fd, iovs, 2) == -1) && (errno == EINVAL));

    /* aligned buffer address and length */
    for (int i = 0; i < 2 * alignment; i += sizeof(uint64_t))
        *(uint64_t *)(ptr + i) = i;
    iovs[0].iov_len = alignment;
    assert(writev(fd, iovs, 2) == 2 * alignment);

    /* aligned buffer address and length */
    assert(lseek(fd, 0, SEEK_SET) == 0);
    ptr = (unsigned char *)((intptr_t)(rbuf - 1) & ~(alignment - 1)) + alignment;
    assert(read(fd, ptr, 2 * alignment) == 2 * alignment);

    assert(!memcmp(ptr, iovs[0].iov_base, alignment));
    assert(!memcmp(ptr + alignment, iovs[1].iov_base, alignment));
    close(fd);
    unlink(file_name);
}

For normal applications sometimes you'll get performance improvements but outside of specific use-cases you should probably expect performance degradation, because again, you are explicitly bypassing the page cache. Some databases however, have their own caches and some want to make their own durability guarantees. Of course there are opinions.

"The right way to do it is to just not use O_DIRECT. The whole notion of "direct IO" is totally braindamaged. Just say no."
- Linus Torvalds - 2007

One of the reasons why Linus, in the past, mentioned he doesn't think O_DIRECT is good is because you can't talk to the device directly - you need to buffer through a syscall and you end up caching it yourself which is precisely what the kernel provides with normal writes - the page cache. Note: Keep in mind that when most people bring up Linus' objections they are referring to a LKML thread that is now 17 years old. A lot of the "common knowledge" that many non-kernel devs might have today is ancient.

Another complaint against direct io, besides fighting the os page cache, is that it really needs async support and "normal" aio on linux can turn into a synchronous operation very easily if you're not aware of the various footguns, for instance, even in unbuffered mode, aio can block if file metadata isn’t there. It will wait (block) for that to be available. If you are using something like the newer io_uring than you can walk around some of these limitations.

The reason that you need async, in many cases, to be fast, is that you would want to perform read-ahead which the kernel can provide without O_DIRECT. Readahead reads new content into the page cache before it is explicitly requested. Part of this is synchronous but part of it - since nothing requested it - is asynchronous. This is why calling O_DIRECT can typically slow things down dramatically versus speeding things up.

Traditionally there were two flavors of aio - linux/kernel aio and posix aio. Linux aio requires O_DIRECT and posix aio spawns threads to perform the blocking calls. Linux aio isn't a great solution for non-databases because of it's reliance on O_DIRECT. Many developers have found both of these methods to be lackluster.

Are there other options besides aio? Yes io_uring. Io_uring makes use of non-blocking ring buffers.

io_uring can be faster and use less cpu than AIO, however, Google recently started restricting io_uring. Despite this, we feel that this will eventually be dealt with and io_uring paired with O_DIRECT will be much more commonplace in the future. Keep in mind io_uring is still fairly new.

Typically when not using O_DIRECT you'll be using normal buffered i/o. For most/many cases you are probably not going to outsmart the os pagecache. However, for databases in particular, that contain their own caching, and might be reading/writing a lot, you can prevent multiple copies of data from storage <> memory (bypassing the cache) and achieve both performance and durability benefits. There are some other benefits as well such as being able to schedule i/o and make better use of memory bandwidth.

Writing without Direct IO

When you call a 'write' all you are doing is moving data from the application into the kernel. The kernel then decides how to actually persist that data to disk and it can and will batch and schedule writes as it sees fit.

Typically a developer will rely on calling something like fsync to force unwritten data to disk.

"From a safety perspective, O_DIRECT is now table stakes"
- Joran Greef - 2021

Why Now?

So besides a handful of databases that implement this why else did we end up making an implementation? If you've got a new kernel why not just follow Linus' advice to not include it?

In one word - durability.

"Hi all, Some time ago I ran into an issue where a user encountered data corruption after a storage error."
- Craig Ringer - 2018

This is how an event known as fsyncgate started on the postgres mailing list. Essentially what got discovered was that sometimes fsync would error but, postgres would retry, be told ok and corruption would happen because dirty pages were *not* being written to disk. This didn't just affect postgres but they were the ones that realized what was going on.

Data pages that failed to be written, were simply marked clean in the page cache when fsync was called yet failed. So postgres would retry fsync but this time there were no dirty pages for the fs to write making the second retry succeed but not writing anything to disk and postgres would truncate the WAL.

Surprise! You've got corruption.

This of course broke a lot of assumptions that developers had. Now it panics on fsync failure. Mysql made a similar change to abort when fsync returns EIO as did mongodb, however, many people believed that the real solution would be to use direct io to have greater error control.

In Can Applications Recover from fsync Failures? the Arpaci-Dusseau duo noted that postgres had been using fsync incorrectly for 20 years.

There are of course other methods that databases utilize but many of them have far worse implications. Relying on something like mmap was a choice that many dbs have taken in the past only to revert later on - mongodb and influxdb were notable (of course influx got rewritten like 3 or 4x?). While Andy Pavlo states that the database can almost always manage memory better than the os in Are You Sure You Want to Use MMAP in Your Database Management System? his group is pretty blunt that mmap is not suitable for durabity reasons.

Postgres uses O_DIRECT when writing write-ahead-logs when a non-default setting of wal_sync_method is toggled, however, for most operations buffered io is used.

You might be surprised that postgres and many other databases don't use O_DIRECT for many operations but it's not necessarily because they don't want to - it's just that it was designed well before it was introduced. Postgres itself is, technically speaking, older than linux although the most recent name was created in 1996 so it is missing a lot of newer architecture design such as threads which we incorporated a year or so ago into our postgres unikernel packages.

In this post from 2021 Andres proposed putting direct-io into postgres. He has a WIP here. One of the larger issues for moving a system like postgres to direct io with the aio consideration is that postgres already runs on many different platforms and supporting each one could be a pain in the ass.

Slightly off-topic but you might remember Andres as the guy who noticed and rang the bell for the xz backdoor recently!

"This says to me that Linux will clearly be an undependable platform in the future."
- Josh Berkus - 2013

The Future Awaits

So the saga that is now 25-30 years old in the making continues. Personally, I'm excited to see what other "surprises" are sitting there waiting to be discovered and to force the creation of new software and upgrading of older software.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.