Cutting Alloc Time Down with Transparent Huge Pages in the Nanos Unikernel

There are a ton of research papers on unikernels going back forever. Now that Nanos has been around for a few years we've been popping up in them and so we pay close attention to whenever someone seems to suggest there might be a performance or security issue.

One paper had indicated poor memory management performance. We figured out pretty quickly that it was because we did not have support for transparent huge pages. This is a feature that has been around in Linux for a while now and we just now got around to implementing it. Pages of memory in nanos were being allocated in 4k chunks, which is still the default set in Linux, yet a lot of software today will make use of linux's THP by allocating in larger chunks of 2M at a time - typically without even knowing it (eg: it is transparent).

There are a handful of methods to see the default page size on your system:

grep -ir pagesize /proc/self/smaps

or

eyberg@venus:~$ getconf PAGE_SIZE
4096

However this is slightly misleading as many systems will be set to madvise which gives the kernel a hint for certain regions via:

cat /sys/kernel/mm/transparent_hugepage/enabled 

One of the problems THP solves is that on today's systems large amounts of ram are both common and allocations can happen quite a lot. Even something as simple as 16G of ram can create 4 million pages. Each reference is stored in a page table entry which contains the map from a virtual to a physical address along with some metadata such as the important user/supervisor bit, the read/write bit, and the nx/xd bit (amongst others).

Walking these entries and looking them up is so expensive we even have something called the TLB - the translation lookaside buffer - that's a fancy way of saying a 'cache'.

So what happens when we set the page size to 2M instead of 4096k in our 16G of ram example? Our page count drops to around 8192. Not only is it faster to walk but we increase our chances of hitting the cache. This is especially true for unikernels as they are always provisioned as virtual machines and that means we're using nested pagetables on the host.

Let's See Some Examples

Onward to the meat and potatoes. In this test we try and trigger page faults for each allocated page which should show us some performance differences in the different page size. Using this somewhat contrived sample program adapted from the paper we were consistently able to see massive perfomance gains:

package main

import (
    "fmt"
    "net/http"
    "runtime"
    "time"
)

func allocTest() {
    start := time.Now()
    size := 50 * 1024 * 1024
    data := make([]uint8, size)
    for i := 0; i < size; i += 4096 {
        data[i] = 0xff
    }
    fmt.Printf("Elapsed = %v\n", time.Now().Sub(start))
}

func allocHandler(w http.ResponseWriter, r *http.Request) {
    allocTest()
}

func main() {
    allocTest()
    fmt.Printf("Go version: %s, listening on port 8080 ...\n", runtime.Version())
    http.HandleFunc("/alloc", allocHandler)
    http.ListenAndServe("0.0.0.0:8080", nil)
}

When we run it we immediately see the difference:

eyberg@venus:~/thp$ ops run thp
running local instance
booting /home/eyberg/.ops/images/thp ...
en1: assigned 10.0.2.15
Elapsed = 18.736866ms
Go version: go1.22.1, listening on port 8080 ...
^Cqemu-system-x86_64: terminating on signal 2
signal: killed

eyberg@venus:~/thp$ ops run thp --nanos-version=522d802
running local instance
booting /home/eyberg/.ops/images/thp ...
en1: assigned 10.0.2.15
Elapsed = 6.769979ms
Go version: go1.22.1, listening on port 8080 ...

Here's the same example in java:

public class Hello {

  public static void main(String[] args) {

    long startTime = System.currentTimeMillis();

    final byte[] ray = new byte[50 * 1024 * 1024];
    for (int i = 0; i < ray.length; i+= 4096) {
        ray[i] = (byte)0xff;
    }

    long endTime = System.currentTimeMillis();

    System.out.println("That took " + (endTime - startTime) + " milliseconds");

  }
}
eyberg@venus:~/jv$ ops pkg load eyberg/java:20.0.1 -c config.json --nanos-version=522d802
running local instance
booting /home/eyberg/.ops/images/java ...
en1: assigned 10.0.2.15
That took 11 milliseconds
eyberg@venus:~/jv$ ops pkg load eyberg/java:20.0.1 -c config.json --nanos-version=522d802
running local instance
booting /home/eyberg/.ops/images/java ...
en1: assigned 10.0.2.15
That took 12 milliseconds
eyberg@venus:~/jv$ ops pkg load eyberg/java:20.0.1 -c config.json
running local instance
booting /home/eyberg/.ops/images/java ...
en1: assigned 10.0.2.15
That took 25 milliseconds
eyberg@venus:~/jv$ ops pkg load eyberg/java:20.0.1 -c config.json
running local instance
booting /home/eyberg/.ops/images/java ...
en1: assigned 10.0.2.15
That took 23 milliseconds

Now let's try something similar with python. A first pass might look something like this:

size = 500 * 1024 * 1024
arr = bytearray(size)

i = 0
while i < len(arr):
    arr[i] = 0xff
    i += 4096

... and indeed we see the same sort of speedups and this operation itself is fairly fast but a lot of python code is not built with byte arrays - they are built with lists and lists take up a lot more room as bytearrays are just sequences of bytes whereas lists are sequences of objects. This starts to compound too when we start doing anything interesting with them.

Here's the same example with a list:

eyberg@venus:~$ cat hi.py
import time

t1 = time.time()
lst = [0xff] * (500*1024*1024)
t2 = time.time()
print("Time=%s" % (t2 - t1))

This actually takes up a ton of ram - as you can see we have to modify to accomodate but it really paints the picture on how much faster things go:

eyberg@venus:~$ ops pkg load eyberg/python:3.10.6 -a hi.py -m6g
running local instance
booting /home/eyberg/.ops/images/python3.10 ...
en1: assigned 10.0.2.15
Time=1.4076340198516846
en1: assigned FE80::7CB4:30FF:FEC4:3EC1

eyberg@venus:~$ ops pkg load eyberg/python:3.10.6 -a hi.py -m6g
running local instance
booting /home/eyberg/.ops/images/python3.10 ...
en1: assigned 10.0.2.15
Time=1.4125440120697021
en1: assigned FE80::1CB0:4BFF:FEC7:FFC6

eyberg@venus:~$ ops pkg load eyberg/python:3.10.6 -a hi.py -m6g --nanos-version=522d802
running local instance
booting /home/eyberg/.ops/images/python3.10 ...
en1: assigned 10.0.2.15
Time=0.6030755043029785

eyberg@venus:~$ ops pkg load eyberg/python:3.10.6 -a hi.py -m6g --nanos-version=522d802
running local instance
booting /home/eyberg/.ops/images/python3.10 ...
en1: assigned 10.0.2.15
Time=0.5960953235626221

Now, THP is not always appropriate - in fact many databases will suggest disabling it to reduce memory fragmentation. Gil Tene even calls it a "form of in-kernel GC", which it kinda is. Databases can also employ different methods of memory allocation which THP could contend with, but for other workloads, as you can see, it can make an immense benefit.

Recent versions of glibc allow you to turn on/off hugepages at will to see what kind of performance one can expect:

GLIBC_TUNABLES=glibc.malloc.hugetlb=1

Depending on your application's workload huge pages are another one of those features that seem to make a lot more sense for a single application server, ala unikernel (vs your typical ubuntu box with a hundred diff. programs running). As with all things we encourage you to benchmark and measure your workloads yourself.

Deploy Your First Open Source Unikernel In Seconds

Get Started Now.