The Roberto Selbach Chronicles

About |  Blog |  Archive

Tag: lru

An LRU in Go (Part 2)

So we created a concurrency-safe LRU in the last post, but it was too slow when used concurrently because of all the locking.

Reducing the amount of time spent waiting on locks is actually not trivial, but not undoable. You can use things like the sync/atomic package to cleverly change pointers back and forth with basically no locking needed. However, our situation is more complicated than that: we use two separate data structures that need to be updated atomically (the list itself and the index.) I don’t know that we can do that with no mutexes.

We can, however, easily lessen the amount of time spent waiting on mutexes by using sharding.

You see, right now we have a single list.List to hold the data and a map for the index. This means that when one goroutine is trying to add, say, {"foo": "bar"}, it has to wait until another finishes updating {"bar" : "baz" } because even though the two keys are not related at all, the lock needs to protect the entire list and index.

A quick and easy way to go around this is to split the data structures into shards. Each shard has its own data structures:

type shard struct {
cap int
len int32

sync.Mutex // protects the index and list
idx map[interface{}]*list.Element // the index for our list
l *list.List // the actual list holding the data
}

And so now, our LRU looks a bit different:

type LRU struct {
    cap int // the max number of items to hold
    nshards int // number of shards
    shardMask int64 // mask used to select correct shard for key
    shards []*shard
}

The methods in LRU are now also much simpler as all they do is call an equivalent method in a given shard:

func (l *LRU) Add(key, val interface{}) {
    l.shard(key).add(key, val)
}

So now it’s time to figure out how to select the correct shard for a given key. In order to prevent two different values of the same key to be stored in two different shards, we need a given key to always return the same shard. We do this by passing a byte representation of the key through the Fowler–Noll–Vo hash function to generate a hash as an int32 number. We do a logical AND between this hash and a shard mask.

The hard part is actually getting a byte representation of a key. It would be trivial if the key was of a fixed type (say, int or string) but we actually use interface{}, which gives us a bit more work to do. The shard() function actually is a large type switch that tries to find the quickest way possible to find the byte representation of any given type.

For instance, if the type is int, we do:

const il = strconv.IntSize / 8
func intBytes(i int) []byte {
    b := make([]byte, il)
    b[0] = byte(i)
    b[1] = byte(i >> 8)
    b[2] = byte(i >> 16)
    b[3] = byte(i >> 24)
    if il == 8 {
        b[4] = byte(i >> 32)
        b[5] = byte(i >> 40)
        b[6] = byte(i >> 48)
        b[7] = byte(i >> 56)
    }
    return b
}

We do similar things to all of the known types. We also check for custom types that provide a String() string method (i.e. implement the Stringer interface.) This should allow for getting the byte representation for the vast majority of types people are likely to want to use as a key. If, however, the type is unknown and is not a Stringer (nor has a Bytes() []byte method), then we fall back to using gob.Encoder, which will work but is very slow to make it realistic usable.

So what does all of this does for us? Now we never lock the entire LRU when doing operations, instead locking only small portions of it when needed, this results in my less time spent waiting for mutexes on average.

We can play around with how many shards we want depending on how much data we store, but we can potentially have thousands of very small shards to provide very fine locking. Of course, since dealing with sharding does have a small overhead, at some point it is not worth adding more.

The following benchmarks were done with 10000 shards and without concurrency —

BenchmarkAdd/mostly_new-4 1000000    1006 ns/op
BenchmarkAdd/mostly_existing-4 10000000  236 ns/op
BenchmarkGet/mostly_found-4 5000000  571 ns/op
BenchmarkGet/mostly_not_found-4 10000000     289 ns/op
BenchmarkRemove/mostly_found-4 3000000   396 ns/op
BenchmarkRemove/mostly_not_found-4 10000000  299 ns/op

And this with 10000 shards with concurrent access —

BenchmarkAddParallel/mostly_new-4 2000000    719 ns/op
BenchmarkAddParallel/mostly_existing-4 10000000  388 ns/op
BenchmarkGetParallel/mostly_found-4 10000000     147 ns/op
BenchmarkGetParallel/mostly_not_found-4 20000000     126 ns/op
BenchmarkRemoveParallel/mostly_found-4 10000000  142 ns/op
BenchmarkRemoveParallel/mostly_not_found-4 20000000  338 ns/op

Still, each shard still locks completely at every access, we might do better than that.

Source code on Github.

An LRU in Go (Part 1)

In my job, I often catch myself having to implement a way to keep data in memory so I can use them at a later time. Sometimes this is for caching purposes, sometimes it’s to keep track of what I’ve sent some other service earlier so I can create a diff or something. Reasons are plenty.

I noticed that I keep writing versions of the same thing over and over and when we’re doing that, maybe it’s time to try and create a more generic solution. So let’s start simple, shall we? What we want is to implement a least-recently-used list (LRU) with no fuss.

Naïve LRU

We start with a simple and very naïve implementation (see the full implementation on Github.)

type LRU struct {
    cap int                           // the max number of items to hold
    idx map[interface{}]*list.Element // the index for our list
    l   *list.List                    // the actual list holding the data
}

We don’t have much there, we have a doubly-linked list l to hold the data and a map to serve as an index. In true Go fashion, we create an initializer —

func New(cap int) *LRU {
    l := &LRU{
        cap: cap,
        l:   list.New(),
        idx: make(map[interface{}]*list.Element, cap+1),
    }
    return l
}

This is nice. Package users will initialize the list with something like l := lru.New(1000) and joy will spread thoughout the land. But I strongly believe the empty value of a struct should also be usable, which is not the case here :

var l LRU
l.Add("hello", "world")

This will panic because neither l nor idx are initialized. My go-to solution for this is to always provide a simples unexported initialization function that is run at the beginning of each exported method and makes sure the initialization is done.

func (l *LRU) lazyInit() {
    if l.l == nil {
        l.l = list.New()
        l.idx = make(map[interface{}]*list.Element, l.cap+1)
    }
}

Now, let’s see how we’d add a new item to the list —

func (l *LRU) Add(k, v interface{}) {
    l.lazyInit()

    // first let's see if we already have this key
    if le, ok := l.idx[k]; ok {
        // update the entry and move it to the front
        le.Value.(*entry).val = v
        l.l.MoveToFront(le)
        return
    }
    l.idx[k] = l.l.PushFront(&entry{key: k, val: v})

    if l.cap > 0 && l.l.Len() > l.cap {
        l.removeOldest()
    }
    return
}

We call lazyInit to make sure everything is initialized than we check whether the key already is in our list. If it is, we only need to update the value and move it to the front of the list as it’s now become the most-recently-used.

If, however, the key was not already in our index, then we need to create a new entry and push it to the list. By the way, this is the first time we see entry, which is an unexported type that we actually store in our list.

type entry struct {
    key, val interface{}
}

Finally, we check whether by adding the new element, we just passed the capacity of the list (if cap is less than 1, we treat as there being no limit to capacity, by the way). If we are over capacity, we go ahead and delete the oldest element in the list.

This is a naïve implementation, but it’s also very fast.

BenchmarkAdd/mostly_new-4                2000000            742 ns/op
BenchmarkAdd/mostly_existing-4          10000000            153 ns/op
BenchmarkGet/mostly_found-4             10000000            222 ns/op
BenchmarkGet/mostly_not_found-4         20000000            143 ns/op
BenchmarkRemove/mostly_found-4          10000000            246 ns/op
BenchmarkRemove/mostly_not_found-4      20000000            142 ns/op

Getting and removing elements is in the same ballpark as accessing a built-in Go map. Getting an existing item adds a bit of overhead on top of accessing the map because it also needs to move the element to the front of the list, but that’s a very fast pointer operation so it’s not that bad.

Adding has to check for an existing item as well as adding/moving the list element, so it’s relatively slower, but still good enough for my immediate needs.

There is a big problem though. This implementation is not at all safe for concurrent use. Sure, package users could protect it with a sync.Mutex but we may want to hide the complexity from them.

The (once again naïve) solution is to protect the list and map accesses with a sync.Mutex of our own.

Concurrent-safe Naïve LRU

Not much needs to be changed to make our naïve implementation safe for concurrent use (full implementation on Github.). We add a mutex —

type LRU struct {
    cap int // the max number of items to hold

    sync.Mutex                               // protects the idem and list
    idx        map[interface{}]*list.Element // the index for our list
    l          *list.List                    // the actual list holding the data
}

And at the top of every exported entrypoint we do something like —

func (l *LRU) Add(k, v interface{}) {
    l.Lock()
    defer l.Unlock()
    l.lazyInit()

Et voilà, we’re safe for concurrent use. But let’s see what the benchmarks tell us.

BenchmarkAdd/mostly_new-4            2000000           819 ns/op
BenchmarkAdd/mostly_existing-4      10000000           200 ns/op
BenchmarkGet/mostly_found-4         10000000           277 ns/op
BenchmarkGet/mostly_not_found-4     10000000           219 ns/op
BenchmarkRemove/mostly_found-4       5000000           325 ns/op
BenchmarkRemove/mostly_not_found-4  10000000           220 ns/op

Overall the code is slightly slower due to the (for now minimal) overhead of the mutex operations. However, although we are safe for concurrent use, we’re not really taking advantage of it, the entire code is completely sequential. An Add will prevent anyone from getting an item. Even a call to Len() locks everyone out.

This will be clear if we modify the tests to run in parallel.

BenchmarkAddParallel/mostly_new-4            1000000          1040 ns/op
BenchmarkAddParallel/mostly_existing-4       5000000           433 ns/op
BenchmarkGetParallel/mostly_found-4          5000000           293 ns/op
BenchmarkGetParallel/mostly_not_found-4     10000000           289 ns/op
BenchmarkRemoveParallel/mostly_found-4       5000000           305 ns/op
BenchmarkRemoveParallel/mostly_not_found-4  10000000           291 ns/op

On average, the calls will take longer because most of the times they’ll have to wait for another operation to finish and free the mutex.

In the next post, we’ll try and cut on the locking considerably.