Replace the map implementation in the golang standard library, SwissTable is faster?

In the latest instalment of the Go dev team’s weekly meeting, they discuss Google3 re-evaluating the performance of SwissTable, and that it might be worth following up with a replacement for the standard library map.

SwissTable

[MichaelP]: Google3 may want to do some benchmarking on this. Maps are used heavily in google3. This may be of value to us.

So this may be worth investing in.

Some people may not understand SwissTable (or called Swiss Tables) is what Google3 and what?

Google3 may be familiar to people inside Google, but the concept is rarely introduced outside. Google3 refers to Google’s large code repository (monorepo), 95% of Google’s code is in this large repository, except for Chrome and Android and other special projects.

SwissTable is a Google open source implementation of a hash table (hash table, map), announced out at cppcon 2017, which is heavily used internally by Google to replace std::unordered_map , which is very efficient due to its compact design, and it has also been ported to other languages, for example, Rust has a library hashbrown based on the SwissTable implementation, and in Rust 1.36 started to replace the standard library HashMap.

The official documentation and some developers’ notes describe the benefits of SwissTable, such as

Swiss Tables Design Notes
Swisstable, a Quick and Dirty Description
Swisstable: a faster hash table than std::unordered_map in C++.

So Go development team also began to focus on SwissTable, perhaps in a future version to implement it and replace the standard library map, after all, these basic data types and algorithms will help millions of people use Go to do the development of the project, such as sorting algorithms, for example, last year bytedance developers submitted a pdqsort algorithm, replacing the standard library sorting algorithms, now again! Someone proposed dual-pivot quicksort, according to the test performance is better and more concise, we have been using for decades these people have been the basic algorithms and data structures or tireless optimisation.

Although the Go development team is currently only focusing on and discussing SwissTable, and may not implement it in the future, the community has recently implemented SwissTable: swiss.

I introduced this author’s maphash in a post last year, and it looks like the author has been preparing for his SwissTable, and this year he submitted a Go implementation of SwissTable, which calculates hashes using his maphash.

The author has written an article about his swiss design, we can at least understand from this library how the Go language design SwissTable, but also in some optimisation scenarios you can privately use this library to replace the built-in map, as the author did.

The built-in map uses the open addressing method (open-hashing , also called zip method), key-value pairs with conflicting hash values (hash values are the same) will be placed in a bucket, when looking up, first find the corresponding bucket according to the hash value, and then traverse the elements in the bucket to find the corresponding value, of course, there are still some optimisations in this, by extra bits to do faster checking, the built-in map details can refer to the sharing of 2016 GopherCon conference: Inside the Map Implementation.

SwissTable uses a different hashing scheme called Closed Hashing (also known as open address method). Each hash value has its own slot, and the choice of slot is determined by the hash value, the simplest way is to start from the hash(key) mod size slot, and keep searching backward until it reaches the corresponding key or an empty slot (non-existent key), which is also a common way of using the open-address method. SwissTable also uses short hash like the built-in map to support fast checking, but its metadata is stored separately from the hash.

Using this method is more friendly to the CPU cache, and more importantly, it is possible to compare 16 short hashes in parallel with the SSE instruction.

Since the find method is the basis for the other methods Get(), Has(), Put(), Delete(), which look for the corresponding slots, let’s look at the library swiss and see how it implements the above logic (pseudo-code).

func (m Map) find(key) (slot int, ok bool) {
    h1, h2 := hashOf(key)            // High-low hash bit, similar to the handling of the built-in map, implements short address
    s := modulus(h1, len(m.keys)/16) //Calculate the position at which the probe starts
    for {
        // Probe implemented using the SSE command, using h2 for short address matching.
        matches := matchH2(m.metadata[s:s+16], h2)
        for _, idx := range matches {
            if m.keys[idx] == key { 
                return idx, true // Found it. Return to the corresponding slot.
            }
        }
        // Use the SSE command to find empty slots
        matches = matchEmpty(m.metadata[s:s+16])
        for _, idx := range matches {
            return idx, false // Empty slot found, return
        }
        s += 16
    }
}

The library also provides code for benchmark, and we tested its performance against the built-in map on an Apple M2 and a dummy machine, respectively.

goos: darwin
goarch: arm64
pkg: github.com/dolthub/swiss
BenchmarkStringMaps/n=16/runtime_map-8         	190298112	         6.716 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=16/swiss.Map-8           	114514650	         9.000 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=128/runtime_map-8        	181336688	         6.934 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=128/swiss.Map-8          	124560243	        13.63 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=1024/runtime_map-8       	100000000	        10.09 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=1024/swiss.Map-8         	100000000	        11.27 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=8192/runtime_map-8       	60224208	        19.38 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=8192/swiss.Map-8         	88910022	        13.28 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=131072/runtime_map-8     	53933996	        22.12 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=131072/swiss.Map-8       	76083596	        16.36 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=16/runtime_map-8          	262228116	         4.678 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=16/swiss.Map-8            	227993193	         5.439 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=128/runtime_map-8         	242425221	         4.708 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=128/swiss.Map-8           	185926908	         5.876 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=1024/runtime_map-8        	173284822	         6.709 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=1024/swiss.Map-8          	186861410	         6.550 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=8192/runtime_map-8        	71231763	        16.57 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=8192/swiss.Map-8          	139595205	         8.635 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=131072/runtime_map-8      	64039132	        19.05 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=131072/swiss.Map-8        	100000000	        11.56 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/dolthub/swiss	33.538s

goos: linux
goarch: amd64
pkg: github.com/dolthub/swiss
cpu: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
BenchmarkStringMaps/n=16/runtime_map-2         	97587910	        12.65 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=16/swiss.Map-2           	61206505	        19.38 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=128/runtime_map-2        	92861481	        13.35 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=128/swiss.Map-2          	58353951	        20.83 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=1024/runtime_map-2       	51516268	        22.70 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=1024/swiss.Map-2         	51832698	        23.09 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=8192/runtime_map-2       	40324459	        29.54 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=8192/swiss.Map-2         	45826951	        26.19 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=131072/runtime_map-2     	26659296	        43.71 ns/op	       0 B/op	       0 allocs/op
BenchmarkStringMaps/n=131072/swiss.Map-2       	30621675	        39.84 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=16/runtime_map-2          	128090280	         9.332 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=16/swiss.Map-2            	87047056	        15.01 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=128/runtime_map-2         	125906628	         9.837 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=128/swiss.Map-2           	79239448	        15.52 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=1024/runtime_map-2        	65208208	        17.22 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=1024/swiss.Map-2          	77075527	        15.81 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=8192/runtime_map-2        	48505800	        25.28 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=8192/swiss.Map-2          	64617066	        18.19 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=131072/runtime_map-2      	36938596	        32.38 ns/op	       0 B/op	       0 allocs/op
BenchmarkInt64Maps/n=131072/swiss.Map-2        	41026358	        28.41 ns/op	       0 B/op	       0 allocs/op

As you can see, when the number of keys is relatively small, swiss can’t take advantage of its SSE instruction, but in the case of a very large number of key values, swiss has an obvious advantage.

The author’s maphash library, introduced earlier, can take advantage of the hardware’s ability to compute hashes faster on some platforms that support the AES instruction.

Not only that, swiss also has its advantages in terms of memory consumption. According to the author’s test, swiss can save more memory.

swiss can be more memory efficient

The reason why the built-in map’s memory will step up when the number of keys increases is because of the bit-hacking optimisation model that is commonly used, which generally uses an exponent of 2 for division in order to improve the method of finding the remainder, so when you need to expand the memory when the number of keys increases, the memory allocated will double. So you see the memory of Go’s built-in maps increasing in steps, which is sometimes very wasteful.

The swiss library uses an algorithm proposed by Daniel Lemire for fast summation (more accurately, modulo reduction), an idea that seems simple but is actually very clever (x * N) / 232:

1
2
3

func fastModN(x, n uint32) uint32 {
    return uint32((uint64(x) * uint64(n)) >> 32)
}

This method restricts the type of n to uint32 and supports a maximum of 2 ^ 32, and since each bucket has 16 elements, the swiss library supports a maximum of 2 ^ 36 elements, which is sufficient for most scenarios.

Ref: https://colobu.com/2023/06/29/replace-std-map-faster/