Renzibei

Hash Table Benchmark

2026-06-13T15:27:44.000Z

This is yet another benchmark to compare different hash tables (hashmaps) with different hash functions in C++, attempting to evaluate theperformance of the lookup, insertion, deletion, iteration, etc. ondifferent datasets as comprehensively as possible.

We show the performance of hash tables with hash functions fordifferent operations on different types of datasets of different sizes.The reader can refer to these results and choose the hash table and hashfunction that best match the target application.

The benchmarks were collected in 2022–2023 (machine configurationsare listed below); this writeup was compiled and published in 2026.

Before viewing thebenchmark results

Anyone familiar enough with hash tables knows that even a well-knownand widely used hash table can have a data distribution it is not verygood at. In other words, no hash table is the fastest on all datasetsfor all operations.

The best practice for selecting a hash table is to consider the datacharacteristics, operation mix, requirements, and hash functiontogether.

This benchmark tries to use a concise and effective method to testthe performance of different operations of the hash tables on some ofthe most common data distributions. But there will always be datadistributions that are very different from the data we use for testing,and different users have different requirements for differentindicators. Therefore, the best test is still one in the realapplication.

Methodology

Test Items

We measure combinations of different hash tables with different hashfunctions. For each combination, we measured its insert, delete, lookup(including successful and failed lookups), and iteration performanceunder different data. Below is a more detailed table of test items.Please note that in the following we will use "hash table" or "hash map"interchangeably to refer to the same concept.

Index	Test items	Notes
1	Insert with reserve	Call map.reserve(n) before insert n elements
2	Insert without reserve	Insert n elements without prior reserve
3	Erase and insert	Repeatedly do one erase after one insert, keep the map sizeconstant
4	Look up keys in the map (hit)	Repeatedly look up the elements that are in the map
5	Look up keys that are not in the map (miss)	Repeatedly look up the elements that are not in the map
6	Look up keys with 50% probability in the map	Repeatedly look up the elements that have a 50% probability in themap
7	Look up keys in the map with large max_load_factor (hit)	Same as Test Item 4 except that the map is set a max_load_factor of0.9 and rehashed before the lookup operations
8	Look up keys that are not in the map with large max_load_factor(miss)	Same as Test Item 5 except that the map is set a max_load_factor of0.9 and rehashed before the lookup operations
9	Look up keys with 50% probability in the map with largemax_load_factor	Same as Test Item 6 except that the map is set a max_load_factor of0.9 and rehashed before the lookup operations
10	Iterate the table	Iterate the whole table several times
11	Heap memory size and load factor with default and largemax_load_factor	Record the heap memory size and load factor when constructing themap in Test Items 4 and 7

As may be noticed, several test items measure lookup speed with alarger upper limit on the load factor. The load factor measures how fullthe hash table is, and max_load_factor is the STL API forcontrolling its upper bound. This is because each hash table may have adifferent expansion strategy and max_load_factor, so evenwith the same number of elements, different tables may choose differentload factors and occupy different amounts of memory. The load factor andmemory footprint greatly affect lookup performance, so using a hashtable with a smaller max_load_factor may have worse (orbetter) lookup performance. On the other hand, a higher load factor maylead to a higher probability of collision, thus reducing lookupperformance.

In addition, extreme lookup performance usually requires making thespace used by the hash table as small as possible to reduce the cachemiss rate. When available memory is very limited, a larger load factormay also be preferred. One way to do this is to set a highermax_load_factor, and then rehash (or set a largemax_load_factor before the main construction process of thetable).

For each of the tests above, we tested the throughput and latency(when the platform under test meets the conditions for the latencytest). The throughput results will be more representative, becausemodern software runs on CPUs with pipelined architectures. And almostall operations will have other instructions before and after them, whichcan make full use of the pipeline. However, for some specific uses,latency data is important. The latency measurement results here are forreference only for special needs and have relatively largelimitations.

Dataset

All the data used in the benchmark are randomly generated; the usercan choose different seeds for the test data. We tested the performanceof each hash table at different sizes from 32 to 10^7.

The tested keys consist of 64-bit integers of different distributionsand strings of different lengths. The detailed test data is shown in thetable below.

Index	Key Type	Value Type	Notes
1	uint64_t with several split bits masked	uint64_t	The keys have such characteristics: only bits in some positions maybe 1, and all other bits are 0. For test data of size n, at mostceil[log2(n)] fixed bits may be 1. e.g. If the key type is uint8_t (itis uint64_t in reality) and the test size is 7, the keys will begenerated with the method `rng() & 0b10010001`. Thedistribution characteristics of such bits can relatively comprehensivelyexamine whether hash tables and hash functions can handle keys that onlyhave effective information in specific bit positions.
2	uint64_t, uniformly distributed in [0, UINT64_MAX]	uint64_t	The keys follow a uniform distribution in the range [0,UINT64_MAX].
3	uint64_t, bits in high position are masked out	uint64_t	The bits in the high position are set to 0. For test data of size n,at most ceil[log2(n)] fixed bits may be 1. For example, if the key typeis uint8_t (uint64_t in reality) and the test size is 7, the keys willbe generated with the method `rng() & 0b00000111`
4	uint64_t, bits in low position are masked out	uint64_t	The bits in the low position are set to 0. For test data of size n,at most ceil[log2(n)] fixed bits may be 1. For example, if the key typeis uint8_t (uint64_t in reality) and the test size is 7, the keys willbe generated with the method `rng() & 0b11100000`
5	uint64_t with several bits masked	56 bytes struct	The keys are the same as the distribution of the data 1. The payloadis a 56 bytes long struct, which makes the`sizeof(std::pair)==64`
6	Small string with a max length of 12	uint64_t	The key type is a string with a maximum length of 12. Both lengthand characters are randomly generated. The compiler/library may useSmall String Optimization (SSO).
7	Small string with a fixed length of 12	uint64_t	The key type is a string with a fixed length of 12. The charactersare randomly generated. The compiler/library may use Small StringOptimization (SSO).
8	Mid string with a max length of 24	uint64_t	The key type is a string with a maximum length of 24. Both lengthand characters are randomly generated.
9	Mid string with a fixed length of 24	uint64_t	The key type is a string with a fixed length of 24. The charactersare randomly generated.
10	Large string with a max length of 64	uint64_t	The key type is a string with a maximum length of 64. Both lengthand characters are randomly generated.
11	Large string with a fixed length of 64	uint64_t	The key type is a string with a fixed length of 64. The charactersare randomly generated.

Different distributions within the range representable by uint64_tare chosen as keys. Uniformly distributed integers in the range ofuint64_t are the easiest to generate with pseudo-random numbers, butthey are rare in real situations.

If users are concerned with performance using integer keys, westrongly recommend focusing on the results on the first dataset ratherthan the second dataset (i.e. the dataset with a uniform randomdistribution). The data from the first dataset can better examine theability of hash tables and hash functions to deal with more diversepatterns, while the test on uniform random distributions barely verifiesthe ability of hash tables to handle other distributions. Moreover, inreal data distributions, few keys happen to be uniformly randomlydistributed over the [0, 2^63 - 1] range.

With this in mind, our analysis for integer keys focuses mainly onthe first dataset. To keep the articles shorter and easier to read, theother integer datasets — including the second (uniform random) one andthe high-/low-bit-masked ones — are mostly shown only in the appendix ofeach test, without detailed discussion.

For the string datasets, different character sets are used. Forfixed-length strings, the pattern is like the first dataset, whereseveral split bits are masked. In other words, only the bits in somepositions may differ among these datasets. This pattern is intended totest the quality of the hash function.

For the strings with variable length, a subset of the printablecharacters can appear in the string.

Real data distributions are often biased. If a combination of hashfunction and hash table can only handle one distribution but cannothandle other distributions, this combination is not robust to unknowndistributions. If the distribution of the data is known in advance, theuser can pick the fastest and most stable hash table for that data.

Tested Hash functions andHash maps

Below is the list of hash functions we tested.

Name	Type	Notes	Link
std::hash	Normal	Implemented by compiler; identity hash is used for integer type inlibc++ and libstdc++
absl::hash	Normal	Implemented by Google; Uses 128-bit product of multiplication and anxor-shift.	https://github.com/abseil/abseil-cpp
robin_hood::hash	Normal	For integer keys, it uses xor-shift, multiplication, xor-shift; Forstring keys it is similar to absl::hash	https://github.com/martinus/robin-hood-hashing
xxHash_xxh3	Bytes	Designed for string; We use identity hash for integer type to passcompilation; It won't show in the results of integer keys	https://github.com/Cyan4973/xxHash

Originally we had some seed hash functions in the tests, which arehash functions that take both a key and a seed as arguments. We removedthese hash functions to keep the test subjects simple, and we use theno-seed version of all the hash tables.

We will not show the results of hash xxHash_xxh3 intests on integer keys. For the early versions ofabsl::Hash, the behavior on the arm64 platform wasdifferent from that on the x86-64 platform, and it was poor for somedatasets. So we once had a uint128_mul::hash to comparewith it, which is similar to the absl::Hash on the x86-64platform. Since the newest version of absl::Hash has fixedthis problem, we deleted the uint128_mul::hash.

The following table lists the hash tables we tested. Some of thesehash tables rely on a "good" hash function to work properly, which cangenerate hash values that are as uniformly distributed as possible forunbalanced keys. If a hash function that does not have such a property(e.g. identity hash) is used, then the performance of these hash tablesmay drop drastically. These hash tables may assume that the hash valuesof the keys from the dataset are uniformly distributed in the outputrange. This requires hash functions to have properties like uniformityor diffusion.

The implication here is that a "good" hash function tends to be morecomplex than the simplest hash function (the identity hash), requiringmore instructions to complete the computation. Some hash tables do notrely on a good hash function, perhaps because they do some extra work toimprove the uniformity of the hash values. For such a hash table, thesimpler the hash function, the better, preferably the identity hash. Sowe should always compare the combinations of hash tables and hashfunctions, rather than fixing the hash function to compare the hashtable, or vice versa.

Here are the hash maps we tested.

Name	Requires a good hash function	Notes	Link
std::unordered_map	No*	Implemented by the STL library; May differ in libc++ andlibstdc++.
ska::flat_hash_map	No	Very fast and simple; Uses robin hood hash; Memory overhead:alignof(value_type) per element; Requires small load factor	https://github.com/skarupke/flat_hash_map
ska::bytell_hash_map	No	A little slower than ska::flat_hash_map but one byte per elementmemory overhead	https://github.com/skarupke/flat_hash_map
absl::flat_hash_map	Yes	Uses SIMD and metadata; Fast when looking up keys that are not inthe map; One byte per element memory overhead	https://abseil.io/about/design/swisstables
absl::node_hash_map	Yes	Slower than absl::flat_hash_map but does not invalidate the pointerafter rehash	https://github.com/abseil/abseil-cpp
tsl::robin_map	Yes	A fast hash table using robin hood hash; Memory overhead is no lessthan ska::flat_hash_map	https://github.com/Tessil/robin-map
emhash::HashMap7	Yes	Fast in lookup hit operations.	https://github.com/ktprime/emhash
fph::DynamicFphMap	No	A dynamic perfect hash table; Ultra-fast in lookup but slow ininsert; 2~8 bits per element memory overhead	https://github.com/renzibei/fph-table
fph::MetaFphMap	No	A dynamic perfect hash table using metadata; Better thanfph::DynamicFphMap in the miss lookup case.	https://github.com/renzibei/fph-table
robin_hood::unordered_flat_map	Yes	A table using robin hood hash;	https://github.com/martinus/robin-hood-hashing
ankerl::unordered_dense_map	Yes	Stores entries in a dense array; fastest to iterate; compactfootprint	https://github.com/martinus/unordered_dense

* Note: For the tested libc++ and libstdc++ versions, the libc++implementation requires a good hash function but libstdc++ has no suchrequirement. If using std::hash, the performance can bepoor when the size is the power of 2 for libc++.

At a quick glance, it is easy to see that many of the hash tableslisted use the robin hoodhashing technique in the pursuit of speed.

Experiments and Results

The code of this benchmark is available at https://github.com/renzibei/hashtable-bench.

Testing Platform

Platform 1: Intel Xeon E-2388G CPU @ 3.20 GHz, boost to 5.1 GHz;x86-64; Rocket Lake.

Platform 2: M1 Max Macbook Pro 16 inch, 2021; arm64; Firestorm.

Due to the lack of a high-precision time stamp counter on the arm64(M1 Max) platform, we only measured the latency on the x86-64 platform(even the AMD CPU has some problems when measuring the latency using theTSC, so we only test latency on the Intel CPU). In addition, for thex86-64 platform, we have also taken some measures to ensure thestability of the measurement results, including the following:

Use the taskset command to set CPU core affinity
Turn off hyperthreading
Isolate the cores by adding isolcpus= andrcu_nocbs= in GRUB_CMDLINE_LINUX in/etc/default/grub
Turn off some power-saving options, including disabling theondemand systemd service, and settingidle=poll and intel_idle.max_cstate=0 in thegrub command line.
Turn off timer tick interrupts, recompile the kernel withCONFIG_NO_HZ_FULL=y and set nohz_full= in thegrub command line.
Other adjustments that de-jitter the system latency. You can referto https://rigtorp.se/low-latency-guide/.

These measures cannot be done on macOS. But as we do not measure thelatency of operations on macOS, it doesn't matter that much.

Results

For throughput data, performance will be represented by the averagetime per operation. We will plot the average time per operation fordifferent scales of data. The shorter the time, the better theperformance.

For the latency data, due to the limitation of the article length, weonly show the latency of the 99th percentile in most test cases, whichcan help to show the worst time complexity and long tail latency of thehash table. And that's not even enough to reflect worst-case latency.For a distribution with long-tailed features, the 0.99th, 0.999th, and0.9999th quantiles can all have very different values. If theapplication has strict requirements on real-time performance and taillatency (such as gaming and high-frequency trading), then this datametric should be worth paying attention to.

If too much time is spent in a test, we will count it as the timeoutand set the time as zero, and that data point won't be plotted.

You can click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

We divide the results into differentgroups according to data type and operation type.

Conclusion

In short, no single hash table is best for every workload — the rightpick depends on which operations dominate, what the keys look like, andhow much memory is available. If lookups dominate and the table is builtonce, the perfect-hash fph::DynamicFphMap (andfph::MetaFphMap when misses are common) is hard to beat, atthe cost of slow construction. For a general-purpose map with a mix ofinserts and lookups, absl::flat_hash_map withabsl::Hash is a fast, compact and robust default;ska::flat_hash_map is the quickest while the data stays incache, and ankerl::unordered_dense_map is the one to reachfor when iteration is frequent. std::unordered_map is theslowest in most tests and is worth keeping mainly for itspointer-stability guarantees.

For the full reasoning behind these picks, with a per-test andper-workload comparison, see the Analysis & Conclusion.

Restrictions

Exclusive access toresources

In our tests, almost all computer resources can be monopolized by thetest program, especially cache resources. And this is relatively rare inpractical applications. In fact, other processes and tasks may occupypart of the cache. In practical applications users should expect a loweravailable cache size.

Cold memory and warm cache

We neither did a warmup, nor did we specifically test the cold startscenario. In our tests, we repeatedly test an operation with a range ofdata many times. Therefore, when the number of operations is muchgreater than the amount of data, it can be considered that mostoperations are accessing the warm cache. When the number of operationsis less than the number of data, most operations are accessing coldmemory. In our test, limited by the test time, when the amount of datais small, the number of operations will be much greater than the amountof data; and when the amount of data is large, the number of operationswill be equal to the amount of data.

Huge test space

The size of the test space contains at least

1	\|hash table set\| x \|hash function set\| x \|data sets\| x \|operation set\| x \|hardware platform set\| x \|compiler set\|

As can beseen, the testable space is quite huge. Any addition to the set of hashtables or the set of hash functions will greatly increase the testingeffort. Due to time and resource constraints, we have only explored partof the combinations, and there are still many combinations and spacesthat we have not tested.

Therefore, in order to choose the most suitable hash table and hashfunction for the user's purpose, real tests should be carried out in theapplication scenario.

Postscript

This hash table benchmark series took at least four years from startto finish. The first version of the data was already available in 2022,when the M1 Max was still a fairly new CPU; now even the M5 is out. Ittook this long because organizing so many charts and analyses wasgenuinely tedious. I finished the main body of the posts roughly between2022 and 2023, but left a lot of cleanup work undone out of laziness.Because this kept hanging over me, the blog also fell into a kind ofhead-of-line blocking: I kept thinking, "I still haven't finished thisseries," and ended up not publishing other posts either.

Over these years, many things have changed. CPUs have gone throughseveral generations, new hash functions and hash tables have appeared,and LLM technology has also advanced rapidly. Many things discussed inthese posts may already be somewhat out of date. All I can say is: timepasses like a river, never stopping day or night.

Whatever the quality of this benchmark series may be, I have decidedto stop iterating on it and publish it as it is.

Hash Table Benchmark - 24 Byte String Lookup

2026-06-13T13:55:00.000Z

The 24 byte string lookup test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we measure the lookup performance of hash tables inthree kinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessfulfind).
Look up keys with a 50% probability of being in the hash table.

There are two kinds of keys in this test: strings with a fixed lengthof 24 bytes, and strings with a max length of 24 bytes.

At 24 bytes the keys have crossed the Small String Optimizationboundary: a std::string of this length no longer fitsinline (libstdc++ stores up to 15 bytes inline), so the fixed-24 keysare all heap-allocated and the lookup must dereference a pointer toreach the characters before it can hash or compare them. This adds anear-guaranteed cache miss per key to the hash/compare cost discussed inthe 12-byte post, and it makes the fixed-length and max-length variantsbehave quite differently: in the max-24 case many keys are short enoughto stay inline, avoiding that extra indirection. The four hashes testedare again std::hash, absl::Hash,robin_hood::hash, and xxHash_xxh3; with morebytes to digest per key, the bytes-optimized xxh3 now wins for almostevery table. Each chart shows the best hash per table; Xeon E-2388G andM1 Max throughput are paired and latency is Xeon-only.

Throughput

Lookup keys in the table(hit)

Use default max_load_factor

:

Two things stand out compared to the 12-byte test. First,xxHash_xxh3 is now the per-table winner for essentiallyevery table on the Xeon, because the larger key makes hashing a biggerfraction of the work and xxh3's byte throughput dominates. Second, thewhole field is slower and more tightly bunched: with every fixed-24 keyon the heap, a hit pays both the hashing of 24 bytes and a pointer chaseto the stored characters for the comparison. On the Xeon therobin-hood-style ska::flat_hash_map andtsl::robin_map (xxh3) lead at scale, about 115 ns at 10^7,with the rest of the flat tables within 15-20 ns of them;std::unordered_map is the main outlier at 192 ns. In cache(1,024 elements) fph::DynamicFphMap andankerl::unordered_dense_map are quickest at roughly 11.5-12ns, the perfect-hash table again benefiting from its single-probeguarantee before memory traffic dominates.

The M1 Max separates the field more clearly: the perfect-hash tablesfph::DynamicFphMap and fph::MetaFphMap (xxh3)lead through the mid-sizes (15.5 and 16.9 ns at 32,768) and stay nearthe front to 10^7 (125.7 and 139.0 ns), helped by the M1's large cacheskeeping their sparser arrays resident longer.std::unordered_map is again last (208 ns at 10^7).

The max-length variant below is noticeably faster, often nearly halfthe time at small sizes (for instance tsl::robin_map atabout 6.8 ns vs 18.6 ns at 1,024 on the Xeon), precisely because mostmax-24 keys are short enough to stay inline and skip the heapdereference. The large-max_load_factor charts keep the sameranking with slightly denser packing.

:

Use large max_load_factor

:

Lookup keys not in the table(miss)

Use default max_load_factor

:

As at 12 bytes, misses are cheaper and fph::MetaFphMapleads because its metadata rejects an absent key without dereferencingthe heap pointer or comparing any bytes. On the Xeon it answers afixed-24 miss in about 6 ns at 1,024 and 10.6 ns at 32,768, ahead of theSwissTable tables (absl::flat_hash_map 11.6 ns,absl::node_hash_map 12.0 ns), and it stays fastest to 10^7(69.8 ns). The robin-hood tables again lag in the mid-range(tsl::robin_map 24.6 ns, ska::flat_hash_map25.1 ns at 32,768) because of their longer probe runs. The metadataadvantage is even larger on the M1 Max, wherefph::MetaFphMap resolves a miss in 4.7-9.5 ns up through200,000 elements and only 40.5 ns at 10^7, well clear of the field.

The max-length variant follows the same ordering but with smallerabsolute numbers since many keys stay inline; on the M1 Maxfph::MetaFphMap dips to just 9-11 ns through 1.2M elements.The large-max_load_factor charts are again consistent withthis pattern.

:

Use large max_load_factor

:

Lookup keys witha 50% probability in the table

Use default max_load_factor

:

The mixed workload lands between the two: the SwissTable tablesabsl::flat_hash_map andr_h::unordered_flat_map (xxh3) andfph::MetaFphMap share the lead, around 8.6-10.2 ns at 1,024and roughly 105-117 ns at 10^7 on the Xeon, withstd::unordered_map trailing at 187 ns. Because half thequeries are misses that never touch the heap-stored bytes, themetadata-friendly tables do better here than in the pure-hit case. TheM1 Max keeps the same set of flat tables in front with its usual flattercurves. The max-length variant and thelarge-max_load_factor appendix charts follow the samepattern.

:

Use large max_load_factor

:

Latency

The P99 latency charts (Xeon only) capture the worst 1% of lookups.The 24-byte heap layout makes these tails heavier than at 12 bytes,because a slow lookup can miss the cache on the slot array, on theheap-stored key bytes, and on the page table all at once.

Lookup keys in the table(hit)

Use default max_load_factor

:

For fixed-24 hits the tails climb steeply and converge: by 10^7 everyflat table sits in a 750-830 ns band, withr_h::unordered_flat_map (xxh3) best at 750 ns andstd::unordered_map worst at 930 ns. The jump from thein-cache regime is large, the tail rising from about 50 ns at 1,024 toseveral hundred nanoseconds already at 32,768, since the worst-caselookup now reliably misses on the heap-stored key. The max-lengthvariant has visibly lower tails (e.g. roughly 535-695 ns at 10^7)because the inline short keys spare those lookups the extra heapmiss.

:

Use large max_load_factor

:

Lookup keys not in thetable (miss)

Use default max_load_factor

:

On the miss path the metadata tables keep their tails lowest incache: fph::MetaFphMap andankerl::unordered_dense_map hold around 22-51 ns up to32,768, while the robin-hood tables already spike past 145 ns there. By10^7 the tails again merge into the 600-700 ns range. The max-24 variantis the more telling one: there the tail stays low far longer (theleaders are near 130 ns even at 200,000 elements) before climbing,because most missing keys are rejected from inline data without a heaptouch. This shows how the SSO boundary, not just the table algorithm,shapes the latency tail.

:

Use large max_load_factor

:

Lookup keyswith a 50% probability in the table

Use default max_load_factor

:

With half the queries hitting, the tail is set by the heavier hitpath and the fixed-24 curves converge into the 740-980 ns band at 10^7,r_h::unordered_flat_map best at 745 ns andstd::unordered_map worst at 980 ns. The metadata advantagethat fph::MetaFphMap enjoys on pure misses is diluted herebecause the hit half still pays the heap dereference and byte compare.As before the max-length variant and thelarge-max_load_factor appendix charts follow the samepattern.

:

Use large max_load_factor

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - 64 Byte String Lookup

2026-06-13T13:55:00.000Z

The 64 byte string lookup test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we measure the lookup performance of hash tables inthree kinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessfulfind).
Look up keys with a 50% probability of being in the hash table.

There are two kinds of keys in this test: strings with a fixed lengthof 64 bytes, and strings with a max length of 64 bytes.

At 64 bytes every fixed-length key is well past the Small StringOptimization limit, so all of them live on the heap and a lookup mustfollow a pointer to reach the characters before hashing or comparingthem. The key is now four cache lines of data, which makes the hashfunction cost the dominant term in the lookup: feeding 64 bytes throughthe hash takes far longer than the slot arithmetic. As a resultxxHash_xxh3, which is tuned for byte throughput, is theper-table winner for essentially every table on both machines in thistest, and the differences between table layouts shrink because they allpay the same large hashing and pointer-chasing cost. The fixed-64 andmax-64 variants again differ in that the max-length keys include manyshort, SSO-eligible strings that skip the heap indirection. Charts pairthe Xeon E-2388G with the M1 Max for throughput; latency isXeon-only.

Throughput

Lookup keys in the table(hit)

Use default max_load_factor

:

With hashing 64 bytes dominating, the field is closer together thanin the shorter-key tests, but the ordering is still informative. On theXeon the perfect-hash fph::DynamicFphMap andfph::MetaFphMap (xxh3) are fastest in cache (about 16.4 and16.8 ns at 1,024, 54.6 and 56.5 ns at 32,768) thanks to theirsingle-probe guarantee, which still matters even when the hash isexpensive. At 10^7 they remain near the front (152.7 and 163.5 ns) butthe short-probe robin-hood tables tsl::robin_map andska::flat_hash_map catch up (151.7 and 152.2 ns). Everytable uses xxh3 as its best hash here. The node-basedstd::unordered_map is clearly behind at 246.7 ns, paying aheap node dereference on top of the already-heavy key dereference.

The M1 Max shows the perfect-hash tables leading more decisively,with fph::DynamicFphMap fastest across the whole range(15.5 ns at 32,768, 112.7 ns at 10^7), helped by the large M1 cacheskeeping its sparser array resident. std::unordered_map isagain last at 184.5 ns.

The max-length variant below is markedly faster, for instancetsl::robin_map runs a hit in about 9.9 ns at 1,024 versus18.4 ns for fixed-64, because the short SSO-eligible keys avoid both theheap dereference and the cost of hashing a full 64 bytes. Thelarge-max_load_factor charts keep the same ranking withdenser packing.

:

Use large max_load_factor

:

Lookup keys not in the table(miss)

Use default max_load_factor

:

Misses are again the case where fph::MetaFphMap standsout: its metadata lets it reject an absent key without dereferencing theheap-stored characters or comparing any bytes, so on the Xeon it leadsat 8.2 ns at 1,024, 16.9 ns at 32,768, and stays fastest to 10^7 (95.5ns). The SwissTable tables absl::flat_hash_map andankerl::unordered_dense_map follow closely, while therobin-hood tables tsl::robin_map andska::flat_hash_map fall behind in the mid-range (41.6 and43.1 ns at 32,768) due to longer probe runs. Avoiding the full 64-bytecompare matters a lot here, so the metadata and SwissTable schemes thatshort-circuit on a tag byte pull noticeably ahead. The M1 Max amplifiesthe metadata advantage: fph::MetaFphMap answers a miss in5.9-18 ns up to 200,000 elements and just 54.2 ns at 10^7.std::unordered_map is the slowest at scale on both machines(231 ns Xeon, 125 ns M1 at 10^7).

The max-length variant follows the same ranking with smaller numbers,and the large-max_load_factor charts are consistent withthis pattern.

:

Use large max_load_factor

:

Lookup keys witha 50% probability in the table

Use default max_load_factor

:

The mixed workload sits between the two cases. On the Xeonabsl::flat_hash_map, r_h::unordered_flat_mapand fph::MetaFphMap (all xxh3) share the lead, around11.7-12 ns at 1,024 and 137-148 ns at 10^7, withstd::unordered_map trailing at 235 ns. On the M1 Maxfph::MetaFphMap is fastest across most sizes (22.8 ns at32,768, 104.2 ns at 10^7), since the miss half of the workload rewardsits metadata while the hit half still benefits from its single-probelayout. The max-length variant and thelarge-max_load_factor appendix charts follow the samepattern.

:

Use large max_load_factor

:

Latency

The P99 latency charts (Xeon only) show the tail. With 64-byteheap-allocated keys, a worst-case lookup can miss on the slot array, onthe key's heap buffer, and on the page table, so the tails are theheaviest of the three string tests.

Lookup keys in the table(hit)

Use default max_load_factor

:

For fixed-64 hits the tails converge sharply: from about 67-92 ns at1,024 they jump to the 550-620 ns range already at 32,768 and then toroughly 830-955 ns at 10^7 for the plotted tables, wherer_h::unordered_flat_map (xxh3) is at the front at 830 nsand absl::node_hash_map is the slowest of them at 955 ns.std::unordered_map is slower still, but its tail runs pastthe chart's 1,000 ns display limit (about 1,025 ns at 10^7), so it isnot drawn here. The steep early rise reflects the guaranteed heap misson the key bytes once the buffers no longer fit in cache. The max-lengthvariant has lower tails (the leading tables run roughly 290-330 ns at32,768 versus 550+ for fixed) because the inline short keys spare manylookups that miss.

:

Use large max_load_factor

:

Lookup keys not in thetable (miss)

Use default max_load_factor

:

On the miss path fph::MetaFphMap keeps the lowest tailin cache (24.8 ns at 1,024, 105.6 ns at 32,768) because its metadatasettles a miss without touching the heap-stored key bytes, andankerl::unordered_dense_map is the next best. Therobin-hood tables blow up earliest, past 400 ns at 32,768. The max-64variant pushes the tail-blowup point out considerably, the leadersstaying near 52-62 ns at 32,768, because most missing keys are shortenough to be rejected from inline data. As at 24 bytes, this shows thatthe SSO boundary shapes the latency tail as much as the table algorithmdoes.

:

Use large max_load_factor

:

Lookup keyswith a 50% probability in the table

Use default max_load_factor

:

With half the queries hitting, the tail is governed by the heavierhit path and most fixed-64 curves bunch into the 805-890 ns band at 10^7(many tables cluster near 805-845 ns, with fph::MetaFphMapthe highest among them at about 890 ns); std::unordered_mapagain runs past the chart's 1,000 ns display limit (about 1,065 ns) andis not drawn. The metadata advantage that fph::MetaFphMapenjoys on pure misses is diluted because the hit half still requires theheap dereference and 64-byte compare. The max-length variant and thelarge-max_load_factor appendix charts follow the samepattern.

:

Use large max_load_factor

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - 12 Byte String Lookup

2026-06-13T13:55:00.000Z

The 12 byte string lookup test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we measure the lookup performance of hash tables inthree kinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessfulfind).
Look up keys with a 50% probability of being in the hash table.

There are two kinds of keys in this test: strings with a fixed lengthof 12 bytes, and strings with a max length of 12 bytes.

Unlike the integer tests, string-key lookup spends a large share ofits time inside the hash function and the key comparison: the whole bytesequence has to be fed through the hash, and on a hit the candidateslot's bytes have to be compared against the query. This makes thechoice of hash function much more visible than for integers. The fourhashes tested here are std::hash, absl::Hash,robin_hood::hash, and xxHash_xxh3; xxh3 ispurpose-built for hashing byte ranges, so it tends to win for the tablewhose lookup loop is hash-bound. A second string-specific effect isSmall String Optimization (SSO): a 12-byte std::string isstored inline in the string object, so there is no separate heapallocation and the characters are reached without a pointer dereference.The fixed-length variant makes every key exactly 12 bytes, while themax-length variant lets keys vary up to 12 bytes, which also exercisesthe length-dependent branches of the hash and comparison code. Below,each chart shows the fastest hash per table (click the legend to revealthe rest); the Xeon E-2388G and M1 Max throughput charts are paired, andlatency is measured on the Xeon only.

Throughput

Lookup keys in the table(hit)

Use default max_load_factor

:

On a successful lookup the table has to compute the hash, find theslot, and then actually compare the 12 query bytes against the storedkey, so every table pays the full hash-plus-compare cost. On the XeonE-2388G the perfect-hash fph::DynamicFphMap is the fastestwhile the working set stays in cache: with xxHash_xxh3 itdoes a hit in about 6.5 ns at 1,024 elements and 8.4 ns at 32,768, aheadof every conventional table. This is the regime where fph shines,because its minimal perfect hash guarantees the key lands in its slot onthe first probe, so the only memory touch is the single slot it reads.As the table grows past the L3 cache the picture inverts: once thelookup becomes memory-bound, the tables that keep their probe sequenceshort and local win instead. At 10^7 elementstsl::robin_map and ska::flat_hash_map (bothwith xxh3) lead at about 42 ns, while fph::DynamicFphMapslows to 57 ns and the metadata-heavier fph::MetaFphMap to64 ns, because the perfect-hash tables spread their entries over asparser array and miss the cache more often at scale.

The hash choice is decisive here. xxHash_xxh3 is theper-table winner for most of the open-addressing and perfect-hashtables, since their lookup loop is dominated by hashing 12 bytes; theSwissTable-style emhash::hash_map7 andska::bytell_hash_map instead pair best withabsl::Hash. The node-based tables trail badly once theyleave cache: absl::node_hash_map reaches 107 ns andstd::unordered_map 118 ns at 10^7, each lookup chasing aheap node pointer on top of the hash work.

The M1 Max tells a similar but flatter story. With its larger cachesthe SwissTable-family tables lead at scale(emhash::hash_map7 and ska::bytell_hash_maparound 60-66 ns at 10^7), and absl::Hash is the winninghash for several tables there rather than xxh3, reflecting the Arm-tunedimplementation. The node-based tables again finish last(std::unordered_map at about 200 ns at 10^7).

The max-length variant below behaves almost identically: with allkeys still fitting SSO and capped at 12 bytes, length variation barelychanges the hash/compare cost, so the rankings and magnitudes match thefixed-length case. The large-max_load_factor charts packthe entries more densely, which slightly helps cache residency at largesizes but lengthens probe sequences, leaving the overall orderingunchanged.

:

Use large max_load_factor

:

Lookup keys not in the table(miss)

Use default max_load_factor

:

A miss is cheaper than a hit because most tables can reject the queryfrom metadata alone, without ever comparing the 12 bytes. This is wherefph::MetaFphMap is the clear winner: its per-slot metadatalets it answer "not present" after a single metadata check, so on theXeon it returns a miss in about 3.6 ns at 1,024 elements and stays under10 ns all the way to 1.2M elements (9.65 ns), far ahead of the field.Even at 10^7, where it becomes DRAM-bound and rises to 30.7 ns, it iscompetitive with the best SwissTable tables(absl::flat_hash_map at 25.1 ns,r_h::unordered_flat_map at 29.0 ns). The robin-hood-styletables tsl::robin_map and ska::flat_hash_mapare much slower in the cache regime (16-26 ns at 32,768-200,000) becausetheir backward-shift probing must walk a run of slots before concludingthe key is absent.

Because a miss avoids the full byte comparison, the hash functionmatters slightly less here, and the per-table winner is more oftenabsl::Hash than xxh3. The node-basedstd::unordered_map remains the slowest at scale (110 ns at10^7), since even a miss requires hashing and then traversing a bucket'snode chain. On the M1 Max the same ordering holds with smaller absolutenumbers: fph::MetaFphMap answers misses in 4-6 ns through1.2M elements and only 18.6 ns at 10^7, again the tightest curve on theplatform.

The max-length variant and the large-max_load_factorcharts repeat this pattern; the only visible change is that the denserlarge-load-factor tables lengthen the robin-hood probe runs slightly,widening their gap behind the metadata-based and SwissTable tables.

:

Use large max_load_factor

:

Lookup keys witha 50% probability in the table

Use default max_load_factor

:

The 50%-hit workload mixes hits and misses, so it sits between thetwo cases and the ranking changes accordingly. On the Xeon theSwissTable-style absl::flat_hash_map andr_h::unordered_flat_map (both with xxh3) now top thesmall-to-mid range at about 6 ns at 1,024 and 13-14 ns at 32,768, withfph::MetaFphMap right alongside them; the perfect-hashtables no longer dominate the cache regime as cleanly as in the pure-hitcase because half the queries are misses that the metadata tablesresolve very cheaply. At 10^7 the robin-hood tables again pull ahead onmemory locality (tsl::robin_map 43.5 ns,ska::flat_hash_map 44.4 ns), whilestd::unordered_map trails at 119.5 ns.

The M1 Max ranks r_h::unordered_flat_map andabsl::flat_hash_map first across most sizes, with the usualflatter curves; the per-table best hash is a mix of xxh3 andabsl::Hash. As before, the max-length variant and thelarge-max_load_factor charts follow the same pattern withno qualitative change.

:

Use large max_load_factor

:

Latency

The P99 latency charts (Xeon only) show the tail of the lookup-timedistribution: the 99th-percentile single lookup, which is dominated bythe worst cache and TLB misses on the probe path. As the working setoutgrows the L3 cache, every table's tail jumps to several hundrednanoseconds because the slowest 1% of lookups now take a full DRAM roundtrip.

Lookup keys in the table(hit)

Use default max_load_factor

:

For hits, the tail is set by how many cache lines the probe sequencetouches in its worst case. In cache, fph::DynamicFphMap andfph::MetaFphMap have the lowest tails (about 21 ns at 1,024and 36-37 ns at 32,768) because their single-probe guarantee bounds theworst case tightly. Once the table spills to DRAM the tails convergeinto the 460-560 ns band for the flat tables, wherer_h::unordered_flat_map andabsl::flat_hash_map are best at 10^7 (about 527 and 557ns). The node-based tables have the worst tails throughout(absl::node_hash_map 680 ns,std::unordered_map 795 ns at 10^7), since a single lookupcan miss the cache both on the bucket array and on the node it pointsto.

:

Use large max_load_factor

:

Lookup keys not in thetable (miss)

Use default max_load_factor

:

The miss tail is where fph::MetaFphMap's metadata paysoff most dramatically: its P99 stays extremely flat, only 16-35 ns from1,024 up to 200,000 elements and 59.5 ns at 1.2M, where every othertable has already climbed into the hundreds of nanoseconds. Because amiss is settled by a single metadata read, its worst case rarely needs asecond cache line, so the tail does not blow up until the table nolonger fits in DRAM-resident pages (482 ns at 10^7). The robin-hoodtables tsl::robin_map and ska::flat_hash_mapshow the opposite behaviour, jumping to about 412 ns already at 200,000elements because a worst-case miss walks a long probe run.std::unordered_map again has the heaviest tail (905 ns at10^7).

:

Use large max_load_factor

:

Lookup keyswith a 50% probability in the table

Use default max_load_factor

:

With half the queries hitting, the tail is governed by the moreexpensive hit path: the flat SwissTable tablesr_h::unordered_flat_map andabsl::flat_hash_map keep the best P99 (about 23 and 22 nsat 1,024, rising to 527 and 534 ns at 10^7).fph::MetaFphMap no longer has the flat advantage it showedon pure misses, since the hit half of the workload still requires thebyte comparison and a possible extra cache line. The node-basedstd::unordered_map once more has the largest tail (850 nsat 10^7). The max-length variant and thelarge-max_load_factor appendix charts follow the samepattern.

:

Use large max_load_factor

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - Integer Iterate

2026-06-13T13:55:00.000Z

The integer key iteration test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we measure the performance of iterating over the hashtable.

Unlike lookup or insertion, iteration speed is dominated almostentirely by how a table stores its entries in memory, not by the hashfunction. There are three broad storage strategies among the tables wetest:

Dense array storage.ankerl::unordered_dense_map andemhash::hash_map7 keep all key-value pairs packed in acontiguous array and store only indices (or small metadata) in the hashslots. Iterating is then a linear scan over a dense array, so the costper element is small and, crucially, independent of the loadfactor.
Inline open addressing.ska::flat_hash_map, ska::bytell_hash_map,tsl::robin_map, absl::flat_hash_map,fph::* and robin_hood::unordered_flat_mapstore the entries directly in a sparse slot array. To iterate they mustwalk the whole slot array and skip the empty slots, so the work perelement grows as the table becomes emptier and the array grows largerthan the cache.
Node-based storage. std::unordered_mapand absl::node_hash_map allocate each entry in a separatenode. Iterating chases pointers between nodes that may be scatteredacross the heap, which is cheap while the nodes stay in cache butdegrades sharply once they spill to DRAM.

Throughput

:

The charts confirm the picture above. On both platformsankerl::unordered_dense_map is the fastest by a wide marginand is essentially flat across the whole size range — about 0.22 ns perelement on the Xeon E-2388G and about 2.0 ns on the M1 Max — because italways iterates a packed array regardless of how many slots the tablehas. emhash::hash_map7 is the runner-up with the same flatbehaviour (about 0.6 ns on the Xeon and 2.6 ns on the M1).

Each inline open-addressing table behaves differently: itsper-element cost rises with the number of elements as the slot arraygrows past the cache. ska::flat_hash_map, for example,climbs from roughly 1 ns at small sizes to about 9-10 ns at 10^7elements on the Xeon, because most of the time is then spent readingempty slots from memory. The node-based std::unordered_mapis the slowest at large sizes — around 35 ns per element at 10^7 on theXeon and 22 ns on the M1 — since iterating its node list becomes astream of cache-missing pointer dereferences.

A couple of combinations stop after the mid-size points.absl::flat_hash_map and absl::node_hash_mapassume a well-distributed hash, but the integer std::hashis the identity function; on the masked-bit keys it collides heavily, soconstruction times out at large sizes and those data points are recordedas zero and not plotted.

The remaining integer distributions and the 56-byte value type, shownbelow, give the same ranking. The dense-storage tables stay flat andfastest; the only effect of the larger 56-byte slot is that theopen-addressing tables, which now move 64 bytes per slot, becomememory-bound a little earlier.

:

Latency

The P99 latency of a single iteration step tells the same story fromthe tail end (latency is measured on the x86-64 platform only).ankerl::unordered_dense_map stays flat at about 1.6 nsregardless of size, because advancing the iterator over a dense arraynever misses far. The inline and node-based tables develop growing tailsas the backing storage outgrows the cache: at 10^7 elements the P99 stepreaches tens of nanoseconds for the open-addressing tables and hundredsof nanoseconds for the node-based std::unordered_map andabsl::node_hash_map (about 374 ns and 439 ns respectively),with each spike corresponding to a cache miss on the next slot ornode.

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - Integer Erase and Insert

2026-06-13T13:55:00.000Z

The integer key erase and insert test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we first construct a hash table with size N, and thenrepeat the following operations M times:

Insert a new element into the hash table
Randomly erase an element from the hash table.

It should be noted that the results of this test are also highlycorrelated with the distribution of the data, especially therelationship between deleted data and inserted data. Here we randomlyselect elements to delete with equal probability. In reality, elementsmay not be selected with equal probability; for example, the most likelydeleted element may be the most recently inserted element.

Throughput

We record the time spent in the whole process, which includes bothinsert and erase operations.

The y axis value is the average time per operation. This result isobtained bytime/op = (time for insert + time for erase) / (2 * M).This is the average time taken for insert and erase. This numberreflects the efficiency of making modifications to the hash table.

:

On Intel Rocket Lake, ska::flat_hash_map withstd::hash is the fastest or near-fastest across almost thewhole range (about 5 to 37 ns per operation).emhash::hash_map7 with absl::Hash andabsl::flat_hash_map with absl::Hash are closebehind in the medium range. tsl::robin_map is fast at smallsizes and again at very large sizes (about 39 ns at 10^7), but it slumpsin the medium range (rising to 60 to 90 ns near 400,000 to 800,000)because of the poorer distribution of robin_hood::hash onthese key patterns.

On Apple M1 Max, the combination of ska::flat_hash_mapand std::hash has a comparative advantage in almost everydata scale (about 6 to 33 ns).

It is also worth noting that on the M1 Max chip,absl::node_hash_map shows a large performance degradation,with a pronounced bump in the data range from about 45,000 to 100,000elements (peaking near 135 ns). std::unordered_map alsoexhibits performance degradation, climbing steadily and peaking around100,000 to 300,000 elements (up to about 600 ns). It is not clear whatthe cause of this degradation is. It may be related to the system'smemory allocation policy, as both hash tables require memory allocationand recycling operations for each insertion and deletion. Thisphenomenon is more pronounced on datasets with.

:

On Intel Rocket Lake, the rankings are similar to those of : : ska::flat_hash_mapwith std::hash keeps a comfortable lead (about 8 to 59 ns),and absl::flat_hash_map is the closest follower.

On M1 Max, the relative performance of bothabsl::flat_hash_map and emhash::hash_map7increases a bit when the number of elements is in the range of roughly32,768 to 1,200,000, where emhash::hash_map7 withabsl::Hash actually edges ahead ofska::flat_hash_map at several points.

Latency

We record the latency of insert and erase operations separately. Theinsert latency here is different from that in the "Insert and Construct"test. The latency statistics in the construct test include alloperations from size 0 to size N, while in this test the size of thehash table is always N or N + 1.

Insert (after erase)

: InsertLatency

First consider P50 latency. When the data size is small, bothska::flat_hash_map with std::hash andtsl::robin_map with absl::Hash are in thefirst tier in terms of speed (about 7 to 9 ns). When the number ofelements is larger, absl::flat_hash_map andabsl::node_hash_map with absl::Hash have agreater advantage: past roughly 300,000 elements the ska/tsl tables jumpup (to about 50 ns and beyond) while the absl tables stay near 24 to 28ns. In addition, the median latency of most open-addressed hash tablesconverges again as the number of elements approaches 10^7 (around 92 to99 ns).

For P99 latency, emhash::hash_map7 withabsl::Hash has the smallest tail latency through small andmedium sizes (about 31 ns at small counts, staying ahead up to roughly200,000 elements). When the number of elements is larger than that,absl::flat_hash_map comes out on top.

Another point that has to be mentioned about modifying open-addressedhash tables is the tombstone mechanism used in the implementation. Forsome hash tables, when a delete operation is performed, a special marker(tombstone) is placed on the slot where the deleted element was located.A tombstone marker is not the same as an empty marker. If a hash tablehas too many tombstone markers, its lookup performance will be affected.Therefore, some hash tables will rehash when the number of tombstonesreaches a certain percentage. This can give an insert operation mixedwith erase a poor maximum latency. The P100 latency helps show this.

In addition to the element counts that are powers of 2 (where thefirst insertion may lead to expansion), some hash tables have P100latency at some data points proportional to the number of elements. Thisphenomenon is observed for both the absl-series hash tables androbin_hood::unordered_flat_map. These hash tables shouldnot be selected if the user requires strict maximum latency formodification operations.

: InsertLatency

When the size of value_type becomes 64 bytes, theadvantage of ska::flat_hash_map with std::hashover tsl::robin_map increases when the amount of data issmall (about 8 ns vs 11 ns at the smallest sizes for P50).

For P99 latency, at small element counts the smallest tail belongs toabsl::flat_hash_map with absl::Hash (about 33ns), ahead of ska::flat_hash_map andemhash::hash_map7.

Erase

: EraseLatency

For P50 latency, ska::flat_hash_map withstd::hash almost always has the smallest latency, except inthe range from about 300,000 to 800,000 elements. In that rangeska::flat_hash_map jumps up (to about 100 to 160 ns)because its aggressive expansion and low load factor push it into aslower memory tier, while absl::flat_hash_map androbin_hood::unordered_flat_map stay lower (about 45 to 120ns) and perform relatively better. Above 1,200,000 elements the tablesconverge again.

For P99 latency, emhash::hash_map7 withabsl::Hash performs the fastest when the number of elementsis small (about 19 ns, leading up to roughly 3,000 elements). The taillatency of absl::flat_hash_map with absl::Hashis smaller in the medium range, roughly from 8,000 to 200,000 elements.When the number of elements is larger, the performance of many hashtables is close.

: EraseLatency

When the size of the value_type is 64 bytes, it isbasically the same as .

Throughput Appendix

:

Latency Appendix

Insert (after erase)

: InsertLatency

: Insert Latency

Erase

: EraseLatency

: Erase Latency

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - Analysis and Conclusion

2026-06-13T13:55:00.000Z

A full walk-through of what the benchmark showed — across bothinteger and string keys, small and large values — and how to turn itinto a concrete choice of hash table and hash function for a particularworkload.

No single winner

The benchmark covers insertion, erasure, lookup (successful,unsuccessful, and a 50% mix), and iteration. It runs them on integerkeys of several distributions and on strings of 12, 24 and 64 bytes,with both an 8-byte value and a 56-byte value (so the stored pair is 16or 64 bytes), at sizes from 32 up to 10^7 elements, on two verydifferent machines — an Intel Xeon E-2388G (Rocket Lake) and an Apple M1Max. The one lesson that survives all of that variety is that thefastest combination is never the same twice: it changes with theoperation that dominates the workload, with the type and distribution ofthe keys, with the size of the value, with how large the table isrelative to the cache, and with the platform.

A second lesson, worth stating up front, is that a hash table and ahash function must be chosen together. Some tables assume the hashalready spreads keys uniformly and break down badly if it does not;others do their own mixing and prefer the cheapest possible hash. Everychart therefore compares table + hash pairs rather than tablesin isolation.

Lookups

Lookups are usually the operation people care about most, and theyare where the design of the table matters most. A lookup breaks intothree parts: computing the hash, mapping it to a slot, and loading andcomparing keys. Which part dominates depends on the key type and on howmuch of the table fits in cache.

Integer keys

For integer keys a lookup's time is split between computing andmapping the hash and the memory access that follows, and which onedominates depends on how much of the table sits in cache. The hash isnot necessarily cheap: the identity std::hash costs nothingand is fine when the keys are already spread out evenly, but keys whoseinformation sits in only a few bit positions — as with pointers andaligned addresses (whose low bits are always zero) or small sequentialIDs (whose high bits are always zero) — collide badly under the identityhash and need a real mixing hash such as absl::Hash (a128-bit multiply and xor-shift) to scatter them across the table. Theright hash therefore depends on the keys (discussed further below). Theranking moves through three regimes as the table grows. While it fits inthe L1/L2 cache the memory fetch is fast — only a few nanoseconds — sothe per-probe instruction count (the hash and the slot mapping) iscomparable to the fetch and is what separates the tables; the leanestcombination wins, and ska::flat_hash_map with the identitystd::hash is fastest at small sizes (about 1.3 ns per hitat 1,024 elements on both machines), withfph::DynamicFphMap a close second on the M1, trailing it byonly about 0.1 ns. In the middle of the range, where the table lives inthe L2/L3 caches, fph::DynamicFphMap takes the lead (about3.4 ns at 200,000 and 10.9 ns at 1.2M on the Xeon) because its boundedprobe count keeps cache-line touches low. At the largest sizes, whereevery lookup misses the cache and the memory access dominates,ska::flat_hash_map is fastest again (about 14.7 ns on theXeon, 9.3 ns on the M1 at 10^7) because the table that reads the fewestcache lines — a single compact slot array — wins regardless of thehash.

Misses are more decisive. To prove a key absent, a table must ruleout every slot it could occupy, and tables that store a byte of metadataper slot can do so without touching the full keys.fph::MetaFphMap is in a class of its own: it rejects anabsent key with essentially one memory access, giving it the bestaverage miss time from about 6,000 elements up and a far tighter tail —at 1.2M elements its P99 miss latency is about 34 ns, against about106-111 ns for r_h::unordered_flat_map /absl::flat_hash_map and 440-465 ns for the probe-walkingska::flat_hash_map and tsl::robin_map. Theadvantage lasts until the metadata array itself leaves the L3 cache(around 10^7 here), after which it, too, pays one DRAM access.absl::flat_hash_map is the best of the conventional tableson misses, having been built around the same metadata idea. The 50%-hitcase is roughly the average of the two, with the alternating outcomeadding a branch-misprediction penalty that narrows every gap.

String keys

String keys change the picture because hashing costs much more — thewhole string must be hashed and compared, rather than a single 64-bitword — and because longer strings may live on the heap. Two effectsstand out.

First, the byte-oriented xxHash_xxh3 becomes the besthash for almost every table, and increasingly so as strings lengthen:hashing dominates, so a fast bytes hash matters more than the table.Second, the in-cache lookup floor rises sharply with length. Asuccessful in-cache lookup costs about 1.3 ns for an integer key, about6.5 ns for a 12-byte string, about 13 ns for a fixed 24-byte string, andabout 16 ns for a 64-byte string on the Xeon. The jump between 12 and 24bytes is mostly the Small String Optimization (SSO): a string of up to~15 bytes is stored inline in the std::string object, whilea longer one is heap-allocated, so a lookup must follow a pointer to aseparate cache line to compare the characters. This shows up directly inthe data — the max-length-24 keys, most of which are shortenough to stay inline, are about twice as fast as thefixed-length-24 keys, which are all heap-allocated (about 7 nsversus 13 ns at small sizes).

Apart from the higher floor, the ranking tracks the integer case: theperfect-hash tables lead in cache (for 64-byte keysfph::DynamicFphMap and fph::MetaFphMap withxxHash_xxh3 are fastest for hits), the robin-hood tablestsl::robin_map and ska::flat_hash_map overtakeonce memory-bound, and fph::MetaFphMap again dominatesmisses (about 8 ns at 1,024 and 60 ns at 1.2M for 64-byte keys, versusabout 63-68 ns for the next tables).

The effect of the value size

Enlarging the value from 8 to 56 bytes (a 64-byte pair) makes eachslot four times bigger, so fewer entries fit in each cache level andevery table becomes memory-bound earlier; a hit that cost about 15 ns at10^7 with an 8-byte value costs about 21 ns with the 56-byte value. Thisshifts the balance toward tables that touch the fewest cache lines:fph::DynamicFphMap, with its bounded probe count, leadsacross more of the range with the large value than it does with thesmall one. If the values are large, weight the mid- and large-sizeresults more heavily than the small-size ones.

Building and modifying thetable

Insertion inverts the lookup ranking for the perfect-hash tables. Theflat tables — absl::flat_hash_map,ska::flat_hash_map, emhash::hash_map7,tsl::robin_map, ankerl::unordered_dense_map —are fastest to fill (about 4-6 ns per insert at small sizes and 25-35 nsat 10^7 with capacity reserved), while std::unordered_mapis far slower (near 125 ns at 10^7) because it allocates a node perelement. The perfect-hash tables are slowest by a wide margin: buildinga perfect hash costs fph::DynamicFphMap andfph::MetaFphMap roughly 1,450-1,900 ns per element at 10^7on the Xeon, about 12-15 times std::unordered_map (and11-12 times on the M1). Dropping the reserve call roughlydoubles every table's insert time because growth then triggers repeatedrehashing.

The erase-insert test, which alternates one erase and one insert tohold the size constant, favours the flat tables(ska::flat_hash_map, tsl::robin_map,absl::flat_hash_map); open-addressing tables leavetombstones behind erased entries, and the perfect-hash tables fareworst, timing out at the largest sizes because they cannot cheaplyabsorb churn.

Key type and value size matter here too. With string keys, everyinsert and erase also pays the string hash, and for strings beyond theSSO limit a heap allocation and free for the characters — so the gapbetween 12-byte and 64-byte string workloads is large, while integerinserts are dominated purely by table mechanics. A larger value, as withlookups, pushes the memory-bound regime earlier.

Iteration

Iteration is decided almost entirely by storage layout, independentof the hash function and nearly independent of the key type (theiterator visits fixed-size table slots; it does not re-hash or, forinline-stored entries, dereference the keys).ankerl::unordered_dense_map andemhash::hash_map7 keep their entries in a dense, contiguousarray and iterate in near-constant time per element regardless of loadfactor — about 0.22 ns on the Xeon and 2.0 ns on the M1, flat across thewhole size range. The inline open-addressing tables must scan a sparseslot array and skip empty slots, so their cost rises with size; thenode-based std::unordered_map is slowest, chasingcache-missing pointers (about 35 ns per element at 10^7 on the Xeon).The perfect-hash tables get no special benefit, since they also iteratea sparse layout.

Memory and load factor

Footprint, measured on integer key-value pairs, splits the tablesinto three groups at 10^7 elements. The one-metadata-byte SwissTablesabsl::flat_hash_map and ska::bytell_hash_mapare most compact at about 272 MB; the node-basedstd::unordered_map and absl::node_hash_mapcome next (about 308 MB and 297 MB); the fph tables useroughly twice the most compact (about 556-572 MB) for the index thatspeeds their lookups; and ska::flat_hash_map andtsl::robin_map are largest at about 768 MB, because theykeep a low maximum load factor and store the full, alignment-paddedvalue in every slot.

The load-factor chart explains that. Open-addressing tables grow bydoubling, so occupancy sawtooths between roughly 0.4 and 0.9;ska::flat_hash_map and tsl::robin_map settleto the lowest values (about 0.30 at large sizes), buying speed withempty space, while std::unordered_map runs near 1.0 (a nodeper element, no empty slots). Raising max_load_factor packsthe flat tables tighter — ska::flat_hash_map rises fromabout 0.30 to 0.60 at 10^7, roughly halving its empty space — at thecost of longer probes. Because footprint decides when a table crosseseach cache boundary, this is the structural reason behind the cachetiers in the lookup results: a bulkier table falls out of cache at asmaller element count.

The hashfunction matters as much as the table

For integer keys std::hash is the identity function. Atable that does its own mixing — most notablyska::flat_hash_map — can use it safely and enjoy its zerocost. But keys that are not uniformly random — those whose informationlives in only a handful of bit positions, like pointers or smallsequential IDs — make tables that assume a uniform hash(absl::flat_hash_map, emhash::hash_map7,tsl::robin_map) collide catastrophically with the identityhash, badly enough that some combinations time out during construction,so they need a good mixing hash like absl::Hash. Even agood hash can have weak spots: robin_hood::hash spreadssome such key patterns poorly, producing an irregular mid-range slumpfor tsl::robin_map with that hash. For string keys,xxHash_xxh3 wins across the board.

Platformdifferences: Rocket Lake vs M1 Max

The two machines have very different memory systems. The Xeon E-2388G(Rocket Lake) has a 48 KB L1, 512 KB L2 and 16 MB L3 with 4 KB pages;the M1 Max has a 128 KB L1, 12 MB L2 and 48 MB system-level cache with16 KB pages. The larger caches and pages on the M1 push every cache-tierboundary to the right and reduce TLB pressure, so although theranking of tables is broadly consistent across platforms, thesizes at which one overtakes another differ. One quirk: the libc++std::unordered_map on the M1 indexes buckets with a modulothat degenerates to a power-of-two mask when the bucket count is a powerof two, discarding the high bits of the hash, which makes thatcombination anomalously slow at element counts that are exact powers oftwo.

Choosing a hash table

Two practical questions decide most of the choice: how oftenis the table written versus read, and how big is itrelative to the cache it actually gets.

On the first question: if the table is built once and then queriedfar more than it is modified, the perfect-hashfph::DynamicFphMap (or fph::MetaFphMap whenmisses are common) gives the best lookup performance — at the cost of aslow build, an inability to absorb frequent updates, and roughly twicethe memory of the most compact tables, so it suits a static dictionary,a read-mostly index or a membership set far better than a constantlychanging table. If inserts, erases and lookups are mixed, a flat tableis the better all-rounder, and absl::flat_hash_map withabsl::Hash (or xxHash_xxh3 for strings) is astrong, compact, distribution-robust default.

The second question is where "fast in cache" needs unpacking. A tableis "in cache" when its whole footprint — not the element count alone —fits in a cache level: on the Xeon, roughly 16,000 small (16-byte)entries fit in the L2 and about a million in the L3, and proportionallyfewer when the value is large or the keys are heap-allocated strings.Therefore, ska::flat_hash_map, the fastest table whileeverything stays resident, is the right pick mainly for genuinely smalltables, or when the memory can be spared. The catch is thatska::flat_hash_map also has the largest footprintof all (its low load factor is exactly what makes it fast), so for thesame number of elements it leaves cache sooner than a compact table, andit competes harder for whatever cache is left to the rest of theprogram. In a real application the cache is shared with everything elserunning, so the effective threshold is well below the raw cache size; ifmemory is tight, the cache is contended, or the table is large, thecompact absl::flat_hash_map orska::bytell_hash_map will usually beat it despiteska::flat_hash_map's edge in a benchmark that owns thewhole cache. For iteration-heavy work,ankerl::unordered_dense_map is the clear choice regardlessof size.

Your situation	Consider
build once, then look up a lot (read-mostly / static)	`fph::DynamicFphMap` (hit-heavy) or`fph::MetaFphMap` (miss-heavy)
general-purpose, mixed insert / erase / lookup	`absl::flat_hash_map` (with `absl::Hash`, or`xxHash_xxh3` for strings) — fast, compact, robust
small table, or plenty of spare and uncontended cache, want lowestlookup time	`ska::flat_hash_map` — but it is the largest table, soprefer a compact one once it grows
memory tight, cache shared/contended, or table large	`absl::flat_hash_map` or`ska::bytell_hash_map` (most compact)
dominated by iteration / frequent full scans	`ankerl::unordered_dense_map`
need reference / pointer stability	`absl::node_hash_map` or`std::unordered_map`

Final advice

std::unordered_map is the slowest table in almost everythroughput test because of its node-per-element layout, so it is worthkeeping mainly when its iterator and pointer-stability guarantees areactually required. Beyond that, the table above is a starting point, nota verdict: the full space of tables, hashes, key distributions, valuesizes, table sizes and platforms is enormous, and a real workload mayland between the cases measured here. The surest answer is always tobenchmark the two or three most promising candidates on the actual dataand operation mix.

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - Memory Usage and Load Factor

2026-06-13T13:55:00.000Z

This page discusses the memory usage and load factor of hashtables.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

During the previous tests, we recorded the heap memory size and loadfactor. We count the heap memory size by implementing a customAllocator, which counts the allocated bytes during theallocate() function call. However, the accuracy of thecounted number depends on the hash table container correctly usingAllocator; that is, all memory allocations must go throughthe allocator. Some hash tables, such as emhash::HashMap7,will not get accurate memory data because Allocator is notfully used for all heap memory allocations.

One of the troublesome aspects of C++ design is that classes that usedifferent Allocator template parameters belong to different types (whichis what the std::pmr container is trying to solve). Forexample, std::basic_string using astd::allocator and std::basic_string using acustom allocator are two types, and hash functions likestd::hash are usually only compatible withstd::string using std::allocator. Therefore,it is not possible to use these hash functions directly for strings thatuse other allocators, and our method of counting heap memory size cannotcount the heap memory of strings. As a result, we only count cases whereboth keys and values are integer types.

It needs to be clarified that if the user doesn't care about thetotal heap memory size but cares about the warm memory size related tothe query speed, then the data in this test cannot accurately reflectthe cache size that the hash table needs to utilize. Some hash tablesrequire some extra space for non-query work, such as insertion, and thispart of the memory space is not accessed during lookup operations.

The set of element counts used in this test is different from othertest items. To reflect the ability to cope with different load factors,the number of elements is chosen as 0.4 times or 0.6 times a power of 2,e.g.0.4 x 2^10, 0.6 x 2^10, 0.4 x 2^11, 0.6 x 2^11, ...

The heap memory a table occupies and the load factor it runs atdirectly shape its lookup speed, because together they decide how muchmemory a lookup must stream through and therefore how soon the workingset spills out of each cache level. A lower load factor wastes moreslots but shortens probe chains; a smaller per-element footprint fitsmore elements into the same cache. The cache-tier boundaries seen in thelookup throughput test are exactly the points where a table's footprintcrosses the L1, L2 and L3 capacities, so the two charts below are thestructural explanation behind those tiers.

Because these figures mostly describe the data structures themselvesrather than the processor, the heap-memory numbers are mostlyCPU-independent. Exact totals can still differ across STLimplementations and allocators. The memory-size charts below thereforeshow the measured Intel and M1 Max values separately; treat them asstructure-driven but implementation-sensitive.

Heap Memory Size

Memory size whenusing default max_load_factor

:

The footprint splits the tables into three groups. TheSwissTable-style absl::flat_hash_map andska::bytell_hash_map, which spend only one metadata byteper slot, are the most compact — about 272 MB for 10^7 integer pairs.The node-based std::unordered_map andabsl::node_hash_map come next (about 308 MB and 297 MB),paying for a per-node allocation and its pointer. The perfect-hashfph::DynamicFphMap and fph::MetaFphMap useroughly twice the most compact tables (about 556-572 MB), the cost ofthe extra index and metadata that make their lookups fast. Largest ofall are ska::flat_hash_map and tsl::robin_mapat about 768 MB, because they keep a low maximum load factor and storethe full, alignment-padded value type in every slot.emhash::hash_map7 is shown as zero because it does notroute all of its allocations through the custom counting allocator, soits memory cannot be measured here (as noted above).

Memory size whenusing large max_load_factor

:

Load factor

Load factor whenusing default max_load_factor

:

The load-factor chart explains part of that footprint. Mostopen-addressing tables grow by doubling, so their load factor sawtoothsbetween roughly 0.4 right after a growth and 0.75-0.9 just before thenext one; the test samples sizes at 0.4x and 0.6x powers of two, whichis why the visible values cluster around 0.5-0.76.ska::flat_hash_map and tsl::robin_map withrobin_hood::hash settle to the lowest load factors (about0.29-0.30 at large sizes), trading memory for shorter probe chains — thesame low occupancy that helps their lookup speed. At the other extreme,std::unordered_map runs at a load factor close to 1.0,because a node is allocated per element and there are no empty slots toleave slack; its memory cost lives in the nodes and pointers rather thanin spare slots. Raising max_load_factor packs the flattables more tightly — at 10^7 elements ska::flat_hash_map'sload factor rises from about 0.30 to about 0.60, roughly halving itsempty space — at the price of longer probe sequences and slower lookups,the tradeoff explored in the lookup tests.

Load factor whenusing large max_load_factor

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - Integer Lookup Latency

2026-06-13T13:55:00.000Z

The integer key lookup latency test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we measure the lookup latency of hash tables in threekinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessfulfind).
Look up keys with a 50% probability of being in the hash table.

Latency tells a different story from throughput. The throughput testkeeps many independent lookups in flight, so the CPU's out-of-orderengine overlaps their memory accesses. The P99 latency below is insteadthe 99th-percentile time of a single lookup, so it captures theworst accesses that cannot be hidden — a lookup that misses severalcache lines, walks a long probe sequence, or mispredicts a branch.Latency is measured on the x86-64 platform (Xeon E-2388G) only.

Two regimes dominate every chart. While the table still fits incache, the tail is set by how many memory accesses the worst-case lookupperforms: tables that bound their probe count (the perfect-hashfph::* maps) or reject a key quickly through metadata keepa tight tail, while tables that may walk a long probe chain undercollisions show a heavier one. Once the table spills out of the L3cache, almost every 99th-percentile lookup incurs at least one DRAMaccess, so the tail flattens onto a memory-latency floor of roughly420-460 ns on this Rocket Lake platform and the choice of table mattersfar less for the tail than it does for throughput. In each group thesecond chart sets a large max_load_factor, which packs thetables more tightly and slightly lengthens the probe chains, but leavesthe qualitative ordering unchanged.

Lookup keys in the table(hit)

Use default max_load_factor

:

For hit lookups the ordering shifts with size. At small sizes thesimple open-addressing tables have the tightest tail —ska::flat_hash_map with std::hash reaches aP99 of about 7.8 ns at 1,024 elements. In the L2/L3 regime theperfect-hash tables take over: at 32,768 elementsfph::DynamicFphMap and fph::MetaFphMap withstd::hash have the best P99 (about 22 ns), because theyguarantee a small, bounded number of probes even in the worst case.Beyond about one million elements every table converges to the ~420-460ns DRAM floor; the node-based std::unordered_map is theclear outlier, since a tail lookup chases two cache-missing pointers andreaches 655-765 ns.

:

Use large max_load_factor

:

Lookup keys not in the table(miss)

Use default max_load_factor

:

The miss case is where the metadata design shines most.fph::MetaFphMap can confirm that a key is absent by readinga single metadata cache line, without walking any probe sequence. Aslong as that metadata array fits in cache this gives it a dramaticallytighter tail than any other table: at 1,200,000 elements its P99 misslatency is about 34 ns, while the other tables range from roughly 106 ns(r_h::unordered_flat_map, absl::flat_hash_map)up to 440-460 ns (ska::flat_hash_map,tsl::robin_map) — a more-than-tenfold gap at the tail. Theadvantage is largest in the L3 regime and disappears at 10,000,000elements, where the metadata array itself no longer fits in the L3cache, so even a single metadata read becomes a DRAM miss andfph::MetaFphMap falls back to the common ~450 ns floor.This is exactly the workload fph::MetaFphMap was designedfor, and the main reason to prefer it overfph::DynamicFphMap when unsuccessful lookups arecommon.

:

Use large max_load_factor

:

Lookup keys witha 50% probability in the table

Use default max_load_factor

:

The 50% case mixes hits and misses, and the alternating outcome addsa branch-misprediction penalty that narrows the gaps between tables. Theperfect-hash maps still hold a mild tail advantage in the in-cacheregime (fph::DynamicFphMap is around 27 ns at 32,768elements), but once the working set leaves the cache the memory-latencyfloor dominates again and the tables become hard to separate, with onlythe node-based tables clearly behind (absl::node_hash_mapat about 520-620 ns and std::unordered_map at about 675-785ns at the largest sizes).

:

Use large max_load_factor

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - Integer Lookup Throughput

2026-06-13T13:55:00.000Z

The integer key lookup throughput test.

Click the labels on the legend to view or hide data linescorresponding to specific hash tables and hash functions.

In this test, we measure the lookup performance of hash tables inthree kinds of situations:

Look up the keys in the hash table (hit or successful find).
Look up the keys not in the hash table (miss or unsuccessfulfind).
Look up keys with a 50% probability of being in the hash table.

Exploringthe Connection between Lookup Performance and Memory Hierarchy

Before looking at the details of lookup speed, it is important tonote the strong correlation between hash table lookup speed and thecache hit rate. With integer key-value pairs, the hash table lookupoperation mainly involves loading content from memory, which consumesmost of the operation time.

Modern computer systems use a hierarchical memory design. Forinstance, Intel Rocket Lake has registers, L1 cache, L2 cache, L3 cache,and DRAM, with speed decreasing in that order. The M1 Max has L1 cache,L2 cache, SLC cache, and DRAM.

As the cache miss rate at a given level rises, the overall lookuptime becomes limited by the memory hardware speed of the next, slowerlevel. Besides cache misses, TLB misses also contribute to the timepenalty. The chart below helps illustrate this concept. It shows the P50(median) hit-lookup latency on the Xeon E-2388G. The structure iseasiest to read by isolating a single open-addressing table such asska::flat_hash_map with std::hash (click theother legend entries to hide them). For such a table the median lookupsucceeds after a single key comparison, so the P50 value mostly reflectsthe latency of one memory load and tracks the memory hierarchyclosely.

The figure above shows that on the Intel Rocket Lake architecture,lookup performance separates into four tiers, each determined by thenumber of elements:

Elements that fully fit into the L1 cache. Given asizeof(value_type) of 16 bytes and an L1 data cache of 48 KB, this canhold approximately 3,072 elements. Since the default load factor ofska::flat_hash_map is usually less than 0.5, this range inthe figure lies between 32 and 1,024.
Elements that fully fit into the L2 cache. Given asizeof(value_type) of 16 bytes and an L2 cache of 512 KB, this can holdapproximately 32,768 elements. For ska::flat_hash_map inthis test, the range is 1,500 to 8,192.
Elements that fully fit into the L3 cache. Given asizeof(value_type) of 16 bytes and an L3 cache of 16 MB, this can holdapproximately 1,048,576 elements. For ska::flat_hash_map inthis test, the range is 12,000 to 400,000.
The case where reading from RAM becomes inevitable. This typicallyoccurs when the L3 cache can't hold all the elements, generally whenthere are more than 1,048,576 elements.

In addition to cache misses, TLB misses also have a significantinfluence on hash table lookup speed. Rocket Lake has 64 L1 DTLB (in 4KBmode, 32 in 2MB mode) entries and 1536 STLB entries. This means if a 4KBpage size is used, considerable L1 TLB misses will occur when the numberof elements exceeds 16,384, and L2 TLB misses when the number exceeds393,216. Using huge pages can greatly reduce TLB misses. On macOS withM1 Max, a 16 KB page size is used by default, resulting in much smallerpenalties due to TLB misses compared to the default 4 KB pages on theRocket Lake platform. In the P50 curve this TLB-miss penalty adds anextra rise once the working set pushes the page table beyond the reachof the L2 TLB, on top of the cache-tier steps described above.

The M1 Max can utilize 128KB L1 data cache, 12MB L2 cache, and 48MBSLC cache per thread, resulting in a higher cache hit rate and thereforesuperior performance for most data sizes in the hash table test.

Lookup keys in the table(hit)

Use default max_load_factor

:

Broadly speaking, the lookup time for a hash table can be split intofour parts:

Time required for the hash function to compute the hash value.
Time needed to map the hash value to a specific memory address.
Time taken to load content from the given memory address.
Time spent comparing the loaded content with the target key value,with additional penalty time if the comparison is unsuccessful.

When the element count is low, the CPU cache can often hold allelements. Here, the third step is relatively fast, so the other threecomponents become important. The first and second steps can trade workwith each other. If the hash function (in the first step) does notdistribute elements well across the hash space, the second step needsextra computation to distribute these non-uniform hash values into theunderlying slot array. If the first step uses a high-quality hashfunction that uniformly distributes hash values, the second steptypically only needs a quick bitwise AND instruction for truncation (ifthe number of slots is a power of 2).

This can be seen from the charts above. Both libc++ and libstdc++'simplementations of std::hash use the identity hash foruint64_t, making the first step extremely fast. If a bitwise ANDinstruction is directly used to get the slot-array index in the secondstep, there will be many collisions, leading to redundant comparisons inthe fourth step. Hash tables that use a simple method in the secondstep, such as absl::flat_hash_map,absl::node_hash_map, emhash::hash_map7 andtsl::robin_map require high-quality hash functions, sostd::hash is not enough.robin_hood::unordered_flat_map goes slightly further in thesecond step but doesn't achieve optimal performance withstd::hash. libc++'s implementation ofstd::unordered_map also shows poor performance when usingstd::hash and the number of elements is a power of 2.robin_hood::hash, though providing insufficient hashquality for most hash tables demanding good hash functions, is goodenough for robin_hood::unordered_flat_map, as illustratedin the range 100,000 to 3,100,000 data points.

In contrast, hash tables that do extra work in the second step canstill achieve good performance even with the simplest identity hash inthe first step, as seen in ska::flat_hash_map,ska::bytell_hash_map, fph::DynamicFphMap,fph::MetaFphMap and libstdc++'sstd::unordered_map. I recommend checking out the innovativetrick ska::flat_hash_map uses in the second step, called FibonacciHashing: The Optimization that the World Forgot (or: a BetterAlternative to Integer Modulo). This method uses a 64-bitmultiplication and a right-shift instruction. Since there are noarithmetic instructions involved in the hash function (for identityhash), these are pretty much all the arithmetic instructions the CPU'sALU has to handle.

On the other hand, a hash table requiring a "good" hash function anda bitwise AND instruction for the second step lacks a hash function thatcan be implemented with a cost equal to or less than these twoinstructions, and still achieve satisfactory hash quality for most datadistributions, to the best of my knowledge.

For this reason, the combination of ska::flat_hash_mapand std::hash currently ranks as the fastest when all thedata fits into the L2 cache on Intel Rocket Lake (or L1 cache on M1Max). Almost any other combination requires more instructions in thefirst and second steps combined. Within this data range, thetsl::robin_map with absl::hash is the secondfastest on Rocket Lake, with a very minor difference. On M1 Max, thefph::DynamicFphMap with std::hash combinationis almost just as fast, differing by less than 0.1 nanoseconds peroperation.

However, when the L2 cache can't hold all elements but the L3 cachecan, ska::flat_hash_map no longer holds the top spot. OnIntel Rocket Lake, the combination of fph::DynamicFphMapand std::hash proves to be the fastest within the range of8,192 to 3,100,000 elements (though absl::flat_hash_mapwith absl::hash is slightly faster at 400,000 elements). OnM1 Max, this combination also excels when the element count is within8,192 to 1,200,000. fph::MetaFphMap is the runner-up inthis range. Considering that both ska::flat_hash_map andtsl::robin_map use an aggressive expansion strategy leadingto a low load factor and earlier cache misses, and thatfph::DynamicFphMap uses perfect hashing (i.e., noconflict), these results make sense.

When the data can't fit into the L3 cache,ska::flat_hash_map with std::hash becomes thefastest combination once more. The tsl::robin_map andabsl::hash combination follows closely, with the differenceof one or two instructions becoming almost negligible compared to thecost of cache misses. These two are still the fastest even with amassive number of elements, partially because many other hash tables useauxiliary arrays to store additional information. For example,absl::flat_hash_map employs a metadata array which, whendealing with large data, can have a high and costly cache miss rate. Incontrast, ska::flat_hash_map andtsl::robin_map simply use one slot array to store key-valuedata, allowing one memory-to-cache line load to suffice.

Analyzing this chain of observations, we find that high-performancehash tables tend to have few failed comparisons in the fourth lookupstep (either ska::flat_hash_map limits failures, orfph::DynamicFphMap has no failures as a perfect hashtable).

Under this premise, when the data is small enough to be fully loadedinto the cache, the cost of loading from memory in the third stepbecomes minimal. During this stage, a small number and simplicity ofinstructions in the first and second steps are crucial for speed. A hashtable and hash function combination capable of this(ska::flat_hash_map and std::hash) is thefastest.

When the amount of data slightly increases, and collisions becomemore frequent, hash tables that avoid collisions can gain an advantage,such as fph::DynamicFphMap.absl::flat_hash_map, which uses SIMD to resolve collisionsto a certain degree, can also hold an advantage with certain datasizes.

When the amount of data continues to increase to the point where anymemory access results in a cache miss, the hash table with the leastmemory accesses becomes the fastest. This requires fewer hashcollisions, and also highlights that any metadata will increase thenumber of memory fetches at this stage. ska::flat_hash_mapand tsl::robin_map are the fastest choices at thispoint.

In conclusion, a high-performance hash table and hash functioncombination will depend on several factors. When data fits entirelywithin cache, ska::flat_hash_map withstd::hash performs best, due to the minimal instructionsrequired in the first two steps. As data size increases, hash tablesthat can effectively avoid collisions or utilize SIMD to resolve themgain the advantage, such as fph::DynamicFphMap andabsl::flat_hash_map. Lastly, when data size surpasses cachecapacity, hash tables that minimize memory accesses, likeska::flat_hash_map and tsl::robin_map, comeout on top.

:

When the size of value_type expands to 64 bytes, cachecan accommodate fewer elements. Additionally, a cache line can only holda single element, making hash table collisions more costly. However,modern CPUs' powerful prefetching capabilities typically keep thesecosts low for hash tables using linear or quadratic probing.

Overall, the performance between hash tables remains mostlyconsistent with the case. Thecombination of ska::flat_hash_map andstd::hash is fastest when data volume is low or high, whilethe combination of fph::DynamicFphMap andstd::hash excels with medium-sized data.

Use large max_load_factor

:

Prior tests used default load factors. However, different hash tablesoften use different load factors, leading to performance differences dueto distinct memory size requirements.

Hash tables respond differently to increased maximum load factors.While larger load factors reduce memory usage and potential cache missrates, improving performance and minimizing footprint, they can alsoincrease hash table collisions, slowing lookup speed.

Thus, if hash table performance is sensitive to load factor andcollision probability, larger load factors can degrade performance, asseen with ska::flat_hash_map andtsl::robin_map. However, for tables less affected by loadfactor, such as fph::DynamicFphMap andabsl::flat_hash_map, performance remains stable or evenimproves with larger load factors.

The comparison between larger and defaultmax_load_factor confirms these observations.

ska::flat_hash_map and std::hash remain thefastest combination with few elements. As element count rises,performance of ska::flat_hash_map andtsl::robin_map falls while fph::DynamicFphMapand fph::MetaFphMap performance generally increases. Otherhash tables exhibit little change. Thus, for larger data sizes,fph::DynamicFphMap and std::hash offer thefastest lookup, followed by combinations of fph::MetaFphMapand std::hash, and absl::flat_hash_map withabsl::hash.

:

When the value_type size is 64 bytes, increasingmax_load_factor gives results similar to. ska::flat_hash_mapwith std::hash is fastest for fewer elements, whilefph::DynamicFphMap with std::hash takes thelead with more elements.

Lookup keys not in the table(miss)

Use default max_load_factor

:

Finding nonexistent elements in a hash table differs from locatingexisting ones. Full comparisons confirm whether the target key equals astored key, but different hash values are enough to prove that the keysare different. Therefore, hash tables that store hash values or partialhashes can speed up this task. In particular, when partial hash values(e.g., 1 byte) are used as metadata, they use less cache space thancomplete keys.

Although hash tables like absl::flat_hash_map,r_h::unordered_flat_map, and fph::MetaFphMapuse partial hashes as metadata, they are not always fastest for smalllookups because the additional instruction cost can outweigh the cachesavings. However, the advantage of this approach appears once the L1cache is insufficient.

On Intel Rocket Lake, fph::MetaFphMap withstd::hash is fastest when element count exceeds 6,000, andstays ahead almost the whole way up; only at the very largest counts(around 10 million) does absl::flat_hash_map withabsl::Hash edge ahead. On M1 Max,ska::flat_hash_map is fastest for under 3,000 elements,closely followed by fph::DynamicFphMap. For 6,000 to150,000 elements, fph::DynamicFphMap leads, butfph::MetaFphMap prevails above 200,000 elements. The M1Max's larger cache capacity likely influences this variation, withmetadata-based methods gaining advantage at higher element counts.

:

When using a 64-byte value_type, the overall situationis similar to . On M1 Max,fph::MetaFphMap starts to dominate from 45,000 elements.This change is caused by the fact that the number of elements the cachecan hold becomes smaller.

Use large max_load_factor

:

When a larger maximum load factor is set, the overall ranking isconsistent with the default maximum load factor. The range whereska::flat_hash_map is ahead is reduced, because itsperformance is sensitive to load factor. On Intel Rocket Lake,ska::flat_hash_map with std::hash is thefastest when the number of elements is not greater than 1024. With moreelements, fph::MetaFphMap with std::hash isfaster. On M1 Max, the range where ska::flat_hash_map isahead is also smaller.

:

As above, the overall relative speed relationship is similar to thatwhen using the default load factor.

Lookup keys witha 50% probability in the table

Use default max_load_factor

:

When the lookup key has a 50% probability of being in a hash table,the throughput decreases due to the increased lookup time caused by thebranch prediction failure penalty. The lookup time gap between hashtables narrows. We have to show only the fastest hash function for eachhash table by default to make the graph a bit clearer and easier toread.

From the above graphs, we can see that the performance of the hashtables can be considered very close, except forstd::unordered_map.

When the number of elements is relatively small,ska::flat_hash_map with std::hash is still thefastest at most data points.

On Intel Rocket Lake, when the number of elements is between 25,000and 2,200,000, fph::MetaFphMap andfph::DynamicFphMap are the fastest when paired withstd::hash. absl::flat_hash_map andska::flat_hash_map are very close in performance on somedata points. When the number of elements is in the range of 2,200,000 to6,000,000, absl::flat_hash_map with absl::hashis the fastest pair. When the number of elements is not less than6,000,000, ska::flat_hash_map with std::hashreturns to first place. The gap is very small, and the leading tablechanges with the element count.

On the M1 Max, fph::MetaFphMap andabsl::flat_hash_map are the fastest hash tables when thenumber of elements is greater than 32,768. The performance ofabsl::flat_hash_map fluctuates with the number of elementswhile fph::MetaFphMap is relatively stable. When the numberof elements is extremely large, ska::flat_hash_map returnsto first place.

:

Use large max_load_factor

:

Hit Appendix

Use default max_load_factor

:

Use large max_load_factor

:

Miss Appendix

Use default max_load_factor

:

Use large max_load_factor

:

50% Hit Appendix

Use default max_load_factor

:

Use large max_load_factor

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - String Erase and Insert

2026-06-13T13:55:00.000Z

The string key erase and insert test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we first construct a hash table with size N, and thenrepeat the following procedure M times:

Insert a new element into the hash table
Randomly erase an element from the hash table.

Because the table size stays at N (or N+1) throughout, this measuresthe steady- state cost of churning a fully-grown table rather than thecost of growing it. For string keys every insert and every erase stillhas to hash the whole key and compare strings on collision, and for keysabove the SSO threshold (the 24- and 64-byte variants) every insert alsoallocates the string's characters on the heap and every erase freesthem, so the 64-byte case is dominated bymalloc/free. The 12-byte keys are storedinline (SSO) and skip the allocator entirely, which is why they areseveral times faster than the long-string cases.

Two structural effects matter here. First, open-addressing tables usetombstones: an erase marks the slot as deleted ratherthan empty so the probe chains stay intact, and once tombstonesaccumulate the table must rehash, which shows up as latency spikes.Second, the perfect-hash tables fph::DynamicFphMap andfph::MetaFphMap are a poor fit for this workload -- everyinsert can force a perfect-hash rebuild, so they are not only theslowest but actually time out at the larger sizes (those points read0.00 and are not plotted). They are fast to look up but pay forit heavily under churn, and that caveat matters for erase/insertworkloads.

Throughput

We record the time spent in the whole process, which includes bothinsert and erase operations.

The y axis value is the average time per operation. This result isobtained bytime/op = (time for insert + time for erase) / (2 * M).This is the average time taken for insert and erase.

:

For the 64-byte fixed key on the Xeon, xxHash_xxh3 isthe best hash and the flat open-addressing tables are tightly bunched atthe front: ska::flat_hash_map, tsl::robin_map,absl::flat_hash_map androbin_hood::unordered_flat_map all run about 67-72 ns at1024 elements and converge to roughly 320-360 ns at 10^7, where themalloc/free of the 64-byte string and thecache-missing probes dominate. std::unordered_map is about50% slower (93 ns small, 513 ns at 10^7) because it allocates and freesa node on top of the string on every modification.

The perfect-hash tables are much slower:fph::DynamicFphMap/fph::MetaFphMap startaround 160-170 ns at 1024 and then time out --fph::DynamicFphMap has no plotted point past 32768 andfph::MetaFphMap shows a 6050 ns spike at 1.2M beforedropping out -- because the steady stream of inserts keeps triggeringperfect-hash rebuilds. Note also thatankerl::unordered_dense_map competes well at small and midsizes (~75 ns at 1024) but its 10^7 point is absent (0.00); itsdense-array design pays extra to keep entries packed when an arbitraryelement is erased, since it back-fills the hole from the array end.

On the M1 Max the ranking is similar, withska::flat_hash_map and tsl::robin_map leading(~370-380 ns at 10^7) and the fph tables again timing out at largesizes. The smaller-key variants below preserve the ordering but run muchfaster: the 12-byte SSO key brings the flat leaders down to ~87-91 ns at10^7 (Xeon) because no allocation is involved.

:

Latency

Insert (after erase)

:

Latency is measured on the Xeon E-2388G only. The first charts inthis section show the 64-byte fixed key, but looking at the 12-byte SSOkey makes the table algorithm clearest, since allocation noise isremoved. For the P50 (median) insert-after-erase,absl::flat_hash_map and absl::node_hash_mapare best at large sizes (~104-110 ns at 10^7), whileska::flat_hash_map and tsl::robin_map lead atsmall sizes (~14-16 ns at 1024) but lose ground in the 200k-1.2M range(~88-100 ns) as their probe chains lengthen. The fph tables sit farbehind even at the median (~50-68 ns at 1024, climbing into thehundreds) and drop out at large sizes.

The P99 (tail) is where the tombstone-rehash behaviour shows. Theconventional flat tables keep a bounded tail --absl::flat_hash_map is around 492 ns andstd::unordered_map around 1000 ns at 10^7 -- but theperfect-hash tables have very large spikes:fph::MetaFphMap's P99 reaches ~44000 ns at 1.2M and ~231000ns at 10^7, because an insert that triggers a full perfect-hash rebuildlands directly in the tail. These tables should never be chosen wherebounded modification latency matters.

For the 64-byte key the string allocation raises the floor: everytable's P99 is several hundred ns even at small sizes (~470 ns at 1024),and the same fph blowup (~48000 ns at 1.2M forfph::MetaFphMap) appears. Other length variants follow thesame pattern.

:

Erase

:

Erase latency is closer between tables because an erase on anopen-addressing table is cheap -- it locates the slot and drops atombstone -- with no allocation on the inline-stored path. For the64-byte key the P99 erase is bunched between about 1030 and 1140 ns at10^7 for the conventional tables (absl::flat_hash_map ~1030ns, std::unordered_map ~1390 ns), the floor again set bythe free of the string's heap buffer plus the cache miss toreach the slot. As before the fph tables are the exception:fph::DynamicFphMap already times out past 32768 andfph::MetaFphMap's P99 climbs past ~1200 ns at 1.2M beforedropping out, since an erase that crosses the tombstone threshold forcesa perfect-hash rebuild. The median (P50) erase is dominated byska::flat_hash_map, tsl::robin_map andemhash::hash_map7 at small-to-mid sizes; the remainingstring variants tell the same story scaled by string length.

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - String Iterate

2026-06-13T13:55:00.000Z

The string key iterate test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we measure the performance of iterating over the hashtable.

Iteration is the one operation where the key type barely matters.Walking a table never hashes a key and never compares two strings; itonly advances an iterator and reads back each stored entry. Therefore,unlike lookup or insertion, the string length and the choice of hashfunction have almost no influence here. What dominates is how a tablelays out its entries in memory, exactly as in the integer iterate test. The three storagestrategies are:

Dense array storage.ankerl::unordered_dense_map andemhash::hash_map7 keep all key-value pairs packed in acontiguous array and store only indices (or small metadata) in the hashslots. Iterating is a linear scan over a dense array, so the cost perelement is tiny and independent of the load factor.
Inline open addressing.ska::flat_hash_map, ska::bytell_hash_map,tsl::robin_map, absl::flat_hash_map,fph::* and robin_hood::unordered_flat_mapstore the entries directly in a sparse slot array. To iterate they walkthe whole slot array and skip the empty slots, so the work per elementgrows as the table becomes emptier and the array spills out ofcache.
Node-based storage. std::unordered_mapand absl::node_hash_map allocate each entry in a separatenode and chase pointers between them.

One string-specific note: the stored entry here isstd::pair, which is the samefixed size (a 32-byte std::string control block plus thevalue) regardless of whether the string is 12, 24 or 64 characters long.The actual character bytes of a long, heap-allocated string liveelsewhere and are not touched during iteration, since we onlyadvance the iterator and do not read the string contents. That is whythe iterate numbers are essentially identical across all six stringvariants.

Throughput

:

The charts confirm the layout-driven picture. On both platformsankerl::unordered_dense_map is the fastest by a wide marginand is perfectly flat across the whole size range -- about 0.22 ns perelement on the Xeon E-2388G and about 2.0 ns on the M1 Max -- because italways scans a packed array no matter how many slots the table has.emhash::hash_map7 is the runner-up with the same flatbehaviour (about 0.63 ns on the Xeon and 2.6 ns on the M1).

Every inline open-addressing table behaves differently: itsper-element cost rises with the element count as the slot array growspast the cache. ska::flat_hash_map, for example, climbsfrom under 1 ns at 1024 elements to about 12.4 ns at 10^7 on the Xeon,because most of that time is then spent reading empty slots from memory.The node-based std::unordered_map is the slowest at largesizes -- around 48 ns per element at 10^7 on the Xeon and 25 ns on theM1 -- since iterating its node list becomes a stream of cache-missingpointer dereferences.

The perfect-hash tables sit in the middle of the open-addressing packand never lead here: fph::MetaFphMap runs about 7.4 ns andfph::DynamicFphMap about 10.8 ns at 10^7 on the Xeon. Aperfect hash buys fast lookup, but iteration still has to walka sparse slot array, so it gives fph no advantage andfph::DynamicFphMap is in fact one of the slower flat tablesto scan because it keeps a relatively sparse slot array.

The remaining five string variants below tell the same story almostto the decimal -- iteration ignores the key contents, so thefixed-vs-max length and the 12/24/64-byte distinction make no measurabledifference.

:

Latency

The P99 latency of a single iteration step tells the same story fromthe tail end (latency is measured on the Xeon E-2388G only).ankerl::unordered_dense_map stays flat at about 0.94 nsregardless of size, because advancing over a dense array never missesfar. The inline and node-based tables develop growing tails as thebacking storage outgrows the cache: at 10^7 elements the P99 stepreaches roughly 84 ns for ska::bytell_hash_map, about 110ns for ska::flat_hash_map and about 103 ns fortsl::robin_map, while the node-basedstd::unordered_map and absl::node_hash_mapreach hundreds of nanoseconds (about 406 ns and 444 ns respectively),with each spike corresponding to a cache miss on the next slot or node.The first point (N=1024) is a small outlier for ankerl at3.13 ns -- a cold-start artifact -- after which it settles to its flat0.94 ns. The other five string variants follow the same pattern.

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - String Insert and Construct

2026-06-13T13:55:00.000Z

The string key insert and construct test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we measure the time spent constructing the hash table.The construction is done by theinsert( const value_type& value ) operation. We testboth with and without calling reserve before inserting. The time spentin reserve is not counted in the total time.

During the test, a hash table is constructed multiple times, and inthe throughput test we record the total time spent on insert.

The length of a string is the same asstd::string::length(), which means that one additional byteis needed to save the null character.

Inserting string keys is more expensive than inserting integersbecause each insert pays for three things that integers do not: hashingthe whole byte sequence of the key, comparing whole strings on acollision, and -- for keys longer than the Small String Optimization(SSO) threshold -- allocating heap memory to hold the characters. On thelibstdc++/libc++ implementations used here, a std::stringof length up to 15 stores its bytes inline in the control block (noallocation), so the 12-byte keys are pure SSO and never touch theallocator, while the 24-byte and 64-byte keys spill to the heap andevery insert also pays a malloc. This single fact -- SSO vsheap allocation -- is the biggest driver of the differences between thestring-length variants below.

It is also worth setting expectations for the perfect-hash tables upfront. fph::DynamicFphMap and fph::MetaFphMapare extremely fast at lookup because they build a(near-)perfect hash with no probing, but that construction is exactlywhat makes their insert slow -- they periodically rebuild theperfect hash as the table grows. So in this insert/construct test thefph tables are consistently the slowest, and that caveat matters for theinsert results.

Throughput

Insert with reserve

:

With reserve called ahead of time, the capacity is fixedbefore insertion, so this isolates the pure per-insert cost (hash,probe, allocate, store) with no rehashing. For the 64-byte fixed key onthe Xeon, xxHash_xxh3 -- which is built to consume rawbytes -- is the best hash for nearly every table, and the leaders aretightly bunched: absl::flat_hash_map,ankerl::unordered_dense_map androbin_hood::unordered_flat_map all sit around 24-26 ns at1024 elements and converge to roughly 118-119 ns at 10^7, where theper-insert cost is dominated by the malloc of the64-character string and by cache misses, not by the table algorithm.std::unordered_map trails the flat tables by roughly 2x(about 48 ns small, 250 ns at 10^7) because it allocates a node perelement on top of the string allocation.

The fph tables are dramatically slower here:fph::MetaFphMap and fph::DynamicFphMap reachabout 2390 ns and 2471 ns at 10^7 on the Xeon -- on the order of 10xslower than std::unordered_map and 20x slower than the flatleaders -- because the perfect-hash build cost grows with the table.(Note their N=1024 points, ~198 and ~178 ns, are an artifact of buildingand rebuilding a perfect hash for a tiny table; they actually getrelatively better at 32768 before the build cost dominatesagain.) On the M1 Max the ordering is the same but the gap is smaller:the fph tables land around 950 ns at 10^7 versus ~100-120 ns for theflat leaders.

The 24-byte and 12-byte variants below follow the same ranking, justshifted faster as the strings shrink: at 10^7 the flat leaders drop from~118 ns (64-byte) to ~92-112 ns (24-byte) to ~50-63 ns (12-byte), thelast because the 12-byte key is pure SSO and skips the allocatorentirely. The "max length N" variants run about the same as, or a littlefaster than, the matching "fixed length N", because the randomlygenerated keys average shorter than the maximum -- so there is less tohash, compare, and copy -- even though the varying length makes theper-key cost less uniform.

:

Insert without reserve

:

Without reserve, every table also pays the cost ofgrowing and rehashing as it fills, which roughly doubles the per-inserttime across the board. For the 64-byte fixed key on the Xeon,absl::flat_hash_map with xxHash_xxh3 is thefastest at large sizes (about 191 ns at 10^7) because it grows cheaplyand rehashes a metadata array rather than moving whole entries again;ankerl::unordered_dense_map is just behind (~178 ns) sinceit only has to grow a dense array. The probing-heavy tables suffer morefrom rehash: ska::flat_hash_map climbs to about 465 ns at10^7, more than double the absl number, because every growth re-insertsevery element through the probe sequence.

The fph tables are much slower here. Without a reserve, every growthtriggers a fresh perfect-hash build, sofph::MetaFphMap/fph::DynamicFphMap reach ~3800and ~4180 ns at 10^7 on the Xeon -- about 10x the flat leaders andnoticeably worse than their with-reserve numbers, where they at leastonly build the perfect hash once at the final size. This clearly showsthat fph trades construction speed for lookup speed.

The smaller-key and "max length" variants below preserve thisranking; the 12-byte SSO keys are again the fastest (absl about 64 ns at10^7) because they never call the allocator, leaving only the table'sown growth and probe costs.

:

Latency

The P99 latency of a single insert is measured on the Xeon E-2388Gonly. It tells the same story from the tail end: the conventional flattables keep their tail bounded while the perfect-hash tables spikewhenever they rebuild.

Insert with reserve

:

With a prior reserve, the 64-byte fixed key shows the flat tablesholding a tight P99: absl::node_hash_map,ankerl::unordered_dense_map andabsl::flat_hash_map sit around 490-530 ns at 10^7, withstd::unordered_map at ~880 ns. The allocation of the64-byte string sets a floor under all of them. The fph tables againstand out: fph::DynamicFphMap climbs to about 12660 ns andfph::MetaFphMap to about 12420 ns at 10^7 -- more than anorder of magnitude worse -- because even with capacity reserved, theperfect-hash construction has occasional very expensive steps thatdominate the 99th percentile. For the 12-byte SSO key the floor drops(the flat tables are around 460-490 ns at 10^7) since there is noallocation, but the fph tail stays high (fph::DynamicFphMap~11780 ns).

:

Insert without reserve

:

Without a prior reserve the P99 of the conventional tables staysclose to the reserved case -- a rehash large enough to fall in the 99thpercentile is rare relative to the number of inserts -- so for the12-byte key absl::flat_hash_map still sits around 442 ns at10^7 and std::unordered_map around 850 ns. The fph tables,by contrast, degrade sharply: without reserve they rebuild the perfecthash at every growth, so fph::DynamicFphMap andfph::MetaFphMap reach roughly 20400 and 20360 ns at 10^7,almost double their reserved tails. If stable insert latency matters,reserve ahead of time and avoid the perfect-hash tables for the buildphase. The remaining string variants follow the same pattern.

:

← Back to Hash Table Benchmarkindex

Hash Table Benchmark - Integer Insert and Construct

2026-06-13T12:30:14.000Z

The integer key insert test.

Click the labels on the legend to hide or show the data linesfor specific hash tables and hash functions in the figure.

In this test, we measure the time spent constructing the hash table.The construction is done with theemplace( const value_type& value ) operation. We testboth with and without a reserve before inserting. The time taken byreserve is not counted in the total time.

We originally used theinsert( const value_type &value) function to build thehash table, but we found that some hash tables do not usestd::pair as theirvalue_type (the type used by the STL container). As aresult, we cannot use the insert function uniformly acrossall the tables we test.

During the test, each hash table is constructed many times, and werecord the total insertion time in the throughput test.

Throughput

The y axis value is the average time per operation. This result isobtained bytime/op = sum{construct time} / (number of construct * number of elements).

:

Insert with reserve

When analyzing insert performance after reserve, thefirst obvious thing is that some hash tables, such asemhash::hash_map7, tsl::robin_map,absl::node_hash_map and absl::flat_hash_map,are not suitable for use with std::hash. They need a betterhash function. Of course, there are some hash tables that haven'texposed this problem in the insert test yet, and they will show thisproblem in the later lookup test.

When we remove these hash tables that do not use the appropriate hashfunction (the reader can click the labels in the legend to hide the datapoints of these hash tables), we can see that the slowest hash tables toinsert are the perfect-hash fph::DynamicFphMap andfph::MetaFphMap, by a large margin. This is the honesttradeoff for their fast lookups: building a perfect hash is expensive,and the cost grows with the number of elements. On the Xeon E-2388G, thefph tables already take 37 to 51 ns per insert at 1024 elements (vsabout 20 ns for std::unordered_map and 4 to 6 ns for thefastest flat tables), so even with a small table the gap is real. As theelement count grows the gap widens further: at 10^7 elements the fphinsert time is about 1450 to 1900 ns, roughly 12 to 15 times that ofstd::unordered_map (which sits near 125 ns) on the IntelCPU, and about 11 to 12 times on the Apple M1 Max (where the fph tablesare near 1550 ns and std::unordered_map near 134 ns).

In addition, we unexpectedly found that the combination ofstd::unordered_map and std::hash issurprisingly slow for some data points on the M1 Max. Furtherobservation shows that the number of elements of these data points isexactly a power of two: at 1024 elements this combination needs about165 ns per insert, at 2048 about 311 ns, and at 8192 it explodes toroughly 2500 ns, while at the neighboring non-power-of-two sizes (e.g.800, 1500, 3000, 6000) it stays around 20 to 32 ns. We infer this comesfrom the libc++ unordered_map implementation used by clangon the M1: it takes the hash value modulo the number of buckets toobtain a slot index. When the bucket count is exactly a power of two,that modulo degenerates into keeping only the low-order bits of the hashand discarding the high-order bits. Since std::hash forintegers is the identity function, any entropy that lives in the highbits is thrown away, which produces heavy collisions at exactly thepower-of-two sizes.

When looking for the fastest hash table for insert operations, theresults are a little different on the x86-64 and arm64architectures.

On Intel Rocket Lake, several flat tables are essentially tied atsmall sizes: emhash::hash_map7 withabsl::Hash, ska::flat_hash_map withstd::hash, and ankerl::unordered_dense_map allsit around 4 to 5 ns per insert when the element count is no larger thana few thousand. In the medium range, absl::flat_hash_mapwith absl::Hash is the most consistent winner: it staysnear 6 to 12 ns from 6,000 up to about 800,000, where the other flattables start to climb faster. At the largest sizes the picture shiftsagain, and in an irregular way. tsl::robin_map withrobin_hood::hash is actually the fastest at 10^7 elements(about 25 ns), with ankerl::unordered_dense_map andska::flat_hash_map close behind (about 27 to 28 ns). Yetthe very same combination slumps in the 200,000 to 1,200,000 range(rising to about 30 to 42 ns) before recovering — a non-monotonicprofile that suggests robin_hood::hash produces a weakerdistribution for these masked-bit keys in that range rather than a cleancache-tier effect. robin_hood::unordered_flat_map, on theother hand, does not suffer from this hash-quality problem: it stayscompetitive with either absl::Hash orrobin_hood::hash. This means that though it requires a goodhash function, it does not rely on the hash quality as much as otherhash tables like absl::flat_hash_map.

On Apple M1 Max, the story is a little different. When the elementcount is no larger than about 6,000, ska::flat_hash_mapwith std::hash is the fastest, dipping below 4 ns perinsert. When the number of elements is larger than 6,000,absl::flat_hash_map with absl::Hash takes thelead, staying around 4.5 to 10 ns all the way to about 1,200,000 whilethe others trail. But when the number of elements approaches 10^7, thegap between these flat maps shrinks and they converge to roughly 20 to22 ns. This is because the working set no longer fits in cache whenthere are many elements. This phenomenon can be observed more clearly inthe : data because cache isunder more pressure in that situation.

Insert without reserve

When there is no reserve operation before insert, therankings of these hash tables change again.

On Intel Rocket Lake, absl::flat_hash_map withabsl::Hash is almost always the fastest across all datascales (about 16 to 44 ns), with emhash::hash_map7 andankerl::unordered_dense_map close behind. When the numberof elements is small, tsl::robin_map andska::flat_hash_map are also nearly as fast, but they fallback in the larger ranges because their aggressive growth strategytriggers more rehashing.

On Apple Silicon, ska::flat_hash_map withstd::hash is the fastest at the smallest sizes (about 14 nsat 1024). When the number of elements grows beyond a few thousand,absl::flat_hash_map with absl::Hash againbecomes the fastest, staying around 20 to 32 ns across the rest of therange. emhash::hash_map7 with absl::Hash comesa close second.

:

Insert with reserve

The pattern of keys is the same as the : . The difference is that the size ofthe pair is 4 times larger: 64 bytes per element instead of the original16 bytes. Therefore, the working set runs out of cache earlier.

On Intel Rocket Lake, at small sizesankerl::unordered_dense_map,ska::flat_hash_map with std::hash andabsl::flat_hash_map with absl::Hash are allclose (about 5 to 8 ns). In the medium range,absl::flat_hash_map with absl::Hash is thefastest, staying near 10 to 15 ns up to about 200,000 elements. But atthe largest sizes absl::flat_hash_map falls behind: at 10^7it costs about 65 ns while ska::flat_hash_map (about 43 ns)and tsl::robin_map with absl::Hash (about 40ns) are clearly faster. This is because absl::flat_hash_mapkeeps its metadata and slots in two separate arrays, so once the workingset no longer fits in cache it pays two memory loads per probe; thelarger 64-byte element makes this penalty more visible.

On Apple Silicon, ska::flat_hash_map withstd::hash is the fastest when the number of elements issmall (no larger than about 6,000). For the medium-to-large range,absl::flat_hash_map with absl::Hash takes thelead, holding it up to about 1,200,000 elements. When the element countis higher than that, ska::flat_hash_map again pulls roughlyeven with or ahead of absl::flat_hash_map. This is becausewhen the data reaches this scale, the M1 Max runs out of cache, andabsl::flat_hash_map relies on cache to get decentperformance. Its metadata and slots are in two different arrays, whileska::flat_hash_map has only one slot array. So even ifthere is a cache miss, ska::flat_hash_map only fetches datafrom RAM once.

Insert without reserve

When there is no prior reserve, the results are quitesimilar to those of the data .This may be because most of the time is spent on allocation anddeallocation of memory.

There is one difference: absl::node_hash_map gets abetter ranking than in the data.This shows that when the sizeof(value_type) is large, orwhen the construction of the value_type is time-consuming,absl::node_hash_map has some advantages because it does notmove the value between slots.

Latency

:

Insert with reserve latency

Insert without reservelatency

We found that even if there is no reserve before inserting, the P99latency of the insert operation is basically at the same level as theP99 latency with reserve. But theoretically, without reserve, theworst-case time complexity of an insert operation is O(n), because thehash table needs to grow. If a reservation is made in advance, rehashcan be avoided when inserting.

Therefore, we can additionally look at the P100 latency (aka maxlatency) of insert operations. The following two graphs show theinsertion P100 latency when the reserve operation is performed inadvance and the insertion P100 latency when the reserve operation is notperformed.

It can be seen that only the fph family of hash tables androbin_hood::unordered_flat_map have large maximum insertionlatency when a prior reserve is performed. The fph tables reach tens ofmillions of nanoseconds at 10^7 elements (their perfect-hash rebuild isthe culprit), and robin_hood::unordered_flat_map climbs toover a million ns. From the experimental data, the worst-case timecomplexity of insertion in the other hash tables, after reserve, doesnot appear to be O(n). The P100 insertion latency ofska::flat_hash_map and ska::bytell_hash_map isonly about 760 to 830 ns even when the number of elements is 10^7, whichis a good result.

When there is no prior reserve, the P100 latency of the insertoperation arguably reflects the worst time complexity of that operationon the hash table: O(n). The maximum insertion latency is proportionalto the number of elements, reaching the order of 10^8 ns at 10^7elements for the flat tables. Relatively speaking,absl::flat_hash_map androbin_hood::unordered_flat_map (together withankerl::unordered_dense_map) have a smaller maximum latencywhen expanding.

If the user wants stable insert time, then reserve should beperformed in advance, which can avoid rehash on most hash tables.

:

Insert with reserve latency

Insert without reservelatency

When the size of value_type is 64 bytes, the rankingsare very similar to .

Throughput Appendix

:

Insert with reserve

Insert without reserve

:

Insert with reserve

Insert without reserve

:

Insert with reserve

Insert without reserve

Latency Appendix

:

Insert with reserve latency

Insert without reservelatency

:

Insert with reserve latency

Insert without reservelatency

:

Insert with reserve latency

Insert without reservelatency

← Back to Hash Table Benchmarkindex