Hash Table Benchmark - 整数查询延迟

  1. 查询表中存在的 key(命中)
    1. 使用默认 max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
      3. <K,V>: <uint64_t with high position bits masked, uint64_t>
      4. <K,V>: <uint64_t with low position bits masked, uint64_t>
      5. <K,V>: <uint64_t uniformly distributed, uint64_t>
    2. 使用较大 max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
      3. <K,V>: <uint64_t with high position bits masked, uint64_t>
      4. <K,V>: <uint64_t with low position bits masked, uint64_t>
      5. <K,V>: <uint64_t uniformly distributed, uint64_t>
  2. 查询表中不存在的 key(未命中)
    1. 使用默认 max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
      3. <K,V>: <uint64_t with high position bits masked, uint64_t>
      4. <K,V>: <uint64_t with low position bits masked, uint64_t>
      5. <K,V>: <uint64_t uniformly distributed, uint64_t>
    2. 使用较大 max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
      3. <K,V>: <uint64_t with high position bits masked, uint64_t>
      4. <K,V>: <uint64_t with low position bits masked, uint64_t>
      5. <K,V>: <uint64_t uniformly distributed, uint64_t>
  3. 查询有 50% 概率在表中的 key
    1. 使用默认 max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
      3. <K,V>: <uint64_t with high position bits masked, uint64_t>
      4. <K,V>: <uint64_t with low position bits masked, uint64_t>
      5. <K,V>: <uint64_t uniformly distributed, uint64_t>
    2. 使用较大 max_load_factor
      1. <K,V>: <uint64_t with several split bits masked, uint64_t>
      2. <K,V>: <uint64_t with several split bits masked, 56 bytes struct>
      3. <K,V>: <uint64_t with high position bits masked, uint64_t>
      4. <K,V>: <uint64_t with low position bits masked, uint64_t>
      5. <K,V>: <uint64_t uniformly distributed, uint64_t>

这一篇测试整数 key 下的查询延迟。

点击图例中的标签,可以隐藏或显示图中对应哈希表和哈希函数的数据线。

这一篇测试三种情况下哈希表的查询延迟:

  1. 查询表中存在的 key(命中,successful find)。
  2. 查询表中不存在的 key(未命中,unsuccessful find)。
  3. 查询有 50% 概率在表中的 key。

延迟反映的情况和吞吐并不一样。吞吐测试会让很多次相互独立的查询同时进行,CPU 的乱序执行引擎可以把它们的访存重叠起来。而下面的 P99 延迟是单次查询耗时的 99 分位,它刻画的是那些无法被掩盖的最坏访问——一次查询可能错过好几条 cache line、走过很长的探测序列,或者发生分支预测失败。延迟只在 x86-64 平台(Xeon E-2388G)上测试。

每张图里都有两个区间。当表还能放进 cache 时,长尾取决于最坏情况下一次查询要做多少次访存:能限制探测次数的哈希表(perfect hash 的 fph::*),或者能借助 metadata 快速排除一个 key 的哈希表,长尾都比较短;而在冲突下可能走过很长探测链的哈希表,长尾就更重。一旦表的大小超过 L3 cache,几乎每一次 99 分位查询都至少要访问一次 DRAM,于是长尾会落到一个由访存延迟决定的下限附近,在这台 Rocket Lake 平台上大约是 420-460 ns,此时哈希表的选择对长尾的影响远小于它对吞吐的影响。每一组的第二张图都使用了较大的 max_load_factor,它把表填得更满,探测链略微变长,但大体上的排名不变。

查询表中存在的 key(命中)

使用默认 max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

命中查询的排名会随规模变化。规模较小时,简单的开放寻址哈希表长尾最短——ska::flat_hash_map 搭配 std::hash 在 1,024 个元素时 P99 约为 7.8 ns。进入 L2/L3 区间后,perfect hash 表反超:在 32,768 个元素时,fph::DynamicFphMapfph::MetaFphMap 搭配 std::hash 有最好的 P99(约 22 ns),因为它们即使在最坏情况下也能保证探测次数很小且有上界。超过约一百万个元素后,所有表都收敛到约 420-460 ns 的 DRAM 下限;基于节点的 std::unordered_map 是明显的例外,因为一次长尾查询要先后解引用两个指针,而这两次都可能 cache miss,达到 655-765 ns。

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

使用较大 max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

查询表中不存在的 key(未命中)

使用默认 max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

未命中正是 metadata 设计最能发挥作用的场景。fph::MetaFphMap 只需读取一条 metadata cache line 就能确认某个 key 不存在,完全不用走探测序列。只要这块 metadata 数组还能放进 cache,它的长尾就远比其他任何表都短:在 1,200,000 个元素时,它的 P99 未命中延迟约为 34 ns,而其他表大致从约 106 ns(r_h::unordered_flat_mapabsl::flat_hash_map)一直到 440-460 ns(ska::flat_hash_maptsl::robin_map)——长尾上有十倍以上的差距。这个优势在 L3 区间最明显,到 10,000,000 个元素时就消失了:此时 metadata 数组本身也放不进 L3 cache,于是连一次 metadata 读取都会变成一次 DRAM miss,fph::MetaFphMap 也回落到大家共同的约 450 ns 下限。这正是 fph::MetaFphMap 最擅长的场景,也是未命中查询常见时优先选它而不是 fph::DynamicFphMap 的主要理由。

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

使用较大 max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

查询有 50% 概率在表中的 key

使用默认 max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

50% 的情况把命中和未命中混在一起,结果不断交替,带来额外的分支预测失败开销,使各个表之间的差距变小。在 cache 内的区间,perfect hash table 仍有一点长尾上的优势(fph::DynamicFphMap 在 32,768 个元素时约为 27 ns),但一旦工作集离开 cache,访存延迟下限又重新主导,各个表就很难区分了,只有基于节点的表明显落后(在最大规模下,absl::node_hash_map 约为 520-620 ns,std::unordered_map 约为 675-785 ns)。

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>

使用较大 max_load_factor

<K,V>: <uint64_t with several split bits masked, uint64_t>

<K,V>: <uint64_t with several split bits masked, 56 bytes struct>

<K,V>: <uint64_t with high position bits masked, uint64_t>

<K,V>: <uint64_t with low position bits masked, uint64_t>

<K,V>: <uint64_t uniformly distributed, uint64_t>


← 返回 Hash Table Benchmark 目录