Hash Table Benchmark - 字符串删除和插入

吞吐
延迟
1. 插入（删除之后）
2. 删除

这一篇测试字符串 key 下反复删除和插入的性能。

📄 本测试的原始 CSV 数据：Xeon E-2388G · M1 Max。

点击图例中的标签，可以隐藏或显示图中对应哈希表和哈希函数的数据线。

这个测试先构造一个大小为 N 的哈希表，然后重复执行 M 次下面的过程：

向哈希表中插入一个新元素
从哈希表中随机删除一个元素

整个过程中，表的大小基本保持在 N 或 N+1，因此这里测到的不是扩容长大过程的成本，而是哈希表已经达到目标大小后，持续修改时的成本。对字符串 key 来说，每次 insert 和 erase 仍然要对整个 key 计算 hash，遇到冲突时还要比较字符串；对于超过 SSO 阈值的 key（24-byte 和 64-byte 两组），每次 insert 还需要在 heap 上为字符串内容分配内存，每次 erase 也要释放这块内存，所以 64-byte 的结果很大程度上由 malloc/free 主导。12-byte key 会通过 SSO 存在 std::string 内部，不需要额外分配，因此比长字符串快很多。

这里有两点会从结构上影响结果。第一，open-addressing table 会使用 tombstone：erase 时不会直接把 slot 标成 empty，而是标成 deleted，这样 probe chain 才不会断；但 tombstone 积累到一定程度后，哈希表必须 rehash，于是会出现 latency spike。第二，perfect hash table fph::DynamicFphMap 和 fph::MetaFphMap 并不适合这个 workload。每次 insert 都可能触发 perfect-hash rebuild，所以它们不仅慢，在较大规模下还会 timeout（这些点显示为 0.00，不会画出来）。它们 lookup 很快，但如果频繁修改，代价就很大。

吞吐

这里记录的是整个过程的耗时，包括 insert 和 erase 两部分。

y 轴是平均单次操作耗时，计算方式是 time/op = (time for insert + time for erase) / (2 * M)，也就是 insert 和 erase 的平均耗时。

<K,V>: <string with a fixed length of 64, uint64_t>

对于 64-byte fixed key，在 Xeon 上 xxHash_xxh3 是最好的哈希函数，几个 flat open-addressing table 排在最前面且差距不大：ska::flat_hash_map、tsl::robin_map、absl::flat_hash_map 和 robin_hood::unordered_flat_map 在 1024 个元素时大约是 67-72 ns，到 10^7 个元素时收敛到大约 320-360 ns。这时 64-byte string 的 malloc/free 和 cache-missing probe 占了主要时间。std::unordered_map 大约慢 50%，小规模时 93 ns，10^7 时 513 ns，因为每次修改除了字符串自己的分配和释放，还要额外分配和释放 node。

perfect hash table 要慢得多。fph::DynamicFphMap/fph::MetaFphMap 在 1024 个元素时就已经是 160-170 ns 左右，之后还会 timeout：fph::DynamicFphMap 在 32768 之后没有画出的点，fph::MetaFphMap 在 1.2M 处出现 6050 ns 的尖峰，之后也没有后续数据。这是因为持续插入会不断触发 perfect-hash rebuild。还要注意，ankerl::unordered_dense_map 在小中规模下表现不错，1024 时约 75 ns，但 10^7 的点缺失（0.00）；它的 dense-array 设计在删除任意位置的元素后，需要用数组尾部的元素回填空洞以保持紧凑，因此在这里会有额外开销。

M1 Max 上排名类似，ska::flat_hash_map 和 tsl::robin_map 领先，10^7 时约 370-380 ns，fph 表在大规模下同样 timeout。下面几组较短 key 的排序基本保持不变，但整体快很多：12-byte SSO key 让最快的几个 flat table 在 Xeon 10^7 时降到约 87-91 ns，因为没有字符串分配。

<K,V>: <string with a max length of 64, uint64_t>

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

延迟

这里把插入和删除的延迟分开记录。注意，这里的插入延迟和“插入和构造”测试中的插入延迟不一样：构造测试中的延迟统计的是从大小 0 一直插入到大小 N 的整个过程，而本测试中哈希表的大小始终保持在 N 或 N + 1 附近。

插入（删除之后）

<K,V>: <string with a fixed length of 64, uint64_t>

延迟只在 Xeon E-2388G 上测试。这一节第一组图是 64-byte fixed key，但如果想看清哈希表算法本身，12-byte SSO key 反而更合适，因为它去掉了字符串分配带来的噪声。对于 insert-after-erase 的 P50（median）latency，大规模下 absl::flat_hash_map 和 absl::node_hash_map 最好，10^7 时大约 104-110 ns；ska::flat_hash_map 和 tsl::robin_map 在小规模时领先，1024 时大约 14-16 ns，但在 200k-1.2M 这一段会因为 probe chain 变长而落后，约 88-100 ns。fph 表即使看 median 也明显落后，1024 时约 50-68 ns，之后会升到几百 ns，并且大规模下没有后续数据。

P99（tail）能看到 tombstone-rehash 的影响。常规 flat table 的 tail 相对可控，10^7 时 absl::flat_hash_map 大约 492 ns，std::unordered_map 大约 1000 ns；但 perfect hash table 会出现很大的尖峰：fph::MetaFphMap 的 P99 在 1.2M 时达到约 44000 ns，在 10^7 时达到约 231000 ns，因为触发完整 perfect-hash rebuild 的 insert 会直接落在 tail 里。如果场景要求修改操作的延迟有上界，就不应该选择这些表。

对于 64-byte key，字符串分配会抬高 latency 的下限：即使在小规模下，各表的 P99 也有几百 ns，1024 时约 470 ns；同时 fph 的尖峰仍然存在，比如 fph::MetaFphMap 在 1.2M 时约 48000 ns。其他长度的字符串也有类似模式。

<K,V>: <string with a max length of 64, uint64_t>

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

删除

<K,V>: <string with a fixed length of 64, uint64_t>

Erase latency 在各表之间更接近，因为 open-addressing table 的 erase 本身代价不高：找到 slot 后放一个 tombstone，在 inline-stored 路径上没有分配。对于 64-byte key，常规哈希表在 10^7 时的 P99 erase 大多集中在约 1030 到 1140 ns，absl::flat_hash_map 约 1030 ns，std::unordered_map 约 1390 ns。这里的下限仍然主要来自释放字符串 heap buffer 的 free，以及访问 slot 时的 cache miss。和前面一样，fph 表是例外：fph::DynamicFphMap 在 32768 之后就 timeout，fph::MetaFphMap 的 P99 在 1.2M 时超过约 1200 ns 后也没有后续数据，因为当某次 erase 使 tombstone 数量超过阈值时，会触发一次 perfect-hash rebuild。P50 erase 在小中规模下主要由 ska::flat_hash_map、tsl::robin_map 和 emhash::hash_map7 领先；其余字符串长度的结果也是同样模式，只是随字符串长度整体缩放。

<K,V>: <string with a max length of 64, uint64_t>

<K,V>: <string with a fixed length of 24, uint64_t>

<K,V>: <string with a max length of 24, uint64_t>

<K,V>: <string with a fixed length of 12, uint64_t>

<K,V>: <string with a max length of 12, uint64_t>

← 返回 Hash Table Benchmark 目录

作者: Renzibei
文章链接: https://renzibei.com//string-erase-insert/
版权声明: 本网站所有文章（包含文字、图片等内容）除特别声明外，均系作者原创，采用 CC BY-NC-ND 4.0 许可协议。引用与转载时请遵守协议、注明出处。

发布于 2026-06-13

tags:

{ hashtable }

{ benchmark }

{ algorithm }