哈希和unordered系列封装（C++）

哈希和unordered系列封装

一、哈希
- 1. 概念
- 2. 哈希函数，哈希碰撞
- - 哈希函数（常用的两个）
  - 哈希冲突（碰撞）
  - 小结
- 3. 解决哈希碰撞
- - 闭散列
  - - 线性探测
    - 二次探测
    - 代码实现
    - 载荷因子（扩容）
  - 开散列
  - - 哈希桶
    - 代码实现
    - 扩容
二、unordered系列封装
- hash_table
- - 迭代器实现原理(单项迭代器)
  - hash_table实现代码
- unordered_set封装
- unordered_map封装
三、总结

一、哈希

1. 概念

通过某种函数使用元素的存储位置与其关键码之间建立映射关系。

插入元素时，通过该函数求得的值，就是该元素的存储位置。
搜索元素时，通过该函数求得的值进行比对，如果关键码相等则搜索成功。

该方法称为哈希（散列）方法，而其中的某中函数被称为哈希（散列）函数，构造出来的结构成为哈希表（散列表）。

2. 哈希函数，哈希碰撞

哈希函数（常用的两个）

直接定址法

函数
取关键字的某个线性函数得出散列地址：Hash(Key) = A * Key + B
优缺
优点：简单均匀
缺点：关键码的分布范围需要集中
场景
统计字符串中字符出现的个数，其中字符是集中的。

除留余数法

函数
Hash(Key) = Key % m（m是小于等于表中可取地址数即可（建议：质数））
场景
适用于值的方位分散

eg:
除留余数法

注意：

使用除留余数法，所以就要求被%的key必须是整型。如果key为字符串如何转成整型呢？
答：字符串哈希函数。评价hash函数性能的一个重要指标就是冲突，在相关资源允许的条件下冲突越少hash函数的性能越好。
常见的字符串哈希算法BKDRHash，APHash，DJBHash…

eg:

 // BKDR Hash Function
unsigned int BKDRHash(char *str)
{unsigned int seed = 131; // 31 131 1313 13131 131313 etc..unsigned int hash = 0;while (*str){hash = hash * seed + (*str++);}return (hash & 0x7FFFFFFF);
}

使用除留余数法，最好模一个素数，如何快速模一个类似两倍关系的素数？
答：使用了一个默认的素数集合，这个集合中包含了一系列素数。在不同的STL实现中，这个素数集合可能会有所不同。一般来说，这个集合中的素数经过仔细选择，以确保哈希表的负载因子（即平均哈希桶中元素的数量）保持在一个较小的范围内，从而提供更好的性能。

//素数集合
size_t GetNextPrime(size_t prime)
{const int PRIMECOUNT = 28;static const size_t primeList[PRIMECOUNT] ={53ul, 97ul, 193ul, 389ul, 769ul,1543ul, 3079ul, 6151ul, 12289ul, 24593ul,49157ul, 98317ul, 196613ul, 393241ul, 786433ul,1572869ul, 3145739ul, 6291469ul, 12582917ul,25165843ul,50331653ul, 100663319ul, 201326611ul, 402653189ul,805306457ul,1610612741ul, 3221225473ul, 4294967291ul};size_t i = 0;for (; i < PRIMECOUNT; ++i){if (primeList[i] > prime)return primeList[i];}return primeList[i];
}

哈希冲突（碰撞）

根据上面的例子，如果在数据集合中添加一个数据25，那么会发现通过哈希函数求的地址已经被别的关键码占据。

概念：不同关键码通过相同的哈希函数计算出相同的哈希地址，被称为哈希冲突（碰撞）。

小结

哈希函数的设计跟哈希冲突有着必要的联系。
哈希函数的设计：

哈希函数的定义域，需要包含存储的全部关键码。值域，0到哈希表允许地址数最大值-1
哈希函数计算的地址，均匀分布在哈希表中
设计简单

3. 解决哈希碰撞

解决哈希碰撞的两种方法：闭散列和开散列

闭散列

闭散列：也叫开放地址法，当发生哈希冲突时，如果哈希表未被填满，说明哈希表还有空位置，那么就可以从冲突位置为起始找下一个空位置。

线性探测

概念：从发生冲突的位置开始，依次向后探测，直到寻找到下一个空位置为止。

优缺点

优点：实现简单
缺点：一旦发生冲突连在一起，容易产生数据“堆积”。搜索效率下降

插入

通过哈希函数获取待插入元素在哈希表的目标位置
如果该位置没有元素直接插入，如果有元素则发生冲突，使用线性探测找到下一个空位置，然后插入。

eg:

删除

因为哈希冲突的原因，不能随便删除，会影响后面元素的搜索。例如：删除上个例子哈希表的6，那么我们查找25会被影响。
所以采用伪删除，给哈希表每个空间设置一个状态
`状态: EMPTY此位置为空，EXIST此位置有元素，DELETE此位置元素被删除。

enum STATE
{EXIST, EMPTY,DELETE
};`

二次探测

不同于线性探测是依次寻找空位置，二次探测是通过公式跳跃式的寻找空位置。
Hash(i) = (Hash(x) + i^2) % m;
Hash(X)：通过哈希函数计算key值得到的位置，但是已经存在元素
Hash(i)：将要存放位置
m：哈希表的大小
i = 1，2，3，4…

注意： 除了线性探测，二次探测，还有双重哈希…

代码实现

//开放地址法
namespace open_address 
{//哈希函数template<class K>struct DefaultHashFunc{size_t operator()(const K& key){return size_t(key);     //转成无符号整型}};//模板特化 -- 针对字符串    BKDRHash算法template<>struct DefaultHashFunc<string>{size_t operator()(const string& s){size_t hash = 0;for (auto ch : s){hash *= 131;hash += ch;}return hash;}};//状态enum STATE{EXIST,EMPTY,DELETE};//数据template<class K, class V>struct HashData{pair<K, V> _kv;STATE _state = EMPTY;};template<class K, class V, class HashFunc = DefaultHashFunc<K>>class HashTable{public:HashTable(){_table.resize(10);     //给哈希表初始化十个空间}bool Insert(const pair<K, V>& kv){if (Find(kv.first))return false;//扩容   -->   根据载荷因子//if ((double)_n / (double)_table.size() >= 0.7)if (10 * _n / _table.size() >= 7){size_t newSize = _table.size() * 2;//造新表HashTable<K, V, HashFunc> newHT;newHT._table.resize(newSize);//遍历旧表重新映射到新表for (size_t i = 0; i < _table.size(); i++){if (_table[i]._state == EXIST){newHT.Insert(_table[i]._kv);}}//交换新旧表,原空间出作用域自动销毁_table.swap(newHT._table);}//线性探测HashFunc hf;size_t hashi = hf(kv.first) % _table.size();while (_table[hashi]._state == EXIST){++hashi;hashi %= _table.size();}_table[hashi]._kv = kv;_table[hashi]._state = EXIST;_n++;return true;}HashData<const K, V>* Find(const K& key){HashFunc hf;size_t hashi = hf(key) % _table.size();while (_table[hashi]._state != EMPTY){if (_table[hashi]._state == EXIST&& _table[hashi]._kv.first == key){//&_table[hashi]类型是HashData<K, V>*return (HashData<const K, V>*)&_table[hashi];    }++hashi;//如果到_table的最后了，绕到最前面hashi %= _table.size();}return nullptr;}bool Erase(const K& key){HashData<const K, V>* ret = Find(key);if (ret){ret->_state = DELETE;--_n;return true;}return false;}private:vector<HashData<K, V>> _table;size_t _n = 0;           //存储有效数据};}

载荷因子（扩容）

载荷因子的就算方法：α = 表中有效的元素个数 / 散列表的长度。
对于开放地址法，载荷因子是特别重要的元素，通过一些科学实验，载荷因子应严格控制在0.7-0.8。∵散列表的长度是一定的，表中有效元素个数和α成正比，∴如果超过载荷因子0.8，产生冲突的可能就越大，查表时CPU缓存命中率低。

再进行插入操作的时候要根据载荷因子判断需不需要扩容，用空间换时间

开散列

开散列：也叫链地址法（开链法），首先对关键码集合用散列函数计算散列地址，具有相同关键码的归于同一子集合，每个自己和称为一个桶，各个桶中的元素通过单链表链起来，各链表的头节点存在哈希表中。

哈希桶

5和8下标都存在哈希冲突

代码实现

namespace hash_bucket
{template<class K>struct DefaultHashFunc{size_t operator()(const K& key){return size_t(key);}};//模板特化 -- 针对字符串template<>struct DefaultHashFunc<string>{size_t operator()(const string& s){size_t hash = 0;for (auto ch : s){hash *= 131;hash += ch;}return hash;}};template<class K, class V>struct HashNode{pair<K, V> _kv;HashNode<K, V>* _next;//初始化HashNode(const pair<K, V>& kv):_kv(kv),_next(nullptr){}};template<class K, class V, class HashFunc = DefaultHashFunc<K>>class HashTable{typedef HashNode<const K, V> Node;public:HashTable(){//开十个空间，初始化为nullptr_table.resize(10, nullptr);}~HashTable(){for (size_t i = 0; i < _table.size(); i++){Node* cur = _table[i];while (cur){Node* next = cur->_next;delete cur;cur = next;}_table[i] = nullptr;}}bool Insert(const pair<K, V>& kv){if (Find(kv.first)){return false;}HashFunc hf;//负载因子到1扩容if (_n == _table.size()){size_t newSize = _table.size() * 2;vector<Node*> newTable;newTable.resize(newSize, nullptr);//遍历旧表for (size_t i = 0; i < _table.size(); i++){Node* cur = _table[i];while (cur){Node* next = cur->_next;size_t hashi = hf(cur->_kv.first) % newSize;cur->_next = newTable[hashi];newTable[hashi] = cur;cur = next;}_table[i] = nullptr;}_table.swap(newTable);}size_t hashi = hf(kv.first) % _table.size();Node* newnode = new Node(kv);newnode->_next = _table[hashi];_table[hashi] = newnode;_n++;return true;}Node* Find(const K& key){HashFunc hf;size_t hashi = hf(key) % _table.size();Node* cur = _table[hashi];while (cur){if (cur->_kv.first == key){return cur;}cur = cur->_next;}return nullptr;}bool Erase(const K& key){HashFunc hf;size_t hashi = hf(key) % _table.size();Node* cur = _table[hashi];Node* prev = nullptr;while (cur){if (cur->_kv.first == key){if (prev == nullptr){_table[hashi] = cur->_next;}else{prev->_next = cur->_next;}delete cur;return true;}prev = cur;cur = cur->_next;}return false;}void Print(){for (size_t i = 0; i < _table.size(); i++){printf("[%d]->", i);Node* cur = _table[i];while (cur){cout << cur->_kv.first << ":" << cur->_kv.second << "->";cur = cur->_next;}printf("nullptr\n");}cout << endl;}private:vector<Node*> _table;size_t _n = 0;};
}

扩容

桶的个数是一定的（桶的个数 == 表的大小）。如果不进行扩容，可能一个桶中有很多元素，会影响哈希表的性能。开散列最完美的情况就是每个哈希桶中刚好挂一个节点，再插入时就会发生哈希冲突，因此判断扩容的条件就可以是： 元素的个数 == 桶的个数。

二、unordered系列封装

unordered系列set、map的容器接口和红黑树实现的set、map相似，使用大差不差，所以在这里就不进行介绍了。

hash_table

迭代器实现原理(单项迭代器)

迭代器++

当前桶没遍历完，直接通过链表找下一个节点
当前桶遍历完
a. 通过哈希函数确定当前存储位置然后+1
b. 循环（加过1的位置小于哈希表的大小）
- - Ⅰ.该位置不为空，则成功找到，直接返回
- - Ⅱ.该位置为空继续向后+1，继续循环判断
c. 循环结束没找到，返回nullptr

Self& operator++()
{if (_node->_next)  //当前桶没完{_node = _node->_next;}else               //当前桶完了{HashFunc hf;KeyOfT kot;size_t hashi = hf(kot(_node->_data)) % _pht->_table.size();++hashi;while (hashi < _pht->_table.size()){if (_pht->_table[hashi]){_node = _pht->_table[hashi];return *this;}else{hashi++;}}_node = nullptr;}return *this;
}

hash_table实现代码

#include <vector>// 1、哈希表
// 2、封装map和set
// 3、普通迭代器
// 4、const迭代器
// 5、insert返回值  operator[]
// 6、key不能修改的问题namespace hash_bucket
{template<class K>struct DefaultHashFunc{size_t operator()(const K& key){return (size_t)key;}};template<>   //特化struct DefaultHashFunc<string>{size_t operator()(const string& str){size_t hash = 0;for (auto ch : str){hash *= 131;hash += ch;}return hash;}};template<class T>struct HashNode{T _data;HashNode<T>* _next;HashNode(const T& data):_data(data), _next(nullptr){}};//类前置声明  -->  因为迭代器的实现会调用哈希表指针template<class K, class T, class KeyOfT, class HashFunc>class HashTable;//迭代器template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc>struct HTIterator{typedef HashNode<T> Node;typedef HTIterator<K, T, Ptr, Ref, KeyOfT, HashFunc> Self;//普通迭代器typedef HTIterator<K, T, T*, T&, KeyOfT, HashFunc> Iterator;Node* _node;//哈希表指针  注意这里要加上const限制*this，不然哈希表调用时的this是const的会导致权限放大const HashTable<K, T, KeyOfT, HashFunc>* _pht;HTIterator(Node* node, const HashTable<K, T, KeyOfT, HashFunc>* pht):_node(node),_pht(pht){}//普通迭代器时，是拷贝构造//const迭代器时，是构造。普通迭代器构造const迭代器HTIterator(const Iterator& it):_node(it._node), _pht(it._pht){}Ref operator*(){return _node->_data;}Ptr operator->(){return &_node->_data;}bool operator!=(const Self& s){return _node != s._node;}bool operator==(const Self& s){return _node == s._node;}Self& operator++(){if (_node->_next)  //当前桶没完{_node = _node->_next;}else               //当前桶完了{HashFunc hf;KeyOfT kot;size_t hashi = hf(kot(_node->_data)) % _pht->_table.size();++hashi;while (hashi < _pht->_table.size()){if (_pht->_table[hashi]){_node = _pht->_table[hashi];return *this;}else{hashi++;}}_node = nullptr;}return *this;}};//set -> hash_bucket::HashTable<K, K> _ht//map -> hash_bucket::HashTable<K, pair<K, V>> _httemplate<class K, class T, class KeyOfT, class HashFunc = DefaultHashFunc<K>>class HashTable{typedef HashNode<T> Node;//友元         迭代器的实现会调用哈希表指针template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc>friend struct HTIterator;public:typedef HTIterator<K, T, T*, T&, KeyOfT, HashFunc> iterator;typedef HTIterator<K, T, const T*, const T&, KeyOfT, HashFunc> const_iterator;iterator begin(){for (size_t i = 0; i < _table.size(); i++){Node* cur = _table[i];if (cur){return iterator(cur, this);}}return iterator(nullptr, this);}iterator end(){return iterator(nullptr, this);}const_iterator begin()  const{for (size_t i = 0; i < _table.size(); i++){Node* cur = _table[i];if (cur){return const_iterator(cur, this);}}return const_iterator(nullptr, this);}const_iterator end() const{return const_iterator(nullptr, this);}HashTable(){_table.resize(10, nullptr);}~HashTable(){for (size_t i = 0; i < _table.size(); i++){Node* cur = _table[i];while (cur){Node* next = cur->_next;delete cur;cur = next;}_table[i] = nullptr;}}pair<iterator, bool> Insert(const T& data){HashFunc hf;KeyOfT kot;iterator it = Find(kot(data));if (it != end()){return make_pair(it, false);}// 负载因子到1--扩容if (_n == _table.size()){size_t newSize = _table.size() * 2;vector<Node*> newTable;newTable.resize(newSize, nullptr);// 遍历旧表，把节点牵下来挂到新表for (size_t i = 0; i < _table.size(); i++){Node* cur = _table[i];while (cur){Node* next = cur->_next;size_t hashi = hf(kot(data)) % newSize;cur->_next = newTable[hashi];newTable[hashi] = cur;cur = next;}_table[i] = nullptr;}_table.swap(newTable);}size_t hashi = hf(kot(data)) % _table.size();// 头插Node* newnode = new Node(data);newnode->_next = _table[hashi];_table[hashi] = newnode;++_n;return make_pair(iterator(newnode, this), true);}iterator Find(const K& key){HashFunc hf;KeyOfT kot;size_t hashi = hf(key) % _table.size();Node* cur = _table[hashi];while (cur){if (kot(cur->_data) == key){return iterator(cur, this);}cur = cur->_next;}return iterator(nullptr, this);}bool Erase(const K& key){HashFunc hf;KeyOfT kot;size_t hashi = hf(key) % _table.size();Node* prev = nullptr;Node* cur = _table[hashi];while (cur){if (kot(cur->_data) == key){if (prev == nullptr){_table[hashi] = cur->_next;}else{prev->_next = cur->_next;}delete cur;return true;}prev = cur;cur = cur->_next;}--_n;return false;}private:vector<Node*> _table; // 指针数组size_t _n = 0;        // 存储有效数据个数};
}

unordered_set封装

namespace kpl
{template<class K>class unordered_set{//该仿函数只是跟map跑struct SetKeyOfT{const K& operator()(const K& key){return key;}};public:typedef typename hash_bucket::HashTable<K, K, SetKeyOfT>::const_iterator iterator;typedef typename hash_bucket::HashTable<K, K, SetKeyOfT>::const_iterator const_iterator;const_iterator begin() const{return _ht.begin();}const_iterator end() const{return _ht.end();}pair<iterator, bool> insert(const K& key){//这里返回值的first的迭代器是普通迭代器，用普通迭代器接收pair<typename hash_bucket::HashTable<K, K, SetKeyOfT>::iterator, bool> ret = _ht.Insert(key);//使用普通迭代器构造一个const的迭代器，这里就体现出迭代器实现中的那个拷贝构造return pair<iterator, bool>(ret.first, ret.second);}private:hash_bucket::HashTable<K, K, SetKeyOfT> _ht;};
}

unordered_map封装

namespace kpl
{template<class K, class V>class unordered_map{//仿函数的主要作用在这里，set的封装只是跟跑，为了就是去键值对的keystruct MapKeyOfT{const K& operator()(const pair<K, V>& kv){return kv.first;}};public:typedef typename hash_bucket::HashTable<K, pair<const K, V>, MapKeyOfT>::iterator iterator;typedef typename hash_bucket::HashTable<K, pair<const K, V>, MapKeyOfT>::const_iterator const_iterator;iterator begin(){return _ht.begin();}iterator end(){return _ht.end();}const_iterator begin() const{return _ht.begin();}const_iterator end() const{return _ht.end();}pair<iterator, bool> insert(const pair<K, V>& kv){return _ht.Insert(kv);}//返回值是与key对应的value的值。V& operator[](const K& key){pair<iterator, bool> ret = _ht.Insert(make_pair(key, V()));return ret.first->second;}private:hash_bucket::HashTable<K, pair<const K, V>, MapKeyOfT> _ht;};}