PostgreSQL索引篇 | Hash索引






  • 静态Hash表:桶数目B不变。

  • 动态Hash表:允许B改变,使B近似于记录总数除以一个块中能容纳的记录数所得到的商。也就是说,每个桶大约有一个存储块。

    • 动态Hash表又可分为两种:

      • ①可扩展Hash表:它在简单的静态Hash表结构上进行了扩充,其特点如下:

        • 用一个指向数据块的指针数组来表示桶,而不是用数据块本身组成的数组来表示桶。
        • 指针数组能增长,它的长度总是2的幂,因而数组每增长一次,桶的数目就翻倍。但并非每个桶都有数据块;如果某些桶的所有记录都可以放在一个块中,那么这些桶可能共享一个块。
        • Hash函数h为每个键计算出一个K位二进制序列,该K值足够大,比如32。但是,无论何时桶的数目都使用从序列第一位开始的若干位来表示,位的数值等于K。比如,使用了r位,则这时桶中将有 2^r 个项。
      • ②线性Hash表:它的桶的增长较为缓慢,具有如下特点:

        • 桶数目B的选择总是使存储块的平均记录数与每个存储块所能容纳的记录总数保持一个固定的比例,例如80%。

          假设每个磁盘块能存放两条记录,r为当前Hash表中总的记录数,B为当前的桶数目,那么r≤1.6B。(块的平均记录数=r/块数,块能容纳的记录总数=2 ,则r/2*B≤80%)

        • 由于存储块并不是总能分裂,所以允许有溢出块。

        • 用来做桶数组项序号的二进制位数是 [log2 B],其中B是当前的桶数。这些位总是从Hash函数得到的位序列的右(低位)端开始取。

        • 假定Hash函数值的i位正在用来给桶数组项编号,且有一个键值为K的记录想要插入到编号为a1a2……ai的桶中,即a1a2…ai是h(K)的后i位。那么,a1a2…ai当作二进制数,设它为m。

          • 如果 m<B,那么编号为m的桶存在并把记录存入该桶中。
          • 如果 B≤m<2i,那么桶m还不存在,因此把记录存入桶(m-2i-1)中,也就是把a1改为0时对应的桶。




在PostgreSQL 8.4.1的实现中包含两种类型的Hash表:

  • 一个是用做索引的外存Hash表
  • 另一个是用于内部数据查找的内存Hash表(在第3章中已多次提到)。





对于每种页面末尾的**“Special Space”,Hash索引填充的是HashPageOpaqueData结构**。其定义如数据结构4.10所示。









  • hashm_bsize:字段实际保存的是一个桶页中用于存放索引元组的空间大小(按字节),由于使用一个磁盘块作为一个桶页,该值为“磁盘块所占空间(8k)- PageHeaderData所占空间- HashPageOpaqueData所占空间”。
  • hashm_firstfree:表示当前可能空闲的最小溢出页号,注意是可能的,并不一定就是空闲的。在查找空闲的溢出页时,还要对该页进行核查。
  • hashm_spares[HASH_MAX_SPLITTPOINTS]:保存了截止到各个分裂点时整个Hash索引分配的溢出页总数。当splitpoint 增加后(即产生了一次新的分裂),在之前保存在这个数组中的值就不能再改变,即不管在该分裂点之前分配的溢出页是否回收都不会修改hashm_spares数组中的值。HASH_MAX_SPLITPOINTS的最大值为32,即最多可分裂32次。
  • hashm_mapp[HASH_MAX_BITMAPS]:记录了各个位图的块号,HASH_MAX_BITMAPS的最大值为128,即最多可分配128块位图。通过这个数组,能够快速找到各个位图的块号。






桶页是成组分配的,每次分配的数目都是2的幂次。0号和1号桶页是在Hash表初始化的时候分配的,当splitpoint为1时(进行第一次分裂),将会同时分配2号和3号桶页。当需要第5个桶时,也就是splitpoint增加到2时,同时分配4 7号桶页。当splitpoint增加到3时,同时分配815号桶页,以此类推。




当splitpoint 增加1的时候,再分配2^splitpoint个桶页,以此类推,这样的实现机制使得很容易求得页面在磁盘中的块号。

len(hashm_spares)=splitpoint+1; hashm_spares[splitpoint]=n; 从第1次分裂到第splitpoint次分裂后共n个溢出页




  • pagenum=0时,blocknum=1就是因为原先存在一个元页(bloknum=0)
  • pagenum≠1时,blocknum=第spiltpoint次分裂后已分配的所有桶的数量下标(2^(spiltpoint+1)-1)+【溢出页的数量(hashm_spares[splitpoint])+位图页的数量(k)】+元页(1)

metap->hashm_spares[splitnum]++;// 计算新增溢出页的blocknum前,先加1

// _hash_getovflpage 
/** No free pages --- have to extend the relation to add an overflow page.* First, check to see if we have to add a new bitmap page too.*/if (last_bit == (uint32) (BMPGSZ_BIT(metap) - 1)){/** We create the new bitmap page with all pages marked "in use".* Actually two pages in the new bitmap's range will exist* immediately: the bitmap page itself, and the following page which* is the one we return to the caller.	Both of these are correctly* marked "in use".  Subsequent pages do not exist yet, but it is* convenient to pre-mark them as "in use" too.*/bit = metap->hashm_spares[splitnum];_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit));metap->hashm_spares[splitnum]++;// 计算新增溢出页的blocknum前,先加1}
/* Calculate address of the new overflow page 然后再取计算blocknum*/bit = metap->hashm_spares[splitnum];blkno = bitno_to_blkno(metap, bit);



前面说过,在每次扩展Hash表后,都会给该次扩展预留桶的编号(不管是否实际已经分配)。所以,分配的溢出页的磁盘号都是在预留的桶之后的。同时,在元页中使用数组hashm_spares记录了在各次桶的扩展时分配的溢出页数,从而能够快速地计算出各个桶页的块号。当splitpoint 的值由k增加到k +1时,hashm_spares [k]的值就不再允许更改,以保证能够有效地找到每次扩展的桶。也就是说,溢出页一旦分配,便一直存在,即使是被回收也只是标记它为空闲,并没有释放其物理空间



位图用于管理Hash索引的溢出页和位图页本身的使用情况。如果某个溢出页上的元组都被移除或删除,就要将此溢出页回收,但并不把它还给操作系统,而是继续由PostgreSQL 管理,以便下次需要溢出页的时候使用。因此,需要一种机制来记录每一个溢出页是否可用,这时便用到了位图。








PostgreSQL 8.4.1使用全局变量来记录所创建的Hash索引,它提供了一个外部函数hashbuild来建立Hash索引,函数执行的流程如图4-16所示。



{Relation	heap = (Relation) PG_GETARG_POINTER(0);Relation	index = (Relation) PG_GETARG_POINTER(1);IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);IndexBuildResult *result;BlockNumber relpages;double		reltuples;uint32		num_buckets;HashBuildState buildstate;/** We expect to be called exactly once for any index relation. If that's* not the case, big trouble's what we have.*/if (RelationGetNumberOfBlocks(index) != 0)elog(ERROR, "index \"%s\" already contains data",RelationGetRelationName(index));/* Estimate the number of rows currently present in the table */estimate_rel_size(heap, NULL, &relpages, &reltuples);/* Initialize the hash index metadata page and initial buckets 初始化元页*/num_buckets = _hash_metapinit(index, reltuples);/** If we just insert the tuples into the index in scan order, then* (assuming their hash codes are pretty random) there will be no locality* of access to the index, and if the index is bigger than available RAM* then we'll thrash horribly.  To prevent that scenario, we can sort the* tuples by (expected) bucket number.	However, such a sort is useless* overhead when the index does fit in RAM.  We choose to sort if the* initial index size exceeds NBuffers.** NOTE: this test will need adjustment if a bucket is ever different from* one page.*/if (num_buckets >= (uint32) NBuffers)buildstate.spool = _h_spoolinit(index, num_buckets);elsebuildstate.spool = NULL;/* prepare to build the index */buildstate.indtuples = 0;/*do the heap scan 对待建立索引的表进行扫描,生成索引元组*/reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,hashbuildCallback, (void *) &buildstate);if (buildstate.spool){/* sort the tuples and insert them into the index */_h_indexbuild(buildstate.spool);_h_spooldestroy(buildstate.spool);}/** Return statistics*/result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));result->heap_tuples = reltuples;result->index_tuples = buildstate.indtuples;PG_RETURN_POINTER(result);






uint32 _hash_metapinit(Relation rel, double num_tuples)
{HashMetaPage metap;HashPageOpaque pageopaque;Buffer		metabuf;Buffer		buf;Page		pg;int32		data_width;int32		item_width;int32		ffactor;double		dnumbuckets;uint32		num_buckets;uint32		log2_num_buckets;uint32		i;/* safety check 安全机制检查,确定入参正确*/if (RelationGetNumberOfBlocks(rel) != 0)elog(ERROR, "cannot initialize non-empty hash index \"%s\"",RelationGetRelationName(rel));/** Determine the target fill factor (in tuples per bucket) for this index.* The idea is to make the fill factor correspond to pages about as full* as the user-settable fillfactor parameter says.	We can compute it* exactly since the index datatype (i.e. uint32 hash key) is fixed-width.确定装填因子的大小,当Hash表的充满程度达到了fillfactor时,就增加新的桶*/data_width = sizeof(uint32);item_width = MAXALIGN(sizeof(IndexTupleData)) + MAXALIGN(data_width) +sizeof(ItemIdData);		/* include the line pointer */ffactor = RelationGetTargetPageUsage(rel, HASH_DEFAULT_FILLFACTOR) / item_width;// desired bytes/page  /  bytes/item/* keep to a sane range */if (ffactor < 10)ffactor = 10;/** Choose the number of initial bucket pages to match the fill factor* given the estimated number of tuples.  We round up the result to the* next power of 2, however, and always force at least 2 bucket pages. The* upper limit is determined by considerations explained in* _hash_expandtable().*/dnumbuckets = num_tuples / ffactor;if (dnumbuckets <= 2.0)// 初始化桶的数量num_buckets = 2;else if (dnumbuckets >= (double) 0x40000000)num_buckets = 0x40000000;elsenum_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);log2_num_buckets = _hash_log2(num_buckets);// <--<--Assert(num_buckets == (((uint32) 1) << log2_num_buckets));Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);/** We initialize the metapage, the first N bucket pages, and the first* bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()* calls to occur.	This ensures that the smgr level has the right idea of* the physical index length.*/metabuf = _hash_getnewbuf(rel, HASH_METAPAGE);pg = BufferGetPage(metabuf);pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);pageopaque->hasho_prevblkno = InvalidBlockNumber;pageopaque->hasho_nextblkno = InvalidBlockNumber;pageopaque->hasho_bucket = -1;pageopaque->hasho_flag = LH_META_PAGE;pageopaque->hasho_page_id = HASHO_PAGE_ID;// 将metapage读入内存缓冲区,取得metapage在内存中的相应页号metap = HashPageGetMeta(pg);metap->hashm_magic = HASH_MAGIC;metap->hashm_version = HASH_VERSION;metap->hashm_ntuples = 0;metap->hashm_nmaps = 0;metap->hashm_ffactor = ffactor;metap->hashm_bsize = HashGetMaxBitmapSize(pg);/* find largest bitmap array size that will fit in page size */for (i = _hash_log2(metap->hashm_bsize); i > 0; --i){if ((1 << i) <= metap->hashm_bsize)break;}Assert(i > 0);metap->hashm_bmsize = 1 << i;metap->hashm_bmshift = i + BYTE_TO_BIT;Assert((1 << BMPG_SHIFT(metap)) == (BMPG_MASK(metap) + 1));/** Label the index with its primary hash support function's OID.  This is* pretty useless for normal operation (in fact, hashm_procid is not used* anywhere), but it might be handy for forensic purposes so we keep it.*/metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);/** We initialize the index with N buckets, 0 .. N-1, occupying physical* blocks 1 to N.  The first freespace bitmap page is in block N+1. Since* N is a power of 2, we can set the masks this way:*/metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;metap->hashm_highmask = (num_buckets << 1) - 1;MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares));MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp));/* Set up mapping for one spare page after the initial splitpoints */metap->hashm_spares[log2_num_buckets] = 1;// <--<--metap->hashm_ovflpoint = log2_num_buckets;// 当前分裂次数metap->hashm_firstfree = 0;/** Release buffer lock on the metapage while we initialize buckets.* Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS* won't accomplish anything.  It's a bad idea to hold buffer locks for* long intervals in any case, since that can block the bgwriter.*/_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);/** Initialize the first N buckets初始化桶*/for (i = 0; i < num_buckets; i++){/* Allow interrupts, in case N is huge */CHECK_FOR_INTERRUPTS();buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i));pg = BufferGetPage(buf);pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);pageopaque->hasho_prevblkno = InvalidBlockNumber;pageopaque->hasho_nextblkno = InvalidBlockNumber;pageopaque->hasho_bucket = i;pageopaque->hasho_flag = LH_BUCKET_PAGE;pageopaque->hasho_page_id = HASHO_PAGE_ID;_hash_wrtbuf(rel, buf);}/* Now reacquire buffer lock on metapage */_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);/** Initialize first bitmap page初始化元页的位图信息*/_hash_initbitmap(rel, metap, num_buckets + 1);/* all done 初始化完毕,写回磁盘*/_hash_wrtbuf(rel, metabuf);return num_buckets;



该函数扫描整个基表,根据查找键对其中的每一个元组都生成一个Hash索引元组,并插入到Hash表中。在其中还将调用函数hashbuildCallback 。




static void
hashbuildCallback(Relation index,HeapTuple htup,Datum *values,bool *isnull,bool tupleIsAlive,void *state)
{HashBuildState *buildstate = (HashBuildState *) state;IndexTuple	itup;/* form an index tuple and point it at the heap tuple */itup = _hash_form_tuple(index, values, isnull);itup->t_tid = htup->t_self;// 索引元组指向堆元组/* Hash indexes don't index nulls, see notes in hashinsert */if (IndexTupleHasNulls(itup)){pfree(itup);return;}/* Either spool the tuple for sorting, or just put it into the index */if (buildstate->spool)_h_spool(itup, buildstate->spool);else_hash_doinsert(index, itup);// <--<--buildstate->indtuples += 1;pfree(itup);




  • 若该桶有足够的空间,则将元组插入到该桶中
  • 否则申请溢出页并插入到溢出页中。





_hash_doinsert(Relation rel, IndexTuple itup)
{Buffer		buf;Buffer		metabuf;HashMetaPage metap;BlockNumber blkno;Page		page;HashPageOpaque pageopaque;Size		itemsz;bool		do_expand;uint32		hashkey;Bucket		bucket;/** Get the hash key for the item (it's stored in the index tuple itself).*/hashkey = _hash_get_indextuple_hashkey(itup);/* compute item size too */itemsz = IndexTupleDSize(*itup);itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we* need to be consistent *//** Acquire shared split lock so we can compute the target bucket safely* (see README).*/_hash_getlock(rel, 0, HASH_SHARE);/* Read the metapage */metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);metap = HashPageGetMeta(BufferGetPage(metabuf));/** Check whether the item can fit on a hash page at all. (Eventually, we* ought to try to apply TOAST methods if not.)  Note that at this point,* itemsz doesn't include the ItemId.** XXX this is useless code if we are only storing hash keys.*/if (itemsz > HashMaxItemSize((Page) metap))ereport(ERROR,(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),errmsg("index row size %lu exceeds hash maximum %lu",(unsigned long) itemsz,(unsigned long) HashMaxItemSize((Page) metap)),errhint("Values larger than a buffer page cannot be indexed.")));/** Compute the target bucket number, and convert to block number.计算出待插入的桶*/bucket = _hash_hashkey2bucket(hashkey,metap->hashm_maxbucket,metap->hashm_highmask,metap->hashm_lowmask);blkno = BUCKET_TO_BLKNO(metap, bucket);/* release lock on metapage, but keep pin since we'll need it again */_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);/** Acquire share lock on target bucket; then we can release split lock.*/_hash_getlock(rel, blkno, HASH_SHARE);_hash_droplock(rel, 0, HASH_SHARE);/* Fetch the primary bucket page for the bucket 进入桶页*/buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);page = BufferGetPage(buf);pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);Assert(pageopaque->hasho_bucket == bucket);/* Do the insertion */while (PageGetFreeSpace(page) < itemsz)// 空闲空间不够{/** no space on this page; check for an overflow page*/BlockNumber nextblkno = pageopaque->hasho_nextblkno;if (BlockNumberIsValid(nextblkno))// 存在溢出页{/** ovfl page exists; go get it.  if it doesn't have room, we'll* find out next pass through the loop test above.*/_hash_relbuf(rel, buf);buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);page = BufferGetPage(buf);}else{/** we're at the end of the bucket chain and we haven't found a* page with enough room.  allocate a new overflow page.*//* release our write lock without modifying buffer */_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);/* chain to a new overflow page 申请溢出页*/buf = _hash_addovflpage(rel, metabuf, buf);page = BufferGetPage(buf);/* should fit now, given test above */Assert(PageGetFreeSpace(page) >= itemsz);}pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);Assert(pageopaque->hasho_bucket == bucket);}/* found page with enough space, so add the item here 插入操作*/(void) _hash_pgaddtup(rel, buf, itemsz, itup);/* write and release the modified page */_hash_wrtbuf(rel, buf);/* We can drop the bucket lock now */_hash_droplock(rel, blkno, HASH_SHARE);/** Write-lock the metapage so we can increment the tuple count. After* incrementing it, check to see if it's time for a split.*/_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);metap->hashm_ntuples += 1;/* Make sure this stays in sync with _hash_expandtable() */do_expand = metap->hashm_ntuples >(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);/* Write out the metapage and drop lock, but keep pin */_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);/* Attempt to split if a split is needed */if (do_expand)_hash_expandtable(rel, metabuf);/* Finally drop our pin on the metapage */_hash_dropbuf(rel, metabuf);



  • 分配:在索引元组的插入过程中,若待插入的桶中没有空间,就需要创建一个溢出页,并把它链接到该桶上,然后将索引元组插入到分配的溢出页中。
  • 回收:当对Hash表进行扩展之后,原来存放在溢出页中的索引元组可能就会移到新增加的桶中,这时就需要对溢出页进行回收。






static Buffer
_hash_getovflpage(Relation rel, Buffer metabuf)
{HashMetaPage metap;Buffer		mapbuf = 0;Buffer		newbuf;BlockNumber blkno;uint32		orig_firstfree;uint32		splitnum;uint32	   *freep = NULL;uint32		max_ovflpg;uint32		bit;uint32		first_page;uint32		last_bit;uint32		last_page;uint32		i,j;/* Get exclusive lock on the meta page */_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);_hash_checkpage(rel, metabuf, LH_META_PAGE);metap = HashPageGetMeta(BufferGetPage(metabuf));/* start search at hashm_firstfree */orig_firstfree = metap->hashm_firstfree;// 当前可能空闲的最小溢出页号first_page = orig_firstfree >> BMPG_SHIFT(metap);bit = orig_firstfree & BMPG_MASK(metap);i = first_page;j = bit / BITS_PER_MAP;bit &= ~(BITS_PER_MAP - 1);/* outer loop iterates once per bitmap page */for (;;){BlockNumber mapblkno;Page		mappage;uint32		last_inpage;/* want to end search with the last existing overflow page 获取分配给Hash表的总溢出页数、最后一个位图的位置以及标记第一个空闲的
*/splitnum = metap->hashm_ovflpoint;max_ovflpg = metap->hashm_spares[splitnum] - 1;last_page = max_ovflpg >> BMPG_SHIFT(metap);last_bit = max_ovflpg & BMPG_MASK(metap);if (i > last_page)break;Assert(i < metap->hashm_nmaps);mapblkno = metap->hashm_mapp[i];if (i == last_page)last_inpage = last_bit;elselast_inpage = BMPGSZ_BIT(metap) - 1;/* Release exclusive lock on metapage while reading bitmap page */_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);// 给溢出页加写锁mapbuf = _hash_getbuf(rel, mapblkno, HASH_WRITE, LH_BITMAP_PAGE);mappage = BufferGetPage(mapbuf);freep = HashPageGetBitmap(mappage);// 在位图中查找空闲的溢出页for (; bit <= last_inpage; j++, bit += BITS_PER_MAP){if (freep[j] != ALL_SET)// ALL_SET为0goto found;}/* No free space here, try to advance to next map page 找不到,释放buffer的写锁*/_hash_relbuf(rel, mapbuf);i++;j = 0;					/* scan from start of next map page */bit = 0;/* Reacquire exclusive lock on the meta page */_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);}/** No free pages --- have to extend the relation to add an overflow page.* First, check to see if we have to add a new bitmap page too.申请一个溢出页,并在位图中增加该溢出页的记录*/if (last_bit == (uint32) (BMPGSZ_BIT(metap) - 1)){/** We create the new bitmap page with all pages marked "in use".* Actually two pages in the new bitmap's range will exist* immediately: the bitmap page itself, and the following page which* is the one we return to the caller.	Both of these are correctly* marked "in use".  Subsequent pages do not exist yet, but it is* convenient to pre-mark them as "in use" too.*/bit = metap->hashm_spares[splitnum];_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit));metap->hashm_spares[splitnum]++;}else{/** Nothing to do here; since the page will be past the last used page,* we know its bitmap bit was preinitialized to "in use".*/}/* Calculate address of the new overflow page */bit = metap->hashm_spares[splitnum];blkno = bitno_to_blkno(metap, bit);/** Fetch the page with _hash_getnewbuf to ensure smgr's idea of the* relation length stays in sync with ours.  XXX It's annoying to do this* with metapage write lock held; would be better to use a lock that* doesn't block incoming searches.*/newbuf = _hash_getnewbuf(rel, blkno);metap->hashm_spares[splitnum]++;/** Adjust hashm_firstfree to avoid redundant searches.	But don't risk* changing it if someone moved it while we were searching bitmap pages.*/if (metap->hashm_firstfree == orig_firstfree)metap->hashm_firstfree = bit + 1;/* Write updated metapage and release lock, but not pin */_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);return newbuf;found:// 找到位图标记为0的溢出页,将其标记修改为1/* convert bit to bit number within page */bit += _hash_firstfreebit(freep[j]);/* mark page "in use" in the bitmap */SETBIT(freep, bit);_hash_wrtbuf(rel, mapbuf);/* Reacquire exclusive lock on the meta page */_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);/* convert bit to absolute bit number */bit += (i << BMPG_SHIFT(metap));/* Calculate address of the recycled overflow page */blkno = bitno_to_blkno(metap, bit);/** Adjust hashm_firstfree to avoid redundant searches.	But don't risk* changing it if someone moved it while we were searching bitmap pages.修改metapage中相关值*/if (metap->hashm_firstfree == orig_firstfree){metap->hashm_firstfree = bit + 1;/* Write updated metapage and release lock, but not pin */_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);}else{/* We didn't change the metapage, so no need to write */_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);}/* Fetch, init, and return the recycled page */return _hash_getinitbuf(rel, blkno);


/** Acquire or release the content_lock for the buffer.*/
LockBuffer(Buffer buffer, int mode)
{volatile BufferDesc *buf;Assert(BufferIsValid(buffer));if (BufferIsLocal(buffer))return;					/* local buffers need no lock */buf = &(BufferDescriptors[buffer - 1]);if (mode == BUFFER_LOCK_UNLOCK)LWLockRelease(buf->content_lock);else if (mode == BUFFER_LOCK_SHARE)LWLockAcquire(buf->content_lock, LW_SHARED);else if (mode == BUFFER_LOCK_EXCLUSIVE)LWLockAcquire(buf->content_lock, LW_EXCLUSIVE);elseelog(ERROR, "unrecognized buffer lock mode: %d", mode);





_hash_freeovflpage(Relation rel, Buffer ovflbuf,BufferAccessStrategy bstrategy)
{HashMetaPage metap;Buffer		metabuf;Buffer		mapbuf;BlockNumber ovflblkno;BlockNumber prevblkno;BlockNumber blkno;BlockNumber nextblkno;HashPageOpaque ovflopaque;Page		ovflpage;Page		mappage;uint32	   *freep;uint32		ovflbitno;int32		bitmappage,bitmapbit;Bucket		bucket;/* Get information from the doomed page */_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);ovflblkno = BufferGetBlockNumber(ovflbuf);ovflpage = BufferGetPage(ovflbuf);ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);nextblkno = ovflopaque->hasho_nextblkno;prevblkno = ovflopaque->hasho_prevblkno;bucket = ovflopaque->hasho_bucket;/* 将该溢出页的所有位 置0* Zero the page for debugging's sake; then write and release it. (Note:* if we failed to zero the page here, we'd have problems with the Assert* in _hash_pageinit() when the page is reused.)*/MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));_hash_wrtbuf(rel, ovflbuf);/* 将溢出页从桶链中断开* Fix up the bucket chain.  this is a doubly-linked list, so we must fix* up the bucket chain members behind and ahead of the overflow page being* deleted.  No concurrency issues since we hold exclusive lock on the* entire bucket.*/if (BlockNumberIsValid(prevblkno)){Buffer		prevbuf = _hash_getbuf_with_strategy(rel,prevblkno,HASH_WRITE,LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,bstrategy);Page		prevpage = BufferGetPage(prevbuf);HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);Assert(prevopaque->hasho_bucket == bucket);prevopaque->hasho_nextblkno = nextblkno;_hash_wrtbuf(rel, prevbuf);}if (BlockNumberIsValid(nextblkno)){Buffer		nextbuf = _hash_getbuf_with_strategy(rel,nextblkno,HASH_WRITE,LH_OVERFLOW_PAGE,bstrategy);Page		nextpage = BufferGetPage(nextbuf);HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);Assert(nextopaque->hasho_bucket == bucket);nextopaque->hasho_prevblkno = prevblkno;_hash_wrtbuf(rel, nextbuf);}/* Note: bstrategy is intentionally not used for metapage and bitmap */// 下面是修改位图上的信息,通过元页找到该对应位图后,将此溢出页对应的bit置0/* Read the metapage so we can determine which bitmap page to use */metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);metap = HashPageGetMeta(BufferGetPage(metabuf));/* Identify which bit to set */ovflbitno = blkno_to_bitno(metap, ovflblkno);bitmappage = ovflbitno >> BMPG_SHIFT(metap);bitmapbit = ovflbitno & BMPG_MASK(metap);if (bitmappage >= metap->hashm_nmaps)elog(ERROR, "invalid overflow bit number %u", ovflbitno);blkno = metap->hashm_mapp[bitmappage];/* Release metapage lock while we access the bitmap page */_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);/* Clear the bitmap bit to indicate that this overflow page is free */mapbuf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BITMAP_PAGE);mappage = BufferGetPage(mapbuf);freep = HashPageGetBitmap(mappage);Assert(ISSET(freep, bitmapbit));CLRBIT(freep, bitmapbit);_hash_wrtbuf(rel, mapbuf);/* Get write-lock on metapage to update firstfree */_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);/* if this is now the first free page, update hashm_firstfree 判断是否需要修改元页的hashm_firstfree字段*/if (ovflbitno < metap->hashm_firstfree){metap->hashm_firstfree = ovflbitno;_hash_wrtbuf(rel, metabuf);}else{/* no need to change metapage */_hash_relbuf(rel, metabuf);}return nextblkno;









  • 在修改完metapage的信息之后,进行桶的分裂之前,程序释放了metapage 上的“lmgr lock”,这使得桶的分裂操作可以和其他操作并发执行,特别是多个对桶的分裂操作可以并发执行。


    /* Write out the metapage and drop lock, but keep pin */
    _hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);/* Release split lock; okay for other splits to occur now */
    _hash_droplock(rel, 0, HASH_EXCLUSIVE);/* Relocate records to the new bucket */
    _hash_splitbucket(rel, metabuf, old_bucket, new_bucket,start_oblkno, start_nblkno,maxbucket, highmask, lowmask);/* Release bucket locks, allowing others to access them */
    _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);// 通过_hash_try_getlock加锁
    _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);


    /** Fetch the item's hash key (conveniently stored in the item) and* determine which bucket it now belongs in.计算原itup现在应该分配给新的桶还是旧桶
    itup = (IndexTuple) PageGetItem(opage, PageGetItemId(opage, ooffnum));
    bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),maxbucket, highmask, lowmask);
  • 如果未能取得分裂桶上的“Imgr lock”,则放弃增加新桶来扩展Hash表的操作。因为如果继续等待分裂桶上的“Imgr lock”,而某个拥有分裂桶上的“lmgr lock”的进程碰巧也在等待本进程拥有的某个锁,这将会导致死锁。




  • 在压缩桶中元组时,从桶的最后一页开始顺序向前进行,对扫描到的每一个元组都把它移动到桶的前面。具体来说,有两类页面,分别是rpagewpage

    开始时,rpage是桶链的最后一页,wpage是桶链的第一页,从rpage 中读取每一个元组,并一次把它们写到wpage中,并将它们在rpage中的记录删除。

    • 若rpage中的所有元组都已扫描完毕,则令rpage的前一页为 rpage,并对其继续扫描;

    • 若wpage中已经没有足够的空间进行插入,则令wpage的下一页为wpage,并将元组插入其中。

    这样,rpage从桶链的最后一页开始向前移,wpage从桶链的第一页开始向后移,直到rpage与 wpage是同一个桶页为止,程序结束。

  • 当把位于桶链里后面的元组都移到前面后,可能会出现一些不含任何元组的溢出页,把这些溢出页回收,并在位图中标识其可用,这是调用_hash_freeovflpage函数完成回收的。

  • 程序结束时,桶链上的所有页都不空,除非一开始整个桶链就是空的。

  • 程序的调用者必须持有整个桶链的“lmgr lock”,以防止某些并发执行的进程访问本桶链,造成程序出错。

    wblkno = bucket_blkno;
    wbuf = _hash_getbuf_with_strategy(rel,wblkno,HASH_WRITE, LH_BUCKET_PAGE,bstrategy);




如果rpage的前一页就是wpage,那么wpage 上的写锁将会阻止函数_hash_freeovflpage释放 rpage所在的溢出页,因为当对wpage加写锁的时候,进程是不能修改wpage所在页的指向桶链中下一页所在块的指针的(详见对函数_hash _freeovflpage的代码分析),所以在这里必须分情况讨论。





