PG存储Vacuum

简单叙述vacuum的基本原理。

前言

PG的astore的delete操作只是标记删除,元组本身空间并没有释放,造成空间浪费。这时候需要vacuum进程来进行不可见元组的清理。

有两种vacuum:

  • lazy vacuum:将无效元组标记为可用(碎片空间)。
  • full vacuum:对表进行完全清理,释放无效元组空间,并整理元组。

版本链与HOT机制

PG Update的实现是先删除旧元组,再插入一条新元组。原始元组使用t_ctid指向新元组,于是通过t_ctid,新旧多个版本的元组就组成了一条版本链

如果表上创建了索引,由于元组的每次更新都会插入一条新元组,所以不论索引列是否更新都会向索引插入一条索引元组,这样会极大的降低插入性能。

可以用一个简单的思路来优化这个问题:如果更新操作不涉及索引列,那么就无需向索引插入索引元组。但是这个方法会造成版本链分散在多个块中,而磁盘IO是致命的慢。于是,PG就采用了HOT(Heap-Only Tuples)机制,使得满足以下两个条件的更新不向索引中插入元组,避免版本链分散在多个块中。

  • 更新操作不涉及索引列
  • 更新后,插入的新元组和老元组在同一个块

上图中,从ver1到ver3的红色链就是一条HOT链。HOT机制作用就是确保版本链不会跨块存储。另外,HOT还为vauccm带来一点优化,后文再述。

HOT链判断

元组的头部字段t_infomask2,新元组标记为HEAP_HOT_UPDATED,新元组会标记为HEAP_ONLY_TUPLE。图中ver2作为被更新的元组,既有only也有updated。

FSM

Free Space Map,FSM,空闲空间映射图。当数据块中进行insert、update、delete这些操作时,都会产生一个旧版本的数据。vacuum时便会清理掉这些旧版本数据,数据块便存在许多空闲的空间,如果能够将这些空间再次利用起来当然是很好的,不然新插入数据时继续分配一个新的数据块,那数据文件的膨胀速度就会非常快,因此便通过FSM来实现这一功能。

将一个page的空闲空间分为256档(一个字节,8位,2^ 8=256),那么8k/256=32,32个字节就能表示一个页面的空闲程度。

回收流程

回收基于HOT链,当HOT链上的版本部分过期时,过期元组的Linp会被设置为LP_UNUSED,同时还需要修改第一个元组的Linp指向版本链的最后一个元组,这个过程称之为重定位:

  • 将ver1的ItemIdData的状态修改为LP_REDIRECT。
  • 将ver1的ItemIdData的lp_off指向ver3的ItemIdData的位置。

Linp复用:heap-insert时查找有无空闲的Linp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
if (PageHasFreeLinePointers(phdr))
{
/*
* Look for "recyclable" (unused) ItemId. We check for no storage
* as well, just to be paranoid --- unused items should never have
* storage.
*/
for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
{
itemId = PageGetItemId(phdr, offsetNumber);
if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId))
break;
}
if (offsetNumber >= limit)
{
/* the hint is wrong, so reset it */
PageClearHasFreeLinePointers(phdr);
}
}

当HOT链上的版本全部过期时,并不是简单地将所有链上的元组Linp设置为LP_UNUSED并删除索引元组idx1。PG为了优化批量删除idx,在删除了ver1的元组实体后,不会将ItemIdData标记为LP_UNUSED,而是标记为LP_DEAD。后面批量的删除索引后,再来将标记为LP_DEAD的元组修改为LP_UNUSED。

代码

下面介绍heap_page_prune,该函数主要用于清理单个文件块的HOT链,并进行块内碎片的整理(由PageRepairFragmentation完成)。

heap_page_prune

该函数的逻辑步骤就两步:

  • 针对块内的每个元组,调用heap_prune_chain进行HOT链清理
  • 调用heap_page_prune_execute完成块内碎片清理
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
//pruneheap.c line 182
int heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
bool report_stats, TransactionId *latestRemovedXid)
{
//省略...

/* Scan the page */
maxoff = PageGetMaxOffsetNumber(page);
for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
{
ItemId itemid;

/* Ignore items already processed as part of an earlier chain */
if (prstate.marked[offnum])
continue;

/* Nothing to do if slot is empty or already dead */
itemid = PageGetItemId(page, offnum);
if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
continue;

/* Process this item or chain of items */
ndeleted += heap_prune_chain(relation, buffer, offnum, OldestXmin, &prstate);
}
...

/* Have we found any prunable items? */
if (prstate.nredirected > 0 || prstate.ndead > 0 || prstate.nunused > 0)
{
/*
* Apply the planned item changes, then repair page fragmentation, and
* update the page's hint bit about whether it has free line pointers.
*/
heap_page_prune_execute(buffer,
prstate.redirected, prstate.nredirected,
prstate.nowdead, prstate.ndead,
prstate.nowunused, prstate.nunused);

//省略...
}

return ndeleted;
}

heap_prune_chain

遍历HOT链,判断元组是否过期,记录需要设置为LP_UNUSED、LP_DEAD、LP_REDIRECT的元组。

heap_prune_chain包含三个步骤:

步骤1:遍历元组的HOT链

步骤2:判断元组是否过期

需要注意的是,HOT链上的元组是按照版本的先后顺序排列,所以只需要找到最后一条过期的原数,它之前的所有元组都过期了。

步骤3:记录需要设置为LP_UNUSED、LP_DEAD、LP_REDIRECT的元组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
//pruneheap.c line 354
static int
heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
TransactionId OldestXmin,
PruneState *prstate)
{
int ndeleted = 0;
Page dp = (Page) BufferGetPage(buffer);
TransactionId priorXmax = InvalidTransactionId;
ItemId rootlp;
HeapTupleHeader htup;
OffsetNumber latestdead = InvalidOffsetNumber,
maxoff = PageGetMaxOffsetNumber(dp),
offnum;
OffsetNumber chainitems[MaxHeapTuplesPerPage];
int nchain = 0,
i;
HeapTupleData tup;

tup.t_tableOid = RelationGetRelid(relation);

rootlp = PageGetItemId(dp, rootoffnum);

//省略

/* Start from the root tuple */
offnum = rootoffnum;

/* while not end of the chain
* 步骤1:遍历元组的HOT链
*/
for (;;)
{
ItemId lp;
bool tupdead,
recent_dead;

/* Some sanity checks */
if (offnum < FirstOffsetNumber || offnum > maxoff)
break;

/* If item is already processed, stop --- it must not be same chain */
if (prstate->marked[offnum])
break;

lp = PageGetItemId(dp, offnum);

/* Unused item obviously isn't part of the chain */
if (!ItemIdIsUsed(lp))
break;

/*
* If we are looking at the redirected root line pointer, jump to the
* first normal tuple in the chain. If we find a redirect somewhere
* else, stop --- it must not be same chain.
*/
if (ItemIdIsRedirected(lp))
{
if (nchain > 0)
break; /* not at start of chain */
chainitems[nchain++] = offnum;
offnum = ItemIdGetRedirect(rootlp);
continue;
}

/*
* Likewise, a dead item pointer can't be part of the chain. (We
* already eliminated the case of dead root tuple outside this
* function.)
*/
if (ItemIdIsDead(lp))
break;

Assert(ItemIdIsNormal(lp));
htup = (HeapTupleHeader) PageGetItem(dp, lp);

tup.t_data = htup;
tup.t_len = ItemIdGetLength(lp);
ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);

/*
* Check the tuple XMIN against prior XMAX, if any
*/
if (TransactionIdIsValid(priorXmax) &&
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;

/*
* OK, this tuple is indeed a member of the chain.
*/
chainitems[nchain++] = offnum;

/*
* Check tuple's visibility status.
*/
tupdead = recent_dead = false;

//步骤2:判断元组是否过期
switch (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer))
{
case HEAPTUPLE_DEAD:
//元组过期
tupdead = true;
break;

case HEAPTUPLE_RECENTLY_DEAD:
//省略
break;

case HEAPTUPLE_DELETE_IN_PROGRESS:
//省略
break;

case HEAPTUPLE_LIVE:
case HEAPTUPLE_INSERT_IN_PROGRESS:
//元组未过期
break;

default:
elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
break;
}

/*
* Remember the last DEAD tuple seen. We will advance past
* RECENTLY_DEAD tuples just in case there's a DEAD one after them;
* but we can't advance past anything else. (XXX is it really worth
* continuing to scan beyond RECENTLY_DEAD? The case where we will
* find another DEAD tuple is a fairly unusual corner case.)
*/
if (tupdead)
{
latestdead = offnum;
HeapTupleHeaderAdvanceLatestRemovedXid(htup,
&prstate->latestRemovedXid);
}
else if (!recent_dead)
break;

/*
* If the tuple is not HOT-updated, then we are at the end of this
* HOT-update chain.
* 判断是否到达链尾,链尾不含HEAP_HOT_UPDATED标志。
*/
if (!HeapTupleHeaderIsHotUpdated(htup))
break;

/*
* Advance to next chain member.
*/
Assert(ItemPointerGetBlockNumber(&htup->t_ctid) ==
BufferGetBlockNumber(buffer));
offnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
priorXmax = HeapTupleHeaderGetUpdateXid(htup);
}

/*
* If we found a DEAD tuple in the chain, adjust the HOT chain so that all
* the DEAD tuples at the start of the chain are removed and the root line
* pointer is appropriately redirected.
* 步骤3:记录需要设置为LP_UNUSED、LP_DEAD、LP_REDIRECT的元组
*/
if (OffsetNumberIsValid(latestdead))
{
/*
* Mark as unused each intermediate item that we are able to remove
* from the chain.
*
* When the previous item is the last dead tuple seen, we are at the
* right candidate for redirection.
*/
for (i = 1; (i < nchain) && (chainitems[i - 1] != latestdead); i++)
{
//记录需要设置为LP_UNUSED的元组
heap_prune_record_unused(prstate, chainitems[i]);
ndeleted++;
}

/*
* If the root entry had been a normal tuple, we are deleting it, so
* count it in the result. But changing a redirect (even to DEAD
* state) doesn't count.
*/
if (ItemIdIsNormal(rootlp))
ndeleted++;

/*
* If the DEAD tuple is at the end of the chain, the entire chain is
* dead and the root line pointer can be marked dead. Otherwise just
* redirect the root to the correct chain member.
*/
if (i >= nchain)
/*记录需要设置为LP_DEAD的元组i >= nchain说明HOT链上的所有元组都过期了,
*这时需要将链头设置为LP_DEAD
*/
heap_prune_record_dead(prstate, rootoffnum);
else
/*HOT链上还有元组没有过期,所以需要将链头重定位到该元组*/
heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);
}
else if (nchain < 2 && ItemIdIsRedirected(rootlp))
{
/*
* We found a redirect item that doesn't point to a valid follow-on
* item. This can happen if the loop in heap_page_prune caused us to
* visit the dead successor of a redirect item before visiting the
* redirect item. We can clean up by setting the redirect item to
* DEAD state.
*/
heap_prune_record_dead(prstate, rootoffnum);
}

return ndeleted;
}

heap_page_prune_execute

heap_page_prune_execute主要包含两个步骤:

步骤1:依据heap_prune_chain的处理结果,将相关元组设置为LP_UNUSED、LP_DEAD、LP_REDIRECT。

这三种类型的元组,他们的元组实体部分都没有意义了,所以需要将他们ItemIdData中的lp_len设置为0,以此表示他们已经没有元组实体了。对于LP_REDIRECT元组,需要将其lp_off设置为重定位的ItemIdData位置,对于LP_UNUSED、LP_DEAD直接将其lp_off设置为0。具体代码实现参见:ItemIdSetUnused、ItemIdSetRedirect、ItemIdSetDead。

步骤2:调用PageRepairFragmentation对页面内进行空间整理,即删除过期元组的元组实体部分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
void
heap_page_prune_execute(Buffer buffer,
OffsetNumber *redirected, int nredirected,
OffsetNumber *nowdead, int ndead,
OffsetNumber *nowunused, int nunused)
{
Page page = (Page) BufferGetPage(buffer);
OffsetNumber *offnum;
int i;

//步骤1:依据heap_prune_chain的处理结果,将相关元组设置为LP_UNUSED、LP_DEAD、LP_REDIRECT。
/* Update all redirected line pointers */
offnum = redirected;
for (i = 0; i < nredirected; i++)
{
OffsetNumber fromoff = *offnum++;
OffsetNumber tooff = *offnum++;
ItemId fromlp = PageGetItemId(page, fromoff);

ItemIdSetRedirect(fromlp, tooff);
}

/* Update all now-dead line pointers */
offnum = nowdead;
for (i = 0; i < ndead; i++)
{
OffsetNumber off = *offnum++;
ItemId lp = PageGetItemId(page, off);

ItemIdSetDead(lp);
}

/* Update all now-unused line pointers */
offnum = nowunused;
for (i = 0; i < nunused; i++)
{
OffsetNumber off = *offnum++;
ItemId lp = PageGetItemId(page, off);

ItemIdSetUnused(lp);
}

/*
* Finally, repair any fragmentation, and update the page's hint bit about
* whether it has free pointers.
* 步骤2:调用PageRepairFragmentation对页面内进行空间整理,即删除过期元组的元组实体部分。
*/
PageRepairFragmentation(page);
}

PageRepairFragmentation

红色表示需要删除的元组,绿色表示需要保留的元组,那么回收的最便捷方式就是将TUPLE3-TUPLE1依次移动到块末尾.

PageRepairFragmentation主要包含三个步骤:

步骤1:统计块内需要保留的元组数量。

如果块内没有元组需要保留,则直接修改块的pd_upper指针让其指向块末尾。

步骤2:记录块内需要保留的元组。

步骤3:调用compactify_tuples实现实际的元组删除。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
void
PageRepairFragmentation(Page page)
{
Offset pd_lower = ((PageHeader) page)->pd_lower;
Offset pd_upper = ((PageHeader) page)->pd_upper;
Offset pd_special = ((PageHeader) page)->pd_special;
ItemId lp;
int nline,
nstorage,
nunused;
int i;
Size totallen;

/*
* It's worth the trouble to be more paranoid here than in most places,
* because we are about to reshuffle data in (what is usually) a shared
* disk buffer. If we aren't careful then corrupted pointers, lengths,
* etc could cause us to clobber adjacent disk buffers, spreading the data
* loss further. So, check everything.
*/
if (pd_lower < SizeOfPageHeaderData ||
pd_lower > pd_upper ||
pd_upper > pd_special ||
pd_special > BLCKSZ ||
pd_special != MAXALIGN(pd_special))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
pd_lower, pd_upper, pd_special)));

nline = PageGetMaxOffsetNumber(page);
nunused = nstorage = 0;
//步骤1:统计块内需要保留的元组数量。
for (i = FirstOffsetNumber; i <= nline; i++)
{
lp = PageGetItemId(page, i);
if (ItemIdIsUsed(lp))
{
if (ItemIdHasStorage(lp))
nstorage++;
}
else
{
/* Unused entries should have lp_len = 0, but make sure */
ItemIdSetUnused(lp);
nunused++;
}
}

if (nstorage == 0)
{
/* Page is completely empty, so just reset it quickly
* 如果块内没有元组需要保留,则直接修改块的pd_upper指针让其指向块末尾。
*/
((PageHeader) page)->pd_upper = pd_special;
}
else
{
/* Need to compact the page the hard way */
itemIdSortData itemidbase[MaxHeapTuplesPerPage];
itemIdSort itemidptr = itemidbase;

//步骤2:记录块内需要保留的元组。
totallen = 0;
for (i = 0; i < nline; i++)
{
lp = PageGetItemId(page, i + 1);
if (ItemIdHasStorage(lp))
{
itemidptr->offsetindex = i;
itemidptr->itemoff = ItemIdGetOffset(lp);
if (itemidptr->itemoff < (int) pd_upper ||
itemidptr->itemoff >= (int) pd_special)
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted item pointer: %u",
itemidptr->itemoff)));
itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
totallen += itemidptr->alignedlen;
itemidptr++;
}
}

if (totallen > (Size) (pd_special - pd_lower))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted item lengths: total %u, available space %u",
(unsigned int) totallen, pd_special - pd_lower)));

//步骤3:调用compactify_tuples实现实际的元组删除。
compactify_tuples(itemidbase, nstorage, page);
}

/* Set hint bit for PageAddItem */
if (nunused > 0)
PageSetHasFreeLinePointers(page);
else
PageClearHasFreeLinePointers(page);
}

compactify_tuples

compactify_tuples包含两个步骤:

  • 步骤1:排序

    通过图5和图6,不难发现,元组移动的先后顺序要和元组偏移的大小顺序保持一致,所以在整理之前,需要先将元组按照偏移的大小顺序降序排列。

  • 步骤2:整理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
static void
compactify_tuples(itemIdSort itemidbase, int nitems, Page page)
{
PageHeader phdr = (PageHeader) page;
Offset upper;
int i;

/* sort itemIdSortData array into decreasing itemoff order
* 步骤1:排序
*/
qsort((char *) itemidbase, nitems, sizeof(itemIdSortData),
itemoffcompare);

//步骤2:整理
upper = phdr->pd_special;
for (i = 0; i < nitems; i++)
{
itemIdSort itemidptr = &itemidbase[i];
ItemId lp;

lp = PageGetItemId(page, itemidptr->offsetindex + 1);
upper -= itemidptr->alignedlen;
memmove((char *) page + upper,
(char *) page + itemidptr->itemoff,
itemidptr->alignedlen);
lp->lp_off = upper;
}

phdr->pd_upper = upper;
}

小思考:向一个新分配的块中插入元组,元组ItemData的顺序和元组实体offset之前应该是逆序关系。但是,随着元组的删除,元组ItemData的重用,会导致元组ItemData和元组实体offset之间完全乱序,这就是为什么在compactify_tuples中必须对元组排序的原因。

参考

整篇文章大部分基于下方的博客,写得很好。

https://blog.csdn.net/obvious__/article/details/121318928

作者

Desirer

发布于

2024-10-04

更新于

2024-11-15

许可协议