pg_control为什么会存在?
为啥会有pg_control这么个文件呢? pg_control是PostgreSQL中一个很重要的文件,我们之前讲到过PostgreSQL的启动过程,启动过程中很重要的一项工作就是故障恢复,启动startup进程,回放WAL日志进行故障恢复,而从哪里开始进行回放呢?我怎么知道起点在哪里呢?这个位置的保存一定是在磁盘中,而不是在内存中,假设数据库因故障崩溃,内存中的数据会丢失,所以,只有checkpointer进程在做checkpoint操作时不断的更新pg_control文件,使之持久化保存,数据库启动进行故障恢复时,读取该文件,获得故障恢复的起始位置。
除了保存检查点信息,还保存一些其他的状态等信息,用于数据库启动等。比如数据库状态信息,系统表版本号等。
看下面的代码,数据库启动时会检查pg_control文件,如果文件被损坏,数据库就会启动失败。
main()
--> MemoryContextInit() // 初始化内存上下文: TopMemoryContext、ErrorContext
--> PostmasterMain(argc, argv); // Postmaster main entry point--> pqsignal_pm(SIGCHLD, reaper); /* handle child termination */ // 注册信号处理函数--> checkDataDir(); // 检查数据目录--> ValidatePgVersion(DataDir); // 检查PG_VERSION文件,PG实例版本是否与程序兼容--> checkControlFile(); // 检查pg_control文件--> CreateDataDirLockFile(true); // 创建postmaster.pid文件--> LocalProcessControlFile(false); // 读pg_control,到ControlFileData中--> ReadControlFile();
startup进程从pg_control中获取故障恢复起点:
StartupProcessMain(void)
--> StartupXLOG();--> ValidateXLOGDirectoryStructure(); // 检查pg_wal是否存在--> readRecoverySignalFile(); // 依据standby.signal和recovery.signal是否存在,判断进入何种状态--> validateRecoveryParameters();if (read_backup_label(&checkPointLoc, &backupEndRequired, &backupFromStandby)){// 如果backup_label文件存在,则表示从备份文件中进行恢复(例如使用pg_basebackup进行备份)// 此种情况,设置backup_label,而不是用pg_control,为啥呢?下面就是解释/** If we see a backup_label during recovery, we assume that we are recovering* from a backup dump file, and we therefore roll forward from the checkpoint* identified by the label file, NOT what pg_control says. This avoids the* problem that pg_control might have been archived one or more checkpoints* later than the start of the dump, and so if we rely on it as the start* point, we will fail to restore a consistent database state.*/}else{/* Get the last valid checkpoint record. */checkPointLoc = ControlFile->checkPoint; // 从pg_control中获取检查点信息RedoStartLSN = ControlFile->checkPointCopy.redo;record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, true);--> XLogBeginRead(xlogreader, RecPtr); // Begin reading WAL at 'RecPtr'.--> ReadRecord(xlogreader, LOG, true); // Attempt to read the next XLOG record.for (;;){XLogReadRecord(xlogreader, &errormsg); // Attempt to read an XLOG record.}}
pg_control文件的内容
pg_control保存在PGDATA/global/pg_control中,我们通过pg_controldata查看其具体内容:
其中有2个是非常重要的,
Latest checkpoint location: 0/D04FD00 -- 最后一次的checkpoint位置
Latest checkpoint's REDO location: 0/D04FCC8 -- 重做点位置,非常重要,崩溃恢复时回放的起点
怎么解释呢?
就是说checkpoint操作是需要一定时间的,在开始进行checkpoint时,先记录当前点为Latest checkpoint REDO location,作为重做点。当完成刷盘操作后,把checkpoint相关信息也生成一条WAL记录,再把这个WAL记录也写入WAL日志文件中,位置就是Latest checkpoint location。
如果checkpoint的过程中,节点故障挂掉了?这种情况,如果checkpoint没有完成,那么其pg_control文件就没有被更新,也就是说pg_control还是上次的pg_control文件,再次启动时,尽管已经做了部分checkpoint操作,但是仍然从老的重做点开始回放,回放具有幂等性,仍能进行正常的恢复操作。
postgres@slpc:~/pgsql$ ./bin/pg_controldata -D masternode/
pg_control version number: 1300
Catalog version number: 202107181 -- 系统表版本号
Database system identifier: 7242131451622390647 -- 数据库系统标识符,用于标识一套数据库系统,物理复制的主备库拥有相同的数据库唯一标识串,initdb时生成
Database cluster state: in production
pg_control last modified: 2023年08月03日 星期四 18时17分24秒
Latest checkpoint location: 0/D04FD00 -- 最后一次的checkpoint位置
Latest checkpoint's REDO location: 0/D04FCC8 -- 重做点
Latest checkpoint's REDO WAL file: 00000001000000000000000D
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:25998
Latest checkpoint's NextOID: 24641
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 726
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 25998
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: 2023年08月03日 星期四 18时17分18秒
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0 -- 备机用,用于指定当备考异常终止再启动时,只有应用WAL日志过指定点后才能对外提供只读服务,否则,用户读到的数据可能会不一致。
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: off
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: 5faa9a77552e9f84e1dadd05fabcc97d121cc211905bd02386eb881dd347c198
关于其具体含义,可以看下面的定义,更多可参考src/include/catalog/pg_control.h
的定义。
/* Contents of pg_control. */
typedef struct ControlFileData
{/* Unique system identifier --- to ensure we match up xlog files with the installation that produced them. */uint64 system_identifier;uint32 pg_control_version; /* PG_CONTROL_VERSION */uint32 catalog_version_no; /* see catversion.h *//* System status data */DBState state; /* see enum above */pg_time_t time; /* time stamp of last pg_control update */XLogRecPtr checkPoint; /* last check point record ptr */CheckPoint checkPointCopy; /* copy of last check point record */XLogRecPtr unloggedLSN; /* current fake LSN value, for unlogged rels *//** These two values determine the minimum point we must recover up to* before starting up:** minRecoveryPoint is updated to the latest replayed LSN whenever we* flush a data change during archive recovery. That guards against* starting archive recovery, aborting it, and restarting with an earlier* stop location. If we've already flushed data changes from WAL record X* to disk, we mustn't start up until we reach X again. Zero when not* doing archive recovery.** backupStartPoint is the redo pointer of the backup start checkpoint, if* we are recovering from an online backup and haven't reached the end of* backup yet. It is reset to zero when the end of backup is reached, and* we mustn't start up before that. A boolean would suffice otherwise, but* we use the redo pointer as a cross-check when we see an end-of-backup* record, to make sure the end-of-backup record corresponds the base* backup we're recovering from.** backupEndPoint is the backup end location, if we are recovering from an* online backup which was taken from the standby and haven't reached the* end of backup yet. It is initialized to the minimum recovery point in* pg_control which was backed up last. It is reset to zero when the end* of backup is reached, and we mustn't start up before that.** If backupEndRequired is true, we know for sure that we're restoring* from a backup, and must see a backup-end record before we can safely* start up. If it's false, but backupStartPoint is set, a backup_label* file was found at startup but it may have been a leftover from a stray* pg_start_backup() call, not accompanied by pg_stop_backup().*/XLogRecPtr minRecoveryPoint;TimeLineID minRecoveryPointTLI;XLogRecPtr backupStartPoint;XLogRecPtr backupEndPoint;bool backupEndRequired;/* Parameter settings that determine if the WAL can be used for archival or hot standby. */int wal_level;bool wal_log_hints;int MaxConnections;int max_worker_processes;int max_wal_senders;int max_prepared_xacts;int max_locks_per_xact;bool track_commit_timestamp;/** This data is used to check for hardware-architecture compatibility of* the database and the backend executable. We need not check endianness* explicitly, since the pg_control version will surely look wrong to a* machine of different endianness, but we do need to worry about MAXALIGN* and floating-point format. (Note: storage layout nominally also* depends on SHORTALIGN and INTALIGN, but in practice these are the same* on all architectures of interest.)** Testing just one double value is not a very bulletproof test for* floating-point compatibility, but it will catch most cases.*/uint32 maxAlign; /* alignment requirement for tuples */double floatFormat; /* constant 1234567.0 */
#define FLOATFORMAT_VALUE 1234567.0/* This data is used to make sure that configuration of this database is compatible with the backend executable.*/uint32 blcksz; /* data block size for this DB */uint32 relseg_size; /* blocks per segment of large relation */uint32 xlog_blcksz; /* block size within WAL files */uint32 xlog_seg_size; /* size of each WAL segment */uint32 nameDataLen; /* catalog name field width */uint32 indexMaxKeys; /* max number of columns in an index */uint32 toast_max_chunk_size; /* chunk size in TOAST tables */uint32 loblksize; /* chunk size in pg_largeobject */bool float8ByVal; /* float8, int8, etc pass-by-value? *//* Are data pages protected by checksums? Zero if no checksum version */uint32 data_checksum_version;/* Random nonce, used in authentication requests that need to proceed* based on values that are cluster-unique, like a SASL exchange that* failed at an early stage. */char mock_authentication_nonce[MOCK_AUTH_NONCE_LEN];pg_crc32c crc; /* CRC of all above ... MUST BE LAST! */
} ControlFileData;
pg_controldata
可通过pg_controldata -D masternode/
这种形式查看pg_control文件的内容,我们看一下其主流程,就是读pg_control文件,然后将内容进行解析。
main(int argc, char *argv[])
{ControlFileData *ControlFile;/* get a copy of the control file */ControlFile = get_controlfile(DataDir, &crc_ok);printf(_("pg_control version number: %u\n"), ControlFile->pg_control_version);printf(_("Catalog version number: %u\n"), ControlFile->catalog_version_no);printf(_("Database system identifier: %llu\n"), (unsigned long long) ControlFile->system_identifier);printf(_("Database cluster state: %s\n"), dbState(ControlFile->state));printf(_("pg_control last modified: %s\n"), pgctime_str);printf(_("Latest checkpoint location: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->checkPoint));printf(_("Latest checkpoint's REDO location: %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo));printf(_("Latest checkpoint's REDO WAL file: %s\n"), xlogfilename);// 其他信息...
}
// 读pg_control文件到ControlFileData中
ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p)
{ControlFileData *ControlFile;ControlFile = palloc(sizeof(ControlFileData));fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0);r = read(fd, ControlFile, sizeof(ControlFileData));close(fd);return ControlFile;
}
参考文档:
PostgreSQL故障恢复能力之检查点(Checkpoint)
He3DB恢复过程源码分析系列