PostgreSQL源码分析——pg_control

pg_control为什么会存在?

为啥会有pg_control这么个文件呢? pg_control是PostgreSQL中一个很重要的文件,我们之前讲到过PostgreSQL的启动过程,启动过程中很重要的一项工作就是故障恢复,启动startup进程,回放WAL日志进行故障恢复,而从哪里开始进行回放呢?我怎么知道起点在哪里呢?这个位置的保存一定是在磁盘中,而不是在内存中,假设数据库因故障崩溃,内存中的数据会丢失,所以,只有checkpointer进程在做checkpoint操作时不断的更新pg_control文件,使之持久化保存,数据库启动进行故障恢复时,读取该文件,获得故障恢复的起始位置。

除了保存检查点信息,还保存一些其他的状态等信息,用于数据库启动等。比如数据库状态信息,系统表版本号等。

看下面的代码,数据库启动时会检查pg_control文件,如果文件被损坏,数据库就会启动失败。

main()
--> MemoryContextInit()          // 初始化内存上下文: TopMemoryContext、ErrorContext
--> PostmasterMain(argc, argv);  // Postmaster main entry point--> pqsignal_pm(SIGCHLD, reaper);	/* handle child termination */ // 注册信号处理函数--> checkDataDir();          // 检查数据目录--> ValidatePgVersion(DataDir); // 检查PG_VERSION文件,PG实例版本是否与程序兼容--> checkControlFile();             // 检查pg_control文件--> CreateDataDirLockFile(true);    // 创建postmaster.pid文件--> LocalProcessControlFile(false); // 读pg_control,到ControlFileData中--> ReadControlFile();

startup进程从pg_control中获取故障恢复起点:

StartupProcessMain(void)
--> StartupXLOG();--> ValidateXLOGDirectoryStructure();   // 检查pg_wal是否存在-->	readRecoverySignalFile();           // 依据standby.signal和recovery.signal是否存在,判断进入何种状态--> validateRecoveryParameters();if (read_backup_label(&checkPointLoc, &backupEndRequired, &backupFromStandby)){// 如果backup_label文件存在,则表示从备份文件中进行恢复(例如使用pg_basebackup进行备份)// 此种情况,设置backup_label,而不是用pg_control,为啥呢?下面就是解释/** If we see a backup_label during recovery, we assume that we are recovering* from a backup dump file, and we therefore roll forward from the checkpoint* identified by the label file, NOT what pg_control says.  This avoids the* problem that pg_control might have been archived one or more checkpoints* later than the start of the dump, and so if we rely on it as the start* point, we will fail to restore a consistent database state.*/}else{/* Get the last valid checkpoint record. */checkPointLoc = ControlFile->checkPoint;            // 从pg_control中获取检查点信息RedoStartLSN = ControlFile->checkPointCopy.redo;record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, true);--> XLogBeginRead(xlogreader, RecPtr);     // Begin reading WAL at 'RecPtr'.--> ReadRecord(xlogreader, LOG, true);     // Attempt to read the next XLOG record.for (;;){XLogReadRecord(xlogreader, &errormsg);    // Attempt to read an XLOG record.}}

pg_control文件的内容

pg_control保存在PGDATA/global/pg_control中,我们通过pg_controldata查看其具体内容:
其中有2个是非常重要的,

Latest checkpoint location:           0/D04FD00       -- 最后一次的checkpoint位置
Latest checkpoint's REDO location:    0/D04FCC8	      -- 重做点位置,非常重要,崩溃恢复时回放的起点

怎么解释呢?
checkpoint.png
就是说checkpoint操作是需要一定时间的,在开始进行checkpoint时,先记录当前点为Latest checkpoint REDO location,作为重做点。当完成刷盘操作后,把checkpoint相关信息也生成一条WAL记录,再把这个WAL记录也写入WAL日志文件中,位置就是Latest checkpoint location。

如果checkpoint的过程中,节点故障挂掉了?这种情况,如果checkpoint没有完成,那么其pg_control文件就没有被更新,也就是说pg_control还是上次的pg_control文件,再次启动时,尽管已经做了部分checkpoint操作,但是仍然从老的重做点开始回放,回放具有幂等性,仍能进行正常的恢复操作。

postgres@slpc:~/pgsql$ ./bin/pg_controldata -D masternode/
pg_control version number:            1300
Catalog version number:               202107181                 -- 系统表版本号
Database system identifier:           7242131451622390647       -- 数据库系统标识符,用于标识一套数据库系统,物理复制的主备库拥有相同的数据库唯一标识串,initdb时生成
Database cluster state:               in production
pg_control last modified:             2023年08月03日 星期四 18时17分24秒
Latest checkpoint location:           0/D04FD00       -- 最后一次的checkpoint位置
Latest checkpoint's REDO location:    0/D04FCC8    -- 重做点
Latest checkpoint's REDO WAL file:    00000001000000000000000D
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:25998
Latest checkpoint's NextOID:          24641
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        726
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  25998
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            2023年08月03日 星期四 18时17分18秒
Fake LSN counter for unlogged rels:   0/3E8
Minimum recovery ending location:     0/0         -- 备机用,用于指定当备考异常终止再启动时,只有应用WAL日志过指定点后才能对外提供只读服务,否则,用户读到的数据可能会不一致。
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                off
max_connections setting:              100
max_worker_processes setting:         8
max_wal_senders setting:              10
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float8 argument passing:              by value
Data page checksum version:           0
Mock authentication nonce:            5faa9a77552e9f84e1dadd05fabcc97d121cc211905bd02386eb881dd347c198

关于其具体含义,可以看下面的定义,更多可参考src/include/catalog/pg_control.h的定义。

/* Contents of pg_control. */
typedef struct ControlFileData
{/* Unique system identifier --- to ensure we match up xlog files with the installation that produced them. */uint64		system_identifier;uint32		pg_control_version; /* PG_CONTROL_VERSION */uint32		catalog_version_no; /* see catversion.h *//* System status data */DBState		state;			/* see enum above */pg_time_t	time;			/* time stamp of last pg_control update */XLogRecPtr	checkPoint;		/* last check point record ptr */CheckPoint	checkPointCopy; /* copy of last check point record */XLogRecPtr	unloggedLSN;	/* current fake LSN value, for unlogged rels *//** These two values determine the minimum point we must recover up to* before starting up:** minRecoveryPoint is updated to the latest replayed LSN whenever we* flush a data change during archive recovery. That guards against* starting archive recovery, aborting it, and restarting with an earlier* stop location. If we've already flushed data changes from WAL record X* to disk, we mustn't start up until we reach X again. Zero when not* doing archive recovery.** backupStartPoint is the redo pointer of the backup start checkpoint, if* we are recovering from an online backup and haven't reached the end of* backup yet. It is reset to zero when the end of backup is reached, and* we mustn't start up before that. A boolean would suffice otherwise, but* we use the redo pointer as a cross-check when we see an end-of-backup* record, to make sure the end-of-backup record corresponds the base* backup we're recovering from.** backupEndPoint is the backup end location, if we are recovering from an* online backup which was taken from the standby and haven't reached the* end of backup yet. It is initialized to the minimum recovery point in* pg_control which was backed up last. It is reset to zero when the end* of backup is reached, and we mustn't start up before that.** If backupEndRequired is true, we know for sure that we're restoring* from a backup, and must see a backup-end record before we can safely* start up. If it's false, but backupStartPoint is set, a backup_label* file was found at startup but it may have been a leftover from a stray* pg_start_backup() call, not accompanied by pg_stop_backup().*/XLogRecPtr	minRecoveryPoint;TimeLineID	minRecoveryPointTLI;XLogRecPtr	backupStartPoint;XLogRecPtr	backupEndPoint;bool		backupEndRequired;/* Parameter settings that determine if the WAL can be used for archival or hot standby. */int			wal_level;bool		wal_log_hints;int			MaxConnections;int			max_worker_processes;int			max_wal_senders;int			max_prepared_xacts;int			max_locks_per_xact;bool		track_commit_timestamp;/** This data is used to check for hardware-architecture compatibility of* the database and the backend executable.  We need not check endianness* explicitly, since the pg_control version will surely look wrong to a* machine of different endianness, but we do need to worry about MAXALIGN* and floating-point format.  (Note: storage layout nominally also* depends on SHORTALIGN and INTALIGN, but in practice these are the same* on all architectures of interest.)** Testing just one double value is not a very bulletproof test for* floating-point compatibility, but it will catch most cases.*/uint32		maxAlign;		/* alignment requirement for tuples */double		floatFormat;	/* constant 1234567.0 */
#define FLOATFORMAT_VALUE	1234567.0/* This data is used to make sure that configuration of this database is compatible with the backend executable.*/uint32		blcksz;			/* data block size for this DB */uint32		relseg_size;	/* blocks per segment of large relation */uint32		xlog_blcksz;	/* block size within WAL files */uint32		xlog_seg_size;	/* size of each WAL segment */uint32		nameDataLen;	/* catalog name field width */uint32		indexMaxKeys;	/* max number of columns in an index */uint32		toast_max_chunk_size;	/* chunk size in TOAST tables */uint32		loblksize;		/* chunk size in pg_largeobject */bool		float8ByVal;	/* float8, int8, etc pass-by-value? *//* Are data pages protected by checksums? Zero if no checksum version */uint32		data_checksum_version;/* Random nonce, used in authentication requests that need to proceed* based on values that are cluster-unique, like a SASL exchange that* failed at an early stage. */char		mock_authentication_nonce[MOCK_AUTH_NONCE_LEN];pg_crc32c	crc;   	/* CRC of all above ... MUST BE LAST! */
} ControlFileData;

pg_controldata

可通过pg_controldata -D masternode/这种形式查看pg_control文件的内容,我们看一下其主流程,就是读pg_control文件,然后将内容进行解析。

main(int argc, char *argv[])
{ControlFileData *ControlFile;/* get a copy of the control file */ControlFile = get_controlfile(DataDir, &crc_ok);printf(_("pg_control version number:            %u\n"), ControlFile->pg_control_version);printf(_("Catalog version number:               %u\n"), ControlFile->catalog_version_no);printf(_("Database system identifier:           %llu\n"), (unsigned long long) ControlFile->system_identifier);printf(_("Database cluster state:               %s\n"), dbState(ControlFile->state));printf(_("pg_control last modified:             %s\n"), pgctime_str);printf(_("Latest checkpoint location:           %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->checkPoint));printf(_("Latest checkpoint's REDO location:    %X/%X\n"), LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo));printf(_("Latest checkpoint's REDO WAL file:    %s\n"), xlogfilename);// 其他信息...
}
// 读pg_control文件到ControlFileData中
ControlFileData *get_controlfile(const char *DataDir, bool *crc_ok_p)
{ControlFileData *ControlFile;ControlFile = palloc(sizeof(ControlFileData));fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0);r = read(fd, ControlFile, sizeof(ControlFileData));close(fd);return ControlFile;
}

参考文档:
PostgreSQL故障恢复能力之检查点(Checkpoint)
He3DB恢复过程源码分析系列

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/web/29206.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

OpenSearch 与 Elasticsearch主要差异

1. 什么是 Elasticsearch? Elasticsearch 是一个基于 Apache Lucene 构建的开源、RESTful、分布式搜索和分析引擎。它旨在处理大量数据,使其成为日志和事件数据管理的流行选择。 Elasticsearch 还以其实时功能而闻名,允许用户在数据模式发生…

GO RACE 测试在低版本GCC上报错误 exit status 0xc0000139

windows机器环境,go程序使用race定位时一运行就报错,写了个example如: 能看到加了race之后就不行了, 搜了一下,git上有个issue: runtime: Race detector causes exit status 0xc0000139 on Windows 11 wi…

酷开会员 | 酷开系统将艺术、回忆与浪漫融入生活

随着审美观念的改变以及技术的提升,消费者对家用电视的需求已不局限于单纯的功能性,外观设计带来的美感与视觉效果也愈发成为消费者关注的焦点。在画质和功能逐步完善的当下,电视中的壁纸模式,则能让其更好地融入家居环境&#xf…

基于WPF技术的换热站智能监控系统17--项目总结

1、项目颜值,你打几分? 基于WPF技术的换热站智能监控系统01--项目创建-CSDN博客 基于WPF技术的换热站智能监控系统02--标题栏实现-CSDN博客 基于WPF技术的换热站智能监控系统03--实现左侧加载动画_wpf控制系统-CSDN博客 基于WPF技术的换热站智能监…

船舶能源新纪元:智能管理引领绿色航运潮流

在蓝色的大海上,无数船只乘风破浪,为全球的贸易和文化交流贡献着力量。然而,随着环保意识的提升和可持续发展的要求,船舶的能源消耗和排放问题逐渐成为了人们关注的焦点。在这个关键时刻,船舶能源管理系统应运而生&…

台球灯控计费系统安装教程,佳易王桌球房计费系统的安装方法教程

台球灯控计费系统安装教程,佳易王桌球房计费系统的安装方法教程 一、软件操作教程 以下软件操作教程以,佳易王台球计时计费管理软件为例说明 软件文件下载可以点击最下方官网卡片——软件下载——试用版软件下载 1、点击计时开灯,相应的灯…

【深度学习驱动流体力学】OpenFOAM框架剖析

目录 1. applications 目录solvers:存放各种求解器。mesh:网格生成相关工具。 2. src 目录3. tutorials 目录其他主要目录和文件参考 OpenFOAM 源码文件目录的框架如下,OpenFOAM 是一个开源的计算流体力学 (CFD) 软件包,其源码文件结构设计精…

深入理解并打败C语言难关之一————指针(3)

前言: 昨天把指针最为基础的内容讲完了,并且详细说明了传值调用和传址调用的区别(这次我也是做到了每日一更,感觉有好多想写的但是没有写完),下面不多废话,下面进入本文想要说的内容 目录&#…

windows解决clion终端中文乱码

windows解决clion终端中文乱码 问题: 解决办法: 添加system("chcp 65001 > nul");

聚六亚甲基单胍(PHMB)为第三代胍类消毒剂 我国应用场景广泛

聚六亚甲基单胍(PHMB)为第三代胍类消毒剂 我国应用场景广泛 聚六亚甲基单胍全称为聚六亚甲基胍盐酸盐,简称PHMG,是一种高分子有机聚合物,易溶于水,水溶液无色无味,无毒,生物相容性良…

欢迎 Stable Diffusion 3 加入 Diffusers

作为 Stability AI 的 Stable Diffusion 家族最新的模型,Stable Diffusion 3(SD3) 现已登陆 Hugging Face Hub,并且可用在 🧨 Diffusers 中使用了。 Stable Diffusion 3https://stability.ai/news/stable-diffusion-3-research-paper 当前放出…

MapStruct对象转换

MapStruct是一个Java注解处理器,用于简化对象的转换 遇到的问题: java: Internal error in the mapping processor: java.lang.NullPointerException 解决方案:修改编辑器配置 -Djps.track.ap.dependenciesfalse

从0开始C++(一)

目录 c的基本介绍 C语言和C 的区别 面向过程和面向对象的区别 引用 引用使用的注意事项 赋值 终端输入 cin getline string字符串类 遍历方式 字符串和数字转换 函数 内联函数 函数重载overload 小练习: 参考代码 c的基本介绍 C是一种通用的高级编…

Unity基础(一)unity的下载与安装

目录 一:下载与安装 1.官网下载地址 2.推荐直接下载UnityHub 3.选择编辑器版本(推荐长期支持版) 4.在UnityHub安装选择相应的模块 二:创建项目 简介: Unity 是一款广泛应用的跨平台游戏开发引擎。 它具有以下显著特点: 强大的跨平台能力:能将开发的游…

JavaScript之内置对象

内置对象 JavaScript中的对象分为3种:自定义对象、内置对象、浏览器对象前面两种对象是javascript基础内容,属于ECMAScript;第三个浏览器对象属于我们javascript独有的,我们javascript API讲解 内置对象就是指javascript语言自带…

ssldump一键分析网络流量(KALI工具系列二十二)

目录 1、KALI LINUX 简介 2、ssldump工具简介 3、在KALI中使用ssldump 3.1 目标主机IP(win) 3.2 KALI的IP 4、操作示例 4.1 监听指定网卡 4.2 指定端口 4.3 特定主机 4.4 解码文件 4.5 显示对话摘要 4.6 显示加密数据(需要私钥&…

Spring boot 启动报:Do not use @ for indentation

一、使用maven插件动态切换配置时出现报错 二、配置文件及pom 2.1 配置文件结构 2.2 application.yml spring: # 根据环境读取配置文件(手动) # profiles: # active: dev# 根据环境读取配置文件(通过勾选maven插件)profiles…

神仙级AI大模型入门教程(非常详细),从零基础入门到精通,从看这篇开始!

一.初聊大模型 1.为什么要学习大模型? 在学习大模型之前,你不必担心自己缺乏相关知识或认为这太难。我坚信,只要你有学习的意愿并付出努力,你就能够掌握大模型,并能够用它们完成许多有意义的事情。在这个快速变化的时…

codeforces round 953 div2

A Alice and books 题目&#xff1a; 思路&#xff1a;编号最大的肯定会被读到&#xff0c;所以在编号最大的这一组书中不能存在除去最大编号的外书外页数最大的书&#xff0c;并且在另一堆中这本书的编号也应该是最大值 代码&#xff1a; #include <iostream>using…

郑州设计资质延续流程:人员社保的审核标准是什么?

郑州设计资质延续流程中&#xff0c;人员社保的审核标准如下&#xff1a; 一、社保缴纳期限 审核标准&#xff1a;人员&#xff08;技术负责人、注册人员等&#xff09;的社保考核期限恢复为3个月。需要提供相关人员至少连续3个月的社保缴纳记录。 二、社保缴纳主体 审核标准…