关闭守护系统时,必须按照一定的顺序来关闭守护进程和数据库实例。特别是自动切换 模式,如果退出守护进程或主备库的顺序不正确,可能会引起主备切换,甚至造成守护进程 DM 数据守护与读写分离集群组分裂。
官方推荐通过在监视器执行stop group命令关闭守护系统:
有时监视器里操作被认为不太方便,比如要login输入SYSDBA复杂的口令,而通过disql可以通过本地认证免密登录数据库。下面官方也推荐了disql操作方式:
如上可见,为防止发生自动接管,首先需要先关停确认监视器。然后再关闭守护进程。守护进程的没有严格的关闭顺序。之所以先关闭备库的watcher,是考虑到操作方便。因为确认监视器往往部署在备机。之所以最后关闭主库的watcher,同样是为了操作方便。因为要随后停数据库是有严格顺序的,先停主,再停备。
为什么要先停主库,能不能先停备库?先停备库会导致主库由于归档发送失败导致短暂进入suspend状态,备库再次重启后需要进入一个恢复阶段。大多数情况系统会完美地处理上述问题,但不能排除极少数情况下发生异常。
下面通过日志观察备库关闭再启动,系统做了什么。
#关停备库的守护进程
[dmdba@node2 bin]$ ./DmWatcherServiceTEST2 stop
Stopping DmWatcherServiceTEST2: [ OK ]#关停备库。之所以先关守护进程,是防止备库自动被守护进程重新拉起
[dmdba@node2 bin]$ ./DmServiceTEST2 stop
Stopping DmServiceTEST2: [ OK ]#查看备库归档日志的LSN号
[dmdba@node2 bin]$ ./dmrachk arch_path=/home/dmdba/dmarch
rachk V8
the database pmnt_magic: 1859674653
the database db_magic: 1291794688
/*******************************************************************/
archive file /home/dmdba/dmarch/STANDBY_ARCHIVE_0x25335316_EP0_2024-06-23_01-43-11.log itemize.
version : 0x7007
status : INACTIVE
enc_flag : FALSE
n_rpkgs : 22
pmnt_magic : 1859674653
db_magic : 1291794688
src_db_magic : 624120598
arch_lsn : 1724574716
arch_seq : 1430505
clsn : 1724574737
next_seq : 1430526
g_next_seq : 1430526
file len : 15360
file free : 15360
dsc seqno : 0
dsc node : 1
create time : 2024-06-23 01:43:11.914690 time_zone:+08:00
close time : 2024-06-23 01:43:12.061566 time_zone:+08:00
llog first time : 1900-01-01 00:00:00.000000 time_zone:+08:00
llog last time : 1900-01-01 00:00:00.000000 time_zone:+08:00
/*******************************************************************/
/*******************************************************************/
archive file /home/dmdba/dmarch/STANDBY_ARCHIVE_0x25335316_EP0_2024-06-23_01-43-12.log itemize.
version : 0x7007
status : INACTIVE
enc_flag : FALSE
n_rpkgs : 32
pmnt_magic : 1859674653
db_magic : 1291794688
src_db_magic : 624120598
arch_lsn : 1724574738
arch_seq : 1430527
clsn : 1724574775
next_seq : 1430558
g_next_seq : 1430558
file len : 21504
file free : 21504
dsc seqno : 0
dsc node : 1
create time : 2024-06-23 01:43:12.562146 time_zone:+08:00
close time : 2024-06-23 01:44:49.847657 time_zone:+08:00
llog first time : 1900-01-01 00:00:00.000000 time_zone:+08:00
llog last time : 1900-01-01 00:00:00.000000 time_zone:+08:00
/*******************************************************************/The SUMMARY(seqno[0]):
total files: 2
okey files: 2
fail file: 0
repeat file: 0the rachk tool running cost 0.869 ms, code:0#启动守护进程,备库自动拉起
[dmdba@node2 bin]$ ./DmWatcherServiceTEST2 start
Starting DmWatcherServiceTEST2: [ OK ]
下面查看主库日志
日志内容比较多,节选几个关键部分
向备库发送实时归档(lsn:1724574777)失败,节点切换为SUSPEND状态,守护进程状态修改为DW_FAILOVER状态。
2024-06-23 01:44:47.931 [ERROR] database P0000001104 T0000000000000001518 [mal recv for arch] mal receive from site(TEST2) success, but site(TEST2) apply failed, begin lsn:1724574777, end lsn:1724574777, code:-518
2024-06-23 01:44:47.931 [ERROR] database P0000001104 T0000000000000001518 send realtime archive to instance[TEST2] failed, code = -518, begin_lsn = 1724574777, end_lsn = 1724574777!
2024-06-23 01:44:47.931 [INFO] database P0000001104 T0000000000000001518 Send archive log to remote instance failed, switch all ep to SUSPEND status success!
2024-06-23 01:44:49.934 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=217, dseq=1719099861, name_in=, begin_lsn=-1!
2024-06-23 01:44:49.934 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, begin, msg_dseq:1719099861
2024-06-23 01:44:49.934 [INFO] database P0000001104 T0000000000000001554 set g_dw_stat from NONE to DW_FAILOVER success, g_dw_recover_stop is 0
2024-06-23 01:44:49.934 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, finished, msg_dseq:1719099861, set code:0
归档状态修改为INVALID状态,保证后续数据库open后不再传输实时归档。
数据库恢复为open,业务数据可以继续写入(RLOG)
守护进程状态由FAILOVER恢复为NONE
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001558 Change TEST2 arch status from VALID to INVALID
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, received sql exec cmd:1, dseq:1719099863, sql:ALTER DATABASE OPEN FORCE
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=1, dseq=1719099863, name_in=, begin_lsn=-1!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554 pha_altdb_open_for_dw alter database open start...
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554 pha_altdb_open_for_dw, altdb set changing end!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554 altdb open rlog_flush_notify_ex start!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554 altdb open rlog_flush_notify_ex end!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001554 pha_altdb_open_for_dw alter database open success!
2024-06-23 01:44:50.936 [INFO] database P0000001104 T0000000000000001559 total 0 active crash trx, pseg_crash_trx_rollbacksys_only(0) begin ...
2024-06-23 01:44:50.937 [INFO] database P0000001104 T0000000000000001559 pseg_crash_trx_rollback end, total 0 active crash trx, include 0 empty_trxs, 0 empty_pages which only need to delete mgr recs.
2024-06-23 01:44:50.937 [INFO] database P0000001104 T0000000000000001559 pseg_crash_trx_rollback end
2024-06-23 01:44:50.937 [INFO] database P0000001104 T0000000000000001559 hpc_clear_active_trx adjust n_crash_active_trx from 0 to 0.
2024-06-23 01:44:51.938 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=217, dseq=1719099864, name_in=, begin_lsn=-1!
2024-06-23 01:44:51.938 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, begin, msg_dseq:1719099864
2024-06-23 01:44:51.938 [INFO] database P0000001104 T0000000000000001554 set g_dw_stat from DW_FAILOVER to NONE success, g_dw_recover_stop is 0
2024-06-23 01:44:51.938 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, finished, msg_dseq:1719099864, set code:0
备库重启后,主库守护进程状态设置为DW_RECOVERY,准备异步传输归档
...
#连接备库失败
2024-06-23 01:47:46.577 [WARNING] database P0000001104 T0000000000000001507 MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)
2024-06-23 01:47:46.577 [WARNING] database P0000001104 T0000000000000001507 site(0) ctl_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2024-06-23 01:47:46.577 [WARNING] database P0000001104 T0000000000000001506 mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again
2024-06-23 01:47:46.578 [INFO] database P0000001104 T0000000000000001506 send CMD_MAL_LINK_CHECK(350): (mal_id:21784098, stmt_id:71964, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1)
2024-06-23 01:47:46.578 [INFO] database P0000001104 T0000000000000001506 send CMD_MAL_LINK_CHECK(350): (mal_id:21784098, stmt_id:71966, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:4467)
#重新建立到备库的连接
2024-06-23 01:47:52.515 [INFO] database P0000001104 T0000000000000001514 mal_site_ctl_link_create startup from mal_site(0) to mal_site(1)!
2024-06-23 01:47:52.515 [INFO] database P0000001104 T0000000000000001514 mal_site_magic_gen site_magic[39959], src_site:0, dst_site:1
2024-06-23 01:47:52.515 [INFO] database P0000001104 T0000000000000001514 site[0] mal_site_ctl_port_set to site[1, IP: 192.168.100.101, port_num: 5800], socket handle = 52, site_magic = 39959
2024-06-23 01:47:52.515 [INFO] database P0000001104 T0000000000000001507 mal_site_port_get site_magic:39959, src_site:0, dst_site:1
2024-06-23 01:47:52.516 [INFO] database P0000001104 T0000000000000001506 mal_site_port_get site_magic:39959, src_site:0, dst_site:1
2024-06-23 01:47:52.523 [INFO] database P0000001104 T0000000000000001513 site[0] mal_site_data_port_set from site[1, IP: 192.168.100.101, port_num: 5800], socket handle = 53, site_magic = 39959
2024-06-23 01:47:52.523 [INFO] database P0000001104 T0000000000000001505 mal_site_port_get site_magic:39959, src_site:0, dst_site:1
2024-06-23 01:47:52.523 [INFO] database P0000001104 T0000000000000001508 mal_site_port_get site_magic:39959, src_site:0, dst_site:1
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=217, dseq=1719099865, name_in=, begin_lsn=-1!
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, begin, msg_dseq:1719099865
#守护进程状态设置为DW_RECOVERY
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554 set g_dw_stat from NONE to DW_RECOVERY success, g_dw_recover_stop is 0
#准备异步发送归档
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554 rarch_set_async_send_info, reset total_rpkg and total_time_used.
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, reset all async send info success.
2024-06-23 01:47:53.293 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, finished, msg_dseq:1719099865, set code:0
2024-06-23 01:47:54.347 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=210, dseq=1719099866, name_in=TEST2, begin_lsn=0!
2024-06-23 01:47:54.347 [INFO] database P0000001104 T0000000000000001557 utsk_cmd_exec, cmd:210, sys_status:OPEN, dseq:1719099866
开始异步归档
#备库(mal_site(1))1724574775最末重演日志序列号1430558,lsn1724574775
#注意:与dmrachk分析备库最后一个归档的next_seq和clsn相同
2024-06-23 01:47:54.356 [INFO] database P0000001104 T0000000000000001557 rarch_sync_process from mal_site(1): n_apply_ep=1, apply_seq_arr[0]=1430558, apply_lsn_arr[0]=1724574775
2024-06-23 01:47:54.357 [INFO] database P0000001104 T0000000000000001557 rarch_ep_init begin, dsc_seqno:0, begin_pkg_seqno:1430558, rfil:/u01/dm/dmarch/test/ARCHIVE_LOCAL1_0x25335316_EP0_2024-06-23_00-43-33.log!
2024-06-23 01:47:54.373 [INFO] database P0000001104 T0000000000000001557 rarch_ep_init, ep->prev_term_id change from 0 to 0.
2024-06-23 01:47:54.373 [INFO] database P0000001104 T0000000000000001557 rarch_ep_init, code:0, dsc_seqno:0, begin_pkg_seqno:1430558, load ep pkg info[db_magic:0x25335316, dsc_seqno:0, pkg_seqno:1430558, min_lsn:1724574775, max_lsn:1724574775], prev_term_id:0!
2024-06-23 01:47:54.373 [INFO] database P0000001104 T0000000000000001557 rarch_sync_process, begin to send arch, db_magic:0x25335316. 2024-06-23 01:47:54.376 [INFO] database P0000001104 T0000000000000001557 Local arch send type to TEST2 change from 0 to 1, clear recent arch send info!
2024-06-23 01:47:54.533 [INFO] database P0000001104 T0000000000000001557 rarch_sync_process, ep_seqno[0]'s rlog_pkg send over! 2024-06-23 01:47:54.533 [INFO] database P0000001104 T0000000000000001557 rarch_sync_process, all rlog_pkg send over, rarch_apply_wait!
2024-06-23 01:47:54.547 [INFO] database P0000001104 T0000000000000001557 rarch_sync_process, all rlog_pkg send over, rarch_apply_wait success, site_seq:1, reach_until_time:0!
2024-06-23 01:47:54.547 [INFO] database P0000001104 T0000000000000001557 rarch_local_arch_send to TEST2 finished, code=0, dseq=1719099866, max_arch_lsn=1724574841!
...
为保证可以追齐,传输一段归档后设置主库SUSPEND,禁止RLOG写入。
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001566 rarch_local_arch_send to TEST2 finished, code=0, dseq=1719099871, max_arch_lsn=1724574842!
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, received sql exec cmd:1, dseq:1719099872, sql:ALTER DATABASE SUSPEND
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=1, dseq=1719099872, name_in=, begin_lsn=-1!
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554 utsk_dw_sql, before enqueue, len:0, tsks:0
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554 utsk_dw_sql, after enqueue, len:1, tsks:1
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001554 utsk_dw_sql:ALTER DATABASE SUSPEND put to task sys, sys_status:OPEN
2024-06-23 01:47:54.927 [INFO] database P0000001104 T0000000000000001568 utsk_cmd_exec, cmd:1, sys_status:OPEN, dseq:1719099872
2024-06-23 01:47:55.028 [INFO] database P0000001104 T0000000000000001568 alter database suspend success!
2024-06-23 01:47:55.028 [INFO] database P0000001104 T0000000000000001568 utsk_process_dw_sql_exec, prepare sql:ALTER DATABASE SUSPEND, code:0, dseq:1719099872
继续传输归档直至追齐
2024-06-23 01:47:55.029 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=210, dseq=1719099873, name_in=TEST2, begin_lsn=0!
2024-06-23 01:47:55.029 [INFO] database P0000001104 T0000000000000001569 utsk_cmd_exec, cmd:210, sys_status:SUSPEND, dseq:1719099873
2024-06-23 01:47:55.036 [INFO] database P0000001104 T0000000000000001569 rarch_sync_process from mal_site(1): n_apply_ep=1, apply_seq_arr[0]=1430623, apply_lsn_arr[0]=1724574842
2024-06-23 01:47:55.037 [INFO] database P0000001104 T0000000000000001569 rarch_ep_init begin, dsc_seqno:0, begin_pkg_seqno:1430623, rfil:/u01/dm/dmarch/test/ARCHIVE_LOCAL1_0x25335316_EP0_2024-06-23_00-43-33.log!
2024-06-23 01:47:55.054 [INFO] database P0000001104 T0000000000000001569 rarch_ep_init, ep->prev_term_id change from 0 to 0.
2024-06-23 01:47:55.054 [INFO] database P0000001104 T0000000000000001569 rarch_ep_init, code:0, dsc_seqno:0, begin_pkg_seqno:1430623, load ep pkg info[db_magic:0x25335316, dsc_seqno:0, pkg_seqno:1430623, min_lsn:1724574842, max_lsn:1724574842], prev_term_id:0!
2024-06-23 01:47:55.054 [INFO] database P0000001104 T0000000000000001569 rarch_sync_process, begin to send arch, db_magic:0x25335316.
2024-06-23 01:47:55.056 [INFO] database P0000001104 T0000000000000001569 rarch_sync_process, ep_seqno[0]'s rlog_pkg send over!
2024-06-23 01:47:55.056 [INFO] database P0000001104 T0000000000000001569 rarch_sync_process, all rlog_pkg send over, rarch_apply_wait!
2024-06-23 01:47:55.060 [INFO] database P0000001104 T0000000000000001569 rarch_sync_process, all rlog_pkg send over, rarch_apply_wait success, site_seq:1, reach_until_time:0!
2024-06-23 01:47:55.060 [INFO] database P0000001104 T0000000000000001569 rarch_local_arch_send to TEST2 finished, code=0, dseq=1719099873, max_arch_lsn=1724574842!
...
日志追齐后将路径状态恢复为VALID,保证后续开库后正常发送实时归档
主库准备open
2024-06-23 01:47:55.061 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=100, dseq=1719099874, name_in=, begin_lsn=-1!
2024-06-23 01:47:55.061 [INFO] database P0000001104 T0000000000000001567 utsk_cmd_exec, cmd:100, sys_status:SUSPEND, dseq:1719099874
2024-06-23 01:47:55.061 [INFO] database P0000001104 T0000000000000001567 Change TEST2 arch status from INVALID to VALID
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001567 Notify dest ep(1) set dw_stat_flag from DW_RECOVERY to NONE success.
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001567 utsk_process_mdw_chg_march, dseq:1719099874, code:0, op_flag:0, n_name:1, arch_status:VALID!
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001567 utsk_cmd_exec, utsk_process_mdw_chg_march success, sys_status:SUSPEND!
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, received sql exec cmd:1, dseq:1719099875, sql:ALTER DATABASE OPEN FORCE
数据库open完毕,守护进程状态由DW_RECOVERY切换为NONE
2024-06-23 01:47:55.064 [INFO] database P0000001104 T0000000000000001554 pha_altdb_open_for_dw alter database open start... 2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001554 pha_altdb_open_for_dw, altdb set changing end! 2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001554 altdb open rlog_flush_notify_ex start! 2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001554 altdb open rlog_flush_notify_ex end! 2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001554 pha_altdb_open_for_dw alter database open success! 2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001570 total 0 active crash trx, pseg_crash_trx_rollbacksys_only(0) begin ... 2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001570 pseg_crash_trx_rollback end, total 0 active crash trx, include 0 empty_trxs, 0 empty_pages which only need to delete mgr recs. 2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001570 pseg_crash_trx_rollback end 2024-06-23 01:47:55.065 [INFO] database P0000001104 T0000000000000001570 hpc_clear_active_trx adjust n_crash_active_trx from 0 to 0. 2024-06-23 01:47:56.067 [INFO] database P0000001104 T0000000000000001554 utsk_cmd_add, cmd info: cmd=217, dseq=1719099876, name_in=, begin_lsn=-1! 2024-06-23 01:47:56.067 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, begin, msg_dseq:1719099876 2024-06-23 01:47:56.067 [INFO] database P0000001104 T0000000000000001554 set g_dw_stat from DW_RECOVERY to NONE success, g_dw_recover_stop is 0 2024-06-23 01:47:56.070 [INFO] database P0000001104 T0000000000000001554 Notify dest ep(TEST2) set dw_stat_flag from DW_RECOVERY to NONE success. 2024-06-23 01:47:56.070 [INFO] database P0000001104 T0000000000000001554 utsk_set_global_dw_stat, finished, msg_dseq:1719099876, set code:0 2024-06-23 01:47:57.588 [INFO] database P0000001104 T0000000000000001518 Local arch send type to TEST2 change from 1 to 0, clear recent arch send info!