参考文章:https://www.modb.pro/db/70021
概述
为了确保复杂的事务可以安全地同时运行,kingbase(PostgreSQL)提供了各种级别的锁来控制对各种数据对象的并发访问,使得对数据库关键部分的更改序列化。事务并发运行,直到它们尝试获取互相冲突的锁为止(比如两个事务更新同一行时)。当多个事务同时在数据库中运行时,并发控制是一种用于维持一致性和隔离性的技术,在kingbase(PostgreSQL)中,使用快照隔离Sanpshot Isolation (简称SI) 来实现多版本并发控制,同时以两阶段锁定 (2PL) 机制为辅。在执行DDL时使用2PL,在执行DML时使用SI。
在数据库中,同样也存在着各式各样的锁,表级锁、行级锁、页锁等等,数据库的并发能力除了和它的并发控制机制有关, 还和数据库的锁粒度控制息息相关,粒度越细冲突范围就越小,并发能力就越强。锁的最终目的无外乎是为了保证数据的一致性和完整性
kingbase锁矩阵
场景
Z银行客户A系统 性能压测,压测时监控数据库服务器,发现数据库存在长连接情况,对长链接的会话进行锁分析。
经典获取锁信息的SQL:
SELECT blocked_locks.pid AS blocked_pid,blocked_activity.usename AS blocked_user,blocked_activity.client_addr as blocked_client_addr,blocked_activity.client_hostname as blocked_client_hostname,blocked_activity.application_name as blocked_application_name,blocked_activity.wait_event_type as blocked_wait_event_type,blocked_activity.wait_event as blocked_wait_event,blocked_activity.query AS blocked_statement,blocked_activity.xact_start AS blocked_xact_start,blocking_locks.pid AS blocking_pid,blocking_activity.usename AS blocking_user,blocking_activity.client_addr as blocking_client_addr,blocking_activity.client_hostname as blocking_client_hostname,blocking_activity.application_name as blocking_application_name,blocking_activity.wait_event_type as blocking_wait_event_type,blocking_activity.wait_event as blocking_wait_event,blocking_activity.query AS current_statement_in_blocking_process,blocking_activity.xact_start AS blocking_xact_start
FROM pg_catalog.pg_locks blocked_locksJOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pidJOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktypeAND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASEAND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relationAND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.pageAND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tupleAND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxidAND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionidAND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classidAND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objidAND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubidAND blocking_locks.pid != blocked_locks.pidJOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted ORDER BY blocked_activity.pid ;
查询结果:
通过上述的SQL查询找到了相互阻塞的SQL,分析等待事件,pid 4183128 在等待一个事务提交,而pid 4183129 是在等客户端完成操作。
两个事务操作的不是同一张表,不应该相互阻塞才对,再次确认一下各自的阻塞pid。
查询阻塞pid SQL:
SELECT pid,pg_blocking_pid(pid),wait_event_type,wait_event,query from pg_stat_activity where pid =xxxxx;
这里说明一下,pid改变,是使用两次压测的查询结果。
查询结果分析:
通过系统函数pg_blocking_pid可以看到阻塞事务pid和等锁类型,分析阻塞update操作的确实是select操作的事务,锁类型是lock,锁事件是等待事务提交,而select操作的锁类型是客户端操作,等待事件是等待客户端操作提交。
考虑到两个SQL并不是操作的一个表,且select操作不应该阻塞DML操作,因此继续使用第二个经典的SQL来查询一下。
经典SQL2:
with
t_wait as
( select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.granted, a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a.transactionid,a.fastpath, b.state,b.query,b.xact_start,b.query_start,b.usename,b.datname,b.client_addr,b.client_port,b.application_name from pg_locks a,pg_stat_activity b where a.pid=b.pid and not a.granted
),
t_run as
( select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.granted, a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a.transactionid,a.fastpath, b.state,b.query,b.xact_start,b.query_start,b.usename,b.datname,b.client_addr,b.client_port,b.application_name from pg_locks a,pg_stat_activity b where a.pid=b.pid and a.granted
),
t_overlap as
( select r.* from t_wait w join t_run r on ( r.locktype is not distinct from w.locktype and r.database is not distinct from w.database and r.relation is not distinct from w.relation and r.page is not distinct from w.page and r.tuple is not distinct from w.tuple and r.virtualxid is not distinct from w.virtualxid and r.transactionid is not distinct from w.transactionid and r.classid is not distinct from w.classid and r.objid is not distinct from w.objid and r.objsubid is not distinct from w.objsubid and r.pid <> w.pid )
),
t_unionall as
( select r.* from t_overlap r union all select w.* from t_wait w
)
select locktype,datname,relation::regclass,page,tuple,virtualxid,transactionid::text,classid::regclass,objid,objsubid,
string_agg(
'Pid: '||case when pid is null then 'NULL' else pid::text end||chr(10)||
'Lock_Granted: '||case when granted is null then 'NULL' else granted::text end||' , Mode: '||case when mode is null then 'NULL' else mode::text end||' , FastPath: '||case when fastpath is null then 'NULL' else fastpath::text end||' , VirtualTransaction: '||case when virtualtransaction is null then 'NULL' else virtualtransaction::text end||' , Session_State: '||case when state is null then 'NULL' else state::text end||chr(10)||
'Username: '||case when usename is null then 'NULL' else usename::text end||' , Database: '||case when datname is null then 'NULL' else datname::text end||' , Client_Addr: '||case when client_addr is null then 'NULL' else client_addr::text end||' , Client_Port: '||case when client_port is null then 'NULL' else client_port::text end||' , Application_Name: '||case when application_name is null then 'NULL' else application_name::text end||chr(10)||
'Xact_Start: '||case when xact_start is null then 'NULL' else xact_start::text end||' , Query_Start: '||case when query_start is null then 'NULL' else query_start::text end||' , Xact_Elapse: '||case when (now()-xact_start) is null then 'NULL' else (now()-xact_start)::text end||' , Query_Elapse: '||case when (now()-query_start) is null then 'NULL' else (now()-query_start)::text end||chr(10)||
'SQL (Current SQL in Transaction): '||chr(10)||
case when query is null then 'NULL' else query::text end,
chr(10)||'--------'||chr(10)
order by ( case mode when 'INVALID' then 0 when 'AccessShareLock' then 1 when 'RowShareLock' then 2 when 'RowExclusiveLock' then 3 when 'ShareUpdateExclusiveLock' then 4 when 'ShareLock' then 5 when 'ShareRowExclusiveLock' then 6 when 'ExclusiveLock' then 7 when 'AccessExclusiveLock' then 8 else 0 end ) desc, (case when granted then 0 else 1 end)
) as lock_conflict
from t_unionall
group by
locktype,datname,relation,page,tuple,virtualxid,transactionid::text,classid,objid,objsubid ;
SQL2查询结果:
分析SQL2查询结果,发现update实际被另一个update阻塞。
找到了相互阻塞的事务,现在重点关注的是select操作为什么没有commit,这就需要看一下业务的代码了,先现象和分析结果反馈给了业务侧开发,开发反馈是方法上之前有一个事务的注解,去掉之后select正常提交,未在出现锁等待情况。
问题解决方案:
1 数据库端添加超时参数,idle_in_transaction_session_timeout,单位ms。
2 业务代码侧解决。