一、背景
没有做任何特殊操作,突然kuboard访问时,提示如下信息:
{"message": "Failed to connect to the database.","type": "Internal Server Error"
}
二、排查过程
此处kuboard为docker部署的,查看kuboard的运行情况,提示Up 6 months
正在运行
docker ps | grep kuboard
查看kuboard容器的日志:
docker logs -f --tail=10 容器ID
[root@nb003 ~]# docker logs -f --tail=10 a2caf8010e75
time="2023-09-23T05:15:08Z" level=error msg="failed to rotate keys: etcdserver: mvcc: database space exceeded"
{"level":"warn","ts":"2023-09-23T13:15:12.504+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8f19b170-257f-4a30-942f-1be1122e3be0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
time="2023-09-23T05:15:12Z" level=error msg="Storage health check failed: create auth request: etcdserver: mvcc: database space exceeded"
日志如上,发现提示ResourceExhausted desc = etcdserver: mvcc: database space exceeded
,这表示etcd服务磁盘空间不足了,默认的空间配额限制为2G,超出空间配额限制就会影响服务,所以需要定期清理。
故查看数据映射的空间大小,找到自己的kuboard-data,查看
etcd db占用空间大小,发现从9月23日11点57的时候就是2GB了。已经达到默认的空间配额限制为2G的最大值。
[root@nb003 snap]# cd /data/kuboard-data/etcd-data/member/snap
[root@nb003 snap]# pwd
/data/kuboard-data/etcd-data/member/snap
[root@nb003 snap]# ls -lrth
总用量 2.1G
-rw-r--r-- 1 root root 8.0K 9月 19 11:49 0000000000000005-00000000005c4542.snap
-rw-r--r-- 1 root root 8.0K 9月 20 08:41 0000000000000005-00000000005c6c53.snap
-rw-r--r-- 1 root root 8.0K 9月 21 05:33 0000000000000005-00000000005c9364.snap
-rw-r--r-- 1 root root 8.0K 9月 22 02:26 0000000000000005-00000000005cba75.snap
-rw-r--r-- 1 root root 8.0K 9月 23 07:13 0000000000000005-00000000005ce186.snap
-rw------- 1 root root 2.0G 9月 23 11:57 db
进入kuboard容器内部,查看etcd的情况:可以看到在ERRORS列里同样也提示了一个警告alarm:NOSPACE
空间不足
[root@nb003 snap]# docker exec -it a2caf8010e75 bash
root@a2caf8010e75:/# ETCDCTL_API=3 etcdctl --endpoints="http://127.0.0.1:2379" --write-out=table endpoint status
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://127.0.0.1:2379 | 59a9c584ea2c3f35 | 3.4.14 | 2.1 GB | true | false | 6 | 6089300 | 6089300 | memberID:6460912315094810421 |
| | | | | | | | | | alarm:NOSPACE |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
root@a2caf8010e75:/#
三、解决办法
在kuboard容器中依次做如下操作:
# 备份db
etcdctl snapshot save backup.db
# 查看当前版本
rev=$(ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
# 压缩旧版本
ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 compact $rev
# 整理多余的空间
ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 defrag
# 取消告警信息(之前有nospace的告警)
ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 alarm disarm
# 再次查看etcd的状态(发现ERROR字段已为空)
ETCDCTL_API=3 etcdctl --endpoints="http://127.0.0.1:2379" --write-out=table endpoint status
详细过程及其输出如下:
root@a2caf8010e75:/# etcdctl snapshot save backup.db
{"level":"info","ts":1695447648.315712,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"backup.db.part"}
{"level":"info","ts":"2023-09-23T13:40:48.317+0800","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1695447648.3172774,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":"2023-09-23T13:41:03.646+0800","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1695447663.8131642,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"2.1 GB","took":15.497392681}
{"level":"info","ts":1695447663.8132935,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"backup.db"}
Snapshot saved at backup.db
root@a2caf8010e75:/# rev=$(ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
root@a2caf8010e75:/# ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 compact $rev
compacted revision 6077603
root@a2caf8010e75:/# ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 defrag
Finished defragmenting etcd member[http://127.0.0.1:2379]
root@a2caf8010e75:/# ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 alarm disarm
memberID:6460912315094810421 alarm:NOSPACE
root@a2caf8010e75:/# ETCDCTL_API=3 etcdctl --endpoints="http://127.0.0.1:2379" --write-out=table endpoint status
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://127.0.0.1:2379 | 59a9c584ea2c3f35 | 3.4.14 | 127 kB | true | false | 6 | 6089454 | 6089454 | |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
root@a2caf8010e75:/#
四、访问验证结果
浏览器访问kuboard(ip:30080)验证,访问正常
查看etcd db的占用情况:发现大小变为156K