文章目录
- 一、采集器安装
- 1. Categraf简介
- 2. Categraf部署
- 3. 测试服务器部署
- 4. 系统监控插件
- 5. 显卡监控插件
- 6. 服务监控插件
- 二、监控仪表盘
- 1. 机器列表
- 2. 系统监控
- 3. 服务监控
- 三、告警配置
- 1. 邮件通知
- 2. 告警规则
- 3. 告警自愈
一、采集器安装
1. Categraf简介
Categraf 需要部署到所有需要监控的机器上,因为采集 CPU、内存、进程等指标需要读取操作系统里的信息。
Categraf 推送监控数据到服务端,基于 Prometheus 的 RemoteWrite 协议。
Grafana 仪表盘市场
categraf插件说明
categraf部署文档
categraf下载地址
下载文件例如: categraf-v0.3.45-linux-amd64.tar.gz
2. Categraf部署
有些监控插件,docker部署方式很难配置,所以采用二进制部署Categraf。
- 删除不使用的插件
categraf-v0.3.45-linux-amd64/conf/input.* - 修改插件配置*.toml
- 修改Categraf配置config.toml
[global]
hostname = "机器标签"
[[writers]]
url = "http://192.168.6.226:17000/prometheus/v1/write"
[ibex]
enable = true
servers = ["192.168.6.226:20090"]
[heartbeat]
url = "http://192.168.6.226:17000/v1/n9e/heartbeat"
- 拷贝categraf
拷贝categraf-v0.3.45-linux-amd64内的所有文件/文件夹到要部署的环境 /home/monitor/categraf - 安装启动categraf
cd /home/monitor/categraf && chmod +x categraf && ./categraf --install && ./categraf --start
- 其他命令
# 以service方式安装, 相当于添加service文件+systemctl daemon-reload
sudo ./categraf --install
# 以service方式卸载, 相当于systemctl stop categraf + 删除service文件
# 如果安装过categraf,先卸载
sudo ./categraf --remove
# 以service方式启动categraf ,相当于systemctl start categraf
sudo ./categraf --start
# 以service方式停止categraf,相当于systemctl stop categraf
sudo ./categraf --stop
# 以service方式查看categraf,相当于systemctl status categraf
sudo ./categraf --status
# 采集了哪些 mysql 指标
sudo ./categraf --test --inputs mysql
3. 测试服务器部署
4. 系统监控插件
- cpu 插件:采集本机 CPU 的使用率、空闲率等
input.cpu/cpu.toml,可使用默认配置
# 采集频率
interval = 15
# 是否采集每个单核的指标
collect_per_cpu = false
- 磁盘 插件:采集磁盘利用率、inode利用率等
input.disk/disk.toml,可使用默认配置
# 采集频率
interval = 15# 统计指定挂载点
# mount_points = ["/"]# 按文件系统类型忽略挂载点
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs", "nsfs", "CDFS"]# 忽略挂载点
ignore_mount_points = ["/boot", "/var/lib/kubelet/pods"]
- 磁盘IO 插件:采集磁盘读写IO指标
input.diskio/diskio.toml,可使用默认配置
# 采集频率
interval = 15# 统计指定设备
# devices = ["sda", "sdb", "vd*"]
- 内核 插件:采集 OS 启动时间,上下文切换的次数等
input.kernel/kernel.toml,可使用默认配置
# 采集频率
interval = 15
- 内存 插件:采集内存利用率等
input.mem/mem.toml,可使用默认配置
# 采集频率
interval = 15# 是否采集各个平台特有的指标
collect_platform_fields = true
- 网络流量 插件:采集网卡的流量、包量等
input.net/net.toml,可使用默认配置
# 采集频率
interval = 15# 是否在Linux上收集协议统计信息
# collect_protocol_stats = false# 统计指定网卡信息
# interfaces = ["eth0"]
- 网络连接 插件:采集有多少 time_wait 连接,多少 established 连接等
input.netstat/netstat.toml,可使用默认配置
# 采集频率
interval = 15disable_summary_stats = false# 如果有很多网络连接, 该插件占用系统资源
disable_connection_stats = truetcp_ext = false
ip_ext = false
- ntp时间 插件:监控机器时间偏移量
input.ntp/ntp.toml
# 采集频率
interval = 15# ntp 服务器
ntp_servers = ["ntp.aliyun.com"]# 响应超时时间
timeout = 5
- 进程 插件:采集进程 running 的有多少,sleeping 的有多少,total 有多少
input.processes/processes.toml,可使用默认配置
# 采集频率
interval = 15# 强制使用ps命令收集
# force_ps = false# 强制使用/proc收集
# force_proc = false
- system 插件:采集系统负载信息
input.system/system.toml,可使用默认配置
# 采集频率
interval = 15# 是否收集system_n_users信息
# collect_user_number = false
5. 显卡监控插件
- nvidia显卡 插件:监控nvidia显卡信息
input.nvidia_smi/nvidia_smi.toml
# 采集频率
interval = 15# 执行本地命令
nvidia_smi_command = "nvidia-smi"# 可以通过运行`nvidia-smi --help-query-gpus`来查找可能的字段
# `AUTO` 自动检测要查询的字段
query_field_names = "AUTO"
6. 服务监控插件
- docker 插件:docker容器监控
input.docker/docker.toml
# 采集频率
interval = 15[[instances]]
# interval = global.interval * interval_times
interval_times = 1## Docker Endpoint
endpoint = "unix:///var/run/docker.sock"# 包括/排除的容器
container_name_include = []
container_name_exclude = []gather_services = false
gather_extend_memstats = falsecontainer_id_label_enable = true
container_id_label_short_style = falsetimeout = "5s"perdevice_include = []total_include = ["cpu", "blkio", "network"]docker_label_include = []
docker_label_exclude = ["annotation*", "io.kubernetes*", "*description*", "*maintainer*", "*hash", "*author*", "*org_*", "*date*", "*url*", "*docker_compose*"]
- 日志 插件:提取日志内容,转换为监控metrics
input.mtail/mtail.toml
# 采集频率
interval = 15[[instances]]
progs = "/home/monitor/categraf/conf/input.mtail/prog1" # 日志解析规则配置文件的路径
logs = ["/home/logs/example/all.log"] # 日志文件
labels = { log="6.221-example-log" } # 日志标签
override_timezone = "Asia/Shanghai" # 时区
emit_metric_timestamp = "true" # 时间戳
input.mtail/prog1/rule_error.mtail
gauge error_num
/ERROR.*/ {error_num++
}
input.mtail/prog1/rule_info.mtail
gauge info_num
/INFO.*/ {info_num++
}
input.mtail/prog1/rule_login.mtail
gauge login_num
/登录账户.*/ {login_num++
}
- mysql 插件:连到 mysql 实例,执行一些 sql,解析输出内容,整理为监控数据上报
input.mysql/mysql.toml
# 采集频率
interval = 15# 定义instance, 一个instance对应一个mysql实例
[[instances]]
address = "192.168.6.200:3306"
username = "root"
password = "123456"# 是否使用tls 等定制参数
parameters = "tls=false"
- nginx 插件:监控nginx状态,该插件依赖nginx的 **http_stub_status_module
input.nginx/nginx.toml
# 采集频率
interval = 15[[instances]]
# 设置访问 Nginx stub_status 链接
urls = ["http://192.168.6.223:8080/nginx_status"]response_timeout = "5s"
nginx服务需要启用http_stub_status_module模块
nginx.conf 配置加上
http {location /nginx_status {stub_status on;access_log off;allow 192.168.6.226; // 允许IP访问deny all; // 禁止其他IP访问}}
}
http://192.168.6.223:8080/nginx_status
- redis 插件:就是连上 redis,执行 info 命令,解析结果,整理成监控数据上报
input.redis/redis.toml
# 采集频率
interval = 15# 定义instance, 一个instance对应一个redis实例
[[instances]]
address = "192.168.6.223:6379"
username = ""
password = ""
pool_size = 2# 是否开启slowlog收集
gather_slowlog = true# 最多收集少条slowlog
slowlog_max_len = 100
二、监控仪表盘
1. 机器列表
- 仪表盘 JSON
{"name": "机器列表","tags": "","ident": "","configs": {"panels": [{"type": "table","id": "77bf513a-8504-4d33-9efe-75aaf9abc9e4","layout": {"h": 11,"i": "77bf513a-8504-4d33-9efe-75aaf9abc9e4","isResizable": true,"w": 24,"x": 0,"y": 5},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "avg(system_uptime{ident=~\"$ident\"}) by (ident)","refId": "A","legend": "启动时长"},{"expr": "avg(cpu_usage_active{cpu=\"cpu-total\", ident=~\"$ident\"}) by (ident)","legend": "CPU使用率","refId": "B"},{"expr": "avg(mem_used_percent{ident=~\"$ident\"}) by (ident)","legend": "内存使用率","refId": "C"},{"expr": "avg(mem_total{ident=~\"$ident\"}) by (ident)","legend": "总内存","refId": "D"},{"expr": "avg(disk_used_percent{ident=~\"$ident\",path=\"/\"}) by (ident)","legend": "硬盘使用率","refId": "E"},{"expr": "avg(disk_total{ident=~\"$ident\"}) by (ident)","refId": "F","legend": "总硬盘"},{"expr": "avg(rate(net_bytes_recv{ident=~\"$ident\"}[1m])) by(ident)","refId": "G","legend": "网络入流量"},{"expr": "avg(rate(net_bytes_sent{ident=~\"$ident\"}[1m])) by(ident)","refId": "H","legend": "网络出流量"},{"expr": "avg(nvidia_smi_utilization_gpu_ratio{ident=~\"$ident\"}) by (ident)","refId": "I","legend": "GPU使用率"},{"expr": "avg(nvidia_smi_memory_used_bytes/nvidia_smi_memory_total_bytes{ident=~\"$ident\"}) by (ident)","refId": "J","legend": "显存使用率"},{"expr": "avg(nvidia_smi_memory_total_bytes{ident=~\"$ident\"}) by (ident)","refId": "K","legend": "总显存"},{"expr": "ntp_offset_ms","refId": "L","legend": "NTP偏移 ms"}],"transformations": [{"id": "organize","options": {"renameByName": {"ident": "机器"}}}],"name": "机器列表","maxPerRow": 4,"custom": {"showHeader": true,"colorMode": "background","calc": "lastNotNull","displayMode": "labelValuesToRows","aggrDimension": "ident","sortColumn": "ident","sortOrder": "ascend","linkMode": "cellLink"},"options": {"standardOptions": {}},"overrides": [{"type": "special","matcher": {"id": "byFrameRefID","value": "A"},"properties": {"standardOptions": {"util": "humantimeSeconds"}}},{"matcher": {"id": "byFrameRefID","value": "B"},"properties": {"standardOptions": {"util": "percent","decimals": 1},"valueMappings": []}},{"matcher": {"id": "byFrameRefID","value": "C"},"properties": {"standardOptions": {"util": "percent","decimals": 1},"valueMappings": []},"type": "special"},{"matcher": {"id": "byFrameRefID","value": "D"},"properties": {"standardOptions": {"decimals": 1,"util": "bytesIEC"},"valueMappings": []},"type": "special"},{"matcher": {"id": "byFrameRefID","value": "E"},"properties": {"standardOptions": {"decimals": 1,"util": "percent"},"valueMappings": []},"type": "special"},{"type": "special","matcher": {"id": "byFrameRefID","value": "F"},"properties": {"standardOptions": {"util": "bytesIEC","decimals": 0}}},{"type": "special","matcher": {"id": "byFrameRefID","value": "G"},"properties": {"standardOptions": {"util": "bytesSecIEC","decimals": 1}}},{"type": "special","matcher": {"id": "byFrameRefID","value": "H"},"properties": {"standardOptions": {"util": "bytesSecIEC","decimals": 1}}},{"type": "special","matcher": {"id": "byFrameRefID","value": "I"},"properties": {"standardOptions": {"util": "percentUnit","decimals": 1}}},{"type": "special","matcher": {"id": "byFrameRefID","value": "J"},"properties": {"standardOptions": {"util": "percentUnit","decimals": 1}}},{"type": "special","matcher": {"id": "byFrameRefID","value": "K"},"properties": {"standardOptions": {"util": "bytesIEC","decimals": 1}}}]}],"var": [{"definition": "prometheus","name": "prom","type": "datasource"},{"allOption": true,"datasource": {"cate": "prometheus","value": "${prom}"},"definition": "label_values(system_load1,ident)","multi": true,"name": "ident","type": "query"}],"version": "3.0.0"}
}
- 仪表盘 效果
2. 系统监控
- 仪表盘 JSON
{"name": "系统监控","tags": "","ident": "","configs": {"panels": [{"type": "timeseries","id": "043c26de-d19f-4fe8-a615-2b7c10ceb828","layout": {"h": 7,"w": 8,"x": 0,"y": 0,"i": "043c26de-d19f-4fe8-a615-2b7c10ceb828","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "cpu_usage_active{ident=~\"$ident\"}","refId": "A","legend": "{{ident}}-使用率"}],"transformations": [{"id": "organize","options": {}}],"name": "CPU使用率","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"util": "percent","min": 0,"max": 101,"decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off","standardOptions": {"min": null,"max": null,"decimals": null}}}]},{"type": "timeseries","id": "239aacdf-1982-428b-b240-57f4ce7f946d","layout": {"h": 7,"w": 8,"x": 8,"y": 0,"i": "239aacdf-1982-428b-b240-57f4ce7f946d","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "mem_used_percent{ident=~\"$ident\"}","refId": "A","legend": "{{ident}}-使用率"}],"transformations": [{"id": "organize","options": {}}],"name": "内存使用率","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"util": "percent","min": 0,"max": 101,"decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off","standardOptions": {"decimals": null,"min": null,"max": null}}}]},{"type": "timeseries","id": "bbd1ebda-99f6-419c-90a5-5f84973976dd","layout": {"h": 7,"w": 8,"x": 16,"y": 0,"i": "bbd1ebda-99f6-419c-90a5-5f84973976dd","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "rate(diskio_read_bytes{ident=~\"$ident\"}[1m])","legend": "{{ident}}-{{name}}-读IO","refId": "A"},{"expr": "rate(diskio_write_bytes{ident=~\"$ident\"}[1m])","legend": "{{ident}}-{{name}}-写IO","refId": "B"}],"transformations": [{"id": "organize","options": {}}],"name": "磁盘IO","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"util": "bytesIEC","decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]},{"type": "timeseries","id": "f2ee5d32-737c-4095-b6b7-b15b778ffdb9","layout": {"h": 7,"w": 8,"x": 0,"y": 7,"i": "f2ee5d32-737c-4095-b6b7-b15b778ffdb9","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "rate(net_bytes_recv{ident=~\"$ident\"}[1m])","legend": "{{ident}}-入流量","refId": "A"},{"expr": "rate(net_bytes_sent{ident=~\"$ident\"}[1m])","legend": "{{ident}}-出流量","refId": "B"}],"transformations": [{"id": "organize","options": {}}],"name": "网络流量","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"util": "bytesIEC","decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]},{"type": "timeseries","id": "6be9a2be-1d4c-488d-b695-aa1d82df3a3c","layout": {"h": 7,"w": 8,"x": 8,"y": 7,"i": "e164a7cb-394c-4670-b83c-e9321a08cbe6","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "nvidia_smi_utilization_gpu_ratio{ident=~\"$ident\"}","legend": "{{ident}}-使用率","refId": "A"}],"transformations": [{"id": "organize","options": {}}],"name": "显卡使用率","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"util": "percentUnit","min": 0,"max": 1.01,"decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]},{"type": "timeseries","id": "7873f825-1e41-45e9-a1ee-792a87fd4351","layout": {"h": 7,"w": 8,"x": 16,"y": 7,"i": "37ced102-b020-4e3f-8247-6b2c9240a762","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "nvidia_smi_memory_used_bytes/nvidia_smi_memory_total_bytes{ident=~\"$ident\"}","legend": "{{ident}}-使用率","refId": "A"}],"transformations": [{"id": "organize","options": {}}],"name": "显存使用率","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"util": "percentUnit","min": 0,"max": 1.01,"decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]}],"var": [{"definition": "prometheus","name": "prom","type": "datasource"},{"allOption": true,"datasource": {"cate": "prometheus","value": "${prom}"},"definition": "label_values(system_load1,ident)","multi": true,"name": "ident","type": "query"}],"version": "3.0.0"}
}
- 仪表盘 效果
3. 服务监控
- 仪表盘 JSON
{"name": "服务监控","tags": "","ident": "","configs": {"panels": [{"type": "timeseries","id": "043c26de-d19f-4fe8-a615-2b7c10ceb828","layout": {"h": 6,"w": 8,"x": 0,"y": 0,"i": "043c26de-d19f-4fe8-a615-2b7c10ceb828","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "mysql_global_status_threads_connected{ident=~\"$ident\"}","refId": "A","legend": "{{ident}}-当前连接数"}],"transformations": [{"id": "organize","options": {}}],"name": "MySQL 连接数","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"min": null,"max": null,"decimals": null},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off","standardOptions": {"min": null,"max": null,"decimals": null}}}]},{"type": "timeseries","id": "bbd1ebda-99f6-419c-90a5-5f84973976dd","layout": {"h": 6,"w": 8,"x": 8,"y": 0,"i": "bbd1ebda-99f6-419c-90a5-5f84973976dd","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "mysql_global_status_slow_queries{ident=~\"$ident\"}","legend": "{{ident}}-慢查询","refId": "A"}],"transformations": [{"id": "organize","options": {}}],"name": "MySQL 慢查询数","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"decimals": null},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]},{"type": "timeseries","id": "3ca8db64-b25e-4e72-8dac-187cec4886ae","layout": {"h": 6,"w": 8,"x": 16,"y": 0,"i": "7174939f-2742-47bd-a023-5d1d3698bf76","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "mtail_login_num{ident=~\"$ident\"}","legend": "{{ident}}-登录","refId": "A","time": {"start": "now-24h","end": "now"}}],"transformations": [{"id": "organize","options": {}}],"name": "登录 日志数","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]},{"type": "timeseries","id": "093b192e-e991-4590-ab4b-aa768159e00f","layout": {"h": 6,"w": 8,"x": 0,"y": 6,"i": "a18a3bd3-8c2b-4fa2-81f3-7b0d00b49cc9","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "redis_connected_clients{ident=~\"$ident\"}","refId": "A","legend": "{{ident}}-当前连接数"}],"transformations": [{"id": "organize","options": {}}],"name": "Redis 连接数","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"min": null,"max": null,"decimals": null},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0.01,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off","standardOptions": {"min": null,"max": null,"decimals": null}}}]},{"type": "timeseries","id": "2674442f-937f-4027-806b-10b2286b14f6","layout": {"h": 6,"w": 8,"x": 8,"y": 6,"i": "c8c061df-894d-458e-a89d-86a8428c52c9","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "redis_used_memory{ident=~\"$ident\"}","legend": "{{ident}}-内存","refId": "A"}],"transformations": [{"id": "organize","options": {}}],"name": "Redis 使用内存","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"decimals": null},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]},{"type": "timeseries","id": "d26e8bc3-16a0-4a60-9aa9-36d71b85abc5","layout": {"h": 6,"w": 8,"x": 16,"y": 6,"i": "0a3310ea-74ca-48fa-8c18-52c1b0f71235","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "mtail_error_num{ident=~\"$ident\"}","legend": "{{ident}}-错误","refId": "A","time": {"start": "now-24h","end": "now"}}],"transformations": [{"id": "organize","options": {}}],"name": "Error 日志数","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]},{"type": "timeseries","id": "7fa2cdbe-b782-4b71-bd7e-2cdba7455e77","layout": {"h": 6,"w": 8,"x": 0,"y": 12,"i": "9a2e4d49-7a4f-4627-b2f6-cbe0e4ab04b1","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "nginx_active{ident=~\"$ident\"}","refId": "A","legend": "{{ident}}-活跃连接"}],"transformations": [{"id": "organize","options": {}}],"name": "Nginx 活跃连接数","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"min": null,"max": null,"decimals": null},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off","standardOptions": {"min": null,"max": null,"decimals": null}}}]},{"type": "timeseries","id": "0cb01432-ea29-41f4-8e6f-e6b9b71e90ab","layout": {"h": 6,"w": 8,"x": 8,"y": 12,"i": "8bf97e38-e840-4804-a686-28bb65fec78d","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "docker_n_containers_running{ident=~\"$ident\"}","refId": "A","legend": "{{ident}}-启动容器"}],"transformations": [{"id": "organize","options": {}}],"name": "Docker 启动容器数","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"min": null,"max": null,"decimals": null},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off","standardOptions": {"min": null,"max": null,"decimals": null}}}]},{"type": "timeseries","id": "936b934b-6340-4743-8c12-821c63210fd6","layout": {"h": 6,"w": 8,"x": 16,"y": 12,"i": "c6da1998-c1e3-4486-a24c-58e26d349206","isResizable": true},"version": "3.0.0","datasourceCate": "prometheus","datasourceValue": "${prom}","targets": [{"expr": "docker_container_mem_usage{ident=~\"$ident\"}","legend": "{{ident}}-{{container_name}}-内存","refId": "A"}],"transformations": [{"id": "organize","options": {}}],"name": "Docker 内存使用率","maxPerRow": 4,"options": {"tooltip": {"mode": "all","sort": "desc"},"legend": {"displayMode": "hidden","behaviour": "showItem"},"standardOptions": {"decimals": 0},"thresholds": {"steps": [{"color": "#634CD9","value": null,"type": "base"}]}},"custom": {"drawStyle": "lines","lineInterpolation": "smooth","spanNulls": false,"lineWidth": 2,"fillOpacity": 0,"gradientMode": "none","stack": "off","scaleDistribution": {"type": "linear"}},"overrides": [{"matcher": {"id": "byFrameRefID"},"properties": {"rightYAxisDisplay": "off"}}]}],"var": [{"definition": "prometheus","name": "prom","type": "datasource"},{"allOption": true,"datasource": {"cate": "prometheus","value": "${prom}"},"definition": "label_values(system_load1,ident)","multi": true,"name": "ident","type": "query"}],"version": "3.0.0"}
}
- 仪表盘 效果
三、告警配置
1. 邮件通知
- 配置 SMTP
- 配置 用户邮箱
- 配置 邮件通知模板
<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="ie=edge"><title>夜莺告警通知</title><style type="text/css">.wrapper {background-color: #f8f8f8;padding: 15px;height: 100%;}.main {width: 600px;padding: 30px;margin: 0 auto;background-color: #fff;font-size: 12px;font-family: verdana,'Microsoft YaHei',Consolas,'Deja Vu Sans Mono','Bitstream Vera Sans Mono';}header {border-radius: 2px 2px 0 0;}header .title {font-size: 14px;color: #333333;margin: 0;}header .sub-desc {color: #333;font-size: 14px;margin-top: 6px;margin-bottom: 0;}hr {margin: 20px 0;height: 0;border: none;border-top: 1px solid #e5e5e5;}em {font-weight: 600;}table {margin: 20px 0;width: 100%;}table tbody tr{font-weight: 200;font-size: 12px;color: #666;height: 32px;}.succ {background-color: green;color: #fff;}.fail {background-color: red;color: #fff;}.succ th, .succ td, .fail th, .fail td {color: #fff;}table tbody tr th {width: 80px;text-align: right;}.text-right {text-align: right;}.body {margin-top: 24px;}.body-text {color: #666666;-webkit-font-smoothing: antialiased;}.body-extra {-webkit-font-smoothing: antialiased;}.body-extra.text-right a {text-decoration: none;color: #333;}.body-extra.text-right a:hover {color: #666;}.button {width: 200px;height: 50px;margin-top: 20px;text-align: center;border-radius: 2px;background: #2D77EE;line-height: 50px;font-size: 20px;color: #FFFFFF;cursor: pointer;}.button:hover {background: rgb(25, 115, 255);border-color: rgb(25, 115, 255);color: #fff;}footer {margin-top: 10px;text-align: right;}.footer-logo {text-align: right;}.footer-logo-image {width: 108px;height: 27px;margin-right: 10px;}.copyright {margin-top: 10px;font-size: 12px;text-align: right;color: #999;-webkit-font-smoothing: antialiased;}</style></head><body><div class="wrapper"><div class="main"><header><h3 class="title">{{.RuleName}}</h3><p class="sub-desc"></p></header><hr><div class="body"><table cellspacing="0" cellpadding="0" border="0"><tbody>{{if .IsRecovered}}<tr class="succ"><th>级别状态:</th><td>S{{.Severity}} Recovered</td></tr>{{else}}<tr class="fail"><th>级别状态:</th><td>S{{.Severity}} Triggered</td></tr>{{end}}{{if not .IsRecovered}}<tr><th>触发时值:</th><td>{{.TriggerValue}}</td></tr>{{end}}{{if .TargetIdent}}<tr><th>监控对象:</th><td>{{.TargetIdent}}</td></tr>{{end}}<tr><th>监控指标:</th><td>{{.TagsJSON}}</td></tr>{{$time_duration := sub now.Unix .FirstTriggerTime }}{{if .IsRecovered}}<tr><th>持续时间:</th><td>{{humanizeDurationInterface $time_duration}}</td></tr><tr><th>恢复时间:</th><td>{{timeformat .LastEvalTime}}</td></tr>{{else}}<tr><th>触发时间:</th><td>{{timeformat .TriggerTime}}</td></tr>{{end}}</tbody></table></div></div></div></body></html>
2. 告警规则
- CPU 使用率超过90%
[{"cate": "prometheus","datasource_ids": [0],"name": "CPU 使用率超过90%","note": "","prod": "metric","algorithm": "","algo_params": null,"delay": 0,"severity": 0,"severities": [1],"disabled": 0,"prom_for_duration": 60,"prom_ql": "","rule_config": {"inhibit": true,"queries": [{"keys": {"labelKey": "","valueKey": ""},"prom_ql": "cpu_usage_active > 90","severity": 1}]},"prom_eval_interval": 15,"enable_stime": "00:00","enable_stimes": ["00:00"],"enable_etime": "23:59","enable_etimes": ["23:59"],"enable_days_of_week": ["1","2","3","4","5","6","0"],"enable_days_of_weeks": [["1","2","3","4","5","6","0"]],"enable_in_bg": 0,"notify_recovered": 1,"notify_channels": ["email"],"notify_repeat_step": 60,"notify_max_number": 3,"recover_duration": 60,"callbacks": [],"runbook_url": "","append_tags": [],"annotations": {},"extra_config": null}
]
- MySQL 1分钟内慢查询数超过10个
[{"cate": "prometheus","datasource_ids": [0],"name": "MySQL 1分钟内慢查询数超过10个","note": "","prod": "metric","algorithm": "","algo_params": null,"delay": 0,"severity": 0,"severities": [1],"disabled": 0,"prom_for_duration": 120,"prom_ql": "","rule_config": {"inhibit": false,"queries": [{"keys": {"labelKey": "","valueKey": ""},"prom_ql": "increase(mysql_global_status_slow_queries[1m]) > 10","severity": 1}]},"prom_eval_interval": 15,"enable_stime": "00:00","enable_stimes": ["00:00"],"enable_etime": "23:59","enable_etimes": ["23:59"],"enable_days_of_week": ["1","2","3","4","5","6","0"],"enable_days_of_weeks": [["1","2","3","4","5","6","0"]],"enable_in_bg": 0,"notify_recovered": 1,"notify_channels": ["email"],"notify_repeat_step": 60,"notify_max_number": 3,"recover_duration": 60,"callbacks": [],"runbook_url": "","append_tags": [],"annotations": {},"extra_config": null}
]
- MySQL 连接数超过80%
[{"cate": "prometheus","datasource_ids": [0],"name": "MySQL 连接数超过80%","note": "","prod": "metric","algorithm": "","algo_params": null,"delay": 0,"severity": 0,"severities": [1],"disabled": 0,"prom_for_duration": 120,"prom_ql": "","rule_config": {"inhibit": false,"queries": [{"keys": {"labelKey": "","valueKey": ""},"prom_ql": "avg by (instance) (mysql_global_status_threads_connected) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80","severity": 1}]},"prom_eval_interval": 15,"enable_stime": "00:00","enable_stimes": ["00:00"],"enable_etime": "23:59","enable_etimes": ["23:59"],"enable_days_of_week": ["1","2","3","4","5","6","0"],"enable_days_of_weeks": [["1","2","3","4","5","6","0"]],"enable_in_bg": 0,"notify_recovered": 1,"notify_channels": ["email"],"notify_repeat_step": 60,"notify_max_number": 3,"recover_duration": 60,"callbacks": [],"runbook_url": "","append_tags": [],"annotations": {},"extra_config": null}
]
- 内存 使用率超过85%
[{"cate": "prometheus","datasource_ids": [0],"name": "内存 使用率超过85%","note": "","prod": "metric","algorithm": "","algo_params": null,"delay": 0,"severity": 0,"severities": [1],"disabled": 0,"prom_for_duration": 60,"prom_ql": "","rule_config": {"inhibit": true,"queries": [{"keys": {"labelKey": "","valueKey": ""},"prom_ql": "mem_used_percent > 85","severity": 1}]},"prom_eval_interval": 15,"enable_stime": "00:00","enable_stimes": ["00:00"],"enable_etime": "23:59","enable_etimes": ["23:59"],"enable_days_of_week": ["1","2","3","4","5","6","0"],"enable_days_of_weeks": [["1","2","3","4","5","6","0"]],"enable_in_bg": 0,"notify_recovered": 1,"notify_channels": ["email"],"notify_repeat_step": 60,"notify_max_number": 3,"recover_duration": 60,"callbacks": [],"runbook_url": "","append_tags": [],"annotations": {},"extra_config": null}
]
- 硬盘 使用率超过80%
[{"cate": "prometheus","datasource_ids": [0],"name": "硬盘 使用率超过80%","note": "","prod": "metric","algorithm": "","algo_params": null,"delay": 0,"severity": 0,"severities": [1],"disabled": 0,"prom_for_duration": 60,"prom_ql": "","rule_config": {"inhibit": true,"queries": [{"keys": {"labelKey": "","valueKey": ""},"prom_ql": "disk_used_percent > 80","severity": 1}]},"prom_eval_interval": 30,"enable_stime": "00:00","enable_stimes": ["00:00"],"enable_etime": "23:59","enable_etimes": ["23:59"],"enable_days_of_week": ["0","1","2","3","4","5","6"],"enable_days_of_weeks": [["0","1","2","3","4","5","6"]],"enable_in_bg": 0,"notify_recovered": 1,"notify_channels": [],"notify_repeat_step": 60,"notify_max_number": 3,"recover_duration": 60,"callbacks": [],"runbook_url": "","append_tags": [],"annotations": {},"extra_config": null}
]
- 网络 入流量超过6M/s
[{"cate": "prometheus","datasource_ids": [0],"name": "网络 入流量超过6M/s","note": "","prod": "metric","algorithm": "","algo_params": null,"delay": 0,"severity": 0,"severities": [1],"disabled": 0,"prom_for_duration": 60,"prom_ql": "","rule_config": {"inhibit": false,"queries": [{"keys": {"labelKey": "","valueKey": ""},"prom_ql": "rate(net_bytes_recv[1m]) / 1024 / 1024 > 6","severity": 1}]},"prom_eval_interval": 15,"enable_stime": "00:00","enable_stimes": ["00:00"],"enable_etime": "23:59","enable_etimes": ["23:59"],"enable_days_of_week": ["1","2","3","4","5","6","0"],"enable_days_of_weeks": [["1","2","3","4","5","6","0"]],"enable_in_bg": 0,"notify_recovered": 1,"notify_channels": ["email"],"notify_repeat_step": 60,"notify_max_number": 3,"recover_duration": 60,"callbacks": [],"runbook_url": "","append_tags": [],"annotations": {},"extra_config": null}
]
- 网络 出流量超过6M/s
[{"cate": "prometheus","datasource_ids": [0],"name": "网络 出流量超过6M/s","note": "","prod": "metric","algorithm": "","algo_params": null,"delay": 0,"severity": 0,"severities": [1],"disabled": 0,"prom_for_duration": 60,"prom_ql": "","rule_config": {"inhibit": false,"queries": [{"keys": {"labelKey": "","valueKey": ""},"prom_ql": "rate(net_bytes_sent[1m]) / 1024 / 1024 > 6","severity": 1}]},"prom_eval_interval": 15,"enable_stime": "00:00","enable_stimes": ["00:00"],"enable_etime": "23:59","enable_etimes": ["23:59"],"enable_days_of_week": ["1","2","3","4","5","6","0"],"enable_days_of_weeks": [["1","2","3","4","5","6","0"]],"enable_in_bg": 0,"notify_recovered": 1,"notify_channels": ["email"],"notify_repeat_step": 60,"notify_max_number": 3,"recover_duration": 60,"callbacks": [],"runbook_url": "","append_tags": [],"annotations": {},"extra_config": null}
]
3. 告警自愈
- 自愈配置
- 测试告警自愈
告警自愈 > 自愈脚本 > 创建
告警自愈 > 自愈脚本 > test 创建任务 > 保存立刻执行 > 执行历史 > 点击标题下的任务