文章目录
- 一、部署alertmanager相关组件
- 1.alertmanager-config
- 2.alertmanager-message-tmpl
- 3.alertmanager
- 二、调试邮件告警
- 三、钉钉群/企业微信群 报警
- 3.1添加钉钉群机器人
- 3.2添加企业微信群机器人
- 3.3部署alertmanager-webhook-adapter
- message-tmpl
- alertmanager-webhook-adapter
- alertmanager-config
- 3.4钉钉群报警信息效果
- 3.5企业微信群报警信息效果
- 总结
Prometheus报警功能利用Alertmanager组件完成,Prometheus会对接收的指标数据比对告警规则,如果满足条件,则将告警时间发送给Alertmanager组件,Alertmanager组件发送到接收人
使用步骤:
- 部署Alertmanager
- 配置告警接收人
- 配置Prometheus与Alertmanager通信
- 在Prometheus中创建告警规则
一、部署alertmanager相关组件
1.alertmanager-config
#alertmanager-config.yaml主配置文件,主要配置altermanager的告警配置
apiVersion: v1
kind: ConfigMap
metadata:name: alertmanager-confignamespace: ops
data:alertmanager.yml: |global:#恢复告警等待时间,如果5m没收到来自prometheus的告警 则发送恢复告警通知resolve_timeout: 5m#邮箱服务器smtp_smarthost: 'smtp.exmail.qq.com:465'#发送告警的邮箱地址smtp_from: 'fanxxxxuai@cxxxxne.com'#发送者的邮箱登陆用户名smtp_auth_username: 'fanxxxxuai@cxxxxne.com'#发送者的邮箱授权吗(若是企业微信邮箱的话为发送者的登陆邮箱密码)smtp_auth_password: '123456'#关闭tls,默认是开启tls的,若不关闭则会报错,错误为本篇总结出所示smtp_require_tls: false#alertmanager告警消息的模版templates:- '/etc/alertmanager/msg-tmpl/*.tmpl'#主路由route: #指定告警接收者receiver: 'mail-receiver'#分组(通过alertname标签的值分组)group_by: [cluster, alertname]#第一次产生告警,等待30s,足内有告警的话就一起发出,没有则单独发group_wait: 30s#第二次产生告警,先等待5m,如果5m后还没有恢复就进入repeat_interval。(定义相同的Group之间发送告警通知的时间间隔)group_interval: 5m#在最终发送消息前再等待30m,30m后还没恢复就发送第二次告警repeat_interval: 30m##所以每次告警之间的间隔时间为35m(group_interval+repeat_interval)#配置告警接受者receivers:- name: 'mail-receiver'#使用邮件接收email_configs:- to: 'fanxxxxuai@cxxxxne.com'send_resolved: truehtml: '{{ template "emailMessage" . }}'
2.alertmanager-message-tmpl
#alertmanager-message-tmpl.yaml 告警模版(邮件)
apiVersion: v1
kind: ConfigMap
metadata:name: alertmanager-message-tmplnamespace: ops
data:email.tmpl: |{{ define "emailMessage" }}{{- if gt (len .Alerts.Firing) 0 -}}{{- range $index, $alert := .Alerts -}}{{- if eq $index 0 }}------ 告警问题 ------<br>告警状态:{{ .Status }}<br>告警级别:{{ .Labels.severity }}<br>告警名称:{{ .Labels.alertname }}<br>故障实例:{{ .Labels.instance }}<br>告警概要:{{ .Annotations.summary }}<br>告警详情:{{ .Annotations.description }}<br>故障时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>------ END ------<br>{{- end }}{{- end }}{{- end }}{{- if gt (len .Alerts.Resolved) 0 -}}{{- range $index, $alert := .Alerts -}}{{- if eq $index 0 }}------ 告警恢复 ------<br>告警状态:{{ .Status }}<br>告警级别:{{ .Labels.severity }}<br>告警名称:{{ .Labels.alertname }}<br>恢复实例:{{ .Labels.instance }}<br>告警概要:{{ .Annotations.summary }}<br>告警详情:{{ .Annotations.description }}<br>故障时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>恢复时间:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>------ END ------<br>{{- end }}{{- end }}{{- end }}{{- end }}
3.alertmanager
#alertmanager.yaml 部署altermanager
apiVersion: apps/v1
kind: Deployment
metadata:name: alertmanagernamespace: ops
spec:replicas: 1selector:matchLabels:app: alertmanagertemplate:metadata:labels:app: alertmanagerspec:containers:#用于热加载配置文件- name: prometheus-alertmanager-configmap-reloadimage: "jimmidyson/configmap-reload:v0.1"imagePullPolicy: "IfNotPresent"args:- --volume-dir=/etc/config- --webhook-url=http://localhost:9093/-/reloadvolumeMounts:- name: configmountPath: /etc/configreadOnly: trueresources:limits:cpu: 10mmemory: 10Mirequests:cpu: 10mmemory: 10Mi- name: alertmanagerimage: "prom/alertmanager:latest"ports:- containerPort: 9093readinessProbe:httpGet:path: /#/statusport: 9093initialDelaySeconds: 30timeoutSeconds: 30livenessProbe:httpGet:path: /#/statusport: 9093initialDelaySeconds: 30timeoutSeconds: 30resources:requests:cpu: 100mmemory: 256Milimits:cpu: 500mmemory: 512MivolumeMounts:- name: configmountPath: /etc/alertmanager- name: message-tmpl mountPath: /etc/alertmanager/msg-tmpl- name: datamountPath: /data- name: timezonemountPath: /etc/localtimevolumes:- name: configconfigMap:name: alertmanager-config- name: message-tmplconfigMap:name: alertmanager-message-tmpl- name: datapersistentVolumeClaim:claimName: alertmanager-data- name: timezonehostPath:path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:name: alertmanager-datanamespace: ops
spec:storageClassName: "managed-nfs-storage"accessModes:- ReadWriteOnceresources:requests:storage: "2Gi"
---
apiVersion: v1
kind: Service
metadata:name: alertmanagernamespace: ops
spec:type: NodePortports:- name: httpport: 9093protocol: TCPtargetPort: 9093nodePort: 30093selector:app: alertmanager
部署完成后访问 IP:30093即可访问altermanager的web展示界面
二、调试邮件告警
此时可以尝试重启一个pod进行调试
将资源配额调至超出可申请资源 让其为pending状态
如下:
test-alertmanager.yaml
apiVersion: apps/v1
kind: Deployment
metadata:name: nginx-deploymentlabels:app: nginx
spec:replicas: 1selector:matchLabels:app: nginxtemplate:metadata:labels:app: nginxspec:containers:- name: nginximage: nginx:1.14.2ports:- containerPort: 80resources:requests:memory: "24Gi"cpu: "12000m"limits:memory: "24Gi"cpu: "12000m"
启动此Pod观察
邮件告警如下:
三、钉钉群/企业微信群 报警
(webhook自定义机器人类型即可)
目前prometheus没有集成钉钉群和企业微信群接口,需要自己写webhook(数据转换)或者用别人写的
例如:https://github.com/bougou/alertmanager-webhook-adapter
3.1添加钉钉群机器人
创建完成后会有一个webhook地址,稍后会使用到此webhook的token
3.2添加企业微信群机器人
创建完成后会有一个webhook地址,稍后会使用到此webhook的key
3.3部署alertmanager-webhook-adapter
message-tmpl
#message-tmpl.yaml告警模版我把钉钉的和企业微信的放一起了
模版来源于:https://github.com/bougou/alertmanager-webhook-adapter/tree/main/pkg/models/templates
apiVersion: v1
kind: ConfigMap
metadata:name: message-tmplnamespace: ops
data:
####################################################################################################################dingding.tmpl: |{{ define "__subject" -}}【{{ .Signature }}】{{- if eq (index .Alerts 0).Labels.severity "ok" }} OK{{ end }}{{- if eq (index .Alerts 0).Labels.severity "info" }} INFO{{ end }}{{- if eq (index .Alerts 0).Labels.severity "warning" }} WARNING{{ end }}{{- if eq (index .Alerts 0).Labels.severity "error" }} ERROR{{ end }}{{- ` • ` }}{{- if .CommonLabels.alertname_cn }}{{ .CommonLabels.alertname_cn }}{{ else if .CommonLabels.alertname_custom }}{{ .CommonLabels.alertname_custom }}{{ else if .CommonAnnotations.alertname }}{{ .CommonAnnotations.alertname }}{{ else }}{{ .GroupLabels.alertname }}{{ end }}{{- ` • ` }}{{- if gt (.Alerts.Firing|len) 0 }}告警中:{{ .Alerts.Firing|len }}{{ end }}{{- if and (gt (.Alerts.Firing|len) 0) (gt (.Alerts.Resolved|len) 0) }}/{{ end }}{{- if gt (.Alerts.Resolved|len) 0 }}已恢复:{{ .Alerts.Resolved|len }}{{ end }}{{ end }}{{ define "__externalURL" -}}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{- end }}{{ define "__alertinstance" -}}{{- if ne .Labels.alertinstance nil -}}{{ .Labels.alertinstance }}{{- else if ne .Labels.instance nil -}}{{ .Labels.instance }}{{- else if ne .Labels.node nil -}}{{ .Labels.node }}{{- else if ne .Labels.nodename nil -}}{{ .Labels.nodename }}{{- else if ne .Labels.host nil -}}{{ .Labels.host }}{{- else if ne .Labels.hostname nil -}}{{ .Labels.hostname }}{{- else if ne .Labels.ip nil -}}{{ .Labels.ip }}{{- end -}}{{- end }}{{ define "__alert_list" }}{{ range . }}---> **告警名称**: {{ if .Labels.alertname_cn }}{{ .Labels.alertname_cn }}{{ else if .Labels.alertname_custom }}{{ .Labels.alertname_custom }}{{ else if .Annotations.alertname }}{{ .Annotations.alertname }}{{ else }}{{ .Labels.alertname }}{{ end }}>> **告警级别**: {{ ` ` }}{{- if eq .Labels.severity "ok" }}OK{{ end -}}{{- if eq .Labels.severity "info" }}INFO{{ end -}}{{- if eq .Labels.severity "warning" }}WARNING{{ end -}}{{- if eq .Labels.severity "error" }}ERROR{{ end }}>> **告警实例**: `{{ template "__alertinstance" . }}`>{{- if .Labels.region }}> **地域**: {{ .Labels.region }}>{{- end }}{{- if .Labels.zone }}> **可用区**: {{ .Labels.zone }}>{{- end }}{{- if .Labels.product }}> **产品**: {{ .Labels.product }}>{{- end }}{{- if .Labels.component }}> **组件**: {{ .Labels.component }}>{{- end }}> **告警状态**: {{ if eq .Status "firing" }}🚨{{ else }}✅{{ end }} {{ .Status | toUpper }}>> **开始时间**: {{ .StartsAt.Format "2006-01-02T15:04:05Z07:00" }}>> **结束时间**: {{ if .EndsAt.After .StartsAt }}{{ .EndsAt.Format "2006-01-02T15:04:05Z07:00" }}{{ else }}Not End{{ end }}>{{- if eq .Status "firing" }}> 告警描述: {{ if .Annotations.description_cn }}{{ .Annotations.description_cn }}{{ else }}{{ .Annotations.description }}{{ end }}>{{- end }}{{ end }}{{ end }}{{ define "__alert_summary" }}{{ range . }}- {{ template "__alertinstance" . }}{{ end }}{{ end }}{{ define "prom.title" }}{{ template "__subject" . }}{{ end }}{{ define "prom.markdown" }}{{ .MessageAt.Format "2006-01-02T15:04:05Z07:00" }}#### **摘要**{{ if gt (.Alerts.Firing|len ) 0 }}##### **🚨 触发中告警 [{{ .Alerts.Firing|len }}]**{{ template "__alert_summary" .Alerts.Firing }}{{ end }}{{ if gt (.Alerts.Resolved|len) 0 }}##### **✅ 已恢复告警 [{{ .Alerts.Resolved|len }}]**{{ template "__alert_summary" .Alerts.Resolved }}{{ end }}#### **详请**{{ if gt (.Alerts.Firing|len ) 0 }}##### **🚨 触发中告警 [{{ .Alerts.Firing|len }}]**{{ template "__alert_list" .Alerts.Firing }}{{ end }}{{ if gt (.Alerts.Resolved|len) 0 }}##### **✅ 已恢复告警 [{{ .Alerts.Resolved|len }}]**{{ template "__alert_list" .Alerts.Resolved }}{{ end }}{{ end }}{{ define "prom.text" }}{{ template "prom.markdown" . }}{{ end }}####################################################################################################################wechat.tmpl: | {{ define "__subject" -}}【{{ .Signature }}】{{- if eq (index .Alerts 0).Labels.severity "ok" }} OK{{ end }}{{- if eq (index .Alerts 0).Labels.severity "info" }} INFO{{ end }}{{- if eq (index .Alerts 0).Labels.severity "warning" }} WARNING{{ end }}{{- if eq (index .Alerts 0).Labels.severity "error" }} ERROR{{ end }}{{- ` • ` }}{{- if .CommonLabels.alertname_cn }}{{ .CommonLabels.alertname_cn }}{{ else if .CommonLabels.alertname_custom }}{{ .CommonLabels.alertname_custom }}{{ else if .CommonAnnotations.alertname }}{{ .CommonAnnotations.alertname }}{{ else }}{{ .GroupLabels.alertname }}{{ end }}{{- ` • ` }}{{- if gt (.Alerts.Firing|len) 0 }}告警中:{{ .Alerts.Firing|len }}{{ end }}{{- if and (gt (.Alerts.Firing|len) 0) (gt (.Alerts.Resolved|len) 0) }}/{{ end }}{{- if gt (.Alerts.Resolved|len) 0 }}已恢复:{{ .Alerts.Resolved|len }}{{ end }}{{ end }}{{ define "__externalURL" -}}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{- end }}{{ define "__alertinstance" -}}{{- if ne .Labels.alertinstance nil -}}{{ .Labels.alertinstance }}{{- else if ne .Labels.instance nil -}}{{ .Labels.instance }}{{- else if ne .Labels.node nil -}}{{ .Labels.node }}{{- else if ne .Labels.nodename nil -}}{{ .Labels.nodename }}{{- else if ne .Labels.host nil -}}{{ .Labels.host }}{{- else if ne .Labels.hostname nil -}}{{ .Labels.hostname }}{{- else if ne .Labels.ip nil -}}{{ .Labels.ip }}{{- end -}}{{- end }}{{ define "__alert_list" }}{{ range . }}> <font color="comment"> 告警名称 </font>: {{ if .Labels.alertname_cn }}{{ .Labels.alertname_cn }}{{ else if .Labels.alertname_custom }}{{ .Labels.alertname_custom }}{{ else if .Annotations.alertname }}{{ .Annotations.alertname }}{{ else }}{{ .Labels.alertname }}{{ end }}>> <font color="comment"> 告警级别 </font>:{{ ` ` }}{{- if eq .Labels.severity "ok" }}OK{{ end -}}{{- if eq .Labels.severity "info" }}INFO{{ end -}}{{- if eq .Labels.severity "warning" }}WARNING{{ end -}}{{- if eq .Labels.severity "error" }}ERROR{{ end }}>> <font color="comment"> 实例 </font>: `{{ template "__alertinstance" . }}`>{{- if .Labels.region }}> <font color="comment"> 地域 </font>: {{ .Labels.region }}>{{- end }}{{- if .Labels.zone }}> <font color="comment"> 可用区 </font>: {{ .Labels.zone }}>{{- end }}{{- if .Labels.product }}> <font color="comment"> 产品 </font>: {{ .Labels.product }}>{{- end }}{{- if .Labels.component }}> <font color="comment"> 组件 </font>: {{ .Labels.component }}>{{- end }}> <font color="comment"> 告警状态 </font>: {{ if eq .Status "firing" }}🚨{{ else }}✅{{ end }} <font color="{{ if eq .Status "firing" }}warning{{ else }}info{{ end }}">{{ .Status | toUpper }}</font>>> <font color="comment"> 开始时间 </font>: {{ .StartsAt.Format "2006-01-02T15:04:05Z07:00" }}>> <font color="comment"> 结束时间 </font>: {{ if .EndsAt.After .StartsAt }}{{ .EndsAt.Format "2006-01-02T15:04:05Z07:00" }}{{ else }}Not End{{ end }}{{- if eq .Status "firing" }}>> <font color="comment"> 告警描述 </font>: {{ if .Annotations.description_cn }}{{ .Annotations.description_cn }}{{ else }}{{ .Annotations.description }}{{ end }}{{- end }}{{ end }}{{ end }}{{ define "__alert_summary" -}}{{ range . }}<font color="{{ if eq .Status "firing" }}warning{{ else }}info{{ end }}">{{ template "__alertinstance" . }}</font>{{ end }}{{ end }}{{ define "prom.title" -}}{{ template "__subject" . }}{{ end }}{{ define "prom.markdown" }}{{ .MessageAt.Format "2006-01-02T15:04:05Z07:00" }}#### 摘要{{ if gt (.Alerts.Firing|len ) 0 }}##### <font color="warning">🚨 触发中告警 [{{ .Alerts.Firing|len }}]</font>{{ template "__alert_summary" .Alerts.Firing }}{{ end }}{{ if gt (.Alerts.Resolved|len) 0 }}##### <font color="info">✅ 已恢复告警 [{{ .Alerts.Resolved|len }}]</font>{{ template "__alert_summary" .Alerts.Resolved }}{{ end }}#### 详请{{ if gt (.Alerts.Firing|len ) 0 }}##### <font color="warning">🚨 触发中告警 [{{ .Alerts.Firing|len }}]</font>{{ template "__alert_list" .Alerts.Firing }}{{ end }}{{ if gt (.Alerts.Resolved|len) 0 }}##### <font color="info">✅ 已恢复告警 [{{ .Alerts.Resolved|len }}]</font>{{ template "__alert_list" .Alerts.Resolved }}{{ end }}{{ end }}{{ define "prom.text" }}{{ template "prom.markdown" . }}{{ end }}
alertmanager-webhook-adapter
#alertmanager-webhook-adapter.yaml webhook连接器服务
来源于:https://github.com/bougou/alertmanager-webhook-adapter/tree/main/deploy/k8s
apiVersion: apps/v1
kind: Deployment
metadata:name: alertmanager-webhook-adapternamespace: ops
spec:replicas: 1selector:matchLabels:app: alertmanager-webhook-adaptertemplate:metadata:labels:app: alertmanager-webhook-adapterspec:containers:- name: webhookimage: bougou/alertmanager-webhook-adapter:v1.1.7command:- /alertmanager-webhook-adapter#监听端口- --listen-address=:8090#告警第一行告警数据来源(随便写)- --signature=MyIDC#告警模版所在目录- --tmpl-dir=/msg-tmpl#使用哪个告警模版(这里取决于你想用什么应用报警)#钉钉群机器人的话就写 --tmpl-name=dingding#企业微信群机器人的话就写 --tmpl-name=wechat- --tmpl-name=dingding#- --tmpl-lang=zhenv:- name: TZvalue: Asia/Shanghairesources:requests:memory: 50Micpu: 100mlimits:memory: 250Micpu: 500mvolumeMounts:- name: message-tmplmountPath: /msg-tmplvolumes:- name: message-tmplconfigMap:name: message-tmplrestartPolicy: Always---
apiVersion: v1
kind: Service
metadata:name: alertmanager-webhook-adapternamespace: ops
spec:ports:- port: 80targetPort: 8090protocol: TCPselector:app: alertmanager-webhook-adaptersessionAffinity: None
alertmanager-config
#alertmanager-config.yaml主配置文件,这里主要是修改发送告警的方式
apiVersion: v1
kind: ConfigMap
metadata:name: alertmanager-confignamespace: ops
data:alertmanager.yml: |global: resolve_timeout: 5msmtp_smarthost: 'smtp.exmail.qq.com:465'smtp_from: 'fanxxxxuai@cxxxxne.com'smtp_auth_username: 'fanxxxxuai@cxxxxne.com'smtp_auth_password: '12345'smtp_require_tls: falsetemplates:- '/etc/alertmanager/msg-tmpl/*.tmpl'route:#receiver: 'email-receiver'receiver: 'dingding-receiver'#receiver: 'wechat-receiver'group_by: [cluster, alertname]group_wait: 30sgroup_interval: 5mrepeat_interval: 30mreceivers:#邮件告警- name: 'email-receiver'email_configs:- to: 'fanxxxxuai@cxxxxne.com'send_resolved: truehtml: '{{ template "emailMessage" . }}'#钉钉群告警- name: 'dingding-receiver'webhook_configs:#如下url只需将地址串里的token #替换为创建钉钉群机器人时webhook的token即可#531c0f251944b69c6e731a3bea9a609d9557ebdbcf17b1bc0df8f7b9cf506734- url: http://alertmanager-webhook-adapter:80/webhook/send?channel_type=dingtalk&token=531c0f251944b69c6e731a3bea9a609d9557ebdbcf17b1bc0df8f7b9cf506734#是否发送告警恢复通知send_resolved: true#企业微信群告警- name: 'wechat-receiver'webhook_configs:#如下url只需将地址串里的token #替换为创建企业微信群机器人时webhook的token即可#1cb01f46-f536-4c98-aeac-1455e1472e5d- url: http://alertmanager-webhook-adapter:80/webhook/send?channel_type=weixin&token=1cb01f46-f536-4c98-aeac-1455e1472e5d#是否发送告警恢复通知send_resolved: true
ps:
将prometheus的告警规则 label处添加个’alertinstance’ 若是没有则没有效果里的标题、描述等
规则如下:
3.4钉钉群报警信息效果
3.5企业微信群报警信息效果
总结
在使用alertmanager报警时, 若开启tlssmtp_require_tls: true
的话在发送告警时会报如下错误,需要设置成false即可发送成功;