Prometheus配置与管理

1 配置文件

Prometheus通过命令行和配置文件进行配置，命令行配置不能修改的系统参数（例如存储位置，要保留在磁盘和内存中的数据量等），但配置文件定义了与抓取作业及其实例相关的所有内容，以及哪些规则文件的加载。

Prometheus可以在运行时重新加载其配置。如果新配置格式不正确，则更改将不会应用。通过向Prometheus进程发送SIGHUP或向/-/reload端点发送HTTP POST请求（启用–web.enable-lifecycle）来触发配置重载，这还将重新加载所有已配置的规则文件。

在启动时可以使用--config.file参数指定要加载的配置文件。例如下面的全局配置指定在所有其他配置上下文中有效的参数，或者指定其他具体配置的默认值。

global:# 默认情况下抓取目标的频率.[ scrape_interval: <duration> | default = 1m ]# 抓取超时时间.[ scrape_timeout: <duration> | default = 10s ]# 评估规则的频率.[ evaluation_interval: <duration> | default = 1m ]# 与外部系统通信时添加到任何时间序列或警报的标签（联合，远程存储，Alertma# nager）.即添加到拉取的数据并存到数据库中external_labels:[ <labelname>: <labelvalue> ... ]# 规则文件指定了一个globs列表. 
# 从所有匹配的文件中读取规则和警报.
rule_files:[ - <filepath_glob> ... ]# 抓取配置列表.
scrape_configs:[ - <scrape_config> ... ]# 警报指定与Alertmanager相关的设置.
alerting:alert_relabel_configs:[ - <relabel_config> ... ]alertmanagers:[ - <alertmanager_config> ... ]# 与远程写入功能相关的设置.
remote_write:[ - <remote_write> ... ]# 与远程读取功能相关的设置.
remote_read:[ - <remote_read> ... ]

其他具体的配置参数可以看官方文档：https://prometheus.io/docs/prometheus/latest/configuration/configuration/

2 规则

Prometheus支持两种类型的规则，这些规则可以定期配置，然后定期评估：记录规则（recored rule）和警报规则（alert rule）。规则文件是yaml格式，通过Prometheus配置中的rule_files字段加载指定文件，规则文件使用YAML。

通过将SIGHUP发送到Prometheus进程，可以在运行时重新加载规则文件。 kill -1 pid向指定进程发送SIGHUP信号

通过promtool工具可以对规则文件对语法进行检测，一般下载来整个Prometheus已经包含了promtool工具

promtool check rules /path/to/example.rules.yml

2.1 recored rule

recored rule允许预先计算经常需要或计算上复杂的表达式，并将其结果保存为一组新的时间序列。

groups:- name: examplerules:# 要输出的时间序列的名称- record: job:http_inprogress_requests:sum# 要评估的PromQL表达式expr: sum(http_inprogress_requests) by (job)

2.2 alert rule

警报规则基于Prometheus表达式定义警报条件，并将有关触发警报的通知发送到外部服务。每当警报表达式在给定时间点生成一个或多个向量元素时，警报将视为这些元素的标签集的活动状态。

groups:
- name: examplerules:# 警报的名称- alert: HighErrorRate# 要评估的PromQL表达式。 每个评估周期都会在当前时间进行评估，并且所有结果时间序列都会成为待处理/触发警报。expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5for: 10mlabels:severity: pageannotations:summary: High request latency

3 PromQL

Prometheus提供一个函数式的表达式语言PromQL (Prometheus Query Language)，可以使用户实时地查找和聚合时间序列数据。表达式计算结果可以在图表中展示，也可以在表达式浏览器中以表格形式展示，或者作为数据源提供给API请求

在Prometheus的表达式语言中，任何表达式或者子表达式都可以归为四种类型：

instant vector瞬时向量：它是指在同一时刻，抓取的所有度量指标数据。这些度量指标数据的key都是相同的，也即相同的时间戳，返回瞬时向量的表达式是唯一可以直接绘制图形的类型
range vector范围向量：它是指在任何一个时间范围内，抓取的所有度量指标数据
scalar 标量：一个简单的浮点值
string 字符串：一个当前没有被使用的简单字符串

3.1 字面量

字符串可以用单引号，双引号或反引号指定为文字。

"this is a string"
'these are unescaped: \n \\ \t'
`these are not unescaped: \n ' " \t"'`

标量浮点值可以直接写成形式 -[.(digits)]。

-2.43

3.2时间序列选择器

瞬时向量选择器允许在给定时间戳（即时）为每个选择一组时间序列和单个样本值

# 选择所有时间序列度量名称为http_requests_total的样本数据
http_requests_total# 选择具有http_requests_total度量标准名称的时间系列，该名称也将job标签设置为prometheus，并将其group标签设置为canary
http_requests_total{job="prometheus",group="canary"}# 度量指标名称为http_requests_total，正则表达式匹配标签environment为staging, testing, development的值，且http请求方法不等于GET
http_requests_total{environment=~"staging|testing|development",method!="GET"}

范围向量的工作方式与即时向量相同，不同之处在于它们从当前即时选择回采样范围。在语法上，范围持续时间附加在向量选择器末尾的方括号（[]）中，指定为每个结果范围向量元素提取多长时间值

选择在过去5分钟内为度量标准名称为http_requests_total且job标签设置为prometheus的所有时间序列记录的所有值：

http_requests_total{job="prometheus"}[5m]

offset偏移修饰符允许在查询中改变单个瞬时向量和范围向量中的时间偏移

# 返回过去相对于当前查询评估时间5分钟的http_requests_total值：
http_requests_total offset 5m

4 API

在Prometheus服务器上的/api/v1下可以访问当前稳定的HTTP API，API返回是JSON格式，每个请求成功的返回值都是以2xx开头的编码。如果API处理的是无效请求，返回一个JSON错误对象，并返回下面的错误码：

400 Bad Request。当参数错误或者丢失时。
422 Unprocessable Entity。当一个表达式不能被执行时。
503 Service Unavailable。当查询超时或者中断时。

在 query请求时可以指定参数如下

query=: Prometheus表达式查询字符串。
time=<rfc3339 | uninx_timestamp>:执行时间戳，如果time缺省，则用当前服务器时间表示执行时刻
timeout=: 执行超时时间设置，可选项，默认由-query.timeout标志设置

4.1 请求监测数据

如下所示为一个query请求指定时间点up参数的数据

$ curl 'http://localhost:9090/api/v1/query?query=up&time=2015-07-01T20:10:51.781Z'
{"status": "success","data":{"resultType": "vector","result" : [{"metric" : {"__name__" : "up","job" : "prometheus","instance" : "localhost:9090"},"value": [ 1435781451.781, "1" ]},{"metric" : {"__name__" : "up","job" : "node","instance" : "localhost:9100"},"value" : [ 1435781451.781, "0" ]}]}
}

同理，可以使用/api/v1/query_range接口请求一段时间内的数据

query=: Prometheus表达式查询字符串。
start=<rfc3339 | unix_timestamp>: 开始时间戳。
end=<rfc3339 | unix_timestamp>: 结束时间戳。
step=: 以持续时间格式查询分辨率步长或浮点秒数。
timeout=:评估超时,默认为-query.timeout标志的值并受其限制。

4.2 请求元数据

通过/api/v1/series接口可以对元数据信息进行查询，通过match[]参数可以指定选择器

$ curl -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'
{"status" : "success","data" : [{"__name__" : "up","job" : "prometheus","instance" : "localhost:9090"},{"__name__" : "up","job" : "node","instance" : "localhost:9091"},{"__name__" : "process_start_time_seconds","job" : "prometheus","instance" : "localhost:9090"}]
}

4.3 查询目标

通过/api/v1/targets可以查询Prometheus监控的目标信息

$ curl http://localhost:9090/api/v1/targets
{"status": "success","data": {"activeTargets": [{"discoveredLabels": {"__address__": "127.0.0.1:9090","__metrics_path__": "/metrics","__scheme__": "http","job": "prometheus"},"labels": {"instance": "127.0.0.1:9090","job": "prometheus"},"scrapeUrl": "http://127.0.0.1:9090/metrics","lastError": "","lastScrape": "2017-01-17T15:07:44.723715405+01:00","health": "up"}],"droppedTargets": [{"discoveredLabels": {"__address__": "127.0.0.1:9100","__metrics_path__": "/metrics","__scheme__": "http","job": "node"},}]}
}

还有一些用于查询的API接口，可以详见https://prometheus.io/docs/prometheus/latest/querying/api/

# 查询配置规则信息
curl http://localhost:9090/api/v1/rules
# 查询所有活动警报的列表
curl http://localhost:9090/api/v1/alerts
# 查询运行时信息
curl http://localhost:9090/api/v1/status/runtimeinfo

4.4 管理API

Prometheus提供了一些用于管理的API如下

# 健康检测,始终返回200，应用于检查Prometheus的运行状况。
curl http://localhost:9090/-/healthy# 准备检查,当Prometheus准备服务流量（即响应查询）时，返回200
GET /-/ready# 重载：请求对配置和规则文件的重新加载。 默认情况下它是禁用的，可以通过--web.enable-lifecycle标志启用
curl -X POST http://localhost:9090/-/reload
# 也可以通过将SIGHUP发送到Prometheus进程来触发配置重载。
kill -HUP pid# 退出，Prometheus的正常关闭。 默认情况下它是禁用的，可以通过--web.enable-lifecycle标志启用
PUT  /-/quit
POST /-/quit