常用告警规则

Node_exporter告警规则
NodeCPUUsageHigh:

yaml
复制代码

  • alert: NodeCPUUsageHigh
    expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=“idle”}[5m])) * 100)) > 80
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High CPU usage detected on {{ $labels.instance }}”
    description: “CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的CPU使用率超过80%并持续5分钟时触发告警。

NodeMemoryUsageHigh:

yaml
复制代码

  • alert: NodeMemoryUsageHigh
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High Memory usage detected on {{ $labels.instance }}”
    description: “Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的内存使用率超过90%并持续5分钟时触发告警。

NodeDiskUsageHigh:

yaml
复制代码

  • alert: NodeDiskUsageHigh
    expr: (node_filesystem_size_bytes{fstype!~“tmpfs|fuse.lxcfs”} - node_filesystem_free_bytes{fstype!~“tmpfs|fuse.lxcfs”}) / node_filesystem_size_bytes{fstype!~“tmpfs|fuse.lxcfs”} * 100 > 85
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Disk usage detected on {{ $labels.instance }}”
    description: “Disk usage is above 85% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的磁盘使用率超过85%并持续5分钟时触发告警。

NodeFilesystemReadOnly:

yaml
复制代码

  • alert: NodeFilesystemReadOnly
    expr: node_filesystem_readonly{fstype!~“tmpfs|fuse.lxcfs”} == 1
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: “Filesystem is read-only on {{ $labels.instance }}”
    description: “Filesystem has been read-only for more than 10 minutes on {{ $labels.instance }}.”
    解释:当某节点的文件系统变为只读并持续10分钟时触发告警。

NodeLoadAverageHigh:

yaml
复制代码

  • alert: NodeLoadAverageHigh
    expr: node_load1 > 2 * count(node_cpu_seconds_total{mode=“system”})
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High load average on {{ $labels.instance }}”
    description: “1-minute load average is more than twice the number of CPUs for over 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的1分钟负载平均值超过CPU数量的2倍并持续5分钟时触发告警。

NodeNetworkDown:

yaml
复制代码

  • alert: NodeNetworkDown
    expr: up{job=“node_exporter”} == 0
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: “Node down: {{ $labels.instance }}”
    description: “Node has been down for more than 10 minutes.”
    解释:当某节点的node_exporter数据10分钟内没有上报时触发告警。

NodeSwapUsageHigh:

yaml
复制代码

  • alert: NodeSwapUsageHigh
    expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High swap usage on {{ $labels.instance }}”
    description: “Swap usage is above 50% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的交换分区使用率超过50%并持续5分钟时触发告警。

NodeFileSystemInodesUsageHigh:

yaml
复制代码

  • alert: NodeFileSystemInodesUsageHigh
    expr: (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100 > 80
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High filesystem inodes usage on {{ $labels.instance }}”
    description: “Filesystem inodes usage is above 80% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的文件系统inode使用率超过80%并持续5分钟时触发告警。

NodeTemperatureHigh:

yaml
复制代码

  • alert: NodeTemperatureHigh
    expr: node_hwmon_temp_celsius > 75
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High temperature on {{ $labels.instance }}”
    description: “Node temperature is above 75 degrees Celsius for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的温度超过75摄氏度并持续5分钟时触发告警。

NodeProcessCountHigh:

yaml
复制代码

  • alert: NodeProcessCountHigh
    expr: count(node_scrape_collector_duration_seconds) > 500
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High process count on {{ $labels.instance }}”
    description: “Number of processes is above 500 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某节点的进程数超过500并持续5分钟时触发告警。

windows_exporter告警规则
WindowsCPUUsageHigh:

yaml
复制代码

  • alert: WindowsCPUUsageHigh
    expr: avg by (instance) (rate(windows_cpu_time_total{mode=“idle”}[5m])) < 0.2
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High CPU usage on {{ $labels.instance }}”
    description: “CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的CPU使用率超过80%并持续5分钟时触发告警。

WindowsMemoryUsageHigh:

yaml
复制代码

  • alert: WindowsMemoryUsageHigh
    expr: windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High Memory usage on {{ $labels.instance }}”
    description: “Memory usage is above 90% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的内存使用率超过90%并持续5分钟时触发告警。

WindowsDiskUsageHigh:

yaml
复制代码

  • alert: WindowsDiskUsageHigh
    expr: windows_logical_disk_free_bytes / windows_logical_disk_size_bytes * 100 < 15
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Disk usage on {{ $labels.instance }}”
    description: “Disk usage is above 85% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的磁盘使用率超过85%并持续5分钟时触发告警。

WindowsNetworkDown:

yaml
复制代码

  • alert: WindowsNetworkDown
    expr: up{job=“windows_exporter”} == 0
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: “Node down: {{ $labels.instance }}”
    description: “Node has been down for more than 10 minutes.”
    解释:当某Windows节点的windows_exporter数据10分钟内没有上报时触发告警。

WindowsServiceNotRunning:

yaml
复制代码

  • alert: WindowsServiceNotRunning
    expr: windows_service_status == 2
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Service not running on {{ $labels.instance }}”
    description: “A critical service is not running for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点上的某关键服务未运行并持续5分钟时触发告警。

WindowsPageFileUsageHigh:

yaml
复制代码

  • alert: WindowsPageFileUsageHigh
    expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100
    继续

ChatGPT
windows_exporter告警规则(续)
WindowsPageFileUsageHigh:

yaml
复制代码

  • alert: WindowsPageFileUsageHigh
    expr: windows_os_paging_free_bytes / windows_os_paging_limit_bytes * 100 < 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Page File usage on {{ $labels.instance }}”
    description: “Page file usage is above 90% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的页面文件使用率超过90%并持续5分钟时触发告警。

WindowsHandleCountHigh:

yaml
复制代码

  • alert: WindowsHandleCountHigh
    expr: windows_process_handles_count > 10000
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Handle count on {{ $labels.instance }}”
    description: “Number of handles is above 10,000 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的句柄数超过10,000并持续5分钟时触发告警。

WindowsThreadCountHigh:

yaml
复制代码

  • alert: WindowsThreadCountHigh
    expr: windows_process_threads > 500
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Thread count on {{ $labels.instance }}”
    description: “Number of threads is above 500 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的线程数超过500并持续5分钟时触发告警。

WindowsProcessorQueueLengthHigh:

yaml
复制代码

  • alert: WindowsProcessorQueueLengthHigh
    expr: windows_system_processor_queue_length > 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Processor Queue Length on {{ $labels.instance }}”
    description: “Processor queue length is above 10 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的处理器队列长度超过10并持续5分钟时触发告警。

WindowsDiskIOWaitHigh:

yaml
复制代码

  • alert: WindowsDiskIOWaitHigh
    expr: avg by (instance) (rate(windows_logical_disk_idle_time{instance=~“[^:]+:.*”}[5m])) < 0.1
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High Disk I/O Wait on {{ $labels.instance }}”
    description: “Disk I/O wait time is high for more than 5 minutes on {{ $labels.instance }}.”
    解释:当某Windows节点的磁盘I/O等待时间过高并持续5分钟时触发告警。

mysqld_exporter告警规则
MySQLHighThreadsRunning:

yaml
复制代码

  • alert: MySQLHighThreadsRunning
    expr: mysql_global_status_threads_running > 50
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High number of running threads in MySQL on {{ $labels.instance }}”
    description: “Number of running threads is above 50 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的运行线程数超过50并持续5分钟时触发告警。

MySQLSlowQueries:

yaml
复制代码

  • alert: MySQLSlowQueries
    expr: rate(mysql_global_status_slow_queries[5m]) > 0
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: “Slow queries detected on {{ $labels.instance }}”
    description: “One or more slow queries detected in the last 10 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例在过去10分钟内检测到慢查询时触发告警。

MySQLReplicationLag:

yaml
复制代码

  • alert: MySQLReplicationLag
    expr: mysql_slave_status_seconds_behind_master > 10
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Replication lag on MySQL slave {{ $labels.instance }}”
    description: “Replication lag is more than 10 seconds for over 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL从库的复制延迟超过10秒并持续5分钟时触发告警。

MySQLHighConnections:

yaml
复制代码

  • alert: MySQLHighConnections
    expr: mysql_global_status_threads_connected > 200
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High number of connections to MySQL on {{ $labels.instance }}”
    description: “Number of connections is above 200 for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的连接数超过200并持续5分钟时触发告警。

MySQLLowFreeDiskSpace:

yaml
复制代码

  • alert: MySQLLowFreeDiskSpace
    expr: (mysql_global_variables_datadir - node_filesystem_free_bytes{fstype!~“tmpfs|fuse.lxcfs”}) / mysql_global_variables_datadir * 100 > 90
    for: 10m
    labels:
    severity: critical
    annotations:
    summary: “Low free disk space for MySQL on {{ $labels.instance }}”
    description: “Free disk space for MySQL data directory is below 10% for more than 10 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的数据目录剩余磁盘空间低于10%并持续10分钟时触发告警。

MySQLHighQueryTime:

yaml
复制代码

  • alert: MySQLHighQueryTime
    expr: rate(mysql_global_status_questions[5m]) > 1000
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High query rate on MySQL on {{ $labels.instance }}”
    description: “Query rate is above 1000 queries per second for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的查询率超过每秒1000个并持续5分钟时触发告警。

MySQLInnoDBBufferPoolUsageHigh:

yaml
复制代码

  • alert: MySQLInnoDBBufferPoolUsageHigh
    expr: (mysql_global_status_innodb_buffer_pool_bytes_data / mysql_global_status_innodb_buffer_pool_bytes_total) * 100 > 85
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High InnoDB buffer pool usage on MySQL on {{ $labels.instance }}”
    description: “InnoDB buffer pool usage is above 85% for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的InnoDB缓冲池使用率超过85%并持续5分钟时触发告警。

MySQLMaxConnectionsReached:

yaml
复制代码

  • alert: MySQLMaxConnectionsReached
    expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 90
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Max connections threshold reached on MySQL on {{ $labels.instance }}”
    description: “Connections usage is above 90% of the max_connections limit for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的连接数达到最大连接数限制的90%以上并持续5分钟时触发告警。

MySQLTableLocksContention:

yaml
复制代码

  • alert: MySQLTableLocksContention
    expr: rate(mysql_global_status_table_locks_waited[5m]) > 10
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Table locks contention in MySQL on {{ $labels.instance }}”
    description: “More than 10 table locks waited per second for more than 5 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的表锁等待数超过每秒10个并持续5分钟时触发告警。

MySQLLongRunningQueries:

yaml
复制代码

  • alert: MySQLLongRunningQueries
    expr: mysql_global_status_queries{job=“mysql”} > 100
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: “Long running queries in MySQL on {{ $labels.instance }}”
    description: “More than 100 long running queries for over 10 minutes on {{ $labels.instance }}.”
    解释:当MySQL实例的长时间运行查询超过100个并持续10分钟时触发告警。

blackbox_exporter告警规则
EndpointDown:
yaml
复制代码

  • alert: EndpointDown
    expr: probe_success == 0
    for:
    继续

ChatGPT
blackbox_exporter告警规则
EndpointDown:

yaml
复制代码

  • alert: EndpointDown
    expr: probe_success == 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Endpoint down: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has been down for more than 5 minutes.”
    解释:当某个端点连续5分钟探测失败时触发告警。

EndpointHighLatency:

yaml
复制代码

  • alert: EndpointHighLatency
    expr: probe_duration_seconds > 0.5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High latency on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has a latency higher than 0.5 seconds for more than 5 minutes.”
    解释:当某个端点的响应时间超过0.5秒并持续5分钟时触发告警。

EndpointDNSResolutionFailure:

yaml
复制代码

  • alert: EndpointDNSResolutionFailure
    expr: probe_dns_lookup_time_seconds > 2
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “DNS resolution failure for endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has DNS resolution time higher than 2 seconds for more than 5 minutes.”
    解释:当某个端点的DNS解析时间超过2秒并持续5分钟时触发告警。

EndpointConnectionTimeout:

yaml
复制代码

  • alert: EndpointConnectionTimeout
    expr: probe_tcp_connect_duration_seconds > 1
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Connection timeout on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has connection timeout higher than 1 second for more than 5 minutes.”
    解释:当某个端点的连接时间超过1秒并持续5分钟时触发告警。

EndpointCertificateExpiry:

yaml
复制代码

  • alert: EndpointCertificateExpiry
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
    for: 1h
    labels:
    severity: warning
    annotations:
    summary: “SSL certificate expiry for endpoint: {{ $labels.instance }}”
    description: “The SSL certificate for endpoint {{ $labels.instance }} will expire in less than 7 days.”
    解释:当某个端点的SSL证书将在7天内过期时触发告警。

HTTPResponseCodeMismatch:

yaml
复制代码

  • alert: HTTPResponseCodeMismatch
    expr: probe_http_status_code != 200
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Unexpected HTTP response code on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} returned an unexpected HTTP response code for more than 5 minutes.”
    解释:当某个端点的HTTP响应码不是200并持续5分钟时触发告警。

EndpointHTTPRedirects:

yaml
复制代码

  • alert: EndpointHTTPRedirects
    expr: probe_http_redirects > 5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Too many HTTP redirects on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} encountered more than 5 HTTP redirects for more than 5 minutes.”
    解释:当某个端点的HTTP重定向次数超过5次并持续5分钟时触发告警。

EndpointDNSLookupFailures:

yaml
复制代码

  • alert: EndpointDNSLookupFailures
    expr: increase(probe_dns_lookup_time_seconds[5m]) > 0
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: “DNS lookup failures on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} had DNS lookup failures for more than 10 minutes.”
    解释:当某个端点在10分钟内有DNS解析失败时触发告警。

EndpointTCPConnectionRefused:

yaml
复制代码

  • alert: EndpointTCPConnectionRefused
    expr: probe_tcp_connect_success == 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “TCP connection refused on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} refused TCP connections for more than 5 minutes.”
    解释:当某个端点连续5分钟TCP连接失败时触发告警。

EndpointSSLCertificateInvalid:

yaml
复制代码

  • alert: EndpointSSLCertificateInvalid
    expr: probe_ssl_last_chain_expiry_timestamp_seconds - time() < 0
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Invalid SSL certificate on endpoint: {{ $labels.instance }}”
    description: “The endpoint {{ $labels.instance }} has an invalid SSL certificate.”
    解释:当某个端点的SSL证书无效时触发告警。

cAdvisor告警规则
ContainerCPUUsageHigh:

yaml
复制代码

  • alert: ContainerCPUUsageHigh
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High CPU usage on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} has CPU usage above 80% for more than 5 minutes.”
    解释:当某个容器的CPU使用率超过80%并持续5分钟时触发告警。

ContainerMemoryUsageHigh:

yaml
复制代码

  • alert: ContainerMemoryUsageHigh
    expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “High memory usage on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} has memory usage above 90% for more than 5 minutes.”
    解释:当某个容器的内存使用率超过90%并持续5分钟时触发告警。

ContainerDiskIOHigh:

yaml
复制代码

  • alert: ContainerDiskIOHigh
    expr: rate(container_fs_io_time_seconds_total[5m]) > 0.5
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High disk I/O on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} has disk I/O usage above 50% for more than 5 minutes.”
    解释:当某个容器的磁盘I/O使用率超过50%并持续5分钟时触发告警。

ContainerNetworkErrors:

yaml
复制代码

  • alert: ContainerNetworkErrors
    expr: rate(container_network_receive_errors_total[5m]) > 0 or rate(container_network_transmit_errors_total[5m]) > 0
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Network errors on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} is experiencing network errors for more than 5 minutes.”
    解释:当某个容器在5分钟内出现网络错误时触发告警。

ContainerRestarting:

yaml
复制代码

  • alert: ContainerRestarting
    expr: rate(container_restart_count[5m]) > 1
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Container restarting frequently: {{ $labels.container }}”
    description: “Container {{ $labels.container }} has restarted more than once in the last 5 minutes.”
    解释:当某个容器在5分钟内重启超过1次时触发告警。

ContainerOOMKilled:

yaml
复制代码

  • alert: ContainerOOMKilled
    expr: container_last_seen < time() - 300 and container_state_terminated_reason == “OOMKilled”
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “Container OOMKilled: {{ $labels.container }}”
    description: “Container {{ $labels.container }} was killed due to out of memory in the last 5 minutes.”
    解释:当某个容器在过去5分钟内因内存不足被杀死时触发告警。

ContainerFileSystemFull:

yaml
复制代码

  • alert: ContainerFileSystemFull
    expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “File system full on container {{ $labels.container }}”
    description: "Container {{ $labels.container }} file system usage is above 90
    继续

ChatGPT
cAdvisor告警规则(续)
ContainerFileSystemFull:

yaml
复制代码

  • alert: ContainerFileSystemFull
    expr: container_fs_usage_bytes / container_fs_limit_bytes > 0.9
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: “File system full on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} file system usage is above 90% for more than 5 minutes.”
    解释:当某个容器的文件系统使用率超过90%并持续5分钟时触发告警。

ContainerMemoryLeak:

yaml
复制代码

  • alert: ContainerMemoryLeak
    expr: increase(container_memory_usage_bytes[1h]) > 1000000000
    for: 1h
    labels:
    severity: warning
    annotations:
    summary: “Possible memory leak on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} has increased its memory usage by more than 1 GB in the last hour.”
    解释:当某个容器在过去1小时内内存使用量增加超过1GB时触发告警,可能存在内存泄漏。

ContainerHighCPUThrottling:

yaml
复制代码

  • alert: ContainerHighCPUThrottling
    expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.2
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “High CPU throttling on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} is experiencing CPU throttling more than 20% of the time for more than 5 minutes.”
    解释:当某个容器在5分钟内CPU限制(throttling)时间超过总时间的20%时触发告警。

ContainerNetworkTrafficDrop:

yaml
复制代码

  • alert: ContainerNetworkTrafficDrop
    expr: rate(container_network_receive_packets_dropped_total[5m]) > 0 or rate(container_network_transmit_packets_dropped_total[5m]) > 0
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: “Network traffic drops on container {{ $labels.container }}”
    description: “Container {{ $labels.container }} is experiencing network packet drops for more than 5 minutes.”
    解释:当某个容器在5分钟内出现网络数据包丢失时触发告警。

这些告警规则基于不同的exporter,可以帮助监控和管理系统和服务的健康状态。通过定义合适的阈值和条件,您可以及时发现和解决潜在问题,确保系统的稳定性和性能。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/29865.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

STM32单片机USART串口详解

文章目录 1. 通信接口概述 2. 串口通信 3. 硬件电路 4. 电平标准 5. 串口参数及时序 5.1 数据帧的组成 5.2 起始位 5.3 数据位 5.4 校验位 5.5 停止位 5.6 波特率 5.7 数据帧传输过程示例 6. 串口时序 7. USART概述 8. USART框图 9. USART基本结构 10. 数据帧…

力扣793. 阶乘函数后 K 个零

Problem: 793. 阶乘函数后 K 个零 文章目录 题目描述思路即解法复杂度Code 题目描述 思路即解法 1.根据题意可知即是要求取满足条件的n最小是多少&#xff0c;最大是多少&#xff0c;最大值和最小值一减&#xff0c;就可以算出来有多少个n满足条件了。 2.由于题目中的阶乘存在单…

乡村养老服务管理系统的设计

管理员账户功能包括&#xff1a;系统首页&#xff0c;个人中心&#xff0c;医疗人员管理&#xff0c;乡村志愿者管理&#xff0c;文娱活动管理&#xff0c;活动报名管理&#xff0c;医疗保健管理 前台账户功能包括&#xff1a;系统首页&#xff0c;个人中心&#xff0c;文娱活…

确保数据一致性

目录 事务 ACID 属性 事务的作用 示例 隔离级别 事务管理的重要性 检查点 概述 检查点的好处 检查点的执行策略 检查点操作示例 并发控制 概述 并发控制技术 优点 缺点 适用场景 在数据库管理系统中&#xff0c;保持数据一致性至关重要。即使在系统故障或并发…

简单记录一下命名规则

简单记录一下命名规则 1. 记录一下 在编程中&#xff0c;命名规则&#xff08;也称为命名约定&#xff09;是非常重要的&#xff0c;它可以帮助我们编写出更易于理解和维护的代码。一直记不住到底有哪些&#xff0c;稍微记一下&#xff01; 以下是一些常见的命名规则&#xf…

软件推荐 caj2pdf

## 推荐内容 用 zotero 管理文献时&#xff0c;不能处理知网 caj 格时&#xff0c;有大佬做了相应的工作并开源了。 今天推荐这两个 GitHub 项目 https://github.com/caj2pdf/caj2pdfhttps://github.com/ElonH/caj2pdf_gui 一是 python 做的 caj 2 pdf 源码&#xff0c;二是结…

005-OSPF基本配置

OSPF基本配置 OSPF (Open Shortest Path First) 是一种链路状态路由协议&#xff0c;它属于内部网关协议&#xff08;IGP&#xff09;类别&#xff0c;用于在自治系统&#xff08;AS&#xff09;内部路由 IP 数据包。OSPF 通过使用 Dijkstra 算法计算最短路径树来确定到达每个…

Linux之旅: 基础知识点的终极指南

文章目录 1、Linux的目录结构2、ls命令3、管理文件和目录4、linux命令使用细节和技巧5、权限管理基本命令6、搜索命令7、管道符与重定向8、压缩和解压命令9、用户及vim编辑器10、用户和用户组管理一、Linux系统用户账号的基本管理二、Linux系统用户组的管理 1、Linux的目录结构…

【CSS in Depth2精译】1.1.2 行内样式~1.1.3 选择器的优先级

文章目录 1.1.2 行内样式1.1.3 选择器的优先级1.1.3.1 优先级的写法1.1.3.2 关于优先级的思考 1.1.2 行内样式 如果无法通过样式表来源规则解决样式冲突&#xff0c;浏览器则会考察它们是否通过 行内样式 作用于该元素。当使用 HTML 的 style 属性声明样式时&#xff0c;该样式…

js-promise、async/await

promise&#xff0c;简单就就是回调的一种简化了回调地狱&#xff08;如果多个调用是异步并且有结果依赖&#xff0c;那么就需要写成回调&#xff09;。 async/await&#xff0c;需要成对使用&#xff0c;是对promise的更高级的抽象&#xff0c; 比如 runAsync1() .then(fun…

Win32编程:第一个窗口程序(Part.1)

Win32系统编程是指在Windows操作系统上使用Win32 API进行软件开发的过程&#xff1b;Win32 API是Windows操作系统提供的应用程序接口&#xff0c;允许程序与操作系统进行交互&#xff0c;实现各种功能。 以下是Win32系统编程的基本概念和步骤&#xff1a; 环境准备 开发工具&…

[python学习]--模块管理

在Python中&#xff0c;模块管理是非常重要的&#xff0c;因为它允许你将代码组织成可重用的单元&#xff0c;这些单元可以在其他Python脚本或项目中导入和使用。Python的模块管理包括创建模块、导入模块、使用包&#xff08;packages&#xff09;来组织模块&#xff0c;以及处…

element-plus的form表单组件之checkbox组件

单个checkbox 绑定的响应式的值类型为bool类型&#xff0c;同一个组的checkbox多选其值对应值的数组&#xff0c;类型根据checkbox的value值而来。 label只用来显示具体的值&#xff0c;根据value属性来设置。 element-plus的checkbox提供多种特性。 如单选&#xff0c;多选…

关机充电动画:流程与定制

关机充电动画&#xff1a;流程与定制 基于MTK平台Android 11分析 生成logo.bin 关机充电动画是由一系列的bmp图片组成的&#xff0c;这些图片资源存在于vendor/mediatek/proprietary/bootable/bootloader/lk/dev/logo目录下&#xff08;当然不仅保护关机充电动画&#xff0c…

【软件工程】【22.10】p2

关键字&#xff1a; 软件开发基本途径、初始需求发现技术、UML表达事物之间关系、RUP需求获取基本步骤、项目过程建立涉及工作、项目规划过程域的意图和专用目标 判定表、分支覆盖、条件覆盖 三、简答 四、应用 这里条件覆盖有待商榷

Angular 2 数据显示

Angular 2 数据显示 Angular 2 是一个由 Google 维护的开源前端 web 框架,用于构建单页应用程序(SPA)。它以其高效的数据绑定、组件化架构和强大的依赖注入功能而受到开发者的青睐。在 Angular 2 应用程序中,数据显示是核心功能之一,它允许开发者轻松地将数据从组件传递到…

Gone框架介绍28 - 使用goner.IsDefault 将Goner设置为接口的默认实现

gone是可以高效开发Web服务的Golang依赖注入框架 github地址&#xff1a;https://github.com/gone-io/gone 文档地址&#xff1a;https://goner.fun/zh/ 文章目录 使用goner.IsDefault(...)将Goner设置为接口的默认实现从内置组件goner/logrus的构造函数讲起按类型注入的歧义性…

分布式注册中心如何保证数据一致性

分布式注册中心保证数据一致性主要依赖于一些核心的设计原则、技术和策略。以下是一些关键的方法和步骤&#xff1a; 1、使用分布式协调服务 如ZooKeeper、Etcd等&#xff0c;这些服务提供了分布式锁、分布式配置管理等功能&#xff0c;非常适合作为注册中心的数据存储。它们…

Python uWSGI 安装配置

Python uWSGI 安装配置 1. 引言 uWSGI是一个高性能的HTTP服务器,它实现了WSGI、uwsgi和HTTP等协议。它是用C编写的,旨在提供高性能和低资源消耗的Web服务。uWSGI广泛应用于Python Web应用程序,如Django、Flask等。本文将介绍如何在Python环境中安装和配置uWSGI。 2. 安装…

openEuler2203SP1自定义镜像,ks自动化安装

需求&#xff1a; 1、legacy启动 2、/boot分区1G&#xff0c;剩余给/ 3、创建root密码和一个普通用户 4、最小化安装&#xff08;选上development、legacy-unix、security-tools、standard&#xff09; 5、关闭firewalld、selinux 6、增加安装vim、ntpdate、iptables、ex…