目录
背景:
环境准备:
1. 磁盘准备
2. 磁盘分区格式化
local storage部署
1. 节点打标签
2. 创建local pv storageClass和prometheus-pv
Prometheus-stack部署
1. 下载helm chart包
2. values.yaml 参数解释
3. 部署prometheus-stack
4. 查看部署情况
背景:
k8s集群prometheus 监控数据和业务数据共用一个NFS(网络文件系统),可能会出现以下问题:
-
影响业务:业务数据和监控数据进行隔离,原则上我们可以允许监控数据丢失,但是业务数据一定是不能丢失的
-
读写性能:业务服务和监控系统挂载NFS共享的文件或者目录,如果业务服务和监控系统同时在进行大量的读写则会互现干扰
-
稳定性:NFS对网络环境的要求比较高,如果网络环境不稳定,容易导致文件共享出现故障
-
存储空间:prometheus 虽然有监控数据回收的机制,但是也只是针对数据有限期进行回收,如果某一天有大量的监控数据就会占用NFS的很多存储空间,极端情况下会出现将NFS存储空间占满的情况
-
NFS扩容:NFS的扩展性比较差,当需要扩容时,需要手动进行配置,操作比较繁琐
环境准备:
一个正常运行的集群,集群版本最好 >= 1.21,低于1.21 版本兼容性可能会有问题
kube-prometheus stack | Kubernetes 1.21 | Kubernetes 1.22 | Kubernetes 1.23 | Kubernetes 1.24 | Kubernetes 1.25 | Kubernetes 1.26 | Kubernetes 1.27 |
release-0.9 | ✔ | ✔ | ✗ | ✗ | ✗ | x | x |
release-0.10 | ✗ | ✔ | ✔ | ✗ | ✗ | x | x |
release-0.11 | ✗ | ✗ | ✔ | ✔ | ✗ | x | x |
release-0.12 | ✗ | ✗ | ✗ | ✔ | ✔ | x | x |
main | ✗ | ✗ | ✗ | ✗ | x | ✔ | ✔ |
1. 磁盘准备
从集群中选择一个节点,该节点独立挂载一块磁盘。磁盘最好是做一个磁盘阵列例如Raid50,提高磁盘的容错能力
2. 磁盘分区格式化
# 将sdb的空间都分给一个分区
parted /dev/sdb mkpart primary 0% 100%# 写入文件系统
mkfs -t ext4 /dev/sdb1# 获取磁盘的UUID,用于写入fstab实现开机自动挂载
blkid /dev/sdb1# 创建挂载点
mkdir -p /monitoring# 查看fstab文件
cat /etc/fstab | grep monitoring
/dev/disk/by-uuid/93a76705-814a-4a5e-85f0-88fe03d7837c /monitoring ext4 defaults 0 1 # 挂载
mount -a
local storage部署
1. 节点打标签
kubectl label node node156 prometheus=deploy
2. 创建local pv storageClass和prometheus-pv
cd /home/sunwenbo/local-pv
kubectl apply -f local-pv-storage.yaml
kubectl apply -f local-pv.yaml
local-pv-storage.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata: name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
#reclaimPolicy: Retain 注:local pv不支持retain存储方式
#volumeBindingMode: Immediate 注:不支持动态创建pv
local-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata: name: prometheus-pv
spec: capacity: storage: 200Gi volumeMode: Filesystem accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain #persistentVolumeReclaimPolicy: Delete storageClassName: local-storage local: path: /monitoring/prometheus nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: prometheus operator: In values: - "deploy"
解释一下:还记得我们上面打标签的步骤吧,这里配置nodeAffinity就是为了将pv创建在指定的节点上通过标签进行匹配
查看StorageClass
root@master01:/home/sunwenbo/local-pv# kubectl get storageclasses.storage.k8s.io
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-storage kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 17h
nfs-016 nfs.csi.k8s.io Retain Immediate false 59d
nfs-018 nfs.csi.k8s.io Retain Immediate false 44d
nfs-retain (default) nfs.csi.k8s.io Retain Immediate false 62d
查看pv
注:正常pv的状态是Available,因为还有没有创建pvc,下面展示是我部署后的结果,可以看到prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 绑定了prometheus-pv,至于这个pvc是怎么来的下面会介绍
root@master01:/home/sunwenbo/local-pv# kubectl get pv | grep prometheus
prometheus-pv 200Gi RWO Retain Bound kube-prometheus/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 local-storage 23m
Prometheus-stack部署
1. 下载helm chart包
wget https://github.com/prometheus-community/helm-charts/releases/download/kube-prometheus-stack-45.27.2/kube-prometheus-stack-45.27.2.tgz
tar xf kube-prometheus-stack-45.27.2.tgz
cd kube-prometheus-stack
2. values.yaml 参数解释
修改部分如下
# alertmanager 持久化配置,使用nfs 存储空间为4Galertmanager:alertmanagerSpec:storage: volumeClaimTemplate: spec: storageClassName: nfs-retain accessModes: ["ReadWriteOnce"] resources: requests: storage: 4Gi # grafana 持久化存储配置及环境变量、plugin添加grafana: enabled: true namespaceOverride: "" forceDeployDatasources: falsepersistence: type: pvc enabled: true storageClassName: nfs-retain accessModes: - ReadWriteOnce size: 2Gi finalizers: - kubernetes.io/pvc-protection env: GF_AUTH_ANONYMOUS_ENABLED: "true" GF_AUTH_ANONYMOUS_ORG_NAME: "Main Org." GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer plugins: - grafana-worldmap-panel - grafana-piechart-panel # grafana service 暴露配置service: portName: http-web port: 30080 externalIPs: ["10.1.2.15"]# 监控数据保留15天prometheus: retention: 15d
# prometheus 部署节点使用node亲和性标签匹配affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: prometheus operator: In values: - deploy
# prometheus 设置内存、cpu的reqeust和limit resources: requests: memory: 10Gi cpu: 10 limits: memory: 50Gi cpu: 10 # prometheus 使用外部ip暴露service:externalIPs: ["10.1.2.15"] # prometheus数据持久化存储使用local-storage storageSpec: ## Using PersistentVolumeClaim # volumeClaimTemplate: spec: storageClassName: local-storage accessModes: ["ReadWriteOnce"] resources: requests: storage: 200Gi # 增加gpu-metrics additionalScrapeConfigs: - job_name: gpu-metrics scrape_interval: 1s metrics_path: /metrics scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - nvidia-device-plugin relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: kubernetes_node
全量的values.yaml已经上传到csdn不需要积分就可以下载了
https://download.csdn.net/download/weixin_43798031/88046678https://download.csdn.net/download/weixin_43798031/88046678
3. 部署prometheus-stack
helm upgrade -i kube-prometheus-stack -f values.yaml . -n kube-prometheus
4. 查看部署情况
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get deployments.apps -n kube-prometheus
NAME READY UP-TO-DATE AVAILABLE AGE
kube-prometheus-stack-grafana 1/1 1 1 123m
kube-prometheus-stack-kube-state-metrics 1/1 1 1 123m
kube-prometheus-stack-operator 1/1 1 1 123m
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get daemonsets.apps -n kube-prometheus
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-prometheus-stack-prometheus-node-exporter 148 148 148 148 148 123m
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get statefulsets.apps -n kube-prometheus
NAME READY AGE
alertmanager-kube-prometheus-stack-alertmanager 1/1 123m
prometheus-kube-prometheus-stack-prometheus 1/1 123m
service
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get svc -n kube-prometheus
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None 9093/TCP,9094/TCP,9094/UDP 123m
kube-prometheus-stack-alertmanager ClusterIP 10.111.20.147 9093/TCP 123m
kube-prometheus-stack-grafana ClusterIP 10.104.171.223 10.1.2.15 30080/TCP 123m
kube-prometheus-stack-kube-state-metrics ClusterIP 10.107.110.116 8080/TCP 123m
kube-prometheus-stack-operator ClusterIP 10.107.180.72 443/TCP 123m
kube-prometheus-stack-prometheus ClusterIP 10.102.115.147 10.1.2.15 9090/TCP 123m
kube-prometheus-stack-prometheus-export ClusterIP 10.109.169.13 10.1.2.15 30081/TCP 3d5h
kube-prometheus-stack-prometheus-node-exporter ClusterIP 10.101.152.90 9100/TCP 123m
prometheus-operated ClusterIP None 9090/TCP 123m
pv、pvc
root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get pv | grep prometh
prometheus-pv 200Gi RWO Retain Bound kube-prometheus/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 local-storage 127m
pvc-43823533-9a35-4ace-b0a3-5853e3b4099e 4Gi RWO Retain Bound kube-prometheus/alertmanager-kube-prometheus-stack-alertmanager-db-alertmanager-kube-prometheus-stack-alertmanager-0 nfs-retain 60d
pvc-cef3dd98-7090-47ac-8cec-c52c78e9237f 2Gi RWO Retain Bound kube-prometheus/kube-prometheus-stack-grafana nfs-retain 129m root@master01:/home/sunwenbo/kube-prometheus-stack# kubectl get pvc -n kube-prometheus
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
alertmanager-kube-prometheus-stack-alertmanager-db-alertmanager-kube-prometheus-stack-alertmanager-0 Bound pvc-43823533-9a35-4ace-b0a3-5853e3b4099e 4Gi RWO nfs-retain 60d
kube-prometheus-stack-grafana Bound pvc-cef3dd98-7090-47ac-8cec-c52c78e9237f 2Gi RWO nfs-retain 127m
prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 Bound prometheus-pv 200Gi RWO local-storage 127m
解释一下:使用volumeClaimTemplate 会动态的给我们创建出来一个pvc,由于之前已经创建pv了,这个pvc会自动和pv进行绑定