使用 dotnet-monitor 在 Kubernetes 中收集 .NET metrics

Intro

dotnet-monitor 是微软推出的一个帮助我们诊断和监控 .NET 应用程序的工具，在 Kubernetes 中我们可以让 dotnet-monitor 作为 sidecar 运行，无侵入地监控 .NET 应用，今天我们就来介绍一下如果在 Kubernetes 中使用吧

GetStarted

作为 sidecar 运行的时候，我们只需要修改应用的 deployment 对应的 yaml 文件即可，下面是一个示例：

apiVersion: apps/v1
kind: Deployment
metadata:name: sparktodo-apilabels:app: sparktodo-api
spec:replicas: 1revisionHistoryLimit: 0selector:matchLabels:app: sparktodo-apiminReadySeconds: 0strategy:type: RollingUpdaterollingUpdate:maxUnavailable: 1maxSurge: 1template:metadata:annotations:prometheus.io/scrape: "true"prometheus.io/port: "52323"labels:app: sparktodo-apispec:containers:- name: sparktodo-apiimage: weihanli/sparktodo-api:latestimagePullPolicy: Alwaysresources:requests:memory: "64Mi"cpu: "20m"limits:memory: "128Mi"cpu: "50m"env:- name: DOTNET_DiagnosticPortsvalue: /diag/portports:- name: httpcontainerPort: 80protocol: TCPlivenessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 60periodSeconds: 30readinessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 60periodSeconds: 30volumeMounts:- mountPath: /diagname: diagvol- mountPath: /dumpsname: dumpsvol- name: monitorimage: mcr.microsoft.com/dotnet/monitorargs: [ "--no-auth" ]imagePullPolicy: Alwaysports:- containerPort: 52323env:- name: DOTNETMONITOR_DiagnosticPort__ConnectionModevalue: Listen- name: DOTNETMONITOR_DiagnosticPort__EndpointNamevalue: /diag/port- name: DOTNETMONITOR_Storage__DumpTempFoldervalue: /dumps- name: DOTNETMONITOR_Urlsvalue: "http://+:52323"volumeMounts:- mountPath: /diagname: diagvol- mountPath: /dumpsname: dumpsvolresources:requests:cpu: 20mmemory: 32Milimits:cpu: 50mmemory: 256Mivolumes:- name: diagvolemptyDir: {}- name: dumpsvolemptyDir: {}

为了方便对比，下面是一个变更对比

template:metadata:
+      annotations:
+        prometheus.io/scrape: "true"
+        prometheus.io/port: "52323"labels:app: sparktodo-apispec:containers:- name: sparktodo-apiimage: weihanli/sparktodo-api:latestimagePullPolicy: Alwaysresources:requests:memory: "64Mi"cpu: "20m"limits:memory: "128Mi"cpu: "50m"
+          env:
+          - name: DOTNET_DiagnosticPorts
+            value: /diag/portports:- name: httpcontainerPort: 80protocol: TCPlivenessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 60periodSeconds: 30readinessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 60periodSeconds: 30
+          volumeMounts:
+          - mountPath: /diag
+            name: diagvol
+          - mountPath: /dumps
+            name: dumpsvol
+        - name: monitor
+          image: mcr.microsoft.com/dotnet/monitor
+          args: [ "--no-auth" ]
+          imagePullPolicy: Always
+          ports:
+            - containerPort: 52323
+          env:
+          - name: DOTNETMONITOR_DiagnosticPort__ConnectionMode
+            value: Listen
+          - name: DOTNETMONITOR_DiagnosticPort__EndpointName
+            value: /diag/port
+          - name: DOTNETMONITOR_Storage__DumpTempFolder
+            value: /dumps
+          - name: DOTNETMONITOR_Urls
+            value: "http://+:52323"
+          volumeMounts:
+          - mountPath: /diag
+            name: diagvol
+          - mountPath: /dumps
+            name: dumpsvol
+          resources:
+            requests:
+              cpu: 20m
+              memory: 32Mi
+            limits:
+              cpu: 50m
+              memory: 256Mi
+      volumes:
+      - name: diagvol
+        emptyDir: {}
+      - name: dumpsvol
+        emptyDir: {}

与没有使用 dotnet-monitor 之前相比，主要的变化有这几个方面：

增加了一个 dotnet-monitor 的容器
增加了 volume 和 DiagnosticPorts 配置以支持 .NET 应用和 dotnet-monitor 的通信
增加了 Prometheus 的配置以让 Prometheus 从 dotnet-monitor 拉取 metrics

实际效果：

metrics 示例：

dotnet-monitor 默认会收集很多信息，包括了 CPU、内存、GC、线程池等等信息，可以帮助我们更好的了解 .NET 应用的运行状况，通过 Prometheus 收集到数据之后，我们可以进一步通过 Grafana 来做更好的 UI 展示以及可以根据指定的指标来做监控报警(做了几个小示例，数据仅供参考）

Sample 2

默认地，dotnet-monitor 会监控三个来源的数据，可以认为就是 dotnet-counters 中的三个 Provider，

分别是 System.Runtime/Microsoft.AspNetCore.Hosting/Grpc.AspNetCore.Server

我们也可以自定义 dotnet-monitor 的配置来禁用默认的 provider 或者添加更多新的 provider，我们可以提供两种类型的配置，一种是环境变量形式的配置，配置分隔符使用 __ 来表示，比如

Metrics__IncludeDefaultProviders: true

也可以使用 Json 文件配置（推荐）：

{"Metrics": {"IncludeDefaultProviders": true}
}

更加推荐使用 JSON 方式，因为更加直观，而且更便于维护

这两种方式配置方式配置文件的路径是不一样的，对于第一种配置配置文件放在 /etc/dotnet-monitor 中，而对于 Json 方式的配置则可以更加灵活的自定义，可以使用 XDG_CONFIG_HOME 来定义配置根目录，如果配置为 /etc 则配置文件对应的路径则是 /etc/dotnet-monitor/settings.json，下面是一个使用自定义配置的示例，无论哪种方式配置都可以通过 ConfigMap 来定义，挂载到容器的指定路径

apiVersion: apps/v1
kind: Deployment
metadata:name: reservation-servernamespace: defaultlabels:app: reservation-server
spec:replicas: 1revisionHistoryLimit: 2selector:matchLabels:app: reservation-serverminReadySeconds: 0strategy:type: RollingUpdaterollingUpdate:maxUnavailable: 1maxSurge: 1template:metadata:labels:app: reservation-serverspec:containers:        - name: reservation-serverimage: openreservation/reservation-server:latestimagePullPolicy: Alwaysresources:requests:cpu: 30mmemory: 32Milimits:cpu: 80mmemory: 256MireadinessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 60periodSeconds: 30livenessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 60periodSeconds: 30ports:- containerPort: 80env:- name: DOTNET_DiagnosticPortsvalue: /diag/portvolumeMounts:- name: settingsmountPath: /app/appsettings.Production.jsonsubPath: appsettings- mountPath: /diagname: diagvol- mountPath: /dumpsname: dumpsvol- mountPath: /tmpname: tmpvol- name: dotnet-monitorimage: mcr.microsoft.com/dotnet/monitorargs: [ "--no-auth" ]imagePullPolicy: Alwaysports:- containerPort: 52323env:- name: DOTNETMONITOR_DiagnosticPort__ConnectionModevalue: Listen- name: DOTNETMONITOR_DiagnosticPort__EndpointNamevalue: /diag/port- name: DOTNETMONITOR_Storage__DumpTempFoldervalue: /dumps- name: DOTNETMONITOR_Urlsvalue: "http://+:52323"- name: XDG_CONFIG_HOMEvalue: "/etc"volumeMounts:- mountPath: /diagname: diagvol- mountPath: /dumpsname: dumpsvol- mountPath: /tmpname: tmpvol- name: monitor-configsmountPath: /etc/dotnet-monitor/settings.jsonsubPath: defaultresources:requests:cpu: 30mmemory: 32Milimits:cpu: 50mmemory: 256Mivolumes:- name: settingsconfigMap:name: reservation-configs- name: monitor-configsconfigMap:name: dotnet-monitor-configs- name: diagvolemptyDir: {}- name: dumpsvolemptyDir: {}- name: tmpvolemptyDir: {}

对于 dotnet-monitor 的配置可以放在一个 ConfigMap 中，通过挂载的方式挂载到 dotnet-monitor 容器中，dotnet-monitor 配置 ConfigMap 示例如下：

apiVersion: v1
kind: ConfigMap
metadata:name: dotnet-monitor-configsnamespace: default
data:default: |{"urls": "http://*:52323","Metrics": {"IncludeDefaultProviders": true,"Providers": [{"ProviderName": "System.Net.Http"},{"ProviderName": "Microsoft.EntityFrameworkCore"},{"ProviderName": "Microsoft.Data.SqlClient.EventSource"}]}}

这里另外配置了 Metrics 来源

System.Net.Http 提供 HttpClient 相关的 EventCounters 数据
Microsoft.EntityFrameworkCore 提供 EF Core 相关的 EventCounters 数据

如果我们自己应用程序中有自己封装的一些 Event counters 数据也是可以收集的

Connection Mode

细心的小伙伴们可能会发现我们前面示例中在 dotnet-monitor 容器中都配置了一个环境变量 DOTNETMONITOR_DiagnosticPort__ConnectionMode 为 Listen，

上面两个示例中都是使用 Listen 模式，但是 Listen 模式是 .NET 5 之后才支持的，对于 .NET Core 3.x 的应用应该使用 Connect 模式(踩了坑的==

下面是一个 Connect 模式的 deployment 示例，也是第一个示例改成的 Connect 模式

apiVersion: apps/v1
kind: Deployment
metadata:name: sparktodo-apilabels:app: sparktodo-api
spec:replicas: 1revisionHistoryLimit: 0selector:matchLabels:app: sparktodo-apiminReadySeconds: 0strategy:type: RollingUpdaterollingUpdate:maxUnavailable: 1maxSurge: 1template:metadata:annotations:prometheus.io/scrape: "true"prometheus.io/port: "52323"labels:app: sparktodo-apispec:containers:- name: sparktodo-apiimage: weihanli/sparktodo-api:latestimagePullPolicy: Alwaysresources:requests:memory: "64Mi"cpu: "20m"limits:memory: "128Mi"cpu: "50m"ports:- name: httpcontainerPort: 80protocol: TCPlivenessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 60periodSeconds: 30readinessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 60periodSeconds: 30volumeMounts:- mountPath: /tmpname: tmpvol- name: monitorimage: mcr.microsoft.com/dotnet/monitorargs: [ "--no-auth" ]imagePullPolicy: Alwaysports:- containerPort: 52323env:- name: DOTNETMONITOR_DiagnosticPort__ConnectionModevalue: Connect- name: DOTNETMONITOR_Urlsvalue: "http://+:52323"volumeMounts:- mountPath: /tmpname: tmpvolresources:requests:cpu: 20mmemory: 32Milimits:cpu: 50mmemory: 256Mivolumes:- name: tmpvolemptyDir: {}

和 Listen 模式相比，Connect 模式更为简单一些，应用程序只需要和 dotnet-monitor 容器挂载同一个 tmp 目录即可，但是 Listen 模式功能更为强大，Listen 模式可以支持同时监听多个 .NET 容器，Connect 模式不支持，而且有一些高级的用法 CollectionRule 的配置仅仅支持 Listen 模式，可以参考：https://github.com/dotnet/dotnet-monitor/issues/1274，所以如果可能应当使用 Listen 模式，.NET Core 3.x 只支持 Connect 模式

Open API

dotnet-monitor 除了 metrics 之外还提供了很多的别的 API 可以参考文档 https://github.com/dotnet/dotnet-monitor/blob/main/documentation/api/README.md

Route	Description
`/processes`	获取捕获的进程的信息
`/dump`	生成一个进程的托管 dump
`/gcdump`	生成进程的 GC dump
`/trace`	生成进程的 Trace 信息
`/metrics`	生成进程的 metrics 信息，并以 Prometheus 的格式返回
`/livemetrics`	捕获进程的实时 metrics 信息
`/logs`	捕获进程日志信息（EventLog)
`/info`	获取当前 dotnet-monitor 的信息（版本信息，基本配置）
`/operations`	获取 egress 操作状态获取取消操作

使用 dotnet-monitor 之后，我们就可以更好的监控我们的应用程序，之前我们使用 prometheus-net.DotNetRuntime 这个项目来监控我们的应用程序，有了 dotnet-monitor 基本完全可以取代它了，而且不需要写一行代码，而且扩展性也比较强，只需要修改配置文件就能收集更多自己关心的数据了，功能也很强大，metrics 数据能够帮助我们了解应用程序的整体状态，但是有些问题可能还需要生成进程 dump 来分析具体原因，dotnet-monitor 也可以很方便地生成进程 dump 以及 trace 数据等等，还可以配置一些动态创建 dump，trace 的配置，比如内存持续一分钟超过 2G 创建 dump 等。

另外在部署的时候，上面为了简单没有启用授权，实际使用如果需要公网访问，授权一定要做好，现在已经默认支持授权了，可以参考文档配置，另外一种则是不要给公网访问，只在 k8s 集权内部可以访问，需要的话本地做一个 port-forward 进行操作，也是我更为推荐的使用方式。

功能很强大，一篇文章很难介绍完，大家可以了解一下，有需要的时候就可以用起来了

目前使用下来，总体感觉还是很棒的，但是发现一个问题，有时候信息收集有问题，部署了几个应用，有一个应用的 System.Runtime 相关的 metrics 数据没有收集到，其他的数据都有的，感觉很奇怪，搞了几天了不知道哪里的姿势不对，提了一个 issue，感兴趣的可以关注一下 https://github.com/dotnet/dotnet-monitor/issues/1241，有踩过坑的大佬可以帮忙看一下万分感谢

另外还有一点，上面 Prometheus 只会收集 dotnet-monitor 的数据，如果要同时收集 dotnet-monitor 的 metrics 和应用的 metrics ，你可能需要使用 Prometheus 的 Service Monitor 的 Operator，这里不多做介绍了，可以自己了解一下

References

https://github.com/WeihanLi/SparkTodo/blob/master/manifests/deployment.yml
https://github.com/OpenReservation/ReservationServer/blob/dev/k8s/dotnet-monitor-configmap.yaml
https://github.com/OpenReservation/ReservationServer/blob/dev/k8s/reservation-deployment.yaml
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#kube-prometheus-stack
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack/crds
https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md
https://github.com/dotnet/dotnet-monitor/issues/1274
https://github.com/dotnet/dotnet-monitor/issues/1241
https://github.com/dotnet/dotnet-monitor/
https://github.com/dotnet/dotnet-monitor/tree/main/documentation
https://github.com/dotnet/dotnet-monitor/blob/main/documentation/kubernetes.md