Kubernetes Monitoring

准备

  1. 首先需要部署 kubernetes 集群,参考k8s deploy

helm 安装 prometheus 软件栈

Setting up Prometheus — NVIDIA GPU Telemetry 1.0.0 documentation

使用以下命令安装 prometheus

helm repo add prometheus-community \
   https://prometheus-community.github.io/helm-charts

helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
   --namespace prometheus \
   --create-namespace \
   --set prometheus.service.type=NodePort \
   --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
   --set prometheusOperator.admissionWebhooks.patch.image.registry=registry.cn-hangzhou.aliyuncs.com \
   --set prometheusOperator.admissionWebhooks.patch.image.repository=linuzb/kube-webhook-certgen \
   --set kube-state-metrics.image.registry=registry.cn-hangzhou.aliyuncs.com \
   --set kube-state-metrics.image.repository=linuzb/kube-state-metrics

安装 GPU 监控 DCGM

helm upgrade --install \
   dcgm-exporter \
   gpu-helm-charts/dcgm-exporter \
   --values config.yaml

master 节点也部署 config.yaml

tolerations:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"

导入 gpu 监控 dashboard https://grafana.com/grafana/dashboards/12239

使用 grafana

修改 svc 为 node prot

k -n prometheus edit svc kube-prometheus-stack-grafana 

增加内容

spec:
  - name: http-web
    nodePort: 30759
  type: NodePort

dashboard

默认密码 prom-operator

# Deploy default dashboards.
#
defaultDashboardsEnabled: true

adminPassword: prom-operator