Kubernetes Monitoring
准备
- 首先需要部署 kubernetes 集群,参考k8s deploy
helm 安装 prometheus 软件栈
Setting up Prometheus — NVIDIA GPU Telemetry 1.0.0 documentation
使用以下命令安装 prometheus
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--set prometheus.service.type=NodePort \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheusOperator.admissionWebhooks.patch.image.registry=registry.cn-hangzhou.aliyuncs.com \
--set prometheusOperator.admissionWebhooks.patch.image.repository=linuzb/kube-webhook-certgen \
--set kube-state-metrics.image.registry=registry.cn-hangzhou.aliyuncs.com \
--set kube-state-metrics.image.repository=linuzb/kube-state-metrics安装 GPU 监控 DCGM
helm upgrade --install \
dcgm-exporter \
gpu-helm-charts/dcgm-exporter \
--values config.yamlmaster 节点也部署 config.yaml
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"导入 gpu 监控 dashboard https://grafana.com/grafana/dashboards/12239
使用 grafana
修改 svc 为 node prot
k -n prometheus edit svc kube-prometheus-stack-grafana 增加内容
spec:
- name: http-web
nodePort: 30759
type: NodePortdashboard
默认密码 prom-operator
# Deploy default dashboards.
#
defaultDashboardsEnabled: true
adminPassword: prom-operator