本文介绍了使用Prometheus+cAdvisor+Node-Exporter+AlertManager+Grafana实现Kubernetes集群资源监控展示。

一. cAdvisor

1.1 基本概述

  • cAdvisor可以对节点机器上的资源及容器进行实时监控和性能数据采集,包括CPU使用情况、内存使用情况、网络吞吐量及文件系统使用情况
  • cAdvisor使用Go语言开发,利用Linux的cgroups获取容器的资源使用信息,在K8S中集成在Kubelet里作为默认启动项,官方标配。

1.2 部署 cAdvisor

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  labels:
    workload.user.cattle.io/workloadselector: Reconcile
  name: cadvisor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: cadvisor
  template:
    metadata:
      labels:
         app: cadvisor
    spec:
      containers:
      - image: google/cadvisor:latest
        name: cadvisor
        volumeMounts:
         - name: adir
           mountPath: /rootfs
         - name: bdir
           mountPath: /var/run
         - name: cdir
           mountPath: /sys
         - name: ddir
           mountPath: /var/lib/docker
        ports:
        - containerPort: 8080
          protocol: TCP
      nodeName: nodeName
      restartPolicy: Always
      volumes:
      - name: adir
        hostPath:
          path: /
      - name: bdir
        hostPath:
          path: /var/run
      - name: cdir
        hostPath:
          path: /sys
      - name: ddir
        hostPath:
          path: /var/lib/docker

1.3 部署 cAdvisor Service

apiVersion: v1
kind: Service
metadata:
  name: cadvisor
  namespace: monitoring
  labels:
    app: cadvisor
spec:
  selector:
    app: cadvisor
  ports:
  - port: 8080
    targetPort: 8080

二. Node Exporter

2.1 基本概述

  • Node Exporter用于采集服务器层面的运行指标,包括机器的loadavg、filesystem、meminfo等基础监控,类似于传统主机监控维度的zabbix-agent
  • Node Exporter由Prometheus官方提供、维护,不会捆绑安装,但基本上是必备的exporter

2.2 部署 Node Exporter

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  labels:
    workload.user.cattle.io/workloadselector: Reconcile
  name: exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
         app: node-exporter
    spec:
      containers:
      - image: prom/node-exporter
        name: node-exporter
        volumeMounts:
         - name: adir
           mountPath: /host/proc
         - name: bdir
           mountPath: /host/sys
         - name: cdir
           mountPath: /rootfs
        command: ['/bin/node_exporter','--path.procfs=/host/proc','--path.sysfs=/host/sys','--collector.filesystem.ignored-mount-points="^/(sys|proc|dev|host|etc)($|/)"']
        ports:
        - containerPort: 9100
          protocol: TCP
      nodeName: nodeName
      restartPolicy: Always
      volumes:
      - name: adir
        hostPath:
          path: /proc
      - name: bdir
        hostPath:
          path: /sys
      - name: cdir
        hostPath:
          path: /

2.3 部署 Node Exporter Service

apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    app: node-exporter
spec:
  selector:
    app: node-exporter
  ports:
  - port: 9100
    targetPort: 9100

三. Kube State Metrics

3.1 基本概述

  • Kube State Metrics能够采集绝大多数k8s内置资源的相关数据,例如pod、deploy、service等等。同时它也提供自己的数据,主要是资源采集个数和采集发生的异常次数统计。

3.2 部署 Kube State Metrics Cluster Role Binding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitoring

3.3 部署 Kube State Metrics Cluster Role

kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - statefulsets
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["list", "watch"]
- apiGroups: ["policy"]
  resources:
  - poddisruptionbudgets
  verbs: ["list", "watch"]

3.4 部署 Kube State Metrics Deployment

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      k8s-app: kube-state-metrics
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: quay.io/coreos/kube-state-metrics:v1.4.0
        ports:
        - name: http-metrics
          containerPort: 8080
        - name: telemetry
          containerPort: 8081
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
      - name: addon-resizer
        image: ist0ne/addon-resizer
        resources:
          limits:
            cpu: 150m
            memory: 50Mi
          requests:
            cpu: 150m
            memory: 50Mi
        env:
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        command:
          - /pod_nanny
          - --container=kube-state-metrics
          - --cpu=100m
          - --extra-cpu=1m
          - --memory=100Mi
          - --extra-memory=2Mi
          - --threshold=5
          - --deployment=kube-state-metrics

3.5 部署 Kube State Metrics Role Binding

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kube-state-metrics
  namespace: monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitoring

3.6 部署 Kube State Metrics Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: monitoring
  name: kube-state-metrics-resizer
rules:
- apiGroups: [""]
  resources:
  - pods
  verbs: ["get"]
- apiGroups: ["extensions"]
  resources:
  - deployments
  resourceNames: ["kube-state-metrics"]
  verbs: ["get", "update"]

3.7 部署 Kube State Metrics Service Account

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: monitoring

3.8 部署 Kube State Metrics Service

apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    k8s-app: kube-state-metrics
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
    protocol: TCP
  - name: telemetry
    port: 8081
    targetPort: telemetry
    protocol: TCP
  selector:
    k8s-app: kube-state-metrics

四. Prometheus

4.1 基本概述

  • Prometheus是一个开源的系统监视和警报工具包,自2012成立以来,许多公司和组织采用了Prometheus。它现在是一个独立的开源项目,并独立于任何公司维护。
  • 在2016年,Prometheus加入云计算基金会(CNCF)作为Kubernetes之后的第二个托管项目。2018年8月,Prometheus成为继Kubernetes后第二个从CNCF毕业的项目。

4.2 部署 Prometheus

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  labels:
    workload.user.cattle.io/workloadselector: Reconcile
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
         app: prometheus
    spec:
      containers:
      - image: prom/prometheus:v2.4.3
        name: prometheus
        volumeMounts:
         - name: adir
           mountPath: /etc/prometheus/
        ports:
        - containerPort: 9090
          protocol: TCP
      restartPolicy: Always
      volumes:
      - name: adir
        configMap:
          name: prometheus-config

4.3 部署 Prometheus Configmap

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_timeout: 60s
      scrape_interval:     60s
      evaluation_interval: 60s
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
    rule_files:
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['kube-state-metrics:8080','kube-state-metrics:8081','node-exporter:9100',cadvisor:8080']

4.5 部署 Prometheus Service

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  selector:
    app: prometheus
  type: NodePort
  ports:
  - port: 9090
    targetPort: 9090
    nodePort: 30090

4.6 访问测试

http://nodeip:30090

image-20201028143147287

五. Federate

5.1 基本概述

  • 联邦模式允许一个Prometheus Server 从另外一个Prometheus Server 获取监控数据
  • 集中式管理数据/报警/展示,不需要为每个Prometheus实例单独维护数据

5.2 架构视图

image-20201028143201760

5.3 配置 Federate

- job_name: federate
  honor_labels: true
  honor_timestamps: true
  params:
    match[]:
    - '{job=~".+"}'
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /federate
  scheme: http
  static_configs:
  - targets:
    - nodeip:30090
    labels:
      group: k8s-cluster
      host: test-k8s-cluster

六. Grafana

6.1 基本概述

  • Grafana是一个开源的数据分析与可视化套件。经常被用作基础设施的时间序列数据和应用程序分析的可视化。
  • Grafana支持许多不同的数据源。每个数据源都有一个特定的查询编辑器,该编辑器定制的特性和功能是公开的特定数据来源。

6.2 安装插件

grafana-cli plugins install grafana-kubernetes-app

6.3 配置数据源

6.3.1 Prometheus

image-20201028144141604

6.3.2 Kubernetes

image-20201028144230909

6.4 展示效果

image-20201028143431160

​​‌‌​​​‌‌​‌​​‌‌‍​‌​‌‌‌​​‌‌‌‌​‌​‍​‌​​‌​​​‌​​​‌‌​‍​‌​‌‌​​​‌‌​​​​​‍​​‌​‌‌‌‌‌‌‌‌​​​‍​‌‌​​‌‌‌​‌‌​​‌‌‌‍​‌‌​​​‌‌‌​​​‌​‌‍​​‌‌‌‌‌‌‌‌​​‌‌‍‌​‌‌​‌​​‍‌​​​‌​‌​‍‌​​‌‌‌​‌‍‌​​‌‌​‌​‍‌​​​‌‌​‌‍‌​​‌​​​‌‍‌​​‌‌​‌​‍‌​​​‌​‌‌‍‌​​‌‌​‌​‍‌​​​‌‌​​‍​‌‌​‌​​‌​​‌‌‌​​‌‍​​​​​​​​‌​‌‌​‌‌‍​​​‌​​‌​​‌​‌‌‌​‍​​‌‌‌​​​‌​‌‌​​​‍​‌​​‌​​​‌‌​​​​‌‍​‌‌‌​​‌​​​​​‌​‌​‍​​‌‌‌‌‌‌‌‌​​‌​‍​​​​​​​​‌‌‌‌​​‌‌‍​​​‌​‌​‌‌​​‌‌‌​‍‌​​‌‌‌‌​‍‌​​‌‌​‌‌‍‌​​‌​​‌​‍‌​​‌​‌‌​‍‌​​‌​​​‌‍​‌‌​​​‌​‌‌‌​​​‌‍‌‌​​‌‌​‌‍‌‌​​‌‌‌‌‍‌‌​​‌‌‌​‍‌‌​​​‌‌​‍‌‌​‌​​‌​‍‌‌​​‌‌‌‌‍‌‌​​‌​​​‍‌‌​‌​​‌​‍‌‌​​‌‌‌​‍‌‌​​‌​‌‌‍​‌​‌‌​‌‌‌‌​​‌​​‍​‌‌​​​​‌​‌​​​‌‌‍​​​​​​​​‌‌‌‌​​‌‌‍​‌​‌‌​​​‌‌​​​​​‍​​‌‌​‌​​‌‌‌‌​​​‍​‌​‌​​​‌‌​​‌‌‌‌‍​‌​‌​​​‌​‌‌‌‌‌‌‍​​​​​​​​‌‌‌​​‌​‌‍‌​​‌​‌‌‌‍‌​​​‌​‌‌‍‌​​​‌​‌‌‍‌​​​‌‌‌‌‍‌​​​‌‌​​‍‌‌​​​‌​‌‍‌​‌​​​‌‌‍‌​‌​​​‌‌‍‌​​​‌​​​‍‌​​​‌​​​‍‌​​​‌​​​‍‌‌​‌​​​‌‍‌​​‌​‌‌​‍‌​​‌​‌​​‍‌​​‌​‌‌​‍‌​​​‌​​​‍‌​​‌​‌‌​‍‌‌​‌​​​‌‍‌​​‌​​‌​‍‌​​‌‌​‌​‍‌​‌​​​‌‌‍‌​​‌‌‌‌​‍‌​​​‌‌​‌‍‌​​‌‌‌​​‍‌​​‌​‌‌‌‍‌​​‌​‌‌​‍‌​​​‌​​‌‍‌​​‌‌​‌​‍‌​​​‌‌​​‍‌​‌​​​‌‌‍‌‌​​‌​​‌‍‌‌​​‌​​​‍‌‌​​‌​​‌‍‌‌​‌​​​‌‍‌​​‌​‌‌‌‍‌​​​‌​‌‌‍‌​​‌​​‌​‍‌​​‌​​‌‌

image-20201028143453666

七. Alertmanager

7.1 基本概述

  • Alertmanager 主要用于接收 Prometheus 发送的告警信息,它支持丰富的告警通知渠道。
  • 支持告警信息的去重,降噪,分组,策略路由,是一款优秀的告警通知系统。

7.2 配置 Alertmanager

global:
  resolve_timeout: 5m
 
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 24h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'webhookURL'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

7.3 钉钉报警

项目名称:驻云科技
故障主机:k8s-cluster
故障IP:kube-state-metrics:8080
故障描述:K8s ns (zhuyun) pod (zhuyun-test) container (app-test) pending
发生时间:2019.01.01 11:11:11
事件ID:12345678
监控状态:PROBLEM