prometheus operator添加报警规则及通知方式

2019-08-21

配置报警

修改/root/kube-prometheus/manifests/alertmanager-service.yaml添加 type: NodePort,方便浏览器访问alertmanager页面
kubectl get svc -n monitoring可以看到alertmanager地址端口信息 http://172.16.0.6:31568/#/status
在alertmanager的status页面可以查看到AlertManager的配置信息

Config
global:
  resolve_timeout: 5m
  http_config: {}
  smtp_from: yunwei@hhotel.com
  smtp_hello: hhotel.com
  smtp_smarthost: smtp.qiye.aliyun.com:465
  smtp_auth_username: yunwei@hhotel.com
  smtp_auth_password: <secret>
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  hipchat_api_url: https://api.hipchat.com/
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
... ...

这些信息实际来自于/root/kube-prometheus/manifests/alertmanager-secret.yaml文件,名为alertmanager-main的secret

apiVersion: v1
data:
  alertmanager.yaml: Imdsb2JhbCI6IAogICJyZXNvbHZlX3RpbWVvdXQiOiAiNW0iCiJyZWNlaXZlcnMiOiAKLSAibmFtZSI6ICJudWxsIgoicm91dGUiOiAKICAiZ3JvdXBfYnkiOiAKICAtICJqb2IiCiAgImdyb3VwX2ludGVydmFsIjogIjVtIgogICJncm91cF93YWl0IjogIjMwcyIKICAicmVjZWl2ZXIiOiAibnVsbCIKICAicmVwZWF0X2ludGVydmFsIjogIjEyaCIKICAicm91dGVzIjogCiAgLSAibWF0Y2giOiAKICAgICAgImFsZXJ0bmFtZSI6ICJEZWFkTWFuc1N3aXRjaCIKICAgICJyZWNlaXZlciI6ICJudWxsIg==
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
type: Opaque

可以将alertmanager.yaml对应的value值做一个base64解码:

echo Imdsb2JhbCI6IAogICJyZXNvbHZlX3RpbWVvdXQiOiAiNW0iCiJyZWNlaXZlcnMiOiAKLSAibmFtZSI6ICJudWxsIgoicm91dGUiOiAKICAiZ3JvdXBfYnkiOiAKICAtICJqb2IiCiAgImdyb3VwX2ludGVydmFsIjogIjVtIgogICJncm91cF93YWl0IjogIjMwcyIKICAicmVjZWl2ZXIiOiAibnVsbCIKICAicmVwZWF0X2ludGVydmFsIjogIjEyaCIKICAicm91dGVzIjogCiAgLSAibWF0Y2giOiAKICAgICAgImFsZXJ0bmFtZSI6ICJEZWFkTWFuc1N3aXRjaCIKICAgICJyZWNlaXZlciI6ICJudWxsIg== | base64 -d

我们如果想自定义接收器或者模板消息,可以重新生成这个名为alertmanager-main的secret
vi alertmanager.yaml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qiye.aliyun.com:465'
  smtp_from: 'yunwei@hhotel.com'
  smtp_auth_username: 'yunwei@hhotel.com'
  smtp_auth_password: 'aRXjq9W1jto^7^Zb'
  smtp_hello: 'hhotel.com'
  smtp_require_tls: true
templates:
  - "*.tmpl"
route:
  group_by: ['job', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5m
  receiver: 'wechat'
  routes:
  - receiver: 'wechat'
    group_wait: 10s
    match:
      alertname: EtcdClusterUnavailable
receivers:
- name: 'default'
  email_configs:
  - to: 'yunwei@hhotel.com'
    send_resolved: true
- name: 'wechat'
  wechat_configs:
  - corp_id: 'wx02f71fb3dea46c16'
    to_party: '1'
    to_user: "renzhenxin"
    agent_id: '1'
    api_secret: 'r4OGerF_p4UrIN6QERCefJRxzpI0SquNG5gHCxGxcOM'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

创建wechat报警模板 wechat.tmpl

{{ define "wechat.default.message" }}
{{ range .Alerts }}
========start==========
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
告警程序: prometheus_alert
告警级别: {{ .Labels.severity }}
告警类型: {{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
========end==========
{{ end }}
{{ end }}

删除原来的secret,然后再创建

kubectl delete secret alertmanager-main -n monitoring
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring

查看alertmanegr微信报警模板

kubectl exec -it alertmanager-main-0 /bin/sh -n monitoring
ls /etc/alertmanager/config
cat /etc/alertmanager/config/wechat.tmpl

查看alertmanager的status页面config会显示修改变化

配置自动服务发现

想要让Prometheus Operator去自动发现并监控具有prometheus.io/scrape=true这个annotations的Service,需要对prometheus添加一个额外配置,相应的,Service要在annotation区域添加prometheus.io/scrape=true的声明
vi prometheus-additional.yaml

- job_name: 'kubernetes-service-endpoints'
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

使用这个文件创建一个secret对象

kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
kubectl get secret additional-configs -n monitoring -o yaml

在prometheus资源对象中加入刚才创建的额外配置,在spec下添加

additionalScrapeConfigs:
    name: additional-configs
    key: prometheus-additional.yaml

完整配置cat /root/kube-prometheus/manifests/prometheus-prometheus.yaml

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: k8s
  name: k8s
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  baseImage: quay.io/prometheus/prometheus
  nodeSelector:
    kubernetes.io/os: linux
  podMonitorSelector: {}
  replicas: 2
  secrets:
  - etcd-certs
  resources:
    requests:
      memory: 400Mi
  ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  additionalScrapeConfigs:
    name: additional-configs
    key: prometheus-additional.yaml
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: v2.11.0
kubectl apply -f prometheus-prometheus.yaml

过一会儿到prometheus查看配置已经生效,搜索关键词kubernetes-service-endpoints

kubectl logs -f prometheus-k8s-0 prometheus -n monitoring

可以看到有很多错误日志出现,都是xxx is forbidden,这说明是 RBAC 权限的问题,通过 prometheus 资源对象的配置可以知道 Prometheus 绑定了一个名为 prometheus-k8s 的 ServiceAccount 对象,而这个对象绑定的是一个名为 prometheus-k8s 的 ClusterRole
修改prometheus-clusterRole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-k8s
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - services
  - endpoints
  - pods
  - nodes/proxy
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
kubectl apply -f prometheus-clusterRole.yaml

从prometheus的targets可以看到已经自动发现了端口9153的服务,这是kube-dns

[root@k8s03 manifests]# kubectl describe svc kube-dns -n kube-system
Name:              kube-dns
Namespace:         kube-system
Labels:            k8s-app=kube-dns
                   kubernetes.io/cluster-service=true
                   kubernetes.io/name=KubeDNS
Annotations:       prometheus.io/port: 9153
                   prometheus.io/scrape: true
Selector:          k8s-app=kube-dns
Type:              ClusterIP
IP:                10.96.0.10
Port:              dns  53/UDP
TargetPort:        53/UDP
Endpoints:         192.168.73.66:53,192.168.73.67:53
Port:              dns-tcp  53/TCP
TargetPort:        53/TCP
Endpoints:         192.168.73.66:53,192.168.73.67:53
Port:              metrics  9153/TCP
TargetPort:        9153/TCP
Endpoints:         192.168.73.66:9153,192.168.73.67:9153
Session Affinity:  None
Events:            <none>

标题:prometheus operator添加报警规则及通知方式
作者:fish2018
地址:http://devopser.org/articles/2019/08/21/1566379859249.html