分享一个貌似很“垃圾”的Prometheus监控架构方案+实战

不背锅运维 2024-02-02

747

监控架构图

总体的方案很简单，所有配置远程写的prometheus都往同一套InfluxDB的同一个库写监控数据，各个prometheus配置全局的external_labels以区分来源。现方案规划2台InfluxDB节点，后续可横向扩容即可，监控架构如下图：

机器规划

prometheus规划

prometheus或alertmanager的基本HA，如有必要请自行配置，本方案不做阐述。

用途	ip	组件	系统	说明
做常规的监控，接exporter	192.168.88.12	prometheus2.49.1	centos7.9	监控用途，也要配置全局的external_labels以区分来源，配置远程写
远程读监控数据及负责警报发送	192.168.88.13	prometheus2.49.1、Alertmanager	centos7.9	配置远程读，配置警报规则、由webhook方式将告警发送到其它地方
这台此处模拟其它业务的prometheus	192.168.88.14	prometheus2.49.1	centos7.9	模拟接入其它业务的prometheus监控数据，对方配置远程写，配置全局的external_labels区分来源，这样的方式更高效，之前采用联邦的方式对性能影响很大。

InfluxDB+influx-proxy高可用规划：

根据规划，将写和读分开，分摊压力。因机器数量有限，本方案将influx-proxy和InfluxDB部署在同一台（最好是独立的2台来做HA，如有必要请自行改造）。

同步情况	ip	安装组件	主机名	InfluxDB端口	influx-proxy端口	系统	角色
自动双向	192.168.88.30	InfluxDB v1.8.10、influx-proxy-2.5.10	influxdb01	8086	7076	centos7.9	负责写
自动双向	192.168.88.31	InfluxDB v1.8.10、influx-proxy-2.5.10	influxdb02	8086	7076	centos7.9	负责读

InfluxDB+influx-proxy配置

说明：influxdb01和influxdb02的配置是一样的。

InfluxDB组件：

/data/influxdb-1.8.10-1/etc/influxdb/influxdb.conf

 bind-address = "0.0.0.0:8088"
[meta]
  dir = "/var/lib/influxdb/meta"
[data]
  dir = "/var/lib/influxdb/data"
  wal-dir = "/var/lib/influxdb/wal"
  series-id-set-cache-size = 100
[coordinator]
[retention]
[shard-precreation]
[monitor]
[http]
   bind-address = ":8086"
[logging]
[subscriber]
[[graphite]]
[[collectd]]
[[opentsdb]]
[[udp]]
[continuous_queries]
[tls]

/etc/systemd/system/influxdb.service

[Unit]
Description="influxdb"
After=network.target

[Service]
Type=simple
ExecStart=/data/influxdb-1.8.10-1/usr/bin/influxd -config /data/influxdb-1.8.10-1/etc/influxdb/influxdb.conf

Restart=on-failure
RestartSecs=5s
SuccessExitStatus=0
LimitNOFILE=655360
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=influxdb

[Install]
WantedBy=multi-user.target

influx-proxy组件：

/data/influx-proxy-2.5.10-linux-amd64/proxy.json

{
    "circles": [
        {
            "name": "circle-1",
            "backends": [
                {
                    "name": "influxdb-1",
                    "url": "http://192.168.88.30:8086",
                    "username": "",
                    "password": ""
                }
            ]
        },
        {
            "name": "circle-2",
            "backends": [
                {
                    "name": "influxdb-2",
                    "url": "http://192.168.88.31:8086",
                    "username": "",
                    "password": ""
                }
            ]
        }
    ],
    "listen_addr": ":7076",
    "db_list": [],
    "data_dir": "data",
    "tlog_dir": "log",
    "hash_key": "idx",
    "flush_size": 10000,
    "flush_time": 1,
    "check_interval": 1,
    "rewrite_interval": 10,
    "conn_pool_size": 20,
    "write_timeout": 10,
    "idle_timeout": 10,
    "username": "",
    "password": "",
    "write_tracing": false,
    "query_tracing": false,
    "pprof_enabled": false,
    "https_enabled": false,
    "https_cert": "",
    "https_key": "",
    "data_dir": "/data/influx-proxy-2.5.10-linux-amd64/data"
}

/etc/systemd/system/influx-proxy.service

[Unit]
Description="influx-proxy"
After=network.target

[Service]
Type=simple
ExecStart=/data/influx-proxy-2.5.10-linux-amd64/influx-proxy -config /data/influx-proxy-2.5.10-linux-amd64/proxy.json

Restart=on-failure
RestartSecs=5s
SuccessExitStatus=0
LimitNOFILE=655360
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=influx-proxy

[Install]
WantedBy=multi-user.target

测试和验证

分别通过influx-proxy进行登录

[root@influxdb01 ~]# influx -host 192.168.88.30 -port 7076
Connected to http://192.168.88.30:7076 version 2.5.10
InfluxDB shell version: 1.8.10
>

[root@influxdb02 ~]# influx -host 192.168.88.31 -port 7076
Connected to http://192.168.88.31:7076 version 2.5.10
InfluxDB shell version: 1.8.10
>

创建一个数据库，看看同步情况

在88.30创建数据库

[root@influxdb01 ~]# influx -host 192.168.88.30 -port 7076
Connected to http://192.168.88.30:7076 version 2.5.10
InfluxDB shell version: 1.8.10
> create database prometheus
> show databases
name: databases
name
----
_internal
prometheus
>

在88.31上查看是否也有

[root@influxdb02 ~]# influx -host 192.168.88.31 -port 7076
Connected to http://192.168.88.31:7076 version 2.5.10
InfluxDB shell version: 1.8.10
> show databases
name: databases
name
----
_internal
prometheus
>

如果存在，说明同步成功。也可以在88.31删除，然后去88.30看看是否也已经删除。

prometheus配置

配置192.168.88.12上的prometheus：

global:
  scrape_interval: 15s 
  evaluation_interval: 15s 
  external_labels:
    position: "wgprometheus"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["192.168.88.12:9090"]

  - job_name: "host-mon"
    static_configs:
      - targets: ["192.168.88.12:9100"]

remote_write:
  - url: "http://192.168.88.30:7076/api/v1/prom/write?db=prometheus"

配置192.168.88.13上prometheus：

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

remote_read:
  - url: "http://192.168.88.31:7076/api/v1/prom/read?db=prometheus"

配置192.168.88.14上的prometheus：

global:
  scrape_interval: 15s 
  evaluation_interval: 15s 
  external_labels:
    position: "other-bus-prometheus"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["192.168.88.14:9090"]

  - job_name: "exporter_host"
    static_configs:
      - targets: ["192.168.88.14:9100"]

remote_write:
  - url: "http://192.168.88.30:7076/api/v1/prom/write?db=prometheus"

验证监控数据是否写入到influxdb

[root@influxdb01 ~]# influx -host '192.168.88.31' -port 7076
Connected to http://192.168.88.31:7076 version 2.5.10
InfluxDB shell version: 1.8.10
> 
> show databases;
name: databases
name
----
_internal
prometheus
> use prometheus
Using database prometheus
> SELECT * FROM node_load15 WHERE position = 'other-bus-prometheus' limit 1
name: node_load15
time                __name__    instance           job           position             value
----                --------    --------           ---           --------             -----
1706775423319000000 node_load15 192.168.88.14:9100 exporter_host other-bus-prometheus 0.04
> 
> 
> SELECT * FROM node_load15 WHERE position = 'wgprometheus' limit 1
name: node_load15
time                __name__    instance           job      position     value
----                --------    --------           ---      --------     -----
1706774096565000000 node_load15 192.168.88.12:9100 host-mon wgprometheus 0.05

配置警报组件

根据规划，在192.168.88.13上配置警报组件。

alertmanager.yml：

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5s
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'

警报规则配置

根据规划，192.168.88.13上prometheus作为监控数据读取和警报器的用途，因此，在这里配置警报规则。

prometheus.yml：

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

rule_files:
  - "/data/app/prometheus/alert.yml"

remote_read:
  - url: "http://192.168.88.31:7076/api/v1/prom/read?db=prometheus"

alert.yml：

groups:
- name: Disk alert related group
  rules:
  - alert: sdaDiskWriteTime
    expr: node_disk_write_time_seconds_total{device="sda"} > 1
    for: 1s
    labels:
      severity: disk
    annotations:
      summary: "sda磁盘写入总耗时"
      description: "sda磁盘写入总耗时超出理想值"