暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

分享一个貌似很“垃圾”的Prometheus监控架构方案+实战

不背锅运维 2024-02-02
747

监控架构图

总体的方案很简单,所有配置远程写的prometheus都往同一套InfluxDB的同一个库写监控数据,各个prometheus配置全局的external_labels以区分来源。现方案规划2台InfluxDB节点,后续可横向扩容即可,监控架构如下图:

机器规划

  1. prometheus规划

prometheus或alertmanager的基本HA,如有必要请自行配置,本方案不做阐述。

用途ip组件系统说明
做常规的监控,接exporter192.168.88.12prometheus2.49.1centos7.9监控用途,也要配置全局的external_labels以区分来源,配置远程写
远程读监控数据及负责警报发送192.168.88.13prometheus2.49.1、Alertmanagercentos7.9配置远程读,配置警报规则、由webhook方式将告警发送到其它地方
这台此处模拟其它业务的prometheus192.168.88.14prometheus2.49.1centos7.9模拟接入其它业务的prometheus监控数据,对方配置远程写,配置全局的external_labels区分来源,这样的方式更高效,之前采用联邦的方式对性能影响很大。
  1. InfluxDB+influx-proxy高可用规划:

根据规划,将写和读分开,分摊压力。因机器数量有限,本方案将influx-proxy和InfluxDB部署在同一台(最好是独立的2台来做HA,如有必要请自行改造)。

同步情况ip安装组件主机名InfluxDB端口influx-proxy端口系统角色
自动双向192.168.88.30InfluxDB v1.8.10、influx-proxy-2.5.10influxdb0180867076centos7.9负责写
自动双向192.168.88.31InfluxDB v1.8.10、influx-proxy-2.5.10influxdb0280867076centos7.9负责读

InfluxDB+influx-proxy配置

说明:influxdb01和influxdb02的配置是一样的。

InfluxDB组件:

/data/influxdb-1.8.10-1/etc/influxdb/influxdb.conf

 bind-address = "0.0.0.0:8088"
[meta]
  dir = "/var/lib/influxdb/meta"
[data]
  dir = "/var/lib/influxdb/data"
  wal-dir = "/var/lib/influxdb/wal"
  series-id-set-cache-size = 100
[coordinator]
[retention]
[shard-precreation]
[monitor]
[http]
   bind-address = ":8086"
[logging]
[subscriber]
[[graphite]]
[[collectd]]
[[opentsdb]]
[[udp]]
[continuous_queries]
[tls]

/etc/systemd/system/influxdb.service

[Unit]
Description="influxdb"
After=network.target

[Service]
Type=simple
ExecStart=/data/influxdb-1.8.10-1/usr/bin/influxd -config /data/influxdb-1.8.10-1/etc/influxdb/influxdb.conf

Restart=on-failure
RestartSecs=5s
SuccessExitStatus=0
LimitNOFILE=655360
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=influxdb

[Install]
WantedBy=multi-user.target

influx-proxy组件:

/data/influx-proxy-2.5.10-linux-amd64/proxy.json

{
    "circles": [
        {
            "name""circle-1",
            "backends": [
                {
                    "name""influxdb-1",
                    "url""http://192.168.88.30:8086",
                    "username""",
                    "password"""
                }
            ]
        },
        {
            "name""circle-2",
            "backends": [
                {
                    "name""influxdb-2",
                    "url""http://192.168.88.31:8086",
                    "username""",
                    "password"""
                }
            ]
        }
    ],
    "listen_addr"":7076",
    "db_list": [],
    "data_dir""data",
    "tlog_dir""log",
    "hash_key""idx",
    "flush_size": 10000,
    "flush_time": 1,
    "check_interval": 1,
    "rewrite_interval": 10,
    "conn_pool_size": 20,
    "write_timeout": 10,
    "idle_timeout": 10,
    "username""",
    "password""",
    "write_tracing"false,
    "query_tracing"false,
    "pprof_enabled"false,
    "https_enabled"false,
    "https_cert""",
    "https_key""",
    "data_dir""/data/influx-proxy-2.5.10-linux-amd64/data"
}

/etc/systemd/system/influx-proxy.service

[Unit]
Description="influx-proxy"
After=network.target

[Service]
Type=simple
ExecStart=/data/influx-proxy-2.5.10-linux-amd64/influx-proxy -config /data/influx-proxy-2.5.10-linux-amd64/proxy.json

Restart=on-failure
RestartSecs=5s
SuccessExitStatus=0
LimitNOFILE=655360
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=influx-proxy

[Install]
WantedBy=multi-user.target

测试和验证

  1. 分别通过influx-proxy进行登录
[root@influxdb01 ~]# influx -host 192.168.88.30 -port 7076
Connected to http://192.168.88.30:7076 version 2.5.10
InfluxDB shell version: 1.8.10


[root@influxdb02 ~]# influx -host 192.168.88.31 -port 7076
Connected to http://192.168.88.31:7076 version 2.5.10
InfluxDB shell version: 1.8.10


  1. 创建一个数据库,看看同步情况

在88.30创建数据库

[root@influxdb01 ~]# influx -host 192.168.88.30 -port 7076
Connected to http://192.168.88.30:7076 version 2.5.10
InfluxDB shell version: 1.8.10
> create database prometheus
> show databases
name: databases
name
----
_internal
prometheus


在88.31上查看是否也有

[root@influxdb02 ~]# influx -host 192.168.88.31 -port 7076
Connected to http://192.168.88.31:7076 version 2.5.10
InfluxDB shell version: 1.8.10
> show databases
name: databases
name
----
_internal
prometheus


如果存在,说明同步成功。也可以在88.31删除,然后去88.30看看是否也已经删除。

prometheus配置

  • 配置192.168.88.12上的prometheus:
global:
  scrape_interval: 15s 
  evaluation_interval: 15s 
  external_labels:
    position: "wgprometheus"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["192.168.88.12:9090"]

  - job_name: "host-mon"
    static_configs:
      - targets: ["192.168.88.12:9100"]

remote_write:
  - url: "http://192.168.88.30:7076/api/v1/prom/write?db=prometheus"

  • 配置192.168.88.13上prometheus:
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

remote_read:
  - url: "http://192.168.88.31:7076/api/v1/prom/read?db=prometheus"

  • 配置192.168.88.14上的prometheus:
global:
  scrape_interval: 15s 
  evaluation_interval: 15s 
  external_labels:
    position: "other-bus-prometheus"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["192.168.88.14:9090"]

  - job_name: "exporter_host"
    static_configs:
      - targets: ["192.168.88.14:9100"]

remote_write:
  - url: "http://192.168.88.30:7076/api/v1/prom/write?db=prometheus"

验证监控数据是否写入到influxdb

[root@influxdb01 ~]# influx -host '192.168.88.31' -port 7076
Connected to http://192.168.88.31:7076 version 2.5.10
InfluxDB shell version: 1.8.10

> show databases;
name: databases
name
----
_internal
prometheus
> use prometheus
Using database prometheus
> SELECT * FROM node_load15 WHERE position = 'other-bus-prometheus' limit 1
name: node_load15
time                __name__    instance           job           position             value
----                --------    --------           ---           --------             -----
1706775423319000000 node_load15 192.168.88.14:9100 exporter_host other-bus-prometheus 0.04


> SELECT * FROM node_load15 WHERE position = 'wgprometheus' limit 1
name: node_load15
time                __name__    instance           job      position     value
----                --------    --------           ---      --------     -----
1706774096565000000 node_load15 192.168.88.12:9100 host-mon wgprometheus 0.05

配置警报组件

根据规划,在192.168.88.13上配置警报组件。

  • alertmanager.yml:
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5s
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'

警报规则配置

根据规划,192.168.88.13上prometheus作为监控数据读取和警报器的用途,因此,在这里配置警报规则。

  • prometheus.yml:
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

rule_files:
  - "/data/app/prometheus/alert.yml"

remote_read:
  - url: "http://192.168.88.31:7076/api/v1/prom/read?db=prometheus"

  • alert.yml:
groups:
- name: Disk alert related group
  rules:
  - alert: sdaDiskWriteTime
    expr: node_disk_write_time_seconds_total{device="sda"} > 1
    for: 1s
    labels:
      severity: disk
    annotations:
      summary: "sda磁盘写入总耗时"
      description: "sda磁盘写入总耗时超出理想值"

验证警报是否发出


文章转载自不背锅运维,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论