4.Prometheus监控入门之指标命名与数据查看说明

WeiyiGeek 2021-07-12

6210

0x00 监控指标

1.指标介绍
2.指标命名

0x01 指标标签

1.标签介绍
2.标签应用
3.标签使用实例
4.导出器指标说明
4.1 Linux主机监控指标
4.2 Windows主机监控指标
4.3 容器监控指标说明

0x00 监控指标

1.指标介绍

Q: 什么是Prometheus指标(Metrics
)?

答: 在Prometheus中指标是基础它标志着采集或抓取监控项,并且指标的数值是有上升或者下降的变化,从而使得监控某一个时间段中某个监控项的变化分布情况。
此时此刻,可能你对"指标"这个词有些混乱，在Prometheus中它根据不同的上下文表示不同的含义，它可以是指标系列、子项或者时间序列, 然而对于Gauge类型数据来说都是代表的一个含义。

基础示例:

#@ 1.时间序列
latency_seconds_sum{path="/bar"}

#@ 2.子项包含了_sum/_count 时序
latency_seconds{path="/bar"}

#@ 3.指标系列(只包含指标名称)
latency_seconds
复制

2.指标命名

描述: 在使用prometheus时常常会给导出器采集的数据配置一个指标名称，所以指标命名对于数据采集或者使用有一定的重要性，即构建指标标准名称可以按照以下规则进行。

指标名称整体结构: library_name_unit_suffix

规则语法:

蛇形命名法: 指标名称的每个部分都应该是小写的并采用下划线进行分隔，如node_uname_info
。

名称字母: 指标名称必须以字母开头并且可以跟着任意数目的字母、数字和下划线，如[a-zA-Z_:][a-zA-Z0-9_:]
正则表达式

# - 正常示例
node_cpu_core_throttles_total

# - 在测控指标名称时不建议使用冒号(__),因为它是prometheus内部使用而保留的。
bottomk(__rank_number__, __input_vector__)
__meta_kubernetes_node_address_InternalIP="192.168.12.226"

# - 在测控指标名称时不建议使用冒号(:),因为它是为用户在记录规则中使用而保留的。
复制

指标名称: 指标所代表的含义应该是名字本身，需要实现通过指标名称就能快速知道该值得含义，并且一个名称最好通过下划线进行分割并且通常是从左到右含义越具体越好，如http_request_duration_seconds_sum
，注意不要再指标名称中添加标签名(可能会导致聚合查询是出错)。

指标后缀: 常规得_total、_count_,__sum,__bucket
等后缀是留给counter和summary和histogram指标类型使用得。所以除了在counter类型指标上始终具有_total后缀之外不要将这些后缀放在指标名称得末尾
如http_request_duration_seconds_sum
。

指标单位: 为了避免某个指标究竟是以秒作为单位还是以毫秒为单位应该在命名中带上单位，如container_cpu_system_seconds_total
。(Prometheus本身使用秒、毫秒、微秒和纳秒作为指标名称)

库: 指标名称是个有效得全局命名空间它可以避免库之间得冲突并指出指标出处，如prometheus_http_requests_total或者go_gc_duration_seconds
表示指标通过prometheus以及go库中得到，注意不能将应用名称作为应用中所有指标得前缀。

0x01 指标标签

1.标签介绍

描述: Label能够让我们知道监控项目的来源端口方法等等, 同时label
也为prometheus提供了丰富的聚合和查询
等功能。

标签分类
一类是测控标签，另一类是目标标签，在使用PromSQL查询时没有什么区别但是为了更好有效的使用标签，区分二者就更好了。

测控标签: 表面意思来自测控设置的标签中,可以在应用程序或者库内部模块中设置。例如收到的HTTP请求类比、以及访问的数据库。

目标标签: 它是确定了特定的监控目标即Prometheus抓取指标数据的目标，其标签作为抓取指标过程中的一部分。

标签模式
描述: Prometheus 支持64位的浮点数作为时序数据，不支持如字符串等其它数据类型，但是标签值是字符串类型我们可以将其使用某些特征信息展示中，并可以采用PromQL表达式。

Tips : 你可以为指标设置一个或多个标签并且标签是无序的，所以你可以安装任何给定标签聚合而忽略其它标签，甚至一次性聚合多个标签。

Tips ：注意保留标签(以下划线开头)
和__name__
(它实际上是表达式up的语法糖格式)不要进行使用应该避免此类命名。

2.标签应用

描述: 在 Prometheus 中标签往往有以下几种应用场景:

在 Prometheus 采集时进行数据指标的分类，并按照一定的规则保留或者丢弃采集的数据。

在 Prometheus UI 界面上采用PromQL表达式
设置不同标签进行监控项的找寻,并且可以进行聚合以及求取平均值。

在 Prometheus 中进行了静态设置采集节点时使用。

[{"targets": ["127.0.0.1:9100"],"labels": {"instance": "test","idc": "beijing"}}]  
复制

在 alertmanager 警报时按照设置的标签名称和值进行判断并执行相对应的报警通知。

其中最常用的就是在网站的Http Rquest
请求统计中需要记录http路径访问的次数, 为了解决多种同类型、同工作的指标的数量，我们通常采用标签来进行处理, 例如http_requests_total{path="/login"}

Tips: 汇总指标的分位数打破了有关总和或平均值的规则，因此你无法对分为数进行数学运算。

Tips: job 和 instance 是目标始终以及默认具有的两个标签，job默认来自job_name配置选项。

3.标签使用实例

描述: 在Prometheus中进行自动发现设置时利用relabel_configs
来保留或者丢弃匹配到的标签指标。

# 函数用法
label_replace() 
label_join()

# 基础示例：支持 replace、labelmap、keep、drop 等许多操作
relabel_configs
- source_labels: [] # 使用默认值简洁的删除team标签。
  target_label: team
- source_labels: [__mata_consul_tag] # 使用prod、staging和dev标记填充env标签。
  regex: '.*,(prod|staging|dev),.*'
  target_lable: env

# - 1.简单匹配替换
- source_labels: []      # 为所有目标添加上k8s_cluster=3标签
  replacement: 3
  target_label: "k8s_cluster"
  action: replace
- source_labels: [team]  # 重置标签将team=monitoring替换为team=monitor
  regex: monitoring
  replacement: monitor
  target_label: team
  action: replace


# - 2.匹配到该标签时保留该目标 ( 正 则 匹 配 )
- source_labels: [__meta_kubernetes_endpoints_label_app_kubernetes_io_name]
  action: keep
  regex: ^(kube-state-metrics)$
- source_labels: [team]
  regex: dev|testing|monitor  # Tips: 使用 | 管道符号替代逐个操作。
  action: keep 
- source_labels: [__meta_consul_tag]  # 仅仅包含prod标签的Consul服务。
  regex: '.*,prod,.*'
  action: keep


# - 3.匹配到该标签时丢弃 ( 正 则 匹 配 )
- source_labels: [__meta_kubernetes_endpoints_label_app_kubernetes_io_name]
  regex: ^(kube-state-metrics)$
  action: drop
- source_labels: [job, team]    # 不监控monitor团队的prom的相关任务目标 (使用多个source标签)
  regex: prom;monitor  
  action: drop


# - 4.正则替换 ( 将匹配到的源数据进行替换到目标标签之中 )
- source_label: [__meta_consul_address]  # 使用 consul的IP和9100端口作为地址。
  regex: '(.*)'
  replacement: '${1}:9100'
  target_label: __address__
- source_label: [__meta_consul_node]     # 在实例标签中使用节点名称。
  regex: '(.*)'
  replacement: '${1}:9100'
  target_label: instance
- source_labels: [__address__,__meta_kubernetes_pod_annotation_prometheus_io_port]
  action: replace
  regex: ([^:]+)(?::\d+)?;(\d+)
  replacement: $1:$2
  target_label: __address__
- source_labels: [team]  # 使用replace重置标签操作将team标签中的ing去除 (利用了元组的方式)
  regex: '(.*)ing'
  replacement: '${1}'
  target_label: team
  action: replace
- source_labels: []      # 使用替换重置操作删除team标签
  regex: '(.*)'
  replacement: '${1}'
  target_label: team
  action: replace


# - 5.匹配正则表达式`所有的标签名称然后将匹配标签的值复制到`replacement` 
- action: labelmap
  regex: __meta_kubernetes_pod_label_(.+)
  replacement: '${1}'
  action: labelmap
复制

Tips : 指标重新标记metric-relabel_configs
作用于从目标中抓取时间序列的重新标记，前面我们看到的keep/drop/replace/labelmap
还有labeldrop/labelkeep
(适用于标签名称而不是标签值)等操作都可以在metric_relabel_configs
中使用。

metric-relabel_configs:

# -1. 删除代价大的指标
- source_labels: [__name__]
  regex: http_request_size_bytes
  action: drop

# - 2.删除直方图桶以减少数据大小
- source_labels: [__name__, le]
  regex: 'prometheus_tsdb_compaction_duration_seconds_bucket;(4|32|256)'
  action: drop

# - 3.使用labeldrop删除具有给定前缀的所有标签
- regex: 'container_label_.*'
  action: labledrop
-
复制

4.导出器指标说明

4.1 Linux主机监控指标

项目地址: https://github.com/prometheus/node_exporter

默认开启的功能

名称	说明	系统
arp	从 proc/net/arp 中收集 ARP 统计信息	Linux
conntrack	从 proc/sys/net/netfilter/ 中收集 conntrack 统计信息	Linux
cpu	收集 cpu 统计信息	Darwin, Dragonfly, FreeBSD, Linux
diskstats	从 proc/diskstats 中收集磁盘 I/O 统计信息	Linux
edac	错误检测与纠正统计信息	Linux
entropy	可用内核熵信息	Linux
exec	execution 统计信息	Dragonfly, FreeBSD
filefd	从 proc/sys/fs/file-nr 中收集文件描述符统计信息	Linux
filesystem	文件系统统计信息，例如磁盘已使用空间	Darwin, Dragonfly, FreeBSD, Linux, OpenBSD
hwmon	从 sys/class/hwmon/ 中收集监控器或传感器数据信息	Linux
infiniband	从 InfiniBand 配置中收集网络统计信息	Linux
loadavg	收集系统负载信息	Darwin, Dragonfly, FreeBSD, Linux, NetBSD, OpenBSD, Solaris
mdadm	从 proc/mdstat 中获取设备统计信息	Linux
meminfo	内存统计信息	Darwin, Dragonfly, FreeBSD, Linux
netdev	网口流量统计信息，单位 bytes	Darwin, Dragonfly, FreeBSD, Linux, OpenBSD
netstat	从 proc/net/netstat 收集网络统计数据，等同于 netstat -s	Linux
sockstat	从 proc/net/sockstat 中收集 socket 统计信息	Linux
stat	从 proc/stat 中收集各种统计信息，包含系统启动时间，forks, 中断等	Linux
textfile	通过 –collector.textfile.directory 参数指定本地文本收集路径，收集文本信息	any
time	系统当前时间	any
uname	通过 uname 系统调用, 获取系统信息	any
vmstat	从 proc/vmstat 中收集统计信息	Linux
wifi	收集 wifi 设备相关统计数据	Linux
xfs	收集 xfs 运行时统计信息	Linux (kernel 4.4+)
zfs	收集 zfs 性能统计信息	Linux

默认关闭的功能

名称	说明	系统
bonding	收集系统配置以及激活的绑定网卡数量	Linux
buddyinfo	从 proc/buddyinfo 中收集内存碎片统计信息	Linux
devstat	收集设备统计信息	Dragonfly, FreeBSD
drbd	收集远程镜像块设备（DRBD）统计信息	Linux
interrupts	收集更具体的中断统计信息	Linux，OpenBSD
ipvs	从 proc/net/ip_vs 中收集 IPVS 状态信息，从 proc/net/ip_vs_stats 获取统计信息	Linux
ksmd	从 sys/kernel/mm/ksm 中获取内核和系统统计信息	Linux
logind	从 logind 中收集会话统计信息	Linux
meminfo_numa	从 proc/meminfo_numa 中收集内存统计信息	Linux
mountstats	从 proc/self/mountstat 中收集文件系统统计信息，包括 NFS 客户端统计信息	Linux
nfs	从 proc/net/rpc/nfs 中收集 NFS 统计信息，等同于 nfsstat -c	Linux
qdisc	收集队列推定统计信息	Linux
runit	收集 runit 状态信息	any
supervisord	收集 supervisord 状态信息	any
systemd	从 systemd 中收集设备系统状态信息	Linux
tcpstat	从 proc/net/tcp 和 proc/net/tcp6 收集 TCP 连接状态信息	Linux
wifi	Exposes WiFi device and station statistics.	Linux
zoneinfo	Exposes NUMA memory zone metrics.	Linux

Tips：简单的指标说明：

node_boot_time：系统启动时间
node_cpu：系统CUP使用情况
node_disk_*：磁盘io
node_filesystem_*：文件系统使用量
node_load1：系统负载
node_memory_*：系统内存使用量
node_network_*：网络宽带
node_time：当前系统时间
go_*：node exporter中go相关指标
*process_ ：node exporter自身进程相关指标
复制

4.2 Windows主机监控指标

指标参考: https://github.com/prometheus-community/windows_exporter#collectors

Name	Description	Enabled by default
ad	Active Directory Domain Services
adfs	Active Directory Federation Services
cache	Cache metrics
cpu	CPU usage	✓
cpu_info	CPU Information
cs	"Computer System" metrics (system properties, num cpus/total memory)	✓
container	Container metrics
dfsr	DFSR metrics
dhcp	DHCP Server
dns	DNS Server
exchange	Exchange metrics
fsrmquota	Microsoft File Server Resource Manager (FSRM) Quotas collector
hyperv	Hyper-V hosts
iis	IIS sites and applications
logical_disk	Logical disks, disk I/O	✓
logon	User logon sessions
memory	Memory usage metrics
msmq	MSMQ queues
mssql	SQL Server Performance Objects metrics
netframework_clrexceptions	.NET Framework CLR Exceptions
netframework_clrinterop	.NET Framework Interop Metrics
netframework_clrjit	.NET Framework JIT metrics
netframework_clrloading	.NET Framework CLR Loading metrics
netframework_clrlocksandthreads	.NET Framework locks and metrics threads
netframework_clrmemory	.NET Framework Memory metrics
netframework_clrremoting	.NET Framework Remoting metrics
netframework_clrsecurity	.NET Framework Security Check metrics
net	Network interface I/O	✓
os	OS metrics (memory, processes, users)	✓
process	Per-process metrics
remote_fx	RemoteFX protocol (RDP) metrics
service	Service state metrics	✓
smtp	IIS SMTP Server
system	System calls	✓
tcp	TCP connections
time	Windows Time Service
thermalzone	Thermal information
terminal_services	Terminal services (RDS)
textfile	Read prometheus metrics from a text file	✓
vmware	Performance counters installed by the Vmware Guest agent

4.3 容器监控指标说明

项目地址: https://github.com/google/cadvisor
指标参考: https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md

cAdvisor 常用监控指标

指标名称	类型	含义
container_cpu_load_average_10s	gauge	过去10秒容器CPU的平均负载
container_cpu_usage_seconds_total	counter	容器在每个CPU内核上的累积占用时间 (单位：秒)
container_cpu_system_seconds_total	counter	System CPU累积占用时间（单位：秒）
container_cpu_user_seconds_total	counter	User CPU累积占用时间（单位：秒）
container_fs_usage_bytes	gauge	容器中文件系统的使用量(单位：字节)
container_fs_limit_bytes	gauge	容器可以使用的文件系统总量(单位：字节)
container_fs_reads_bytes_total	counter	容器累积读取数据的总量(单位：字节)
container_fs_writes_bytes_total	counter	容器累积写入数据的总量(单位：字节)
container_memory_max_usage_bytes	gauge	容器的最大内存使用量（单位：字节）
container_memory_usage_bytes	gauge	容器当前的内存使用量（单位：字节
container_spec_memory_limit_bytes	gauge	容器的内存使用量限制
machine_memory_bytes	gauge	当前主机的内存总量
container_network_receive_bytes_total	counter	容器网络累积接收数据总量（单位：字节）
container_network_transmit_bytes_total	counter	容器网络累积传输数据总量（单位：字节）