prometheus高并发采集异常回顾

微保技术 2020-06-02

1573

作者：andorccao 曹灿

编辑：mavisxlmeng 蒙雪莲

复制

在大规模微服务架构当中如何快速观察线上运行情况并识别故障？通常我们靠在线监控上报来实现，并且由于其主业无关性，通常用开源中间件是个高效的选择。微保主要使用prometheus进行监控和告警。

背景

在大规模微服务架构当中如何快速观察线上运行情况并识别故障？通常我们靠在线监控上报来实现，并且由于其主业无关性，通常用开源中间件是个高效的选择。微保主要使用prometheus进行监控和告警。

Prometheus采集数据有pull或push两种方式。pull模式是服务端主动向客户端拉取数据。push模式是客户端使用脚本通过pushgateway插件将数据提交给服务端。微保默认情况下会使用pull模式，只有在少数短时间上线服务或者不方便pull的场景使用push模式推送metrics给prometheus server。

本文主要介绍在高并发场景下，我们碰到了一个prometheus系统bug及其调试和解决过程。

发现问题

weblog是我们的一个后台服务(go开发)，主要用于收集前端请求日志，进行格式化处理后保存到磁盘。该服务同时会提炼监控指标通过prometheus client组件缓存，prometheus 监控系统会每30秒定时pull metrics以生成监控视图。

某次weblog服务发布新版本以后，prometheus监控系统检测到相关核心指标异常。定位发现是采集返回报错，weblog服务的所有指标无法正常被采集，错误日志如下图。

An error has occurred during metrics collection:


collected metric weblog_time label:<name:"appID" value:"" > label:<name:"cma" value:"" > label:name:"code" value:"" > 1abel:<name:"event" value:"custom" > label :<name:"httpStatus" value:"" > label:name:"httpStatus" value:"" > label:<name:"ifFirst" value:"" > label:name:"message" value:"pageLoad" >  label:<name:"msg" value:"" > label<name:"num" value:"" >label:<name:"pageID" value:".*mall. *productDetail" > label:<name:"source" value:"" > label:<name:"timeName" value:"resource_cost_ms" > label:<name:"ua" value:"" > histogram:<sanple_count:243372 sanple_sum:4.762622828805921e+09 bucket:<cumulative_count:58815 upper_bound:0.005 > bucket:<cumulative_count:106019 upper_bound:0.01 > bucket:<cumulative_count:160479 upper_bound:0.025 > bucket:<cumulative_count:197994 upper_bound:0.05 > bucket:<cumulative_count:222643 upper_bound:0.1, bucket:<cumulative_count:233694 upper_bound:0.25 > bucket:<cumulative_count:239032 upper_bound:0.5 > bucket:<cumulativ_count:241622 upper_bound:1 > bucket:<cumulative_count:242917 upper_bound:2.5 > bucket:<cumulative_count:243219 upper_bpund:5 > bucket:<cunmulative_ count:243322 upper_bound:10 > > has label dimensions inconsistent with previously collected metrics in the same metric family


复制

解决问题

3.1 信息收集

观察上面采集的报错信息，可以发现上报了两个重复的label: httpStatus，怀疑是使用方动态传参时输入重复的label。然而经过检查上报实现代码：

func init() {  
    var labelTime = []string{"cmd","appID", "pageID", "event", "elementID", "message", "ua", "ifFirst",  "httpStatus", "code", "msg", "source", "num", "timeName"}   
    configHistogram = prometheus.NewHistogramVec(  
        prometheus.HistogramOpts{  
            Name:    "weblog_time",  
            Help:    "config配置的耗时指标",  
            Buckets: prometheus.DefBuckets,  
        }, labelTime,  
    )  
    prometheus.MustRegister(configHistogram)  
}  
func ReportTime(t float64, labelsTimeValue map[string]string) {  
     if t != 0 {  
        configHistogram.With(labelsTimeValue).Observe(t)  
    }  
}  
复制

初始化label没有重复，report使用的map结构体也不会出现重复的label。看来业务出问题的可能性比较小。

3.2 初步分析

1.对比本次发布新增指标与原有版本的差异，未发现异常，且新旧指标没有逻辑差异。关键是旧指标已经正常运行2年

2.对比依赖库版本，未发现更新

3.注释ReportTime上报逻辑，观察一天后（采集异常平均每天出现两次），没有复现

4.在测试环境模拟压测，未复现

可以看到业务代码非常少，且复现概率较低，判断有可能是prometheus开源组件的缺陷导致。

3.3 深入跟踪

3.3.1 根据唯一的错误关键字查找google上类似问题

label dimensions inconsistent with previously collected metrics in the same metric family
复制

大部分建议方法是升级版本。但新版本对proto依赖版本有要求，短期内不太适用，同时我们也想了解问题的根源。

3.3.2 对比client_golang版本

https://github.com/prometheus/client_golang/commit/c06fb788be8a05442219295095ee0e51523802f0

在某个版本中已经去掉这段检查。

  - if dimHash, ok := dimHashes[metricFamily.GetName()]; ok {  
  -   if dimHash != dh {  
  -       return fmt.Errorf(  
  -           "collected metric %s %s has label dimensions inconsistent with previously collected metrics in the same metric family",  
  -           metricFamily.GetName(), dtoMetric,  
  -       )  
  -   }  
复制

那么这个问题可以直接升版本解决。进一步发现，注释这段检查是规避了问题，但不能彻底解决。由于升级成本比较高，而且采集异常时，能通过重启恢复。所以我们打算深入分析问题所在。

3.3.3 梳理错误原因

梳理client_golang代码（下面的代码仅保留关键逻辑），整理一下报错的上下调用链关系。

func (r *Registry) Gather() ([]*dto.MetricFamily, error) {  
    for _, collector := range r.collectorsByID {  
        // A Metric models a single sample value with its meta data being exported to Prometheus.  
        collector.Collect(metricChan)  
    }  
          
    for metric := range metricChan {  
        dtoMetric := &dto.Metric{}  
        metric.Write(dtoMetric)  
        checkMetricConsistency(metricFamily, dtoMetric, metricHashes, dimHashes)  
     }  
}  
func checkMetricConsistency(metricFamily *dto.MetricFamily,dtoMetric *dto.Metric) {  
// Make sure label pairs are sorted. We depend on it for the consistency check.  
   sort.Sort(LabelPairSorter(dtoMetric.Label))  
      dh := hashNew()  
      for _, lp := range dtoMetric.Label {  
          dh = hashAdd(dh, lp.GetName())  
      }  
      if dh!=dimHashes[metricFamily.GetName()] {  
          return fmt.Errorf("collected metric %s %s has label dimensions inconsistent with previously collected metrics in the same metric family")  
      }  
} 
复制

调用链为：定时采集->Gather()->checkMetricConsistency()->校验label的hash是否一致。hash不一致说明两次采集时label有变化。在下一次复现过程中，日志也验证dtoMetric.Label确实改变了。

那么接下来需要分析label为什么会变化，在哪里发生改变？

3.4 缩小目标

1.假设论证1：label在初始化阶段就发生变更。这个很好验证，加上详细日志打印最初的label。复现后确认初始化label未变化，假设不成立。

2.假设论证2：在某种场景下label有变化（比如高并发，指针引用）。这个比较难验证，我们分析调用链上下文，看到所有label都是复制，没有直接引用。正常情况采集端是每30s调用Gather接口，那么我们可以并发调用Gather接口来压测。压测后发现假设2成立，高并发情况下label确实发生了变化。

3.5 定点分析，反复验证

多次复现后，label的变化分为两种：重复label和乱序label。

源码中只有sort.Sort会对label进行修改。仍然分开验证。

将原label改为有序，按常规经验经过sort.Sort不会有变化。修改后去压测。结果失败，最终仍然会报错。
随机构造数据，单独高并发压sort.Sort函数，无序数据时确实会产生数据错乱，有序时一切正常。

对比1和2，有可能环境不同，数据不一致导致了不同结果。

接下来分析golang官方库sort.Sort代码。很快就发现了问题的根源。

func Sort(data Interface) {  
    n := data.Len()  
    quickSort(data, 0, n, maxDepth(n))  
}  
func quickSort(data Interface, a, b, maxDepth int) {  
    for b-a > 12 { // Use ShellSort for slices <= 12 elements  
        if maxDepth == 0 {  
            heapSort(data, a, b)   
            return  
         }  
         maxDepth--
         mlo, mhi := doPivot(data, a, b) 
         // Avoiding recursion on the larger subproblem guarantees a stack depth of at most lg(b-a).  
         if mlo-a < b-mhi {  
             quickSort(data, a, mlo, maxDepth)  
             a = mhi // i.e., quickSort(data, mhi, b)  
         } else {  
             quickSort(data, mhi, b, maxDepth)  
             b = mlo // i.e., quickSort(data, a, mlo)  
         }  
     }  
     if b-a > 1 {  
        // Do ShellSort pass with gap 6  
        // It could be written in this simplified form cause b-a <= 12  
        for i := a + 6; i < b; i++ {  
            if data.Less(i, i-6) {  
                data.Swap(i, i-6)  
            }  
        }  
        insertionSort(data, a, b)  
    }  
}
复制

在数组长度小于等于12时，采用的插入排序，有序时不会交换数据；大于12时采用的快速排序算法，即使有序也会交换数据。而高并发下交换数据有可能导致数据错乱。

重新构造有序数据，并发压测sort.Sort函数，与分析一致，在数组长度大于12时数组元素错乱。这里也反向验证了旧的监控指标一直运行正常的现象，新增的指标label数量太多才会报错。

3.6 灰度发布验证

在不修改三方库的前提下，我们精简了label数量小于12个，灰度验证通过。

3.7 解决方案

1.业务层面限制，初始化的label数量小于等于12个时不会触发异常

2.升级client_golang版本，最新的版本已经彻底解决Sort并发调用的问题。

总结

这个缺陷隐藏的比较深，通过我们不断假设推论然后验证，最终找到问题的根源和解决方法。虽然过程些许曲折，花费不少时间，但收获很大。本文出现的是比较典型的高并发问题，希望解决思路能给大家带来帮助。

数据库

文章转载自微保技术，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。