没关注? 伸出手指点这里 >>>
监控
Kubernetes系列
对云数据中心的监控,主要针对数据中心内的硬件和软件进行监控和告警,任何传统IT架构、IAAS云、PAAS平台基础架构、SAAS云服务都离不开监控系统的支撑。完整的云监控系统需要支持多种监控对象:容器、网络、存储、服务器和应用等,本文从云监控系统总体架构、prometheus监控系统、kubernetes容器监控、prometheus生态及实践方面进行介绍
监控系统概览
从程序设计角度来看,监控可分为:基础资源监控、中间件监控、应用程序监控和日志监控
基础资源监控
主机监控
物理机、虚拟机监控,ping监控主机的存活性,其他性能监控指标通常包括
CPU
内存
网络IO
磁盘IO
网络监控
网络性能监控
网络攻击检查
设备监控
存储监控
存储性能监控
存储系统监控
存储设备监控
中间件监控
消息中间件
Web服务中间件
缓存中间件
数据库中间件
应用监控(APM) 应用程序运行状态、性能、调用链跟踪等
日志监控 采集日志文本数据并汇总到日志存储端和搜索引擎,可定位性能瓶颈,也可以用于数据统计、故障告警
开源云监控系统:Open-Falcon
Open-Falcon小米开源分布式监控系统,基于微服务架构使用Golang语言开发,模块众多,横向扩展性好,可分为四条核心链路:数据上报、告警、查询、配置
以企业级标准三层网络架构为例(用户区、公共服务区、云管区)
数据上报链路
Agent(数据采集) -> Proxy(用户到云管区代理) -> Transfer(数据接入和转发组件) -> Graph(存储监控数据)
配置链路
Pa-portal(告警配置和管理页面) -> Mysql(核心库保存来自pa-portal的相关配置) <- Config(同步mysql库中最新的配置信息) <- Hbs(从config模块获取最新配置信息,并接收ping,agent模块上报心跳信息) <- [Ping/ (Proxy <- Agent)] 向hbs上报心跳
查询数据上报图表 Dashboard -> Mysql > Graph
告警 Hbs(告警策略:阈值、最大告警次数、告警级别) <- Transfer(从Hbs获取告警策略配置,过滤不合规数据) -> Judge(接收来自Transfer推送的数据和触发告警的配置策略,判断监控数据是否超阈值、是否达到告警条件,满足告警条件推送Redis缓存) -> Redis <- Sender(从Redis拉取告警项,根据告警级别推送相应级别的告警队列) -> Mail-provider(按告警级别通过不同渠道发送告警消息) -> Mail/Phone/SMS(邮件、电话、短信)
Open-Falcon 数据格式
{
"endpoint": "test-endpoint", // 监控对象名称
"metric": "test-metric", // 监控项
"timestamp": ts, // unix时间戳
"step": 60, // 步长,上报周期
"value": 1, // 监控数据
"counterType": "GAUGE", // 监控项类型
"tags": "location=beijing,service=falcon", // 标签
}
Agent模块主要采集主机性能指标,针对不同OS、不同体系结构,可以定制编译并适配相应版本
Transfer模块数据接入层,可接收用户主动上报的正确格式的推送数据,结合所配置的告警策略,实现自定义类型监控
Prometheus生态
Prometheus是什么?官方介绍
Prometheus是一套开源监控和告警系统工具包,许多公司和组织都在使用它,并且用户社区非常活跃,现在是一个独立的开源项目。强调:Prometheus于2016年加入CNCF,是继Kubernetes之后的第二个托管项目。Prometheus的在云原生领域的地位可见一斑
Prometheus的特性和组件,参照官方文档,不多赘述
Prometheus架构解析
1.Prometheus Server 通过服务发现(Kubernetes、consule、zookeeper...)或静态文件配置获取到抓取目标Targets(通常是HTTP/HTTPS API接口)
2.定期(global config)主动Pull Metrics
3.存储到本地TSDB时序数据库(默认保存15天)
4.通过适配read/write接口支持第三方远程存储
5.若监控项触发告警,AlertManager可以多渠道发送告警通知
6.Prometheus提供HTTP查询接口,在数据可视化可以使用自带的Dashboard或Grafana工具,也支持API clients导出TSDB监控数据
Prometheus 定义期望运行的Prometheus Deployments,Prometheus Operator始终确保正在运行的与资源定义匹配的Deployments
ServiceMonitor 以声明式方式指定应如何监视 Service Groups, Operator基于这个定义自动生成抓取配置文件
PrometheusRule 定义可以包含Prometheus报警规则和Recording(重定义监控项名称)规则的规则文件,由Prometheus实例加载,提供reload接口支持热加载
Alertmanager 定义所需的Alertmanager Deployment
Kubernetes、Prometheus、Prometheus Operator 工作方式
1.Kubernetes中每一个需要每监控的应用需要实现Prometheus Metrics API并创建Service实例
2.ServiceMonitor通过标签选择关联的Service(一个或多个),生成抓取配置文件,触发reload
3.Prometheus 相同 Namespace下的所有 ServiceMonitor 被当作 Prometheus Targets(Jobs),它根据配置在特定周期去指定路径 pull metrics
4.Prometheus Operator 通过 List 和 Watch 机制监视 Prometheus 及 ServiceMonitor 运行状态,部署管理 Prometheus Server
## https://github.com/coreos/prometheus-operator/blob/master/pkg/listwatch/listwatch.go
// It allows for the creation of a cache.ListWatch for namespaces from a client that does not have `List` privileges.
// If the slice of namespaces contains only v1.NamespaceAll, then this func assumes that the client has List and
// Watch privileges and returns a regular cache.ListWatch, since there is no other way to get all namespaces.
func NewUnprivilegedNamespaceListWatchFromClient(c cache.Getter, namespaces []string, fieldSelector fields.Selector) *cache.ListWatch {
optionsModifier := func(options *metav1.ListOptions) {
options.FieldSelector = fieldSelector.String()
}
return NewFilteredUnprivilegedNamespaceListWatchFromClient(c, namespaces, optionsModifier)
}
// It allows for the creation of a cache.ListWatch for namespaces from a client that does not have `List` privileges.
// If the slice of namespaces contains only v1.NamespaceAll, then this func assumes that the client has List and
// Watch privileges and returns a regular cache.ListWatch, since there is no other way to get all namespaces.
func NewFilteredUnprivilegedNamespaceListWatchFromClient(c cache.Getter, namespaces []string, optionsModifier func(options *metav1.ListOptions)) *cache.ListWatch {
// If the only namespace given is `v1.NamespaceAll`, then this
// cache.ListWatch must be privileged. In this case, return a regular
// cache.ListWatch.
if IsAllNamespaces(namespaces) {
return cache.NewFilteredListWatchFromClient(c, "namespaces", metav1.NamespaceAll, optionsModifier)
}
listFunc := func(options metav1.ListOptions) (runtime.Object, error) {
optionsModifier(&options)
list := &v1.NamespaceList{}
for _, name := range namespaces {
result := &v1.Namespace{}
err := c.Get().
Resource("namespaces").
Name(name).
VersionedParams(&options, scheme.ParameterCodec).
Do().
Into(result)
if err != nil {
return nil, err
}
list.Items = append(list.Items, *result)
}
return list, nil
}
watchFunc := func(_ metav1.ListOptions) (watch.Interface, error) {
// Since the client does not have Watch privileges, do not
// actually watch anything. Use a watch.FakeWatcher here to
// implement watch.Interface but not send any events.
return watch.NewFake(), nil
}
return &cache.ListWatch{ListFunc: listFunc, WatchFunc: watchFunc}
}
// MultiNamespaceListerWatcher takes a list of namespaces and a cache.ListerWatcher generator func and returns a single cache.ListerWatcher capable of operating on multiple namespaces.
func MultiNamespaceListerWatcher(namespaces []string, f func(string) cache.ListerWatcher) cache.ListerWatcher {
// If there is only one namespace then there is no need to create a proxy.
if len(namespaces) == 1 {
return f(namespaces[0])
}
var lws []cache.ListerWatcher
for _, n := range namespaces {
lws = append(lws, f(n))
}
return multiListerWatcher(lws)
}
// multiListerWatcher abstracts several cache.ListerWatchers, allowing them to be treated as a single cache.ListerWatcher.
type multiListerWatcher []cache.ListerWatcher
// List 实现 ListerWatcher 接口。将所有ListerWatcher的List方法的输出结合成一个结果
func (mlw multiListerWatcher) List(options metav1.ListOptions) (runtime.Object, error) {
l := metav1.List{}
var resourceVersions []string
for _, lw := range mlw {
list, err := lw.List(options)
if err != nil {
return nil, err
}
items, err := meta.ExtractList(list)
if err != nil {
return nil, err
}
metaObj, err := meta.ListAccessor(list)
if err != nil {
return nil, err
}
for _, item := range items {
l.Items = append(l.Items, runtime.RawExtension{Object: item.DeepCopyObject()})
}
resourceVersions = append(resourceVersions, metaObj.GetResourceVersion())
}
// Combine the resource versions so that the composite Watch method can
// distribute appropriate versions to each underlying Watch func.
l.ListMeta.ResourceVersion = strings.Join(resourceVersions, "/")
return &l, nil
}
// Watch实现ListWatcher接口。它将watch.Interface的每个cache.ListerWatcher的输出结合到一个结果chan中,返回一个 watch.Interface
func (mlw multiListerWatcher) Watch(options metav1.ListOptions) (watch.Interface, error) {
resourceVersions := make([]string, len(mlw))
// Allow resource versions to be "".
if options.ResourceVersion != "" {
rvs := strings.Split(options.ResourceVersion, "/")
if len(rvs) != len(mlw) {
return nil, fmt.Errorf("expected resource version to have %d parts to match the number of ListerWatchers", len(mlw))
}
resourceVersions = rvs
}
return newMultiWatch(mlw, resourceVersions, options)
}
// multiWatch abstracts multiple watch.Interface's, allowing them
// to be treated as a single watch.Interface.
type multiWatch struct {
result chan watch.Event
stopped chan struct{}
stoppers []func()
}
// newMultiWatch returns a new multiWatch or an error if one of the underlying
// Watch funcs errored. The length of []cache.ListerWatcher and []string must
// match.
func newMultiWatch(lws []cache.ListerWatcher, resourceVersions []string, options metav1.ListOptions) (*multiWatch, error) {
var (
result = make(chan watch.Event)
stopped = make(chan struct{})
stoppers []func()
wg sync.WaitGroup
)
wg.Add(len(lws))
for i, lw := range lws {
o := options.DeepCopy()
o.ResourceVersion = resourceVersions[i]
w, err := lw.Watch(*o)
if err != nil {
return nil, err
}
go func() {
defer wg.Done()
for {
event, ok := <-w.ResultChan()
if !ok {
return
}
select {
case result <- event:
case <-stopped:
return
}
}
}()
stoppers = append(stoppers, w.Stop)
}
// result chan must be closed,
// once all event sender goroutines exited.
go func() {
wg.Wait()
close(result)
}()
return &multiWatch{
result: result,
stoppers: stoppers,
stopped: stopped,
}, nil
}
// ResultChan implements the watch.Interface interface.
func (mw *multiWatch) ResultChan() <-chan watch.Event {
return mw.result
}
// Stop implements the watch.Interface interface.
// It stops all of the underlying watch.Interfaces and closes the backing chan.
// Can safely be called more than once.
func (mw *multiWatch) Stop() {
select {
case <-mw.stopped:
// nothing to do, we are already stopped
default:
for _, stop := range mw.stoppers {
stop()
}
close(mw.stopped)
}
return
}
// IsAllNamespaces checks if the given slice of namespaces
// contains only v1.NamespaceAll.
func IsAllNamespaces(namespaces []string) bool {
return len(namespaces) == 1 && namespaces[0] == v1.NamespaceAll
}
kubernetes容器监控
Kubernetes集群收集监控统计信息的两个管道
The resource metrics pipeline 提供与集群组件相关的一组有限度量标准
Full metrics pipeline 监视管道从kubelet获取指标,然后通过实现 custom.metrics.k8s.io 或 external.metrics.k8s.io API适配器,将metrics暴露给Kubernetes。在Kubernetes中有多种解决方案,e.g. Prometheus, Sysdig
Kubernetes监控架构
核心度量流程(下图黑色部分):这是 Kubernetes 正常工作所需要的核心度量,从 Kubelet、cAdvisor 等获取度量数据,再由 metrics-server 提供给 Dashboard、HPA 控制器等使用。
监控流程(下图蓝色部分):基于核心度量构建的监控流程,比如 Prometheus 可以从 metrics-server 获取核心度量,从其他数据源(如 Node Exporter 等)获取非核心度量,再基于它们构建监控告警系统
相关监控组件及工作方式
1.kubelet 与 cAdvisor
kubelet 是 Kubernetes Master 和 Worker Node 之间的桥梁,从kube-apiserver获取调度到节点上的Pod根据其podSpec将其运行起来,如以Docker提供容器运行时环境,kubelet从cAdvisor获取单个容器运行资源使用情况统计信息
cAdvisor 是一个开源容器资源使用和性能分析代理,支持Docker容器,被集成到kubelet二进制文件中。cAdvisor自动发现机器中所有容器,并收集cpu、memory、network、diskIO使用情况统计信息,还用过分析机器上的"root"容器来提供整机资源使用情况
部分源代码
// kubernetes/pkg/kubelet/kubelet.go kubelet依赖项定义
type Dependencies struct {
Options []Option
// Injected Dependencies
Auth server.AuthInterface
CAdvisorInterface cadvisor.Interface // cAdvisor
Cloud cloudprovider.Interface
ContainerManager cm.ContainerManager
DockerClientConfig *dockershim.ClientConfig
EventClient v1core.EventsGetter
HeartbeatClient clientset.Interface
OnHeartbeatFailure func()
KubeClient clientset.Interface
Mounter mount.Interface
OOMAdjuster *oom.OOMAdjuster
OSInterface kubecontainer.OSInterface
PodConfig *config.PodConfig
Recorder record.EventRecorder
Subpather subpath.Interface
VolumePlugins []volume.VolumePlugin
DynamicPluginProber volume.DynamicPluginProber
TLSOptions *server.TLSOptions
KubeletConfigController *kubeletconfig.Controller
}
// kubelet.go 从 cAdvisor 获取 MachineInfo, 代码 570~574 行
machineInfo, err := klet.cadvisor.MachineInfo()
if err != nil {
return nil, err
}
klet.machineInfo = machineInfo
// cAdvisor Interface kubernetes/pkg/kubelet/cadvisor/types.go
type Interface interface {
Start() error
DockerContainer(name string, req *cadvisorapi.ContainerInfoRequest) (cadvisorapi.ContainerInfo, error)
ContainerInfo(name string, req *cadvisorapi.ContainerInfoRequest) (*cadvisorapi.ContainerInfo, error)
ContainerInfoV2(name string, options cadvisorapiv2.RequestOptions) (map[string]cadvisorapiv2.ContainerInfo, error)
SubcontainerInfo(name string, req *cadvisorapi.ContainerInfoRequest) (map[string]*cadvisorapi.ContainerInfo, error)
MachineInfo() (*cadvisorapi.MachineInfo, error) // MachineInfo Interface
VersionInfo() (*cadvisorapi.VersionInfo, error)
// Returns usage information about the filesystem holding container images.
ImagesFsInfo() (cadvisorapiv2.FsInfo, error)
// Returns usage information about the root filesystem.
RootFsInfo() (cadvisorapiv2.FsInfo, error)
// Get events streamed through passedChannel that fit the request.
WatchEvents(request *events.Request) (*events.EventChannel, error)
// Get filesystem information for the filesystem that contains the given file.
GetDirFsInfo(path string) (cadvisorapiv2.FsInfo, error)
}
// ImageFsInfoProvider informs cAdvisor how to find imagefs for container images.
type ImageFsInfoProvider interface {
// ImageFsInfoLabel returns the label cAdvisor should use to find the filesystem holding container images.
ImageFsInfoLabel() (string, error)
}
// kubernetes/pkg/kubelet/cadvisor/cadvisor_linux.go 定义了 cadvisor 所包含的监控项
type cadvisorClient struct {
imageFsInfoProvider ImageFsInfoProvider
rootPath string
manager.Manager
}
var _ Interface = new(cadvisorClient)
// TODO(vmarmol): Make configurable.
// The amount of time for which to keep stats in memory.
const statsCacheDuration = 2 * time.Minute
const maxHousekeepingInterval = 15 * time.Second
const defaultHousekeepingInterval = 10 * time.Second
const allowDynamicHousekeeping = true
func init() {
// Override cAdvisor flag defaults.
flagOverrides := map[string]string{
// Override the default cAdvisor housekeeping interval.
"housekeeping_interval": defaultHousekeepingInterval.String(),
// Disable event storage by default.
"event_storage_event_limit": "default=0",
"event_storage_age_limit": "default=0",
}
for name, defaultValue := range flagOverrides {
if f := flag.Lookup(name); f != nil {
f.DefValue = defaultValue
f.Value.Set(defaultValue)
} else {
klog.Errorf("Expected cAdvisor flag %q not found", name)
}
}
}
func New(imageFsInfoProvider ImageFsInfoProvider, rootPath string, usingLegacyStats bool) (Interface, error) {
sysFs := sysfs.NewRealSysFs()
// 包含监控项类型: cpu, memory, cpuLoad, diskIO, network, accelerator, app
includedMetrics := cadvisormetrics.MetricSet{
cadvisormetrics.CpuUsageMetrics: struct{}{},
cadvisormetrics.MemoryUsageMetrics: struct{}{},
cadvisormetrics.CpuLoadMetrics: struct{}{},
cadvisormetrics.DiskIOMetrics: struct{}{},
cadvisormetrics.NetworkUsageMetrics: struct{}{},
cadvisormetrics.AcceleratorUsageMetrics: struct{}{},
cadvisormetrics.AppMetrics: struct{}{},
}
if usingLegacyStats {
includedMetrics[cadvisormetrics.DiskUsageMetrics] = struct{}{}
}
// collect metrics for all cgroups
rawContainerCgroupPathPrefixWhiteList := []string{"/"}
// Create and start the cAdvisor container manager.
m, err := manager.New(memory.New(statsCacheDuration, nil), sysFs, maxHousekeepingInterval, allowDynamicHousekeeping, includedMetrics, http.DefaultClient, rawContainerCgroupPathPrefixWhiteList)
if err != nil {
return nil, err
}
if _, err := os.Stat(rootPath); err != nil {
if os.IsNotExist(err) {
if err := os.MkdirAll(path.Clean(rootPath), 0750); err != nil {
return nil, fmt.Errorf("error creating root directory %q: %v", rootPath, err)
}
} else {
return nil, fmt.Errorf("failed to Stat %q: %v", rootPath, err)
}
}
cadvisorClient := &cadvisorClient{
imageFsInfoProvider: imageFsInfoProvider,
rootPath: rootPath,
Manager: m,
}
return cadvisorClient, nil
}
// MachineInfo 结构体
type MachineInfo struct {
// The number of cores in this machine.
NumCores int `json:"num_cores"`
// Maximum clock speed for the cores, in KHz.
CpuFrequency uint64 `json:"cpu_frequency_khz"`
// The amount of memory (in bytes) in this machine
MemoryCapacity uint64 `json:"memory_capacity"`
// HugePages on this machine.
HugePages []HugePagesInfo `json:"hugepages"`
// The machine id
MachineID string `json:"machine_id"`
// The system uuid
SystemUUID string `json:"system_uuid"`
// The boot id
BootID string `json:"boot_id"`
// Filesystems on this machine.
Filesystems []FsInfo `json:"filesystems"`
// Disk map
DiskMap map[string]DiskInfo `json:"disk_map"`
// Network devices
NetworkDevices []NetInfo `json:"network_devices"`
// Machine Topology
// Describes cpu/memory layout and hierarchy.
Topology []Node `json:"topology"`
// Cloud provider the machine belongs to.
CloudProvider CloudProvider `json:"cloud_provider"`
// Type of cloud instance (e.g. GCE standard) the machine is.
InstanceType InstanceType `json:"instance_type"`
// ID of cloud instance (e.g. instance-1) given to it by the cloud provider.
InstanceID InstanceID `json:"instance_id"`
}
2.Metrics Server前世今生
Heapster工作方式 首先从Master获取集群中所有Node的信息,然后通过Node上的kubelet获取cAdvisor采集的监控数据,heapster将所有获取到的时序监控数据推送到后端存储中,并且还支持数据的可视化
Kubernetes v1.8 开始,资源的使用情况可以通过 Metrics API的形式获取,由集群核心监控数据的聚合器--Metrics Server,替代原来的Heapster组件
Metrics server 定时从Kubelet的Summary API(/ap1/v1/nodes/nodename/stats/summary)采集指标信息,这些聚合过的数据将存储在内存中,且以metric-api的形式暴露出去
Metrics server复用api-server的库来实现自己的功能,比如鉴权、版本等,为了实现将数据存放在内存中,引入了内存存储(即实现Storage interface)。存放在内存中的监控数据是没有持久化的,可以通过第三方存储来拓展,这个和heapster是一致的
Metrics API 相比于之前的采集方式(heapster)是一种新思路,官方希望核心监控指标是稳定的、版本可控的,且可以直接被用户访问(`kubectl top`),或由集群中的控制器使用(e.g. HPA 自动扩缩),和其他Kubernetes APIs一样
Metrics API 的使用
Metrics API 只可以查询当前的度量数据,并不保存历史数据
Metrics API URI 为 /apis/metrics.k8s.io/,在 k8s.io/metrics 维护
必须部署 metrics-server 才能使用,metrics-server 通过调用 Kubelet Summary API 获取数据
example
http://127.0.0.1:8001/apis/metrics.k8s.io/v1beta1/nodes
http://127.0.0.1:8001/apis/metrics.k8s.io/v1beta1/nodes/<node-name>
http://127.0.0.1:8001/apis/metrics.k8s.io/v1beta1/namespace/<namespace-name>/pods/<pod-name>
3.kube-state-metrics
kube-state-metrics通过监控api-server生成对象监控指标,监控的维度是Objects,包括
下图是pod.go定义的一个监控项
监控项名称: kube_pod_status_ready, 指标类型: GAUGE, Help:"xxx"(Prometheus规范,必填)
每一种 Objects 又分别定义哪些 metrics,如果感兴趣可以打开这部分源码
代码目录 kube-state-metrics/internal/collector
4.以上监控指标不能满足需求?
通过实现prometheus API,自定义exporter,根据业务所需扩展metrics
部署实践
接下来介绍prometheus-operator安装部署prometheus容器监控
1.下载 prometheus-operator
git clone https://github.com/coreos/prometheus-operator.git
2.manifests目录
cd coreos/prometheus-operator/contrib/kube-prometheus/manifests
3.目录下一共63各文件
ls | wc -l
4.可以直接一键安装 `kubectl create -f .`
5.如果不清楚执行步骤4发生了什么,下面按字母顺序(手动部署建议顺序)一一介绍
a.vim 00namespace-namespace.yaml 为后续操作创建了一个名叫 monitoring 的 namespace
b.前面介绍Prometheus自定义的4各CRD,分别对应文件
0prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
0prometheus-operator-0prometheusCustomResourceDefinition.yaml
0prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
0prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
tips: CRD定义了些什么? 打开文件看看发现跟普通yaml文件神似?
对了,每一个CRD定义一种资源,关于这个资源类型所有可配置的字段,都在CRD中体现,并且还有"description"释义
c.部署prometheus-operator
0prometheus-operator-serviceAccount.yaml
0prometheus-operator-clusterRole.yaml
0prometheus-operator-clusterRoleBinding.yaml
0prometheus-operator-deployment.yaml
0prometheus-operator-service.yaml
0prometheus-operator-serviceMonitor.yaml
资源对象大致按照 sa -> clusterRole -> clusterRoleBinding -> deployment -> svc -> servicemonitor 顺序创建
打开文件 0prometheus-operator-serviceMonitor.yaml
apiVersion: monitoring.coreos.com/v1 // API Group
kind: ServiceMonitor // CRD 类型
metadata:
labels:
k8s-app: prometheus-operator
name: prometheus-operator
namespace: monitoring
spec:
endpoints: // 必须定义
- honorLabels: true
port: http
selector:
matchLabels:
k8s-app: prometheus-operator // 根据标签选择同 namespace 下 Service
d.部署prometheus
prometheus-serviceAccount.yaml
prometheus-clusterRole.yaml
prometheus-clusterRoleBinding.yaml
prometheus-roleConfig.yaml
prometheus-roleBindingConfig.yaml
prometheus-roleSpecificNamespaces.yaml // 给 prometheus sa 跨 namespace 访问 API Resource 授权
prometheus-roleBindingSpecificNamespaces.yaml
prometheus-rules.yaml // prometheus规则文件,配置告警规则,Record规则
prometheus-prometheus.yaml // prometheus deployment
prometheus-service.yaml
prometheus-serviceMonitor.yaml
e.部署kube-state-metrics
kube-state-metrics-serviceAccount.yaml
kube-state-metrics-clusterRole.yaml
kube-state-metrics-clusterRoleBinding.yaml
kube-state-metrics-role.yaml
kube-state-metrics-roleBinding.yaml
kube-state-metrics-deployment.yaml
kube-state-metrics-service.yaml
kube-state-metrics-serviceMonitor.yaml
f.exporter部署,套路基本一致,以node-exporter为例
node-exporter-serviceAccount.yaml
node-exporter-clusterRole.yaml
node-exporter-clusterRoleBinding.yaml
node-exporter-daemonset.yaml // 节点exporter比较特殊使用daemonset类型
node-exporter-service.yaml
node-exporter-serviceMonitor.yaml
g.集群组件监控
vim prometheus-serviceMonitorApiserver.yaml // 若以镜像部署一个文件足以
二进制部署,则需要根据环境配置endpoint创建service,创建servicemonitor关联svc
example endpoint.yaml
apiVersion: v1
kind: Endpoints
metadata:
labels:
k8s-app: kube-controller-manager
name: kube-controller-manager
namespace: kube-system
subsets:
- addresses:
- ip: 172.16.0.2
- ip: 172.16.0.7
- ip: 172.16.0.8
ports:
- name: http-metrics
port: 10252
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
labels:
k8s-app: kube-scheduler
name: kube-scheduler
namespace: kube-system
subsets:
- addresses:
- ip: 172.16.0.2
- ip: 172.16.0.7
- ip: 172.16.0.8
ports:
- name: http-metrics
port: 10251
protocol: TCP
svc.yaml
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-controller-manager
labels:
k8s-app: kube-controller-manager
spec:
selector:
component: kube-controller-manager
type: ClusterIP
clusterIP: None
ports:
- name: http-metrics
port: 10252
targetPort: 10252
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-scheduler
labels:
k8s-app: kube-scheduler
spec:
selector:
component: kube-scheduler
type: ClusterIP
clusterIP: None
ports:
- name: http-metrics
port: 10251
targetPort: 10251
protocol: TCP
关于部署就介绍这么多,其他组件如果感兴趣可以进一步深入研究
小结
接下来的话题,每周更新
集群安全
网络 + Calico实践
存储 + CephFS实践
日志 + Fluentd & ElasticSearch实践
END