【史上最全】Ambari 大数据集群运维与管理操作指南（二）

大数据研习社 2022-03-31

3827

长按二维码关注

大数据领域必关注的公众号

7 管理集群 (Administering the Cluster)

利用 Ambari Web Admin 选项：

任何用户(any user) : 可以查看有关安装栈和加入其中的每个服务版本的信息

集群管理员(Cluster administrators) : 能够

启用 Kerberos 安全性
重新生成 key tabs
查看服务用户帐号的名称和值
启用服务的自动启动

Ambari administrators：能够

添加新服务到安装栈
升级安装栈到一个新的版本

7.1 利用安装栈和版本信息 (Using Stack and Versions Information)

Stack tab 包含有关集群栈中已安装和可用的服务。任何用户都可以浏览服务列表。作为 Ambari 系统管理员，可以单击 Add Service 来启动向导来安装

服务到集群中。

Versions tab 包含有关哪个版本的软件当前已安装并运行在集群中的信息。作为集群管理员，可以在此页启动一次自动集群更新。

7.2 查看服务账号 (Viewing Service Accounts)

作为集群管理员，可以查看集群服务的服务用户和用户组账号列表。

在 Ambari Web UI > Admin, 单击 Service Accounts

7.3 启用 Kerberos 和重新生成 Keytabs (Enabling Kerberos and Regenerating Keytabs)

作为集群管理员，可以在集群上启用并管理 Kerberos 安全性。

前提准备：

在集群上启用 Kerberos 之前，必须为集群做好准备，如下列新所描述：

security/content/ch_configuring_amb_hdp_for_kerberos.html

步骤：

在 Ambari web UI > Admin 菜单，单击 Enable Kerberos 启动 Kerberos 向导

Kerberos 启用之后，可以在 Ambari web UI > Admin 菜单，重新生成 key tabs 以及禁用 Kerberos

7.3.1 重新生成 Keytabs (Regenerate Key tabs)

作为集群管理员，可以再生维护 Kerberos 安全性要求的 key tabs

前提准备：

再生 key tabs 之前：

集群必须 Kerberos-enabled
必须有 KDC Admin 凭证

步骤：

① 浏览到 Admin > Kerberos

② 单击 Regenerate Kerberos.

③ 确认选择

④ Ambari 连接到 Kerberos Key Distribution Center (KDC) 并为服务和集群到 Ambari 负责人再生 key tabs. 可选地，可以只为那些丢失连 key

tab 的主机生成 key tab, 例如，为那些在 Ambari 启用 Kerberos 时不在线或不可用的主机再生。

⑤ 重启所有服务

7.3.2 禁用 Kerberos (Disable Kerberos)

作为集群管理员，可以在集群上禁用 Kerberos

前提：

禁用 Kerberos 安全性之前，集群必须已经是 Kerberos-enabled

步骤：

① 浏览到 Admin > Kerberos

② 单击 Disable Kerberos

③ 确认选择

集群服务停止，并且 Ambari Kerberos 安全性设置重置

④ 要重新启用 Kerberos, 单击 Enable Kerberos 并跟随向导

7.4 启用服务自动启动 (Enable Service Auto-Start)

作为集群管理员或集群操作员，可以启用安装栈内每一个服务自动重启。一个服务启用了 auto-start 会使 ambari-agent 不需要用户手工作用重新启动

停止状态的服务组件。auto-start 服务默认是启用的，但只有 Ambari Metrics Collector 组件默认设置为 auto-start。

作为第一步，应该在核心 Hadoop 服务的工作节点上启用 auto-start, 例如 YARN 和 HDFS 的 DataNode 以 NameNode 组件。也应该在 SmartSense 服务中

为所有组件启用 auto-start. 启用 auto-start 之后，在 Ambari Web 表盘中监控服务的操作状态。Auto-start 不会尝试显示为后台操作。诊断服务组件的

失败启动，检查 ambari agent 的日志文件，位于组件主机的 var/log/ambari-agent.log

管理一个服务的组件 auto-start 状态

步骤：

① 在 Auto-Start Services 上，单击一个服务名称

② 在 Auto-Start Services 控件的至少一个组件，单击灰色区域，使其状态变为 Enabled

服务名称右侧的绿色图标指示该服务启用了 auto-start 的组件的百分比

③ 要启用服务的所有组件为 auto-start, 单击 Enable All

绿色图标填满指示该服务的所有组件启用了 auto-start

④ 要禁用服务所有组件的 auto-start, 单击 Disable All

绿色图标清空指示该服务的所有组件禁用了 auto-start

⑤ 要清除所有未定的状态改变，在保存它们之前，单击 Discard

⑥ 结束修改 auto-start 状态设置时，单击 Save.

禁用服务当 auto-start :

① 在 Ambari Web, 单击 Admin > Service Auto-Start

② 在 Service Auto Start Configuration 中, 在 Auto-Start Services 控件上，单击灰色区域，使其状态由 Enabled 变为 Disabled

③ 单击 Save

8 启用服务自动启动 (Managing Alerts and Notifications)

Ambari 为每一个集群组件和主机使用一套预定义的七种类型的警报(web, port, metric, aggregate, script, server, and recovery). 可以利用这些警报

监控集群健康情况，以及向其他用户报警以帮助识别和处理故障问题。可以修改警报的名称，描述，以及检查周期，也可以禁用以及重新启用警报。

也可以创建一组警报并设置通知目标给每个用户组，这样就可以使用不同的方法通知不同的警报集给不同的用户组。

8.1 理解警报 (Understanding Alerts)

Ambari 预定义了一系列警报来监控集群组件和主机。每一个警报由一个警报定义(alert definition)来定义，定义警报类型检查的间隔和阈值。集群创建或

修改时，Ambari 读取警报定义并为指定的项(items)创建警报实例进行监控。例如，如果集群包括 Hadoop Distributed File System (HDFS), 有一个警报

定义用于监控 "DataNode Process". 集群中为每一个 DataNode 创建一个警报定义的实例。

利用 Ambari Web，通过单击 Alert tab 可以浏览集群上警报定义列表。可以通过当前状态，最后状态变化，以及与警报定义相关联的服务，查找或过滤警报

的定义。可以单击 alert definition name 来查看该警报的详细信息，或修改警报属性(如检查间隔和阈值)，以及该警报定义相关联的警报实例列表。

每个警报实例报告一个警报状态，由严重程度定义。最常用的严重级别为 OK, WARNING, and CRITICAL, 也有 UNKNOWN 和 NONE 的严重级别。警报通知在警报

状态发生变化时发送(如，状态从 OK 变为 CRITICAL)。

8.1.1 警报类型 (Alert Types)

警报阈值和阈值的单位取决于警报的状态。下表列出了警报类型，它们可能的状态，以及可以配置什么阈值单位，如果阈值可配置的话

WEB Alert Type：WEB 警报监视一个给定组件的 web URL, 警报状态由 HTTP 响应代码确定。因此，不能改变 HTTP 的响应代码来确定 WEB 警报的阈值。可以自定义每个阈值和整个 web 连接超时的响应文本。连接超时被认为是 CRITICAL 警报。阈值单位基于秒。

响应代码对应 WEB 警报的状态如下：

OK status：如果 web URL 响应代码低于 400.
WARNING status：如果 web URL 响应代码等于或高于 400.
CRITICAL status：如果 Ambari 不能连接到某个 web URL.

PORT Alert Type：PORT 警报检查连接到一个给定端口的响应时间，阈值单位基于秒

METRIC Alert Type：METRIC 警报检查一个或多个度量的值(如果执行计算)。度量从一个给定组件上的可用的 URL 端点访问。连接超时被认为是 CRITICAL警报。

阈值是可调整的，并且每一个阈值的单位取决于度量。例如，在 CPU utilization 警报的场景下，单位是百分数；在RPC latency 警报的场景下，单位为毫秒。

AGGREGATE Alert Type：AGGREGATE 警报聚合警报状态的数量作为受影响警报数量的百分比。例如，Percent DataNode Process 警报聚合 DataNode Process警报。

SCRIPT Alert Type：SCRIPT 警报执行某个脚本来确定其状态，例如 OK, WARNING, 或 CRITICAL. 可以自定义响应文本和属性的值，以及 SCRIPT 警报的阈值。

SERVER Alert Type：SERVER 警报执行一个服务器侧的可运行类以确定警报状态，例如，OK, WARNING, 或 CRITICAL

RECOVERY Alert Type：RECOVERY 警报由 Ambari Agent 处理，用于监控进程重启。警报状态 OK, WARNING, 以及 CRITICAL 基于一个进程自动重启所用时间的数量。这在要了解进程终止并被 Ambari 自动重启时非常有用。

8.2 修改警报 (Modifying Alerts)

警报的通用属性包括名称，描述，检查间隔，以及阈值。

检查间隔定义了 Ambari 检查警报状态的频率。例如，"1 minute" 意思是 Ambari 每分钟检查警报的状态。

阈值的配置选项取决于警报的类型

修改警报的通用属性：

① 在 Ambari Web 上浏览到 Alerts 部分

② 找到警报到定义并单击以查看定义详细信息

③ 单击 Edit 来修改名称，描述，检查间隔，以及阈值(如果可用)

④ 单击 Save

⑤ 在下一次检查间隔时，在所有警报实例上修改生效

8.3 修改警报检查数量 (Modifying Alert Check Counts)

Ambari 可以设置警报在分发一个通知之前执行检查的数量。如果警报状态在一个检查期间发生了变化，Ambari 在分发通知之前会尝试检查这个条件一定的

次数(check count)。

警报检查次数不适用于 AGGREATE 警报类型。一个状态的变化对于 AGGREATE 警报导致一个通知分发。

如果环境中经常会用短时的问题导致错误的警报，可以提升检查次数。这种情况下，警报状态的变化仍然会记录，但是作为 SOFT 状态变化。如果在一个指定

的检查次数之后警报条件仍然触发，这个状态的变化被认为是 HARD, 并且通知被发出。

通常对所有警报全局设置检查次数，但如果一个或多个警报实践中有短时问题的情况，也可以对单个的警报设置一覆盖全局设定值。

修改全局警报检查次数：

① 在 Ambari Web 中浏览到 Alerts 部分

② 在 Actions 菜单, 单击 Manage Alert Settings

③ 更新 Check Count 值

④ 单击 Save

对全局警报检查次数对修改可能要求几秒钟后出现在 Ambari UI 的单个警报上

为单个警报覆盖全局警报检查次数：

① Ambari Web 中浏览到 Alerts 部分

② 选择要设置特殊 Check Count 值的警报

③ 在右侧，单击 Check Count property 旁的 Edit 图标

④ 更新 Check Count 值

⑤ 单击 Save

8.4 禁用和再启用警报 (Disabling and Re-enabling Alerts)

可以禁用警报。当一个警报禁用时，没有警报实例生效，并且 Ambari 不在执行该警报的检查。因而，没有警报状态变化会记录，并且没有通知发送。

① Ambari Web 中浏览到 Alerts 部分

② 找到警报定义，单击文本旁的 Enabled 或 Disabled 以启用/禁用该警报

③ 另一方法，单击警报以查看定义的详细信息，然后单击 Enabled 或 Disabled 以启用/禁用该警报

④ 提示确认启用/禁用

8.5 预定义的警报 (Tables of Predefined Alerts)

8.5.1 HDFS 服务警报 (HDFS Service Alerts)

警报名称：NameNode Blocks Health

警报类型：METRIC

描述：This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold.

潜在原因：Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes.

The corrupt or missing blocks are from files with a replication factor of 1. New replicas cannot be created because the

only replica of the block is missing.

解决方法：For critical data, use a replication factor of 3.

Bring up the failed DataNodes with missing or corrupt blocks.

Identify the files associated with the missing or corrupt blocks by running the Hadoop fsck command.

Delete the corrupt files and recover them from backup, if one exists.

警报名称：NFS Gateway Process

警报类型：PORT

描述：This host-level alert is triggered if the NFS Gateway process cannot be confirmed as active.

潜在原因：NFS Gateway is down.

解决方法：Check for a non-operating NFS Gateway in Ambari Web.

警报名称：DataNode Storage

警报类型：METRIC

描述：This host-level alert is triggered if storage capacity is full on the DataNode (90% critical). It checks the DataNode

JMX Servlet for the Capacity and Remaining properties.

潜在原因：Cluster storage is full.

If cluster storage is not full, DataNode is full.

解决方法：If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes.

If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger

disks to the DataNodes. After adding more storage, run the load balancer.

警报名称：DataNode Process

警报类型：PORT

描述：This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on

the network for the configured critical threshold, in seconds.

潜在原因：DataNode process is down or not responding.

DataNode are not down but is not listening to the correct network port/address.

解决方法：Check for non-operating DataNodes in Ambari Web.

Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode, if necessary.

Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.

警报名称：DataNode Web UI

警报类型：WEB

描述：This host-level alert is triggered if the DataNode web UI is unreachable.

潜在原因：The DataNode process is not running.

解决方法：Check whether the DataNode process is running.

警报名称：NameNode Host CPU Utilization

警报类型：METRIC

描述：This host-level alert is triggered if CPU utilization of the NameNode exceeds certain thresholds (200% warning,

250% critical). It checks the NameNode JMX Servlet for the SystemCPULoad property. This information is available only if

you are running JDK 1.7.

潜在原因：Unusually high CPU utilization might be caused by a very unusual job or query workload, but this is generally the sign

of an issue in the daemon.

解决方法：Use the top command to determine which processes are consuming excess CPU.

Reset the offending process.

警报名称：NameNode Web UI

警报类型：WEB

描述：This host-level alert is triggered if the NameNode web UI is unreachable.

潜在原因：The NameNode process is not running.

解决方法：Check whether the NameNode process is running.

警报名称：Percent DataNodes with Available Space

警报类型：AGGREGATE

描述：This service-level alert is triggered if the storage is full on a certain percentage of DataNodes(10% warn, 30% critical)

潜在原因：Cluster storage is full.

If cluster storage is not full, DataNode is full.

解决方法：If the cluster still has storage, use the load balancer to distribute the data to relatively less-used DataNodes

If the cluster is full, delete unnecessary data or increase storage by adding either more DataNodes or more or larger disks

to the DataNodes. After adding more storage, run the load balancer.

警报名称：Percent DataNodes Available

警报类型：AGGREGATE

描述：This alert is triggered if the number of non-operating DataNodes in the cluster is greater than the configured critical

threshold. This aggregates the DataNode process alert.

潜在原因：DataNodes are down.

DataNodes are not down but are not listening to the correct network port/address.

解决方法：Check for non-operating DataNodes in Ambari Web.

Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes.

Run the netstat -tuplpn command to check if the DataNode process is bound to the correct network port.

警报名称：NameNode RPC Latency

警报类型：METRIC

描述：This host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold.

Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

increase for NameNode operations.

潜在原因：A job or an application is performing too many NameNode operations.

解决方法：Review the job or the application for potential bugs causing it to perform too many NameNode operations.

警报名称：NameNode Last Checkpoint

警报类型：SCRIPT

描述：This alert will trigger if the last time that the NameNode performed a checkpoint was too long ago or if the number of

uncommitted transactions is beyond a certain threshold.

潜在原因：Too much time elapsed since last NameNode checkpoint.

Uncommitted transactions beyond threshold.

解决方法：Set NameNode checkpoint.

Review threshold for uncommitted transactions.

警报名称：Secondary NameNode Process

警报类型：WEB

描述：If the Secondary NameNode process cannot be confirmed to be up and listening on the network. This alert is not applicable

when NameNode HA is configured.

潜在原因：The Secondary NameNode is not running.

解决方法：Check that the Secondary DataNode process is running.

警报名称：NameNode Directory Status

警报类型：METRIC

描述：This alert checks if the NameNode NameDirStatus metric reports a failed directory.

潜在原因：One or more of the directories are reporting as not healthy.

解决方法：Check the NameNode UI for information about unhealthy directories.

警报名称：HDFS Capacity Utilization

警报类型：METRIC

描述：This service-level alert is triggered if the HDFS capacity utilization exceeds the configured critical threshold

(80% warn, 90% critical). It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties.

潜在原因：Cluster storage is full.

解决方法：Delete unnecessary data.

Archive unused data.

Add more DataNodes.

Add more or larger disks to the DataNodes.

After adding more storage, run the load balancer.

警报名称: DataNode Health Summary

警报类型: METRIC

描述 : This service-level alert is triggered if there are unhealthy DataNodes.

潜在原因: A DataNode is in an unhealthy state.

解决方法 : Check the NameNode UI for the list of non-operating DataNodes.

警报名称：HDFS Pending Deletion Blocks

警报类型 : METRIC

描述 : This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning

and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property.

潜在原因 : Large number of blocks are pending deletion.

解决方法:

警报名称：HDFS Upgrade Finalized State

警报类型 : SCRIPT

描述 : This service-level alert is triggered if HDFS is not in the finalized state.

潜在原因: The HDFS upgrade is not finalized.

解决方法: Finalize any upgrade you have in process.

警报名称：DataNode Unmounted Data Dir

警报类型 : SCRIPT

描述: This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became

unmounted.

潜在原因 : If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well

as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the

root partition, which is undesirable.

解决方法: Check the data directories to confirm they are mounted as expected.

警报名称：DataNode Heap Usage

警报类型 : METRIC

描述 : This host-level alert is triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet

for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are percentages.

潜在原因 :

警报名称：NameNode Client RPC Queue Latency

警报类型 : SCRIPT

描述: This service-level alert is triggered if the deviation of RPC queue latency on client port has grown beyond the specified

threshold within an given period. This alert will monitor Hourly and Daily periods.

潜在原因 :

解决方法 :

警报名称：NameNode Client RPC Processing Latency

警报类型: SCRIPT

描述 : This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified

threshold within a given period. This alert will monitor Hourly and Daily periods.

潜在原因 :

解决方法:

警报名称：NameNode Service RPC Queue Latency

警报类型: SCRIPT

描述 : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified

threshold within a given period. This alert will monitor Hourly and Daily periods.

潜在原因:

解决方法 :

警报名称：NameNode Service RPC Processing Latency

警报类型 : SCRIPT

描述 : This service-level alert is triggered if the deviation of RPC latency on the DataNode port has grown beyond the specified

threshold within a given period. This alert will monitor Hourly and Daily periods.

潜在原因 :

解决方法 :

警报名称：HDFS Storage Capacity Usage

警报类型: SCRIPT

描述: This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified

threshold within a given period. This alert will monitor Daily and Weekly periods.

潜在原因:

解决方法:

警报名称：NameNode Heap Usage

警报类型: SCRIPT

描述: This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold

within a given period. This alert will monitor Daily and Weekly periods.

潜在原因:

解决方法 :

8.5.2 HDFS HA 警报 (HDFS HA Alerts)

警报名称: JournalNode Web UI

警报类型 : WEB

描述: This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening

on the network for the configured critical threshold, given in seconds.

潜在原因 : The JournalNode process is down or not responding.

The JournalNode is not down but is not listening to the correct network port/address.

解决方法 :

警报名称: NameNode High Availability Health

警报类型: SCRIPT

描述 : This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.

潜在原因 : The Active, Standby or both NameNode processes are down.

解决方法 : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode

host/process using Ambari Web.

On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct

network port.

警报名称: Percent JournalNodes Available

警报类型 : AGGREGATE

描述 : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured

critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.

潜在原因 : JournalNodes are down.

JournalNodes are not down but are not listening to the correct network port/address.

解决方法 : Check for dead JournalNodes in Ambari Web.

警报名称: ZooKeeper Failover Controller Process

警报类型: PORT

描述: This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the

network.

潜在原因: The ZKFC process is down or not responding.

解决方法: Check if the ZKFC process is running.

8.5.3 NameNode HA 警报 (NameNode HA Alerts)

警报名称: JournalNode Process

警报类型 : WEB

描述: This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening

on the network for the configured critical threshold, given in seconds.

潜在原因 : The JournalNode process is down or not responding.

The JournalNode is not down but is not listening to the correct network port/address.

解决方法: Check if the JournalNode process is running.

警报名称: NameNode High Availability Health

警报类型 : SCRIPT

描述: This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.

潜在原因: The Active, Standby or both NameNode processes are down.

解决方法 : On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode

host/process using Ambari Web.

On each host running NameNode, run the netstat -tuplpn command to check if the NameNode process is bound to the correct

network port.

警报名称: Percent JournalNodes Available

警报类型 : AGGREGATE

描述 : This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured

critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.

潜在原因 : JournalNodes are down.

JournalNodes are not down but are not listening to the correct network port/address.

解决方法 : Check for non-operating JournalNodes in Ambari Web.

警报名称: ZooKeeper Failover Controller Process

警报类型 : PORT

描述 : This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the

network.

潜在原因 : The ZKFC process is down or not responding.

解决方法: Check if the ZKFC process is running.

8.5.4 YARN 警报 (YARN Alerts)

警报名称: App Timeline Web UI

警报类型: WEB

描述: This host-level alert is triggered if the App Timeline Server Web UI is unreachable.

潜在原因: The App Timeline Server is down.

App Timeline Service is not down but is not listening to the correct network port/address.

解决方法: Check for non-operating App Timeline Server in Ambari Web.

警报名称: Percent NodeManagers Available

警报类型: AGGREGATE

描述: This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold.

It aggregates the results of DataNode process alert checks.

潜在原因 : NodeManagers are down.

NodeManagers are not down but are not listening to the correct network port/address.

解决方法 : Check for non-operating NodeManagers.

Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManagers hosts/processes, as necessary.

Run the netstat -tuplpn command to check if the NodeManager process is bound to the correct network port.

警报名称: ResourceManager Web UI

警报类型: WEB

描述: This host-level alert is triggered if the ResourceManager Web UI is unreachable.

潜在原因: The ResourceManager process is not running.

解决方法: Check if the ResourceManager process is running.

警报名称: ResourceManager RPC Latency

警报类型: METRIC

描述: This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold.

Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

increase for ResourceManager operations.

潜在原因: A job or an application is performing too many ResourceManager operations

解决方法: Review the job or the application for potential bugs causing it to perform too many ResourceManager operations.

警报名称: ResourceManager CPU Utilization

警报类型 : METRIC

描述 : This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain thresholds (200% warning,

250% critical). It checks the ResourceManager JMX Servlet for the SystemCPULoad property. This information is only available

if you are running JDK 1.7.

潜在原因: Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

an issue in the daemon.

解决方法: Use the top command to determine which processes are consuming excess CPU.

Reset the offending process.

警报名称: NodeManager Web UI

警报类型 : WEB

描述 : This host-level alert is triggered if the NodeManager process cannot be established to be up and listening on the network

for the configured critical threshold, given in seconds.

潜在原因 : NodeManager process is down or not responding.

NodeManager is not down but is not listening to the correct network port/address.

解决方法 : Check if the NodeManager is running.

Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManager, if necessary.

警报名称: NodeManager Health Summary

警报类型 : SCRIPT

描述: This host-level alert checks the node health property available from the NodeManager component.

潜在原因: NodeManager Health Check script reports issues or is not configured.

解决方法: Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart

if necessary.

Check in the ResourceManager UI logs (/var/log/hadoop/yarn) for health check errors.

警报名称: NodeManager Health

警报类型: SCRIPT

描述 : This host-level alert checks the nodeHealthy property available from the NodeManager component.

潜在原因: The NodeManager process is down or not responding.

解决方法: Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and restart the NodeManager, and restart

if necessary.

8.5.5 MapReduce2 警报 (MapReduce2 Alerts)

警报名称: History Server Web UI

警报类型: WEB

描述 : This host-level alert is triggered if the HistoryServer Web UI is unreachable.

潜在原因 : The HistoryServer process is not running.

解决方法: Check if the HistoryServer process is running.

警报名称: History Server RPC latency

警报类型 : METRIC

描述 : This host-level alert is triggered if the HistoryServer operations RPC latency exceeds the configured critical threshold.

Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to

increase for NameNode operations.

潜在原因: A job or an application is performing too many HistoryServer operations.

解决方法 : Review the job or the application for potential bugs causing it to perform too many HistoryServer operations.

警报名称: History Server CPU Utilization

警报类型 : METRIC

描述: This host-level alert is triggered if the percent of CPU utilization on the HistoryServer exceeds the configured

critical threshold.

潜在原因 : Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

n issue in the daemon.

解决方法: Use the top command to determine which processes are consuming excess CPU.

Reset the offending process.

警报名称: History Server Process

警报类型: PORT

描述 : This host-level alert is triggered if the HistoryServer process cannot be established to be up and listening on the

network for the configured critical threshold, given in seconds.

潜在原因 : HistoryServer process is down or not responding.

HistoryServer is not down but is not listening to the correct network port/address.

解决方法 : Check the HistoryServer is running.

Check for any errors in the HistoryServer logs (/var/log/hadoop/mapred) and restart the HistoryServer, if necessary.

8.5.6 HBase 服务警报 (HBase Service Alerts)

警报名称: Percent RegionServers Available

警报类型 :

描述: This service-level alert is triggered if the configured percentage of Region Server processes cannot be determined to be

up and listening on the network for the configured critical threshold. The default setting is 10% to produce a WARN alert

and 30% to produce a CRITICAL alert. It aggregates the results of RegionServer process down checks.

潜在原因: Misconfiguration or less-thanideal configuration caused the RegionServers to crash.

Cascading failures brought on by some workload caused the RegionServers to crash.

The RegionServers shut themselves own because there were problems in the dependent services, ZooKeeper or HDFS.

GC paused the RegionServer for too long and the RegionServers lost contact with Zookeeper.

解决方法: Check the dependent services to make sure they are operating correctly.

Look at the RegionServer log files (usually var/log/hbase/*.log) for further information.

If the failure was associated with a particular workload, try to understand the workload better.

Restart the RegionServers.

警报名称: HBase Master Process

警报类型 :

描述: This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for

the configured critical threshold, given in seconds.

潜在原因 : The HBase master process is down.

The HBase master has shut itself down because there were problems in the dependent services, ZooKeeper or HDFS.

解决方法: Check the dependent services.

Look at the master log files (usually var/log/hbase/*.log) for further information.

Look at the configuration files (/etc/hbase/conf).

Restart the master.

警报名称: HBase Master CPU Utilization

描述: This host-level alert is triggered if CPU utilization of the HBase Master exceeds certain thresholds (200% warning,

250% critical). It checks the HBase Master JMX Servlet for the SystemCPULoad property. This information is only available

if you are running JDK 1.7.

潜在原因: Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of

an issue in the daemon.

解决方法: Use the top command to determine which processes are consuming excess CPU

Reset the offending process.

警报名称: RegionServers Health Summary

描述: This service-level alert is triggered if there are unhealthy RegionServers

潜在原因 : The RegionServer process is down on the host.

The RegionServer process is up and running but not listening on the correct network port (default 60030).

解决方法: Check for dead RegionServer in Ambari Web.

警报名称: HBase RegionServer Process

描述: This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the

network for the configured critical threshold, given in seconds.

潜在原因: The RegionServer process is down on the host.

The RegionServer process is up and running but not listening on the correct network port (default 60030).

解决方法: Check for any errors in the logs (/var/log/hbase/) and restart the RegionServer process using Ambari Web.

Run the netstat -tuplpn command to check if the RegionServer process is bound to the correct network port.

8.5.7 Hive 警报 (Hive Alerts)

警报名称: HiveServer2 Process

警报类型:

描述: This host-level alert is triggered if the HiveServer cannot be determined to be up and responding to client requests.

潜在原因 : HiveServer2 process is not running.

HiveServer2 process is not responding.

解决方法: Using Ambari Web, check status of HiveServer2 component. Stop and then restart.

警报名称: HiveMetastore Process

描述 : This host-level alert is triggered if the Hive Metastore process cannot be determined to be up and listening on the

network for the configured critical threshold, given in seconds.

潜在原因: The Hive Metastore service is down.

The database used by the Hive Metastore is down.

The Hive Metastore host is not reachable over the network.

解决方法: Using Ambari Web, stop the Hive service and then restart it.

警报名称: WebHCat Server Status

警报类型:

描述: This host-level alert is triggered if the WebHCat server cannot be determined to be up and responding to client requests.

潜在原因 : The WebHCat server is down.

The WebHCat server is hung and not responding.

The WebHCat server is not reachable over the network.

解决方法 : Restart the WebHCat server using Ambari Web.

8.5.8 Oozie 警报 (Oozie Alerts)

警报名称: Oozie Server Web UI

描述: This host-level alert is triggered if the Oozie server Web UI is unreachable.

潜在原因: The Oozie server is down.

Oozie Server is not down but is not listening to the correct network port/address.

解决方法: Check for dead Oozie Server in Ambari Web.

警报名称: Oozie Server Status

描述: This host-level alert is triggered if the Oozie server cannot be determined to be up and responding to client requests.

潜在原因: The Oozie server is down.

The Oozie server is hung and not responding.

The Oozie server is not reachable over the network.

解决方法 : Restart the Oozie service using Ambari Web.

8.5.9 ZooKeeper 警报 (ZooKeeper Alerts)

警报名称: Percent ZooKeeper Servers Available

警报类型: AGGREGATE

描述: This service-level alert is triggered if the configured percentage of ZooKeeper processes cannot be determined to be up

and listening on the network for the configured critical threshold, given in seconds. It aggregates the results of

ZooKeeper process checks.

潜在原因: The majority of your ZooKeeper servers are down and not responding.

解决方法 : Check the dependent services to make sure they are operating correctly.

Check the ZooKeeper logs (/var/log/hadoop/zookeeper.log) for further information.

If the failure was associated with a particular workload, try to understand the workload better.

Restart the ZooKeeper servers from the Ambari UI.

警报名称: ZooKeeper Server Process

警报类型 : PORT

描述 : This host-level alert is triggered if the ZooKeeper server process cannot be determined to be up and listening on the

network for the configured critical threshold, given in seconds.

潜在原因 : The ZooKeeper server process is down on the host.

The ZooKeeper server process is up and running but not listening on the correct network port (default 2181).

解决方法: Check for any errors in the ZooKeeper logs (/var/log/hbase/) and restart the ZooKeeper process using Ambari Web.

Run the netstat -tuplpn command to check if the ZooKeeper server process is bound to the correct network port.

8.5.10 Ambari 警报 (Ambari Alerts)

警报名称: Host Disk Usage

警报类型 : SCRIPT

描述: This host-level alert is triggered if the amount of disk space used on a host goes above specific thresholds (50% warn,

80% crit ).

潜在原因: The amount of free disk space left is low.

解决方法 : Check host for disk space to free or add more storage.

警报名称: Ambari Agent Heartbeat

警报类型: SERVER

描述 : This alert is triggered if the server has lost contact with an agent.

潜在原因: Ambari Server host is unreachable from Agent host

Ambari Agent is not running

解决方法: Check connection from Agent host to Ambari Server

Check Agent is running

警报名称: Ambari Server Alerts

警报类型: SERVER

描述: This alert is triggered if the server detects that there are alerts which have not run in a timely manner

潜在原因: Agents are not reporting alert status

Agents are not running

解决方法: Check that all Agents are running and heartbeating

8.5.11 Ambari Metrics 警报 (Ambari Metrics Alerts)

警报名称: Metrics Collector Process

描述: This alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on the configured port for

number of seconds equal to threshold.

潜在原因: The Metrics Collector process is not running.

解决方法: Check the Metrics Collector is running.

警报名称: Metrics Collector –ZooKeeper Server Process

警报类型:

描述:This host-level alert is triggered if the Metrics Collector ZooKeeper Server Process cannot be determined to be up and

listening on the network.

潜在原因: The Metrics Collector process is not running.

解决方法: Check the Metrics Collector is running.

警报名称: Metrics Collector –HBase Master Process

警报类型:

描述: This alert is triggered if the Metrics Collector HBase Master Processes cannot be confirmed to be up and listening on

the network for the configured critical threshold, given in seconds.

潜在原因: The Metrics Collector process is not running.

解决方法: Check the Metrics Collector is running.

警报名称: Metrics Collector – HBase Master CPU Utilization

警报类型:

描述: This host-level alert is triggered if CPU utilization of the Metrics Collector exceeds certain thresholds.

潜在原因: Unusually high CPU utilization generally the sign of an issue in the daemon configuration.

解决方法 : Tune the Ambari Metrics Collector.

警报名称: Metrics Monitor Status

警报类型 :

描述: This host-level alert is triggered if the Metrics Monitor process cannot be confirmed to be up and running on the network.

潜在原因: The Metrics Monitor is down.

解决方法 : Check whether the Metrics Monitor is running on the given host.

警报名称: Percent Metrics Monitors Available

描述: This is an AGGREGATE alert of the Metrics Monitor Status.

潜在原因: Metrics Monitors are down.

解决方法: Check the Metrics Monitors are running.

警报名称: Metrics Collector -Auto-Restart Status

描述: This alert is triggered if the Metrics Collector has been auto-started for number of times equal to start threshold in

a 1 hour timeframe. By default if restarted 2 times in an hour, you will receive a Warning alert. If restarted 4 or more

times in an hour, you will receive a Critical alert.

潜在原因: The Metrics Collector is running but is unstable and causing restarts. This could be due to improper tuning.

解决方法: Tune the Ambari Metrics Collector.

警报名称: Percent Metrics Monitors Available

描述: This is an AGGREGATE alert of the Metrics Monitor Status.

潜在原因: Metrics Monitors are down.

解决方法: Check the Metrics Monitors.

警报名称: Grafana Web UI

描述: This host-level alert is triggered if the AMS Grafana Web UI is unreachable.

潜在原因: Grafana process is not running.

解决方法: Check whether the Grafana process is running. Restart if it has gone down.

8.5.12 SmartSenses 警报 (SmartSense Alerts)

警报名称: SmartSense Server Process

描述: This alert is triggered if the HST server process cannot be confirmed to be up and listening on the network for the

configured critical threshold, given in seconds.

潜在原因: HST server is not running.

解决方法: Start HST server process. If startup fails, check the hst-server.log.

警报名称: SmartSense Bundle Capture Failure

描述: This alert is triggered if the last triggered SmartSense bundle is failed or timed out.

潜在原因: Some nodes are timed out during capture or fail during data capture. It could also be because upload to Hortonworks fails.

解决方法: From the "Bundles" page check the status of bundle. Next, check which agents have failed or timed out, and review their logs.

You can also initiate a new capture.

警报名称: SmartSense Long Running Bundle

描述: This alert is triggered if the SmartSense in-progress bundle has possibility of not completing successfully on time.

潜在原因: Service components that are getting collected may not be running. Or some agents may be timing out during data

collection/upload.

解决方法: Restart the services that are not running. Force-complete the bundle and start a new capture.

警报名称: SmartSense Gateway Status

描述: This alert is triggered if the SmartSense Gateway server process is enabled but is unable to reach.

潜在原因: SmartSense Gateway is not running.

解决方法: Start the gateway. If gateway start fails, review hst-gateway.log

8.6 管理通知 (Managing Notifications)

利用警报组和通知可以创建警报分组，并为每个分组设置通知目标，通过这种方式可以把一组警报以不同的方式发送给不同的集群参与者。例如，可能想要

Hadoop Operations team 通过 email 接收所有的警报，不管警报是什么状态，同时，想要系统管理员小组只接收 RPC 和 CPU 相关的 Critical 状态的警报，

并且只通过 simple network management protocol(SNMP) 方式接收。

为了实现这些不同的结果，可以用一个警报通知，用于管理对所有警报组的所有的严重级别的 email 通知，用一个不同的警报组来管理 SNMP 方式发送的

Critical 严重性级别的警报通知，只包含 RPC 和 CPU 警报。

8.7 创建和编辑通知 (Creating and Editing Notifications)

① Ambari Web 中, 单击 Alerts

② 在 Alerts 页面，单击 Actions 菜单，然后单击 Manage Notifications

③ 在 Manage Alert Notifications 中，单击 + 创建一个新的警报通知

在 Create Alert Notification 中

在 Name 文本框，输入通知的名称
在 Groups 字段，单击 All 或 Custom 分配通知给所有或设置的组
在 Description 字段，输入描述通知的短语
在 Method 字段，单击 EMAIL, SNMP (for MIB-based) 或 Custom SNMP 作为 Ambari server 发送通知的方法

④ 完成所选择的通知方法字段定义

对于 email 通知，提供有关 SMTP 的信息，如，SMTP server, port ,以及 from 地址，服务器是否要求认证

可以对 SMTP 配置添加自定义的属性，基于Javamail SMTP

Email To：由一个或多个 email 地址组成的逗号分隔的列表，用于发送警报给这些 email 地址

SMTP Server：用于发送警报 email 的 SMTP server 的 FDQN 或 IP 地址

SMTP Port：SMTP server 的 SMTP 端口

Email From：一个 email 地址用于发送警报 email 的发送者

Use Authentication：确定在进行发送消息之前， SMTP server 是否要求身份验证。也要提供用户名和密码凭证

对于 MIB-based SNMP 通知，提供版本，community, 主机和端口，用于 SNMP trap 发送

Version：SNMPv1 或 SNMPv2c, 取决于网络环境

Hosts：逗号分隔的一个或多个主机 FDQN 列表，用于发送 trap

Port：进程用于监听 SNMP traps 的端口

对于 SNMP 通知， Ambari 使用 "MIB", 一个文本文件警报定义的清单，来传输警报信息。MIB 概述了对象 ID 如何

映射为对象或属性。

可以在 Ambari server 主机上找到集群的 MIB 文件：

/var/lib/ambari-server/resources/APACHE-AMBARI-MIB.txt

对于自定义 SNMP 通知，提供版本，community, 主机和端口，用于 SNMP trap 发送。

OID 参数必须配置正确，如果没有自定义，使用 enterprise-specific OID

Version SNMPv1 or SNMPv2c, depending on the network environment

OID 1.3.6.1.4.1.18060.16.1.1

Hosts A comma-separated list of one or more host FQDNs to which to send the

trap

Port The port on which a process is listening for SNMP traps

⑤ 单击 Save

8.8 创建或编辑通知组 (Creating or Editing Alert Groups)

① Ambari Web 中, 单击 Alerts

② 在 Alerts 页面，单击 Actions 菜单，然后单击 Manage Alert Groups

③ 在 Manage Alert Groups 中，单击 + 创建一个新的警报组

④ 在 Create Alert Group 中，输入组名称然后单击 Save

⑤ 通过在列表中单击自定义的组，可以添加或删除警报定义，并可以改变该组的通知目标

⑥ 完成分配之后，单击 Save

8.9 分发通知 (Dispatching Notifications)

当启用了一个警报并且警报的状态发生变化时(例如，从 OK 变为 CRITICAL, 或从 CRITICAL 变为 OK), Ambari 或者发送一个 email 或 SNMP 通知，取决于

如何配置的通知。

对于 email 通知，Ambari 发送一封 email 包含所有警报状态的变化。例如，如果有两个警报变为 critical, Ambari 发送一封 email 消息：

Alert A is CRITICAL and Ambari B alert is CRITICAL

Ambari 不会发送另外一封 email 通知，直到状态再次发生变化。

对于 SNMP 通知，Ambari 每个警报状态变化发送一个 SNMP trap. 例如，有两个警报状态变为 critical, Ambari 发送两个 SNMP trap, 每个警报一个，然后

这两个警报状态再次变化时，再次发送。

8.10 查看警报状态日志 (Viewing the Alert Status Log)

不管 Ambari 是否配置为发送警报通知，它都会将警报状态的变化写入 Ambari server 主机的日志。查看日志：

① 在 Ambari server 主机上，浏览到日志目录

cd /var/log/ambari-server/

② 查看 ambari-alerts.log 文件

③ 日志条目包括状态变化的时间，警报状态，警报定义名称，以及响应文本

8.10.1 自定义通知模板 (Customizing Notification Templates)

由 Ambari 产生的通知模板内容取决于通知的类型。Email 和 SNMP 通知都有自定义的模板用于生成内容。本节描述改变用于 Ambari 创建警报通知模板的

必要步骤。

警报模板的 XML 位置

默认情况下，Ambari 自带有一个 alert-templates.xml 文件。这个文件包含每一个已知类型通知的所有的模板(例如， EMAIL 和 SNMP). 这个文件

打包到 Ambari server 的 .jar 文件，因此模板没有存在于磁盘上。但是，这个文件用于如下文本，作为一个参考示例。

当自定义警报模板时，可以高效得覆盖默认的警报模板的 XML, 如下：

① 在 Ambari server 主机上，浏览到 /etc/ambari-server/conf 目录

② 编辑 ambari.properties 文件

③ 为新模板添加一个位置条目

alerts.template.file=/foo/var/alert-templates-custom.xml

④ 保存文件并重启 Ambari Server

重启 Ambari Server 之后，新模板中定义的任何通知类型都会覆盖打包在 Ambari 中的模板定义。如果选择提供自己的模板文件，只需要定义希望覆盖

的类型。如果一个通知模板类型在自定义的模板中没有找到，Ambari 会使用打包到 JAR 文件中的默认模板。

警报模板的 XML 结构

模板文件的结构定义如下。每个 <alert-template> 元素声明警报通知要用于什么类型：

<alert-templates>

<alert-template type="EMAIL">

Subject Content

</subject>

<body>

Body Content

</body>

</alert-template>

<alert-template type="SNMP">

Subject Content

</subject>

<body>

Body Content

</body>

</alert-template>

</alert-templates>

模板变量

模板利用 Apache Velocity 来表现所有标记的内容(tokenized content). 下面的变量可用于模板：

$alert.getAlertDefinition() The definition of which the alert is an instance.

$alert.getAlertText() The specific alert text.

$alert.getAlertName() The name of the alert.

$alert.getAlertState() The alert state (OK, WARNING, CRITICAL, or

UNKNOWN)

$alert.getServiceName() The name of the service that the alert is defined for.

$alert.hasComponentName() True if the alert is for a specific service component.

$alert.getComponentName() The component, if any, that the alert is defined for.

$alert.hasHostName() True if the alert was triggered for a specific host.

$alert.getHostName() The hostname, if any, that the alert was triggered for.

$ambari.getServerUrl() The Ambari Server URL.

$ambari.getServerVersion() The Ambari Server version.

$ambari.getServerHostName() The Ambari Server hostname.

$dispatch.getTargetName() The notification target name.

$dispatch.getTargetDescription() The notification target description.

$summary.getAlerts(service,alertStaAte li)st of all alerts for a given service or alert state (OK|

WARNING|CRITICAL|UNKNOWN)

$summary.getServicesByAlertState(Aal elirsttS otaf tael)l services for a given alert state (OK|

WARNING|CRITICAL|UNKNOWN)

$summary.getServices() A list of all services that are reporting an alert in the

notification.

$summary.getCriticalCount() The CRITICAL alert count.

$summary.getOkCount() The OK alert count.

$summary.getTotalCount() The total alert count.

$summary.getUnknownCount() The UNKNOWN alert count.

$summary.getWarningCount() The WARNING alert count.

$summary.getAlerts() A list of all of the alerts in the notification.

示例：Modify Alert EMAIL Subject

下面示例演示如何改变所有出站 email 通知的主题行(subject line), 包括一个硬编码的标识符：

① 下载 alert-templates.xml 代码作为开始

② 在 Ambari Server 上，保存模板到一个位置，例如，/var/lib/ambariserver/ resources/alert-templates-custom.xml

③ 编辑 alert-templates-custom.xml 文件并修改 <alerttemplate type="EMAIL"> 模板的主题行

<![CDATA[Petstore Ambari has $summary.getTotalCount() alerts!]]>

</subject>

④ 保存文件

⑤ 浏览到 /etc/ambari-server/conf 目录

⑥ 编辑 ambari.properties 文件

⑦ 为新模板文件的位置添加一条目

alerts.template.file=/var/lib/ambari-server/resources/alerttemplates-custom.xml

⑧ 保存文件并重启 Ambari Server

9. 使用 Ambari 核心服务 (Using Ambari Core Services)

Ambari 核心服务可用于监控，分析，以及搜索集群主机的操作状态。

9.1 理解 Ambari 度量器 (Understanding Ambari Metrics)

Ambari Metrics System (AMS) 在 Ambari 管理的集群上收集，聚集，并服务于 Hadoop 和系统度量

9.1.1 AMS 体系结构 (AMS Architecture)

AMS 有四个组件：Metrics Monitors, Hadoop Sinks, Metrics Collector, 以及 Grafana.

• Metrics Monitors：在集群的每部主机上收集系统级别的度量并发布到 Metrics Collector 上

• Hadoop Sinks：插入到 Hadoop 组件中用于发布 Hadoop 度量到 Metrics Collector 上

• Metrics Collector：是一个运行在集群上特定主机中的 daemon 并从注册的发布者接收数据，Monitors 和 Sinks

• Grafana：是一个运行在集群上特定主机中的 daemon，并为在 Metrics Collector 中收集到的 metrics 的可视化提供预构建表盘

9.1.2 使用 Grafana (Using Grafana)

Ambari Metrics System 包括 Grafana 用于为高级可视化集群度量提供预构建表盘。

9.1.2.1 访问 Grafana (Accessing Grafana)

① Ambari Web 中，浏览到 Services > Ambari Metrics > Summary

② 选择 Quick Links 然后选取 Grafana

一个只读版本的 Grafana 页面在浏览器的一个新 tab 页面打开

9.1.2.2 查看 Grafana 表盘(Viewing Grafana Dashboards)

在 Grafana 主页上，Dashboards 提供了一个 AMS 链接列表，Ambari server, Druid and HBase metrics.

查看包含在列表中的特定 metric:

① 在 Grafana 中，浏览到 Dashboards

② 单击 Dashboards 名称

③ 查看更多表盘，单击 Home 列表

④ 滚动查看这个列表

例如，System - Servers

9.1.2.3 在 Grafana 表盘上查看选择的 Metrics (Viewing Selected Metrics on Grafana Dashboards)

在表盘上，展开一个或多个行以查看详细的度量。例如：

在 System - Servers 表盘上，单击行名称，例如单击 System Load Average - 1 Minute

这个行展开以显示一个图表显示度量信息。

9.1.2.4 查看选定主机的 Metrics (Viewing Metrics for Selected Hosts)

默认情况下，Grafana 显示集群上所有主机 metric. 通过从 Hosts 菜单上选择，可以限制显示一个或几个主机的 metric

① 展开 Hosts

② 选择一个或多个主机名

9.1.3 Grafana 表盘参考 (Grafana Dashboards Reference)

Ambari Metrics System 包含的 Grafana 为集群 metrics 的高级可视化带有预构建的表盘。

AMS HBase Dashboards
Ambari Dashboards
HDFS Dashboards
YARN Dashboards
Hive Dashboards
Hive LLAP Dashboards
HBase Dashboards
Kafka Dashboards
Storm Dashboards
System Dashboards
NiFi Dashboard

9.1.3.1 AMS HBase 表盘 (AMS HBase Dashboards)

AMS HBase 指的是由 Ambari Metrics Service 独立管理的 HBase 实例。它与集群的 HBase 服务没有任何连接。AMS HBase 表盘跟踪与常规 HBase 表盘

相同的度量，只是 AMS 自身的实例。

如下的 Grafana 表盘适用于 AMS HBase

AMS HBase - Home
AMS HBase - RegionServers
AMS HBase - Misc

9.1.3.1.1 AMS HBase 表盘 (AMS HBase - Home)

AMS HBase - Home 表盘显示 HBase 集群基本的统计信息，这些仪表提供了 HBase 集群整体状态的观察。

REGIONSERVERS / REGIONS

-------------------------------------------------------------------------------------------------------------------------------------

Num RegionServers: Total number of RegionServers in the cluster.

Num Dead RegionServers: Total number of RegionServers that are dead in the cluster.

Num Regions: Total number of regions in the cluster.

Avg Num Regions per RegionServer: Average number of regions per RegionServer.

NUM REGIONS/STORES

Num Regions /Stores - Total: Total number of regions and stores (column families) in the cluster.

Store File Size /Count - Total : Total data file size and number of store files.

NUM REQUESTS

Num Requests - Total: Total number of requests (read, write and RPCs) in the cluster.

Num Request - Breakdown - Total: Total number of get,put,mutate,etc requests in the cluster.

REGIONSERVER MEMORY

RegionServer Memory - Average : Average used, max or committed on-heap and offheap memory for RegionServers.

RegionServer Offheap Memory - Average : Average used, free or committed on-heap and offheap memory for RegionServers.

MEMORY - MEMSTORE BLOCKCACHE

Memstore - BlockCache - Average : Average blockcache and memstore sizes for RegionServers.

Num Blocks in BlockCache - Total: Total number of (hfile) blocks in the blockcaches across all RegionServers.

BLOCKCACHE

BlockCache Hit/Miss/s Total : Total number of blockcache hits misses and evictions across all RegionServers.

BlockCache Hit Percent - Average: Average blockcache hit percentage across all RegionServers.

OPERATION LATENCIES - GET/MUTATE

Get Latencies - Average : Average min, median, max, 75th, 95th, 99th percentile latencies for Get operation across

all RegionServers.

Mutate Latencies - Average: Average min, median, max, 75th, 95th, 99th percentile latencies for Mutate operation across

all RegionServers.

OPERATION LATENCIES - DELETE/INCREMENT

Delete Latencies - Average : Average min, median, max, 75th, 95th, 99th percentile latencies for Delete operation across

all RegionServers.

Increment Latencies - Average: Average min, median, max, 75th, 95th, 99th percentile latencies for Increment operation across

all RegionServers.

OPERATION LATENCIES - APPEND/REPLAY

Append Latencies - Average: Average min, median, max, 75th, 95th, 99th percentile latencies for Append operation across

all RegionServers.

Replay Latencies - Average : Average min, median, max, 75th, 95th, 99th percentile latencies for Replay operation across

all RegionServers.

REGIONSERVER RPC

RegionServer RPC - Average: Average number of RPCs, active handler threads and open connections across all RegionServers.

RegionServer RPC Queues - Average: Average number of calls in different RPC scheduling queues and the size of all requests in the

RPC queue across all RegionServers.

REGIONSERVER RPC

RegionServer RPC Throughput - Average : Average sent and received bytes from the RPC across all RegionServers.

9.1.3.1.2 AMS HBase 表盘 (AMS HBase - RegionServers)

AMS HBase - RegionServers 仪表显示在监控的 HBase 集群中的 RegionServers 度量，包括一些性能相关的数据。这些仪表帮助查看基本 I/O 数据，以及

RegionServers 中进行负载比较。

9.1.3.1.3 AMS HBase 表盘 (AMS HBase - Misc)

AMS HBase - Misc 仪表显示 HBase 集群相关的多方面的度量信息。可以在某些任务中利用这些度量信息，例如，调试身份认证，授权问题，以及由

RegionServers 产生的异常问题等。

9.1.3.2 Ambari 表盘 (Ambari Dashboards)

下面的仪表可用于 Ambari ：

Ambari Server Database
Ambari Server JVM
Ambari Server Top N

9.1.3.2.1 Ambari server 数据库 (Ambari Server Database)

显示 Ambari server 数据库的操作状态。

TOTAL READ ALL QUERY

Total Read All Query Counter (Rate) : Total ReadAllQuery operations performed.

Total Read All Query Timer (Rate) : Total time spent on ReadAllQuery.

TOTAL CACHE HITS & MISSES

Total Cache Hits (Rate) : Total cache hits on Ambari Server with respect to EclipseLink cache.

Total Cache Misses (Rate) : Total cache misses on Ambari Server with respect to EclipseLink cache.

QUERY

Query Stages Timings: Average time spent on every query sub stage by Ambari Server

Query Types Avg. Timings : Average time spent on every query type by Ambari Server.

HOST ROLE COMMAND ENTITY

Counter.ReadAllQuery.HostRoleCommandEntity (Rate): Rate (num operations per second) in which ReadAllQuery operation on

HostRoleCommandEntity is performed.

Timer.ReadAllQuery.HostRoleCommandEntity (Rate): Rate in which ReadAllQuery operation on HostRoleCommandEntity is performed.

ReadAllQuery.HostRoleCommandEntity : Average time taken for a ReadAllQuery operation on HostRoleCommandEntity (Timer / Counter).

9.1.3.2.2 Ambari server JVM (Ambari Server JVM)

JVM - MEMORY PRESSURE

Heap Usage: Used, max or committed on-heap memory for Ambari Server.

Off-Heap Usage: Used, max or committed off-heap memory for Ambari Server.

JVM GC COUNT

GC Count Par new /s: Number of Java ParNew (YoungGen) Garbage Collections per second.

GC Time Par new /s : Total time spend in Java ParNew(YoungGen) Garbage Collections per second.

GC Count CMS /s: Number of Java Garbage Collections per second.

GC Time Par CMS /s : Total time spend in Java CMS Garbage Collections per second.

JVM THREAD COUNT

Thread Count: Number of active, daemon, deadlock, blocked and runnable threads.

9.1.3.2.3 Ambari Server Top N (Ambari Server Top N)

READ ALL QUERY

Top ReadAllQuery Counters: Top N Ambari Server entities by number of ReadAllQuery operations performed.

Top ReadAllQuery Timers : Top N Ambari Server entities by time spent on ReadAllQuery operations.

Cache Misses

Cache Misses : Top N Ambari Server entities by number of Cache Misses.

9.1.3.3 Druid Dashboards

9.1.3.4 HDFS Dashboards

如下 Grafana 仪表适用于 Hadoop Distributed File System (HDFS) 组件

HDFS - Home
HDFS - NameNodes
HDFS - DataNodes
HDFS - Top-N
HDFS - Users

9.1.3.5 YARN Dashboards

如下 Grafana 仪表适用于 YARN:

YARN - Home
YARN - Applications
YARN - MR JobHistory Server
YARN - MR JobHistory Server
YARN - NodeManagers
YARN - Queues
YARN - ResourceManager

9.1.3.6 Hive Dashboards

如下 Grafana 仪表适用于 Hive:

Hive - Home
Hive - HiveMetaStore
Hive - HiveServer2

9.1.3.7 Hive LLAP Dashboards

如下 Grafana 仪表适用于 Hive LLAP:

Hive LLAP - Heatmap
Hive LLAP - Overview
Hive LLAP - Daemon

9.1.3.8 HBase Dashboards

如下 Grafana 仪表适用于 Hive HBase:

HBase - Home
HBase - RegionServers
HBase - Misc
HBase - Tables
HBase - Users

9.1.3.9 Kafka Dashboards

如下 Grafana 仪表适用于 Hive Kafka:

Kafka - Home
Kafka - Hosts
Kafka - Topics

9.1.3.10 Storm Dashboards

如下 Grafana 仪表适用于 Hive Storm:

Storm - Home
Storm - Topology
Storm - Components

9.1.3.11 System Dashboards

如下 Grafana 仪表适用于 Hive System:

System - Home
System - Servers

9.1.3.12 NiFi Dashboards

如下 Grafana 仪表适用于 Hive NiFi:

NiFi-Home

9.1.4 AMS 性能调优 (AMS Performance Tuning)

要在环境中设置 Ambari Metrics System, 查看并自定义如下 Metrics Collector 配置选项：

Customizing the Metrics Collector Mode
Customizing TTL Settings
Customizing Memory Settings
Customizing Cluster-Environment-Specific Settings
Moving the Metrics Collector
(Optional) Enabling Individual Region, Table, and User Metrics for HBase

9.1.4.1 自定义 Metrics Collector 模式 (Customizing the Metrics Collector Mode)

Metrics Collector 利用 Hadoop 技术构建，例如 Apache HBase, Apache Phoenix, and Apache Traffic Server (ATS). Collector 可存储度量数据到本地

文件系统，成为 embedded mode, 或使用外部 HDFS, 成为 distributed mode. 默认情况下，Collector 运行于嵌入模式。在嵌入模式下，Collector 获取

数据并把度量数据写入到运行 Collector 主机的本地文件系统。

重要提示：

运行嵌入模式时，应该确认 hbase.rootdir 和 hbase.tmp.dir 有足够的大小容纳数据，并且负载要轻。目录配置在

Ambari Metrics > Configs > Advanced > ams-hbasesite

所在分区要有足够的大小，并且负载不要繁重，例如：

file:///grid/0/var/lib/ambari-metrics-collector/hbase.

也要确认 TTL 设置合适。

Collector 配置为分布式模式，它将度量数据写入到 HDFS, 并且组件运行于分布式进程上，有助于管理 CPU 和内存消耗。

切换 Metrics Collector 从嵌入模式到分布式模式：

① 在 Ambari Web 中, 浏览到 Services > Ambari Metrics > Configs

② 修改列于如下表格中的属性值：

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

+-----------------------+-------------------------------------------+-------------------------------+-------------------------------+

③ Ambari Web > Hosts > Components 重启 Metrics Collector

如果集群配置为 NameNode 高可用性，设置 hbase.rootdir 值为 HDFS 名称服务替代 NameNode 主机名称：

hdfs://hdfsnameservice/apps/ams/metrics

可选地，可以在切换到分布式模式之前，将本地存储的现有数据迁移到 HDFS。

步骤：

① 为 ams 用户创建目录

su - hdfs -c 'hdfs dfs -mkdir -p /apps/ams/metrics'

② 停止 Metrics Collector

③ 将度量数据从 AMS 本地目录复制到 HDFS 目录。这是 hbase.rootdir 值，如：

su - hdfs -c 'hdfs dfs -copyFromLocal /var/lib/ambari-metrics-collector/hbase/* /apps/ams/metrics'

su - hdfs -c 'hdfs dfs -chown -R ams:hadoop /apps/ams/metrics'

④ 切换到分布式模式

⑤ 重启 Metrics Collector

9.1.4.2 自定义 TTL 设置 (Customizing TTL Settings)

AMS 可以为聚集的度量设置 Time To Live (TTL), 通过 Ambari Metrics > Configs > Advanced ams-siteEach 自解释的属性名，以及控制度量值在其被

清除之前保持的时间数量(单位，秒)。TTL 设置的时间值单位为秒。

例如，假设正在运行一个单节点的沙箱(a single-node sandbox), 并且要确保不保存超过七天的数据，以降低磁盘空间消耗。可以设置任何以 .ttl 结尾的

属性值为 604800(七天的秒数)。

可能要为 timeline.metrics.cluster.aggregator.daily.ttl 属性设置这个值，控制每日聚集 TTL, 默认设置为 2 年。

另外两个消耗大量磁盘空间的属性为：

timeline.metrics.cluster.aggregator.minute.ttl : 控制分钟级聚集度量 TTL
timeline.metrics.host.aggregator.ttl : 控制基于主机精度的度量 TTL

9.1.4.3 自定义 Memory 设置 (Customizing Memory Settings)

因为 AMS 使用多个组件(例如 Apache HBase 和 Apache Phoenix) 来存储度量和查询，因此多个可调控的属性可用于调优内存使用：

+---------------------------+-------------------------------+-------------------------------------------------------------------+

| 配置 | 属性 | 描述 |

+---------------------------+-------------------------------+-------------------------------------------------------------------+

| Advanced ams-env | metrics_collector_heapsize | Heap size configuration for the Collector. |

+---------------------------+-------------------------------+-------------------------------------------------------------------+

| Advanced ams-hbase-env | hbase_regionserver_heapsize | Heap size configuration for the single AMS HBase Region Server. |

+---------------------------+-------------------------------+-------------------------------------------------------------------+

| Advanced ams-hbase-env | hbase_master_heapsize | Heap size configuration for the single AMS HBase Master. |

+---------------------------+-------------------------------+-------------------------------------------------------------------+

| Advanced ams-hbase-env | regionserver_xmn_size | Maximum value for the young generation heap size for the single |

| | | AMS HBase RegionServer. |

+---------------------------+-------------------------------+-------------------------------------------------------------------+

| Advanced ams-hbase-env | hbase_master_xmn_size | Maximum value for the young generation heap size for the single |

| | | AMS HBase Master. |

+---------------------------+-------------------------------+-------------------------------------------------------------------+

9.1.4.4 自定义集群环境特定的设置 (Customizing Cluster-Environment-Specific Settings)

对 AMS 的 Metrics Collector 模式，TTL 设置，内存设置，以及磁盘空间要求取决于集群的节点数量。下面表格列出对每种配置的建议和调优原则：

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

| | | | 模式 | | |

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

|PoC | 1-5 | 5GB | embedded |Reduce TTLs |metrics_collector_heap_size=1024 |

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

|Production | 20-50 | 50GB | embedded | n.a. |metrics_collector_heap_size=1024 |

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

|Production | 50-200 | 100GB | embedded | n.a. |metrics_collector_heap_size=2048 |

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

|Production | 200-400 | 200GB | embedded | n.a. |metrics_collector_heap_size=2048 |

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

|Production | 400-800 | 200GB | distributed | n.a. |metrics_collector_heap_size=8192 |

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

|Production | 800+ | 500GB | distributed | n.a. |metrics_collector_heap_size=12288 |

+---------------+-----------+-----------+---------------+---------------+-----------------------------------+

9.1.4.5 移动 Metrics Collector (Moving the Metrics Collector)

使用如下过程将 Ambari Metrics Collector 移动到一个新的主机上：

① 在 Ambari Web , 停止 Ambari Metrics 服务

② 执行下列 API 调用来删除当前的 Metric Collector 组件：

curl -u admin:admin -H "X-Requested-By:ambari" - i -X \

DELETE http://ambari.server:8080/api/v1/clusters/cluster.name/hosts/metrics.collector.hostname/host_components/METRICS_COLLECTOR

③ 执行下列 API 调用在新主机上添加 Metric Collector：

curl -u admin:admin -H "X-Requested-By:ambari" - i -X \

POST http://ambari.server:8080/api/v1/clusters/cluster.name/hosts/metrics.collector.hostname/host_components/METRICS_COLLECTOR

④ 在 Ambari Web, 导航到安装了新 Metrics Collector 的主机上并单击 Install the Metrics Collector

⑤ 在 Ambari Web, 启动 Ambari Metrics 服务

9.1.4.6 (可选)为 HBase 启动单独的 Region, Table, and User Metrics (Enabling Individual Region, Table, and User Metrics for HBase)

不像 HBase RegionServer metrics, Ambari 默认禁用 per region, per table, and per user metrics, 因为这些 metrics 非常多因而会导致性能问题。

如果要 Ambari 收集这些 metrics, 可以重新启用它们。然而，要首先测试这个选项并确认 AMS 性能可接受。

① 在 Ambari Server 上，浏览到如下位置：

/var/lib/ambari-server/resources/common-services/HBASE/0.96.0.2.0/package/templates

② 编辑如下模板文件：

hadoop-metrics2-hbase.properties-GANGLIA-MASTER.j2

hadoop-metrics2-hbase.properties-GANGLIA-RS.j2

③ 注释掉或者删除下面的行

*.source.filter.class=org.apache.hadoop.metrics2.filter.RegexFilter

hbase.*.source.filter.exclude=.*(Regions|Users|Tables).*

④ 保存模板文件并重启 Ambari Server 使修改生效。

重要提示：

如果 Ambari 升级到一个新的版本，必须要重新对模板文件进行上述修改

9.1.5 AMS 高可用性 (AMS High Availability)

Ambari 默认安装 Ambari Metrics System (AMS) 到集群中一个 Metrics Collector 组件。Collector 是运行在集群的一个特定主机上的守护进程，从注册

的发布者接收数据，Monitors 和 Sinks .

取决于需要，可以要求 AMS 有两个 Collector 来形成高可用性情形。

前提：

必须部署 AMS 为分布式模式(not embedded)

步骤：

① 在 Ambari Web 中，浏览到打算安装另一个收集器的主机

② 在 Hosts 页面，选取 +Add

③ 从列表上选取 Metrics Collector

Ambari 安装新的 Metrics Collector 并配置 HA 的 Ambari Metrics

新安装的收集器处于 “stopped” 状态

④ 在 Ambari Web 中，启动新的 Collector 组件

Note：

如果在安装第二个 Collector 到集群中之前没有将 AMS 切换为分布式模式，第二个收集器会被安装，但不会启动。

9.1.6 AMS 安全性 (AMS Security)

9.1.6.1 修改 Grafana 管理员密码 (Changing the Grafana Admin Password)

如果需要在初始安装 Ambari 之后修改 Grafana 管理员密码，可以直接在 Grafana 中修改密码，然后在 Ambari Metrics 配置中做同样的修改。

(1) 在 Ambari Web 中, 浏览到 Services > Ambari Metrics, 选择 Quick Links, 然后选取 Grafana

Grafana UI 以只读方式打开

(2) 单击 Sign In

(3) 以管理员登录，使用未更改的密码 admin/admin

(4) 单击 admin 标签以查看管理员信息，单击 Change password

(5) 输入未改变的密码，输入并确认新密码，然后单击 Change password 按钮

(6) 回到 Ambari Web > Services > Ambari Metrics, 然后浏览 Configs tab

(7) 在 General 部分，使用新密码更新并确认 Grafana Admin Password

(8) 保存配置并重启服务，如果提示。

9.1.6.2 为 AMS 设置 HTTPS (Set Up HTTPS for AMS)

如果要限制访问 AMS 通过 HTTPS 连接，必须提供一个证书。起初测试的时候可以使用自签名的证书，但不适用于生产环境。在获得了一个证书之后，必须

运行特定的安装命令(setup command)。

步骤：

(1) 创建自己的 CA 证书(CA certificate)

openssl req -new -x509 -keyout ca.key -out ca.crt -days 365

(2) 导入 CA 证书到信任站 (truststore)

# keytool -keystore /<path>/truststore.jks -alias CARoot -import -file ca.crt -storepass bigdata

(3) 检查 truststore

# keytool -keystore /<path>/truststore.jks -list

Enter keystore password:

Keystore type: JKS

Keystore provider: SUN

Your keystore contains 2 entries

caroot, Feb 22, 2016, trustedCertEntry,

Certificate fingerprint (SHA1):

AD:EE:A5:BC:A8:FA:61:2F:4D:B3:53:3D:29:23:58:AB:2E:B1:82:AF

(4) 为 AMS Collector 生成证书并存储私钥到 keystore.

# keytool -genkey -alias c6401.ambari.apache.org -keyalg RSA -keysize 1024

-dname "CN=c6401.ambari.apache.org,OU=IT,O=Apache,L=US,ST=US,C=US" -keypass

bigdata -keystore /<path>/keystore.jks -storepass bigdata

(5) 为 AMS collector 证书创建证书请求(certificate request)

keytool -keystore /<path>/keystore.jks -alias c6401.ambari.apache.org -certreq -file c6401.ambari.apache.org.csr -storepass bigdata

(6) 利用 CA 证书为证书请求签名

openssl x509 -req -CA ca.crt -CAkey ca.key -in c6401.ambari.apache.org.csr

-out c6401.ambari.apache.org_signed.crt -days 365 -CAcreateserial -passin

pass:bigdata

(7) 把 CA 证书导入到 keystore.

keytool -keystore /<path>/keystore.jks -alias CARoot -import -file ca.crt -storepass bigdata

(8) 导入签名的证书到 keystore.

keytool -keystore /<path>/keystore.jks -alias c6401.ambari.apache.org -

import -file c6401.ambari.apache.org_signed.crt -storepass bigdata

(9) 检查 keystore.

caroot2, Feb 22, 2016, trustedCertEntry,

Certificate fingerprint (SHA1):

7C:B7:0C:27:8E:0D:31:E7:BE:F8:BE:A1:A4:1E:81:22:FC:E5:37:D7

[root@c6401 tmp]# keytool -keystore /tmp/keystore.jks -list

Enter keystore password:

Keystore type: JKS

Keystore provider: SUN

Your keystore contains 2 entries

caroot, Feb 22, 2016, trustedCertEntry,

Certificate fingerprint (SHA1):

AD:EE:A5:BC:A8:FA:61:2F:4D:B3:53:3D:29:23:58:AB:2E:B1:82:AF

c6401.ambari.apache.org, Feb 22, 2016, PrivateKeyEntry,

Certificate fingerprint (SHA1):

A2:F9:BE:56:7A:7A:8B:4C:5E:A6:63:60:B7:70:50:43:34:14:EE:AF

(10) 复制 /<path>/truststore.jks 文件到所有节点的 /<path>/truststore.jks 并设置合适的访问权限

(11) 复制 /<path>/keystore.jks 文件到 AMS 收集器节点只到 /<path>/keystore.jks 路径，并设置合适的访问权限。建议设置 ams 用户为文件 owner, 并设置

访问权限为 400

(12) 在 Ambari Web 中，更新 AMS 配置，setams-site/timeline.metrics.service.http.policy=HTTPS_ONLY

ams-ssl-server/ssl.server.keystore.keypassword=bigdata
ams-ssl-server/ssl.server.keystore.location=/<path>/keystore.jks
ams-ssl-server/ssl.server.keystore.password=bigdata
ams-ssl-server/ssl.server.keystore.type=jks
ams-ssl-server/ssl.server.truststore.location=/<path>/truststore.jks
ams-ssl-server/ssl.server.truststore.password=bigdata
ams-ssl-server/ssl.server.truststore.reload.interval=10000
ams-ssl-server/ssl.server.truststore.type=jks
ams-ssl-client/ssl.client.truststore.location=/<path>/truststore.jks
ams-ssl-client/ssl.client.truststore.password=bigdata
ams-ssl-client/ssl.client.truststore.type=jks
ssl.client.truststore.alias=<Alias used to create certificate for AMS. (Default is hostname)>

(13) 重启服务

(14) 配置 Ambari server 使用 truststore

# ambari-server setup-security

Using python /usr/bin/python

Security setup options... ===========================================================================

Choose one of the following options:

[1] Enable HTTPS for Ambari server.

[2] Encrypt passwords stored in ambari.properties file.

[3] Setup Ambari kerberos JAAS configuration.

[4] Setup truststore.

[5] Import certificate to truststore. ===========================================================================

Enter choice, (1-5): 4

Do you want to configure a truststore [y/n] (y)?

TrustStore type [jks/jceks/pkcs12] (jks):jks

Path to TrustStore file :/<path>/keystore.jks

Password for TrustStore:

Re-enter password:

Ambari Server 'setup-security' completed successfully.

(15) 配置 ambari server 在请求 AMS Collector 时使用 https 替代 http：

# echo "server.timeline.metrics.https.enabled=true" >> /etc/ambari-server/conf/ambari.properties

(16) 重启 ambari server

9.1.6.3 为 Grafana 设置 HTTPS (Set Up HTTPS for Grafana)

如果要限制访问 Grafana 通过 HTTPS 连接，必须提供一个证书。起初测试的时候可以使用自签名的证书，但不适用于生产环境。在获得了一个证书之后，

必须运行特定的安装命令(setup command)。

步骤：

(1) 登录到 Grafana 主机上

(2) 浏览到 Grafana 配置目录

cd /etc/ambari-metrics-grafana/conf/

(3) 定位到证书

如果要创建一个临时的自签名证书，可以运行：

openssl genrsa -out ams-grafana.key 2048

openssl req -new -key ams-grafana.key -out ams-grafana.csr

openssl x509 -req -days 365 -in ams-grafana.csr -signkey ams-grafana.key -

out ams-grafana.crt

(4) 设置证书和秘钥文件的所有者和权限，让 Grafana 可以访问

chown ams:hadoop ams-grafana.crt

chown ams:hadoop ams-grafana.key

chmod 400 ams-grafana.crt

chmod 400 ams-grafana.key

对于 non-root Ambari user, 使用：

chmod 444 ams-grafana.crt

让 agent user 可以读取文件

(5) 在 Ambari Web, 浏览到 Services > Ambari Metrics > Configs

(6) 在 Advanced ams-grafana-ini 部分更新如下属性：

protocol https

cert_file /etc/ambari-metrics-grafana/conf/ams-grafana.crt

cert-Key /etc/ambari-metrics-grafana/conf/ams-grafana.key

(7) 保存配置并重启服务，如果提示。

9.2 Ambari 日志搜索 (Ambari Log Search, Technical Preview)

下面几节描述 Ambari Log Search 的技术概览(Technical Preview), 只能在少于 150 个节点的非生产环境集群上使用。

9.2.1 Ambari 日志搜索体系结构 (Log Search Architecture)

Ambari Log Search 可以搜索由 Ambari-managed HDP 组件生成的日志。Ambari Log Search 依赖于由 Apache Solr 索引服务提供的 Ambari Infra 服务。

两个组件组成了 Log Search 解决方案：

Log Feeder
Log Search Server

9.2.1.1 Log Feeder

Log Feeder 组件分析组件日志。Log Feeder 被部署到集群的所有节点上，并与该节点上所有的组件日志交互。启动时，Log Feeder 开始分析所有已知的

组件日志并把它们发送给 Apache Solr 实例(由 Ambari Infra 服务管理) 以进行索引。

默认情况下，只有 FATAL, ERROR, and WARN 日志被 Log Feeder 捕捉。可以利用 Log Search UI 过滤器设置来临时或永久地添加其他日志级别。

9.2.1.2 Log Search Server

Log Search Server 承载着 Log Search UI web 应用程序，为 Ambari 提供 API, 并且 Log Search UI 访问已索引的组件日志。作为本地或 LDAP 用户登录

之后，可以利用 Log Search UI 可视化，浏览，以及搜索索引化了的组件日志。

9.2.2 Installing Log Search

Log Search 是 Ambari 2.4 及以后版本的内置服务。可以在一个新的安装过程中通过 +Add Service 菜单安装。Log Feeders 自动安装到集群的所有节点上

可以手动将 Log Search Server 安装到与 Ambari Server 同一部主机上。

9.2.3 使用 Log Search (Using Log Search)

使用 Log Search 包括如下活动：

Accessing Log Search
Using Log Search to Troubleshoot
Viewing Service Logs
Viewing Access Logs

9.2.3.1 访问 Log Search (Accessing Log Search)

Log Search 安装之后，可以利用如下三种方法搜索索引化的日志：

Ambari Background Ops Log Search Link
Host Detail Logs Tab
Log Search UI

9.2.3.1.1 Ambari 后台操作日志搜索链接 (Ambari Background Ops Log Search Link)

当执行生命周期操作时，例如启动或停止服务，访问日志可以有助于从潜在的失败中恢复，这是非常重要的。这些日志在 Background Ops 中现在是可用的。

Background Ops 也链接到 Host Detail Logs tab, 列出所有的索引化的日志文件，并可以在一个主机上查看。

9.2.3.1.2 Ambari 后台操作日志搜索链接 (Ambari Background Ops Log Search Link)

Logs tab 页添加到每一个主机的 host detail 页面，包含一个索引的列表，可查看的日志文件，通过 service, component, type 组织。可以通过一个

到 Log Search UI 的链接打开并搜索这些文件。

9.2.3.1.3 Log Search UI

Log Search UI 是一个特定目的构建的 web 应用程序用于搜索 HDP 组件日志。这个 UI 专注于快速访问和从一个单点位置搜索日志。日志可以由日志级别，

组件，以及可以搜索的关键字过滤。

Log Search UI 可以从 Ambari Web 的 Log Search Service 的 Quick Links 访问。

9.2.3.2 利用 Log Search 进行故障处理(Using Log Search to Troubleshoot)

要查找特定问题关联的日志，在 UI 中使用 Troubleshooting 选项卡，选择与该问题关联的服务，组件，以及时间。例如，选择 HDFS, UI 自动搜索 HDFS

相关的组件。可以选择一个昨天或上周的时间帧，或一个自定义的值。当准备好查看匹配的日志时，单击 Go to Logs:

9.2.3.3 查看服务日志 (Viewing Service Logs)

Service Logs tab 可用于搜索横跨所有组件日志，通过关键字或特定日志级别的过滤器，组件，以及时间区间。UI 经过组织，可以快速看到每个级别日志

有多少日志捕捉到，查找关键字，包括排除的组件，匹配查询的日志。

9.2.3.4 查看访问日志 (Viewing Access Logs)

当要处理 HDFS 相关的问题时，可以发现搜索 HDFS 用户访问趋势很有帮助。Access Logs tab 可以查看 HDFS 审计日志，聚集数据使用显示 top ten HDFS

用户，以及 top ten 文件系统资源访问。这能帮助找到异常现象，或热点和冷点数据集。

9.3 Ambari Infra

HDP 中很多服务依赖于核心服务来索引数据。例如，Apache Atlas 利用索引服务进行 lineage-free 文本搜索，Apache Ranger 对审计数据进行索引。

Ambari Infra 的角色是为安装栈上组件提供公共索引服务。

当前， Ambari Infra Service 只有一个组件：Infra Solr Instance. Infra Solr Instance 是一个完全托管的 Apache Solr 安装。默认情况下，Ambari

Infra Service 在选择安装时，部署一个单节点的 SolrCloud 安装，但可以安装多个 Infra Solr Instances , 这样就可以有一个分布式索引并为 Atlas,

Ranger, and LogSearch 提供搜索。

要安装多个 Infra Solr Instances, 可以简单地通过 Ambari 的 +Add Service 功能把它们添加到现有的集群主机上。部署的 Infra Solr Instances 的数量

取决于集群的节点数量和部署的服务。

因为一个 Ambari Infra Solr Instance 用于多个 HDP 组件，因此在重启服务时要小心，避免扰乱这些依赖的服务。HDP 2.5 及以后版本，Atlas, Ranger,

and Log Search 依赖于 Ambari Infra Solr Instance 。

Note：

Infra Solr Instance 是仅为 HDP 组件使用的，不支持第三方组件或应用程序。

9.3.1 存档和清理数据 (Archiving & Purging Data)

大型集群会产生很多的日志内容，Ambari Infra 提供了一个便利工具用于存档和清理不再需要的日志。

工具成为 Solr Data Manager. Solr Data Manager 是一个 python 程序，安装路径为 /usr/bin/infra-solr-data-manager 。此程序使用户可以快速存档，

删除，或保存 Solr 集合的数据。

9.3.1.1 命令行选项 (Command Line Options)

操作模式(Operation Modes)

-m MODE, --mode=MODE archive | delete | save

使用的模式取决于要执行的操作：

archive : 用于将数据存储到存储媒体，并在存储完成之后删除数据

delete : 即删除

save : 类似于 archive, 除了数据保存后不会被删除

连接到 Solr(Connecting to Solr)

-s SOLR_URL, --solr-url=<SOLR_URL>

URL 用于连接到特定的 Solr Cloud 实例

例如，http://c6401.ambari.apache.org:8886/solr

-c COLLECTION, --collection=COLLECTION

Solr 集合(collection) 的名称，如，‘hadoop_logs’

-k SOLR_KEYTAB,--solr-keytab=SOLR_KEYTAB

使用的 keytab 文件，用于 kerberized Solr 实例

-n SOLR_PRINCIPAL, --solr-principal=SOLR_PRINCIPAL

使用的 principal 名称，用于 kerberized Solr 实例

Record Schema

-i ID_FIELD, --id-field=ID_FIELD

solr schema 中字段名称，用于唯一标识每条记录

-f FILTER_FIELD, --filter-field=FILTER_FIELD

solr schema 中用于过滤掉的字段名称，如，'logtime’

-o DATE_FORMAT, --date-format=DATE_FORMAT

The custom date format to use with the -d DAYS field to match log entries that are older than a certain number of days.

-e END

Based on the filter field and date format, this argument configures the date that should be used as the end of the date range. If you

use ‘2018-08-29T12:00:00.000Z’, then any records with a filter field that is after that date will be saved, deleted, or archived

depending on the mode.

-d DAYS, --days=DAYS

Based on the filter field and date format, this argument configures the number days before today should be used as the end of the range.

If you use ‘30’, then any records with a filter field that is older than 30 days will be saved, deleted, or archived depending on the mode.

-q ADDITIONAL_FILTER, --additional-filter=ADDITIONAL_FILTER

Any additional filter criteria to use to match records in the collection

Extracting Records

-r READ_BLOCK_SIZE, --read-block-size=READ_BLOCK_SIZE

The number of records to read at a time from Solr. For example: ‘10’ to read 10 records at a time.

-w WRITE_BLOCK_SIZE, --write-block-size=WRITE_BLOCK_SIZE

The number of records to write per output file. For example: ‘100’ to write 100 records per file.

-j NAME, --name=NAME name included in result files

Additional name to add to the final filename created in save or archive mode.

--json-file

Default output format is one valid json document per record delimited by a newline. This option will write out a single valid JSON

document containing all of the records.

-z COMPRESSION, --compression=COMPRESSION none | tar.gz | tar.bz2 | zip | gz

Depending on how output files will be analyzed, you have the choice to choose the optimal compression and file format to use for output

files. Gzip compression is used by default.

Writing Data to HDFS

-a HDFS_KEYTAB, --hdfs-keytab=HDFS_KEYTAB

The keytab file to use when writing data to a kerberized HDFS instance.

-l HDFS_PRINCIPAL, --hdfs-principal=HDFS_PRINCIPAL

The principal name to use when writing data to a kerberized HDFS instance

-u HDFS_USER, --hdfs-user=HDFS_USER

The user to connect to HDFS as

-p HDFS_PATH, --hdfs-path=HDFS_PATH

The path in HDFS to write data to in save or archive mode.

Writing Data to S3

-t KEY_FILE_PATH, --key-file-path=KEY_FILE_PATH

The path to the file on the local file system that contains the AWS Access and Secret Keys. The file should contain the keys in this

format: <accessKey>,<secretKey>

-b BUCKET, --bucket=BUCKET

The name of the bucket that data should be uploaded to in save or archive mode.

-y KEY_PREFIX, --key-prefix=KEY_PREFIX

The key prefix allows you to create a logical grouping of the objects in an S3 bucket. The prefix value is similar to a directory name

enabling you to store data in the same directory in a bucket. For example, if your Amazon S3 bucket name is logs, and you set prefix

to hadoop/, and the file on your storage device is hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz, then the file would be identified

by this URL: http://s3.amazonaws.com/logs/hadoop/hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz

-g, --ignore-unfinished-uploading

To deal with connectivity issues, uploading extracted data can be retried. If you do not wish to resume uploads, use the -g flag to

disable this behaviour.

Writing Data Locally

-x LOCAL_PATH, --local-path=LOCAL_PATH

The path on the local file system that should be used to write data to in save or archive mode

示例

删除索引的数据 (Deleting Indexed Data)：

delete 模式 (-m delete), 程序从 Solr collection 中删除数据。这个模式利用过滤器字段(-f FITLER_FIELD) 选项来控制哪些数据从索引中删除。

下面的命令会从 hadoop_logs collection 中删除日志项，August 29, 2017 以前创建的，使用 -f 选项指定的 Solr collection 字段作为过滤器字段，

-e 选项标识要删除的区间结尾

infra-solr-data-manager -m delete -s ://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -e 2017-08-29T12:00:00.000Z

存档索引数据 (Archiving Indexed Data)

archive 模式，程序从 Solr collection 中获取数据并写出到 HDFS 或 S3, 然后删除数据。

程序会从 Solr 抓取数据并在达到写入块大小，或 Solr 中没有匹配的数据时创建文件。程序跟踪抓取记录的进度，由过滤字段和 id 字段排序，并且

总是会保存它们最后的值。一旦文件写入，利用配置的压缩类型对其进行压缩。

压缩的文件创建之后，程序创建一个命令文件包含下一步的指导。在下一步操作期间遇到任何中断或错误，程序会启动保存的命令文件，因此所有数据会

是一致的。如果无效的配置导致错误，一致性失败， -g 选项可用于忽略保存的命令文件。程序支持将数据写入到 HDFS, S3, 或本地文件。

下面的命令会从 http://c6401.ambari.apache.org:8886/solr 访问 solr collection hadoop_logs, 基于字段的 logtime, 并抽取出每过 1 天，一次

读取 10 个文档，写出 100 个文档到一个文件，并复制这些 zip 文件到本地 /tmp 目录。

infra-solr-data-manager -m archive -s http://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -d 1 -r 10 -w 100 -x /tmp -v

保存索引数据 (Saving Indexed Data)

保存数据类似于存档数据，除了文件创建和上传之后不会被删除之外。建议在运行存档模式之前使用 save 模式测试，数据按预期的方式写入。

一下命令会存储最后 3 天的 HDFS 审计日志到 HDFS 路径 "/" hdfs 用户，从 kerberized Solr 抓取数据。

infra-solr-data-manager -m save -s http://c6401.ambari.apache.org:8886/solr -c audit_logs -f logtime -d 3 -r 10 -w 100

-q type:\”hdfs_audit\” -j hdfs_audit -k /etc/security/keytabs/ambari-infra-solr.service.keytab -n

infra-solr/c6401.ambari.apache.org@AMBARI.APACHE.ORG -u hdfs -p /

9.3.2 Ambari Infra 性能调优 (Performance Tuning for Ambari Infra)

利用 Ambari Infra 索引和存储 Ranger 审计日志时，应正确调整 Solr 来处理每日的审计日志存储的数量。下面几节描述调整操作系统和 Solr 的建议，

基于在环境中如何利用 Ambari Infra 和 Ranger

9.3.2.1 操作系统调优 (Operating System Tuning)

Solr 在建立索引和搜索时需要使用很多的网络连接，为了避免打开过多的网络连接，建议如下 sysctl 参数：

net.ipv4.tcp_max_tw_buckets = 1440000

net.ipv4.tcp_tw_recycle = 1

net.ipv4.tcp_tw_reuse = 1

这些设置可以永久性设置在 /etc/sysctl.d/net.conf 文件中，或者运行时使用如下 sysctl 命令设置：

sysctl -w net.ipv4.tcp_max_tw_buckets=1440000

sysctl -w net.ipv4.tcp_tw_recycle=1

sysctl -w net.ipv4.tcp_tw_reuse=1

另外，应该提升 solr 的用户进程数量以避免创建纯新线程异常。这可以通过创建一个名称为 etc/security/limits.d/infra-solr.conf 新文件实现，其中

包含如下内容：

infra-solr - nproc 6000

9.3.2.2 设置 JVM - GC (JVM - GC Settings)

堆大小和垃圾回收设置对于生成环境索引很多的 Ranger 审计日志的 Solr 实例非常重要。对于生产环境的部署，建议设置 “Infra Solr Minimum Heap Size,”

和 “Infra Solr Maximum Heap Size” 为 12 GB. 这些设置可以通过如下步骤实现：

① 在 Ambari Web 中，浏览到 Services > Ambari Infra > Configs

② 在 Settings tab, 可以看到有两个滑动条控制 Infra Solr Heap Size

③ 设置 Infra Solr Minimum Heap Size 为 12GB 或 12,288MB

④ 设置 Infra Solr Maximum Heap Size 为 12GB 或 12,288MB

⑤ 单击 Save 保存配置，然后按照 Ambari 提示重启相关服务。

在生产环境部署中使用 G1 作为垃圾回收机制也是推荐的设置。要为 Ambari Infra Solr 实例设置 G1 垃圾回收，通过如下步骤实现：

① 在 Ambari Web 中，浏览到 Services > Ambari Infra > Configs

② 在 Advanced tab 展开 Advanced infra-solr-env

③ 在 infra-solr-env template 定位到多路 GC_TUNE 环境变量定义，以如下内容替换：

GC_TUNE="-XX:+UseG1GC

-XX:+PerfDisableSharedMem

-XX:+ParallelRefProcEnabled

-XX:G1HeapRegionSize=4m

-XX:MaxGCPauseMillis=250

-XX:InitiatingHeapOccupancyPercent=75

-XX:+UseLargePages

-XX:+AggressiveOpts"

用于 -XX:G1HeapRegionSize 的值是基于 12GB Solr Maximum Heap Size. 如果为 Solr 选择使用不同的堆大小, 参考下表建议：

+-----------------------+---------------------------+

| Heap Size | G1HeapRegionSize |

+-----------------------+---------------------------+

| < 4GB | 1MB |

+-----------------------+---------------------------+

| 4-8GB | 2MB |

+-----------------------+---------------------------+

| 8-16GB | 4MB |

+-----------------------+---------------------------+

| 16-32GB | 8MB |

+-----------------------+---------------------------+

| 32-64GB | 16MB |

+-----------------------+---------------------------+

| >64GB | 32MB |

+-----------------------+---------------------------+

9.3.2.3 环境特定的调节参数 (Environment-Specific Tuning Parameters)

下面的每个建议都依赖于每日索引的审计记录的数量。快速确定每日建立索引的审计记录数量，利用如下命令：

使用一个 HTTP client 例如 curl, 执行下列命令：

curl -g "http://<ambari infra hostname>:8886/solr/ranger_audits/select?q=(evtTime:[NOW-7DAYS+TO+*])&wt=json&indent=true&rows=0"

会收到类似如下的消息：

{

"responseHeader":{

"status":0,

"QTime":1,

"params":{

"q":"evtTime:[NOW-7DAYS TO *]",

"indent":"true",

"rows":"0",

"wt":"json"}},

"response":{"numFound":306,"start":0,"docs":[]

}}

利用 response 的 numFound 元素值除以 7 获得每天索引的审计日志数量。如果必要，也可以替换 curl 请求中的 ‘7DAYS’ 为一个更宽泛的时间区间，

可以使用下列关键字：

1MONTHS
7DAYS

如果改变查询的时间区间，确保除以合适的数值。每日的平均记录数用于识别如下建议的应用环境。

Less Than 50 Million Audit Records Per Day

基于 Solr REST API 调用，如果平均每日记录数少于 50 million, 应用如下建议。在每个建议中，time to live, or TTL 控制一个文档被保持在索引

中多长时间被移除需要考虑进去。默认 TTL 为 90 days, 但有些用户选择更激进些，从索引移除文档定为 30 days. 由于这个原因，对这两种 TTL 设置

提供建议。

这些建议假设使用我们推荐的每个 Solr server 实例使用 12GB 堆大小。

Default Time To Live (TTL) 90 days:

Estimated total index size: ~150 GB to 450 GB
Total number of primary/leader shards: 6
Total number of shards including 1 replica each: 12
Total number of co-located Solr nodes: ~3 nodes, up to 2 shards per node(does not include replicas)
Total number of dedicated Solr nodes: ~1 node, up to 12 shards per node(does not include replicas)
50 - 100 Million Audit Records Per Day

50 to 100 million records ~ 5 - 10 GB data per day.

Default Time To Live (TTL) 90 days:

Estimated total index size: ~ 450 - 900 GB for 90 days
Total number of primary/leader shards: 18-36
Total number of shards including 1 replica each: 36-72
Total number of co-located Solr nodes: ~9-18 nodes, up to 2 shards per node(does not include replicas)
Total number of dedicated Solr nodes: ~3-6 nodes, up to 12 shards per node(does not include replicas)

Custom Time To Live (TTL) 30 days:

Estimated total index size: 150 - 300 GB for 30 days
Total number of primary/leader shards: 6-12
Total number of shards including 1 replica each: 12-24
Total number of co-located Solr nodes: ~3-6 nodes, up to 2 shards per node(does not include replicas)
Total number of dedicated Solr nodes: ~1-2 nodes, up to 12 shards per node(does not include replicas)
100 - 200 Million Audit Records Per Day

100 to 200 million records ~ 10 - 20 GB data per day.

Default Time To Live (TTL) 90 days:

Estimated total index size: ~ 900 - 1800 GB for 90 days
Total number of primary/leader shards: 36-72
Total number of shards including 1 replica each: 72-144
Total number of co-located Solr nodes: ~18-36 nodes, up to 2 shards per node(does not include replicas)
Total number of dedicated Solr nodes: ~3-6 nodes, up to 12 shards per node (does not include replicas)

Custom Time To Live (TTL) 30 days:

Estimated total index size: 300 - 600 GB for 30 days
Total number of primary/leader shards: 12-24
Total number of shards including 1 replica each: 24-48
Total number of co-located Solr nodes: ~6-12 nodes, up to 2 shards per node(does not include replicas)
Total number of dedicated Solr nodes: ~1-3 nodes, up to 12 shards per node(does not include replicas)

如果选择使用至少 1 个副本来提供可用性，提升节点数量。如果要求高可用性，考虑配置中使用不小于 3 的 Solr 节点。

如例子中演示的，较低的 TTL 要求较少的资源。如果要长期保留数据，可以利用 SolrDataManager 将数据存档到长期存储系统(HDFS, S3), 并提供 Hive 表以

提供容易的数据查询。这种策略下，热点数据可以存储在 Solr 中以提供 Ranger UI 的快速访问，不活跃的数据存档到 HDFS 或 S3, 可以通过 Ranger 访问。

9.3.2.4 添加新的 Shards (Adding New Shards)

如果查看以上建议之后，需要添加额外的 shards 到现有部署，参考如下 Solr 文档帮助理解如何完成这一任务：

https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.5.pdf

9.3.2.5 内存溢出异常 (Out of Memory Exceptions)

当利用 Ambari Infra 和 Ranger Audit 一起使用时，如果看到很多 Solr 实例以 Java “Out Of Memory” 异常退出，一个解决方案是通过启用 DocValues

来升级 Ranger Audit schema 使用更少的堆内存。这样修改要求重新对数据建立索引而且具有破坏性，但非常有助于处理内存消耗。

欢迎点赞 + 收藏 + 在看素质三连

完

▼