ADF2T: an Active Disk Failure Forecasting and Tolerance Software.pdf

章芋文

318

6页

5次

2023-07-27

免费下载

ADF2T: an Active Disk Failure Forecasting and Tolerance Software

Hongzhang Yang

†‡

, Yahui Yang

†

, Zhengguang Chen

‡

, Zongzhao Li

‡

and Yaofeng Tu

‡§

†

School of Software & Microelectronics, Peking University, China

‡

ZTE Corporation, China

State Key Laboratory of Mobile Network and Mobile Multimedia Technology, China

*Corresponding author: Yahui Yang(yhyang@ss.pku.edu.cn), Yaofeng Tu(tu.yaofeng@zte.com.cn)

Abstract—The reliability of distributed file system is

inevitably affected by hard disk failure. This paper proposes an

active disk failure forecasting and tolerance software. Firstly,

multiple SMART records in the time window are merged into

one sample, and after sliding, tens of times of positive samples

are created.

Secondly

, the features are selected by two-stage

sorting method, so that the most conducive features are used in

machine learning modeling, and the time for model training can

be shortened obviously. Thirdly, through two-stage verification,

parameters can be adjusted in time for unreasonable proactive

reconstruction strategies. Experiments show that modeling and

forecast of ZTE data set and Backblaze data set respectively, the

recall rate is 95.66% and 84.28%, and the error rate is 0.23%

and 2.45%. The work in this paper has been commercially used

for more than one year in ZTE data center. The reliability of

distributed file system software is significantly improved.

Keywords—reliability; disk; failure; forecast

I. INTRODUCTION

As a typical large-scale software system, the reliability of

distributed file system is inevitably affected by hardware

failure, especially hard disk failure. According to the IDC

white paper [1], the data all over the world will reach 175ZB

by the year of 2025. Currently, the annual failure rate of disks

is about 1% [2]. So, hundreds of millions of failure disks will

appear each year worldwide soon. It is undeniable that disk

failure has become the main failure source in data center [3].

According to the survey results provided by ZTE Corporation,

among all the failures in the ZTE data center in 2019, disk

failures accounted for more than half, namely 53.49%.

Similarly, in other research works, this number was 71.1% [4]

and 49.1% [2]. Disk failure can directly lead to disastrous

consequences, such as data loss and business interruption.

In the traditional distributed file systems, it is data

redundancy that mainly ensures reliability, including replica,

erasure code, snapshot, and so on. After disks failed, these

technologies rely on redundancy to ensure read-write

operations and data recovery, so all of them are passive

processing technologies. The defects are as follows: 1) when

the number of failed disks is more than the number of

redundant disks, the risk of permanent data loss is

significantly; 2) in the process of data recovery, system

resources are inevitably occupied by data recovery; 3)

redundancy takes up extra storage space, which leads to the

increase of cos

t. Different from above technologies, this

paper proposes an active disk fault-tolerance technology, that

is, by forecasting disk failure, it provides possibility to

migrate data in advance, as a result the reliability of the

distributed file system is ensured.

A microcosmic view to disk failure, is just as human

beings experience “healthy, sub-healthy, illness, and death”

in their own lifetimes, disk individuals also experience

"healthy, sub-healthy, near-failure, and failure" inevitably.

This is because of the wear of various hardware components.

At the same time, similar to human death due to congenital

physical defects, and accidental deaths caused by accidents

such as car accidents, earthquakes, disasters, animal attacks,

etc. Disks also have unforecastable failures caused by

manufacturing defects or accidental failures, such as

formaldehyde, vibration, sudden voltage changes, excessive

air humidity, and so on. Obviously, the inevitable failure of

the disk has a near-term failure window period, which has the

possibility of forecast, but the accidental failure of the disk is

often sudden, and there is absolutely no near-failure window

period, so it is not forecastable. Considering ZTE's

experience in datacenter operation, this paper gives the

definition of each state of disk as:

x The healthy state, is that the disk can correctly

perform operations such as mount, read, write, and the

operation response time meets the specifications of

the disk manufacturer's manual, and the disk SMART

(Self-Monitoring Analysis and Reporting Technology)

[5] result is normal.

x The sub-healthy state, is that the disk occasionally

freezes, and the response time of operations such as

mount, read, and write does not meet the

specifications of the disk manufacturer's manual

occasionally, and the SMART result of the disk are

slightly abnormal.

x The near-failure state, is one or more of the following

conditions on the disk: mechanical noise, severe

abnormality of the SMART result of the disk, data

loss, read or write incompletely, read or write slowly,

frequent offline, and high temperature, among which

read or write slowly is the most common. That is, the

waiting time of read or write requests in the queue for

20 consecutive times exceeds 1.5 seconds.

x The failure state, is one or more of the following

conditions: file system log printing Medium Error, IO

Error, critical target error, Metadata IO Error, mount

failed, no response to read or write, etc.

The disk failure forecast technology is to find the trends

that the disk is on the near-failure state, thereby forecasting

that the disk will become failure state in the foreseeable

future, for example 7 days.

2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)

DOI 10.1109/ISSREW51248.2020.00030

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on July 26,2023 at 12:22:01 UTC from IEEE Xplore. Restrictions apply.

II. MOTIVATION

In 2001, Greg Hamerly [6] tried to forecast disk failure

through machine learning, and gave a standard process to

solve this problem. Firstly, gathering disk SMART or IO

information, and then training both the healthy and the failure

samples until a model is able to classify the disk from the two

states. In recent years, disk failure forecast has quickly

become a research hotspot, and many research results have

achieved the accuracy rate of more than 85% for single-brand

SATA disk failure forecast.

Although the related works has continuously improved

the accuracy of failure forecast, there are still many problems

in commercial scenario.

x Positive and negative samples are severely

unbalanced. The biggest obstacle of disk failure in the

research at this stage is the small number of failure

disks, as well as the huge number of healthy disks.

Although positive samples can be increased by over-

sampling techniques, it will cause overfitting

problems. Although negative samples can be reduced

through under

-sampling techniques, it will cause

information loss. Therefore, a new balancing method

is urgently needed.

x It is difficult to obtain high accuracy and low error

rate at the same time. We believe that this is caused

by both the data for training and the algorithm for

training. In terms of data, feature selection is required

to remove interference or irrelevant features.

x The feedback mechanism of forecast is needed to be

established urgently. There are three types of

forecast

errors: misjudgment, omissions, and delayed

judgement. misjudgment wastes the remaining life

span of the disk. Omissions and delayed judgement

cause the system to enter a degraded state and rely on

traditional fault tolerance to ensure reliability. When

a forecast error occurs, it is often that corrects it by

updating the forecast model. However, this method

has a time lag, so a more flexible feedback

mechanism is needed.

To overcome the above shortcomings, this paper proposes

ADF2T, an active disk failure forecasting and tolerance

software, including:

x A sliding window record merging technology was

proposed to solve the problem of severe unbalance of

positive and negative samples.

x A two-stage feature selection technology is proposed

to obtain forecast results with high accuracy and low

error rate.

x A two-stage verification technology is proposed to

flexibly and quickly deal with misjudgments,

omissions, and delayed judgment.

III. D

ESIGN OF ADF2T

This chapter analyzes and studies the disk situation of a

video-oriented data center of ZTE, and propose a disk failure

forecast software, covering gathering stage, modeling stage,

and reconstructing stage. The data center has 129,887 disks

totally, and 1,995 failed disks appeared during the year of

2018. For each disk,

SMART is gathered once an hour

without peak hours, so there are 16 records every day.

A. Sliding window record merging technology

In the ZTE data center, the ratio of healthy disks to failed

disks is about 64:1. Facing the severe unbalance between the

positive and negative samples, we propose a sliding window

record merging technology.

Fig. 1. The sliding window

As shown in Fig. 1, for the failure disk, the records within

N days before the failure time are used merely

, 18 or 30 are

common value of N. Each record item is sorted according to

the gathered time, and 3 or 6 days is set as the time window,

which is called the size of window. The starting position of

the time window is placed on the failure time of the disk, and

then the time window slides forward by a distance of step size,

maybe a half day. The time window is moved forward for

dozens of times

totally, until the time window covers all

records on the N days before failure. After sliding, the values

of average, variance and range in each time window are

calculated based on the record items, so that the information

of each record item at multiple consecutive time points is

combined into only one positive sample. Through the sliding

of the time sliding window, positive samples, which are tens

of times more than the original record will be created. For the

health disk, the original record of window size is selected

randomly

, and the values of average, variance and range are

also calculated for each record item as negative samples.

Three important parameters can be adjusted:

x The size of the window. The larger the window, the

more records a single sample contains. Of course, it is

not the more records, the better. Because the

difference of SMART within the window will be

more significant, at the same time, the difference

between samples will be smaller, and the number of

samples created will be less.

x The step size of window. If the step size is smaller

than the window, there will be overlap between two

adjacent windows, fortunately this will not lead

serious over fitting. If the step size is larger than the

window, some SMART data will lose, which is

unacceptable.

x N days before failure. This value needs observation to

determine. In fact, for different data centers, this value

varies greatly, which may be closely related to the IO

pressure of the data center. In the ZTE data center

studied in this paper, the SMART of the failure disk

is significantly different from that of more than 30

days before the failure, so the value is 30.

gg gy

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on July 26,2023 at 12:22:01 UTC from IEEE Xplore. Restrictions apply.

B. Two-stage feature selection technology

As is known to all, not all the disk SMART features can

help to forecast whether a disk is about to fail. If inappropriate

features are selected for machine learning, it will not help the

forecast, maybe even have side effect. In addition, the more

features are used for machine learning, the longer time it takes,

which not only consumes a lot of CPU resources, but also

cannot forecast

in real time. Therefore, it is necessary to

perform feature selection, so that the most conducive to disk

failure forecast are used for modeling, and the time for model

training can be shorten as well. The simple way is selecting by

artificial experience or removal of the feature with a variance

of 0, but the effect is not good.

This paper proposes a two-stage feature selection

technology:

x After gathering SMART of all disks p times, the

features are characterized by N dimensions. Because

the SMART gathered by disks from different brands

or models has some differences, the feature selection

is for disks of the exactly same brand and model. If

the data center has different brands and models of

disks, it needs to select features separately.

x Calculate each variance of the features in p times, and

filter out the M features whose variance is less than

0.01. As a result, N-M features remain.

x Five algorithms are used to rank the importance of the

remaining N-M features, including Chi-

square test,

Logistic Regression, Support Vector Machine (SVM),

Random Forest, Gradient Boosted Iterative Decision

Tree (GBDT). A total of 5 sequences are formed,

which are labeled as sequences a1, a2, a3, a4,anda5.

x Calculate the average ranking value of each feature in

5 sequences, and then sort them to generate sequence

b. The N-M features are sorted according to their

average ranking values from small

to large.

x The Top-k features of sequence b are selected as the

final feature.

Take Seagate ST4000DM000 disk as an example. After

excluding features with variance less than 0.01, there are 21

SMART terms remaining. They will be numbered 1 to 21. The

importance of each feature is sorted according to the five

algorithms. As shown in Table 1, these results are synthesized

to form the final result. As shown in Fig.2, the model performs

best when the top 17 features are selected

TABLE I. AN EXAMPLE OF TWO-STAGE FEATURE SELECTION.

Stage

Sorted order

stage

Chi-square

test

21, 7, 19, 5, 10, 11, 9, 3, 6, 16, 17, 8, 13,

20, 14, 2, 4, 18, 12, 15, 1

Logistic

Regression

11, 9, 3, 20, 19, 17, 10, 16, 7, 8, 4, 5, 13,

6, 12, 18, 21, 2, 15, 14, 1

SVC

11, 9, 10, 17, 3, 16, 7, 20, 19, 8, 13, 4, 5,

12, 6, 2, 15, 21, 1, 14, 18

Random

Forest

16, 9, 14, 20, 5, 19, 21, 2, 6, 17, 11, 3,

15, 12, 4, 13, 1, 7, 18, 10, 8

GBDT

17, 16, 11, 9, 14, 20, 21, 5, 6, 19, 2, 4,

15, 3, 13, 8, 18, 7, 12, 1, 10

stage

9, 11, 16, 17, 19, 20, 3, 5, 21, 7, 10, 6,

14, 4, 13, 2, 8, 12, 15, 18, 1

Fig. 2. The recall rate when selecting top-k features

After feature selection, a suitable machine learning

algorithm is needed to build a forecast model. This paper tried

XGBoost [7]. In recent years, it has been widely used in data

science competitions. In fact, after trying more than ten

different algorithms and more than a hundred different

parameter settings, we found that the algorithm has a small

impact on the accuracy of the disk failure forecast, and the

data quality has a large impact on the accuracy of the forecast.

However, it seems that in the issue of disk failure forecast,

data quality should be valued, which is always ignored by

researchers.

Two-stage verification technology

The prior research works lacks verification of the forecast

results, and cannot be improved in a timely manner when

misjudgment occurs or the proactive reconstruction strategy is

inadequate. It has to wait for a long period of time to update

the forecast models from the newly gathered disk information.

Therefore, in this section we propose a two-stage verification

technology of forecast results, which is shown in Fig. 3.

Fig. 3. Two-stage verification.

For disks that are forecasted to failure, proactive

reconstruction should be performed immediately:

x If a failure has occurred during the reconstruction

process, the system is downgraded, and the remaining

reconstruction work is completed by the other disks,

which are healthy, and the forecast threshold needs to

be adjusted. As a result, the disk should be

forecasted

as a failure disk sooner in the future.

8.51%

10.64%10.64%

12.77%

27.66%

38.30%

44.68%

53.19%

61.70%

65.96%

70.21%

72.34%

78.72%

85.11%

87.23%

91.49%

95.74%

93.62%

89.36%

85.11%

39.44%

22.93%

11.45%

9.06%

8.69%

7.54%

6.18%

5.30%

4.06%

3.15%

2.39%

1.79%

1.00%

0.76%

0.42%

0.24%

0.21%

0.33%

0.67%

1.06%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

1 3 5 7 9 111315171921

Error rate(%)

recall rate(%)

Value of K

recall rate

error rate

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on July 26,2023 at 12:22:01 UTC from IEEE Xplore. Restrictions apply.

of 6

免费下载

相关文档

评论