
B. Feature engineering
Due to occasional SMART acquisition failures or errors
in transmitting data sets, missing value padding is necessary.
The method we adopted in this paper is: if the missing values
are consecutive 2 or more, then the mode of the SMART
item on the disk is used as the filling value; if there is only
one missing value
, the average of the value before and after
it is used as the filling value.
Data normalization uses the interval scaling method,
which uses Max and Min in a feature to scale all values to
the interval of [0, 1].
Sample_days and predict_failure_days are 2 important
parameters:
x Sample_days is defined as the time window size of
the input data of the LSTM network in each
sequence sample, for example,
is 5, and
then a training sample will contain SMART attribute
information of the disk in the past 5 days. The
sample_days need to have appropriate values. If it is
too small, the potential information provided to the
LSTM is less. If it is too large, it corresponds to a
long time series. The data that is too far away from
the final failure has a small impact on the prediction
of the final failure tr
misleading.
x Predict_failure_days is defined as the number of
days before the failure, which is an alarm boundary.
The value of predict_failure_days also needs to be
appropriate. Too long or too short a time interval
will affect the effectiveness of disk failure
processing. The value of predict_failure_days is
reasonable for 5-7 days.
Regarding feature selection, because deep learning
models automatically learn features and the feature
dimensions of the dataset involved 45 dimensions, of which
only 21 dimensions remain after filtering out features with
variance 0. There is no further feature selection in this paperˈ
so the selected features include: SMART_1_raw,
SMART_4_raw, SMART_5_raw, SMART_7_raw,
SMART_9_raw, SMART_12_raw, SMART_183_raw,
SMART_184_raw, SMART_187_raw, SMART_188_raw,
SMART_189_raw, SMART_190_raw, SMART_192_raw,
SMART_193_raw, SMART_194_raw, SMART_197_raw,
SMART_198_raw, SMART_199_raw, SMART_240_raw,
SMART_241_raw,and SMART_242_raw.
C. Dataset balancing
As other hard disk SMART datasets face the same
problem, the number of positive samples in ZTE datasets is
far less than the number of negative samples, which makes it
difficult for machine learning algorithm to obtain sufficient
positive sample information when training the model, so
FDR is relatively low, and it is necessary to balance the data.
This paper attempts four methods of sample balancing,
including ADASYN [14], SMOTE [15], ADASYN
combined with ENN [16], and SMOTE combined with ENN,
in order to compare the different prediction results. After
data balance, the number of positive and negative samples is
both 50,000. In the evaluation section, we will compare the
influence of data balance on the prediction effect.
D. Training model
Every disk goes through a process from health to failure,
so the SMART data collected periodically at a fixed time
period is time series. As a common algorithm, the memory of
the RNN model is generally within 7 time periods, and the
data entered early will gradually fail due to the disappearance
of the gradient. If only the SMART attribute is used to
change the time window of 7 days or less, it is far from
reflecting the change of the disk state. Fortunately, LSTM is
an improved version of RNN, which works better on
problems with time series characteristics. It stores memory
through the information conveyor, which solves the problem
of disappearance of gradient, so that this paper can use a
larger time window to disk prediction. In addition, for
comparison, this paper also tried common algorithms,
including RNN [10], AdaBoost [7], random forest [5], LOG
[5], SVM [5],
5].
TABLE I. LSTM MODEL PARAMETER
As shown in Table 1, this paper uses the LSTM N to 1
model. The input is the data in the sample_days, and the
output is whether the failure will occur in 5 days. The output
is 1 if a failure
is about to occur, otherwise it is 0.
Considering the complexity of disk failure prediction rules,
this paper builds a neural network composed of 3 layers of
LSTM and 3 layers of Dense. The output dimensions of the
first 5 layers are 32, 64, 128, 128, 64, and the output
dimensions of the 6th layer are1.
III. E
VALUATIONS
In this chapter, firstly, training and testing are carried out
based on ZTE historical data set, and all samples are divided
into training set data and test set according to the proportion
of 8:2. During the test, different data balance methods,
different sample_days sizes, different training rounds, and
different algorithms are compared respectively. Each sample
is classified of failure or health. Then, within 7 months of
launch in ZTE data center, we predict whether there will be
any failure in the next 5-7 days based on each disk. In this
chapter, FDR and FAR are used as evaluation indexes.
A. Evaluations on historical data set
In this chapter, based on unbalancing, ADASYN,
SMOTE, ADASYN combined with ENN, and SMOTE
combined with ENN, the original dataset is preprocessed to
compare. In order to improve the efficiency, the number of
training iterations is limited to 50 rounds, and the
sample_days is set to 30. The experimental results are shown
in Table 2. Under the same factors of feature, model, and
predict_failure_days, compared with the unbalancing, the
over-
sampling method improves the FDR significantly, and
the combination of over-sampling and under-sampling
method reduces the FAR, further improving the prediction
effect. Among them, the best performance is SMOTE
combined with ENN, FDR is 89.2%, FAR is 9.3%, so this
method is used for data balance in subsequent experiments.
18
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 01:53:42 UTC from IEEE Xplore. Restrictions apply.
评论