Big Data System Testing Method Based on Chaos
Engineering
Guo Chen, Guotao Bai, Chun Zhang and Juan Wang
China Mobile Information Technology Co., Ltd.
Beijing, China
guochen_cnmobile@163.com
Kang Ni and Zhi Chen
School of Computer, Nanjing University of Posts and
Telecommunications
Nanjing, Jiangsu Province, China
tznikang@163.com, chenz@njupt.edu.cn
Abstract—The big data systems usually have multiple
interdependent network-elements and processes. With the
gradual growth in the number of network-elements and the
complexity of networking, the possibility of system abnormalities
also increases, then the influences of faults on the system are
difficult to assess. For effectively discovering the possible defects
of big data system products and solving the stability problem of
the evaluation system, chaos engineering is employed for the big
data systems, and a big data system testing method based on
chaos engineering is proposed. The proposed method, which is
fault-tolerant, resilient and reliable, mainly includes the
processes, i.e., exception injection, exception recovery, system
correctness verification, observable real-time data statistics, and
result reporting. The experimental results illustrate that the
proposed big data system based on chaos method can shorten the
development cycle of the version by 20% and observe 21 hidden
faults caused by improper handling of abnormal scenarios, which
effectively improves the robustness and stability of the
distributed system.
Keywords- big data system; database system; chaos engineering;
chaos testing; reliability
I. INTRODUCTION
With the growth of the scale and complexity of big data
systems [1], the system architecture is developing towards a
distributed direction. For guaranteeing the reliable operation of
software systems, the high-availability deployment solution has
become a research hotspot on the design of software system
architecture [2-3]. While a single point or cluster failure occurs
within the system, the system's self-recovery and fault
tolerance will decrease. Recently, although researchers have
made relevant progress in advanced architecture, high-quality
code and perfect testing methods, distributed systems still
cannot supply the requirements of high availability and
flexibility [4-5]. For discovering the defects existing in the
distributed system timely, the embedding of chaos engineering
emerges.
Chaos engineering [6-8] refers to continuously simulating
various unknown faults that may occur in the production
environment of a distributed system, and testing the response
of the distributed system to these faults for verifying that the
system operates in a turbulent environment and improving the
stability of the system. The main differences between Chaos
engineering and other non-chaotic methods are that Chaos
engineering [9] is a practice of generating new information,
while fault injection is a fixed method of testing a certain
special case. Injecting failures, i.e., communication delays and
errors, are an effective way to explore the possible undesirable
behaviors of complex systems. Additionally, exploratory
testing forms of chaos engineering can generate the novel
knowledge of system. This form not only tests known
properties, but also performs the verification by the integration.
The application of chaos engineering in software system is
of value in both theoretical research and actual practice. Ali
Basiri et al. built a platform, named ChAP, for running chaos
experiment engineering in Netflix micro service architecture,
which mainly focuses on two failure modes of service
degradation and service failure, it can automatically generate
and execute chaos experiments [10]. Reference [11] proposed a
ChaosOrca framework, which performs the fault injection for
container systems in a micro service environment. Reference [9]
proposed a ChaosMachine framework, which inserts exception
throwing behavior and embedding chaos engineering to verify
the ability of java service. In addition, there are some practical
tools that can be deployed, i.e., ChaoMonkey, Chaosblade, and
Chaos-toolkit [12-14]. Most of the existing tools are aimed at
cloud-native conditions, and they are integrated in the
kubernetes environment [15-16]. According to the
characteristics of the physical machine, the novel tool develops
a special chaos experiment function and supports the direct
injection of exceptions on the physical machine which has the
characteristics of light weight, simple installation, and
deployment, and the tool combines the application scenarios of
big data systems to perform abnormal injection and verification
of real business. Recently, China Mobile and ZTE (Zhongxing
Telecom Equipment) have jointly developed a domestic big
data system product. During the development process, a unit
testing has a high line coverage to ensure that all the logic code
can be covered, and numbers of integration tests are deployed
to guarantee that the system can work with other components.
But in the test of a distributed software system, the software
still has the following shortcomings: 1) There are various types
of network elements. The big data system includes four types
of network-element nodes: management nodes, distributed
transaction control nodes, computing nodes, and storage nodes;
2) The physical network is complex. The telecommunications
field where distributed data is commonly used, a system is
usually deployed across data centers, computer rooms, and
相关文档
评论