
Batch Insertion Strategy in a Distribution Database
Jintao Gao
School of Computer
Northwestern Polytechnical University
Xi’an 710129, China
gaojintao@mail.nwpu.edu.cn
Wenjie Liu, Hongtao Du and Xiaofang Zhang
School of Computer
Northwestern Polytechnical University
Xi’an 710129, China
{liuwenjie& zhangxiaofang & duhongtao}@nwpu.edu.cn
Abstract—History data produced by financial enterprises are
usually very large, and needed to be transmitted from one
table partly or wholly to another table, which requests flexible
and efficient insertion strategy. Distributed systems are good at
handling massive data under big data era. For the high
performance and extendibility, financial enterprises recently
tend to handle and analyze their business data with distributed
systems instead of traditional 'IOE' architecture. Distributed
systems are weak at SQL support, like HBase, which only
provides some simple program interfaces for users, not
satisfying financial enterprise's data insertion requirement. To
solve these problems, we proposes a batch insertion
strategy(BIS). The main contents of BIS include multiple
insertion strategies used to implement large data inserting,
threshold optimization technology used to decrease the
network cost and redirection technology used to reduce the
pressure of system. BIS is implemented in a distributed
database system called OceanBase which is designed by
Alibaba Group. The experiment data are from the actual
business data of some financial enterprises, and the experiment
results show that performance of BIS is basically as well as
existed value insertion in OceanBase but much better than
program insertion.
Keywords- distributed system; large data; batch insertion;
threshold optimization; redirection technology
I. INTRODUCTION
Data insertion is the basic function of traditional database,
taking very important role in processing financial business.
As the coming of big data era, financial enterprises gradually
discard 'IOE'(I represents IBM, O represents Oracle, and E
represents EMC) architecture, and tend to handle massive
data using distributed system. But SQL in distributed
systems are not friendly, not supporting flexible large data
inserting, which will block normal business process of
financial enterprise.
Large history data produced by financial enterprise are
needed to be partly exported from one table and imported to
another table. In traditional database, insertion technology is
very sophisticated. But under distributed environment, to
find out a convenient, flexible and efficient method to insert
large data are quite challenge.
Google introduced a serial distributed systems and
architectures that lead the development direction of
distributed system, such as GFS[2], MapReduce[3] and
spanner[5]. But their insertion functions in SQL level are
only limited to insert very little data or use tools to import
data which is not flexible.
To resolve these problems, we proposed a batch insertion
strategy(BIS), which can flexibly and efficiently insert large
data into a distributed system. BIS is implemented in
OceanBase[13](architecture as chapter 2). Although there is
already an insertion method in OceanBase, it only supports
inserting very few data. BIS's implementing is based on the
existed insertion method, and come up a better result. The
contributions of this paper are as follows.
1. Deeply study the existed insertion strategies, including
traditional and distributed databases, and propose the batch
insertion strategy.
2. For decreasing network cost, a threshold optimization
method is proposed.
3. To reduce the pressure of system under situation of
high concurrent insert operations, a redirection technology is
provided.
4. Using actually data from financial enterprise as
experiment data, we get the conclusion that under BIS, large
data can be inserted into OceanBase[13]
normally, and the
performance of BIS is nearly equal to existed insertion
method, but much better than program insertion method.
And BIS is also suitable for other distributed system to insert
large data.
II. R
ELATED CONCEPTS
A. Physical operator(Po)
Po is used to complete some job[15], like sort or join,
which is the node of physical operator tree. The main
operations of Po contain initializing and getting one row
from its children.
B. Physical operator tree(Pot)
Pot represents SQL's execution semantics[15], and its
node is Po. The physical plan’s execution procedure of one
SQL starts from initializing root of Pot, then depth-firstly
traverse to initialize other nodes. After initializing the whole
tree, it can get one row from leaf node to root iteratively. The
formal definition of Pot is as follows.
The whole tree is defined as pot=T(V,E). V represents
nodes of T, and E represents the relations of V in T, like
father-son.
Foundation items: National Natural Science Fo
(61672434); National High Technology Research; Development Program
(863) of China (2015AA015307) and Natural Science Basic Research Plan
in Shaanxi Province of China (No.2017JM6104).
978-1-5-97-7/1$31.00 ©201 IEEE
Authorized licensed use limited to: Ant Financial. Downloaded on August 29,2023 at 08:45:18 UTC from IEEE Xplore. Restrictions apply.
评论