暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
Efficient time-interval data extraction in MVCC-based RDBMS_WWWJ2018.pdf
290
21页
1次
2021-11-29
免费下载
World Wide Web
https://doi.org/10.1007/s11280-018-0552-7
Efficient time-interval data extraction in MVCC-based
RDBMS
Haixiang Li
1
· Zhanhao Zhao
2,3
· Yijian Cheng
2,3
·
Wei Lu
2,3
· Xiaoyong Du
2,3
· Anqun Pan
1
Received: 16 December 2017 / Revised: 4 March 2018 / Accepted: 21 March 2018
© Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract Account reconciliation is the core business in banks and game companies. It reg-
ularly examines the account balance with the bank or expense statement for every user and
reports the daily, weekly, or monthly balance. Once an account imbalance occurs, it is nec-
essary to efficiently trace the transactions that possibly destroy the account balances. To
help efficiently trace this kind of transactions, in this paper, we investigate the problem of
doing efficient time-interval data extraction in MVCC-based RDBMS, i.e., extracting the
incremental data that are valid between a given time interval in MVCC-based RDBMS. To
this end, we propose a snapshot-based method to extract incremental data based on the fact
that each record is inherently associated with lifetime, indicating whether the record can be
This article belongs to the Topical Collection: Special Issue on Web and Big Data
Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao
Wei Lu
lu-wei@ruc.edu.cn
Haixiang Li
blueseali@tencent.com
Zhanhao Zhao
zhanhaozhao@ruc.edu.cn
Yijian Cheng
yijiancheng@ruc.edu.cn
Xiaoyong Du
duyong@ruc.edu.cn
Anqun Pan
aaronpan@tencent.com
1
Tencent Inc., Shenzhen, China
2
Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China,
Beijing, China
3
School of Information, Renmin University of China, Beijing, China
World Wide Web
accessed or not for a given time interval. We elaborate how to integrate our method into
MySQL, an open-sourced RDBMS, and propose a declarative way to fetch the incremen-
tal data. Several optimization techniques are proposed to boost the extraction performance.
Extensive experiments are conducted over the standardized Sysbench benchmark to show
that our proposed method is robust and efficient.
Keywords RDBMS · MVCC · Incremental data extraction · Snapshot
1 Introduction
Account reconciliation is of great importance to corporate service charging and banking
deposit systems. For example, in the Tencent’s service charging system, a registered QQ
[14]orWeChat[22] user can recharge his or her account so that he or she can utilize the paid
business offerings that the system provides, like purchasing game equipments, watching
digital fee TV, etc. From the perspective of the service charging systems, any access to the
paid business offerings must guarantee that the account reconciliation is correct, i.e., any
change of the account balances must be in consistency with the expense statements. To do
this, the system is required to do regular account reconciliation and reports the daily, weekly,
or monthly balance.
In some cases, however, due to the unpredictable system failures, results of doing the
account reconciliation could be incorrect, i.e., changes of the account balances is not in
consistency with the expense statements. Account imbalance can bring serious harms to
financial systems, which could result in a loss of money or even a loss of customers. In this
case, it is of great importance to dig out the harmful behaviors by tracing back to the trans-
actions in the expense statements that possibly destroy the account balances. To this end, in
this paper, we investigate the problem, namely time-interval data extraction, which is infor-
mally defined to extract the incremental data that are valid between a given time interval. By
separating the time into a sequence of intervals and doing the account reconciliation with
the help of the time-interval data extraction, once an account reconciliation in a time interval
is checked to be incorrect, then transactions that occur in this time period are considered as
suspicious transactions. For example, if we find the suspicious transactions occurring dur-
ing 10:00 am and 11:00 am in Sep. 17, 2017, then we simply extract the transactions during
10:00 am and 11:00 am in Sep. 17, 2017 and then do account reconciliation to figure out
the harmful transactions.
Execution of the time-interval data extraction tasks in RDBMS is not trivial. The reason
is two-fold. First, implementation of time-interval data extraction should be transparent to
users. One may think that by building an auxiliary relation recording all updates of account
balance and the update time, it is able to do time-interval extraction by checking the update
time based on the auxiliary relation. However, it not only incurs additional maintenance
cost and modifies the database schema, but also degrades the system throughput since any
update operation on the original relations will result in a cascading update of the auxiliary
relation. Second, execution of the time-interval data extraction tasks should be easy to use.
Consider that SQL has become a standard query language in existing commercial and open
source RDBMS. Using SQL to execute the time-interval data extraction tasks is preferred.
We investigate the problem of doing interval data extraction in MVCC-based RDBMS.
MVCC is widely adopted as a concurrency control mechanism in existing RDBMS, includ-
ing Oracle, MySQL and PostgreSQL [19], etc. In MVCC mechanism, (1) each record is
World Wide Web
associated with multiple versions. More specifically, each insertion will generate a new ver-
sion of the record. Each deletion will set a deletion flag in the latest version. Each update
will set a deletion flag in the latest version and generate a new version of the record; (2)
each version is associated with a transaction ID that operates (including insert, delete and
update) the record. The transaction ID and the start time of the transaction are interchange-
able; (3) a snapshot refers to the transaction states of the database system, and records all
alive, completed transaction IDs [6]. To process a regular SELECT-FROM-TABLE query,
the RDBMS engine will generate a snapshot for this query, and each record version, which is
accessible by comparing the snapshot with the latest transaction ID that operates the record,
is returned.
Consider the problem of interval data extraction in which we target to extract records that
are accessible between time t
i
and t
j
. Direct applying snapshot to extract interval data is
infeasible. The reason is two-fold. First, different from the regular SELECT-FROM-TABLE
query, no snapshot is still preserved at time t
i
or time t
j
, and hence it is unable to verify
whether a record is accessible. Second, even we can make effort to rebuild the snapshots, a
judgment mechanism to verify the record is necessary. To address the above two issues, we
first propose a snapshot reconstruction mechanism to preserve the transaction states during
the runtime, and then present a carefully designed record visibility judgment mechanism,
to extract the time-interval incremental data. To boost the query performance, we propose
an index-based method to skip unnecessary pages to reduce the I/O cost. To sum up, our
contributions are listed as follows:
We propose a new type of query pattern, named as time-interval data extraction, which
targets to extract incremental data for any given time interval in MVCC-based RDBMS.
We propose a novel snapshot-based method, with a carefully designed record visibility
judgment mechanism, to extract the time-interval incremental data. Various optimiza-
tions are proposed to reduce the I/O cost by skipping the pages without containing any
records that are the results in the time-interval data extraction queries.
We propose a declarative way, using SQL, to extract the time-interval data and provide
a customizable parameter to fetch the inserted, updated and deleted data in the time-
interval data extraction.
We implement our proposed method in MySQL, a widely used open source RDBMS.
Extensive experiments are conducted over the standardized Sysbench [20] benchmark,
which demonstrates our proposed method is robust and efficient.
The rest of the paper is organized as follows. Section 2 defines our problem, discusses exist-
ing incremental data extraction methods, and reviews the MVCC mechanism. Section 3
gives an overview of our proposed method. Section 4 describes the implementation in
details. Experimental results are discussed in Section 5. Section 6 concludes this paper.
2 Preliminaries
2.1 Problem definition
Let R be a relation with n records. Formally, R is represented as {r
1
,r
2
, ..., r
n
}.Giventwo
timestamps t
i
, t
j
(t
i
<t
j
), our objective is to efficiently identify the incremental data of
relation R between time t
i
and t
j
, i.e., the data inserted, deleted or updated in this certain
time-interval would be extracted. Let S(R, t
i
) represent the collection of records in R that
of 21
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。