Efficient time-interval data extraction in MVCC-based RDBMS_WWWJ2018.pdf

onboard

290

21页

1次

2021-11-29

免费下载

World Wide Web

https://doi.org/10.1007/s11280-018-0552-7

Efficient time-interval data extraction in MVCC-based

RDBMS

Haixiang Li

· Zhanhao Zhao

2,3

· Yijian Cheng

2,3

Wei Lu

2,3

· Xiaoyong Du

2,3

· Anqun Pan

Received: 16 December 2017 / Revised: 4 March 2018 / Accepted: 21 March 2018

Abstract Account reconciliation is the core business in banks and game companies. It reg-

ularly examines the account balance with the bank or expense statement for every user and

reports the daily, weekly, or monthly balance. Once an account imbalance occurs, it is nec-

essary to efficiently trace the transactions that possibly destroy the account balances. To

help efficiently trace this kind of transactions, in this paper, we investigate the problem of

doing efficient time-interval data extraction in MVCC-based RDBMS, i.e., extracting the

incremental data that are valid between a given time interval in MVCC-based RDBMS. To

this end, we propose a snapshot-based method to extract incremental data based on the fact

that each record is inherently associated with lifetime, indicating whether the record can be

This article belongs to the Topical Collection: Special Issue on Web and Big Data

Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao

 Wei Lu

lu-wei@ruc.edu.cn

Haixiang Li

blueseali@tencent.com

Zhanhao Zhao

zhanhaozhao@ruc.edu.cn

Yijian Cheng

yijiancheng@ruc.edu.cn

Xiaoyong Du

duyong@ruc.edu.cn

Anqun Pan

aaronpan@tencent.com

Tencent Inc., Shenzhen, China

Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China,

Beijing, China

School of Information, Renmin University of China, Beijing, China

World Wide Web

accessed or not for a given time interval. We elaborate how to integrate our method into

MySQL, an open-sourced RDBMS, and propose a declarative way to fetch the incremen-

tal data. Several optimization techniques are proposed to boost the extraction performance.

Extensive experiments are conducted over the standardized Sysbench benchmark to show

that our proposed method is robust and efficient.

Keywords RDBMS · MVCC · Incremental data extraction · Snapshot

1 Introduction

Account reconciliation is of great importance to corporate service charging and banking

deposit systems. For example, in the Tencent’s service charging system, a registered QQ

[14]orWeChat[22] user can recharge his or her account so that he or she can utilize the paid

business offerings that the system provides, like purchasing game equipments, watching

digital fee TV, etc. From the perspective of the service charging systems, any access to the

paid business offerings must guarantee that the account reconciliation is correct, i.e., any

change of the account balances must be in consistency with the expense statements. To do

this, the system is required to do regular account reconciliation and reports the daily, weekly,

or monthly balance.

In some cases, however, due to the unpredictable system failures, results of doing the

account reconciliation could be incorrect, i.e., changes of the account balances is not in

consistency with the expense statements. Account imbalance can bring serious harms to

financial systems, which could result in a loss of money or even a loss of customers. In this

case, it is of great importance to dig out the harmful behaviors by tracing back to the trans-

actions in the expense statements that possibly destroy the account balances. To this end, in

this paper, we investigate the problem, namely time-interval data extraction, which is infor-

mally defined to extract the incremental data that are valid between a given time interval. By

separating the time into a sequence of intervals and doing the account reconciliation with

the help of the time-interval data extraction, once an account reconciliation in a time interval

is checked to be incorrect, then transactions that occur in this time period are considered as

suspicious transactions. For example, if we find the suspicious transactions occurring dur-

ing 10:00 am and 11:00 am in Sep. 17, 2017, then we simply extract the transactions during

10:00 am and 11:00 am in Sep. 17, 2017 and then do account reconciliation to figure out

the harmful transactions.

Execution of the time-interval data extraction tasks in RDBMS is not trivial. The reason

is two-fold. First, implementation of time-interval data extraction should be transparent to

users. One may think that by building an auxiliary relation recording all updates of account

balance and the update time, it is able to do time-interval extraction by checking the update

time based on the auxiliary relation. However, it not only incurs additional maintenance

cost and modifies the database schema, but also degrades the system throughput since any

update operation on the original relations will result in a cascading update of the auxiliary

relation. Second, execution of the time-interval data extraction tasks should be easy to use.

Consider that SQL has become a standard query language in existing commercial and open

source RDBMS. Using SQL to execute the time-interval data extraction tasks is preferred.

We investigate the problem of doing interval data extraction in MVCC-based RDBMS.

MVCC is widely adopted as a concurrency control mechanism in existing RDBMS, includ-

ing Oracle, MySQL and PostgreSQL [19], etc. In MVCC mechanism, (1) each record is

World Wide Web

associated with multiple versions. More specifically, each insertion will generate a new ver-

sion of the record. Each deletion will set a deletion flag in the latest version. Each update

will set a deletion flag in the latest version and generate a new version of the record; (2)

each version is associated with a transaction ID that operates (including insert, delete and

update) the record. The transaction ID and the start time of the transaction are interchange-

able; (3) a snapshot refers to the transaction states of the database system, and records all

alive, completed transaction IDs [6]. To process a regular SELECT-FROM-TABLE query,

the RDBMS engine will generate a snapshot for this query, and each record version, which is

accessible by comparing the snapshot with the latest transaction ID that operates the record,

is returned.

Consider the problem of interval data extraction in which we target to extract records that

are accessible between time t

and t

. Direct applying snapshot to extract interval data is

infeasible. The reason is two-fold. First, different from the regular SELECT-FROM-TABLE

query, no snapshot is still preserved at time t

or time t

, and hence it is unable to verify

whether a record is accessible. Second, even we can make effort to rebuild the snapshots, a

judgment mechanism to verify the record is necessary. To address the above two issues, we

first propose a snapshot reconstruction mechanism to preserve the transaction states during

the runtime, and then present a carefully designed record visibility judgment mechanism,

to extract the time-interval incremental data. To boost the query performance, we propose

an index-based method to skip unnecessary pages to reduce the I/O cost. To sum up, our

contributions are listed as follows:

– We propose a new type of query pattern, named as time-interval data extraction, which

targets to extract incremental data for any given time interval in MVCC-based RDBMS.

– We propose a novel snapshot-based method, with a carefully designed record visibility

judgment mechanism, to extract the time-interval incremental data. Various optimiza-

tions are proposed to reduce the I/O cost by skipping the pages without containing any

records that are the results in the time-interval data extraction queries.

– We propose a declarative way, using SQL, to extract the time-interval data and provide

a customizable parameter to fetch the inserted, updated and deleted data in the time-

interval data extraction.

– We implement our proposed method in MySQL, a widely used open source RDBMS.

Extensive experiments are conducted over the standardized Sysbench [20] benchmark,

which demonstrates our proposed method is robust and efficient.

The rest of the paper is organized as follows. Section 2 defines our problem, discusses exist-

ing incremental data extraction methods, and reviews the MVCC mechanism. Section 3

gives an overview of our proposed method. Section 4 describes the implementation in

details. Experimental results are discussed in Section 5. Section 6 concludes this paper.

2 Preliminaries

2.1 Problem definition

Let R be a relation with n records. Formally, R is represented as {r

, ..., r

}.Giventwo

timestamps t

, t

), our objective is to efficiently identify the incremental data of

relation R between time t

and t

, i.e., the data inserted, deleted or updated in this certain

time-interval would be extracted. Let S(R, t

) represent the collection of records in R that

of 21

免费下载

tcaplusdb paper

相关文档

评论