T-SQL A Lightweight Implementation to Enable Built-in Temporal Support in MVCC-based RDBMSs——IEEETKDE.pdf

onboard

237

14页

0次

2021-11-29

免费下载

information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TKDE.2021.3081717, IEEE Transactions on Knowledge and Data Engineering

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

T-SQL: A Lightweight Implementation to Enable

Built-in Temporal Support in MVCC-based

RDBMSs

Zhanhao Zhao, Wei Lu, Hongyao Zhao, Zongyan He, Haixiang Li, Anqun Pan, Xiaoyong Du

Abstract—The adoption of temporal expressions into SQL:2011 has continuously driven the extensions of temporal support in

relational database systems (a.b.a. RDBMSs). In this paper, we present T-SQL, a lightweight yet efﬁcient built-in temporal

implementation in RDBMSs. T-SQL entirely relies on multi-version concurrency control (MVCC), widely adopted in RDMBSs, to

manage temporal data. For temporal data, current records are maintained in legacy databases, and historical records, i.e., previous

versions of current records (if any), which used to be periodically reclaimed are separately maintained in KV stores. To enable temporal

query processing under SQL:2011, we extend the query engine in legacy RDBMSs to support query processing over either current

records or historical records or both. Further, regarding temporal data are ever-increasing, we propose various optimizations to reduce

the storage overhead of KV stores while keeping efﬁcient query performance. We elaborate a publicly available implementation on how

to integrate T-SQL into both centralized and distributed RDBMSs. We conduct extensive experiments on both YCSB and TPC-series

benchmarks by comparing T-SQL with other temporal database systems. The results show that T-SQL is both lightweight and efﬁcient.

Index Terms—RDBMS, Temporal Database, MVCC, SQL:2011, KV Store

1 INTRODUCTION

T is of great importance to maintain not only currently valid

data, but also the history of all data changes. Typical examples

include forensic analysis and legal requirements to store data for

a certain number of years, retrospective and trend data analysis

to offer the asset use details as of two years ago, and logical

corruption recovery to rewind tuples/relations/databases as of a

particular point in time [32].

The study on temporal data management has been going on for

decades, but only recently has some progress been made. Temporal

data management by applications brings prohibitively expensive

development and maintenance overhead. Instead, extensions to

support temporal data management in conventional RDBMSs have

been extensively explored. Nevertheless, the temporal support of-

fered by commercially available software tools and systems is still

quite limited. Until recently, the adoption of temporal expressions

into SQL:2011 makes the major DBMSs start to provide built-

in temporal support. The temporal extensions typically include:

(1) extension of non-temporal relations to temporal relations, (2)

uniﬁed management of both current data and historical data in a

single database, (3) query rewrite functionality by expressing the

semantics of temporal queries in equivalent conventional (non-

temporal) SQL statements.

In SQL:2011, the temporal features mainly include temporal

relation deﬁnitions, temporal queries, and others (e.g., temporal

constraints). As compared to the non-temporal counterpart, a

• Zhanhao Zhao, Wei Lu, Hongyao Zhao, Zongyan He, and Xiaoyong

Du are with the Key Laboratory of Data Engineering and Knowledge

Engineering, Ministry of Education, China, and School of Information,

Renmin University of China, China.

E-mail: {zhanhaozhao,lu-wei,hongyaozhao,zongyanhe,duyong}@ruc.edu.cn

• Haixiang Li and Anqun Pan are with Tencent Inc., China.

E-mail: {blueseali,aaronpan}@tencent.com

TABLE 1

Gaming player account balance

ID Player Bal Valid Time Transaction Time

1.1

James 0 [2018-05-20,2018-10-21) [2018-05-20,2018-10-21)

1.2

James 50 [2018-10-21,2018-11-01) [2018-10-21,2018-11-01)

1.3

James 1000 [2018-11-01,2018-11-12) [2018-11-01,2018-11-12)

1.4

James 2000 [2018-11-12,2019-05-11) [2018-11-12,∞)

2.1

David 150 [2018-10-20,2018-10-20) [2018-10-20,2018-10-20)

2.2

David 200 [2018-10-20,2019-04-19) [2018-10-20,∞)

3.1

Jack 200 [2018-11-08,2019-05-07) [2018-11-08,∞)

temporal relation associates either valid time, or transaction time,

or both. Either valid time or transaction time is a closed-open

period (i.e., time interval) [s, t), with s as start time and t as end

time. Valid time is a time period during which a fact was/is/will be

true in reality, and transaction time is a time period during which

a fact is/was recorded (i.e., current/historical) in the database.

Example 1. Table 1 shows an example of the account balance for

gaming players. Each player has one current record and perhaps

one or several historical records. For example, r

1.4

, r

2.2

, r

3.1

are

current records of players James, David, and Jack, respectively,

while the remaining are the historical records of either James or

David. Assume each recharge increases the validity of account

balance by 6 months, e.g., the valid time of current record r

1.4

[2018-11-12,2019-05-11). Besides, the transaction time of each

record r indicates the timestamps on which r is created and

deleted/updated if any (otherwise the timestamp is set to ∞). For

instance, it can be observed that r

1.4

is created on ‘2018-11-12’

and has not yet been deleted/updated.

Thus far, we have witnessed a big burst of temporal support

Authorized licensed use limited to: Tencent. Downloaded on July 05,2021 at 16:23:13 UTC from IEEE Xplore. Restrictions apply.

information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TKDE.2021.3081717, IEEE Transactions on Knowledge and Data Engineering

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2

in legacy database systems, including Oracle [4], Teradata [8],

MariaDB [3], SQL Server [6]. However, they suffer from either

limited expressiveness or poor performance, or both. First, the

temporal data model is inadequate. It is often reported that

DBAs are ﬁred or forced to run away due to an accidental

deletion of valuable business data. This kind of unexpectedly

committed transactions, namely logical data corruption, is difﬁcult

to avoid in real applications. Yet, the temporal data model deﬁned

in SQL:2011 does not explicitly support logical data corruption

recovery. Technically, to address this issue, it is necessary to

extend the temporal data model, based on which we ﬁrst ﬁgure

out the erroneous transaction, and then apply a sequence of

reverse operations over the records involved in the erroneous

transaction. Second, the performance of legacy databases suffers

from the synchronization overhead by introducing the built-in

temporal support. Existing temporal implementations store his-

tory and current records separately so as to skip over historical

records for current data query processing, which is deemed to

the dominant temporal queries. However, this separation degrades

the throughput of conventional transactional workloads because

any update/delete of a record r will cause a synchronization that

transfers r from the current relation to the historical relation. In

this way, any update/delete of a record in the current relation will

require exclusive access to the historical relation, leading to a

signiﬁcant drop in the whole system’s throughput. Third, temporal

data are maintained in an append-only mode, causing an ever-

increasing size. The overhead of maintaining a large volume of

temporal data degrades the query performance, and hence it is

of great necessity to develop a scalable storage engine to enable

efﬁcient query processing.

To address the above issues, in this paper, we propose T-SQL,

a lightweight yet efﬁcient built-in temporal implementation. We

make the following contributions.

• We present a new temporal data model. As compared to the non-

temporal counterpart, besides the valid/transaction time period

deﬁned in SQL:2011, a temporal relation under the new model

has two transaction IDs. One ID corresponds to the transaction that

creates the record, and the other ID corresponds to the transaction

that deletes/updates the record. By introducing the transaction IDs,

it is able to identify all records that are created/updated/deleted

in the same transaction, thus achieving the recovery of logical

data corruption. More importantly, temporal join queries taking

the transaction time as the join key can be enhanced by taking the

transaction ID as the join key instead.

• We propose a built-in temporal implementation with various

optimizations, and encapsulate it into PostgreSQL [5] and Green-

plum [2]. We have released our implementation publicly available.

First, our implementation completely relies on the MVCC

mechanism and is lightweight. Like other temporal implemen-

tations in legacy RDBMSs, we manage current and historical

records separately as well. In legacy RDBMSs, any update/delete

of a current record results in synchronous migration of newly

generated historical records to the historical relation. Instead,

we propose an asynchronous data migration strategy, i.e., any

update/delete of a current record does not cause an immediate data

migration. In our design, all newly generated historical records

are migrated to the historical relation only when the database

system starts to reclaim the storage occupied by records that

are deleted or obsoleted, which is also known as VACUUM in

PostgreSQL and PURGE in MySQL. This late data migration

transfers historical data in batch and is almost non-invasive to the

originally legacy RDBMSs. Further, we utilize the KV store with

various optimizations to organize historical relations, in which

only data changes of historical records of the same entity are

maintained, thus reducing the size of storage space.

Second, in response to the challenge that the transaction time

of each record is expensive to set in legacy RDBMSs, we build an

efﬁcient transaction status manager. It maintains the status of each

transaction, including the transaction commit time, in a transaction

log. A special design of the manager makes the retrieval of the

commit time of a given transaction ID at most one I/O cost. During

the data migration, we update their transaction time for each newly

generated historical record based on its transaction ID. For the

current and historical records that have not yet been transferred to

the historical relation, we are still able to obtain the transaction

time based on their transaction IDs via the manager.

Third, to support temporal query processing, we extend the

parser, query executor, and storage engine of legacy RDBMSs.

Our temporal implementation supports all temporal features de-

ﬁned in SQL:2011. For valid-time qualiﬁers in the temporal

query, we transform the temporal operations into equivalent non-

temporal operations; while for transaction-time qualiﬁers, we

provide a native operator to retrieve current and historical data

with various optimizations.

• We conduct extensive experiments on both real and synthetic

benchmarks by comparing T-SQL with Oracle, SQL Server, and

MariaDB. The results show that T-SQL almost has minimal per-

formance loss (only 7% on average) by introducing the temporal

features, and performs the best for most of the temporal queries.

The rest of the paper is organized below. Section 2 discusses

related work. Section 3 formalizes our new temporal data model.

Section 4 outlines the system architecture. Section 5 elaborates

temporal query processing and storage management. Section 6

presents the implementation. Section 7 reports the experimental

results, and Section 8 concludes the paper.

2 RELATED WORK

The study on temporal data management has been going on for

decades, mainly in the ﬁelds of data model development, query

processing, and implementations.

Early work until 1990s mainly focused on the consensus

glossary of concepts for data modeling [16], [17], [18], [19],

[20]. At this stage, the contributions mainly include temporal

relation deﬁnitions, temporal constraints, and temporal queries. As

compared to the non-temporal counterpart, a temporal relation as-

sociates time, which is multi-dimensional, and can be either valid

time or transaction time, or other types of time. The semantics of

integrity constraints in the temporal data model is also enriched

[12], [39]. Entity integrity does not enforce the uniqueness of the

primary key. Instead, it requires that no intersection exists between

valid times of any two records with the same primary key; while

for reference integrity, there must exist one matching record in

the parent relation whose valid time contains the valid time of the

child record. As compared to the regular query syntax, temporal

queries are formulated by expressing ﬁltering conditions as period

predicates [10], [25].

After attempts with many years to build the implementation

on top of legacy RDBMSs, such as Oracle [4], DB2 [1], and

Ingres [40], it is well recognized that the cost, brought by

the development and maintenance of application programs, is

prohibitively expensive. For this reason, since the late 1990s,

Authorized licensed use limited to: Tencent. Downloaded on July 05,2021 at 16:23:13 UTC from IEEE Xplore. Restrictions apply.

of 14

免费下载

tcaplusdb paper

关注

评论