Greenplum: A Hybrid Database for Transactional and Analytical.pdf

chirpyli

144

27页

1次

2023-03-23

免费下载

Greenplum: A Hybrid Database for Transactional and Analytical

Workloads

ZHENGHUA LYU, HUAN HUBERT ZHANG, GANG XIONG, HAOZHOU WANG, GANG

GUO, JINBAO CHEN, ASIM PRAVEEN, YU YANG, XIAOMING GAO, ASHWIN AGRAWAL,

ALEXANDRA WANG, WEN LIN, JUNFENG YANG, HAO WU, XIAOLIANG LI, FENG GUO,

JIANG WU, JESSE ZHANG, VENKATESH RAGHAVAN, VMware

Demand for enterprise data warehouse solutions to support real-time Online Transaction Processing (OLTP) queries as well

as long-running Online Analytical Processing (OLAP) workloads is growing. Greenplum database is traditionally known

as an OLAP data warehouse system with limited ability to process OLTP workloads. In this paper, we augment Greenplum

into a hybrid system to serve both OLTP and OLAP workloads. The challenge we address here is to achieve this goal

while maintaining the ACID properties with minimal performance overhead. In this eort, we identify the engineering and

performance bottlenecks such as the under-performing restrictive locking and the two-phase commit protocol. Next we

solve the resource contention issues between transactional and analytical queries. We propose a global deadlock detector to

increase the concurrency of query processing. When transactions that update data are guaranteed to reside on exactly one

segment we introduce one-phase commit to speed up query processing. Our resource group model introduces the capability

to separate OLAP and OLTP workloads into more suitable query processing mode. Our experimental evaluation on the TPC-B

and CH-benCHmark benchmarks demonstrates the eectiveness of our approach in boosting the OLTP performance without

sacricing the OLAP performance.

Additional Key Words and Phrases: Database, Hybrid Transaction and Analytical Process

1 INTRODUCTION

Greenplum is an established large scale data-warehouse system with both enterprise and open-source deployments.

The massively parallel processing (MPP) architecture of Greenplum splits the data into disjoint parts that are stored

across individual worker segments. This is similar to the large scale data-warehouse systems such as Oracle

Exadata [

], Teradata [

], and Vertica [

], including DWaaS systems such as AWS Redshift [

], AnalyticDB

[

], and BigQuery [

]. These data warehouse systems are able to eciently manage and query petabytes of data

in a distributed fashion. In contrast, distributed relational databases such as CockroachDB [

], and Amazon RDS

[

] have focused their eorts on providing a scalable solution for storing terabytes of data and fast processing of

transactional queries.

Greenplum users interact with the system through a coordinator node, and the underlying distributed archi-

tecture is transparent to the users. For a given query, the coordinator optimizes it for parallel processing and

dispatches the generated plan to the segments. Each segment executes the plan in parallel, and when needed

shues tuples among segments. This approach achieves signicant speedup for long running analytical queries.

Results are gathered by the coordinator and are then relayed to clients. DML operations can be used to modify data

hosted in the worker segments. Atomicity is ensured via a two-phase commit protocol. Concurrent transactions

are isolated from each other using distributed snapshots. Greenplum supports append-optimized column-oriented

tables with a variety of compression algorithms. These tables are well suited for bulk write and read operations

which are typical in OLAP workloads.

Figure 1 shows a typical data processing workow which involves operational databases managing hot (most

valuable) transactional data for a short period of time. This data is then periodically transformed, using Extract

Author’s address: Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Haozhou Wang, Gang Guo, Jinbao Chen, Asim Praveen, Yu Yang,

Xiaoming Gao, Ashwin Agrawal, Alexandra Wang, Wen Lin, Junfeng Yang, Hao Wu, Xiaoliang Li, Feng Guo, Jiang Wu, Jesse Zhang, Venkatesh

Raghavan, VMware.

arXiv:2103.11080v3 [cs.DB] 14 May 2021

2 • Zhenghua Lyu, Huan Hubert Zhang and Gang Xiong, et al.

Fig. 1. A Typical Enterprise Data Processing Workflow

Transform and Load (ETL) tools, and loaded into a data warehouse for further analysis. There is a growing desire

to reduce the complexity of maintaining disparate systems [

]. In this vein, users would prefer to have a single

system that can cater to both OLAP and OLTP workloads. In other words, such a system needs to be highly

responsive for point queries as well as scalable for long running analytical queries. This desire is well established

in literature and termed as a hybrid transactional and analytical processing (HTAP) system [11, 19, 20, 25].

To meet the need for Greenplum’s enterprise users we propose augmenting Greenplum into an HTAP system.

In this work, we focus on the following areas, namely, 1) improving the data loading into a parallel system with

ACID guarantees, 2) reducing response time for servicing point queries prevalent in OLTP workloads, and 3)

Resource Group, which is able to isolate the resource among dierent kinds of workloads or user groups.

Greenplum was designed with OLAP queries as the rst class citizen while OLTP workloads were not the

primary focus. The two-phase commit poses a performance penalty for transactions that update only a few

tuples. The heavy locking imposed by the coordinator, intended to prevent distributed deadlocks, proves overly

restrictive. This penalty disproportionately aects short running queries. As illustrated in Figure2, the locking

takes more than 25% query running time on a 10-second sample with a small number of connections. When the

amount of concurrences exceeds 100, the locking time becomes unacceptable.

Our key contributions in augmenting Greenplum into an HTAP systems can be summarized as follows:

• We identify challenges in transforming an OLAP database system into an HTAP database system.

•

We propose a global deadlock detector to reduce the locking overhead and increase the OLTP response

time without sacricing performance for OLAP workloads.

•

We speedup transactions that are guaranteed to update only data resident on an exactly one segment by

switching to a one-phase commit protocol.

•

We develop a new resource isolation component to manage OLTP and OLAP workloads, which can avoid

resource conicts in high concurrence scenarios.

•

We conduct a comprehensive performance evaluation on multiple benchmark data sets. The results demon-

strate that the HTAP version of Greenplum performs on-par with traditional OLTP databases while still

oering the capacity of real time computing on highly concurrent and mixed workloads.

Organisation

. In Section 2 we review of related work followed by a detailed description of Greenplum’s MPP

architecture and concepts in Section 3. Section 4 details the design and implementation of the global deadlock

detection. Section 5 demonstrates distributed transaction management in Greenplum and the related optimization

to improve OLTP performance. Section 6 presents our methodology to alleviate the performance degradation

Greenplum: A Hybrid Database for Transactional and Analytical Workloads • 3

Fig. 2. Example of Greenplum locking benchmark

caused by resource competition in a high concurrent, mix workload environment. Lastly, in Section 7 we present

our experimental methodology and analysis.

2 RELATED WORK

Hybrid transactional and analytical processing (HTAP) system.

An HTAP system [

] brings several

benets compared with an OLAP or OLTP system. First, HTAP can reduce the waiting time of new data analysis

tasks signicantly, as there is no ETL transferring delay. It makes real-time data analysis achievable without

extra components or external systems. Second, HTAP systems can also reduce the overall business cost in terms

of hardware and administration. There are many widely-used commercial OLTP DBMS [

] that have

been shifting to HTAP-like DBMS. However, the support for OLTP workloads in commercial OLAP DBMS is

still untouched. As the concept of HTAP is becoming popular, more database systems try to support HTAP

capabilities. Özcan et al

. [19]

have dissected HTAP databases into two categories: single systems for OLTP &

OLAP and separate OLTP & OLAP systems. In the rest of this section, we will discuss the dierent evolution

paths of HTAP databases.

From OLTP to HTAP Databases

OLTP databases are designed to support transactional processing with

high concurrency and low latency. Oracle Exadata [

] is designed to run OLTP workloads simultaneously with

analytical processing. Exadata introduces a smart scale-out storage, RDMA and inniBand networking, and

NVMe ash to improve the HTAP performance. Recently, it supports features like column-level checksum with

in-memory column cache and smart OLTP caching, which reduce the impact of ash disk failure or replacement.

Amazon Aurora [

] is a cloud OLTP database for AWS. It follows the idea that logs are the database, and it

ooads the heavy log processing to the storage layer. To support OLAP workloads, Aurora features parallel

of 27

免费下载

greenplum

文档被以下合辑收录

数据库论文（共5篇）

收录数据库领域相关论文

文档被以下合辑收录

相关文档

评论