
Greenplum: A Hybrid Database for Transactional and Analytical
Workloads
ZHENGHUA LYU, HUAN HUBERT ZHANG, GANG XIONG, HAOZHOU WANG, GANG
GUO, JINBAO CHEN, ASIM PRAVEEN, YU YANG, XIAOMING GAO, ASHWIN AGRAWAL,
ALEXANDRA WANG, WEN LIN, JUNFENG YANG, HAO WU, XIAOLIANG LI, FENG GUO,
JIANG WU, JESSE ZHANG, VENKATESH RAGHAVAN, VMware
Demand for enterprise data warehouse solutions to support real-time Online Transaction Processing (OLTP) queries as well
as long-running Online Analytical Processing (OLAP) workloads is growing. Greenplum database is traditionally known
as an OLAP data warehouse system with limited ability to process OLTP workloads. In this paper, we augment Greenplum
into a hybrid system to serve both OLTP and OLAP workloads. The challenge we address here is to achieve this goal
while maintaining the ACID properties with minimal performance overhead. In this eort, we identify the engineering and
performance bottlenecks such as the under-performing restrictive locking and the two-phase commit protocol. Next we
solve the resource contention issues between transactional and analytical queries. We propose a global deadlock detector to
increase the concurrency of query processing. When transactions that update data are guaranteed to reside on exactly one
segment we introduce one-phase commit to speed up query processing. Our resource group model introduces the capability
to separate OLAP and OLTP workloads into more suitable query processing mode. Our experimental evaluation on the TPC-B
and CH-benCHmark benchmarks demonstrates the eectiveness of our approach in boosting the OLTP performance without
sacricing the OLAP performance.
Additional Key Words and Phrases: Database, Hybrid Transaction and Analytical Process
1 INTRODUCTION
Greenplum is an established large scale data-warehouse system with both enterprise and open-source deployments.
The massively parallel processing (MPP) architecture of Greenplum splits the data into disjoint parts that are stored
across individual worker segments. This is similar to the large scale data-warehouse systems such as Oracle
Exadata [
5
], Teradata [
1
,
7
], and Vertica [
13
], including DWaaS systems such as AWS Redshift [
10
], AnalyticDB
[
27
], and BigQuery [
24
]. These data warehouse systems are able to eciently manage and query petabytes of data
in a distributed fashion. In contrast, distributed relational databases such as CockroachDB [
23
], and Amazon RDS
[
2
] have focused their eorts on providing a scalable solution for storing terabytes of data and fast processing of
transactional queries.
Greenplum users interact with the system through a coordinator node, and the underlying distributed archi-
tecture is transparent to the users. For a given query, the coordinator optimizes it for parallel processing and
dispatches the generated plan to the segments. Each segment executes the plan in parallel, and when needed
shues tuples among segments. This approach achieves signicant speedup for long running analytical queries.
Results are gathered by the coordinator and are then relayed to clients. DML operations can be used to modify data
hosted in the worker segments. Atomicity is ensured via a two-phase commit protocol. Concurrent transactions
are isolated from each other using distributed snapshots. Greenplum supports append-optimized column-oriented
tables with a variety of compression algorithms. These tables are well suited for bulk write and read operations
which are typical in OLAP workloads.
Figure 1 shows a typical data processing workow which involves operational databases managing hot (most
valuable) transactional data for a short period of time. This data is then periodically transformed, using Extract
Author’s address: Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Haozhou Wang, Gang Guo, Jinbao Chen, Asim Praveen, Yu Yang,
Xiaoming Gao, Ashwin Agrawal, Alexandra Wang, Wen Lin, Junfeng Yang, Hao Wu, Xiaoliang Li, Feng Guo, Jiang Wu, Jesse Zhang, Venkatesh
Raghavan, VMware.
arXiv:2103.11080v3 [cs.DB] 14 May 2021
文档被以下合辑收录
相关文档
评论