POLARIS: The Distributed SQL Engine in Azure Synapse
Josep Aguilar-Saborit, Raghu Ramakrishnan, Krish Srinivasan
Kevin Bocksrocker, Ioannis Alagiannis, Mahadevan Sankara, Moe Shafiei
Jose Blakeley, Girish Dasarathy, Sumeet Dash, Lazar Davidovic, Maja Damjanic, Slobodan Djunic, Nemanja Djurkic, Charles Feddersen, Cesar
Galindo-Legaria, Alan Halverson, Milana Kovacevic, Nikola Kicovic, Goran Lukic, Djordje Maksimovic, Ana Manic, Nikola Markovic, Bosko Mihic,
Ugljesa Milic, Marko Milojevic, Tapas Nayak, Milan Potocnik, Milos Radic, Bozidar Radivojevic, Srikumar Rangarajan, Milan Ruzic, Milan Simic,
Marko Sosic, Igor Stanko, Maja Stikic, Sasa Stanojkov, Vukasin Stefanovic, Milos Sukovic, Aleksandar Tomic , Dragan Tomic, Steve Toscano,
Djordje Trifunovic, Veljko Vasic, Tomer Verona, Aleksandar Vujic, Nikola Vujic, Marko Vukovic, Marko Zivanovic
Microsoft Corp
ABSTRACT
In this paper, we describe the Polaris distributed SQL query engine
in Azure Synapse. It is the result of a multi-year project to re-
architect the query processing framework in the SQL DW parallel
data warehouse service, and addresses two main goals: (i) converge
data warehousing and big data workloads, and (ii) separate compute
and state for cloud-native execution.
From a customer perspective, these goals translate into many useful
features, including the ability to resize live workloads, deliver
predictable performance at scale, and to efficiently handle both
relational and unstructured data. Achieving these goals required
many innovations, including a novel “cell” data abstraction, and
flexible, fine-grained, task monitoring and scheduling capable of
handling partial query restarts and PB-scale execution. Most
importantly, while we develop a completely new scale-out
framework, it is fully compatible with T-SQL and leverages
decades of investment in the SQL Server single-node runtime and
query optimizer. The scalability of the system is highlighted by a
1PB scale run of all 22 TPC-H queries; to our knowledge, this is
the first reported run with scale larger than 100TB.
PVLDB Reference Format:
Josep Aguilar-Saborit, Raghu Ramakrishnan et al.
VLDB Conferences. PVLDB, 13(12): 3204 – 3216, 2020.
DOI: https://doi.org/10.14778/3415478.3415545
1. INTRODUCTION
Relational data warehousing has long been the enterprise approach
to data analytics, in conjunction with multi-dimensional business-
intelligence (BI) tools such as Power BI and Tableau. The recent
explosion in the number and diversity of data sources, together with
the interest in machine learning, real-time analytics and other
advanced capabilities, has made it necessary to extend traditional
relational DBMS based warehouses. In contrast to the traditional
approach of carefully curating data to conform to standard
enterprise schemas and semantics, data lakes focus on rapidly
ingesting data from many sources and give users flexible analytic
tools to handle the resulting data heterogeneity and scale.
A common pattern is that data lakes are used for data preparation,
and the results are then moved to a traditional warehouse for the
phase of interactive analysis and reporting. While this pattern
bridges the lake and warehouse paradigms and allows enterprises
to benefit from their complementary strengths, we believe that the
two approaches are converging, and that the full relational SQL tool
chain (spanning data movement, catalogs, business analytics and
reporting) must be supported directly over the diverse and large
datasets stored in a lake; users will not want to migrate all their
investments in existing tool chains.
In this paper, we present the Polaris interactive relational query
engine, a key component for converging warehouses and lakes in
Azure Synapse [1], with a cloud-native scale-out architecture that
makes novel contributions in the following areas:
• Cell data abstraction: Polaris builds on the abstraction of
a data “cell” to run efficiently on a diverse collection of data
formats and storage systems. The full SQL tool chain can now
be brought to bear over files in the lake with on-demand
interactive performance at scale, eliminating the need to move
files into a warehouse. This reduces costs, simplifies data
governance, and reduces time to insight. Additionally, in
conjunction with a re-designed storage manager (Fido [2]) it
supports the full range of query and transactional performance
needed for Tier 1 warehousing workloads.
• Fine-grained scale-out: The highly-available micro-
service architecture is based on (1) a careful packaging of data
and query processing into units called “tasks” that can be
readily moved across compute nodes and re-started at the task
level; (2) widely-partitioned data with a flexible distribution
model; (3) a task-level “workflow-DAG” that is novel in
spanning multiple queries, in contrast to [3, 4, 5, 6]; and (4) a
framework for fine-grained monitoring and flexible
scheduling of tasks.
• Combining scale-up and scale-out: Production-ready
scale-up SQL systems offer excellent intra-partition
parallelism and have been tuned for interactive queries with
deep enhancements to query optimization and vectorized
processing of columnar data partitions, careful control flow,
and exploitation of tiered data caches. While Polaris has a new
scale-out distributed query processing architecture inspired by
big data query execution frameworks, it is unique in how it
combines this with SQL Server’s scale-up features at each
node; we thus benefit from both scale-up and scale-out.
• Flexible service model: Polaris has a concept of a session,
which supports a spectrum of consumption models, ranging
from “serverless” ad-hoc queries to long-standing pools or
clusters. Leveraging the Polaris session architecture, Azure
Synapse is unique among cloud services in how it brings
together serverless and reserved pools with online scaling. All
data (e.g., files in the lake, as well as managed data in Fido
[2]) are accessible from any session, and multiple sessions can
评论