暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
An overview of query optimization in relational systems.pdf
119
10页
1次
2022-11-21
免费下载
An Overview of Query Optimization in Relational Systems
Surajit Chaudhuri
Microsoft Research
One Microsoft Way
Redmond, WA 98052
+1-(425)-703-l 938
surajitc@ microsofkcom
1. OBJECTIVE
There has been extensive work in query optimization since the
early ‘70s. It is hard to capture the breadth and depth of this large
body of work in a short article. Therefore, I have decided to focus
primarily on the optimization of SQL queries in relational
database systems and present my biased and incomplete view of
this field. The goal of this article is not to be comprehensive, but
rather to explain the foundations and present samplings of
significant work in this area. I would like to apologize to the many
contributors in this area whose work I have failed to explicitly
acknowledge due to oversight or lack of space. I take the liberty of
trading technical precision for ease of presentation.
Index Nested Loop
(A.x = C.x)
/\
Merge-Join
(A.x=B.x)
/\
Index Scan C
Sort Sort
2. INTRODUCTION
Table Scan A Table Scan B
Relational query languages provide a high-level “declarative”
interface to access data stored in relational databases. Over time,
SQL [41] has emerged as the standard for relational query
languages. Two key components of the query evaluation
component of a SQL database system are the query optimizer and
the query execution engine.
Figure 1. Operator Tree
The query execution engine implements a set of physical
operators. An operator takes as input one or more data streams
and produces an output data stream. Examples of physical
operators are (external) sort, sequential scan, index scan, nested-
loop join, and sort-merge join. I refer to such operators as
physical operators since they are not necessarily tied one-to-one
with relational operators. The simplest way to think of physical
operators is as pieces of code that are used as building blocks to
make possible the execution of SQL queries. An abstract
representation of such an execution is a physical operator tree, as
illustrated in Figure 1. The edges in an operator tree represent the
data flow among the physical operators. We use the terms
physical operator tree and execution plan (or, simply plan)
interchangeably. The execution engine is responsible for the
execution of the plan that results in generating answers to the
query. Therefore, the capabilities of the query execution engine
determine the structure of the operator trees that are feasible. We
refer the reader to [20] for an overview of query evaluation
techniques.
The query optimizer is responsible for generating the input for the
execution engine. It takes a parsed representation of a SQL query
as input and is responsible for generating an efficient execution
plan for the given SQL query from the space of possible execution
plans. The task of an optimizer is nontrivial since for a given SQL
query, there can be a large number of possible operator trees:
.
The algebraic representation of the given query can be
transformed into many other logically equivalent algebraic
representations: e.g.,
Join(Join(A,B),C)= Join(Join(B,C),A)
.
For a given algebraic representation, there may be many
operator trees that implement the algebraic expression, e.g.,
typically there are several join algorithms supported in a
database system.
Furthermore, the throughput or the response times for the
execution of these plans may be widely different. Therefore, a
judicious choice of an execution by the optimizer is of critical
importance. Thus, query optimization can be viewed as a difficult
search problem. In order to solve this problem, we need to
provide:
.
A space of plans (search space).
.
A cost estimation technique so that a cost may be assigned to
each plan in the search space. Intuitively, this is an
estimation of the resources needed for the execution of the
plan.
.
An enumeration algorithm that can search through the
execution space.
34
A desirable optimizer is one where (1) the search space includes
plans that have low cosr (2) the costing technique is accurure (3)
the enumeration algorithm is eficient. Each of these three tasks is
nontrivial and that is why building a good optimizer is an
enormous undertaking.
We begin by discussing the System-R optimization framework
since this was a remarkably elegant approach that helped fuel
much of the subsequent work in optimization. In Section 4, we
will discuss the search space that is considered by optimizers.
This section will provide the forum for presentation of important
algebraic transformations that are incorporated in the search
space. In Section 5, we address the problem of cost estimation. In
Section 6, we take up the topic of enumerating the search space.
This completes the discussion of the basic optimization
framework. In Section 7, we discuss some of the recent
developments in query optimization.
3. AN EXAMPLE: SYSTEM-R OPTIMIZER
The System-R project significantly advanced the state of query
optimization of relational systems. The ideas in [.55] have been
incorporated in many commercial optimizers continue to be
remarkably relevant. I will present a subset of those important
ideas here in the context of Select-Project-Join (SPJ) queries. The
class of SPJ queries is closely related to and encapsulates
conjunctive queries, which are widely studied in Database Theory.
The search space for the System-R optimizer in the context of a
SPJ query consists of operator trees that correspond to linear
sequence of join operations,
e.g.,
the
sequence
Join (Join (Join (A, B) , C) , D) is illustrated in Figure
2(a).
Such sequences are logically equivalent because of
associative and commutative properties of joins. A join operator
can use either the nested loop or sort-merge implementation. Each
scan node can use either index scan (using a clustered or non-
clustered index) or sequential scan. Finally, predicates are
evaluated as early as possible.
The cost model assigns an estimated cost to any partial or
complete plan in the search space. It also determines the estimated
size of the data stream for output of every operator in the plan. It
relies on:
(a) A set of statistics maintained on relations and indexes, e.g.,
number of data pages in a relation, number of pages in an
index, number of distinct values in a column
(b) Formulas to estimate selectivity of predicates and to project
the size of the output data stream for every operator node.
For example, the size of the output of a join is estimated by
taking the product of the sizes of the two relations and then
applying the joint selectivity of all applicable predicates.
(c) Formulas to estimate the CPU and I/O costs of query
execution for every operator. These formulas take into
account the statistical properties of its input data streams,
existing access methods over the input data streams, and any
available order on the data stream (e.g., if a data stream is
ordered, then the cost of a sort-merge join on that stream may
be significantly reduced). In addition, it is also checked if the
output data stream will have any order.
The cost model uses (a)-(c) to compute and associate the
following information in a bottom-up fashion for operators in a
plan: (1) The size of the data stream represented by the output of
the operator node. (2) Any ordering of tuples created or sustained
by the output data stream of the operator node. (3) Estimated
execution cost for the operator (and the cumulative cost of the
partial plan so far).
Join(C,D)
Join*E&
A B
A
B
C
D
(4 (b)
I
Figure 2. (a) Linear and (b) bushy join
I
The enumeration algorithm for System-R optimizer demonstrates
two important techniques: use of dynamic programming and use
of interesting orders.
The essence of the dynamic programming approach is based on
the assumption that the cost model satisfies the principle of
optimaliQ. Specifically, it assumes that in order to obtain an
optimal plan for a SPJ query Q consisting of k joins, it suffices to
consider only the optimal plans for subexpressions of Q that
consist of (k-l) joins and extend those plans with an additional
join. In other words, the suboptimal plans for subexpressions of Q
(also called subqueries) consisting of (k-l) joins do not need to be
considered further in determining the optimal plan for Q.
Accordingly, the dynamic programming based enumeration views
a SPJ query Q as a sef of relations (RI, . . R,) to be joined. The
enumeration algorithm proceeds bottom-up. At the end of the j-th
step, the algorithm produces the optimal plans for all subqueries
of size j. To obtain an optimal plan for a subquery consisting of
(i+l) relations, we consider all possible ways of constructing a
plan for the subquery by extending the plans constructed in the j-
th step. For example, the optimal plan for (RI, R2, RX, R4) is
obtained by picking the plan with the cheapest cost from among
the optimal plans for: (1) Join ( ( R1, R2, R3 ) , R4 ) (2)
Join((Rl,Rz,R4),R3) (3) Join ((R~,Rx,R~),R~) (4)
Join((&,R3,R4). RI).
The rest of the plans for
(RI, R2, R3, R4} may be discarded. The dynamic programming
approach is significantly faster than the nai’ve approach since
instead of O(n!) plans, only O(n2”.‘) plans need to be enumerated.
The second important aspect of System R optimizer is the
consideration of interesting orders. Let us now consider a query
that represents the join among ( R1, Ra
, R3]
with the predicates
R1. a = R2. a = R3. a. Let us also assume that the cost of the
plans for the subquery (RI, R2) are x and y for nested-loop and
sort-merge join respectively and x < y. In such a case, while
considering the plan for ( R1, R2, R3 ), we will not consider the
plan where Ri and Rz are joined using sort-merge. However, note
that if sort-merge is used to join R1 and Rs, rhe result ofthe join is
sorted on a. The sorted order may significantly reduce the cost of
the join with R3. Thus, pruning the plan that represents the sort-
merge join between R1 and R2 can result in sub-optimality of the
global plan. The problem arises because the result of the sort-
merge join between R1 and R2 has an ordering of tuples in the
35
output stream that is useful in the subsequent join. However, the
nested-loop join does not have such ordering. Therefore, given a
query, System R identified ordering of tuples that are potentially
consequential to execution plans for the query (hence the name
interesting orders). Furthermore, in the System R optimizer, two
plans are compared only if they represent the same expression as
well as have the same interesting order. The idea of interesting
order was later generalized to physical properties in [22] and is
used extensively in modem optimizers. Intuitively, a physical
property is any characteristic of a plan that is not shared by all
plans for the same logical expression, but can impact the cost of
subsequent operations. Finally, note that the System-R’s approach
of taking into account physical properties demonstrates a simple
mechanism to handle any violation of the principle of optimality,
not necessarily arising only from physical properties.
Despite the elegance of the System-R approach, the framework
cannot be easily extended to incorporate other logical
transformations (beyond join ordering) that expand the search
space. This led to the development of more extensible
optimization architectures. However, the use of cost-based
optimization, dynamic programming and interesting orders
strongly influenced subsequent developments in optimization.
4. SEARCH SPACE
As mentioned in Section 2, the search space for optimization
depends on the set of algebraic transformations that preserve
equivalence and the set of physical operators supported in an
optimizer. In this section, I will discuss a few of the many
important algebraic transformations that have been discovered. It
should be noted that transformations do not necessarily reduce
cost and therefore must be applied in a cost-based manner by the
enumeration algorithm to ensure a positive benefit.
The optimizer may use several representations of a query during
the lifecycle of optimizing a query. The initial representation is
often the parse tree of the query and the final representation is an
operator tree. An intermediate representation that is also used is
that of logical operator trees (also called query trees) that captures
an algebraic expression. Figure 2 is an example of a query tree.
Often, nodes of the query trees are annotated with additional
information.
Some systems also use a “calculus-oriented” representation for
analyzing the structure of the query. For SPJ queries, such a
structure is often captured by a query graph where nodes
represent relations (correlation variables) and labeled edges
represent join predicates among the relations (see Figure 3).
Although conceptually simple, such a representation falls short of
representing the structure of arbitrary SQL statements in a number
of ways. First, predicate graphs only represent a set of join
predicates and cannot represent other algebraic operators, e.g.,
union. Next, unlike natural join, operators such as outerjoin are
asymmetric and are sensitive to the order of evaluation. Finally,
such a representation does not capture the fact that SQL
statements may have nested query blocks. In the QGM structure
used in the Starburst system [26], the building block is an
enhanced query graph that is able to represent a simple SQL
statement that has no nesting (“single block” query). Multi block
queries are represented as a set of subgraphs with edges among
subgraphs that represent predicates (and quantifiers) across query
blocks. In contrast, Exodus [22] and its derivatives, uniformly use
query trees and operator trees for all phases of optimization.
E.Dept#=D.Dept#
,,.~~~rn
EMP E2
Figure 3. Query
4.1 Commuting Between Operators
A large and important class of transformations exploits
commutativity among operators. In this section, we see examples
of such transformations.
4. I. I Generalizing Join Sequencing
In many of the systems, the sequence of join operations is
syntactically restricted to limit search space. For example, in the
System R project, only linear sequences of join operations are
considered and Cartesian product among relations is deferred until
after all the joins.
Since join operations are commutative and associative, the
sequence of joins in an operator tree need not be linear. In
particular, the query consisting of join among relations
RI, Rs , Ra , RI can be algebraically represented and evaluated as
Join(Join(A,B),Join(C,D)).Suchquerytreesarecalled
bushy, illustrated in Figure 2(b). Bushy join sequences require
materialization of intermediate relations. While bushy trees may
result in cheaper query plan, they expand the cost of enumerating
the search space considerably’. Although there has been some
studies of merits of exploring the bushy join sequences, by and
large most systems still focus on linear join sequences and only
restricted subsets of bushy join trees.
Deferring Cartesian products may also result in poor performance.
In many decision-support queries where the query graph forms a
star, it has been observed that a Cartesian product among
appropriate nodes (“dimensional” tables in OLAP terminology
[7]) results in a significant reduction in cost.
In an extensible system, the behavior of the join enumerator may
be adapted on a per query basis so as to restrict the “bushy”-ness
of the join trees and to allow or disallow Cartesian products [46].
However, it is nontrivial to determine a priori the effects of such
tuning on the quality and cost of the search.
4.1.2 Outerjoin and Join
One-sided outerjoin is an asymmetric operator in SQL that
preserves ail of the tuples of one relation. Symmetric outerjoins
preserve both the operand relations. Thus, (R LOJ S), where LOJ
designates left outerjoin between R and S, preserves all tuples of
R. In addition to the tuples from natural join, the above operation
contains all remaining tuples in R that fail to join with S (padded
with NULLS for their S attributes). Unlike natural joins, a
It is not the cost of generating the syntactic join orders that is
most expensive. Rather, the task of choosing physical operators
and computing the cost of each alternative plan is
computationally intensive.
36
of 10
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。