
A desirable optimizer is one where (1) the search space includes
plans that have low cosr (2) the costing technique is accurure (3)
the enumeration algorithm is eficient. Each of these three tasks is
nontrivial and that is why building a good optimizer is an
enormous undertaking.
We begin by discussing the System-R optimization framework
since this was a remarkably elegant approach that helped fuel
much of the subsequent work in optimization. In Section 4, we
will discuss the search space that is considered by optimizers.
This section will provide the forum for presentation of important
algebraic transformations that are incorporated in the search
space. In Section 5, we address the problem of cost estimation. In
Section 6, we take up the topic of enumerating the search space.
This completes the discussion of the basic optimization
framework. In Section 7, we discuss some of the recent
developments in query optimization.
3. AN EXAMPLE: SYSTEM-R OPTIMIZER
The System-R project significantly advanced the state of query
optimization of relational systems. The ideas in [.55] have been
incorporated in many commercial optimizers continue to be
remarkably relevant. I will present a subset of those important
ideas here in the context of Select-Project-Join (SPJ) queries. The
class of SPJ queries is closely related to and encapsulates
conjunctive queries, which are widely studied in Database Theory.
The search space for the System-R optimizer in the context of a
SPJ query consists of operator trees that correspond to linear
sequence of join operations,
e.g.,
the
sequence
Join (Join (Join (A, B) , C) , D) is illustrated in Figure
2(a).
Such sequences are logically equivalent because of
associative and commutative properties of joins. A join operator
can use either the nested loop or sort-merge implementation. Each
scan node can use either index scan (using a clustered or non-
clustered index) or sequential scan. Finally, predicates are
evaluated as early as possible.
The cost model assigns an estimated cost to any partial or
complete plan in the search space. It also determines the estimated
size of the data stream for output of every operator in the plan. It
relies on:
(a) A set of statistics maintained on relations and indexes, e.g.,
number of data pages in a relation, number of pages in an
index, number of distinct values in a column
(b) Formulas to estimate selectivity of predicates and to project
the size of the output data stream for every operator node.
For example, the size of the output of a join is estimated by
taking the product of the sizes of the two relations and then
applying the joint selectivity of all applicable predicates.
(c) Formulas to estimate the CPU and I/O costs of query
execution for every operator. These formulas take into
account the statistical properties of its input data streams,
existing access methods over the input data streams, and any
available order on the data stream (e.g., if a data stream is
ordered, then the cost of a sort-merge join on that stream may
be significantly reduced). In addition, it is also checked if the
output data stream will have any order.
The cost model uses (a)-(c) to compute and associate the
following information in a bottom-up fashion for operators in a
plan: (1) The size of the data stream represented by the output of
the operator node. (2) Any ordering of tuples created or sustained
by the output data stream of the operator node. (3) Estimated
execution cost for the operator (and the cumulative cost of the
partial plan so far).
Join(C,D)
Join*E&
A B
A
B
C
D
(4 (b)
I
Figure 2. (a) Linear and (b) bushy join
I
The enumeration algorithm for System-R optimizer demonstrates
two important techniques: use of dynamic programming and use
of interesting orders.
The essence of the dynamic programming approach is based on
the assumption that the cost model satisfies the principle of
optimaliQ. Specifically, it assumes that in order to obtain an
optimal plan for a SPJ query Q consisting of k joins, it suffices to
consider only the optimal plans for subexpressions of Q that
consist of (k-l) joins and extend those plans with an additional
join. In other words, the suboptimal plans for subexpressions of Q
(also called subqueries) consisting of (k-l) joins do not need to be
considered further in determining the optimal plan for Q.
Accordingly, the dynamic programming based enumeration views
a SPJ query Q as a sef of relations (RI, . . R,) to be joined. The
enumeration algorithm proceeds bottom-up. At the end of the j-th
step, the algorithm produces the optimal plans for all subqueries
of size j. To obtain an optimal plan for a subquery consisting of
(i+l) relations, we consider all possible ways of constructing a
plan for the subquery by extending the plans constructed in the j-
th step. For example, the optimal plan for (RI, R2, RX, R4) is
obtained by picking the plan with the cheapest cost from among
the optimal plans for: (1) Join ( ( R1, R2, R3 ) , R4 ) (2)
Join((Rl,Rz,R4),R3) (3) Join ((R~,Rx,R~),R~) (4)
Join((&,R3,R4). RI).
The rest of the plans for
(RI, R2, R3, R4} may be discarded. The dynamic programming
approach is significantly faster than the nai’ve approach since
instead of O(n!) plans, only O(n2”.‘) plans need to be enumerated.
The second important aspect of System R optimizer is the
consideration of interesting orders. Let us now consider a query
that represents the join among ( R1, Ra
, R3]
with the predicates
R1. a = R2. a = R3. a. Let us also assume that the cost of the
plans for the subquery (RI, R2) are x and y for nested-loop and
sort-merge join respectively and x < y. In such a case, while
considering the plan for ( R1, R2, R3 ), we will not consider the
plan where Ri and Rz are joined using sort-merge. However, note
that if sort-merge is used to join R1 and Rs, rhe result ofthe join is
sorted on a. The sorted order may significantly reduce the cost of
the join with R3. Thus, pruning the plan that represents the sort-
merge join between R1 and R2 can result in sub-optimality of the
global plan. The problem arises because the result of the sort-
merge join between R1 and R2 has an ordering of tuples in the
35
文档被以下合辑收录
相关文档
评论