An overview of query optimization in relational systems.pdf

chirpyli

119

10页

1次

2022-11-21

免费下载

An Overview of Query Optimization in Relational Systems

Surajit Chaudhuri

Microsoft Research

One Microsoft Way

Redmond, WA 98052

+1-(425)-703-l 938

surajitc@ microsofkcom

1. OBJECTIVE

There has been extensive work in query optimization since the

early ‘70s. It is hard to capture the breadth and depth of this large

body of work in a short article. Therefore, I have decided to focus

primarily on the optimization of SQL queries in relational

database systems and present my biased and incomplete view of

this field. The goal of this article is not to be comprehensive, but

rather to explain the foundations and present samplings of

significant work in this area. I would like to apologize to the many

contributors in this area whose work I have failed to explicitly

acknowledge due to oversight or lack of space. I take the liberty of

trading technical precision for ease of presentation.

Index Nested Loop

(A.x = C.x)

Merge-Join

(A.x=B.x)

Index Scan C

Sort Sort

2. INTRODUCTION

Table Scan A Table Scan B

Relational query languages provide a high-level “declarative”

interface to access data stored in relational databases. Over time,

SQL [41] has emerged as the standard for relational query

languages. Two key components of the query evaluation

component of a SQL database system are the query optimizer and

the query execution engine.

Figure 1. Operator Tree

The query execution engine implements a set of physical

operators. An operator takes as input one or more data streams

and produces an output data stream. Examples of physical

operators are (external) sort, sequential scan, index scan, nested-

loop join, and sort-merge join. I refer to such operators as

physical operators since they are not necessarily tied one-to-one

with relational operators. The simplest way to think of physical

operators is as pieces of code that are used as building blocks to

make possible the execution of SQL queries. An abstract

representation of such an execution is a physical operator tree, as

illustrated in Figure 1. The edges in an operator tree represent the

data flow among the physical operators. We use the terms

physical operator tree and execution plan (or, simply plan)

interchangeably. The execution engine is responsible for the

execution of the plan that results in generating answers to the

query. Therefore, the capabilities of the query execution engine

determine the structure of the operator trees that are feasible. We

refer the reader to [20] for an overview of query evaluation

techniques.

The query optimizer is responsible for generating the input for the

execution engine. It takes a parsed representation of a SQL query

as input and is responsible for generating an efficient execution

plan for the given SQL query from the space of possible execution

plans. The task of an optimizer is nontrivial since for a given SQL

query, there can be a large number of possible operator trees:

The algebraic representation of the given query can be

transformed into many other logically equivalent algebraic

representations: e.g.,

Join(Join(A,B),C)= Join(Join(B,C),A)

For a given algebraic representation, there may be many

operator trees that implement the algebraic expression, e.g.,

typically there are several join algorithms supported in a

database system.

Furthermore, the throughput or the response times for the

execution of these plans may be widely different. Therefore, a

judicious choice of an execution by the optimizer is of critical

importance. Thus, query optimization can be viewed as a difficult

search problem. In order to solve this problem, we need to

provide:

A space of plans (search space).

A cost estimation technique so that a cost may be assigned to

each plan in the search space. Intuitively, this is an

estimation of the resources needed for the execution of the

plan.

An enumeration algorithm that can search through the

execution space.

A desirable optimizer is one where (1) the search space includes

plans that have low cosr (2) the costing technique is accurure (3)

the enumeration algorithm is eficient. Each of these three tasks is

nontrivial and that is why building a good optimizer is an

enormous undertaking.

We begin by discussing the System-R optimization framework

since this was a remarkably elegant approach that helped fuel

much of the subsequent work in optimization. In Section 4, we

will discuss the search space that is considered by optimizers.

This section will provide the forum for presentation of important

algebraic transformations that are incorporated in the search

space. In Section 5, we address the problem of cost estimation. In

Section 6, we take up the topic of enumerating the search space.

This completes the discussion of the basic optimization

framework. In Section 7, we discuss some of the recent

developments in query optimization.

3. AN EXAMPLE: SYSTEM-R OPTIMIZER

The System-R project significantly advanced the state of query

optimization of relational systems. The ideas in [.55] have been

incorporated in many commercial optimizers continue to be

remarkably relevant. I will present a subset of those important

ideas here in the context of Select-Project-Join (SPJ) queries. The

class of SPJ queries is closely related to and encapsulates

conjunctive queries, which are widely studied in Database Theory.

The search space for the System-R optimizer in the context of a

SPJ query consists of operator trees that correspond to linear

sequence of join operations,

e.g.,

the

sequence

Join (Join (Join (A, B) , C) , D) is illustrated in Figure

2(a).

Such sequences are logically equivalent because of

associative and commutative properties of joins. A join operator

can use either the nested loop or sort-merge implementation. Each

scan node can use either index scan (using a clustered or non-

clustered index) or sequential scan. Finally, predicates are

evaluated as early as possible.

The cost model assigns an estimated cost to any partial or

complete plan in the search space. It also determines the estimated

size of the data stream for output of every operator in the plan. It

relies on:

(a) A set of statistics maintained on relations and indexes, e.g.,

number of data pages in a relation, number of pages in an

index, number of distinct values in a column

(b) Formulas to estimate selectivity of predicates and to project

the size of the output data stream for every operator node.

For example, the size of the output of a join is estimated by

taking the product of the sizes of the two relations and then

applying the joint selectivity of all applicable predicates.

execution for every operator. These formulas take into

account the statistical properties of its input data streams,

existing access methods over the input data streams, and any

available order on the data stream (e.g., if a data stream is

ordered, then the cost of a sort-merge join on that stream may

be significantly reduced). In addition, it is also checked if the

output data stream will have any order.

The cost model uses (a)-(c) to compute and associate the

following information in a bottom-up fashion for operators in a

plan: (1) The size of the data stream represented by the output of

the operator node. (2) Any ordering of tuples created or sustained

by the output data stream of the operator node. (3) Estimated

execution cost for the operator (and the cumulative cost of the

partial plan so far).

Join(C,D)

Join*E&

A B

(4 (b)

Figure 2. (a) Linear and (b) bushy join

The enumeration algorithm for System-R optimizer demonstrates

two important techniques: use of dynamic programming and use

of interesting orders.

The essence of the dynamic programming approach is based on

the assumption that the cost model satisfies the principle of

optimaliQ. Specifically, it assumes that in order to obtain an

optimal plan for a SPJ query Q consisting of k joins, it suffices to

consider only the optimal plans for subexpressions of Q that

consist of (k-l) joins and extend those plans with an additional

join. In other words, the suboptimal plans for subexpressions of Q

(also called subqueries) consisting of (k-l) joins do not need to be

considered further in determining the optimal plan for Q.

Accordingly, the dynamic programming based enumeration views

a SPJ query Q as a sef of relations (RI, . . R,) to be joined. The

enumeration algorithm proceeds bottom-up. At the end of the j-th

step, the algorithm produces the optimal plans for all subqueries

of size j. To obtain an optimal plan for a subquery consisting of

(i+l) relations, we consider all possible ways of constructing a

plan for the subquery by extending the plans constructed in the j-

th step. For example, the optimal plan for (RI, R2, RX, R4) is

obtained by picking the plan with the cheapest cost from among

the optimal plans for: (1) Join ( ( R1, R2, R3 ) , R4 ) (2)

Join((Rl,Rz,R4),R3) (3) Join ((R~,Rx,R~),R~) (4)

Join((&,R3,R4). RI).

The rest of the plans for

(RI, R2, R3, R4} may be discarded. The dynamic programming

approach is significantly faster than the nai’ve approach since

instead of O(n!) plans, only O(n2”.‘) plans need to be enumerated.

The second important aspect of System R optimizer is the

consideration of interesting orders. Let us now consider a query

that represents the join among ( R1, Ra

, R3]

with the predicates

R1. a = R2. a = R3. a. Let us also assume that the cost of the

plans for the subquery (RI, R2) are x and y for nested-loop and

sort-merge join respectively and x < y. In such a case, while

considering the plan for ( R1, R2, R3 ), we will not consider the

plan where Ri and Rz are joined using sort-merge. However, note

that if sort-merge is used to join R1 and Rs, rhe result ofthe join is

sorted on a. The sorted order may significantly reduce the cost of

the join with R3. Thus, pruning the plan that represents the sort-

merge join between R1 and R2 can result in sub-optimality of the

global plan. The problem arises because the result of the sort-

merge join between R1 and R2 has an ordering of tuples in the

output stream that is useful in the subsequent join. However, the

nested-loop join does not have such ordering. Therefore, given a

query, System R identified ordering of tuples that are potentially

consequential to execution plans for the query (hence the name

interesting orders). Furthermore, in the System R optimizer, two

plans are compared only if they represent the same expression as

well as have the same interesting order. The idea of interesting

order was later generalized to physical properties in [22] and is

used extensively in modem optimizers. Intuitively, a physical

property is any characteristic of a plan that is not shared by all

plans for the same logical expression, but can impact the cost of

subsequent operations. Finally, note that the System-R’s approach

of taking into account physical properties demonstrates a simple

mechanism to handle any violation of the principle of optimality,

not necessarily arising only from physical properties.

Despite the elegance of the System-R approach, the framework

cannot be easily extended to incorporate other logical

transformations (beyond join ordering) that expand the search

space. This led to the development of more extensible

optimization architectures. However, the use of cost-based

optimization, dynamic programming and interesting orders

strongly influenced subsequent developments in optimization.

4. SEARCH SPACE

As mentioned in Section 2, the search space for optimization

depends on the set of algebraic transformations that preserve

equivalence and the set of physical operators supported in an

optimizer. In this section, I will discuss a few of the many

important algebraic transformations that have been discovered. It

should be noted that transformations do not necessarily reduce

cost and therefore must be applied in a cost-based manner by the

enumeration algorithm to ensure a positive benefit.

The optimizer may use several representations of a query during

the lifecycle of optimizing a query. The initial representation is

often the parse tree of the query and the final representation is an

operator tree. An intermediate representation that is also used is

that of logical operator trees (also called query trees) that captures

an algebraic expression. Figure 2 is an example of a query tree.

Often, nodes of the query trees are annotated with additional

information.

Some systems also use a “calculus-oriented” representation for

analyzing the structure of the query. For SPJ queries, such a

structure is often captured by a query graph where nodes

represent relations (correlation variables) and labeled edges

represent join predicates among the relations (see Figure 3).

Although conceptually simple, such a representation falls short of

representing the structure of arbitrary SQL statements in a number

of ways. First, predicate graphs only represent a set of join

predicates and cannot represent other algebraic operators, e.g.,

union. Next, unlike natural join, operators such as outerjoin are

asymmetric and are sensitive to the order of evaluation. Finally,

such a representation does not capture the fact that SQL

statements may have nested query blocks. In the QGM structure

used in the Starburst system [26], the building block is an

enhanced query graph that is able to represent a simple SQL

statement that has no nesting (“single block” query). Multi block

queries are represented as a set of subgraphs with edges among

subgraphs that represent predicates (and quantifiers) across query

blocks. In contrast, Exodus [22] and its derivatives, uniformly use

query trees and operator trees for all phases of optimization.

E.Dept#=D.Dept#

,,.~~~rn

EMP E2

Figure 3. Query

4.1 Commuting Between Operators

A large and important class of transformations exploits

commutativity among operators. In this section, we see examples

of such transformations.

4. I. I Generalizing Join Sequencing

In many of the systems, the sequence of join operations is

syntactically restricted to limit search space. For example, in the

System R project, only linear sequences of join operations are

considered and Cartesian product among relations is deferred until

after all the joins.

Since join operations are commutative and associative, the

sequence of joins in an operator tree need not be linear. In

particular, the query consisting of join among relations

RI, Rs , Ra , RI can be algebraically represented and evaluated as

Join(Join(A,B),Join(C,D)).Suchquerytreesarecalled

bushy, illustrated in Figure 2(b). Bushy join sequences require

materialization of intermediate relations. While bushy trees may

result in cheaper query plan, they expand the cost of enumerating

the search space considerably’. Although there has been some

studies of merits of exploring the bushy join sequences, by and

large most systems still focus on linear join sequences and only

restricted subsets of bushy join trees.

Deferring Cartesian products may also result in poor performance.

In many decision-support queries where the query graph forms a

star, it has been observed that a Cartesian product among

appropriate nodes (“dimensional” tables in OLAP terminology

[7]) results in a significant reduction in cost.

In an extensible system, the behavior of the join enumerator may

be adapted on a per query basis so as to restrict the “bushy”-ness

of the join trees and to allow or disallow Cartesian products [46].

However, it is nontrivial to determine a priori the effects of such

tuning on the quality and cost of the search.

4.1.2 Outerjoin and Join

One-sided outerjoin is an asymmetric operator in SQL that

preserves ail of the tuples of one relation. Symmetric outerjoins

preserve both the operand relations. Thus, (R LOJ S), where LOJ

designates left outerjoin between R and S, preserves all tuples of

R. In addition to the tuples from natural join, the above operation

contains all remaining tuples in R that fail to join with S (padded

with NULLS for their S attributes). Unlike natural joins, a

’ It is not the cost of generating the syntactic join orders that is

most expensive. Rather, the task of choosing physical operators

and computing the cost of each alternative plan is

computationally intensive.

of 10

免费下载

数据库

文档被以下合辑收录

数据库论文（共5篇）

收录数据库领域相关论文

文档被以下合辑收录

相关文档

评论