为了有效处理星形查询,星形转换在Oracle 8i时被引入。这些查询通常在遵从星形数据模型的数据仓库应用中使用。之所以称之为星形模型,是因为该数据模型图看起来像星星。星形的中心由一个或多个事实表构成,而与其相对的,则是维度表。
star_model.PNG
这种转换的基本思想是避免在大表上使用全表扫描的访问方法,在星形模式中称该大表为事实表。在典型的星形查询中,事实表连接到若干个小得多的维度表。事实表通常包含每个维度表的一个键(称为外键)以及多个度量列,比如销售额。维度表中的对应键称为主键。连接是在事实表的外键和维度表的对应主键上。查询还包含维度表的其他列上的过滤谓词,这些列通常非常严格。这些过滤器的组合有助于显著减少事实表中需处理的数据集。星形转换的目标是只访问事实表中的这组简化数据。
考虑如下的星形查询Q1.该查询是查找加利福尼亚所有城市在1999年1季度和2季度,通过互联网的销售总额
Q1:
SELECT c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales s, times t, customers c, channels ch WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id AND s.channel_id = ch.channel_id AND c.cust_state_province = 'CA' AND ch.channel_desc = 'Internet' AND t.calendar_quarter_desc IN ('1999-01','1999-02') GROUP BY c.cust_city, t.calendar_quarter_desc;
复制
SALES表是事实表,而其它表被做为维度表。SALES表中包含产品的每一次售出,也正因如此,它可能包含上千万行销售记录。然而,他们中有只有很少是通过互联网,在指定的季度,被销售给了在加利福尼亚的客户。该查询被转换为Q2。
Q2:
SELECT c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales s, times t, customers c WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id AND c.cust_state_province = 'CA' AND t.calendar_quarter_desc IN ('1999-01','1999-02') AND s.time_id IN (SELECT time_id FROM times WHERE calendar_quarter_desc IN('1999-01','1999-02')) AND s.cust_id IN (SELECT cust_id FROM customers WHERE cust_state_province='CA') AND s.channel_id IN (SELECT channel_id FROM channels WHERE channel_desc = 'Internet') GROUP BY c.cust_city, t.calendar_quarter_desc;
复制
星形转换本质上是关于添加相应维度约束上的子查询谓词。这些子查询谓词称为位图半连接谓词(bitmap semi-join predicates)。当事实连接列(s.timeid、s.custid…)上存在索引时,将执行转换。通过驱动子查询提供的键值的位图与和位图或操作(位图可以来自位图索引或从常规B-Tree索引生成),只有事实表中的相关行才需要返回。如果维度表上的过滤器过滤掉了大量数据,这可能比在事实表上的全表扫描更有效。从事实表中检索到相关行之后,可能需要使用原始谓词将它们连接回维度表。在某些情况下,可以消除该联接。我们稍后将讨论这种情况
表1显示了转换查询的执行计划。请注意SALES表使用了位图访问来替代了全表扫描。对于每一个来自于子查询(行11,16,21)(译者注:个人认为应为行14,19,24)的键值,都要被事实表索引的位图所检索(行12,17,22)(译者注:个人认为应为行15,20,25)。位图中的每一位都对应着事实表中的一行。如果子查询的键值与事实表行中的值相同,则设置该位。例如,位图[1][0][1][0][0][0]…(其余行均为0)表示事实表的第1行和第3行具有来自子查询的匹配键值。让我们假设上面的位图是来自customers表子查询的键值。
行9、14、19(译者注:个人认为应为行12、17、22)中的迭代操作子查询中的键并获得相应的位图。假设customers子查询使用位图[0][1][0][0][0][0][0]生成另一个键值。。。
每一个子查询的位图被合并(OR操作)(行8,13和18)(译者注:个人认为应为行11,16和21)。在上例中,在合并了Customers子查询产生的两个位图,产生一个单一的位图[1][1][1][0][0][0]…
合并后的位置被与操作(行7)(译者注:个人认为应为行10)。假设来于CHANNELS表的位置是[1][0][0][0][0][0]…,如果将它和来自CUSTOMERS表子查询的位图相与,会产生[1][0][0][0][0]…
最终位图所对应的ROWIDS被生成(行6)(译者注:个人认为应为行9)。事实表被ROWIDS(行5)(译者注:个人认为应为行8)所检索。在上例中,它将产生仅对应于第1行的1个ROWID,并仅获取1行,而不是扫描整个事实表。
上述示例中的位图表示仅用于演示。在Oracle中,它们以压缩形式来表示和存储。
表 1: 转换查询的执行计划
-------------------------------------------------------------- | Id | Operation | Name | -------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | HASH GROUP BY | | |* 2 | HASH JOIN | | |* 3 | TABLE ACCESS FULL | CUSTOMERS | |* 4 | HASH JOIN | | |* 5 | TABLE ACCESS FULL | TIMES | | 6 | VIEW | VW_ST_B1772830 | | 7 | NESTED LOOPS | | | 8 | PARTITION RANGE SUBQUERY | | | 9 | BITMAP CONVERSION TO ROWIDS| | | 10 | BITMAP AND | | | 11 | BITMAP MERGE | | | 12 | BITMAP KEY ITERATION | | | 13 | BUFFER SORT | | |* 14 | TABLE ACCESS FULL | CHANNELS | |* 15 | BITMAP INDEX RANGE SCAN| SALES_CHANNEL_BIX| | 16 | BITMAP MERGE | | | 17 | BITMAP KEY ITERATION | | | 18 | BUFFER SORT | | |* 19 | TABLE ACCESS FULL | TIMES | |* 20 | BITMAP INDEX RANGE SCAN| SALES_TIME_BIX | | 21 | BITMAP MERGE | | | 22 | BITMAP KEY ITERATION | | | 23 | BUFFER SORT | | |* 24 | TABLE ACCESS FULL | CUSTOMERS | |* 25 | BITMAP INDEX RANGE SCAN| SALES_CUST_BIX | | 26 | TABLE ACCESS BY USER ROWID | SALES | --------------------------------------------------------------
复制
回连消除
子查询及其位图树仅基于维度表上的过滤条件来过滤事实表,因此可能仍然需要连接到维度表。当维度表上的所有谓词都是半联接子查询谓词的一部分,从子查询中选择的列是唯一的,维度列不在select 列表、group by等中时,维度表的连接被消除。 在上例中,CHANNELS表就没有被连接回SALES表,因为外层未引用且CHANNEL_ID是唯一的。
临时表转换
如果连接不能被消除,Oracle使用临时表存储子查询的结果,以避免重新扫描维度表(用于位图键生成和回连)。除此之外,如果查询并行运行,结果将被物化,这样每个从属进程就可以从临时表中选择结果,而不是再次执行子查询。
比如 ,如果Oracle物化了CUSTOMERS表子查询的结果到临时表中,则转换后的查询Q3将如下所示:
Q3:
SELECT t1.c1 cust_city, t.calendar_quarter_desc calendar_quarter_desc, sum(s.amount_sold) sales_amount FROM sales s, sh.times t, sys_temp_0fd9d6621_e7e24 t1 WHERE s.time_id=t.time_id AND s.cust_id=t1.c0 AND (t.calendar_quarter_desc='1999-q1' OR t.calendar_quarter_desc='1999-q2') AND s.cust_id IN (SELECT t1.c0 FROM sys_temp_0fd9d6621_e7e24 t1) AND s.channel_id IN (SELECT ch.channel_id FROM channels ch WHERE ch.channel_desc='internet') AND s.time_id IN (SELECT t.time_id FROM times t WHERE t.calendar_quarter_desc='1999-q1' OR t.calendar_quarter_desc='1999-q2') GROUP BY t1.c1, t.calendar_quarter_desc;
复制
请注意CUSTOMERS表被替换为临时表sys_temp_0fd9d6621_e7e24,并且,对CUST_ID列和CUST_CITY列的引用,也会相应地被临时表中的列所替代。临时表创建时会带有2列- (c0 number, c1 varchar2(30)). 这些列对应于CUSTOMERS表中的CUST_ID和CUST_CITY。在语句Q3开始执行时,下面的查询Q4将被用来填充该临时表。
Q4:
SELECT c.cust_id, c.cust_city FROM customers WHERE c.cust_state_province = 'CA'
复制
表2显示转换后查询的执行计划
表2: 使用了临时表转换的执行计划
---------------------------------------------------------------------- | Id | Operation | Name | ---------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | TEMP TABLE TRANSFORMATION | | | 2 | LOAD AS SELECT | | |* 3 | TABLE ACCESS FULL | CUSTOMERS | | 4 | HASH GROUP BY | | |* 5 | HASH JOIN | | | 6 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6613_C716F| |* 7 | HASH JOIN | | |* 8 | TABLE ACCESS FULL | TIMES | | 9 | VIEW | VW_ST_A3F94988 | | 10 | NESTED LOOPS | | | 11 | PARTITION RANGE SUBQUERY | | | 12 | BITMAP CONVERSION TO ROWIDS| | | 13 | BITMAP AND | | | 14 | BITMAP MERGE | | | 15 | BITMAP KEY ITERATION | | | 16 | BUFFER SORT | | |* 17 | TABLE ACCESS FULL | CHANNELS | |* 18 | BITMAP INDEX RANGE SCAN| SALES_CHANNEL_BIX | | 19 | BITMAP MERGE | | | 20 | BITMAP KEY ITERATION | | | 21 | BUFFER SORT | | |* 22 | TABLE ACCESS FULL | TIMES | |* 23 | BITMAP INDEX RANGE SCAN| SALES_TIME_BIX | | 24 | BITMAP MERGE | | | 25 | BITMAP KEY ITERATION | | | 26 | BUFFER SORT | | | 27 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6613_C716F| |* 28 | BITMAP INDEX RANGE SCAN| SALES_CUST_BIX | | 29 | TABLE ACCESS BY USER ROWID | SALES | ----------------------------------------------------------------------
复制
执行计划中的行1,行2和行3物化CUSTOMER表子查询到临时表中。在行24(译者注:人个认为是行27),扫描了临时表(代替子查询)以便去创建事实表的位图。行26(译者注:个人认为是行6)为了回连而扫描了临时表,从而代替了对CUSTOMERS表的扫描。注意,CUSTOMERS上的过滤条件不必用于临时表上,因为在物化临时表时,就已经施加了过滤。
启用转换
星形转换由数据库初始化参数star_transformation_enabled所控制,该参数有3个值:
TRUE | Oracle优化器自动识别事实表和限定的维度表。这一行为是基于成本的,比如,如果转换后的执行计划成本低于不转换的执行计划,转换才会执行。而且,如果物化视图可以改善性能,优化器会自动尝试临时表转换。 |
---|---|
FALSE | 不尝试转换 |
TEMP_DISABLE | 除了不进行临时表转换外,类似于TRUE的行为。 |
该参数的默认值是FALSE。你必须修改参数值,并且在事实表的连接列上创建索引,才能发挥出该转换的优势。
总结
星形转换可以改善一个非常大的事实表与多个有很好选择性谓词的维度表连接的查询性能。转换避免了对事实表的全表扫描。它只获取事实表中最终可以和特定维度表关联上的行。转换是基于成本的–仅在转换后的成本低于未转换的成本时。如果维度过滤谓词不能显著减少从事实表中检索到的数据量,那么全表扫描会是更有效的 。
在本文中,我们试图通过对简单查询示例及其执行计划的演示,来说明星形转换背后的基本思想。Oracle可以在更复杂的情况下进行星形转换。比如,有多个事实表的查询,雪花(维度是对多个范式化表的连接,而不是对一个非范式化的单表)等。
原文链接:https://blogs.oracle.com/optimizer/post/optimizer-transformations-star-transformation
Optimizer Transformations: Star Transformation
January 1, 2020 | 8 minute read
Sunil Chakkappen
Star transformation was introduced in Oracle 8i to process star queries efficiently. These queries are commonly used in data warehouse applications that follow the Star Schema data model. The Star Schema is so called because the data model diagram resembles a star. The center of the star consists of one or more fact tables and the points of the star are the dimension tables.
star_model.PNG
The basic idea of this transformation is to steer clear of using a full table scan access method on large tables, referred to as fact tables in the Star Schema. In a typical star query, the fact table is joined to several much smaller dimension tables. The fact table typically contains one key (referred to as foreign key) for every dimension table as well as a number of measure columns such as sales amount. The corresponding key in the dimension table is referred to as the primary key. The join is based on a foreign key of the fact table with the corresponding primary key of the dimension table. The query also contains filter predicates on other columns of the dimension tables that typically are very restrictive. The combination of these filters help to dramatically reduce the data set processed from the fact table. The goal of star transformation is to access only this reduced set of data from the fact table.
Consider the following star query Q1. The query is to find the total sales amount in all cities in California for quarters Q1 and Q2 of year 1999 through the Internet.
Q1:
SELECT c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales s, times t, customers c, channels ch WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id AND s.channel_id = ch.channel_id AND c.cust_state_province = 'CA' AND ch.channel_desc = 'Internet' AND t.calendar_quarter_desc IN ('1999-01','1999-02') GROUP BY c.cust_city, t.calendar_quarter_desc;
复制
Sales is the fact table while the other tables are considered as dimension tables. The Sales table contains one row for every sale of a product and thus it may contain billions of sales records. However only a few of them are sold to customers in California through the Internet for the specified quarters. The query is transformed into Q2.
Q2:
SELECT c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales s, times t, customers c WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id AND c.cust_state_province = 'CA' AND t.calendar_quarter_desc IN ('1999-01','1999-02') AND s.time_id IN (SELECT time_id FROM times WHERE calendar_quarter_desc IN('1999-01','1999-02')) AND s.cust_id IN (SELECT cust_id FROM customers WHERE cust_state_province='CA') AND s.channel_id IN (SELECT channel_id FROM channels WHERE channel_desc = 'Internet') GROUP BY c.cust_city, t.calendar_quarter_desc;
复制
Star transformation is essentially about adding subquery predicates corresponding to the constraint dimensions. These subquery predicates are referred to as bitmap semi-join predicates. The transformation is performed when there are indexes on the fact join columns (s.timeid, s.custid…). By driving bitmap AND and OR operations (bitmaps can be from bitmap indexes or generated from regular B-Tree indexes) of the key values supplied by the subqueries, only the relevant rows from the fact table need to be retrieved. If the filters on the dimension tables filter out a lot of data, this can be much more efficient than a full table scan on the fact table. After the relevant rows have been retrieved from the fact table, they may need to be joined back to the dimension tables, using the original predicates. In some cases, the join back can be eliminated. We will discuss this situation later.
Table 1 shows the query plan for the transformed query. Note that the sales table has a bitmap access path instead of a full table scan. For each key value coming from the subqueries (lines 11, 16, 21), the bitmaps are retrieved from the fact table indexes (lines 12, 17, 22). Each bit in the bitmap corresponds to a row in fact table. The bit is set if the key value from the subquery is same as the value in the row of fact table. For example, the bitmap [1][0][1][0][0][0]…(all 0s for remaining rows) indicate that rows 1 and 3 of fact table has matching key value from subquery. Lets say the above bitmap is for a key value from customers table subquery.
The operations in lines 9, 14, 19 iterates over the keys from the subqueries and get the corresponding bitmaps. Lets say the customers subquery produces one more key value with the bitmap [0][1][0][0][0][0]…
The bitmaps for each subquery are merged (ORed) (lines 8, 13 and 18). In the above example, it will produce a single bitmap [1][1][1][0][0][0]… for customers subquery after merging the two bitmaps.
The merged bitmaps are ANDed (line 7). Lets say the bitmap from channels is [1][0][0][0][0][0]… If you AND this bitmap with the bitmap from customers subquery it will produce [1][0][0][0][0]…
The corresponding rowids of the final bitmap are generated (line 6). The fact table rows are retrieved using the rowids (line 5). In the above example, it will generate only 1 rowid corresponding to the first row and fetches only a single row instead of scanning the entire fact table.
The representation of bitmaps in the above example is for illustration purpose only. In Oracle, they are represented and stored in a compressed form.
Table 1: The plan of the transformed query…
-------------------------------------------------------------- | Id | Operation | Name | -------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | HASH GROUP BY | | |* 2 | HASH JOIN | | |* 3 | TABLE ACCESS FULL | CUSTOMERS | |* 4 | HASH JOIN | | |* 5 | TABLE ACCESS FULL | TIMES | | 6 | VIEW | VW_ST_B1772830 | | 7 | NESTED LOOPS | | | 8 | PARTITION RANGE SUBQUERY | | | 9 | BITMAP CONVERSION TO ROWIDS| | | 10 | BITMAP AND | | | 11 | BITMAP MERGE | | | 12 | BITMAP KEY ITERATION | | | 13 | BUFFER SORT | | |* 14 | TABLE ACCESS FULL | CHANNELS | |* 15 | BITMAP INDEX RANGE SCAN| SALES_CHANNEL_BIX| | 16 | BITMAP MERGE | | | 17 | BITMAP KEY ITERATION | | | 18 | BUFFER SORT | | |* 19 | TABLE ACCESS FULL | TIMES | |* 20 | BITMAP INDEX RANGE SCAN| SALES_TIME_BIX | | 21 | BITMAP MERGE | | | 22 | BITMAP KEY ITERATION | | | 23 | BUFFER SORT | | |* 24 | TABLE ACCESS FULL | CUSTOMERS | |* 25 | BITMAP INDEX RANGE SCAN| SALES_CUST_BIX | | 26 | TABLE ACCESS BY USER ROWID | SALES | --------------------------------------------------------------
复制
Join back elimination
The subqueries and their bitmap tree only filter the fact table based on the dimension filters, so it may still be necessary to join to the dimension table. The join back of the dimension table is eliminated when all the predicates on dimension tables are part of the semijoin subquery predicate, the column(s) selected from the subquery are unique and the dimension columns are not in select list, group by etc. In the above example, the table channels is not joined back to the sales table since it is not referenced outside and channel_id is unique.
Temporary table transformation
If the join back is not eliminated, Oracle stores the results of the subquery in a temporary table to avoid re-scanning the dimension table (for bitmap key generation and join back). In addition to this, the results are materialized if the query is run in parallel, so that each slave can select the results from the temporary tables instead of executing the subquery again.
For example, if Oracle materializes the results of the subquery on customers into a temporary table, the transformed query Q3 will be as follows.
Q3:
SELECT t1.c1 cust_city, t.calendar_quarter_desc calendar_quarter_desc, sum(s.amount_sold) sales_amount FROM sales s, sh.times t, sys_temp_0fd9d6621_e7e24 t1 WHERE s.time_id=t.time_id AND s.cust_id=t1.c0 AND (t.calendar_quarter_desc='1999-q1' OR t.calendar_quarter_desc='1999-q2') AND s.cust_id IN (SELECT t1.c0 FROM sys_temp_0fd9d6621_e7e24 t1) AND s.channel_id IN (SELECT ch.channel_id FROM channels ch WHERE ch.channel_desc='internet') AND s.time_id IN (SELECT t.time_id FROM times t WHERE t.calendar_quarter_desc='1999-q1' OR t.calendar_quarter_desc='1999-q2') GROUP BY t1.c1, t.calendar_quarter_desc;
复制
Note that customers is replaced by the temporary table sys_temp_0fd9d6621_e7e24 and references to columns cust_id and cust_city are replaced by the corresponding columns of the temporary table. The temporary table will be created with 2 columns - (c0 number, c1 varchar2(30)). These columns corresponds to cust_id and cust_city of customers table. The table will be populated using the following query Q4 at the beginning of the execution of the statement Q3.
Q4:
SELECT c.cust_id, c.cust_city FROM customers WHERE c.cust_state_province = 'CA'
复制
Table 2 shows the plan for the transformed query.
Table 2: Plan with temporary table transformation…
---------------------------------------------------------------------- | Id | Operation | Name | ---------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | TEMP TABLE TRANSFORMATION | | | 2 | LOAD AS SELECT | | |* 3 | TABLE ACCESS FULL | CUSTOMERS | | 4 | HASH GROUP BY | | |* 5 | HASH JOIN | | | 6 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6613_C716F| |* 7 | HASH JOIN | | |* 8 | TABLE ACCESS FULL | TIMES | | 9 | VIEW | VW_ST_A3F94988 | | 10 | NESTED LOOPS | | | 11 | PARTITION RANGE SUBQUERY | | | 12 | BITMAP CONVERSION TO ROWIDS| | | 13 | BITMAP AND | | | 14 | BITMAP MERGE | | | 15 | BITMAP KEY ITERATION | | | 16 | BUFFER SORT | | |* 17 | TABLE ACCESS FULL | CHANNELS | |* 18 | BITMAP INDEX RANGE SCAN| SALES_CHANNEL_BIX | | 19 | BITMAP MERGE | | | 20 | BITMAP KEY ITERATION | | | 21 | BUFFER SORT | | |* 22 | TABLE ACCESS FULL | TIMES | |* 23 | BITMAP INDEX RANGE SCAN| SALES_TIME_BIX | | 24 | BITMAP MERGE | | | 25 | BITMAP KEY ITERATION | | | 26 | BUFFER SORT | | | 27 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6613_C716F| |* 28 | BITMAP INDEX RANGE SCAN| SALES_CUST_BIX | | 29 | TABLE ACCESS BY USER ROWID | SALES | ----------------------------------------------------------------------
复制
The lines 1,2 and 3 of the plan materialize the customers subquery into the temporary table. In line 24, it scans the temporary table (instead of the subquery) to build the bitmap from the fact table. Line 26 is for scanning the temporary table for joining back instead of scanning customers. Note that the filter on customers is not needed to be applied on the temporary table since the filter is already applied while materializing the temporary table.
Enabling the transformation
Star transformation is controlled by the star_transformation_enabled database initialization parameter. The parameter takes 3 values:
TRUE | The Oracle Optimizer performs transformation by identifying fact and constraint dimension tables automatically. This is done in a cost-based manner, i.e. the transformation is performed only if the cost of the transformed plan is lower than the non-transformed plan. Also the optimizer will attempt temporary table transformation automatically whenever materialization improves performance. |
---|---|
FALSE | The transformation is not tried. |
TEMP_DISABLE | This value has similar behavior as TRUE except that temporary table transformation is not tried. |
The default value of the parameter is FALSE. You have to change the parameter value and create indexes on the joining columns of the fact table to take advantage of this transformation.
Summary
Star transformation improves the performance of queries with a very big fact table joined to multiple dimension tables when the dimension tables have very selective predicates. The transformation avoids the full scan of the fact table. It fetches only relevant rows from the fact table that will eventually join to the constraint dimension rows. The transformation is performed based on cost - only when the cost of the transformed plan is lower than that of the non-transformed plan. If the dimension filters do not significantly reduce the amount of data to be retrieved from the fact table, then a full table scan is more efficient.
In this post we have tried to illustrate the basic ideas behind star transformation by showing simple example queries and plans. Oracle can do star transformation in more complex cases. For example, a query with multiple fact tables, snowflakes (dimension is a join of several normalized tables instead of denormalized single table), etc.