
Tu YF, et al. BDMasker: Dynamic data protection system for open big data environment 89
(3) Precision of SQL rewriting. SQL is a widely used data query language, and the popular
big data computing engines can all provide SQL access. SQL rewriting represents the key
technology for dynamic data masking. As SQL requests in the business field are ever-changing,
and the SQL rewriting mechanism involves all columns that define masking strategies, the
rewriting of complex SQL statements may result in data distortion and reduce data availability,
even undermining the accuracy of business logic processing. In the case of complex SQL access
requests, it poses a challenge to the design of a dynamic data protection system, especially the
SQL rewriting technology, as to how to ensure that the rewritten SQL is completely transparent
to the businesses on the premise of not exposing the sensitive information of the underlying
physical table so as to make the business logic isolated from the influence of data protection.
Next, we take the query statement Query76 of TPC-DS
[10]
shown in Listing 1 as an example to
explain the technical difficulties of SQL rewriting. Query76 covers most of the important syntax
rules in SQL statements.
Listing 1 Query76 of TPC-DS
SELECT channel , col_name , d_year , d_qoy , i_category , COUNT(*) sales_cnt , SUM(
ext_sales_price) sales_amt
FROM (
SELECT ’store ’ AS chan ne l , ss_store_sk col_name , d_year , d_qoy , i_category ,
ss_ext_sales_price ext_sales_price
FROM store_sales , item , date_dim
WHERE ss_store_sk IS NULL AND ss_sold_date_sk = d_date_sk AND ss_item_sk =
i_item_sk
UNION ALL
SELECT ’web’ AS channel , ws_ship_customer_sk col_name , d_year , d_qoy ,
i_category , ws_ext_sales_price ext_sales_price
FROM web_sales , item , date_dim
WHERE ws_ship_customer_sk IS NULL AND ws_sold_date_sk = d_date_sk AND
ws_item_sk = i_item_sk
UNION ALL
SELECT ’ catalog ’ AS channel , cs_ship_addr_sk col_name , d_year , d_qoy ,
i_category , cs_ext_sales_price ext_sales_price
FROM catalog_sales , item , date_dim
WHERE cs_ship_addr_sk IS NULL AND cs_sold_date_sk = d_date_sk AND cs_item_sk =
i_item_sk
) foo
GROUP BY channel , col_name , d_year , d_qoy , i_category
ORDER BY channel , col_name , d_year , d_qoy , i_category
LIMIT 100
Precise SQL analysis. For an SQL query, the output of its query result set eventually
comes from the output field of the “select” statement of the outermost query, and the output
field may be from sub-query statements, “join” statements, “union” statements, etc., which may
undergo multi-tier conversion through sub-query and nested functions, etc. Therefore, we need
to accurately identify and correctly mask the source of sensitive fields in the outermost output
parts; otherwise, the sensitive information of the underlying physical table may get exposed.
For instance, the fields in the physical table on which the outermost output column col_ name
in Query76 finally depends include the following: ss_store_sk field in the store_ sales
table, ws_ship_customer_sk field in the Web_sales table, and cs_ship_addr_sk field in the
catalog_sales table. If masking rules were set for only one of the three fields, the outermost
output column col_name would cause the leak of sensitive data in the query result set by not
obtaining the field information of the underlying physical table on which it depends and applying
the corresponding masking rules.
Precise positioning of sensitive fields. The sensitive fields in the SQL query requests
may come from different syntactic structures. For example, in Query76, the field i_category
appears many times in the sub-query output field at different tiers; the field
d_year
is seen in both
the sub-query output field and the GROUP BY and ORDER BY statements. Some query output
相关文档
评论