BDMasker: Dynamic Data Protection System for Open Big Data Environment.pdf

章芋文

110

29页

3次

2023-07-27

免费下载

International Journal of Software and Informatics, ISSN 1673-7288

http://www.ijsi.org, ijsi@iscas.ac.cn, +86-10-62661048

IJSI, 2023, 13(1): 87–115, doi: 10.21655/ijsi.1673-7288.00297

Research

Article

BDMasker: Dynamic Data Protection System for

Open Big Data Environment

Yaofeng Tu (屠要峰)

1,2

, Jiahao Niu (牛家浩)

, Dezheng Wang (王德政)

1,2

Hong Gao (高洪)

, Jin Xu (徐进)

, Ke Hong (洪科)

, Fang Yang (阳方)

(State Key Laboratory of Mobile Network and Mobile Multimedia Technology, Shenzhen 518057, China)

(ZTE Corporation, Nanjing 210014, China)

Corresponding author: Jiahao Niu, niu.jiahao@zte.com.cn

Abstract Big data has become a national basic strategic resource, and the opening and sharing

of data is the core of China’s big data strategy. Cloud native technology and lake-house

architecture are reconstructing the big data infrastructure and promoting data sharing and value

dissemination. The development of the big data industry and technology requires stronger data

security and data sharing capabilities. However, data security in an open environment has

become a bottleneck, which restricts the development and utilization of big data technology.

The issues of data security and privacy protection have become increasingly prominent both

in the open source big data ecosystem and the commercial big data system. Dynamic data

protection system under the open big data environment is now facing challenges in regards

such as data availability, processing eﬃciency, and system scalability. This paper proposes

the dynamic data protection system BDMasker for the open big data environment. Through a

precise query analysis and query rewriting technology based on the query dependency model,

it can accurately perceive but does not change the original business request, which indicates

that the whole process of dynamic masking has zero impact on the business. Furthermore,

its multi-engine-oriented uniﬁed security strategy framework realizes the vertical expansion of

dynamic data protection capabilities and the horizontal expansion among multiple computing

engines. The distributed computing capability of the big data execution engine can be used to

improve the data protection processing performance of the system. The experimental results

show that the precise SQL analysis and rewriting technology proposed by BDMasker is eﬀective.

The system has good scalability and performance, and the overall performance ﬂuctuates within

3% in the TPC-DS and YCSB benchmark tests.

Keywords big data; data masking; dynamic data masking; SQL rewriting; query dependency

Citation Tu YF, Niu JH, Wang DZ, Gao H, Xu J, Hong K, Yang F. BDMasker: Dynamic data protection

system for open big data environment, International Journal of Software and Informatics, 2023, 13(1):

87–115. http://www.ijsi.org/1673-7288/297.htm

In the era of big data, big data serves as a national basic strategic resource. Attaching great

importance to the development of big data, China has begun to put in place the national big

This is the English version of Chinese article “面向开放大数据环境的动态数据保护系统, 2022, 34(3):

1213–1235. doi: 10.13328/j.cnki.jos.006783”

Funding items: National Key R&D Program of China (2021YFB3101100)

Received 2022-05-14; Revised 2022-07-29, 2022-09-07; Accepted 2022-09-23; IJSI published online 2023-03-30

88 International Journal of Software and Informatics, 2023, 13(1)

data strategy in an all-round manner, in which the opening and sharing of data lies at the core

of the big data competition strategy. From the perspective of the technological development

trend, new technical architectures and big data support platforms are emerging, among which the

cloud native and lake-house architecture are reconstructing the big data infrastructure. Stronger

capabilities of data security and data sharing are required from access to the data lake and

data warehouse to cross-database and cross-domain sharing. Both the open source big data

ecosystem and the commercial big data system, however, fall wor ryingly behind the business

development in the security protection capability of big data in an open environment. It is

shown by the privacy disclosures in recent years that the release or sharing of unmasked data

is highly prone to reveal private data, especially individual sensitive information. In 2018, the

data of 87 million users of Facebook, a social media in the US, was illegally used by Cambridge

Analytica, a consulting company, and Facebook paid a $5 billion ﬁne for such an event. Again

in 2021, there was a data leak involving another 533 million individual Facebook users. The

security problem in the open environment has become a bottleneck in the development and

utilization of big data technology. Accordingly, it has become one of the research focuses in

big data security as to how to protect the privacy of sensitive data in an open and complex

environment while ensuring good data availability and computing eﬃciency

[1, 2]

Data security in the open big data environment diﬀers greatly from traditional data secur ity,

with changes seen in the protection method, protection object, and the relationship between

management and technology. The application scenario of open big data is committed to the

opening and sharing of data, with more diverse roles involved in data processing and the ﬂow of

data as normal, which sets higher standards for data security protection. Traditional measures for

data security such as data encryption and static masking, thereby, are outmoded. According to

relevant research, privacy protection and dynamic data masking technologies represent important

means for safe data ﬂow and sharing and credible big data services

[

3, 4]

. By maintaining the

availability of data sources without the leak of sensitive information in the data ﬂow, dynamic

data masking technology boasts good utility and a broad prospect of application. In an open big

data environment, it is a complex problem requiring prompt solutions as to how to dynamically

protect sensitive data in an automated, eﬃcient, and scalable manner while softening the impact

on nor mal businesses amid massive multimodal data and highly concurrent access requests

[5–7]

The following challenges are mainly involved.

(1) Scalability of heterogeneous environments. To satisfy the timeliness requirement of

diﬀerent data queries and data computing under an open big data scenario, many kinds of big

data computing engines are often deployed simultaneously on the same cluster. For instance,

Apache Spark

[8]

is suitable for batch processing of static data with high latency, and Apache

Flink

[9]

is for low-latency or real-time streaming data processing. Faced with complex and

diverse business scenarios and multiple computing engines, we should explore how to create,

manage, and maintain a uniform data protection strategy for heterogeneous engines and provide

standardized access methods for the horizontal expansion of heterogeneous environments. In

addition to the capability of dynamic data masking, it is necessary to study how to ﬂexibly

support multiple capabilities of dynamic data protection under one framework and support the

vertical expansion of the dynamic data protection capability of a single engine.

(2) High eﬃciency of processing performance. In an open big data environment, data is

generated at a faster pace, with its size on the exponential rise. To meet the response time

requirements in high-performance real-time protection of massive data, data security protection

must be able to operate automatically under the rules and optimize the load of the whole

processing process. In this way, it can make full use of the distributed computing capability of

the big data execution engine to enhance the processing performance.

Tu YF, et al. BDMasker: Dynamic data protection system for open big data environment 89

(3) Precision of SQL rewriting. SQL is a widely used data query language, and the popular

big data computing engines can all provide SQL access. SQL rewriting represents the key

technology for dynamic data masking. As SQL requests in the business ﬁeld are ever-changing,

and the SQL rewriting mechanism involves all columns that deﬁne masking strategies, the

rewriting of complex SQL statements may result in data distortion and reduce data availability,

even undermining the accuracy of business logic processing. In the case of complex SQL access

requests, it poses a challenge to the design of a dynamic data protection system, especially the

SQL rewriting technology, as to how to ensure that the rewritten SQL is completely transparent

to the businesses on the premise of not exposing the sensitive information of the underlying

physical table so as to make the business logic isolated from the inﬂuence of data protection.

Next, we take the query statement Query76 of TPC-DS

[10]

shown in Listing 1 as an example to

explain the technical diﬃculties of SQL rewriting. Query76 covers most of the important syntax

rules in SQL statements.

Listing 1 Query76 of TPC-DS

SELECT channel , col_name , d_year , d_qoy , i_category , COUNT(*) sales_cnt , SUM(

ext_sales_price) sales_amt

FROM (

SELECT ’store ’ AS chan ne l , ss_store_sk col_name , d_year , d_qoy , i_category ,

ss_ext_sales_price ext_sales_price

FROM store_sales , item , date_dim

WHERE ss_store_sk IS NULL AND ss_sold_date_sk = d_date_sk AND ss_item_sk =

i_item_sk

UNION ALL

SELECT ’web’ AS channel , ws_ship_customer_sk col_name , d_year , d_qoy ,

i_category , ws_ext_sales_price ext_sales_price

FROM web_sales , item , date_dim

WHERE ws_ship_customer_sk IS NULL AND ws_sold_date_sk = d_date_sk AND

ws_item_sk = i_item_sk

UNION ALL

SELECT ’ catalog ’ AS channel , cs_ship_addr_sk col_name , d_year , d_qoy ,

i_category , cs_ext_sales_price ext_sales_price

FROM catalog_sales , item , date_dim

WHERE cs_ship_addr_sk IS NULL AND cs_sold_date_sk = d_date_sk AND cs_item_sk =

i_item_sk

) foo

GROUP BY channel , col_name , d_year , d_qoy , i_category

ORDER BY channel , col_name , d_year , d_qoy , i_category

LIMIT 100

Precise SQL analysis. For an SQL query, the output of its query result set eventually

comes from the output ﬁeld of the “select” statement of the outermost query, and the output

ﬁeld may be from sub-query statements, “join” statements, “union” statements, etc., which may

undergo multi-tier conversion through sub-query and nested functions, etc. Therefore, we need

to accurately identify and correctly mask the source of sensitive ﬁelds in the outermost output

parts; otherwise, the sensitive information of the underlying physical table may get exposed.

For instance, the ﬁelds in the physical table on which the outermost output column col_ name

in Query76 ﬁnally depends include the following: ss_store_sk ﬁeld in the store_ sales

table, ws_ship_customer_sk ﬁeld in the Web_sales table, and cs_ship_addr_sk ﬁeld in the

catalog_sales table. If masking rules were set for only one of the three ﬁelds, the outermost

output column col_name would cause the leak of sensitive data in the query result set by not

obtaining the ﬁeld information of the underlying physical table on which it depends and applying

the corresponding masking rules.

Precise positioning of sensitive fields. The sensitive ﬁelds in the SQL query requests

may come from diﬀerent syntactic structures. For example, in Query76, the ﬁeld i_category

appears many times in the sub-query output ﬁeld at diﬀerent tiers; the ﬁeld

d_year

is seen in both

the sub-query output ﬁeld and the GROUP BY and ORDER BY statements. Some query output

of 29

免费下载

goldendb ijsi

相关文档

评论