暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
BDMasker: Dynamic Data Protection System for Open Big Data Environment.pdf
110
29页
3次
2023-07-27
免费下载
International Journal of Software and Informatics, ISSN 1673-7288
http://www.ijsi.org, ijsi@iscas.ac.cn, +86-10-62661048
IJSI, 2023, 13(1): 87–115, doi: 10.21655/ijsi.1673-7288.00297
©2023 by Institute of Software, Chinese Academy of Sciences. All rights reserved.
Research
Article
BDMasker: Dynamic Data Protection System for
Open Big Data Environment
Yaofeng Tu ( )
1,2
, Jiahao Niu ( )
2
, Dezheng Wang ( )
1,2
,
Hong Gao (高洪)
2
, Jin Xu (徐进)
2
, Ke Hong (洪科)
2
, Fang Yang (阳方)
2
1
(State Key Laboratory of Mobile Network and Mobile Multimedia Technology, Shenzhen 518057, China)
2
(ZTE Corporation, Nanjing 210014, China)
Corresponding author: Jiahao Niu, niu.jiahao@zte.com.cn
Abstract Big data has become a national basic strategic resource, and the opening and sharing
of data is the core of Chinas big data strategy. Cloud native technology and lake-house
architecture are reconstructing the big data infrastructure and promoting data sharing and value
dissemination. The development of the big data industry and technology requires stronger data
security and data sharing capabilities. However, data security in an open environment has
become a bottleneck, which restricts the development and utilization of big data technology.
The issues of data security and privacy protection have become increasingly prominent both
in the open source big data ecosystem and the commercial big data system. Dynamic data
protection system under the open big data environment is now facing challenges in regards
such as data availability, processing efficiency, and system scalability. This paper proposes
the dynamic data protection system BDMasker for the open big data environment. Through a
precise query analysis and query rewriting technology based on the query dependency model,
it can accurately perceive but does not change the original business request, which indicates
that the whole process of dynamic masking has zero impact on the business. Furthermore,
its multi-engine-oriented unified security strategy framework realizes the vertical expansion of
dynamic data protection capabilities and the horizontal expansion among multiple computing
engines. The distributed computing capability of the big data execution engine can be used to
improve the data protection processing performance of the system. The experimental results
show that the precise SQL analysis and rewriting technology proposed by BDMasker is effective.
The system has good scalability and performance, and the overall performance fluctuates within
3% in the TPC-DS and YCSB benchmark tests.
Keywords big data; data masking; dynamic data masking; SQL rewriting; query dependency
Citation Tu YF, Niu JH, Wang DZ, Gao H, Xu J, Hong K, Yang F. BDMasker: Dynamic data protection
system for open big data environment, International Journal of Software and Informatics, 2023, 13(1):
87–115. http://www.ijsi.org/1673-7288/297.htm
In the era of big data, big data serves as a national basic strategic resource. Attaching great
importance to the development of big data, China has begun to put in place the national big
This is the English version of Chinese article “面 , 2022, 34(3):
1213–1235. doi: 10.13328/j.cnki.jos.006783
Funding items: National Key R&D Program of China (2021YFB3101100)
Received 2022-05-14; Revised 2022-07-29, 2022-09-07; Accepted 2022-09-23; IJSI published online 2023-03-30
88 International Journal of Software and Informatics, 2023, 13(1)
data strategy in an all-round manner, in which the opening and sharing of data lies at the core
of the big data competition strategy. From the perspective of the technological development
trend, new technical architectures and big data support platforms are emerging, among which the
cloud native and lake-house architecture are reconstructing the big data infrastructure. Stronger
capabilities of data security and data sharing are required from access to the data lake and
data warehouse to cross-database and cross-domain sharing. Both the open source big data
ecosystem and the commercial big data system, however, fall wor ryingly behind the business
development in the security protection capability of big data in an open environment. It is
shown by the privacy disclosures in recent years that the release or sharing of unmasked data
is highly prone to reveal private data, especially individual sensitive information. In 2018, the
data of 87 million users of Facebook, a social media in the US, was illegally used by Cambridge
Analytica, a consulting company, and Facebook paid a $5 billion fine for such an event. Again
in 2021, there was a data leak involving another 533 million individual Facebook users. The
security problem in the open environment has become a bottleneck in the development and
utilization of big data technology. Accordingly, it has become one of the research focuses in
big data security as to how to protect the privacy of sensitive data in an open and complex
environment while ensuring good data availability and computing efficiency
[1, 2]
.
Data security in the open big data environment differs greatly from traditional data secur ity,
with changes seen in the protection method, protection object, and the relationship between
management and technology. The application scenario of open big data is committed to the
opening and sharing of data, with more diverse roles involved in data processing and the flow of
data as normal, which sets higher standards for data security protection. Traditional measures for
data security such as data encryption and static masking, thereby, are outmoded. According to
relevant research, privacy protection and dynamic data masking technologies represent important
means for safe data flow and sharing and credible big data services
[
3, 4]
. By maintaining the
availability of data sources without the leak of sensitive information in the data flow, dynamic
data masking technology boasts good utility and a broad prospect of application. In an open big
data environment, it is a complex problem requiring prompt solutions as to how to dynamically
protect sensitive data in an automated, efficient, and scalable manner while softening the impact
on nor mal businesses amid massive multimodal data and highly concurrent access requests
[5–7]
.
The following challenges are mainly involved.
(1) Scalability of heterogeneous environments. To satisfy the timeliness requirement of
different data queries and data computing under an open big data scenario, many kinds of big
data computing engines are often deployed simultaneously on the same cluster. For instance,
Apache Spark
[8]
is suitable for batch processing of static data with high latency, and Apache
Flink
[9]
is for low-latency or real-time streaming data processing. Faced with complex and
diverse business scenarios and multiple computing engines, we should explore how to create,
manage, and maintain a uniform data protection strategy for heterogeneous engines and provide
standardized access methods for the horizontal expansion of heterogeneous environments. In
addition to the capability of dynamic data masking, it is necessary to study how to flexibly
support multiple capabilities of dynamic data protection under one framework and support the
vertical expansion of the dynamic data protection capability of a single engine.
(2) High efficiency of processing performance. In an open big data environment, data is
generated at a faster pace, with its size on the exponential rise. To meet the response time
requirements in high-performance real-time protection of massive data, data security protection
must be able to operate automatically under the rules and optimize the load of the whole
processing process. In this way, it can make full use of the distributed computing capability of
the big data execution engine to enhance the processing performance.
Tu YF, et al. BDMasker: Dynamic data protection system for open big data environment 89
(3) Precision of SQL rewriting. SQL is a widely used data query language, and the popular
big data computing engines can all provide SQL access. SQL rewriting represents the key
technology for dynamic data masking. As SQL requests in the business field are ever-changing,
and the SQL rewriting mechanism involves all columns that define masking strategies, the
rewriting of complex SQL statements may result in data distortion and reduce data availability,
even undermining the accuracy of business logic processing. In the case of complex SQL access
requests, it poses a challenge to the design of a dynamic data protection system, especially the
SQL rewriting technology, as to how to ensure that the rewritten SQL is completely transparent
to the businesses on the premise of not exposing the sensitive information of the underlying
physical table so as to make the business logic isolated from the influence of data protection.
Next, we take the query statement Query76 of TPC-DS
[10]
shown in Listing 1 as an example to
explain the technical difficulties of SQL rewriting. Query76 covers most of the important syntax
rules in SQL statements.
Listing 1 Query76 of TPC-DS
SELECT channel , col_name , d_year , d_qoy , i_category , COUNT(*) sales_cnt , SUM(
ext_sales_price) sales_amt
FROM (
SELECT store AS chan ne l , ss_store_sk col_name , d_year , d_qoy , i_category ,
ss_ext_sales_price ext_sales_price
FROM store_sales , item , date_dim
WHERE ss_store_sk IS NULL AND ss_sold_date_sk = d_date_sk AND ss_item_sk =
i_item_sk
UNION ALL
SELECT web AS channel , ws_ship_customer_sk col_name , d_year , d_qoy ,
i_category , ws_ext_sales_price ext_sales_price
FROM web_sales , item , date_dim
WHERE ws_ship_customer_sk IS NULL AND ws_sold_date_sk = d_date_sk AND
ws_item_sk = i_item_sk
UNION ALL
SELECT catalog AS channel , cs_ship_addr_sk col_name , d_year , d_qoy ,
i_category , cs_ext_sales_price ext_sales_price
FROM catalog_sales , item , date_dim
WHERE cs_ship_addr_sk IS NULL AND cs_sold_date_sk = d_date_sk AND cs_item_sk =
i_item_sk
) foo
GROUP BY channel , col_name , d_year , d_qoy , i_category
ORDER BY channel , col_name , d_year , d_qoy , i_category
LIMIT 100
Precise SQL analysis. For an SQL query, the output of its query result set eventually
comes from the output field of the “select statement of the outermost query, and the output
field may be from sub-query statements, “join statements, “union” statements, etc., which may
undergo multi-tier conversion through sub-query and nested functions, etc. Therefore, we need
to accurately identify and correctly mask the source of sensitive fields in the outermost output
parts; otherwise, the sensitive information of the underlying physical table may get exposed.
For instance, the fields in the physical table on which the outermost output column col_ name
in Query76 finally depends include the following: ss_store_sk field in the store_ sales
table, ws_ship_customer_sk field in the Web_sales table, and cs_ship_addr_sk field in the
catalog_sales table. If masking rules were set for only one of the three fields, the outermost
output column col_name would cause the leak of sensitive data in the query result set by not
obtaining the field information of the underlying physical table on which it depends and applying
the corresponding masking rules.
Precise positioning of sensitive fields. The sensitive fields in the SQL query requests
may come from different syntactic structures. For example, in Query76, the field i_category
appears many times in the sub-query output field at different tiers; the field
d_year
is seen in both
the sub-query output field and the GROUP BY and ORDER BY statements. Some query output
of 29
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。