暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
SIGMOD-2024-Secure Sampling for Approximate Multi-party Query Processing_阿里云.pdf
96
27页
1次
2024-06-24
免费下载
219
Secure Sampling for Approximate Multi-party ery
Processing
QIYAO LUO, Hong Kong University of Science and Technology, Hong Kong SAR, China
YILEI WANG, Alibaba Group, Hangzhou, China
KE YI, Hong Kong University of Science and Technology, Hong Kong SAR, China
SHENG WANG, Alibaba Group, Hangzhou, China
FEIFEI LI, Alibaba Group, Hangzhou, China
We study the problem of random sampling in the secure multi-party computation (MPC) model. In MPC,
taking a sample securely must have a cost
Ω(𝑛)
irrespective to the sample size
𝑠
. This is in stark contrast with
the plaintext setting, where a sample can be taken in
𝑂 (𝑠)
time trivially. Thus, the goal of approximate query
processing (AQP) with sublinear costs seems unachievable under MPC. To get around this inherent barrier, in
this paper we take a two-stage approach: In the oine stage, we generate a batch of
𝑛/𝑠
samples with
˜
𝑂 (𝑛)
total cost, which can then be consumed to answer queries as they arrive online. Such an approach allows us to
achieve an
˜
𝑂 (𝑠)
amortized cost per query, similar to the plaintext setting. Based on our secure batch sampling
algorithms, we build MASQUE, an MPC-AQP system that achieves sublinear online query costs by running
an MPC protocol to evaluate the queries on pre-generated samples. MASQUE achieves the strong security
guarantee of the MPC model, i.e., nothing is revealed beyond the query result, which itself can be further
protected by (amplied) dierential privacy.
CCS Concepts: Security and privacy
Database and storage security; Information systems
Data management systems; Mathematics of computing Probability and statistics.
Additional Key Words and Phrases: Secure multi-party computation; Sampling; Approximate query processing
ACM Reference Format:
Qiyao Luo, Yilei Wang, Ke Yi, Sheng Wang, and Feifei Li. 2023. Secure Sampling for Approximate Multi-
party Query Processing. Proc. ACM Manag. Data 1, 3 (SIGMOD), Article 219 (September 2023), 27 pages.
https://doi.org/10.1145/3617339
1 INTRODUCTION
Data analysis often needs to be conducted across multiple organizations. Meanwhile, as data gets
increasingly valuable, there is also a growing need to protect its security and privacy.
Example 1.1. An insurance company wishes to estimate the budget it should prepare for certain
types of diseases in 2023. This information could be obtained by running a query on the joint
data from the insurance company and the hospitals. For example, several hospitals jointly hold
R(person, year, disease)
and the insurance company has relation
S(person, cost, rate)
. A simple
SQL query on these relations can be the following:
Authors’ addresses: Qiyao Luo, qluoak@cse.ust.hk, Hong Kong University of Science and Technology, Hong Kong SAR,
China; Yilei Wang, fengmi.wyl@alibaba-inc.com, Alibaba Group, Hangzhou, China; Ke Yi, yike@cse.ust.hk, Hong Kong
University of Science and Technology, Hong Kong SAR, China; Sheng Wang, sh.wang@alibaba-inc.com, Alibaba Group,
Hangzhou, China; Feifei Li, lifeifei@alibaba-inc.com, Alibaba Group, Hangzhou, China.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2836-6573/2023/9-ART219 $15.00
https://doi.org/10.1145/3617339
Proc. ACM Manag. Data, Vol. 1, No. 3 (SIGMOD), Article 219. Publication date: September 2023.
219:2 Qiyao Luo et al.
SELECT disease, SUM(cost * rate)
FROM R, S
WHERE R.person = S.person AND R.year = 2023
GROUP BY disease
However, the medical records at the hospitals, as well as the insurance policies, contain sensitive
personal information, and should not be revealed to other parties. A query processing system under
the secure multi-party computation (MPC) model thus meets this requirement.
MPC enables multiple parties to jointly compute a function without disclosing their private input.
Unfortunately, the current MPC-SQL systems are 1000+ times slower than on plaintext [
7
,
44
]. For
instance, it takes SMCQL [
7
] 1000 seconds to evaluate a query over only 200 tuples. Most existing
works improve the performance by weakening the security guarantee or restricting to a small class
of queries. Conclave [
52
] relies on a trusted third party; Scape [
27
] reveals the intermediate join
sizes; Shrinkwrap [
8
] protects the intermediate join sizes by dierential privacy but at a high cost of
at least
Ω(𝑛
2
)
; Secure Yannakakis [
53
] strictly follows the MPC security requirement, but supports
only limited types of queries. In this paper, we aim at designing an MPC-SQL system that does not
compromise on security or generality; the sacrice we make is that only approximate query results
are returned.
In fact, even for query processing over plaintext, where performance is not a big issue unless
running on very large datasets, approximate query processing (AQP) has already been extensively
studied [
1
,
15
,
29
,
37
,
38
,
42
]. Thus, AQP is a more valuable approach for MPC even on not-so-large
datasets. Among the many AQP techniques, random sampling is the most appealing due to its
versatility in handling a large class of queries. Meanwhile, there is a rich body of work from the
statistics literature that can be used to derive statistically sound interpretations of the query results
evaluated on the sample. A general rule of thumb is that the (normalized) sampling error is roughly
proportional to 1
/
𝑠
for sample size
𝑠
. This means that the sample size, hence the query processing
cost, can be sublinear or even independent to the database size
𝑛
, which is especially appreciated
when facing large data sets.
Due to the high overhead of MPC protocols, sampling based AQP techniques are in an even
stronger demand than on plaintext. In Example 1.1, since the insurance company does not necessarily
require an exact result, the protocol can work on a sample so as to reduce the query evaluation
time.
Indeed, this motivation is well articulated in SAQE [
9
], the rst MPC-AQP system. Note that the
sample must be taken securely, i.e., no one should know which elements are taken (or not) into the
sample; otherwise, a curious onlooker may be able to deduce unauthorized information by linking
this knowledge with the query result [
9
]. However, the MPC sampling protocol of [
9
] has a cost
(both running time and communication) of
𝑂 (𝑛 log 𝑛)
, no matter how small the sample size
𝑠
is.
This may not be faster than exactly computing the query result without sampling, which has
𝑂 (𝑛)
cost for many queries but with a larger hidden constant.
Unfortunately, there is an inherent
Ω(𝑛)
lower bound for secure sampling (so the algorithm of
[
9
] cannot be improved by more than a logarithmic factor): If a record is not touched, then it reveals
that it must not be in the sample. To get around this barrier, in this paper, we take a two-stage
approach: In the oine stage (before any query is given), we solve the batch sampling problem,
i.e., generating
𝑛/𝑠
samples of size
𝑠
each, with a total cost of
˜
𝑂 (𝑛)
1
. Thus, the amortized cost per
sample is
˜
𝑂 (𝑠)
. Then during the online stage, the samples are consumed as queries arrive, such that
the online query processing cost is
˜
𝑂 (𝑠)
, similar to the plaintext setting (but with a larger hidden
1
The
˜
𝑂 notation suppresses polylogarithmic factors.
Proc. ACM Manag. Data, Vol. 1, No. 3 (SIGMOD), Article 219. Publication date: September 2023.
Secure Sampling for Approximate Multi-party ery Processing 219:3
constant, as we need to run an MPC protocol to evaluate the query on the sample). When samples
are depleted, we re-run the batch sampling to obtain a fresh set of samples. We also support the
case when
𝑠
is given at query time instead of during the oine stage, with a polylogarithmic factor
increase in the oine storage and computation. This idea will be introduced in Section 3.3.
This two-stage approach considerably reduces the response time of queries, as the heavy com-
putation parts have been moved to the oine stage. In fact, most MPC systems already adopt a
two-stage design: Oblivious transfer (OT) extension [
32
] generates a small number of base OTs
oine, which can then be used to extend to a large number of OTs online; Beaver triples [
10
] are
generated in an oine stage to accelerate multiplications in the online stage; when data is stored
in multiple tables, it is often denormalized using MPC joins [
7
,
27
,
40
,
44
,
54
] in an oine stage, i.e.,
the fact table is joined with all the dimension tables to form a at table. In some sense, our proposal
is to decouple the sampling process in such a manner as well, which is necessary to break the
Ω(𝑛)
lower bound.
1.1 Contributions
Specically, we make the following contributions in this paper:
(1)
We formally introduce the batch sampling problem under MPC, as a necessary way to
achieving an
˜
𝑂 (𝑠) cost amortized per sample.
(2)
We describe a suite of MPC batch sampling algorithms including shue sampling, sampling
with replacement (WR sampling), sampling without replacement (WoR sampling), and strat-
ied sampling. We discuss their pros and cons in terms of cost, independence, sampling
error, and (dierential) privacy amplication. The algorithm for WoR sampling under MPC is
particularly nontrivial. We model the problem as a graph, design a circuit for pointer jumping,
and then construct a WoR sampling circuit based on it. We also generalize our WoR sampling
algorithm to support stratied sampling.
(3)
Based on our sampling algorithms, we build MASQUE (Multi-party Approximate Secure
QUEry processing), an MPC-AQP system that utilizes two semi-honest non-colluding servers
with any number of data owners. During the oine stage, the data owners rst secret-
share their data to the two servers. MASQUE then denormalizes the data using secure joins
and generates a batch of samples. During the online stage, one sample is consumed for
each query received. For all-around privacy protection, MASQUE also optionally injects
dierential privacy (DP) noise to the query result before returning it to the designated
receiver. We empirically compare MASQUE with SAQE [
9
], the previous MPC-AQP system,
and SMCQL [
7
] and SecYan [
53
], the state-of-the-art exact MPC query processing engines.
Experimental results show that MASQUE can reduce the online latency of MPC query
processing signicantly.
1.2 Related Work
Sampling based AQP techniques have been extensively studied over plaintext data and many AQP
systems exist [
1
,
15
,
29
,
31
,
37
,
42
]. Some systems take the sample online after a query arrives
[
31
,
37
], while others also take a two-stage approach, where samples are pre-generated during an
oine stage [
1
,
29
,
42
], similar to what we do in this paper. Since an online sample can be taken
in
𝑂 (𝑠)
time over plaintext, the two-stage approach oers limited improvement, and which one
performs better delicately depends on the query and data characteristics [
15
]. This is the major
dierence between MPC and the plaintext setting, as taking an online sample under MPC requires
Ω(𝑛) cost, so the two-stage approach has a signicant advantage in this case.
Proc. ACM Manag. Data, Vol. 1, No. 3 (SIGMOD), Article 219. Publication date: September 2023.
of 27
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。