SIGMOD-2024-Secure Sampling for Approximate Multi-party Query Processing_阿里云.pdf

赤井秀一

27页

1次

2024-06-24

免费下载

219

Secure Sampling for Approximate Multi-party ery

Processing

QIYAO LUO, Hong Kong University of Science and Technology, Hong Kong SAR, China

YILEI WANG, Alibaba Group, Hangzhou, China

KE YI, Hong Kong University of Science and Technology, Hong Kong SAR, China

SHENG WANG, Alibaba Group, Hangzhou, China

FEIFEI LI, Alibaba Group, Hangzhou, China

We study the problem of random sampling in the secure multi-party computation (MPC) model. In MPC,

taking a sample securely must have a cost

Ω(𝑛)

irrespective to the sample size

𝑠

. This is in stark contrast with

the plaintext setting, where a sample can be taken in

𝑂 (𝑠)

time trivially. Thus, the goal of approximate query

processing (AQP) with sublinear costs seems unachievable under MPC. To get around this inherent barrier, in

this paper we take a two-stage approach: In the oine stage, we generate a batch of

𝑛/𝑠

samples with

𝑂 (𝑛)

total cost, which can then be consumed to answer queries as they arrive online. Such an approach allows us to

achieve an

𝑂 (𝑠)

amortized cost per query, similar to the plaintext setting. Based on our secure batch sampling

algorithms, we build MASQUE, an MPC-AQP system that achieves sublinear online query costs by running

an MPC protocol to evaluate the queries on pre-generated samples. MASQUE achieves the strong security

guarantee of the MPC model, i.e., nothing is revealed beyond the query result, which itself can be further

protected by (amplied) dierential privacy.

CCS Concepts: • Security and privacy

→

Database and storage security; • Information systems

→

Data management systems; • Mathematics of computing → Probability and statistics.

Additional Key Words and Phrases: Secure multi-party computation; Sampling; Approximate query processing

ACM Reference Format:

Qiyao Luo, Yilei Wang, Ke Yi, Sheng Wang, and Feifei Li. 2023. Secure Sampling for Approximate Multi-

party Query Processing. Proc. ACM Manag. Data 1, 3 (SIGMOD), Article 219 (September 2023), 27 pages.

https://doi.org/10.1145/3617339

1 INTRODUCTION

Data analysis often needs to be conducted across multiple organizations. Meanwhile, as data gets

increasingly valuable, there is also a growing need to protect its security and privacy.

Example 1.1. An insurance company wishes to estimate the budget it should prepare for certain

types of diseases in 2023. This information could be obtained by running a query on the joint

data from the insurance company and the hospitals. For example, several hospitals jointly hold

R(person, year, disease)

and the insurance company has relation

S(person, cost, rate)

. A simple

SQL query on these relations can be the following:

Authors’ addresses: Qiyao Luo, qluoak@cse.ust.hk, Hong Kong University of Science and Technology, Hong Kong SAR,

China; Yilei Wang, fengmi.wyl@alibaba-inc.com, Alibaba Group, Hangzhou, China; Ke Yi, yike@cse.ust.hk, Hong Kong

University of Science and Technology, Hong Kong SAR, China; Sheng Wang, sh.wang@alibaba-inc.com, Alibaba Group,

Hangzhou, China; Feifei Li, lifeifei@alibaba-inc.com, Alibaba Group, Hangzhou, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the

full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

2836-6573/2023/9-ART219 $15.00

https://doi.org/10.1145/3617339

Proc. ACM Manag. Data, Vol. 1, No. 3 (SIGMOD), Article 219. Publication date: September 2023.

219:2 Qiyao Luo et al.

SELECT disease, SUM(cost * rate)

FROM R, S

WHERE R.person = S.person AND R.year = 2023

GROUP BY disease

However, the medical records at the hospitals, as well as the insurance policies, contain sensitive

personal information, and should not be revealed to other parties. A query processing system under

the secure multi-party computation (MPC) model thus meets this requirement.

MPC enables multiple parties to jointly compute a function without disclosing their private input.

Unfortunately, the current MPC-SQL systems are 1000+ times slower than on plaintext [

]. For

instance, it takes SMCQL [

] 1000 seconds to evaluate a query over only 200 tuples. Most existing

works improve the performance by weakening the security guarantee or restricting to a small class

of queries. Conclave [

] relies on a trusted third party; Scape [

] reveals the intermediate join

sizes; Shrinkwrap [

] protects the intermediate join sizes by dierential privacy but at a high cost of

at least

Ω(𝑛

)

; Secure Yannakakis [

] strictly follows the MPC security requirement, but supports

only limited types of queries. In this paper, we aim at designing an MPC-SQL system that does not

compromise on security or generality; the sacrice we make is that only approximate query results

are returned.

In fact, even for query processing over plaintext, where performance is not a big issue unless

running on very large datasets, approximate query processing (AQP) has already been extensively

studied [

]. Thus, AQP is a more valuable approach for MPC even on not-so-large

datasets. Among the many AQP techniques, random sampling is the most appealing due to its

versatility in handling a large class of queries. Meanwhile, there is a rich body of work from the

statistics literature that can be used to derive statistically sound interpretations of the query results

evaluated on the sample. A general rule of thumb is that the (normalized) sampling error is roughly

proportional to 1

√

𝑠

for sample size

𝑠

. This means that the sample size, hence the query processing

cost, can be sublinear or even independent to the database size

𝑛

, which is especially appreciated

when facing large data sets.

Due to the high overhead of MPC protocols, sampling based AQP techniques are in an even

stronger demand than on plaintext. In Example 1.1, since the insurance company does not necessarily

require an exact result, the protocol can work on a sample so as to reduce the query evaluation

time.

Indeed, this motivation is well articulated in SAQE [

], the rst MPC-AQP system. Note that the

sample must be taken securely, i.e., no one should know which elements are taken (or not) into the

sample; otherwise, a curious onlooker may be able to deduce unauthorized information by linking

this knowledge with the query result [

]. However, the MPC sampling protocol of [

] has a cost

(both running time and communication) of

𝑂 (𝑛 log 𝑛)

, no matter how small the sample size

𝑠

is.

This may not be faster than exactly computing the query result without sampling, which has

𝑂 (𝑛)

cost for many queries but with a larger hidden constant.

Unfortunately, there is an inherent

Ω(𝑛)

lower bound for secure sampling (so the algorithm of

[

] cannot be improved by more than a logarithmic factor): If a record is not touched, then it reveals

that it must not be in the sample. To get around this barrier, in this paper, we take a two-stage

approach: In the oine stage (before any query is given), we solve the batch sampling problem,

i.e., generating

𝑛/𝑠

samples of size

𝑠

each, with a total cost of

𝑂 (𝑛)

. Thus, the amortized cost per

sample is

𝑂 (𝑠)

. Then during the online stage, the samples are consumed as queries arrive, such that

the online query processing cost is

𝑂 (𝑠)

, similar to the plaintext setting (but with a larger hidden

The

𝑂 notation suppresses polylogarithmic factors.

Proc. ACM Manag. Data, Vol. 1, No. 3 (SIGMOD), Article 219. Publication date: September 2023.

Secure Sampling for Approximate Multi-party ery Processing 219:3

constant, as we need to run an MPC protocol to evaluate the query on the sample). When samples

are depleted, we re-run the batch sampling to obtain a fresh set of samples. We also support the

case when

𝑠

is given at query time instead of during the oine stage, with a polylogarithmic factor

increase in the oine storage and computation. This idea will be introduced in Section 3.3.

This two-stage approach considerably reduces the response time of queries, as the heavy com-

putation parts have been moved to the oine stage. In fact, most MPC systems already adopt a

two-stage design: Oblivious transfer (OT) extension [

] generates a small number of base OTs

oine, which can then be used to extend to a large number of OTs online; Beaver triples [

] are

generated in an oine stage to accelerate multiplications in the online stage; when data is stored

in multiple tables, it is often denormalized using MPC joins [

] in an oine stage, i.e.,

the fact table is joined with all the dimension tables to form a at table. In some sense, our proposal

is to decouple the sampling process in such a manner as well, which is necessary to break the

Ω(𝑛)

lower bound.

1.1 Contributions

Specically, we make the following contributions in this paper:

(1)

We formally introduce the batch sampling problem under MPC, as a necessary way to

achieving an

𝑂 (𝑠) cost amortized per sample.

(2)

We describe a suite of MPC batch sampling algorithms including shue sampling, sampling

with replacement (WR sampling), sampling without replacement (WoR sampling), and strat-

ied sampling. We discuss their pros and cons in terms of cost, independence, sampling

error, and (dierential) privacy amplication. The algorithm for WoR sampling under MPC is

particularly nontrivial. We model the problem as a graph, design a circuit for pointer jumping,

and then construct a WoR sampling circuit based on it. We also generalize our WoR sampling

algorithm to support stratied sampling.

(3)

Based on our sampling algorithms, we build MASQUE (Multi-party Approximate Secure

QUEry processing), an MPC-AQP system that utilizes two semi-honest non-colluding servers

with any number of data owners. During the oine stage, the data owners rst secret-

share their data to the two servers. MASQUE then denormalizes the data using secure joins

and generates a batch of samples. During the online stage, one sample is consumed for

each query received. For all-around privacy protection, MASQUE also optionally injects

dierential privacy (DP) noise to the query result before returning it to the designated

receiver. We empirically compare MASQUE with SAQE [

], the previous MPC-AQP system,

and SMCQL [

] and SecYan [

], the state-of-the-art exact MPC query processing engines.

Experimental results show that MASQUE can reduce the online latency of MPC query

processing signicantly.

1.2 Related Work

Sampling based AQP techniques have been extensively studied over plaintext data and many AQP

systems exist [

]. Some systems take the sample online after a query arrives

[

], while others also take a two-stage approach, where samples are pre-generated during an

oine stage [

], similar to what we do in this paper. Since an online sample can be taken

𝑂 (𝑠)

time over plaintext, the two-stage approach oers limited improvement, and which one

performs better delicately depends on the query and data characteristics [

]. This is the major

dierence between MPC and the plaintext setting, as taking an online sample under MPC requires

Ω(𝑛) cost, so the two-stage approach has a signicant advantage in this case.

Proc. ACM Manag. Data, Vol. 1, No. 3 (SIGMOD), Article 219. Publication date: September 2023.

of 27

免费下载

相关文档

评论