VLDB2023_A Learned Sketch for Subgraph Counting_ a holistic approach_华为云.pdf

迹部景吾

26页

2次

2024-08-28

免费下载

The VLDB Journal (2023) 32:937–962

https://doi.org/10.1007/s00778-023-00781-5

REGULAR PAPER

Learned sketch for subgraph counting: a holistic approach

Kangfei Zhao

· Jeﬀrey Xu Yu

· Qiyan Li

· Hao Zhang

· Yu Rong

Received: 6 May 2022 / Revised: 25 November 2022 / Accepted: 6 January 2023 / Published online: 9 February 2023

Abstract

Subgraph counting, as a fundamental problem in network analysis, is to count the number of subgraphs in a data graph that

match a given query graph by either homomorphism or subgraph isomorphism. The importance of subgraph counting derives

from the fact that it provides insights of a large graph, in particular a labeled graph, when a collection of query graphs with

different sizes and labels are issued. The problem of counting is challenging. On the one hand, exact counting by enumerating

subgraphs is NP-hard. On the other hand, approximate counting by subgraph isomorphism can only support small query graphs

over unlabeled graphs. Another way for subgraph counting is to specify it as an

SQL

query and estimate the cardinality of

the query in

RDBMS

. Existing approaches for cardinality estimation can only support subgraph counting by homomorphism

up to some extent, as it is difﬁcult to deal with sampling failure when a query graph becomes large. A question that arises is

how we support subgraph counting by machine learning (ML) and deep learning (DL). To devise an ML/DL solution, apart

from the query graphs, another issue is to deal with large data graphs by ML/DL, as the existing DL approach for subgraph

isomorphism counting can only support small data graphs. In addition, the ML/DL approaches proposed in

RDBMS

context

for approximate query processing and cardinality estimation cannot be used, as subgraph counting is to do complex self-joins

over one relation, whereas existing approaches focus on multiple relations. In this work, we propose an active l earned sketch

for subgraph counting (ALSS) with two main components: a learned sketch for subgraph counting and an active learner. The

sketch is constructed by a neural network regression model, and the active learner is to perform model updates based on new

arrival test query graphs. Our holistic learning framework supports both undirected graphs and directed graphs, whose nodes

and/or edges are associated zero to multiple labels. We conduct extensive experimental studies to conﬁrm the effectiveness

and efﬁciency of ALSS using large real labeled graphs. Moreover, we show that ALSS can assist query optimizers in ﬁnding a

better query plan for complex multi-way self-joins.

Keywords Subgraph counting · Graph neural network · Active learning

Jeffrey Xu Yu

yu@se.cuhk.edu.hk

Kangfei Zhao

zkf1105@gmail.com

Qiyan Li

qyli@se.cuhk.edu.hk

Hao Zhang

zhanghaowuda12@gmail.com

Yu Rong

yu.rong@hotmail.com

Beijing Institute of Technology, Beijing, China

Chinese University of Hong Kong, Hong Kong, China

Huawei Cloud Database Innovation Lab, Beijing, China

Tencent AI Lab, Shenzhen, China

1 Introduction

Graph has been widely used for modeling real applica-

tions in commercial, biological, lexical, and social networks

[62,85,89,101,119]. As one of fundamental problems, sub-

graph counting is to count how many subgraphs in a labeled

data graph that match a user-given query graph by either

homomorphism or subgraph isomorphism. The subgraph

counting provides insights to understand a large complex data

graph, as users can issue a collection of query graphs with

different sizes and labels. It has a wide range of applications

in network analysis, to name a few, designing graph kernels

for graph comparison and representation [81,94], building

probabilistic models for computer vision tasks (e.g., photo

cropping and image segmentation) [30,110,115 ], and ana-

lyzing chemical and biological networks (e.g., molecular

123

938 K. Zhao et al.

properties prediction and phylogeny construction) [14]. In

addition to supporting applications, subgraph counting can

also be used in a system for query optimization to estimate

the cardinality for complex large joins with cycles, where

a query is to enumerate subgraph matchings in a large data

graph stored in a relational system [2,61].

1.1 Existing approaches and motivation

Exact subgraph counting by subgraph isomorphism is chal-

lenging. As surveyed in [78], there are enumeration meth-

ods [7,10,11,13,42,61,80] and analytical methods [4,60,67,

72]. Full subgraph enumeration is difﬁcult as determining

whether a given query graph exists in a data graph by sub-

graph isomorphism is NP-complete [12,22]. The analytical

approaches do counting by decomposing a query graph into

smaller s ubgraphs, and counting the query graph based on

the counts obtained for the smaller subgraphs. The analyt-

ical methods are designed for some query graphs up to a

certain size (e.g., 5-node query graph), and are difﬁcult to

have a general form for any size of query graphs. Some exact

subgraph counting algorithms, e.g., [60,112] can support

query graphs with up to 6 nodes. Instead of exact subgraph

counting, approximate subgraph counting has been studied

in recent years [15–17,21,41,99], which is designed for small

query graphs over simple unlabeled data graphs. For larger

query graphs, [82] proposes an approximate algorithm that

can count tree-shaped query graphs with up to 12 nodes. [17]

can support approximate subgraph counting for query graphs

with 10 nodes by color coding. These approaches are non-

trivial to be extended to support labeled graphs due to the

intrinsic techniques they use for approximation (e.g., color

coding) [15,17].

The subgraph counting can be supported in

RDBMS

using

SQL

[65,84], multi-way joins [18,52,97,116], and

processed as cardinality estimation. Speciﬁcally, consider a

node-labeled undirected graph G = (V , E, L). It can be

stored in an

RDBMS

using a node table, T

(V , L), and an

edge table, T

(F, T ), where the node table keeps its node-

ids and the labels associated with the node, and the edge table

keeps the edges (node-id pairs). Consider a simple subgraph

counting task to count the number of open triangles among

three nodes, {v

}, which are connected by two edges

) and (v

), and the node-labels for v

are

A/B/C, respectively. The subgraph counting by homomor-

phism can be speciﬁed as an

SQL

query

selectcount(∗)from

as T

, T

as T

, T

as T

, T

as T

, T

as T

where T

.T = T

.F and T

.V = T

.Fand

.L =



and T

.V = T

.T and

.L =



and T

.V = T

.T and T

.L =



Note that the same subgraph counting by subgraph isomor-

phism needs to add a constraint in the where clause as T

.F =

.T .

To process such

SQL

queries, ﬁrst, the approaches in

[18,65,84,97] decompose a query graph into a set of small

subgraphs, count the subgraph matchings for the smaller sub-

graphs, and aggregate the counts to approximate the ﬁnal

count. They rely on an assumption that the counts of the

smaller subgraphs are independent, which is impractical for

real large graphs so that it incurs inaccurate estimation. Sec-

ond, the approaches in [52,116] make use of join sampling

that draws matchings from the data graph. These approaches

face drastically increasing sampling failure in the intermedi-

ate join steps when the query graphs and the distribution

of the data graph are complex (i.e., complex topological

structure and label distribution), although the sampling is

independent. The limitations of these approaches cause the

deterioration of estimation accuracy as well as efﬁciency

on counting f or complex subgraphs. In [69], a benchmark

G-CARE

for cardinality estimation of subgraph matching

(homomorphism) is presented, which investigates estimation

approaches for graphs [21,65,84] as well as relational data

[52,97,116]. With

G-CARE

open source, we verify that join

sampling has its limitation for complex subgraph matching

even by homomorphism when it is used to test for large query

graphs. The difﬁculty of cardinality estimation comes from

complex data distribution [50].

A question that arises is if machine learning (ML) and deep

learning (DL) can be effectively used for subgraph counting.

Liu et al. in [54] study neural subgraph isomorphism counting

using a dynamic memory network. Their models are trained

on synthetic graphs which are up to 512 nodes (Table 4 in

[54]). Their approach is computationally too expensive to

be used for large data graphs. Below, as subgraph counting

(homomorphism or subgraph isomorphism) can be speciﬁed

as an

SQL

query, we discuss if DL approaches studied in

RDBMS

context for approximate query processing (AQP)

or cardinality estimation can be used to support subgraph

counting.

Table 1 summarizes the ML/DL approaches studied for

AQP (the top three) and cardinality estimation (the bottom

twelve). The

SQL

queries they support include join, selec-

tion, and group-by and aggregation. The joins can be either

equi-join, joins between primary-key (PK) and foreign-key

(FK), or precomputed join results. The selection conditions

are on numerical or categorical attributes. For numerical

attributes, the condition can be on a value or a range. For cate-

gorical attributes, string estimation is also considered in [87].

For selection predicates, most approaches support multivari-

ate numerical range predicates and categorical predicates,

and [87,100] support string comparison. For AQP, aggre-

Homomorphism allows two nodes in a query graph to be mapped to

the same node in a data graph, where subgraph isomorphism does not.

123

of 26

免费下载

关注

评论