GridGraph- Large Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning (1).pdf

波风水门

109

13页

0次

2023-05-31

免费下载

This paper is included in the Proceedings of the

2015 USENIX Annual Technical Conference (USENIC ATC ’15).

July 8–10, 2015 • Santa Clara, CA, USA

ISBN 978-1-931971-225

Open access to the Proceedings of the

2015 USENIX Annual Technical Conference

(USENIX ATC ’15) is sponsored by USENIX.

GridGraph: Large-Scale Graph Processing

on a Single Machine Using 2-Level

Hierarchical Partitioning

Xiaowei Zhu, Wentao Han, and Wenguang Chen, Tsinghua University

https://www.usenix.org/conference/atc15/technical-session/presentation/zhu

USENIX Association 2015 USENIX Annual Technical Conference 375

GridGraph: Large-Scale Graph Processing on a Single Machine

Using 2-Level Hierarchical Partitioning

Xiaowei Zhu, Wentao Han, and Wenguang Chen

Department of Computer Science and Technology, Tsinghua University

{zhuxw13,hwt04}@mails.tsinghua.edu.cn, cwg@tsinghua.edu.cn

Abstract

In this paper, we present GridGraph, a system for pro-

cessing large-scale graphs on a single machine. Grid-

Graph breaks graphs into 1D-partitioned vertex chunks

and 2D-partitioned edge blocks using a ﬁrst ﬁne-grained

level partitioning in preprocessing. A second coarse-

grained level partitioning is applied in runtime. Through

a novel dual sliding windows method, GridGraph can

stream the edges and apply on-the-ﬂy vertex updates,

thus reduce the I/O amount required for computation.

The partitioning of edges also enable selective schedul-

ing so that some of the blocks can be skipped to reduce

unnecessary I/O. This is very effective when the active

vertex set shrinks with convergence.

Our evaluation results show that GridGraph scales

seamlessly with memory capacity and disk bandwidth,

and outperforms state-of-the-art out-of-core systems, in-

cluding GraphChi and X-Stream. Furthermore, we show

that the performance of GridGraph is even competitive

with distributed systems, and it also provides signiﬁcant

cost efﬁciency in cloud environment.

1 Introduction

There has been increasing interests to process large-scale

graphs efﬁciently in both academic and industrial com-

munities. Many real-world problems, such as online so-

cial networks, web graphs, user-item matrices, and more,

can be represented as graph computing problems.

Many distributed graph processing systems like Pregel

[17], GraphLab [16], PowerGraph [8], GraphX [28], and

others [1, 22] have been proposed in the past few years.

They are able to handle graphs of very large scale by ex-

ploiting the powerful computation resources of clusters.

However, load imbalance [11, 20], synchronization over-

head [33] and fault tolerance overhead [27] are still chal-

lenges for graph processing in distributed environment.

Moreover, users need to be skillful since tuning a cluster

and optimizing graph algorithms in distributed systems

are non-trivial tasks.

GraphChi [13], X-Stream [21] and other out-of-core

systems [9, 15, 31, 34] provide alternative solutions.

They enable users to process large-scale graphs on a sin-

gle machine by using disks efﬁciently. GraphChi par-

titions the vertices into disjoint intervals and breaks the

large edge list into smaller shards containing edges with

destinations in corresponding intervals. It uses a vertex-

centric processing model, which gathers data from neigh-

bors by reading edge values, computes and applies new

values to vertices, and scatters new data to neighbors

by writing values on edges. By using a novel parallel

sliding windows method to reduce random I/O accesses,

GraphChi is able to process large-scale graphs in rea-

sonable time. However, its sharding process requires

the edges in every shard to be sorted by sources. This

is a very time consuming process since several passes

over edges are needed. Fragmented accesses over several

shards are often inevitable, decreasing the usage of disk

bandwidth. X-Stream introduces an edge-centric scatter-

gather processing model. In the scatter phase, X-Stream

streams every edge and generates updates to propagate

vertex states. In the gather phase, X-Stream streams ev-

ery update, and applies it to the corresponding vertex

state. Accesses to vertices are random and happen on

a high level of storage hierarchy which is small but fast.

And accesses to edges and updates fall into a low level of

storage hierarchy which is large but slow. However, these

accesses are sequential so that maximum throughput can

be achieved. Although X-Stream can leverage high disk

bandwidth by sequential accessing, it needs to generate

updates which could be in the same magnitude as edges,

and its lack of support on selective scheduling could also

be a critical problem when dealing with graphs of large

diameters.

We propose GridGraph, which groups edges into a

“grid” representation. In GridGraph, vertices are parti-

tioned into 1D chunks and edges are partitioned into 2D

of 13

免费下载

tugraph paper

关注

评论