USENIX Association 2015 USENIX Annual Technical Conference 375
GridGraph: Large-Scale Graph Processing on a Single Machine
Using 2-Level Hierarchical Partitioning
Xiaowei Zhu, Wentao Han, and Wenguang Chen
Department of Computer Science and Technology, Tsinghua University
{zhuxw13,hwt04}@mails.tsinghua.edu.cn, cwg@tsinghua.edu.cn
Abstract
In this paper, we present GridGraph, a system for pro-
cessing large-scale graphs on a single machine. Grid-
Graph breaks graphs into 1D-partitioned vertex chunks
and 2D-partitioned edge blocks using a first fine-grained
level partitioning in preprocessing. A second coarse-
grained level partitioning is applied in runtime. Through
a novel dual sliding windows method, GridGraph can
stream the edges and apply on-the-fly vertex updates,
thus reduce the I/O amount required for computation.
The partitioning of edges also enable selective schedul-
ing so that some of the blocks can be skipped to reduce
unnecessary I/O. This is very effective when the active
vertex set shrinks with convergence.
Our evaluation results show that GridGraph scales
seamlessly with memory capacity and disk bandwidth,
and outperforms state-of-the-art out-of-core systems, in-
cluding GraphChi and X-Stream. Furthermore, we show
that the performance of GridGraph is even competitive
with distributed systems, and it also provides significant
cost efficiency in cloud environment.
1 Introduction
There has been increasing interests to process large-scale
graphs efficiently in both academic and industrial com-
munities. Many real-world problems, such as online so-
cial networks, web graphs, user-item matrices, and more,
can be represented as graph computing problems.
Many distributed graph processing systems like Pregel
[17], GraphLab [16], PowerGraph [8], GraphX [28], and
others [1, 22] have been proposed in the past few years.
They are able to handle graphs of very large scale by ex-
ploiting the powerful computation resources of clusters.
However, load imbalance [11, 20], synchronization over-
head [33] and fault tolerance overhead [27] are still chal-
lenges for graph processing in distributed environment.
Moreover, users need to be skillful since tuning a cluster
and optimizing graph algorithms in distributed systems
are non-trivial tasks.
GraphChi [13], X-Stream [21] and other out-of-core
systems [9, 15, 31, 34] provide alternative solutions.
They enable users to process large-scale graphs on a sin-
gle machine by using disks efficiently. GraphChi par-
titions the vertices into disjoint intervals and breaks the
large edge list into smaller shards containing edges with
destinations in corresponding intervals. It uses a vertex-
centric processing model, which gathers data from neigh-
bors by reading edge values, computes and applies new
values to vertices, and scatters new data to neighbors
by writing values on edges. By using a novel parallel
sliding windows method to reduce random I/O accesses,
GraphChi is able to process large-scale graphs in rea-
sonable time. However, its sharding process requires
the edges in every shard to be sorted by sources. This
is a very time consuming process since several passes
over edges are needed. Fragmented accesses over several
shards are often inevitable, decreasing the usage of disk
bandwidth. X-Stream introduces an edge-centric scatter-
gather processing model. In the scatter phase, X-Stream
streams every edge and generates updates to propagate
vertex states. In the gather phase, X-Stream streams ev-
ery update, and applies it to the corresponding vertex
state. Accesses to vertices are random and happen on
a high level of storage hierarchy which is small but fast.
And accesses to edges and updates fall into a low level of
storage hierarchy which is large but slow. However, these
accesses are sequential so that maximum throughput can
be achieved. Although X-Stream can leverage high disk
bandwidth by sequential accessing, it needs to generate
updates which could be in the same magnitude as edges,
and its lack of support on selective scheduling could also
be a critical problem when dealing with graphs of large
diameters.
We propose GridGraph, which groups edges into a
“grid” representation. In GridGraph, vertices are parti-
tioned into 1D chunks and edges are partitioned into 2D
评论