uSendfile: A User-space Sendfile Verb based on Flash and RDMA
Hongzhang Yang*, Yahui Yang*, Yaofeng Tu†, Ping Wang*
∗Peking University, China
Email: {yanghongzhang, yhyang, pwang}@pku.edu.cn
ЪZTE Corporation, China
Email: tu.yaofeng@zte.com.cn
Abstract—Flash and RDMA (Remote Direct Memory Access)
provide extremely high performance in storage and network
hardware. However, a gap between distributed system and new
hardware exists. Although RDMA speeds up memory access
between two nodes, there are many serious problems to be
solved when sending data or command from one flash to
another flash. In this paper, we propose a distributed system
on flash and RDMA, and implement a User-space Sendfile
verb based on it. Experimental results show that the RPC in
DSFR outperforms the traditional RPC mechanism by dozens
of times, and the uSendfile reduces time overhead significantly.
Keywords- Sendfile; RPC; Distributed System; Flash; RDMA
I. INTRODUCTION
With the increasing demands of cloud storage and big
data processing, it becomes a popular approach to scale the
data management from a single node to distributed
environments. The traditional distributed systems use HDD
(Hard Disk Drive) as the storage medium, and transfer data
through RPC (remote procedure call) which is based on the
TCP/IP protocol. Flash and RDMA provide extremely high
performance in storage and network hardware. Although
RDMA speeds up memory access between two nodes, there
are many serious problems to be solved when sending data
or command from one flash to another flash. The distributed
system designed for flash and RDMA is fairly needed.
RDMA is capable of improving the network transmission
delay during data processing between different nodes.
RDMA achieves zero-copy data transmission without the
involvement of the operating systems on both sides, which
can provide high bandwidth and low latency. RDMA
supports three kinds of queues, Sending Queue (SQ),
Receiving Queue (RQ), and Completion Queue (CQ). SQ
and RQ are usually created in pairs, called Queue Pairs (QP).
RDMA is widely used in distributed systems [1-2].
Flash is the core component of SSD (Solid State Drive).
Flash has some special access patterns due to its physical
features [3-4], including erasing before writing, garbage
collection, and out-of-place update. Before writing data to
flash, it is occasionally to do garbage collection firstly, and
then erase old blocks, so writing performance may be
affected seriously. All in all, because of the difference
between HDD and flash, the system design needs to be
rebuilt.
In this paper, we propose a distributed system on flash
and RDMA, and implement a User-space Sendfile verb
based on it. The goal of this paper is to solve the problem of
efficiency in the process of sending data or command from
one node to another in distributed systems, so as to take full
advantage of Flash and RDMA. In summary, we make the
following contributions:
x We design a distributed system on flash and RDMA
called DSFR, which optimizes the parallel network
topology of RDMA for data transmission, the
RDMA-based RPC to improve the performance of
the distributed system, and an efficient flash garbage
collection method.
x We also design a User-space Sendfile verb called
uSendfile, which solves the problem caused by
traditional kernel Sendfile verb. The uSendfile
combines reading file from local flash to local
memory and sending this file from local memory to
remote memory, as well as to remote flash.
x We implement and evaluate the above works.
Experimental results show that the RPC in DSFR
outperforms the traditional RPC mechanism by
dozens of times. What’s more, the uSendfile has
been reduced by an order of magnitude over Sendfile
on the exactly same flash and RDMA.
II. M
OTIVATION
A. Problem of distributed system on flash and RDMA
When transfer data or command from one node to
another, there are many problems. Firstly, a single QP may
become the bottleneck, thus failing to saturate the processing
power of the NIC (Network Interface Controller). As a result,
multiple QP connections can be created between two nodes,
and messages can be transferred in parallel to improve the
throughput. What’s more, the traditional RPC is too complex,
especially the process of the Acknowledgement, which
seems unnecessary. Additionally, current research of flash
garbage collection methods are unreasonable, resulting in
cold dirty data pages being relocated twice, the first time in
garbage collection, and the second time in cache replacement.
Therefore, the system on flash and RDMA needs to be
optimized urgently.
465
2019 IEEE 12th International Conference on Cloud Computing (CLOUD)
2159-6190/19/$31.00 ©2019 IEEE
DOI 10.1109/CLOUD.2019.00080
评论