暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
URFS-A User space Raw File System based on NVMe SSD.pdf
139
8页
4次
2023-08-29
免费下载
URFS: A User-space Raw File System based on NVMe SSD
Yaofeng Tu
College of Computer Science & Technology
Nanjing University of Aeronautics & Astronautics, ZTE Corporation
Nanjing, China
tu.yaofeng@zte.com.cn
Yinjun Han
Central R&D Institute
ZTE Corporation
Nanjing, China
Corresponding Author Email: hanynjun@163.com
Zhenghua Chen
Central R&D Institute
ZTE Corporation
Nanjing, China
chen.zhenghua@zte.com.cn
Zhengguang Chen
Central R&D Institute
ZTE Corporation
Nanjing, China
chen.zhengguang@zte.com.cn
Bing Chen
College of Computer Science & Technology
Nanjing University of Aeronautics & Astronautics
Nanjing, China
cb
china@nuaa.edu.cn
Abstract—NVMe (Non-Volatile Memory Express) is a pro-
tocol designed specifically for SSD (Solid State Drive), which
has significantly improved the performance of SSD storage
devices. However, the traditional kernel-space IO path hinders
the performance of NVMe SSD devices. In this paper, a user-
space raw file system (URFS) based on NVMe SSD is proposed.
Through the design of the user-space multi-process shared
cache, multiple applications can share access to SSD to reduce
the amount of SSD access; NVMe-oriented log-free data layout
and Multi-granularity IO queue elastic separation technology
are used to improve system performance and throughput.
Experiments show that, compared to traditional file systems,
URFS performance is improved by more than 23% in CDN
(Content Delivery Network) scenarios, and URFS performance
is improved more in small file scenarios and read-intensive
scenarios.
Keywords-NVMe SSD; user-space; file system; Multi-Queue;
IO isolation
I. INTRODUCTION AND MOTIVATION
With the development of emerging semiconductor stor-
age technology, the IO performance of computer external
storage has been rapidly improved. The NVMe protocol
is designed for SSD and fully utilizes the low latency
and parallelism of PCIe SSDs; at the same time, it takes
full advantage of the concurrency of multi-core CPUs by
means of multiple queues. The advantages of NVMe over
AHCI are related to its ability to utilize parallelism in host
hardware and software, which is manifested in difference
in command queue depth, efficiency of interrupt processing,
and frequency of register accesses [1]. The distinguishing
feature of NVMe is the provision of multiple queues to
process IO commands. When an IO command is issued, the
host system puts the command in the commit queue and
uses the doorbell register to notify the NVMe device. After
processing the IO command, the NVMe device writes the
result to the completion queue and triggers an interrupt to
the host system. NVMe enhances the interrupt processing
performance and interrupt aggregation of MSI / MSI-X [2].
These features of the NVMe protocol help to take advantage
of SSD.
However, the current usage of NVMe SSD has the fol-
lowing problems:
(1) The overhead of the traditional kernel-space IO stack
becomes a performance bottleneck, because it is designed
for HDD. To overcome this problem, many researchers have
tried to reduce the kernel overhead by using the polling
mechanism in user-space and eliminate unnecessary context
switching. Intel provides the user-space file system BlobFS,
which is based on SPDK, but BlobFS does not support the
in-place update and shared access to SSD from multiple
applications, which restricts the usage of NVMe SSD.
(2) Multi-queue technology is an important method of
NVMe to improve performance. NVMe allocates different
queues according to tasks, scheduling priorities and the
number of cores, thereby achieving high performance. But
the reading and writing performance of SSD are imbalance,
the performance of reading is far better than that of writing,
reads are 10-40x faster than writes [3]. Existing file systems
do not distinguish read queues from write queues based on
the above characteristic, resulting in write performance that
degrades read performance.
(3) SSD reads and writes are aligned according to pages,
such as 4K, and erasing is aligned according to blocks, such
as 2M. These characteristics of SSD are not considered
by existing file systems, resulting in a large amount of
write amplification and unnecessary data migration, greatly
reducing the reading and writing performance and resulting
in jitter.
This paper designs the user-space raw file system (URFS)
according to the characteristics of NVMe protocol and
SSD, including unbalanced reading-writing performance,
and deleting by block but reading and writing by page. The
contributions of this paper mainly include:
494
2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)
2690-5965/20/$31.00 ©2020 IEEE
DOI 10.1109/ICPADS51040.2020.00070
2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) | 978-1-7281-9074-7/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICPADS51040.2020.00070
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 02:14:50 UTC from IEEE Xplore. Restrictions apply.
(1) Provides a method of multi process sharing NVMe
SSD. By sharing memory, data is transferred directly in
multiple applications and file systems, which realizes the
efficient sharing of NVMe SSD devices by multi processes.
The dynamic library of file system in user-space is proposed,
which is directly called by application. It reduces memory
copy and the interaction with kernel-space. Through dead-
lock detection, it can identify and judge the status of the
application in time, release the invalid resources, and realize
the efficient use of system resources.
(2) With the support of NVMe for atomic operations,
a simple and reliable metadata management mechanism is
designed to remove the write-ahead logs of metadata oper-
ations, solve write amplification caused by the application
and file systems, and improve disk IO utilization.
(3) According to the characteristics of multi-queue, the
elastic separation technology is proposed. Due to the imbal-
anced performance of reading and writing on SSD, the read
IOPS is significantly higher than the write IOPS. The read
queue and write queue are seperated to eliminate the impact
of the write operation on the read operation.
II. URFS DESIGN AND IMPLEMENTATION
URFS is a user mode file system based on the NVMe
protocol. This section first describes the URFS system archi-
tecture, and then introduces the design details of URFS, in-
cluding user mode multi-process shared memory framework,
NVMe-oriented log-free data layout, and multi-granularity
IO queue elastic separation technology.
A. System architecture of URFS
As shown in Fig. 1 (a), the traditional user-space driver
monopolizes the SSD device, resulting in multiple APPs
unable to share access to the SSD. Literature [4] proposes
to use the file system as process, which can directly manage
the raw SSD devices and ensure the integrity of metadata
without the involvement of the kernel, but multiple APPs
are unable to share SSD. In this paper, a more optimized
method is adopted. Through the management of shared
memory, multiple apps have the ability to access to multiple
NVMe SSDs. and the shared memory cache improves the
performance of APP access to SSD. At the same time, it
manages the security of files through fssvr, and authenticates
the APP’s access rights without kernel involvement.
As shown in Fig. 1 (b), the architecture of URFS is
composed of three parts: application layer, service layer and
resource layer. In application layer, APP calls fslib to achieve
shared access to multiple SSD resources based on shared
memory and fssvr communication. Fslib is a client dynamic
library that provides Posix-like file system APIs for client
application calls.
The service layer is implemented as a high-priority user
mode process called fssvr, whose core is the shared memory
framework. The messages between fslib and fssvr include
JOREDOIG FWO+HDG FWO%XI SDJH+HDG
$33
IVOLE
««
66'
66'66'
««
$&/
6KDUHG0HPRU\
'HDGORFN&KHFN 3DJH&DFKH
66'
$33
66'
$33
D2ULJLQDO$FFHVV0HWKRG E,PSURYHG$FFHVV0HWKRG
$SSOLFDWLRQ
/D\HU
6HUYLFH
/D\HU
5HVRXUFH
/D\HU
$33
IVOLE
$33Q
IVOLE
IVVYU
Figure 1. Multi-process shared NVMe SSD framework in user-space
control messages and data messages. The control messages
are managed by the control header called ctlHead and the
control cache block called ctlBuf. The data messages are
managed by the page header called pageHead and the page
cache. At the same time, Globalfd manages all the file han-
dles opened on the APP, and saves the information related
to the file opening, including the file name, file id, storage
location, storage block being operated, and length notifica-
tion. Page Cache implements cache sharing of memory data
among multiple processes. The file access control and data
security mechanism is implemented through the GID and
UID of the application process. Server-side program fssvr, is
responsible for access management of multiple NVMe SSD
devices. The fssvr’s main process manages the storage space
of the NVMe SSD, and dispatches and sends read/write
requests. Multiple APPs can share multiple NVMe SSDs
by calling fslib. Multiple fslib and fssvr exchange messages
and transfer content through shared memory.
The SSD in the resource layer is uniformly scheduled and
managed by fssvr in the user mode.
In terms of deadlock and resource release, fslib will
periodically have heartbeat messages with fssvr. For key
resources such as Globalfd, control block headers, and cache
blocks, deadlock detection will be performed on heartbeat
conditions and lock time judgment. When it is detected that
the APP has been abnormally closed or the resource is
deadlocked, the resource is forcibly released. Page Cache
can be shared among multiple application APPs, so that
when an APP opens a file for reading and writing, other
APPs can simultaneously share the Page Cache access,
reducing the amount of access to the SSD and improving
the performance and throughput of the system.
Through the above design, URFS implements a method
for multiple applications to share access to SSD, improves
the security through ACL control. Through the Page Cache
mechanism, data cache acceleration and data sharing among
multiple applications are realized, which improves the us-
ability and performance greatly.
495
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 02:14:50 UTC from IEEE Xplore. Restrictions apply.
VXSHU LQRGH «« VPDOO]RQH «« ELJ]RQH
GLVNOD\RXW
PHWDGDWD
GDWD
GLUHFWRU\LQRGH VPDOOILOHLQRGH ELJILOHLQRGH
KHDG
H[WHQWOLVW
SDWKKHDG
UHVHUYHG
SDWK KHDG
LQOLQHGDWD
SDWK
.
««
. . .
. . . .
0
««
0
0
0 0
Figure 2. Layout of URFS
B. Non-log data layout for NVMe
Many applications do not require complete POSIX seman-
tics. These applications try to simplify the dependence on
the file system and optimize access on their own according
to business scenarios. Taking CDN as an example, only basic
file and directory operations are needed. In such scenarios,
the data layout of the file system can be simplified to achieve
better performance. Therefore, URFS is designed to only
support basic operations such as open, read, write, remove,
stat, and rename of files and directories, while advanced
features such as attr and symlink are not supported. URFS
uses the page as the minimum granularity for disk space
allocation and access, and the page is fixed at 4K to optimize
SSD performance. In addition, NVMe atomic operations are
used to achieve consistency guarantees when metadata is
updated. As shown in Fig. 2, the file system layout of URFS
mainly includes metadata region and data region.
The metadata region is composed of super block and
inodes. The size of the super block is 2M, which records
immutable information such as file system type and version
number, and variable information such as the size of the
metadata region. The size of the inode is 4K, recording file
or directory metadata. The directory inode contains a 512-
byte header and a 512-byte full path of the directory, and
the remaining 3K is reserved. The file inode contains a 512-
byte header and a 512-byte file full path, and the remaining
3K uses different layouts according to the file size.
Unlike traditional file systems such as EXT4 and XFS,
the disk structure of URFS does not contain disk space
bitmap, write-ahead log, and directory entries. Instead, it
reads and parses inodes when the file system is mounted
and constructs the hash index of the file, B-Tree index of
the directory and the list of free disk space in the memory.
This design reduces the type and amount of metadata that
needs to be saved to disk, thereby reducing the number of
IOs during metadata update to improve the performance
of data update operations. Furthermore, the above design
greatly simplifies the consistency model, thereby avoiding
the implementation of the logging mechanism. When the
file is updated, the metadata modification only involves the
inode block of the file, which can be completed by one
NVMe atomic operation. The disadvantage of this design is
the reduction of startup speed. Although the read bandwidth
of NVMe disks is usually above 2GB/s, actual tests show
that the overhead mainly comes from the construction of
memory data structures rather than disk data reads. The
startup time of millions of data is for the minute level.
The full path of the file is stored in the file inode, and file
access based on the full path can be directly hit from the hash
index without searching the inode step by step. The directory
inode is used to provide support for empty directories and
directory renaming. URFS does not maintain dentry, and
the directory traversal operation is supported by the B-Tree
index in memory. The directory rename operation is split
into rename operations for each file. In order to achieve crash
consistency, the directory rename operation first records the
new directory name in the reserved area of the directory
inode, then modifies the file inode one by one, and finally
completes the actual modification of the directory name. In
the startup phase, the unfinished renaming operation can be
detected and restored by checking the reserved area of the
directory inode. Obviously, the cost of the rename operation
is related to the number of files in the directory, so you
should try to use a flat directory structure and control the
number of files in a single directory, which is consistent with
common file layout optimization strategies.
When data blocks are allocated, small zone and big zone
are distinguished. Files of different sizes are usually different
in file type, and naturally different in access frequency and
life cycle. URFS takes advantage of this and concentrates
small files in small zone storage. Large file allocation
granularity is aligned with SSD erasure granularity. By
partitioning the data, the overhead of FTL for GC can be
effectively reduced [5]. Specifically, the format and content
of the file inode will change with the file length. For small-
size files smaller than 3K, they can be directly stored inline
in the inode. This method can effectively reduce the number
of IOs and improve space utilization and data read/write
performance. For files larger than 3K, first allocate data
blocks at a granularity of 4k in the small zone for storage.
When the file length exceeds 2M, the original 4K data block
copy of the file is merged into a 2MB data block, and 2MB
is used later or its multiple size allocates data blocks from
big zone. This SSD-friendly multi-level allocation method
can better balance performance and space utilization.
For files written in streaming, the data block alloca-
tion may quickly cross the above multiple stages, causing
changes of inode layout and unnecessary data movement.
Similar to traditional file systems, the delayed allocation
of data blocks can effectively reduce this problem. URFS
collects frequently modified data into the same block by
distinguishing the file size. Therefore, when FTL performs
GC, the number of valid pages in the selected block to
be erased will be reduced, thereby reducing GC overhead.
The SSD storage space uses the FTL erase block size as
the minimum allocation unit, the metadata space and the
496
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 02:14:50 UTC from IEEE Xplore. Restrictions apply.
of 8
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。