URFS-A User space Raw File System based on NVMe SSD.pdf

章芋文

139

8页

4次

2023-08-29

免费下载

URFS: A User-space Raw File System based on NVMe SSD

Yaofeng Tu

College of Computer Science & Technology

Nanjing University of Aeronautics & Astronautics, ZTE Corporation

Nanjing, China

tu.yaofeng@zte.com.cn

Yinjun Han

Central R&D Institute

ZTE Corporation

Nanjing, China

Corresponding Author Email: hanynjun@163.com

Zhenghua Chen

Central R&D Institute

ZTE Corporation

Nanjing, China

chen.zhenghua@zte.com.cn

Zhengguang Chen

Central R&D Institute

ZTE Corporation

Nanjing, China

chen.zhengguang@zte.com.cn

Bing Chen

College of Computer Science & Technology

Nanjing University of Aeronautics & Astronautics

Nanjing, China

china@nuaa.edu.cn

Abstract—NVMe (Non-Volatile Memory Express) is a pro-

tocol designed speciﬁcally for SSD (Solid State Drive), which

has signiﬁcantly improved the performance of SSD storage

devices. However, the traditional kernel-space IO path hinders

the performance of NVMe SSD devices. In this paper, a user-

space raw ﬁle system (URFS) based on NVMe SSD is proposed.

Through the design of the user-space multi-process shared

cache, multiple applications can share access to SSD to reduce

the amount of SSD access; NVMe-oriented log-free data layout

and Multi-granularity IO queue elastic separation technology

are used to improve system performance and throughput.

Experiments show that, compared to traditional ﬁle systems,

URFS performance is improved by more than 23% in CDN

(Content Delivery Network) scenarios, and URFS performance

is improved more in small ﬁle scenarios and read-intensive

scenarios.

Keywords-NVMe SSD; user-space; ﬁle system; Multi-Queue;

IO isolation

I. INTRODUCTION AND MOTIVATION

With the development of emerging semiconductor stor-

age technology, the IO performance of computer external

storage has been rapidly improved. The NVMe protocol

is designed for SSD and fully utilizes the low latency

and parallelism of PCIe SSDs; at the same time, it takes

full advantage of the concurrency of multi-core CPUs by

means of multiple queues. The advantages of NVMe over

AHCI are related to its ability to utilize parallelism in host

hardware and software, which is manifested in difference

in command queue depth, efﬁciency of interrupt processing,

and frequency of register accesses [1]. The distinguishing

feature of NVMe is the provision of multiple queues to

process IO commands. When an IO command is issued, the

host system puts the command in the commit queue and

uses the doorbell register to notify the NVMe device. After

processing the IO command, the NVMe device writes the

result to the completion queue and triggers an interrupt to

the host system. NVMe enhances the interrupt processing

performance and interrupt aggregation of MSI / MSI-X [2].

These features of the NVMe protocol help to take advantage

of SSD.

However, the current usage of NVMe SSD has the fol-

lowing problems:

(1) The overhead of the traditional kernel-space IO stack

becomes a performance bottleneck, because it is designed

for HDD. To overcome this problem, many researchers have

tried to reduce the kernel overhead by using the polling

mechanism in user-space and eliminate unnecessary context

switching. Intel provides the user-space ﬁle system BlobFS,

which is based on SPDK, but BlobFS does not support the

in-place update and shared access to SSD from multiple

applications, which restricts the usage of NVMe SSD.

(2) Multi-queue technology is an important method of

NVMe to improve performance. NVMe allocates different

queues according to tasks, scheduling priorities and the

number of cores, thereby achieving high performance. But

the reading and writing performance of SSD are imbalance,

the performance of reading is far better than that of writing,

reads are 10-40x faster than writes [3]. Existing ﬁle systems

do not distinguish read queues from write queues based on

the above characteristic, resulting in write performance that

degrades read performance.

(3) SSD reads and writes are aligned according to pages,

such as 4K, and erasing is aligned according to blocks, such

as 2M. These characteristics of SSD are not considered

by existing ﬁle systems, resulting in a large amount of

write ampliﬁcation and unnecessary data migration, greatly

reducing the reading and writing performance and resulting

in jitter.

This paper designs the user-space raw ﬁle system (URFS)

according to the characteristics of NVMe protocol and

SSD, including unbalanced reading-writing performance,

and deleting by block but reading and writing by page. The

contributions of this paper mainly include:

494

2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)

DOI 10.1109/ICPADS51040.2020.00070

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 02:14:50 UTC from IEEE Xplore. Restrictions apply.

(1) Provides a method of multi process sharing NVMe

SSD. By sharing memory, data is transferred directly in

multiple applications and ﬁle systems, which realizes the

efﬁcient sharing of NVMe SSD devices by multi processes.

The dynamic library of ﬁle system in user-space is proposed,

which is directly called by application. It reduces memory

copy and the interaction with kernel-space. Through dead-

lock detection, it can identify and judge the status of the

application in time, release the invalid resources, and realize

the efﬁcient use of system resources.

(2) With the support of NVMe for atomic operations,

a simple and reliable metadata management mechanism is

designed to remove the write-ahead logs of metadata oper-

ations, solve write ampliﬁcation caused by the application

and ﬁle systems, and improve disk IO utilization.

(3) According to the characteristics of multi-queue, the

elastic separation technology is proposed. Due to the imbal-

anced performance of reading and writing on SSD, the read

IOPS is signiﬁcantly higher than the write IOPS. The read

queue and write queue are seperated to eliminate the impact

of the write operation on the read operation.

II. URFS DESIGN AND IMPLEMENTATION

URFS is a user mode ﬁle system based on the NVMe

protocol. This section ﬁrst describes the URFS system archi-

tecture, and then introduces the design details of URFS, in-

cluding user mode multi-process shared memory framework,

NVMe-oriented log-free data layout, and multi-granularity

IO queue elastic separation technology.

A. System architecture of URFS

As shown in Fig. 1 (a), the traditional user-space driver

monopolizes the SSD device, resulting in multiple APPs

unable to share access to the SSD. Literature [4] proposes

to use the ﬁle system as process, which can directly manage

the raw SSD devices and ensure the integrity of metadata

without the involvement of the kernel, but multiple APPs

are unable to share SSD. In this paper, a more optimized

method is adopted. Through the management of shared

memory, multiple apps have the ability to access to multiple

NVMe SSDs. and the shared memory cache improves the

performance of APP access to SSD. At the same time, it

manages the security of ﬁles through fssvr, and authenticates

the APP’s access rights without kernel involvement.

As shown in Fig. 1 (b), the architecture of URFS is

composed of three parts: application layer, service layer and

resource layer. In application layer, APP calls fslib to achieve

shared access to multiple SSD resources based on shared

memory and fssvr communication. Fslib is a client dynamic

library that provides Posix-like ﬁle system APIs for client

application calls.

The service layer is implemented as a high-priority user

mode process called fssvr, whose core is the shared memory

framework. The messages between fslib and fssvr include

JOREDOIG FWO+HDG FWO%XI SDJH+HDG

$33

IVOLE

««

66'

66'66'

««

$&/

6KDUHG0HPRU\

'HDGORFN&KHFN 3DJH&DFKH

66'

$33

66'

$33

D2ULJLQDO$FFHVV0HWKRG E,PSURYHG$FFHVV0HWKRG

$SSOLFDWLRQ

/D\HU

6HUYLFH

/D\HU

5HVRXUFH

/D\HU

$33

IVOLE

$33Q

IVOLE

IVVYU

Figure 1. Multi-process shared NVMe SSD framework in user-space

control messages and data messages. The control messages

are managed by the control header called ctlHead and the

control cache block called ctlBuf. The data messages are

managed by the page header called pageHead and the page

cache. At the same time, Globalfd manages all the ﬁle han-

dles opened on the APP, and saves the information related

to the ﬁle opening, including the ﬁle name, ﬁle id, storage

location, storage block being operated, and length notiﬁca-

tion. Page Cache implements cache sharing of memory data

among multiple processes. The ﬁle access control and data

security mechanism is implemented through the GID and

UID of the application process. Server-side program fssvr, is

responsible for access management of multiple NVMe SSD

devices. The fssvr’s main process manages the storage space

of the NVMe SSD, and dispatches and sends read/write

requests. Multiple APPs can share multiple NVMe SSDs

by calling fslib. Multiple fslib and fssvr exchange messages

and transfer content through shared memory.

The SSD in the resource layer is uniformly scheduled and

managed by fssvr in the user mode.

In terms of deadlock and resource release, fslib will

periodically have heartbeat messages with fssvr. For key

resources such as Globalfd, control block headers, and cache

blocks, deadlock detection will be performed on heartbeat

conditions and lock time judgment. When it is detected that

the APP has been abnormally closed or the resource is

deadlocked, the resource is forcibly released. Page Cache

can be shared among multiple application APPs, so that

when an APP opens a ﬁle for reading and writing, other

APPs can simultaneously share the Page Cache access,

reducing the amount of access to the SSD and improving

the performance and throughput of the system.

Through the above design, URFS implements a method

for multiple applications to share access to SSD, improves

the security through ACL control. Through the Page Cache

mechanism, data cache acceleration and data sharing among

multiple applications are realized, which improves the us-

ability and performance greatly.

495

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 02:14:50 UTC from IEEE Xplore. Restrictions apply.

VXSHU LQRGH «« VPDOO]RQH «« ELJ]RQH

GLVNOD\RXW

PHWDGDWD

GDWD

GLUHFWRU\LQRGH VPDOOILOHLQRGH ELJILOHLQRGH

KHDG

H[WHQWOLVW

SDWKKHDG

UHVHUYHG

SDWK KHDG

LQOLQHGDWD

SDWK

.

««

. . .

. . . .

0

««

0

0

0 0

Figure 2. Layout of URFS

B. Non-log data layout for NVMe

Many applications do not require complete POSIX seman-

tics. These applications try to simplify the dependence on

the ﬁle system and optimize access on their own according

to business scenarios. Taking CDN as an example, only basic

ﬁle and directory operations are needed. In such scenarios,

the data layout of the ﬁle system can be simpliﬁed to achieve

better performance. Therefore, URFS is designed to only

support basic operations such as open, read, write, remove,

stat, and rename of ﬁles and directories, while advanced

features such as attr and symlink are not supported. URFS

uses the page as the minimum granularity for disk space

allocation and access, and the page is ﬁxed at 4K to optimize

SSD performance. In addition, NVMe atomic operations are

used to achieve consistency guarantees when metadata is

updated. As shown in Fig. 2, the ﬁle system layout of URFS

mainly includes metadata region and data region.

The metadata region is composed of super block and

inodes. The size of the super block is 2M, which records

immutable information such as ﬁle system type and version

number, and variable information such as the size of the

metadata region. The size of the inode is 4K, recording ﬁle

or directory metadata. The directory inode contains a 512-

byte header and a 512-byte full path of the directory, and

the remaining 3K is reserved. The ﬁle inode contains a 512-

byte header and a 512-byte ﬁle full path, and the remaining

3K uses different layouts according to the ﬁle size.

Unlike traditional ﬁle systems such as EXT4 and XFS,

the disk structure of URFS does not contain disk space

bitmap, write-ahead log, and directory entries. Instead, it

reads and parses inodes when the ﬁle system is mounted

and constructs the hash index of the ﬁle, B-Tree index of

the directory and the list of free disk space in the memory.

This design reduces the type and amount of metadata that

needs to be saved to disk, thereby reducing the number of

IOs during metadata update to improve the performance

of data update operations. Furthermore, the above design

greatly simpliﬁes the consistency model, thereby avoiding

the implementation of the logging mechanism. When the

ﬁle is updated, the metadata modiﬁcation only involves the

inode block of the ﬁle, which can be completed by one

NVMe atomic operation. The disadvantage of this design is

the reduction of startup speed. Although the read bandwidth

of NVMe disks is usually above 2GB/s, actual tests show

that the overhead mainly comes from the construction of

memory data structures rather than disk data reads. The

startup time of millions of data is for the minute level.

The full path of the ﬁle is stored in the ﬁle inode, and ﬁle

access based on the full path can be directly hit from the hash

index without searching the inode step by step. The directory

inode is used to provide support for empty directories and

directory renaming. URFS does not maintain dentry, and

the directory traversal operation is supported by the B-Tree

index in memory. The directory rename operation is split

into rename operations for each ﬁle. In order to achieve crash

consistency, the directory rename operation ﬁrst records the

new directory name in the reserved area of the directory

inode, then modiﬁes the ﬁle inode one by one, and ﬁnally

completes the actual modiﬁcation of the directory name. In

the startup phase, the unﬁnished renaming operation can be

detected and restored by checking the reserved area of the

directory inode. Obviously, the cost of the rename operation

is related to the number of ﬁles in the directory, so you

should try to use a ﬂat directory structure and control the

number of ﬁles in a single directory, which is consistent with

common ﬁle layout optimization strategies.

When data blocks are allocated, small zone and big zone

are distinguished. Files of different sizes are usually different

in ﬁle type, and naturally different in access frequency and

life cycle. URFS takes advantage of this and concentrates

small ﬁles in small zone storage. Large ﬁle allocation

granularity is aligned with SSD erasure granularity. By

partitioning the data, the overhead of FTL for GC can be

effectively reduced [5]. Speciﬁcally, the format and content

of the ﬁle inode will change with the ﬁle length. For small-

size ﬁles smaller than 3K, they can be directly stored inline

in the inode. This method can effectively reduce the number

of IOs and improve space utilization and data read/write

performance. For ﬁles larger than 3K, ﬁrst allocate data

blocks at a granularity of 4k in the small zone for storage.

When the ﬁle length exceeds 2M, the original 4K data block

copy of the ﬁle is merged into a 2MB data block, and 2MB

is used later or its multiple size allocates data blocks from

big zone. This SSD-friendly multi-level allocation method

can better balance performance and space utilization.

For ﬁles written in streaming, the data block alloca-

tion may quickly cross the above multiple stages, causing

changes of inode layout and unnecessary data movement.

Similar to traditional ﬁle systems, the delayed allocation

of data blocks can effectively reduce this problem. URFS

collects frequently modiﬁed data into the same block by

distinguishing the ﬁle size. Therefore, when FTL performs

GC, the number of valid pages in the selected block to

be erased will be reduced, thereby reducing GC overhead.

The SSD storage space uses the FTL erase block size as

the minimum allocation unit, the metadata space and the

496

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 02:14:50 UTC from IEEE Xplore. Restrictions apply.

of 8

免费下载

相关文档

评论