
VXSHU LQRGH «« VPDOO]RQH «« ELJ]RQH
GLVNOD\RXW
PHWDGDWD
GDWD
GLUHFWRU\LQRGH VPDOOILOHLQRGH ELJILOHLQRGH
KHDG
H[WHQWOLVW
SDWKKHDG
UHVHUYHG
SDWK KHDG
LQOLQHGDWD
SDWK
.
««
. . .
. . . .
0
««
0
0
0 0
Figure 2. Layout of URFS
B. Non-log data layout for NVMe
Many applications do not require complete POSIX seman-
tics. These applications try to simplify the dependence on
the file system and optimize access on their own according
to business scenarios. Taking CDN as an example, only basic
file and directory operations are needed. In such scenarios,
the data layout of the file system can be simplified to achieve
better performance. Therefore, URFS is designed to only
support basic operations such as open, read, write, remove,
stat, and rename of files and directories, while advanced
features such as attr and symlink are not supported. URFS
uses the page as the minimum granularity for disk space
allocation and access, and the page is fixed at 4K to optimize
SSD performance. In addition, NVMe atomic operations are
used to achieve consistency guarantees when metadata is
updated. As shown in Fig. 2, the file system layout of URFS
mainly includes metadata region and data region.
The metadata region is composed of super block and
inodes. The size of the super block is 2M, which records
immutable information such as file system type and version
number, and variable information such as the size of the
metadata region. The size of the inode is 4K, recording file
or directory metadata. The directory inode contains a 512-
byte header and a 512-byte full path of the directory, and
the remaining 3K is reserved. The file inode contains a 512-
byte header and a 512-byte file full path, and the remaining
3K uses different layouts according to the file size.
Unlike traditional file systems such as EXT4 and XFS,
the disk structure of URFS does not contain disk space
bitmap, write-ahead log, and directory entries. Instead, it
reads and parses inodes when the file system is mounted
and constructs the hash index of the file, B-Tree index of
the directory and the list of free disk space in the memory.
This design reduces the type and amount of metadata that
needs to be saved to disk, thereby reducing the number of
IOs during metadata update to improve the performance
of data update operations. Furthermore, the above design
greatly simplifies the consistency model, thereby avoiding
the implementation of the logging mechanism. When the
file is updated, the metadata modification only involves the
inode block of the file, which can be completed by one
NVMe atomic operation. The disadvantage of this design is
the reduction of startup speed. Although the read bandwidth
of NVMe disks is usually above 2GB/s, actual tests show
that the overhead mainly comes from the construction of
memory data structures rather than disk data reads. The
startup time of millions of data is for the minute level.
The full path of the file is stored in the file inode, and file
access based on the full path can be directly hit from the hash
index without searching the inode step by step. The directory
inode is used to provide support for empty directories and
directory renaming. URFS does not maintain dentry, and
the directory traversal operation is supported by the B-Tree
index in memory. The directory rename operation is split
into rename operations for each file. In order to achieve crash
consistency, the directory rename operation first records the
new directory name in the reserved area of the directory
inode, then modifies the file inode one by one, and finally
completes the actual modification of the directory name. In
the startup phase, the unfinished renaming operation can be
detected and restored by checking the reserved area of the
directory inode. Obviously, the cost of the rename operation
is related to the number of files in the directory, so you
should try to use a flat directory structure and control the
number of files in a single directory, which is consistent with
common file layout optimization strategies.
When data blocks are allocated, small zone and big zone
are distinguished. Files of different sizes are usually different
in file type, and naturally different in access frequency and
life cycle. URFS takes advantage of this and concentrates
small files in small zone storage. Large file allocation
granularity is aligned with SSD erasure granularity. By
partitioning the data, the overhead of FTL for GC can be
effectively reduced [5]. Specifically, the format and content
of the file inode will change with the file length. For small-
size files smaller than 3K, they can be directly stored inline
in the inode. This method can effectively reduce the number
of IOs and improve space utilization and data read/write
performance. For files larger than 3K, first allocate data
blocks at a granularity of 4k in the small zone for storage.
When the file length exceeds 2M, the original 4K data block
copy of the file is merged into a 2MB data block, and 2MB
is used later or its multiple size allocates data blocks from
big zone. This SSD-friendly multi-level allocation method
can better balance performance and space utilization.
For files written in streaming, the data block alloca-
tion may quickly cross the above multiple stages, causing
changes of inode layout and unnecessary data movement.
Similar to traditional file systems, the delayed allocation
of data blocks can effectively reduce this problem. URFS
collects frequently modified data into the same block by
distinguishing the file size. Therefore, when FTL performs
GC, the number of valid pages in the selected block to
be erased will be reduced, thereby reducing GC overhead.
The SSD storage space uses the FTL erase block size as
the minimum allocation unit, the metadata space and the
496
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 02:14:50 UTC from IEEE Xplore. Restrictions apply.
相关文档
评论