Linux-Fsdevel Archive on
help / color / mirror / Atom feed
From: Matthew Wilcox <>
To: Mikulas Patocka <>
Cc: Dave Chinner <>,
	Dan Williams <>,
	Linus Torvalds <>,
	Alexander Viro <>,
	Andrew Morton <>,
	Vishal Verma <>,
	Dave Jiang <>,
	Ira Weiny <>, Jan Kara <>,
	Eric Sandeen <>,
	Dave Chinner <>,
	"Kani, Toshi" <>,
	"Norton, Scott J" <>,
	Rajesh (DCIG/CDI/HPS Perf)"  <>,
	Linux Kernel Mailing List <>,
	linux-fsdevel <>,
	linux-nvdimm <>
Subject: Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
Date: Tue, 22 Sep 2020 18:25:53 +0100	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Sep 22, 2020 at 12:46:05PM -0400, Mikulas Patocka wrote:
> I agree that the b+tree were a good choice for XFS.
> In RAM-based maps, red-black trees or avl trees are used often. In 
> disk-based maps, btrees or b+trees are used. That's because in RAM, you 
> are optimizing for the number of cache lines accessed, and on the disk, 
> you are optimizing for the number of blocks accessed.

That's because software hasn't yet caught up to modern hardware.  An
in-memory btree outperforms an rbtree for several reasons:

 - rbtrees are taller; the average height of a b-tree with even just
   8 pointers per level is about half of an rbtree.
 - Not all cachelines are created equal.  The subsequent cacheline
   of this cacheline is probably already in cache.  If not, it's on its
   way and will be here shortly.  So a forward scan of this node will be
   quicker than finding another level of tree.  Our experiments have found
   a performance improvement with 256-byte B-tree nodes over an rbtree.
 - btrees are (or can be) RCU-safe.  It's very hard to make an rbtree
   RCU safe.  You can do a seqlock to get most of the benefit, but btrees
   let you allocate a new node and copy into it, while rbtrees embed the
   node in the object.

The downside, of course, is that b-trees require external allocations
while rbtrees embed the node in the object.  So you may need to do
more allocations.  But filesystems love doing additional allocations;
they make journalling so much easier.

> BTW. How does XFS "predict" the file size? - so that it allocates extent 
> of proper size without knowing how big the file will be?

XFS does delayed allocation.  The page cache absorbs the writes and then
at writeback time, XFS chooses where on storage to put it.  Some things
break this like O_SYNC and, er, DAX, but it's very effective.

> > The NVFS indirect block tree has a fan-out of 16,
> No. The top level in the inode contains 16 blocks (11 direct and 5 
> indirect). And each indirect block can have 512 pointers (4096/8). You can 
> format the device with larger block size and this increases the fanout 
> (the NVFS block size must be greater or equal than the system page size).
> 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can 
> map 256 TiB and 5 levels can map 128 PiB.

But compare to an unfragmented file ... you can map the entire thing with
a single entry.  Even if you have to use a leaf node, you can get four
extents in a single cacheline (and that's a fairly naive leaf node layout;
I don't know exactly what XFS uses)

> > Rename is another operation that has specific "operation has atomic
> > behaviour" expectations. I haven't looked at how you've
> > implementated that yet, but I suspect it also is extremely difficult
> > to implement in an atomic manner using direct pmem updates to the
> > directory structures.
> There is a small window when renamed inode is neither in source nor in 
> target directory. Fsck will reclaim such inode and add it to lost+found - 
> just like on EXT2.

... ouch.  If you have to choose, it'd be better to link it to the second
directory then unlink it from the first one.  Then your fsck can detect
it has the wrong count and fix up the count (ie link it into both
directories rather than neither).

> If you think that the lack of journaling is show-stopper, I can implement 
> it. But then, I'll have something that has complexity of EXT4 and 
> performance of EXT4. So that there will no longer be any reason why to use 
> NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> attract some users who want good performance and who don't care about GID 
> and UID being updated atomically, etc.

Well, what's your intent with nvfs?  Do you already have customers in mind
who want to use this in production, or is this somewhere to play with and
develop concepts that might make it into one of the longer-established

  reply	other threads:[~2020-09-22 17:26 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-15 12:34 [RFC] nvfs: a filesystem for persistent memory Mikulas Patocka
2020-09-15 13:00 ` Matthew Wilcox
2020-09-15 13:24   ` Mikulas Patocka
2020-09-22 10:04   ` Ritesh Harjani
2020-09-15 15:16 ` Dan Williams
2020-09-15 16:58   ` Mikulas Patocka
2020-09-15 17:38     ` Mikulas Patocka
2020-09-16 10:57       ` [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache Mikulas Patocka
2020-09-16 16:21         ` Dan Williams
2020-09-16 17:24           ` Mikulas Patocka
2020-09-16 17:40             ` Dan Williams
2020-09-16 18:06               ` Mikulas Patocka
2020-09-21 16:20                 ` NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) Mikulas Patocka
2020-09-22  5:03                   ` Dave Chinner
2020-09-22 16:46                     ` Mikulas Patocka
2020-09-22 17:25                       ` Matthew Wilcox [this message]
2020-09-24 15:00                         ` Mikulas Patocka
2020-09-28 15:22                           ` Mikulas Patocka
2020-09-23  2:45                       ` Dave Chinner
2020-09-23  9:20                         ` A bug in ext4 with big directories (was: NVFS XFS metadata) Mikulas Patocka
2020-09-23  9:44                           ` Jan Kara
2020-09-23 12:46                             ` Mikulas Patocka
2020-09-23 20:20                             ` Andreas Dilger
2020-09-23 17:19                         ` NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache) Mikulas Patocka
2020-09-23  9:57                       ` Jan Kara
2020-09-23 13:11                         ` Mikulas Patocka
2020-09-23 15:04                           ` Matthew Wilcox
2020-09-22 12:28                   ` Matthew Wilcox
2020-09-22 12:39                     ` Mikulas Patocka
2020-09-16 18:56               ` [PATCH] pmem: fix __copy_user_flushcache Mikulas Patocka
2020-09-18  1:53                 ` Dan Williams
2020-09-18 12:25                   ` the "read" syscall sees partial effects of the "write" syscall Mikulas Patocka
2020-09-18 13:13                     ` Jan Kara
2020-09-18 18:02                       ` Linus Torvalds
2020-09-20 23:41                       ` Dave Chinner
2020-09-17  6:50               ` [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache Christoph Hellwig
2020-09-21 16:19   ` [RFC] nvfs: a filesystem for persistent memory Mikulas Patocka
2020-09-21 16:29     ` Dan Williams
2020-09-22 15:43     ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
    --subject='Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).