Linux-Fsdevel Archive on
help / color / mirror / Atom feed
From: Ming Lei <>
To: Matthew Wilcox <>
Cc:, linux-mm <>,
	Linux FS Devel <>,
	linux-block <>
Subject: Re: [LSF/MM TOPIC] A high-performance userspace block driver
Date: Wed, 17 Jan 2018 10:49:24 +0800	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Jan 16, 2018 at 10:52 PM, Matthew Wilcox <> wrote:
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

I like the idea, and one line code of Python/... may need thousands
of C code to be done in kernel.

> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
> It's either that or make pipes seekable ...

Userfaultfd might be another choice:

1) map the block LBA space into a range of process vm space

2) when READ/WRITE req comes, convert it to page fault on the
mapped range, and let userland to take control of it, and meantime
kernel req context is slept

3) IO req context in kernel side is waken up after userspace completed
the IO request via userfaultfd

4) kernel side continue to complete the IO, such as copying page from
storage range to req(bio) pages.

Seems READ should be fine since it is very similar with the use case
of QEMU postcopy live migration, WRITE can be a bit different, and
maybe need some change on userfaultfd.

Ming Lei

To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to  For more info on Linux MM,
see: .
Don't email: <a href=mailto:""> </a>

  parent reply	other threads:[~2018-01-17  2:49 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-16 14:52 Matthew Wilcox
2018-01-16 23:04 ` Viacheslav Dubeyko
2018-01-16 23:23 ` Theodore Ts'o
2018-01-16 23:28   ` [Lsf-pc] " James Bottomley
2018-01-16 23:57     ` Bart Van Assche
2018-01-17  0:41 ` Bart Van Assche
2018-01-17  2:49 ` Ming Lei [this message]
2018-01-17 21:21   ` Matthew Wilcox
2018-01-22 12:02     ` Mike Rapoport
2018-01-22 12:18     ` Ming Lei
2018-01-18  5:27 ` Figo.zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \
    --subject='Re: [LSF/MM TOPIC] A high-performance userspace block driver' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).