Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [LSF/MM TOPIC] A high-performance userspace block driver
@ 2018-01-16 14:52 Matthew Wilcox
  2018-01-16 23:04 ` Viacheslav Dubeyko
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Matthew Wilcox @ 2018-01-16 14:52 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, linux-fsdevel, linux-block


I see the improvements that Facebook have been making to the nbd driver,
and I think that's a wonderful thing.  Maybe the outcome of this topic
is simply: "Shut up, Matthew, this is good enough".

It's clear that there's an appetite for userspace block devices; not for
swap devices or the root device, but for accessing data that's stored
in that silo over there, and I really don't want to bring that entire
mess of CORBA / Go / Rust / whatever into the kernel to get to it,
but it would be really handy to present it as a block device.

I've looked at a few block-driver-in-userspace projects that exist, and
they all seem pretty bad.  For example, one API maps a few gigabytes of
address space and plays games with vm_insert_page() to put page cache
pages into the address space of the client process.  Of course, the TLB
flush overhead of that solution is criminal.

I've looked at pipes, and they're not an awful solution.  We've almost
got enough syscalls to treat other objects as pipes.  The problem is
that they're not seekable.  So essentially you're looking at having one
pipe per outstanding command.  If yu want to make good use of a modern
NAND device, you want a few hundred outstanding commands, and that's a
bit of a shoddy interface.

Right now, I'm leaning towards combining these two approaches; adding
a VM_NOTLB flag so the mmaped bits of the page cache never make it into
the process's address space, so the TLB shootdown can be safely skipped.
Then check it in follow_page_mask() and return the appropriate struct
page.  As long as the userspace process does everything using O_DIRECT,
I think this will work.

It's either that or make pipes seekable ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 [LSF/MM TOPIC] A high-performance userspace block driver Matthew Wilcox
@ 2018-01-16 23:04 ` Viacheslav Dubeyko
  2018-01-16 23:23 ` Theodore Ts'o
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Viacheslav Dubeyko @ 2018-01-16 23:04 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, linux-fsdevel, linux-block

On Tue, 2018-01-16 at 06:52 -0800, Matthew Wilcox wrote:
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
> 
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
> 
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
> 
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
> 
> It's either that or make pipes seekable ...

I like the whole idea. But why pipes? What's about shared memory? To
make the pipes seekable sounds like the killing of initial concept.
Usually, we treat pipe as FIFO communication channel. So, to make the
pipe seekable sounds really strange, from my point of view. Maybe, we
need in some new abstraction?

By the way, what's use-case(s) you have in mind for the suggested
approach?

Thanks,
Vyacheslav Dubeyko.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 [LSF/MM TOPIC] A high-performance userspace block driver Matthew Wilcox
  2018-01-16 23:04 ` Viacheslav Dubeyko
@ 2018-01-16 23:23 ` Theodore Ts'o
  2018-01-16 23:28   ` [Lsf-pc] " James Bottomley
  2018-01-17  0:41 ` Bart Van Assche
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2018-01-16 23:23 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, linux-fsdevel, linux-block

On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> 
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

... and using iSCSI was too painful and heavyweight.

Google has an iblock device implementation, so you can use that as
confirmation that there certainly has been a desire for such a thing.
In fact, we're happily using it in production even as we speak.

We have been (tentatively) planning on presenting it at OSS North
America later in the year, since the Vault conference is no longer
with us, but we could probably put together a quick presentation for
LSF/MM if there is interest.

There were plans to do something using page cache tricks (what we were
calling the "zero copy" option), but we decided to start with
something simpler, more reliable, so long as it was less overhead and
pain than iSCSI (which was simply an over-engineered solution for our
use case), it was all upside.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 23:23 ` Theodore Ts'o
@ 2018-01-16 23:28   ` James Bottomley
  2018-01-16 23:57     ` Bart Van Assche
  0 siblings, 1 reply; 11+ messages in thread
From: James Bottomley @ 2018-01-16 23:28 UTC (permalink / raw)
  To: Theodore Ts'o, Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, lsf-pc, linux-block, linux-scsi

On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > 
> > 
> > I see the improvements that Facebook have been making to the nbd
> > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > this topic is simply: "Shut up, Matthew, this is good enough".
> > 
> > It's clear that there's an appetite for userspace block devices;
> > not for swap devices or the root device, but for accessing data
> > that's stored in that silo over there, and I really don't want to
> > bring that entire mess of CORBA / Go / Rust / whatever into the
> > kernel to get to it, but it would be really handy to present it as
> > a block device.
> 
> ... and using iSCSI was too painful and heavyweight.

>From what I've seen a reasonable number of storage over IP cloud
implementations are actually using AoE.  The argument goes that the
protocol is about ideal (at least as compared to iSCSI or FCoE) and the
company behind it doesn't seem to want to add any more features that
would bloat it.

James

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 23:28   ` [Lsf-pc] " James Bottomley
@ 2018-01-16 23:57     ` Bart Van Assche
  0 siblings, 0 replies; 11+ messages in thread
From: Bart Van Assche @ 2018-01-16 23:57 UTC (permalink / raw)
  To: James.Bottomley, tytso, willy
  Cc: linux-scsi, linux-mm, linux-block, lsf-pc, linux-fsdevel

On Tue, 2018-01-16 at 15:28 -0800, James Bottomley wrote:
> On Tue, 2018-01-16 at 18:23 -0500, Theodore Ts'o wrote:
> > On Tue, Jan 16, 2018 at 06:52:40AM -0800, Matthew Wilcox wrote:
> > > 
> > > 
> > > I see the improvements that Facebook have been making to the nbd
> > > driver, and I think that's a wonderful thing.  Maybe the outcome of
> > > this topic is simply: "Shut up, Matthew, this is good enough".
> > > 
> > > It's clear that there's an appetite for userspace block devices;
> > > not for swap devices or the root device, but for accessing data
> > > that's stored in that silo over there, and I really don't want to
> > > bring that entire mess of CORBA / Go / Rust / whatever into the
> > > kernel to get to it, but it would be really handy to present it as
> > > a block device.
> > 
> > ... and using iSCSI was too painful and heavyweight.
> 
> From what I've seen a reasonable number of storage over IP cloud
> implementations are actually using AoE.  The argument goes that the
> protocol is about ideal (at least as compared to iSCSI or FCoE) and the
> company behind it doesn't seem to want to add any more features that
> would bloat it.

Has anyone already looked into iSER, SRP or NVMeOF over rdma_rxe over the
loopback network driver? I think all three driver stacks support zero-copy
receiving, something that is not possible with iSCSI/TCP nor with AoE.

Bart.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 [LSF/MM TOPIC] A high-performance userspace block driver Matthew Wilcox
  2018-01-16 23:04 ` Viacheslav Dubeyko
  2018-01-16 23:23 ` Theodore Ts'o
@ 2018-01-17  0:41 ` Bart Van Assche
  2018-01-17  2:49 ` Ming Lei
  2018-01-18  5:27 ` Figo.zhang
  4 siblings, 0 replies; 11+ messages in thread
From: Bart Van Assche @ 2018-01-17  0:41 UTC (permalink / raw)
  To: lsf-pc, willy; +Cc: linux-mm, linux-block, linux-fsdevel

On Tue, 2018-01-16 at 06:52 -0800, Matthew Wilcox wrote:
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
> 
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
> 
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
> 
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
> 
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
> 
> It's either that or make pipes seekable ...

How about using the RDMA API and the rdma_rxe driver over loopback? The RDMA
API supports zero-copy communication which is something the BSD socket API
does not support. The RDMA API also supports byte-level granularity and the
hot path (ib_post_send(), ib_post_recv(), ib_poll_cq()) does not require any
system calls for PCIe RDMA adapters. The rdma_rxe driver however uses a system
call to trigger the send doorbell.

Bart.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 [LSF/MM TOPIC] A high-performance userspace block driver Matthew Wilcox
                   ` (2 preceding siblings ...)
  2018-01-17  0:41 ` Bart Van Assche
@ 2018-01-17  2:49 ` Ming Lei
  2018-01-17 21:21   ` Matthew Wilcox
  2018-01-18  5:27 ` Figo.zhang
  4 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2018-01-17  2:49 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, Linux FS Devel, linux-block

On Tue, Jan 16, 2018 at 10:52 PM, Matthew Wilcox <willy@infradead.org> wrote:
>
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
>
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.

I like the idea, and one line code of Python/... may need thousands
of C code to be done in kernel.

>
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.  For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
>
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
>
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
>
> It's either that or make pipes seekable ...

Userfaultfd might be another choice:

1) map the block LBA space into a range of process vm space

2) when READ/WRITE req comes, convert it to page fault on the
mapped range, and let userland to take control of it, and meantime
kernel req context is slept

3) IO req context in kernel side is waken up after userspace completed
the IO request via userfaultfd

4) kernel side continue to complete the IO, such as copying page from
storage range to req(bio) pages.

Seems READ should be fine since it is very similar with the use case
of QEMU postcopy live migration, WRITE can be a bit different, and
maybe need some change on userfaultfd.

-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-17  2:49 ` Ming Lei
@ 2018-01-17 21:21   ` Matthew Wilcox
  2018-01-22 12:02     ` Mike Rapoport
  2018-01-22 12:18     ` Ming Lei
  0 siblings, 2 replies; 11+ messages in thread
From: Matthew Wilcox @ 2018-01-17 21:21 UTC (permalink / raw)
  To: Ming Lei; +Cc: lsf-pc, linux-mm, Linux FS Devel, linux-block

On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
> Userfaultfd might be another choice:
> 
> 1) map the block LBA space into a range of process vm space

That would limit the size of a block device to ~200TB (with my laptop's
CPU).  That's probably OK for most users, but I suspect there are some
who would chafe at such a restriction (before the 57-bit CPUs arrive).

> 2) when READ/WRITE req comes, convert it to page fault on the
> mapped range, and let userland to take control of it, and meantime
> kernel req context is slept

You don't want to sleep the request; you want it to be able to submit
more I/O.  But we have infrastructure in place to inform the submitter
when I/Os have completed.

> 3) IO req context in kernel side is waken up after userspace completed
> the IO request via userfaultfd
> 
> 4) kernel side continue to complete the IO, such as copying page from
> storage range to req(bio) pages.
> 
> Seems READ should be fine since it is very similar with the use case
> of QEMU postcopy live migration, WRITE can be a bit different, and
> maybe need some change on userfaultfd.

I like this idea, and maybe extending UFFD is the way to solve this
problem.  Perhaps I should explain a little more what the requirements
are.  At the point the driver gets the I/O, pages to copy data into (for
a read) or copy data from (for a write) have already been allocated.
At all costs, we need to avoid playing VM tricks (because TLB flushes
are expensive).  So one copy is probably OK, but we'd like to avoid it
if reasonable.

Let's assume that the userspace program looks at the request metadata and
decides that it needs to send a network request.  Ideally, it would find
a way to have the data from the response land in the pre-allocated pages
(for a read) or send the data straight from the pages in the request
(for a write).  I'm not sure UFFD helps us with that part of the problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-16 14:52 [LSF/MM TOPIC] A high-performance userspace block driver Matthew Wilcox
                   ` (3 preceding siblings ...)
  2018-01-17  2:49 ` Ming Lei
@ 2018-01-18  5:27 ` Figo.zhang
  4 siblings, 0 replies; 11+ messages in thread
From: Figo.zhang @ 2018-01-18  5:27 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, Linux MM, linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 2118 bytes --]

2018-01-16 22:52 GMT+08:00 Matthew Wilcox <willy@infradead.org>:

>
> I see the improvements that Facebook have been making to the nbd driver,
> and I think that's a wonderful thing.  Maybe the outcome of this topic
> is simply: "Shut up, Matthew, this is good enough".
>
> It's clear that there's an appetite for userspace block devices; not for
> swap devices or the root device, but for accessing data that's stored
> in that silo over there, and I really don't want to bring that entire
> mess of CORBA / Go / Rust / whatever into the kernel to get to it,
> but it would be really handy to present it as a block device.
>
> I've looked at a few block-driver-in-userspace projects that exist, and
> they all seem pretty bad.


how about the SPDK?


> For example, one API maps a few gigabytes of
> address space and plays games with vm_insert_page() to put page cache
> pages into the address space of the client process.  Of course, the TLB
> flush overhead of that solution is criminal.
>
> I've looked at pipes, and they're not an awful solution.  We've almost
> got enough syscalls to treat other objects as pipes.  The problem is
> that they're not seekable.  So essentially you're looking at having one
> pipe per outstanding command.  If yu want to make good use of a modern
> NAND device, you want a few hundred outstanding commands, and that's a
> bit of a shoddy interface.
>
> Right now, I'm leaning towards combining these two approaches; adding
> a VM_NOTLB flag so the mmaped bits of the page cache never make it into
> the process's address space, so the TLB shootdown can be safely skipped.
> Then check it in follow_page_mask() and return the appropriate struct
> page.  As long as the userspace process does everything using O_DIRECT,
> I think this will work.
>
> It's either that or make pipes seekable ...
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

[-- Attachment #2: Type: text/html, Size: 2939 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-17 21:21   ` Matthew Wilcox
@ 2018-01-22 12:02     ` Mike Rapoport
  2018-01-22 12:18     ` Ming Lei
  1 sibling, 0 replies; 11+ messages in thread
From: Mike Rapoport @ 2018-01-22 12:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ming Lei, lsf-pc, linux-mm, Linux FS Devel, linux-block

On Wed, Jan 17, 2018 at 01:21:44PM -0800, Matthew Wilcox wrote:
> On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
> > Userfaultfd might be another choice:
> > 
> > 1) map the block LBA space into a range of process vm space
> 
> That would limit the size of a block device to ~200TB (with my laptop's
> CPU).  That's probably OK for most users, but I suspect there are some
> who would chafe at such a restriction (before the 57-bit CPUs arrive).
> 
> > 2) when READ/WRITE req comes, convert it to page fault on the
> > mapped range, and let userland to take control of it, and meantime
> > kernel req context is slept
> 
> You don't want to sleep the request; you want it to be able to submit
> more I/O.  But we have infrastructure in place to inform the submitter
> when I/Os have completed.

It's possible to queue IO requests and have a kthread that will convert
those requests to page faults. The thread indeed will sleep on each page
fault, though.
 
> > 3) IO req context in kernel side is waken up after userspace completed
> > the IO request via userfaultfd
> > 
> > 4) kernel side continue to complete the IO, such as copying page from
> > storage range to req(bio) pages.
> > 
> > Seems READ should be fine since it is very similar with the use case
> > of QEMU postcopy live migration, WRITE can be a bit different, and
> > maybe need some change on userfaultfd.
> 
> I like this idea, and maybe extending UFFD is the way to solve this
> problem.  Perhaps I should explain a little more what the requirements
> are.  At the point the driver gets the I/O, pages to copy data into (for
> a read) or copy data from (for a write) have already been allocated.
> At all costs, we need to avoid playing VM tricks (because TLB flushes
> are expensive).  So one copy is probably OK, but we'd like to avoid it
> if reasonable.
> 
> Let's assume that the userspace program looks at the request metadata and
> decides that it needs to send a network request.  Ideally, it would find
> a way to have the data from the response land in the pre-allocated pages
> (for a read) or send the data straight from the pages in the request
> (for a write).  I'm not sure UFFD helps us with that part of the problem.

As of now it does not. UFFD allocates pages when userland asks to copy the
data into UFFD controlled VMA.
In your example, after the data had arrives from the network userland it
can be copied into a page UFFD will allocate.

Unrelated to block device, I've been thinking of implementing splice for
userfaultfd...

-- 
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] A high-performance userspace block driver
  2018-01-17 21:21   ` Matthew Wilcox
  2018-01-22 12:02     ` Mike Rapoport
@ 2018-01-22 12:18     ` Ming Lei
  1 sibling, 0 replies; 11+ messages in thread
From: Ming Lei @ 2018-01-22 12:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm, Linux FS Devel, linux-block

On Thu, Jan 18, 2018 at 5:21 AM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
>> Userfaultfd might be another choice:
>>
>> 1) map the block LBA space into a range of process vm space
>
> That would limit the size of a block device to ~200TB (with my laptop's
> CPU).  That's probably OK for most users, but I suspect there are some
> who would chafe at such a restriction (before the 57-bit CPUs arrive).

In theory, it won't be a issue, since the LBA space can be partitioned into
more than one process's vm space, so no matter what the size of block device
is, this way should work.

>
>> 2) when READ/WRITE req comes, convert it to page fault on the
>> mapped range, and let userland to take control of it, and meantime
>> kernel req context is slept
>
> You don't want to sleep the request; you want it to be able to submit
> more I/O.  But we have infrastructure in place to inform the submitter
> when I/Os have completed.

Yes, the current bio completion(.end_bio) model can be respected, and
this issue(where to sleep) may depend on UFFD's read/POLLIN protocol.

>
>> 3) IO req context in kernel side is waken up after userspace completed
>> the IO request via userfaultfd
>>
>> 4) kernel side continue to complete the IO, such as copying page from
>> storage range to req(bio) pages.
>>
>> Seems READ should be fine since it is very similar with the use case
>> of QEMU postcopy live migration, WRITE can be a bit different, and
>> maybe need some change on userfaultfd.
>
> I like this idea, and maybe extending UFFD is the way to solve this
> problem.  Perhaps I should explain a little more what the requirements
> are.  At the point the driver gets the I/O, pages to copy data into (for
> a read) or copy data from (for a write) have already been allocated.
> At all costs, we need to avoid playing VM tricks (because TLB flushes
> are expensive).  So one copy is probably OK, but we'd like to avoid it
> if reasonable.

I agree, and one time of page copy can be easier to implement.

>
> Let's assume that the userspace program looks at the request metadata and
> decides that it needs to send a network request.  Ideally, it would find
> a way to have the data from the response land in the pre-allocated pages
> (for a read) or send the data straight from the pages in the request
> (for a write).  I'm not sure UFFD helps us with that part of the problem.


-- 
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-01-22 12:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-16 14:52 [LSF/MM TOPIC] A high-performance userspace block driver Matthew Wilcox
2018-01-16 23:04 ` Viacheslav Dubeyko
2018-01-16 23:23 ` Theodore Ts'o
2018-01-16 23:28   ` [Lsf-pc] " James Bottomley
2018-01-16 23:57     ` Bart Van Assche
2018-01-17  0:41 ` Bart Van Assche
2018-01-17  2:49 ` Ming Lei
2018-01-17 21:21   ` Matthew Wilcox
2018-01-22 12:02     ` Mike Rapoport
2018-01-22 12:18     ` Ming Lei
2018-01-18  5:27 ` Figo.zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).