LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* [PATCH net-next 0/2] tcp: mmap: rework zerocopy receive @ 2018-04-25 5:27 Eric Dumazet 2018-04-25 5:27 ` [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet 2018-04-25 5:27 ` [PATCH net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE Eric Dumazet 0 siblings, 2 replies; 13+ messages in thread From: Eric Dumazet @ 2018-04-25 5:27 UTC (permalink / raw) To: David S . Miller Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet syzbot reported a lockdep issue caused by tcp mmap() support. I implemented Andy Lutomirski nice suggestions to resolve the issue and increase scalability as well. First patch is adding a new setsockopt() operation and changes mmap() behavior. Second patch changes tcp_mmap reference program. Eric Dumazet (2): tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE include/uapi/linux/tcp.h | 8 ++ net/ipv4/tcp.c | 186 +++++++++++++------------ tools/testing/selftests/net/tcp_mmap.c | 63 +++++---- 3 files changed, 139 insertions(+), 118 deletions(-) -- 2.17.0.484.g0c8726318c-goog ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 5:27 [PATCH net-next 0/2] tcp: mmap: rework zerocopy receive Eric Dumazet @ 2018-04-25 5:27 ` Eric Dumazet 2018-04-25 6:28 ` Christoph Hellwig 2018-04-25 13:42 ` Eric Dumazet 2018-04-25 5:27 ` [PATCH net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE Eric Dumazet 1 sibling, 2 replies; 13+ messages in thread From: Eric Dumazet @ 2018-04-25 5:27 UTC (permalink / raw) To: David S . Miller Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet When adding tcp mmap() implementation, I forgot that socket lock had to be taken before current->mm->mmap_sem. syzbot eventually caught the bug. Since we can not lock the socket in tcp mmap() handler we have to split the operation in two phases. 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. This operation does not involve any TCP locking. 2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements the transfert of pages from skbs to one VMA. This operation only uses down_read(¤t->mm->mmap_sem) after holding TCP lock, thus solving the lockdep issue. This new implementation was suggested by Andy Lutomirski with great details. Benefits are : - Better scalability, in case multiple threads reuse VMAS (without mmap()/munmap() calls) since mmap_sem wont be write locked. - Better error recovery. The previous mmap() model had to provide the expected size of the mapping. If for some reason one part could not be mapped (partial MSS), the whole operation had to be aborted. With the tcp_zerocopy_receive struct, kernel can report how many bytes were successfuly mapped, and how many bytes should be read to skip the problematic sequence. - No more memory allocation to hold an array of page pointers. 16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/ - skbs are freed while mmap_sem has been released Following patch makes the change in tcp_mmap tool to demonstrate one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...) Note that memcg might require additional changes. Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Cc: linux-mm@kvack.org Cc: Soheil Hassas Yeganeh <soheil@google.com> --- include/uapi/linux/tcp.h | 8 ++ net/ipv4/tcp.c | 186 ++++++++++++++++++++------------------- 2 files changed, 103 insertions(+), 91 deletions(-) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -122,6 +122,7 @@ enum { #define TCP_MD5SIG_EXT 32 /* TCP MD5 Signature with extensions */ #define TCP_FASTOPEN_KEY 33 /* Set the key for Fast Open (cookie) */ #define TCP_FASTOPEN_NO_COOKIE 34 /* Enable TFO without a TFO cookie */ +#define TCP_ZEROCOPY_RECEIVE 35 struct tcp_repair_opt { __u32 opt_code; @@ -276,4 +277,11 @@ struct tcp_diag_md5sig { __u8 tcpm_key[TCP_MD5SIG_MAXKEYLEN]; }; +/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */ + +struct tcp_zerocopy_receive { + __u64 address; /* in: address of mapping */ + __u32 length; /* in/out: number of bytes to map/mapped */ + __u32 recv_skip_hint; /* out: amount of bytes to skip */ +}; #endif /* _UAPI_LINUX_TCP_H */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index dfd090ea54ad47112fc23c61180b5bf8edd2c736..a28eca97df9465a0aa522a1833a171b53c237b1f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1726,118 +1726,108 @@ int tcp_set_rcvlowat(struct sock *sk, int val) } EXPORT_SYMBOL(tcp_set_rcvlowat); -/* When user wants to mmap X pages, we first need to perform the mapping - * before freeing any skbs in receive queue, otherwise user would be unable - * to fallback to standard recvmsg(). This happens if some data in the - * requested block is not exactly fitting in a page. - * - * We only support order-0 pages for the moment. - * mmap() on TCP is very strict, there is no point - * trying to accommodate with pathological layouts. - */ +static const struct vm_operations_struct tcp_vm_ops = { +}; + int tcp_mmap(struct file *file, struct socket *sock, struct vm_area_struct *vma) { - unsigned long size = vma->vm_end - vma->vm_start; - unsigned int nr_pages = size >> PAGE_SHIFT; - struct page **pages_array = NULL; - u32 seq, len, offset, nr = 0; - struct sock *sk = sock->sk; - const skb_frag_t *frags; + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) + return -EPERM; + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); + + /* Instruct vm_insert_page() to not down_read(mmap_sem) */ + vma->vm_flags |= VM_MIXEDMAP; + + vma->vm_ops = &tcp_vm_ops; + return 0; +} +EXPORT_SYMBOL(tcp_mmap); + +static int tcp_zerocopy_receive(struct sock *sk, + struct tcp_zerocopy_receive *zc) +{ + unsigned long address = (unsigned long)zc->address; + const skb_frag_t *frags = NULL; + u32 length = 0, seq, offset; + struct vm_area_struct *vma; + struct sk_buff *skb = NULL; struct tcp_sock *tp; - struct sk_buff *skb; int ret; - if (vma->vm_pgoff || !nr_pages) + if (address & (PAGE_SIZE - 1) || address != zc->address) return -EINVAL; - if (vma->vm_flags & VM_WRITE) - return -EPERM; - /* TODO: Maybe the following is not needed if pages are COW */ - vma->vm_flags &= ~VM_MAYWRITE; - - lock_sock(sk); - - ret = -ENOTCONN; if (sk->sk_state == TCP_LISTEN) - goto out; + return -ENOTCONN; sock_rps_record_flow(sk); - if (tcp_inq(sk) < size) { - ret = sock_flag(sk, SOCK_DONE) ? -EIO : -EAGAIN; + down_read(¤t->mm->mmap_sem); + + ret = -EINVAL; + vma = find_vma(current->mm, address); + if (!vma || vma->vm_start > address || vma->vm_ops != &tcp_vm_ops) goto out; - } + zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); + tp = tcp_sk(sk); seq = tp->copied_seq; - /* Abort if urgent data is in the area */ - if (unlikely(tp->urg_data)) { - u32 urg_offset = tp->urg_seq - seq; + zc->length = min_t(u32, zc->length, tcp_inq(sk)); - ret = -EINVAL; - if (urg_offset < size) - goto out; - } - ret = -ENOMEM; - pages_array = kvmalloc_array(nr_pages, sizeof(struct page *), - GFP_KERNEL); - if (!pages_array) - goto out; - skb = tcp_recv_skb(sk, seq, &offset); - ret = -EINVAL; -skb_start: - /* We do not support anything not in page frags */ - offset -= skb_headlen(skb); - if ((int)offset < 0) - goto out; - if (skb_has_frag_list(skb)) - goto out; - len = skb->data_len - offset; - frags = skb_shinfo(skb)->frags; - while (offset) { - if (frags->size > offset) - goto out; - offset -= frags->size; - frags++; - } - while (nr < nr_pages) { - if (len) { - if (len < PAGE_SIZE) - goto out; - if (frags->size != PAGE_SIZE || frags->page_offset) - goto out; - pages_array[nr++] = skb_frag_page(frags); - frags++; - len -= PAGE_SIZE; - seq += PAGE_SIZE; - continue; - } - skb = skb->next; - offset = seq - TCP_SKB_CB(skb)->seq; - goto skb_start; - } - /* OK, we have a full set of pages ready to be inserted into vma */ - for (nr = 0; nr < nr_pages; nr++) { - ret = vm_insert_page(vma, vma->vm_start + (nr << PAGE_SHIFT), - pages_array[nr]); - if (ret) - goto out; - } - /* operation is complete, we can 'consume' all skbs */ - tp->copied_seq = seq; - tcp_rcv_space_adjust(sk); - - /* Clean up data we have read: This will do ACK frames. */ - tcp_recv_skb(sk, seq, &offset); - tcp_cleanup_rbuf(sk, size); + zap_page_range(vma, address, zc->length); + zc->recv_skip_hint = 0; ret = 0; + while (length + PAGE_SIZE <= zc->length) { + if (zc->recv_skip_hint < PAGE_SIZE) { + if (skb) { + skb = skb->next; + offset = seq - TCP_SKB_CB(skb)->seq; + } else { + skb = tcp_recv_skb(sk, seq, &offset); + } + + zc->recv_skip_hint = skb->len - offset; + offset -= skb_headlen(skb); + if ((int)offset < 0 || skb_has_frag_list(skb)) + break; + frags = skb_shinfo(skb)->frags; + while (offset) { + if (frags->size > offset) + goto out; + offset -= frags->size; + frags++; + } + } + if (frags->size != PAGE_SIZE || frags->page_offset) + break; + ret = vm_insert_page(vma, address + length, + skb_frag_page(frags)); + if (ret) + break; + length += PAGE_SIZE; + seq += PAGE_SIZE; + zc->recv_skip_hint -= PAGE_SIZE; + frags++; + } out: - release_sock(sk); - kvfree(pages_array); + up_read(¤t->mm->mmap_sem); + if (length) { + tp->copied_seq = seq; + tcp_rcv_space_adjust(sk); + + /* Clean up data we have read: This will do ACK frames. */ + tcp_recv_skb(sk, seq, &offset); + tcp_cleanup_rbuf(sk, length); + ret = 0; + } else { + if (!zc->recv_skip_hint && sock_flag(sk, SOCK_DONE)) + ret = -EIO; + } + zc->length = length; return ret; } -EXPORT_SYMBOL(tcp_mmap); static void tcp_update_recv_tstamps(struct sk_buff *skb, struct scm_timestamping *tss) @@ -2738,6 +2728,20 @@ static int do_tcp_setsockopt(struct sock *sk, int level, return tcp_fastopen_reset_cipher(net, sk, key, sizeof(key)); } + case TCP_ZEROCOPY_RECEIVE: { + struct tcp_zerocopy_receive zc; + + if (optlen != sizeof(zc)) + return -EINVAL; + if (copy_from_user(&zc, optval, optlen)) + return -EFAULT; + lock_sock(sk); + err = tcp_zerocopy_receive(sk, &zc); + release_sock(sk); + if (!err && copy_to_user(optval, &zc, optlen)) + err = -EFAULT; + return err; + } default: /* fallthru */ break; -- 2.17.0.484.g0c8726318c-goog ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 5:27 ` [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet @ 2018-04-25 6:28 ` Christoph Hellwig 2018-04-25 13:01 ` Eric Dumazet 2018-04-25 13:42 ` Eric Dumazet 1 sibling, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2018-04-25 6:28 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh, Eric Dumazet On Tue, Apr 24, 2018 at 10:27:21PM -0700, Eric Dumazet wrote: > When adding tcp mmap() implementation, I forgot that socket lock > had to be taken before current->mm->mmap_sem. syzbot eventually caught > the bug. > > Since we can not lock the socket in tcp mmap() handler we have to > split the operation in two phases. > > 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. > This operation does not involve any TCP locking. > > 2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements > the transfert of pages from skbs to one VMA. > This operation only uses down_read(¤t->mm->mmap_sem) after > holding TCP lock, thus solving the lockdep issue. > > This new implementation was suggested by Andy Lutomirski with great details. Thanks, this looks much more sensible to me. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 6:28 ` Christoph Hellwig @ 2018-04-25 13:01 ` Eric Dumazet 2018-04-25 15:24 ` Christoph Hellwig 2018-04-25 16:04 ` Matthew Wilcox 0 siblings, 2 replies; 13+ messages in thread From: Eric Dumazet @ 2018-04-25 13:01 UTC (permalink / raw) To: Christoph Hellwig, Eric Dumazet Cc: David S . Miller, netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh On 04/24/2018 11:28 PM, Christoph Hellwig wrote: > On Tue, Apr 24, 2018 at 10:27:21PM -0700, Eric Dumazet wrote: >> When adding tcp mmap() implementation, I forgot that socket lock >> had to be taken before current->mm->mmap_sem. syzbot eventually caught >> the bug. >> >> Since we can not lock the socket in tcp mmap() handler we have to >> split the operation in two phases. >> >> 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. >> This operation does not involve any TCP locking. >> >> 2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements >> the transfert of pages from skbs to one VMA. >> This operation only uses down_read(¤t->mm->mmap_sem) after >> holding TCP lock, thus solving the lockdep issue. >> >> This new implementation was suggested by Andy Lutomirski with great details. > > Thanks, this looks much more sensible to me. > Thanks Christoph Note the high cost of zap_page_range(), needed to avoid -EBUSY being returned from vm_insert_page() the second time TCP_ZEROCOPY_RECEIVE is used on one VMA. Ideally a vm_replace_page() would avoid this cost ? 6.51% tcp_mmap [kernel.kallsyms] [k] unmap_page_range 5.90% tcp_mmap [kernel.kallsyms] [k] vm_insert_page 4.85% tcp_mmap [kernel.kallsyms] [k] _raw_spin_lock 4.50% tcp_mmap [kernel.kallsyms] [k] mark_page_accessed 3.51% tcp_mmap [kernel.kallsyms] [k] page_remove_rmap 2.99% tcp_mmap [kernel.kallsyms] [k] page_add_file_rmap 2.53% tcp_mmap [kernel.kallsyms] [k] release_pages 2.38% tcp_mmap [kernel.kallsyms] [k] put_page 2.37% tcp_mmap [kernel.kallsyms] [k] smp_call_function_single 2.28% tcp_mmap [kernel.kallsyms] [k] __get_locked_pte 2.25% tcp_mmap [kernel.kallsyms] [k] do_tcp_setsockopt.isra.35 2.21% tcp_mmap [kernel.kallsyms] [k] page_clear_age ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 13:01 ` Eric Dumazet @ 2018-04-25 15:24 ` Christoph Hellwig 2018-04-25 16:04 ` Matthew Wilcox 1 sibling, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2018-04-25 15:24 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Hellwig, Eric Dumazet, David S . Miller, netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh On Wed, Apr 25, 2018 at 06:01:02AM -0700, Eric Dumazet wrote: > Thanks Christoph > > Note the high cost of zap_page_range(), needed to avoid -EBUSY being returned > from vm_insert_page() the second time TCP_ZEROCOPY_RECEIVE is used on one VMA. > > Ideally a vm_replace_page() would avoid this cost ? No my area of expertise, I'll defer to the MM folks.. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 13:01 ` Eric Dumazet 2018-04-25 15:24 ` Christoph Hellwig @ 2018-04-25 16:04 ` Matthew Wilcox 2018-04-25 16:20 ` Eric Dumazet 2018-04-25 16:22 ` Andy Lutomirski 1 sibling, 2 replies; 13+ messages in thread From: Matthew Wilcox @ 2018-04-25 16:04 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Hellwig, Eric Dumazet, David S . Miller, netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh On Wed, Apr 25, 2018 at 06:01:02AM -0700, Eric Dumazet wrote: > On 04/24/2018 11:28 PM, Christoph Hellwig wrote: > > On Tue, Apr 24, 2018 at 10:27:21PM -0700, Eric Dumazet wrote: > >> When adding tcp mmap() implementation, I forgot that socket lock > >> had to be taken before current->mm->mmap_sem. syzbot eventually caught > >> the bug. > >> > >> Since we can not lock the socket in tcp mmap() handler we have to > >> split the operation in two phases. > >> > >> 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. > >> This operation does not involve any TCP locking. > >> > >> 2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements > >> the transfert of pages from skbs to one VMA. > >> This operation only uses down_read(¤t->mm->mmap_sem) after > >> holding TCP lock, thus solving the lockdep issue. > >> > >> This new implementation was suggested by Andy Lutomirski with great details. > > > > Thanks, this looks much more sensible to me. > > > > Thanks Christoph > > Note the high cost of zap_page_range(), needed to avoid -EBUSY being returned > from vm_insert_page() the second time TCP_ZEROCOPY_RECEIVE is used on one VMA. > > Ideally a vm_replace_page() would avoid this cost ? If you don't zap the page range, any of the CPUs in the system where any thread in this task have ever run may have a TLB entry pointing to this page ... if the page is being recycled into the page allocator, then that page might end up as a slab page or page table or page cache while the other CPU still have access to it. You could hang onto the page until you've built up a sufficiently large batch, then bulk-invalidate all of the TLB entries, but we start to get into weirdnesses on different CPU architectures. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 16:04 ` Matthew Wilcox @ 2018-04-25 16:20 ` Eric Dumazet 2018-04-25 16:30 ` Matthew Wilcox 2018-04-25 16:22 ` Andy Lutomirski 1 sibling, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2018-04-25 16:20 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, Eric Dumazet, David S . Miller, netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh On 04/25/2018 09:04 AM, Matthew Wilcox wrote: > If you don't zap the page range, any of the CPUs in the system where > any thread in this task have ever run may have a TLB entry pointing to > this page ... if the page is being recycled into the page allocator, > then that page might end up as a slab page or page table or page cache > while the other CPU still have access to it. Yes, this makes sense. > > You could hang onto the page until you've built up a sufficiently large > batch, then bulk-invalidate all of the TLB entries, but we start to get > into weirdnesses on different CPU architectures. > zap_page_range() is already doing a bulk-invalidate, so maybe vm_replace_page() wont bring serious improvement if we end-up doing same dance. Thanks. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 16:20 ` Eric Dumazet @ 2018-04-25 16:30 ` Matthew Wilcox 0 siblings, 0 replies; 13+ messages in thread From: Matthew Wilcox @ 2018-04-25 16:30 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Hellwig, Eric Dumazet, David S . Miller, netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh On Wed, Apr 25, 2018 at 09:20:55AM -0700, Eric Dumazet wrote: > On 04/25/2018 09:04 AM, Matthew Wilcox wrote: > > If you don't zap the page range, any of the CPUs in the system where > > any thread in this task have ever run may have a TLB entry pointing to > > this page ... if the page is being recycled into the page allocator, > > then that page might end up as a slab page or page table or page cache > > while the other CPU still have access to it. > > Yes, this makes sense. > > > > > You could hang onto the page until you've built up a sufficiently large > > batch, then bulk-invalidate all of the TLB entries, but we start to get > > into weirdnesses on different CPU architectures. > > > > zap_page_range() is already doing a bulk-invalidate, > so maybe vm_replace_page() wont bring serious improvement if we end-up doing same dance. Sorry, I was unclear. zap_page_range() bulk-invalidates all pages that were torn down as part of this call. What I was trying to say was that we could have a whole new API which put page after page into the same address, and bumped the refcount on them to prevent them from actually being freed. Once we get to a batch limit, we invalidate all of the pages which were mapped at those addresses and can then free the pages back to the allocator. I don't think you can implement this scheme on s390 because it requires the userspace address to still be mapped to that page on shootdown (?) but I think we could implement it on x86. Another possibility is if we had some way to insert the TLB entry into the local CPU's page tables only, we wouldn't need to broadcast-invalidate the TLB entry; we could just do it locally which is relatively quick. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 16:04 ` Matthew Wilcox 2018-04-25 16:20 ` Eric Dumazet @ 2018-04-25 16:22 ` Andy Lutomirski 2018-04-25 16:35 ` Eric Dumazet 1 sibling, 1 reply; 13+ messages in thread From: Andy Lutomirski @ 2018-04-25 16:22 UTC (permalink / raw) To: Matthew Wilcox Cc: Eric Dumazet, Christoph Hellwig, Eric Dumazet, David S . Miller, netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh On Wed, Apr 25, 2018 at 9:04 AM, Matthew Wilcox <willy@infradead.org> wrote: > On Wed, Apr 25, 2018 at 06:01:02AM -0700, Eric Dumazet wrote: >> On 04/24/2018 11:28 PM, Christoph Hellwig wrote: >> > On Tue, Apr 24, 2018 at 10:27:21PM -0700, Eric Dumazet wrote: >> >> When adding tcp mmap() implementation, I forgot that socket lock >> >> had to be taken before current->mm->mmap_sem. syzbot eventually caught >> >> the bug. >> >> >> >> Since we can not lock the socket in tcp mmap() handler we have to >> >> split the operation in two phases. >> >> >> >> 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. >> >> This operation does not involve any TCP locking. >> >> >> >> 2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements >> >> the transfert of pages from skbs to one VMA. >> >> This operation only uses down_read(¤t->mm->mmap_sem) after >> >> holding TCP lock, thus solving the lockdep issue. >> >> >> >> This new implementation was suggested by Andy Lutomirski with great details. >> > >> > Thanks, this looks much more sensible to me. >> > >> >> Thanks Christoph >> >> Note the high cost of zap_page_range(), needed to avoid -EBUSY being returned >> from vm_insert_page() the second time TCP_ZEROCOPY_RECEIVE is used on one VMA. >> >> Ideally a vm_replace_page() would avoid this cost ? > > If you don't zap the page range, any of the CPUs in the system where > any thread in this task have ever run may have a TLB entry pointing to > this page ... if the page is being recycled into the page allocator, > then that page might end up as a slab page or page table or page cache > while the other CPU still have access to it. Indeed. This is one of the reasons that Linus has generally been quite vocal that he doesn't like MMU-based zerocopy schemes. > > You could hang onto the page until you've built up a sufficiently large > batch, then bulk-invalidate all of the TLB entries, but we start to get > into weirdnesses on different CPU architectures. The existing mmu_gather code should already handle this at least moderately well. If it's not, then it should be fixed. On x86, there is no operation to flush a range of addresses. You can flush one address or you can flush all of them. If you flush one page at a time, then you might never recover the performance of a plain old memcpy(). If you flush all of them, then you're hurting the performance of everything else in the task. In general, I suspect that the zerocopy receive mechanism will only really be a win in single-threaded applications that consume large amounts of receive bandwidth on a single TCP socket using lots of memory and don't do all that much else. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 16:22 ` Andy Lutomirski @ 2018-04-25 16:35 ` Eric Dumazet 2018-04-25 16:44 ` Eric Dumazet 0 siblings, 1 reply; 13+ messages in thread From: Eric Dumazet @ 2018-04-25 16:35 UTC (permalink / raw) To: Andy Lutomirski, Matthew Wilcox Cc: Eric Dumazet, Christoph Hellwig, Eric Dumazet, David S . Miller, netdev, linux-kernel, linux-mm, Soheil Hassas Yeganeh On 04/25/2018 09:22 AM, Andy Lutomirski wrote: > In general, I suspect that the zerocopy receive mechanism will only > really be a win in single-threaded applications that consume large > amounts of receive bandwidth on a single TCP socket using lots of > memory and don't do all that much else. This was dully noted in the original patch submission. https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=309c446cb45f6663932c8e6d0754f4ac81d1b5cd Our intent at Google is to use it for some specific 1MB+ receives, not as a generic and universal mechanism. The major benefit is really the 4KB+ MTU, allowing to pack exactly 4096 bytes of payload per page. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 16:35 ` Eric Dumazet @ 2018-04-25 16:44 ` Eric Dumazet 0 siblings, 0 replies; 13+ messages in thread From: Eric Dumazet @ 2018-04-25 16:44 UTC (permalink / raw) To: Andy Lutomirski, Matthew Wilcox Cc: Christoph Hellwig, Eric Dumazet, David S . Miller, netdev, linux-kernel, linux-mm, Soheil Hassas Yeganeh On 04/25/2018 09:35 AM, Eric Dumazet wrote: > > > On 04/25/2018 09:22 AM, Andy Lutomirski wrote: > >> In general, I suspect that the zerocopy receive mechanism will only >> really be a win in single-threaded applications that consume large >> amounts of receive bandwidth on a single TCP socket using lots of >> memory and don't do all that much else. > > This was dully noted in the original patch submission. > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=309c446cb45f6663932c8e6d0754f4ac81d1b5cd > > Our intent at Google is to use it for some specific 1MB+ receives, not as a generic and universal mechanism. > > The major benefit is really the 4KB+ MTU, allowing to pack exactly 4096 bytes of payload per page. > Some perf numbers with 10 concurrent threads in tcp_mmap with zero copy enabled : (tcp_mmap uses 512 KB chunks, not 1MB ones) received 32768 MB (100 % mmap'ed) in 28.3054 s, 9.71116 Gbit cpu usage user:0.039 sys:1.946, 60.5774 usec per MB, 65536 c-switches received 32768 MB (100 % mmap'ed) in 28.2504 s, 9.73004 Gbit cpu usage user:0.052 sys:1.941, 60.8215 usec per MB, 65536 c-switches received 32768 MB (99.9998 % mmap'ed) in 28.2508 s, 9.72993 Gbit cpu usage user:0.056 sys:1.915, 60.1501 usec per MB, 65539 c-switches received 32768 MB (100 % mmap'ed) in 28.2544 s, 9.72866 Gbit cpu usage user:0.053 sys:1.966, 61.615 usec per MB, 65536 c-switches received 32768 MB (100 % mmap'ed) in 115.985 s, 2.36995 Gbit cpu usage user:0.057 sys:2.492, 77.7893 usec per MB, 65536 c-switches received 32768 MB (100 % mmap'ed) in 62.633 s, 4.38871 Gbit cpu usage user:0.048 sys:2.076, 64.8193 usec per MB, 65536 c-switches received 32768 MB (100 % mmap'ed) in 59.4608 s, 4.62285 Gbit cpu usage user:0.047 sys:1.965, 61.4014 usec per MB, 65536 c-switches received 32768 MB (100 % mmap'ed) in 119.364 s, 2.30285 Gbit cpu usage user:0.057 sys:2.757, 85.8765 usec per MB, 65536 c-switches received 32768 MB (100 % mmap'ed) in 121.37 s, 2.2648 Gbit cpu usage user:0.05 sys:2.224, 69.397 usec per MB, 65536 c-switches received 32768 MB (100 % mmap'ed) in 121.382 s, 2.26457 Gbit cpu usage user:0.049 sys:2.163, 67.5049 usec per MB, 65538 c-switches received 32768 MB (100 % mmap'ed) in 39.7636 s, 6.91281 Gbit cpu usage user:0.055 sys:2.053, 64.3311 usec per MB, 65536 c-switches received 32768 MB (100 % mmap'ed) in 21.2803 s, 12.917 Gbit cpu usage user:0.043 sys:2.057, 64.0869 usec per MB, 65537 c-switches When zero copy is not enabled : received 32768 MB (0 % mmap'ed) in 49.4301 s, 5.56094 Gbit cpu usage user:0.036 sys:6.747, 207.001 usec per MB, 65546 c-switches received 32768 MB (0 % mmap'ed) in 49.431 s, 5.56084 Gbit cpu usage user:0.042 sys:5.262, 161.865 usec per MB, 65540 c-switches received 32768 MB (0 % mmap'ed) in 84.7254 s, 3.24434 Gbit cpu usage user:0.045 sys:5.154, 158.661 usec per MB, 65548 c-switches received 32768 MB (0 % mmap'ed) in 84.7274 s, 3.24426 Gbit cpu usage user:0.043 sys:6.528, 200.531 usec per MB, 65542 c-switches received 32768 MB (0 % mmap'ed) in 35.3133 s, 7.78398 Gbit cpu usage user:0.032 sys:5.066, 155.579 usec per MB, 65540 c-switches received 32768 MB (0 % mmap'ed) in 35.3137 s, 7.78389 Gbit cpu usage user:0.034 sys:6.358, 195.068 usec per MB, 65536 c-switches received 32768 MB (0 % mmap'ed) in 98.8568 s, 2.78057 Gbit cpu usage user:0.042 sys:6.519, 200.226 usec per MB, 65550 c-switches received 32768 MB (0 % mmap'ed) in 98.8638 s, 2.78037 Gbit cpu usage user:0.042 sys:5.243, 161.285 usec per MB, 65545 c-switches received 32768 MB (0 % mmap'ed) in 108.282 s, 2.53853 Gbit cpu usage user:0.059 sys:5.938, 183.014 usec per MB, 65538 c-switches received 32768 MB (0 % mmap'ed) in 108.314 s, 2.53778 Gbit cpu usage user:0.04 sys:6.096, 187.256 usec per MB, 65548 c-switches received 32768 MB (0 % mmap'ed) in 29.4351 s, 9.33845 Gbit cpu usage user:0.041 sys:6.03, 185.272 usec per MB, 65536 c-switches received 32768 MB (0 % mmap'ed) in 44.3993 s, 6.19104 Gbit cpu usage user:0.034 sys:5.115, 157.135 usec per MB, 65535 c-switches received 32768 MB (0 % mmap'ed) in 79.7203 s, 3.44803 Gbit cpu usage user:0.046 sys:5.214, 160.522 usec per MB, 65540 c-switches ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive 2018-04-25 5:27 ` [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet 2018-04-25 6:28 ` Christoph Hellwig @ 2018-04-25 13:42 ` Eric Dumazet 1 sibling, 0 replies; 13+ messages in thread From: Eric Dumazet @ 2018-04-25 13:42 UTC (permalink / raw) To: Eric Dumazet, David S . Miller Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh, Eric Dumazet On 04/24/2018 10:27 PM, Eric Dumazet wrote: > When adding tcp mmap() implementation, I forgot that socket lock > had to be taken before current->mm->mmap_sem. syzbot eventually caught > the bug. > + ... > + down_read(¤t->mm->mmap_sem); > + > + ret = -EINVAL; > + vma = find_vma(current->mm, address); > + if (!vma || vma->vm_start > address || vma->vm_ops != &tcp_vm_ops) > goto out; > - } > + zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); > + > tp = tcp_sk(sk); > seq = tp->copied_seq; > - /* Abort if urgent data is in the area */ > - if (unlikely(tp->urg_data)) { > - u32 urg_offset = tp->urg_seq - seq; > + zc->length = min_t(u32, zc->length, tcp_inq(sk)); > > I might have to make sure zc->length is page aligned before calling zap_page_range() ? zc->length &= ~(PAGE_SIZE - 1); + zap_page_range(vma, address, zc->length); ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE 2018-04-25 5:27 [PATCH net-next 0/2] tcp: mmap: rework zerocopy receive Eric Dumazet 2018-04-25 5:27 ` [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet @ 2018-04-25 5:27 ` Eric Dumazet 1 sibling, 0 replies; 13+ messages in thread From: Eric Dumazet @ 2018-04-25 5:27 UTC (permalink / raw) To: David S . Miller Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet After prior kernel change, mmap() on TCP socket only reserves VMA. We have to use setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) to perform the transfert of pages from skbs in TCP receive queue into such VMA. struct tcp_zerocopy_receive { __u64 address; /* in: address of mapping */ __u32 length; /* in/out: number of bytes to map/mapped */ __u32 recv_skip_hint; /* out: amount of bytes to skip */ }; After a successful setsockopt(...TCP_ZEROCOPY_RECEIVE...), @length contains number of bytes that were mapped, and @recv_skip_hint contains number of bytes that should be read using conventional read()/recv()/recvmsg() system calls, to skip a sequence of bytes that can not be mapped, because not properly page aligned. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Soheil Hassas Yeganeh <soheil@google.com> --- tools/testing/selftests/net/tcp_mmap.c | 63 +++++++++++++++----------- 1 file changed, 36 insertions(+), 27 deletions(-) diff --git a/tools/testing/selftests/net/tcp_mmap.c b/tools/testing/selftests/net/tcp_mmap.c index dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44..5b381cdbdd6319556ba4e3dad530fae8f13f5a9b 100644 --- a/tools/testing/selftests/net/tcp_mmap.c +++ b/tools/testing/selftests/net/tcp_mmap.c @@ -76,9 +76,10 @@ #include <time.h> #include <sys/time.h> #include <netinet/in.h> -#include <netinet/tcp.h> #include <arpa/inet.h> #include <poll.h> +#include <linux/tcp.h> +#include <assert.h> #ifndef MSG_ZEROCOPY #define MSG_ZEROCOPY 0x4000000 @@ -134,11 +135,12 @@ void hash_zone(void *zone, unsigned int length) void *child_thread(void *arg) { unsigned long total_mmap = 0, total = 0; + struct tcp_zerocopy_receive zc; unsigned long delta_usec; int flags = MAP_SHARED; struct timeval t0, t1; char *buffer = NULL; - void *oaddr = NULL; + void *addr = NULL; double throughput; struct rusage ru; int lu, fd; @@ -153,41 +155,45 @@ void *child_thread(void *arg) perror("malloc"); goto error; } + if (zflg) { + addr = mmap(NULL, chunk_size, PROT_READ, flags, fd, 0); + if (addr == (void *)-1) + zflg = 0; + } while (1) { struct pollfd pfd = { .fd = fd, .events = POLLIN, }; int sub; poll(&pfd, 1, 10000); if (zflg) { - void *naddr; + int res; - naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 0); - if (naddr == (void *)-1) { - if (errno == EAGAIN) { - /* That is if SO_RCVLOWAT is buggy */ - usleep(1000); - continue; - } - if (errno == EINVAL) { - flags = MAP_SHARED; - oaddr = NULL; - goto fallback; - } - if (errno != EIO) - perror("mmap()"); + zc.address = (__u64)addr; + zc.length = chunk_size; + zc.recv_skip_hint = 0; + res = setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, + &zc, sizeof(zc)); + if (res == -1) break; + + if (zc.length) { + assert(zc.length <= chunk_size); + total_mmap += zc.length; + if (xflg) + hash_zone(addr, zc.length); + total += zc.length; } - total_mmap += chunk_size; - if (xflg) - hash_zone(naddr, chunk_size); - total += chunk_size; - if (!keepflag) { - flags |= MAP_FIXED; - oaddr = naddr; + if (zc.recv_skip_hint) { + assert(zc.recv_skip_hint <= chunk_size); + lu = read(fd, buffer, zc.recv_skip_hint); + if (lu > 0) { + if (xflg) + hash_zone(buffer, lu); + total += lu; + } } continue; } -fallback: sub = 0; while (sub < chunk_size) { lu = read(fd, buffer + sub, chunk_size - sub); @@ -228,6 +234,8 @@ void *child_thread(void *arg) error: free(buffer); close(fd); + if (zflg) + munmap(addr, chunk_size); pthread_exit(0); } @@ -371,7 +379,8 @@ int main(int argc, char *argv[]) setup_sockaddr(cfg_family, host, &listenaddr); if (mss && - setsockopt(fdlisten, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) { + setsockopt(fdlisten, IPPROTO_TCP, TCP_MAXSEG, + &mss, sizeof(mss)) == -1) { perror("setsockopt TCP_MAXSEG"); exit(1); } @@ -402,7 +411,7 @@ int main(int argc, char *argv[]) setup_sockaddr(cfg_family, host, &addr); if (mss && - setsockopt(fd, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) { + setsockopt(fd, IPPROTO_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) { perror("setsockopt TCP_MAXSEG"); exit(1); } -- 2.17.0.484.g0c8726318c-goog ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-04-25 16:44 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-25 5:27 [PATCH net-next 0/2] tcp: mmap: rework zerocopy receive Eric Dumazet 2018-04-25 5:27 ` [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet 2018-04-25 6:28 ` Christoph Hellwig 2018-04-25 13:01 ` Eric Dumazet 2018-04-25 15:24 ` Christoph Hellwig 2018-04-25 16:04 ` Matthew Wilcox 2018-04-25 16:20 ` Eric Dumazet 2018-04-25 16:30 ` Matthew Wilcox 2018-04-25 16:22 ` Andy Lutomirski 2018-04-25 16:35 ` Eric Dumazet 2018-04-25 16:44 ` Eric Dumazet 2018-04-25 13:42 ` Eric Dumazet 2018-04-25 5:27 ` [PATCH net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).