Linux-Fsdevel Archive on lore.kernel.org help / color / mirror / Atom feed
* remove the last set_fs() in common code, and remove it for x86 and powerpc v2 @ 2020-08-27 15:00 Christoph Hellwig 2020-08-27 15:00 ` [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig ` (13 more replies) 0 siblings, 14 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Hi all, this series removes the last set_fs() used to force a kernel address space for the uaccess code in the kernel read/write/splice code, and then stops implementing the address space overrides entirely for x86 and powerpc. The file system part has been posted a few times, and the read/write side has been pretty much unchanced. For splice this series drops the conversion of the seq_file and sysctl code to the iter ops, and thus loses the splice support for them. The reasons for that is that it caused a lot of churn for not much use - splice for these small files really isn't much of a win, even if existing userspace uses it. All callers I found do the proper fallback, but if this turns out to be an issue the conversion can be resurrected. Besides x86 and powerpc I plan to eventually convert all other architectures, although this will be a slow process, starting with the easier ones once the infrastructure is merged. The process to convert architectures is roughtly: (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code (2) implement __get_kernel_nofault and __put_kernel_nofault (3) remove the arch specific address limitation functionality Changes since v1: - drop the patch to remove the non-iter ops for /dev/zero and /dev/null as they caused a performance regression - don't enable user access in __get_kernel on powerpc - xfail the set_fs() based lkdtm tests Diffstat: ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 15:58 ` David Laight [not found] ` <20200901064849.GI4299@shao2-debian> 2020-08-27 15:00 ` [PATCH 02/10] fs: don't allow splice read/write without explicit ops Christoph Hellwig ` (12 subsequent siblings) 13 siblings, 2 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Don't allow calling ->read or ->write with set_fs as a preparation for killing off set_fs. All the instances that we use kernel_read/write on are using the iter ops already. If a file has both the regular ->read/->write methods and the iter variants those could have different semantics for messed up enough drivers. Also fails the kernel access to them in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> --- fs/read_write.c | 67 +++++++++++++++++++++++++++++++------------------ 1 file changed, 42 insertions(+), 25 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 5db58b8c78d0dd..702c4301d9eb6b 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -419,27 +419,41 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo return ret; } +static int warn_unsupported(struct file *file, const char *op) +{ + pr_warn_ratelimited( + "kernel %s not supported for file %pD4 (pid: %d comm: %.20s)\n", + op, file, current->pid, current->comm); + return -EINVAL; +} + ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos) { - mm_segment_t old_fs = get_fs(); + struct kvec iov = { + .iov_base = buf, + .iov_len = min_t(size_t, count, MAX_RW_COUNT), + }; + struct kiocb kiocb; + struct iov_iter iter; ssize_t ret; if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ))) return -EINVAL; if (!(file->f_mode & FMODE_CAN_READ)) return -EINVAL; + /* + * Also fail if ->read_iter and ->read are both wired up as that + * implies very convoluted semantics. + */ + if (unlikely(!file->f_op->read_iter || file->f_op->read)) + return warn_unsupported(file, "read"); - if (count > MAX_RW_COUNT) - count = MAX_RW_COUNT; - set_fs(KERNEL_DS); - if (file->f_op->read) - ret = file->f_op->read(file, (void __user *)buf, count, pos); - else if (file->f_op->read_iter) - ret = new_sync_read(file, (void __user *)buf, count, pos); - else - ret = -EINVAL; - set_fs(old_fs); + init_sync_kiocb(&kiocb, file); + kiocb.ki_pos = *pos; + iov_iter_kvec(&iter, READ, &iov, 1, iov.iov_len); + ret = file->f_op->read_iter(&kiocb, &iter); if (ret > 0) { + *pos = kiocb.ki_pos; fsnotify_access(file); add_rchar(current, ret); } @@ -510,28 +524,31 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t /* caller is responsible for file_start_write/file_end_write */ ssize_t __kernel_write(struct file *file, const void *buf, size_t count, loff_t *pos) { - mm_segment_t old_fs; - const char __user *p; + struct kvec iov = { + .iov_base = (void *)buf, + .iov_len = min_t(size_t, count, MAX_RW_COUNT), + }; + struct kiocb kiocb; + struct iov_iter iter; ssize_t ret; if (WARN_ON_ONCE(!(file->f_mode & FMODE_WRITE))) return -EBADF; if (!(file->f_mode & FMODE_CAN_WRITE)) return -EINVAL; + /* + * Also fail if ->write_iter and ->write are both wired up as that + * implies very convoluted semantics. + */ + if (unlikely(!file->f_op->write_iter || file->f_op->write)) + return warn_unsupported(file, "write"); - old_fs = get_fs(); - set_fs(KERNEL_DS); - p = (__force const char __user *)buf; - if (count > MAX_RW_COUNT) - count = MAX_RW_COUNT; - if (file->f_op->write) - ret = file->f_op->write(file, p, count, pos); - else if (file->f_op->write_iter) - ret = new_sync_write(file, p, count, pos); - else - ret = -EINVAL; - set_fs(old_fs); + init_sync_kiocb(&kiocb, file); + kiocb.ki_pos = *pos; + iov_iter_kvec(&iter, WRITE, &iov, 1, iov.iov_len); + ret = file->f_op->write_iter(&kiocb, &iter); if (ret > 0) { + *pos = kiocb.ki_pos; fsnotify_modify(file); add_wchar(current, ret); } -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops 2020-08-27 15:00 ` [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig @ 2020-08-27 15:58 ` David Laight 2020-08-29 9:23 ` 'Christoph Hellwig' [not found] ` <20200901064849.GI4299@shao2-debian> 1 sibling, 1 reply; 45+ messages in thread From: David Laight @ 2020-08-27 15:58 UTC (permalink / raw) To: 'Christoph Hellwig', Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel From: Christoph Hellwig > Sent: 27 August 2020 16:00 > > Don't allow calling ->read or ->write with set_fs as a preparation for > killing off set_fs. All the instances that we use kernel_read/write on > are using the iter ops already. > > If a file has both the regular ->read/->write methods and the iter > variants those could have different semantics for messed up enough > drivers. Also fails the kernel access to them in that case. Is there a real justification for that? For system calls supplying both methods makes sense to avoid the extra code paths for a simple read/write. Any one stupid enough to make them behave differently gets what they deserve. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops 2020-08-27 15:58 ` David Laight @ 2020-08-29 9:23 ` 'Christoph Hellwig' 0 siblings, 0 replies; 45+ messages in thread From: 'Christoph Hellwig' @ 2020-08-29 9:23 UTC (permalink / raw) To: David Laight Cc: 'Christoph Hellwig', Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel On Thu, Aug 27, 2020 at 03:58:02PM +0000, David Laight wrote: > Is there a real justification for that? > For system calls supplying both methods makes sense to avoid > the extra code paths for a simple read/write. Al asked for it as two of our four in-tree instances do have weird semantics, and we can't change that any more. And the other two don't make sense to be used with kernel_read and kernel_write ( (/dev/null and /dev/zero). ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <20200901064849.GI4299@shao2-debian>]
* Re: [fs] ef30fb3c60: kernel write not supported for file /sys/kernel/softlockup_panic [not found] ` <20200901064849.GI4299@shao2-debian> @ 2020-09-01 7:08 ` Christoph Hellwig 0 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-09-01 7:08 UTC (permalink / raw) To: kernel test robot Cc: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel, 0day robot, lkp, rui.zhang, yu.c.chen Looks like since the start of this series we've grown new code to use kernel_write on sysctl files based on boot parameters. The good news is that this just means I need to resurrect the sysctl series as all that work was done already. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 02/10] fs: don't allow splice read/write without explicit ops 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig 2020-08-27 15:00 ` [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 15:00 ` [PATCH 03/10] uaccess: add infrastructure for kernel builds with set_fs() Christoph Hellwig ` (11 subsequent siblings) 13 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev default_file_splice_write is the last piece of generic code that uses set_fs to make the uaccess routines operate on kernel pointers. It implements a "fallback loop" for splicing from files that do not actually provide a proper splice_read method. The usual file systems and other high bandwith instances all provide a ->splice_read, so this just removes support for various device drivers and procfs/debugfs files. If splice support for any of those turns out to be important it can be added back by switching them to the iter ops and using generic_file_splice_read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> --- fs/read_write.c | 2 +- fs/splice.c | 130 +++++---------------------------------------- include/linux/fs.h | 2 - 3 files changed, 15 insertions(+), 119 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 702c4301d9eb6b..8c61f67453e3d3 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1077,7 +1077,7 @@ ssize_t vfs_iter_write(struct file *file, struct iov_iter *iter, loff_t *ppos, } EXPORT_SYMBOL(vfs_iter_write); -ssize_t vfs_readv(struct file *file, const struct iovec __user *vec, +static ssize_t vfs_readv(struct file *file, const struct iovec __user *vec, unsigned long vlen, loff_t *pos, rwf_t flags) { struct iovec iovstack[UIO_FASTIOV]; diff --git a/fs/splice.c b/fs/splice.c index d7c8a7c4db07ff..412df7b48f9eb7 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -342,89 +342,6 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = { }; EXPORT_SYMBOL(nosteal_pipe_buf_ops); -static ssize_t kernel_readv(struct file *file, const struct kvec *vec, - unsigned long vlen, loff_t offset) -{ - mm_segment_t old_fs; - loff_t pos = offset; - ssize_t res; - - old_fs = get_fs(); - set_fs(KERNEL_DS); - /* The cast to a user pointer is valid due to the set_fs() */ - res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0); - set_fs(old_fs); - - return res; -} - -static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, - struct pipe_inode_info *pipe, size_t len, - unsigned int flags) -{ - struct kvec *vec, __vec[PIPE_DEF_BUFFERS]; - struct iov_iter to; - struct page **pages; - unsigned int nr_pages; - unsigned int mask; - size_t offset, base, copied = 0; - ssize_t res; - int i; - - if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) - return -EAGAIN; - - /* - * Try to keep page boundaries matching to source pagecache ones - - * it probably won't be much help, but... - */ - offset = *ppos & ~PAGE_MASK; - - iov_iter_pipe(&to, READ, pipe, len + offset); - - res = iov_iter_get_pages_alloc(&to, &pages, len + offset, &base); - if (res <= 0) - return -ENOMEM; - - nr_pages = DIV_ROUND_UP(res + base, PAGE_SIZE); - - vec = __vec; - if (nr_pages > PIPE_DEF_BUFFERS) { - vec = kmalloc_array(nr_pages, sizeof(struct kvec), GFP_KERNEL); - if (unlikely(!vec)) { - res = -ENOMEM; - goto out; - } - } - - mask = pipe->ring_size - 1; - pipe->bufs[to.head & mask].offset = offset; - pipe->bufs[to.head & mask].len -= offset; - - for (i = 0; i < nr_pages; i++) { - size_t this_len = min_t(size_t, len, PAGE_SIZE - offset); - vec[i].iov_base = page_address(pages[i]) + offset; - vec[i].iov_len = this_len; - len -= this_len; - offset = 0; - } - - res = kernel_readv(in, vec, nr_pages, *ppos); - if (res > 0) { - copied = res; - *ppos += res; - } - - if (vec != __vec) - kfree(vec); -out: - for (i = 0; i < nr_pages; i++) - put_page(pages[i]); - kvfree(pages); - iov_iter_advance(&to, copied); /* truncates and discards */ - return res; -} - /* * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos' * using sendpage(). Return the number of bytes sent. @@ -788,33 +705,6 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, EXPORT_SYMBOL(iter_file_splice_write); -static int write_pipe_buf(struct pipe_inode_info *pipe, struct pipe_buffer *buf, - struct splice_desc *sd) -{ - int ret; - void *data; - loff_t tmp = sd->pos; - - data = kmap(buf->page); - ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp); - kunmap(buf->page); - - return ret; -} - -static ssize_t default_file_splice_write(struct pipe_inode_info *pipe, - struct file *out, loff_t *ppos, - size_t len, unsigned int flags) -{ - ssize_t ret; - - ret = splice_from_pipe(pipe, out, ppos, len, flags, write_pipe_buf); - if (ret > 0) - *ppos += ret; - - return ret; -} - /** * generic_splice_sendpage - splice data from a pipe to a socket * @pipe: pipe to splice from @@ -836,15 +726,23 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out, EXPORT_SYMBOL(generic_splice_sendpage); +static int warn_unsupported(struct file *file, const char *op) +{ + pr_debug_ratelimited( + "splice %s not supported for file %pD4 (pid: %d comm: %.20s)\n", + op, file, current->pid, current->comm); + return -EINVAL; +} + /* * Attempt to initiate a splice from pipe to file. */ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags) { - if (out->f_op->splice_write) - return out->f_op->splice_write(pipe, out, ppos, len, flags); - return default_file_splice_write(pipe, out, ppos, len, flags); + if (unlikely(!out->f_op->splice_write)) + return warn_unsupported(out, "write"); + return out->f_op->splice_write(pipe, out, ppos, len, flags); } /* @@ -866,9 +764,9 @@ static long do_splice_to(struct file *in, loff_t *ppos, if (unlikely(len > MAX_RW_COUNT)) len = MAX_RW_COUNT; - if (in->f_op->splice_read) - return in->f_op->splice_read(in, ppos, pipe, len, flags); - return default_file_splice_read(in, ppos, pipe, len, flags); + if (unlikely(!in->f_op->splice_read)) + return warn_unsupported(in, "read"); + return in->f_op->splice_read(in, ppos, pipe, len, flags); } /** diff --git a/include/linux/fs.h b/include/linux/fs.h index e019ea2f1347e6..d33cc3e8ed410b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1894,8 +1894,6 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); -extern ssize_t vfs_readv(struct file *, const struct iovec __user *, - unsigned long, loff_t *, rwf_t); extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, loff_t, size_t, unsigned int); extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 03/10] uaccess: add infrastructure for kernel builds with set_fs() 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig 2020-08-27 15:00 ` [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig 2020-08-27 15:00 ` [PATCH 02/10] fs: don't allow splice read/write without explicit ops Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 15:00 ` [PATCH 04/10] test_bitmap: skip user bitmap tests for !CONFIG_SET_FS Christoph Hellwig ` (10 subsequent siblings) 13 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Add a CONFIG_SET_FS option that is selected by architecturess that implement set_fs, which is all of them initially. If the option is not set stubs for routines related to overriding the address space are provided so that architectures can start to opt out of providing set_fs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> --- arch/Kconfig | 3 +++ arch/alpha/Kconfig | 1 + arch/arc/Kconfig | 1 + arch/arm/Kconfig | 1 + arch/arm64/Kconfig | 1 + arch/c6x/Kconfig | 1 + arch/csky/Kconfig | 1 + arch/h8300/Kconfig | 1 + arch/hexagon/Kconfig | 1 + arch/ia64/Kconfig | 1 + arch/m68k/Kconfig | 1 + arch/microblaze/Kconfig | 1 + arch/mips/Kconfig | 1 + arch/nds32/Kconfig | 1 + arch/nios2/Kconfig | 1 + arch/openrisc/Kconfig | 1 + arch/parisc/Kconfig | 1 + arch/powerpc/Kconfig | 1 + arch/riscv/Kconfig | 1 + arch/s390/Kconfig | 1 + arch/sh/Kconfig | 1 + arch/sparc/Kconfig | 1 + arch/um/Kconfig | 1 + arch/x86/Kconfig | 1 + arch/xtensa/Kconfig | 1 + include/linux/uaccess.h | 18 ++++++++++++++++++ 26 files changed, 45 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index af14a567b493fc..3fab619a6aa51a 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -24,6 +24,9 @@ config KEXEC_ELF config HAVE_IMA_KEXEC bool +config SET_FS + bool + config HOTPLUG_SMT bool diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 9c5f06e8eb9bc0..d6e9fc7a7b19e2 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -39,6 +39,7 @@ config ALPHA select OLD_SIGSUSPEND select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67 select MMU_GATHER_NO_RANGE + select SET_FS help The Alpha is a 64-bit general-purpose processor designed and marketed by the Digital Equipment Corporation of blessed memory, diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index ba00c4e1e1c271..c49f5754a11e40 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -48,6 +48,7 @@ config ARC select PCI_SYSCALL if PCI select PERF_USE_VMALLOC if ARC_CACHE_VIPT_ALIASING select HAVE_ARCH_JUMP_LABEL if ISA_ARCV2 && !CPU_ENDIAN_BE32 + select SET_FS config ARCH_HAS_CACHE_LINE_SIZE def_bool y diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index e00d94b1665876..87e1478a42dc4f 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -118,6 +118,7 @@ config ARM select PCI_SYSCALL if PCI select PERF_USE_VMALLOC select RTC_LIB + select SET_FS select SYS_SUPPORTS_APM_EMULATION # Above selects are sorted alphabetically; please add new ones # according to that. Thanks. diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6d232837cbeee8..fbd9e35bef096f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -192,6 +192,7 @@ config ARM64 select PCI_SYSCALL if PCI select POWER_RESET select POWER_SUPPLY + select SET_FS select SPARSE_IRQ select SWIOTLB select SYSCTL_EXCEPTION_TRACE diff --git a/arch/c6x/Kconfig b/arch/c6x/Kconfig index 6444ebfd06a665..48d66bf0465d68 100644 --- a/arch/c6x/Kconfig +++ b/arch/c6x/Kconfig @@ -22,6 +22,7 @@ config C6X select GENERIC_CLOCKEVENTS select MODULES_USE_ELF_RELA select MMU_GATHER_NO_RANGE if MMU + select SET_FS config MMU def_bool n diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 3d5afb5f568543..2836f6e76fdb2d 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -78,6 +78,7 @@ config CSKY select PCI_DOMAINS_GENERIC if PCI select PCI_SYSCALL if PCI select PCI_MSI if PCI + select SET_FS config LOCKDEP_SUPPORT def_bool y diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig index d11666d538fea8..7945de067e9fcc 100644 --- a/arch/h8300/Kconfig +++ b/arch/h8300/Kconfig @@ -25,6 +25,7 @@ config H8300 select HAVE_ARCH_KGDB select HAVE_ARCH_HASH select CPU_NO_EFFICIENT_FFS + select SET_FS select UACCESS_MEMCPY config CPU_BIG_ENDIAN diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig index 667cfc511cf999..f2afabbadd430e 100644 --- a/arch/hexagon/Kconfig +++ b/arch/hexagon/Kconfig @@ -31,6 +31,7 @@ config HEXAGON select GENERIC_CLOCKEVENTS_BROADCAST select MODULES_USE_ELF_RELA select GENERIC_CPU_DEVICES + select SET_FS help Qualcomm Hexagon is a processor architecture designed for high performance and low power across a wide variety of applications. diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index 5b4ec80bf5863a..22a6853840e235 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -56,6 +56,7 @@ config IA64 select NEED_DMA_MAP_STATE select NEED_SG_DMA_LENGTH select NUMA if !FLATMEM + select SET_FS default y help The Itanium Processor Family is Intel's 64-bit successor to diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig index 6f2f38d05772ab..dcf4ae8c9b215f 100644 --- a/arch/m68k/Kconfig +++ b/arch/m68k/Kconfig @@ -32,6 +32,7 @@ config M68K select OLD_SIGSUSPEND3 select OLD_SIGACTION select MMU_GATHER_NO_RANGE if MMU + select SET_FS config CPU_BIG_ENDIAN def_bool y diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index d262ac0c8714bd..7e3d4583abf3e6 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -46,6 +46,7 @@ config MICROBLAZE select CPU_NO_EFFICIENT_FFS select MMU_GATHER_NO_RANGE if MMU select SPARSE_IRQ + select SET_FS # Endianness selection choice diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index c95fa3a2484cf0..fbc26391b588f8 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -87,6 +87,7 @@ config MIPS select MODULES_USE_ELF_RELA if MODULES && 64BIT select PERF_USE_VMALLOC select RTC_LIB + select SET_FS select SYSCTL_EXCEPTION_TRACE select VIRT_TO_BUS diff --git a/arch/nds32/Kconfig b/arch/nds32/Kconfig index e30298e99e1bdf..e8e541fd2267d0 100644 --- a/arch/nds32/Kconfig +++ b/arch/nds32/Kconfig @@ -48,6 +48,7 @@ config NDS32 select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FTRACE_MCOUNT_RECORD select HAVE_DYNAMIC_FTRACE + select SET_FS help Andes(nds32) Linux support. diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig index c6645141bb2a88..c7c6ba6bec9dfc 100644 --- a/arch/nios2/Kconfig +++ b/arch/nios2/Kconfig @@ -27,6 +27,7 @@ config NIOS2 select USB_ARCH_HAS_HCD if USB_SUPPORT select CPU_NO_EFFICIENT_FFS select MMU_GATHER_NO_RANGE if MMU + select SET_FS config GENERIC_CSUM def_bool y diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig index 7e94fe37cb2fdf..6233c62931803f 100644 --- a/arch/openrisc/Kconfig +++ b/arch/openrisc/Kconfig @@ -39,6 +39,7 @@ config OPENRISC select ARCH_WANT_FRAME_POINTERS select GENERIC_IRQ_MULTI_HANDLER select MMU_GATHER_NO_RANGE if MMU + select SET_FS config CPU_BIG_ENDIAN def_bool y diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 3b0f53dd70bc9b..be70af482b5a9a 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -63,6 +63,7 @@ config PARISC select HAVE_FTRACE_MCOUNT_RECORD if HAVE_DYNAMIC_FTRACE select HAVE_KPROBES_ON_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS + select SET_FS help The PA-RISC microprocessor is designed by Hewlett-Packard and used diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1f48bbfb3ce99d..3f09d6fdf89405 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -249,6 +249,7 @@ config PPC select PCI_SYSCALL if PCI select PPC_DAWR if PPC64 select RTC_LIB + select SET_FS select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index df18372861d8d2..ea0c1ad456d838 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -82,6 +82,7 @@ config RISCV select PCI_MSI if PCI select RISCV_INTC select RISCV_TIMER if RISCV_SBI + select SET_FS select SPARSEMEM_STATIC if 32BIT select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 3d86e12e8e3c21..fd81385a7787cb 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -185,6 +185,7 @@ config S390 select OLD_SIGSUSPEND3 select PCI_DOMAINS if PCI select PCI_MSI if PCI + select SET_FS select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index d20927128fce05..2bd1653f3b3fea 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -71,6 +71,7 @@ config SUPERH select PERF_EVENTS select PERF_USE_VMALLOC select RTC_LIB + select SET_FS select SPARSE_IRQ help The SuperH is a RISC processor targeted for use in embedded systems diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index efeff2c896a544..3e0cf0319a278a 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -49,6 +49,7 @@ config SPARC select LOCKDEP_SMALL if LOCKDEP select NEED_DMA_MAP_STATE select NEED_SG_DMA_LENGTH + select SET_FS config SPARC32 def_bool !64BIT diff --git a/arch/um/Kconfig b/arch/um/Kconfig index eb51fec759484a..3aefcd81566809 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -19,6 +19,7 @@ config UML select GENERIC_CPU_DEVICES select GENERIC_CLOCKEVENTS select HAVE_GCC_PLUGINS + select SET_FS select TTY # Needed for line.c config MMU diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7101ac64bb209d..f85c13355732fe 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -237,6 +237,7 @@ config X86 select HAVE_ARCH_KCSAN if X86_64 select X86_FEATURE_NAMES if PROC_FS select PROC_PID_ARCH_STATUS if PROC_FS + select SET_FS imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI config INSTRUCTION_DECODER diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index e997e0119c0251..94bad4d66b4bde 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -41,6 +41,7 @@ config XTENSA select IRQ_DOMAIN select MODULES_USE_ELF_RELA select PERF_USE_VMALLOC + select SET_FS select VIRT_TO_BUS help Xtensa processors are 32-bit RISC machines designed by Tensilica diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 94b28541165929..70073c802b48ed 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -8,6 +8,7 @@ #include <asm/uaccess.h> +#ifdef CONFIG_SET_FS /* * Force the uaccess routines to be wired up for actual userspace access, * overriding any possible set_fs(KERNEL_DS) still lingering around. Undone @@ -25,6 +26,23 @@ static inline void force_uaccess_end(mm_segment_t oldfs) { set_fs(oldfs); } +#else /* CONFIG_SET_FS */ +typedef struct { + /* empty dummy */ +} mm_segment_t; + +#define uaccess_kernel() (false) +#define user_addr_max() (TASK_SIZE_MAX) + +static inline mm_segment_t force_uaccess_begin(void) +{ + return (mm_segment_t) { }; +} + +static inline void force_uaccess_end(mm_segment_t oldfs) +{ +} +#endif /* CONFIG_SET_FS */ /* * Architectures should provide two primitives (raw_copy_{to,from}_user()) -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 04/10] test_bitmap: skip user bitmap tests for !CONFIG_SET_FS 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (2 preceding siblings ...) 2020-08-27 15:00 ` [PATCH 03/10] uaccess: add infrastructure for kernel builds with set_fs() Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 15:00 ` [PATCH 05/10] lkdtm: disable set_fs-based " Christoph Hellwig ` (9 subsequent siblings) 13 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev We can't run the tests for userspace bitmap parsing if set_fs() doesn't exist. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> --- lib/test_bitmap.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index df903c53952bb9..49b1d25fbaf546 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -365,6 +365,7 @@ static void __init __test_bitmap_parselist(int is_user) for (i = 0; i < ARRAY_SIZE(parselist_tests); i++) { #define ptest parselist_tests[i] +#ifdef CONFIG_SET_FS if (is_user) { mm_segment_t orig_fs = get_fs(); size_t len = strlen(ptest.in); @@ -375,7 +376,9 @@ static void __init __test_bitmap_parselist(int is_user) bmap, ptest.nbits); time = ktime_get() - time; set_fs(orig_fs); - } else { + } else +#endif /* CONFIG_SET_FS */ + { time = ktime_get(); err = bitmap_parselist(ptest.in, bmap, ptest.nbits); time = ktime_get() - time; @@ -454,6 +457,7 @@ static void __init __test_bitmap_parse(int is_user) for (i = 0; i < ARRAY_SIZE(parse_tests); i++) { struct test_bitmap_parselist test = parse_tests[i]; +#ifdef CONFIG_SET_FS if (is_user) { size_t len = strlen(test.in); mm_segment_t orig_fs = get_fs(); @@ -464,7 +468,9 @@ static void __init __test_bitmap_parse(int is_user) bmap, test.nbits); time = ktime_get() - time; set_fs(orig_fs); - } else { + } else +#endif /* CONFIG_SET_FS */ + { size_t len = test.flags & NO_LEN ? UINT_MAX : strlen(test.in); time = ktime_get(); -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (3 preceding siblings ...) 2020-08-27 15:00 ` [PATCH 04/10] test_bitmap: skip user bitmap tests for !CONFIG_SET_FS Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 18:06 ` Linus Torvalds 2020-08-27 15:00 ` [PATCH 06/10] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h Christoph Hellwig ` (8 subsequent siblings) 13 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Once we can't manipulate the address limit, we also can't test what happens when the manipulation is abused. Signed-off-by: Christoph Hellwig <hch@lst.de> --- drivers/misc/lkdtm/bugs.c | 4 ++++ drivers/misc/lkdtm/usercopy.c | 4 ++++ 2 files changed, 8 insertions(+) diff --git a/drivers/misc/lkdtm/bugs.c b/drivers/misc/lkdtm/bugs.c index 4dfbfd51bdf774..0d5b93694a0183 100644 --- a/drivers/misc/lkdtm/bugs.c +++ b/drivers/misc/lkdtm/bugs.c @@ -315,11 +315,15 @@ void lkdtm_CORRUPT_LIST_DEL(void) /* Test if unbalanced set_fs(KERNEL_DS)/set_fs(USER_DS) check exists. */ void lkdtm_CORRUPT_USER_DS(void) { +#ifdef CONFIG_SET_FS pr_info("setting bad task size limit\n"); set_fs(KERNEL_DS); /* Make sure we do not keep running with a KERNEL_DS! */ force_sig(SIGKILL); +#else + pr_err("XFAIL: this requires set_fs()\n"); +#endif } /* Test that VMAP_STACK is actually allocating with a leading guard page */ diff --git a/drivers/misc/lkdtm/usercopy.c b/drivers/misc/lkdtm/usercopy.c index b833367a45d053..04d10063835241 100644 --- a/drivers/misc/lkdtm/usercopy.c +++ b/drivers/misc/lkdtm/usercopy.c @@ -327,6 +327,7 @@ void lkdtm_USERCOPY_KERNEL(void) void lkdtm_USERCOPY_KERNEL_DS(void) { +#ifdef CONFIG_SET_FS char __user *user_ptr = (char __user *)(0xFUL << (sizeof(unsigned long) * 8 - 4)); mm_segment_t old_fs = get_fs(); @@ -338,6 +339,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void) if (copy_to_user(user_ptr, buf, sizeof(buf)) == 0) pr_err("copy_to_user() to noncanonical address succeeded!?\n"); set_fs(old_fs); +#else + pr_err("XFAIL: this requires set_fs()\n"); +#endif } void __init lkdtm_usercopy_init(void) -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS 2020-08-27 15:00 ` [PATCH 05/10] lkdtm: disable set_fs-based " Christoph Hellwig @ 2020-08-27 18:06 ` Linus Torvalds 2020-08-29 9:24 ` Christoph Hellwig 0 siblings, 1 reply; 45+ messages in thread From: Linus Torvalds @ 2020-08-27 18:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Al Viro, Michael Ellerman, the arch/x86 maintainers, Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch, linuxppc-dev On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > Once we can't manipulate the address limit, we also can't test what > happens when the manipulation is abused. Just remove these tests entirely. Once set_fs() doesn't exist on x86, the tests no longer make any sense what-so-ever, because test coverage will be basically zero. So don't make the code uglier just to maintain a fiction that something is tested when it isn't really. Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS 2020-08-27 18:06 ` Linus Torvalds @ 2020-08-29 9:24 ` Christoph Hellwig 2020-09-01 18:52 ` Kees Cook 2020-09-01 18:57 ` Kees Cook 0 siblings, 2 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-29 9:24 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Al Viro, Michael Ellerman, the arch/x86 maintainers, Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch, linuxppc-dev On Thu, Aug 27, 2020 at 11:06:28AM -0700, Linus Torvalds wrote: > On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > Once we can't manipulate the address limit, we also can't test what > > happens when the manipulation is abused. > > Just remove these tests entirely. > > Once set_fs() doesn't exist on x86, the tests no longer make any sense > what-so-ever, because test coverage will be basically zero. > > So don't make the code uglier just to maintain a fiction that > something is tested when it isn't really. Sure fine with me unless Kees screams. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS 2020-08-29 9:24 ` Christoph Hellwig @ 2020-09-01 18:52 ` Kees Cook 2020-09-01 18:57 ` Kees Cook 1 sibling, 0 replies; 45+ messages in thread From: Kees Cook @ 2020-09-01 18:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Al Viro, Michael Ellerman, the arch/x86 maintainers, Linux Kernel Mailing List, linux-fsdevel, linux-arch, linuxppc-dev On Sat, Aug 29, 2020 at 11:24:06AM +0200, Christoph Hellwig wrote: > On Thu, Aug 27, 2020 at 11:06:28AM -0700, Linus Torvalds wrote: > > On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > Once we can't manipulate the address limit, we also can't test what > > > happens when the manipulation is abused. > > > > Just remove these tests entirely. > > > > Once set_fs() doesn't exist on x86, the tests no longer make any sense > > what-so-ever, because test coverage will be basically zero. > > > > So don't make the code uglier just to maintain a fiction that > > something is tested when it isn't really. > > Sure fine with me unless Kees screams. If we don't have set_fs, we don't need the tests. :) -- Kees Cook ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS 2020-08-29 9:24 ` Christoph Hellwig 2020-09-01 18:52 ` Kees Cook @ 2020-09-01 18:57 ` Kees Cook 2020-09-02 8:09 ` Christoph Hellwig 1 sibling, 1 reply; 45+ messages in thread From: Kees Cook @ 2020-09-01 18:57 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Al Viro, Michael Ellerman, the arch/x86 maintainers, Linux Kernel Mailing List, linux-fsdevel, linux-arch, linuxppc-dev On Sat, Aug 29, 2020 at 11:24:06AM +0200, Christoph Hellwig wrote: > On Thu, Aug 27, 2020 at 11:06:28AM -0700, Linus Torvalds wrote: > > On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > Once we can't manipulate the address limit, we also can't test what > > > happens when the manipulation is abused. > > > > Just remove these tests entirely. > > > > Once set_fs() doesn't exist on x86, the tests no longer make any sense > > what-so-ever, because test coverage will be basically zero. > > > > So don't make the code uglier just to maintain a fiction that > > something is tested when it isn't really. > > Sure fine with me unless Kees screams. To clarify: if any of x86, arm64, arm, powerpc, riscv, and s390 are using set_fs(), I want to keep this test. "ugly" is fine in lkdtm. :) -- Kees Cook ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS 2020-09-01 18:57 ` Kees Cook @ 2020-09-02 8:09 ` Christoph Hellwig 0 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-09-02 8:09 UTC (permalink / raw) To: Kees Cook Cc: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman, the arch/x86 maintainers, Linux Kernel Mailing List, linux-fsdevel, linux-arch, linuxppc-dev On Tue, Sep 01, 2020 at 11:57:37AM -0700, Kees Cook wrote: > On Sat, Aug 29, 2020 at 11:24:06AM +0200, Christoph Hellwig wrote: > > On Thu, Aug 27, 2020 at 11:06:28AM -0700, Linus Torvalds wrote: > > > On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > Once we can't manipulate the address limit, we also can't test what > > > > happens when the manipulation is abused. > > > > > > Just remove these tests entirely. > > > > > > Once set_fs() doesn't exist on x86, the tests no longer make any sense > > > what-so-ever, because test coverage will be basically zero. > > > > > > So don't make the code uglier just to maintain a fiction that > > > something is tested when it isn't really. > > > > Sure fine with me unless Kees screams. > > To clarify: if any of x86, arm64, arm, powerpc, riscv, and s390 are > using set_fs(), I want to keep this test. "ugly" is fine in lkdtm. :) And Linus wants them gone entirely, so I'll need a stage fight between the two of you. At least for this merge window I'm only planning on x86 and power, plus maybe riscv if I get the work done in time. Although helper from the maintainers would be welcome. s390 has a driver that still uses set_fs that will need some surgery, although it shouldn't be too bad, but arm will be a piece of work. Unless I get help it will take a while. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 06/10] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (4 preceding siblings ...) 2020-08-27 15:00 ` [PATCH 05/10] lkdtm: disable set_fs-based " Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 15:00 ` [PATCH 07/10] x86: make TASK_SIZE_MAX usable from assembly code Christoph Hellwig ` (7 subsequent siblings) 13 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev At least for 64-bit this moves them closer to some of the defines they are based on, and it prepares for using the TASK_SIZE_MAX definition from assembly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> --- arch/x86/include/asm/page_32_types.h | 11 +++++++ arch/x86/include/asm/page_64_types.h | 38 +++++++++++++++++++++ arch/x86/include/asm/processor.h | 49 ---------------------------- 3 files changed, 49 insertions(+), 49 deletions(-) diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h index 565ad755c785e2..26236925fb2c36 100644 --- a/arch/x86/include/asm/page_32_types.h +++ b/arch/x86/include/asm/page_32_types.h @@ -41,6 +41,17 @@ #define __VIRTUAL_MASK_SHIFT 32 #endif /* CONFIG_X86_PAE */ +/* + * User space process size: 3GB (default). + */ +#define IA32_PAGE_OFFSET PAGE_OFFSET +#define TASK_SIZE PAGE_OFFSET +#define TASK_SIZE_LOW TASK_SIZE +#define TASK_SIZE_MAX TASK_SIZE +#define DEFAULT_MAP_WINDOW TASK_SIZE +#define STACK_TOP TASK_SIZE +#define STACK_TOP_MAX STACK_TOP + /* * Kernel image size is limited to 512 MB (see in arch/x86/kernel/head_32.S) */ diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 288b065955b729..996595c9897e0a 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -58,6 +58,44 @@ #define __VIRTUAL_MASK_SHIFT 47 #endif +/* + * User space process size. This is the first address outside the user range. + * There are a few constraints that determine this: + * + * On Intel CPUs, if a SYSCALL instruction is at the highest canonical + * address, then that syscall will enter the kernel with a + * non-canonical return address, and SYSRET will explode dangerously. + * We avoid this particular problem by preventing anything executable + * from being mapped at the maximum canonical address. + * + * On AMD CPUs in the Ryzen family, there's a nasty bug in which the + * CPUs malfunction if they execute code from the highest canonical page. + * They'll speculate right off the end of the canonical space, and + * bad things happen. This is worked around in the same way as the + * Intel problem. + * + * With page table isolation enabled, we map the LDT in ... [stay tuned] + */ +#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE) + +#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE) + +/* This decides where the kernel will search for a free chunk of vm + * space during mmap's. + */ +#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? \ + 0xc0000000 : 0xFFFFe000) + +#define TASK_SIZE_LOW (test_thread_flag(TIF_ADDR32) ? \ + IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW) +#define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? \ + IA32_PAGE_OFFSET : TASK_SIZE_MAX) +#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \ + IA32_PAGE_OFFSET : TASK_SIZE_MAX) + +#define STACK_TOP TASK_SIZE_LOW +#define STACK_TOP_MAX TASK_SIZE_MAX + /* * Maximum kernel image size is limited to 1 GiB, due to the fixmap living * in the next 1 GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S). diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 97143d87994c24..1618eeb08361a9 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -782,17 +782,6 @@ static inline void spin_lock_prefetch(const void *x) }) #ifdef CONFIG_X86_32 -/* - * User space process size: 3GB (default). - */ -#define IA32_PAGE_OFFSET PAGE_OFFSET -#define TASK_SIZE PAGE_OFFSET -#define TASK_SIZE_LOW TASK_SIZE -#define TASK_SIZE_MAX TASK_SIZE -#define DEFAULT_MAP_WINDOW TASK_SIZE -#define STACK_TOP TASK_SIZE -#define STACK_TOP_MAX STACK_TOP - #define INIT_THREAD { \ .sp0 = TOP_OF_INIT_STACK, \ .sysenter_cs = __KERNEL_CS, \ @@ -802,44 +791,6 @@ static inline void spin_lock_prefetch(const void *x) #define KSTK_ESP(task) (task_pt_regs(task)->sp) #else -/* - * User space process size. This is the first address outside the user range. - * There are a few constraints that determine this: - * - * On Intel CPUs, if a SYSCALL instruction is at the highest canonical - * address, then that syscall will enter the kernel with a - * non-canonical return address, and SYSRET will explode dangerously. - * We avoid this particular problem by preventing anything executable - * from being mapped at the maximum canonical address. - * - * On AMD CPUs in the Ryzen family, there's a nasty bug in which the - * CPUs malfunction if they execute code from the highest canonical page. - * They'll speculate right off the end of the canonical space, and - * bad things happen. This is worked around in the same way as the - * Intel problem. - * - * With page table isolation enabled, we map the LDT in ... [stay tuned] - */ -#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE) - -#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE) - -/* This decides where the kernel will search for a free chunk of vm - * space during mmap's. - */ -#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? \ - 0xc0000000 : 0xFFFFe000) - -#define TASK_SIZE_LOW (test_thread_flag(TIF_ADDR32) ? \ - IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW) -#define TASK_SIZE (test_thread_flag(TIF_ADDR32) ? \ - IA32_PAGE_OFFSET : TASK_SIZE_MAX) -#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_ADDR32)) ? \ - IA32_PAGE_OFFSET : TASK_SIZE_MAX) - -#define STACK_TOP TASK_SIZE_LOW -#define STACK_TOP_MAX TASK_SIZE_MAX - #define INIT_THREAD { \ .addr_limit = KERNEL_DS, \ } -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 07/10] x86: make TASK_SIZE_MAX usable from assembly code 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (5 preceding siblings ...) 2020-08-27 15:00 ` [PATCH 06/10] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 15:00 ` [PATCH 08/10] x86: remove address space overrides using set_fs() Christoph Hellwig ` (6 subsequent siblings) 13 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev For 64-bit the only thing missing was a strategic _AC, and for 32-bit we need to use __PAGE_OFFSET instead of PAGE_OFFSET in the TASK_SIZE definition to escape the explicit unsigned long cast. This just works because __PAGE_OFFSET is defined using _AC itself and thus never needs the cast anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> --- arch/x86/include/asm/page_32_types.h | 4 ++-- arch/x86/include/asm/page_64_types.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h index 26236925fb2c36..f462895a33e452 100644 --- a/arch/x86/include/asm/page_32_types.h +++ b/arch/x86/include/asm/page_32_types.h @@ -44,8 +44,8 @@ /* * User space process size: 3GB (default). */ -#define IA32_PAGE_OFFSET PAGE_OFFSET -#define TASK_SIZE PAGE_OFFSET +#define IA32_PAGE_OFFSET __PAGE_OFFSET +#define TASK_SIZE __PAGE_OFFSET #define TASK_SIZE_LOW TASK_SIZE #define TASK_SIZE_MAX TASK_SIZE #define DEFAULT_MAP_WINDOW TASK_SIZE diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 996595c9897e0a..838515daf87b36 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -76,7 +76,7 @@ * * With page table isolation enabled, we map the LDT in ... [stay tuned] */ -#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE) +#define TASK_SIZE_MAX ((_AC(1,UL) << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE) #define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE) -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 08/10] x86: remove address space overrides using set_fs() 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (6 preceding siblings ...) 2020-08-27 15:00 ` [PATCH 07/10] x86: make TASK_SIZE_MAX usable from assembly code Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 18:15 ` Linus Torvalds 2020-08-27 15:00 ` [PATCH 09/10] powerpc: use non-set_fs based maccess routines Christoph Hellwig ` (5 subsequent siblings) 13 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Stop providing the possibility to override the address space using set_fs() now that there is no need for that any more. To properly handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on x86 a new alternative is introduced, which just like the one in entry_64.S has to use the hardcoded virtual address bits to escape the fact that TASK_SIZE_MAX isn't actually a constant when 5-level page tables are enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> --- arch/x86/Kconfig | 1 - arch/x86/ia32/ia32_aout.c | 1 - arch/x86/include/asm/processor.h | 11 +---------- arch/x86/include/asm/thread_info.h | 2 -- arch/x86/include/asm/uaccess.h | 26 +------------------------- arch/x86/kernel/asm-offsets.c | 3 --- arch/x86/lib/getuser.S | 28 ++++++++++++++++++---------- arch/x86/lib/putuser.S | 21 ++++++++++++--------- 8 files changed, 32 insertions(+), 61 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f85c13355732fe..7101ac64bb209d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -237,7 +237,6 @@ config X86 select HAVE_ARCH_KCSAN if X86_64 select X86_FEATURE_NAMES if PROC_FS select PROC_PID_ARCH_STATUS if PROC_FS - select SET_FS imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI config INSTRUCTION_DECODER diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c index ca8a657edf5977..a09fc37ead9d47 100644 --- a/arch/x86/ia32/ia32_aout.c +++ b/arch/x86/ia32/ia32_aout.c @@ -239,7 +239,6 @@ static int load_aout_binary(struct linux_binprm *bprm) (regs)->ss = __USER32_DS; regs->r8 = regs->r9 = regs->r10 = regs->r11 = regs->r12 = regs->r13 = regs->r14 = regs->r15 = 0; - set_fs(USER_DS); return 0; } diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 1618eeb08361a9..189573d95c3af6 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -482,10 +482,6 @@ extern unsigned int fpu_user_xstate_size; struct perf_event; -typedef struct { - unsigned long seg; -} mm_segment_t; - struct thread_struct { /* Cached TLS descriptors: */ struct desc_struct tls_array[GDT_ENTRY_TLS_ENTRIES]; @@ -538,8 +534,6 @@ struct thread_struct { */ unsigned long iopl_emul; - mm_segment_t addr_limit; - unsigned int sig_on_uaccess_err:1; /* Floating point and extended processor state */ @@ -785,15 +779,12 @@ static inline void spin_lock_prefetch(const void *x) #define INIT_THREAD { \ .sp0 = TOP_OF_INIT_STACK, \ .sysenter_cs = __KERNEL_CS, \ - .addr_limit = KERNEL_DS, \ } #define KSTK_ESP(task) (task_pt_regs(task)->sp) #else -#define INIT_THREAD { \ - .addr_limit = KERNEL_DS, \ -} +#define INIT_THREAD { } extern unsigned long KSTK_ESP(struct task_struct *task); diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 267701ae3d86dd..44733a4bfc4294 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -102,7 +102,6 @@ struct thread_info { #define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */ #define TIF_ADDR32 29 /* 32-bit address space on 64 bits */ #define TIF_X32 30 /* 32-bit native x86-64 binary */ -#define TIF_FSCHECK 31 /* Check FS is USER_DS on return */ #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) @@ -131,7 +130,6 @@ struct thread_info { #define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT) #define _TIF_ADDR32 (1 << TIF_ADDR32) #define _TIF_X32 (1 << TIF_X32) -#define _TIF_FSCHECK (1 << TIF_FSCHECK) /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW_BASE \ diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index ecefaffd15d4c8..a4ceda0510ea87 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -12,30 +12,6 @@ #include <asm/smap.h> #include <asm/extable.h> -/* - * The fs value determines whether argument validity checking should be - * performed or not. If get_fs() == USER_DS, checking is performed, with - * get_fs() == KERNEL_DS, checking is bypassed. - * - * For historical reasons, these macros are grossly misnamed. - */ - -#define MAKE_MM_SEG(s) ((mm_segment_t) { (s) }) - -#define KERNEL_DS MAKE_MM_SEG(-1UL) -#define USER_DS MAKE_MM_SEG(TASK_SIZE_MAX) - -#define get_fs() (current->thread.addr_limit) -static inline void set_fs(mm_segment_t fs) -{ - current->thread.addr_limit = fs; - /* On user-mode return, check fs is correct */ - set_thread_flag(TIF_FSCHECK); -} - -#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) -#define user_addr_max() (current->thread.addr_limit.seg) - /* * Test whether a block of memory is a valid user space address. * Returns 0 if the range is valid, nonzero otherwise. @@ -93,7 +69,7 @@ static inline bool pagefault_disabled(void); #define access_ok(addr, size) \ ({ \ WARN_ON_IN_IRQ(); \ - likely(!__range_not_ok(addr, size, user_addr_max())); \ + likely(!__range_not_ok(addr, size, TASK_SIZE_MAX)); \ }) /* diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 3ca07ad552ae0c..70b7154f4bdd62 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -37,9 +37,6 @@ static void __used common(void) OFFSET(TASK_stack_canary, task_struct, stack_canary); #endif - BLANK(); - OFFSET(TASK_addr_limit, task_struct, thread.addr_limit); - BLANK(); OFFSET(crypto_tfm_ctx_offset, crypto_tfm, __crt_ctx); diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index c8a85b512796e1..ccc9808c66420a 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -35,10 +35,18 @@ #include <asm/smap.h> #include <asm/export.h> +#ifdef CONFIG_X86_5LEVEL +#define LOAD_TASK_SIZE_MAX \ + ALTERNATIVE "mov $((1 << 47) - 4096),%rdx", \ + "mov $((1 << 56) - 4096),%rdx", X86_FEATURE_LA57 +#else +#define LOAD_TASK_SIZE_MAX mov $TASK_SIZE_MAX,%_ASM_DX +#endif + .text SYM_FUNC_START(__get_user_1) - mov PER_CPU_VAR(current_task), %_ASM_DX - cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX + LOAD_TASK_SIZE_MAX + cmp %_ASM_DX,%_ASM_AX jae bad_get_user sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */ and %_ASM_DX, %_ASM_AX @@ -53,8 +61,8 @@ EXPORT_SYMBOL(__get_user_1) SYM_FUNC_START(__get_user_2) add $1,%_ASM_AX jc bad_get_user - mov PER_CPU_VAR(current_task), %_ASM_DX - cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX + LOAD_TASK_SIZE_MAX + cmp %_ASM_DX,%_ASM_AX jae bad_get_user sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */ and %_ASM_DX, %_ASM_AX @@ -69,8 +77,8 @@ EXPORT_SYMBOL(__get_user_2) SYM_FUNC_START(__get_user_4) add $3,%_ASM_AX jc bad_get_user - mov PER_CPU_VAR(current_task), %_ASM_DX - cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX + LOAD_TASK_SIZE_MAX + cmp %_ASM_DX,%_ASM_AX jae bad_get_user sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */ and %_ASM_DX, %_ASM_AX @@ -86,8 +94,8 @@ SYM_FUNC_START(__get_user_8) #ifdef CONFIG_X86_64 add $7,%_ASM_AX jc bad_get_user - mov PER_CPU_VAR(current_task), %_ASM_DX - cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX + LOAD_TASK_SIZE_MAX + cmp %_ASM_DX,%_ASM_AX jae bad_get_user sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */ and %_ASM_DX, %_ASM_AX @@ -99,8 +107,8 @@ SYM_FUNC_START(__get_user_8) #else add $7,%_ASM_AX jc bad_get_user_8 - mov PER_CPU_VAR(current_task), %_ASM_DX - cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX + LOAD_TASK_SIZE_MAX + cmp %_ASM_DX,%_ASM_AX jae bad_get_user_8 sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */ and %_ASM_DX, %_ASM_AX diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S index 7c7c92db8497af..f5a56394985875 100644 --- a/arch/x86/lib/putuser.S +++ b/arch/x86/lib/putuser.S @@ -31,12 +31,18 @@ * as they get called from within inline assembly. */ -#define ENTER mov PER_CPU_VAR(current_task), %_ASM_BX +#ifdef CONFIG_X86_5LEVEL +#define LOAD_TASK_SIZE_MAX \ + ALTERNATIVE "mov $((1 << 47) - 4096),%rbx", \ + "mov $((1 << 56) - 4096),%rbx", X86_FEATURE_LA57 +#else +#define LOAD_TASK_SIZE_MAX mov $TASK_SIZE_MAX,%_ASM_BX +#endif .text SYM_FUNC_START(__put_user_1) - ENTER - cmp TASK_addr_limit(%_ASM_BX),%_ASM_CX + LOAD_TASK_SIZE_MAX + cmp %_ASM_BX,%_ASM_CX jae .Lbad_put_user ASM_STAC 1: movb %al,(%_ASM_CX) @@ -47,8 +53,7 @@ SYM_FUNC_END(__put_user_1) EXPORT_SYMBOL(__put_user_1) SYM_FUNC_START(__put_user_2) - ENTER - mov TASK_addr_limit(%_ASM_BX),%_ASM_BX + LOAD_TASK_SIZE_MAX sub $1,%_ASM_BX cmp %_ASM_BX,%_ASM_CX jae .Lbad_put_user @@ -61,8 +66,7 @@ SYM_FUNC_END(__put_user_2) EXPORT_SYMBOL(__put_user_2) SYM_FUNC_START(__put_user_4) - ENTER - mov TASK_addr_limit(%_ASM_BX),%_ASM_BX + LOAD_TASK_SIZE_MAX sub $3,%_ASM_BX cmp %_ASM_BX,%_ASM_CX jae .Lbad_put_user @@ -75,8 +79,7 @@ SYM_FUNC_END(__put_user_4) EXPORT_SYMBOL(__put_user_4) SYM_FUNC_START(__put_user_8) - ENTER - mov TASK_addr_limit(%_ASM_BX),%_ASM_BX + LOAD_TASK_SIZE_MAX sub $7,%_ASM_BX cmp %_ASM_BX,%_ASM_CX jae .Lbad_put_user -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 08/10] x86: remove address space overrides using set_fs() 2020-08-27 15:00 ` [PATCH 08/10] x86: remove address space overrides using set_fs() Christoph Hellwig @ 2020-08-27 18:15 ` Linus Torvalds 2020-08-29 9:25 ` Christoph Hellwig 0 siblings, 1 reply; 45+ messages in thread From: Linus Torvalds @ 2020-08-27 18:15 UTC (permalink / raw) To: Christoph Hellwig Cc: Al Viro, Michael Ellerman, the arch/x86 maintainers, Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch, linuxppc-dev On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > SYM_FUNC_START(__get_user_2) > add $1,%_ASM_AX > jc bad_get_user This no longer makes sense, and > - mov PER_CPU_VAR(current_task), %_ASM_DX > - cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX > + LOAD_TASK_SIZE_MAX > + cmp %_ASM_DX,%_ASM_AX This should be LOAD_TASK_SIZE_MAX_MINUS_N(1) cmp %_ASM_DX,%_ASM_AX instead (and then because we no longer modify _ASM_AX, we'd also remove the offset on the access). > SYM_FUNC_START(__put_user_2) > - ENTER > - mov TASK_addr_limit(%_ASM_BX),%_ASM_BX > + LOAD_TASK_SIZE_MAX > sub $1,%_ASM_BX It's even more obvious here. We load a constant and then immediately do a "sub $1" on that value. It's not a huge deal, you don't have to respin the series for this, I just wanted to point it out so that people are aware of it and if I forget somebody else will hopefully remember that "we should fix that too". Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 08/10] x86: remove address space overrides using set_fs() 2020-08-27 18:15 ` Linus Torvalds @ 2020-08-29 9:25 ` Christoph Hellwig 0 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-29 9:25 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Al Viro, Michael Ellerman, the arch/x86 maintainers, Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch, linuxppc-dev On Thu, Aug 27, 2020 at 11:15:12AM -0700, Linus Torvalds wrote: > > SYM_FUNC_START(__put_user_2) > > - ENTER > > - mov TASK_addr_limit(%_ASM_BX),%_ASM_BX > > + LOAD_TASK_SIZE_MAX > > sub $1,%_ASM_BX > > It's even more obvious here. We load a constant and then immediately > do a "sub $1" on that value. > > It's not a huge deal, you don't have to respin the series for this, I > just wanted to point it out so that people are aware of it and if I > forget somebody else will hopefully remember that "we should fix that > too". The changes seem easy enough and I need to respin at least for the lkdtm changes, and probaby also for a pending fix in the low-level x86 code that will hopefully be picked up for 5.9. But the more important questions is: how do we want to pick the series up? Especially due to the splice changes I really want it to be in linux-next as long as possible. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 09/10] powerpc: use non-set_fs based maccess routines 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (7 preceding siblings ...) 2020-08-27 15:00 ` [PATCH 08/10] x86: remove address space overrides using set_fs() Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-08-27 15:00 ` [PATCH 10/10] powerpc: remove address space overrides using set_fs() Christoph Hellwig ` (4 subsequent siblings) 13 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Provide __get_kernel_nofault and __put_kernel_nofault routines to implement the maccess routines without messing with set_fs and without opening up access to user space. Signed-off-by: Christoph Hellwig <hch@lst.de> --- arch/powerpc/include/asm/uaccess.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 00699903f1efca..7fe3531ad36a77 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -623,4 +623,20 @@ do { \ __put_user_goto(*(u8*)(_src + _i), (u8 __user *)(_dst + _i), e);\ } while (0) +#define HAVE_GET_KERNEL_NOFAULT + +#define __get_kernel_nofault(dst, src, type, err_label) \ +do { \ + int __kr_err; \ + \ + __get_user_size_allowed(*((type *)(dst)), (__force type __user *)(src),\ + sizeof(type), __kr_err); \ + if (unlikely(__kr_err)) \ + goto err_label; \ +} while (0) + +#define __put_kernel_nofault(dst, src, type, err_label) \ + __put_user_size_goto(*((type *)(src)), \ + (__force type __user *)(dst), sizeof(type), err_label) + #endif /* _ARCH_POWERPC_UACCESS_H */ -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (8 preceding siblings ...) 2020-08-27 15:00 ` [PATCH 09/10] powerpc: use non-set_fs based maccess routines Christoph Hellwig @ 2020-08-27 15:00 ` Christoph Hellwig 2020-09-02 6:15 ` Christophe Leroy 2020-08-27 15:31 ` remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (3 subsequent siblings) 13 siblings, 1 reply; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:00 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Stop providing the possibility to override the address space using set_fs() now that there is no need for that any more. Signed-off-by: Christoph Hellwig <hch@lst.de> --- arch/powerpc/Kconfig | 1 - arch/powerpc/include/asm/processor.h | 7 --- arch/powerpc/include/asm/thread_info.h | 5 +-- arch/powerpc/include/asm/uaccess.h | 62 ++++++++------------------ arch/powerpc/kernel/signal.c | 3 -- arch/powerpc/lib/sstep.c | 6 +-- 6 files changed, 22 insertions(+), 62 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3f09d6fdf89405..1f48bbfb3ce99d 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -249,7 +249,6 @@ config PPC select PCI_SYSCALL if PCI select PPC_DAWR if PPC64 select RTC_LIB - select SET_FS select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index ed0d633ab5aa42..f01e4d650c520a 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -83,10 +83,6 @@ struct task_struct; void start_thread(struct pt_regs *regs, unsigned long fdptr, unsigned long sp); void release_thread(struct task_struct *); -typedef struct { - unsigned long seg; -} mm_segment_t; - #define TS_FPR(i) fp_state.fpr[i][TS_FPROFFSET] #define TS_CKFPR(i) ckfp_state.fpr[i][TS_FPROFFSET] @@ -148,7 +144,6 @@ struct thread_struct { unsigned long ksp_vsid; #endif struct pt_regs *regs; /* Pointer to saved register state */ - mm_segment_t addr_limit; /* for get_fs() validation */ #ifdef CONFIG_BOOKE /* BookE base exception scratch space; align on cacheline */ unsigned long normsave[8] ____cacheline_aligned; @@ -295,7 +290,6 @@ struct thread_struct { #define INIT_THREAD { \ .ksp = INIT_SP, \ .ksp_limit = INIT_SP_LIMIT, \ - .addr_limit = KERNEL_DS, \ .pgdir = swapper_pg_dir, \ .fpexc_mode = MSR_FE0 | MSR_FE1, \ SPEFSCR_INIT \ @@ -303,7 +297,6 @@ struct thread_struct { #else #define INIT_THREAD { \ .ksp = INIT_SP, \ - .addr_limit = KERNEL_DS, \ .fpexc_mode = 0, \ } #endif diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index ca6c9702570494..46a210b03d2b80 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -90,7 +90,6 @@ void arch_setup_new_exec(void); #define TIF_SYSCALL_TRACE 0 /* syscall trace active */ #define TIF_SIGPENDING 1 /* signal pending */ #define TIF_NEED_RESCHED 2 /* rescheduling necessary */ -#define TIF_FSCHECK 3 /* Check FS is USER_DS on return */ #define TIF_SYSCALL_EMU 4 /* syscall emulation active */ #define TIF_RESTORE_TM 5 /* need to restore TM FP/VEC/VSX */ #define TIF_PATCH_PENDING 6 /* pending live patching update */ @@ -130,7 +129,6 @@ void arch_setup_new_exec(void); #define _TIF_SYSCALL_TRACEPOINT (1<<TIF_SYSCALL_TRACEPOINT) #define _TIF_EMULATE_STACK_STORE (1<<TIF_EMULATE_STACK_STORE) #define _TIF_NOHZ (1<<TIF_NOHZ) -#define _TIF_FSCHECK (1<<TIF_FSCHECK) #define _TIF_SYSCALL_EMU (1<<TIF_SYSCALL_EMU) #define _TIF_SYSCALL_DOTRACE (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \ _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \ @@ -138,8 +136,7 @@ void arch_setup_new_exec(void); #define _TIF_USER_WORK_MASK (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \ _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ - _TIF_RESTORE_TM | _TIF_PATCH_PENDING | \ - _TIF_FSCHECK) + _TIF_RESTORE_TM | _TIF_PATCH_PENDING) #define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR) /* Bits in local_flags */ diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 7fe3531ad36a77..39727537d39701 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -8,62 +8,36 @@ #include <asm/extable.h> #include <asm/kup.h> -/* - * The fs value determines whether argument validity checking should be - * performed or not. If get_fs() == USER_DS, checking is performed, with - * get_fs() == KERNEL_DS, checking is bypassed. - * - * For historical reasons, these macros are grossly misnamed. - * - * The fs/ds values are now the highest legal address in the "segment". - * This simplifies the checking in the routines below. - */ - -#define MAKE_MM_SEG(s) ((mm_segment_t) { (s) }) - -#define KERNEL_DS MAKE_MM_SEG(~0UL) #ifdef __powerpc64__ /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */ -#define USER_DS MAKE_MM_SEG(TASK_SIZE_USER64 - 1) -#else -#define USER_DS MAKE_MM_SEG(TASK_SIZE - 1) -#endif - -#define get_fs() (current->thread.addr_limit) +#define TASK_SIZE_MAX TASK_SIZE_USER64 -static inline void set_fs(mm_segment_t fs) +static inline bool __access_ok(unsigned long addr, unsigned long size) { - current->thread.addr_limit = fs; - /* On user-mode return check addr_limit (fs) is correct */ - set_thread_flag(TIF_FSCHECK); + if (addr >= TASK_SIZE_MAX) + return false; + /* + * This check is sufficient because there is a large enough gap between + * user addresses and the kernel addresses. + */ + return size <= TASK_SIZE_MAX; } - -#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) -#define user_addr_max() (get_fs().seg) - -#ifdef __powerpc64__ -/* - * This check is sufficient because there is a large enough - * gap between user addresses and the kernel addresses - */ -#define __access_ok(addr, size, segment) \ - (((addr) <= (segment).seg) && ((size) <= (segment).seg)) - #else +#define TASK_SIZE_MAX TASK_SIZE -static inline int __access_ok(unsigned long addr, unsigned long size, - mm_segment_t seg) +static inline bool __access_ok(unsigned long addr, unsigned long size) { - if (addr > seg.seg) - return 0; - return (size == 0 || size - 1 <= seg.seg - addr); + if (addr >= TASK_SIZE_MAX) + return false; + if (size == 0) + return false; + return size <= TASK_SIZE_MAX - addr; } - -#endif +#endif /* __powerpc64__ */ #define access_ok(addr, size) \ (__chk_user_ptr(addr), \ - __access_ok((__force unsigned long)(addr), (size), get_fs())) + __access_ok((unsigned long)(addr), (size))) /* * These are the main single-value transfer routines. They automatically diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index d15a98c758b8b4..df547d8e31e49c 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -312,9 +312,6 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) { user_exit(); - /* Check valid addr_limit, TIF check is done there */ - addr_limit_user_check(); - if (thread_info_flags & _TIF_UPROBE) uprobe_notify_resume(regs); diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index caee8cc77e1954..8342188ea1acd0 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -108,11 +108,11 @@ static nokprobe_inline long address_ok(struct pt_regs *regs, { if (!user_mode(regs)) return 1; - if (__access_ok(ea, nb, USER_DS)) + if (__access_ok(ea, nb)) return 1; - if (__access_ok(ea, 1, USER_DS)) + if (__access_ok(ea, 1)) /* Access overlaps the end of the user region */ - regs->dar = USER_DS.seg; + regs->dar = TASK_SIZE_MAX - 1; else regs->dar = ea; return 0; -- 2.28.0 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-08-27 15:00 ` [PATCH 10/10] powerpc: remove address space overrides using set_fs() Christoph Hellwig @ 2020-09-02 6:15 ` Christophe Leroy 2020-09-02 12:36 ` Christoph Hellwig 0 siblings, 1 reply; 45+ messages in thread From: Christophe Leroy @ 2020-09-02 6:15 UTC (permalink / raw) To: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel Le 27/08/2020 à 17:00, Christoph Hellwig a écrit : > Stop providing the possibility to override the address space using > set_fs() now that there is no need for that any more. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > arch/powerpc/Kconfig | 1 - > arch/powerpc/include/asm/processor.h | 7 --- > arch/powerpc/include/asm/thread_info.h | 5 +-- > arch/powerpc/include/asm/uaccess.h | 62 ++++++++------------------ > arch/powerpc/kernel/signal.c | 3 -- > arch/powerpc/lib/sstep.c | 6 +-- > 6 files changed, 22 insertions(+), 62 deletions(-) > > diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h > index 7fe3531ad36a77..39727537d39701 100644 > --- a/arch/powerpc/include/asm/uaccess.h > +++ b/arch/powerpc/include/asm/uaccess.h > @@ -8,62 +8,36 @@ > #include <asm/extable.h> > #include <asm/kup.h> > > -/* > - * The fs value determines whether argument validity checking should be > - * performed or not. If get_fs() == USER_DS, checking is performed, with > - * get_fs() == KERNEL_DS, checking is bypassed. > - * > - * For historical reasons, these macros are grossly misnamed. > - * > - * The fs/ds values are now the highest legal address in the "segment". > - * This simplifies the checking in the routines below. > - */ > - > -#define MAKE_MM_SEG(s) ((mm_segment_t) { (s) }) > - > -#define KERNEL_DS MAKE_MM_SEG(~0UL) > #ifdef __powerpc64__ > /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */ > -#define USER_DS MAKE_MM_SEG(TASK_SIZE_USER64 - 1) > -#else > -#define USER_DS MAKE_MM_SEG(TASK_SIZE - 1) > -#endif > - > -#define get_fs() (current->thread.addr_limit) > +#define TASK_SIZE_MAX TASK_SIZE_USER64 > > -static inline void set_fs(mm_segment_t fs) > +static inline bool __access_ok(unsigned long addr, unsigned long size) > { > - current->thread.addr_limit = fs; > - /* On user-mode return check addr_limit (fs) is correct */ > - set_thread_flag(TIF_FSCHECK); > + if (addr >= TASK_SIZE_MAX) > + return false; > + /* > + * This check is sufficient because there is a large enough gap between > + * user addresses and the kernel addresses. > + */ > + return size <= TASK_SIZE_MAX; > } > - > -#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) > -#define user_addr_max() (get_fs().seg) > - > -#ifdef __powerpc64__ > -/* > - * This check is sufficient because there is a large enough > - * gap between user addresses and the kernel addresses > - */ > -#define __access_ok(addr, size, segment) \ > - (((addr) <= (segment).seg) && ((size) <= (segment).seg)) > - > #else > +#define TASK_SIZE_MAX TASK_SIZE > > -static inline int __access_ok(unsigned long addr, unsigned long size, > - mm_segment_t seg) > +static inline bool __access_ok(unsigned long addr, unsigned long size) > { > - if (addr > seg.seg) > - return 0; > - return (size == 0 || size - 1 <= seg.seg - addr); > + if (addr >= TASK_SIZE_MAX) > + return false; > + if (size == 0) > + return false; __access_ok() was returning true when size == 0 up to now. Any reason to return false now ? > + return size <= TASK_SIZE_MAX - addr; > } > - > -#endif > +#endif /* __powerpc64__ */ > > #define access_ok(addr, size) \ > (__chk_user_ptr(addr), \ > - __access_ok((__force unsigned long)(addr), (size), get_fs())) > + __access_ok((unsigned long)(addr), (size))) > > /* > * These are the main single-value transfer routines. They automatically Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 6:15 ` Christophe Leroy @ 2020-09-02 12:36 ` Christoph Hellwig 2020-09-02 13:13 ` David Laight 2020-09-02 15:17 ` Christophe Leroy 0 siblings, 2 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-09-02 12:36 UTC (permalink / raw) To: Christophe Leroy Cc: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: >> - return 0; >> - return (size == 0 || size - 1 <= seg.seg - addr); >> + if (addr >= TASK_SIZE_MAX) >> + return false; >> + if (size == 0) >> + return false; > > __access_ok() was returning true when size == 0 up to now. Any reason to > return false now ? No, this is accidental and broken. Can you re-run your benchmark with this fixed? ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 12:36 ` Christoph Hellwig @ 2020-09-02 13:13 ` David Laight 2020-09-02 13:24 ` Christophe Leroy 2020-09-02 15:17 ` Christophe Leroy 1 sibling, 1 reply; 45+ messages in thread From: David Laight @ 2020-09-02 13:13 UTC (permalink / raw) To: 'Christoph Hellwig', Christophe Leroy Cc: Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel From: Christoph Hellwig > Sent: 02 September 2020 13:37 > > On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: > >> - return 0; > >> - return (size == 0 || size - 1 <= seg.seg - addr); > >> + if (addr >= TASK_SIZE_MAX) > >> + return false; > >> + if (size == 0) > >> + return false; > > > > __access_ok() was returning true when size == 0 up to now. Any reason to > > return false now ? > > No, this is accidental and broken. Can you re-run your benchmark with > this fixed? Is TASK_SIZE_MASK defined such that you can do: return (addr | size) < TASK_SIZE_MAX) || !size; David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 13:13 ` David Laight @ 2020-09-02 13:24 ` Christophe Leroy 2020-09-02 13:51 ` David Laight 0 siblings, 1 reply; 45+ messages in thread From: Christophe Leroy @ 2020-09-02 13:24 UTC (permalink / raw) To: David Laight, 'Christoph Hellwig' Cc: Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel Le 02/09/2020 à 15:13, David Laight a écrit : > From: Christoph Hellwig >> Sent: 02 September 2020 13:37 >> >> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: >>>> - return 0; >>>> - return (size == 0 || size - 1 <= seg.seg - addr); >>>> + if (addr >= TASK_SIZE_MAX) >>>> + return false; >>>> + if (size == 0) >>>> + return false; >>> >>> __access_ok() was returning true when size == 0 up to now. Any reason to >>> return false now ? >> >> No, this is accidental and broken. Can you re-run your benchmark with >> this fixed? > > Is TASK_SIZE_MASK defined such that you can do: > > return (addr | size) < TASK_SIZE_MAX) || !size; TASK_SIZE_MAX will usually be 0xc0000000 With: addr = 0x80000000; size = 0x80000000; I expect it to fail .... With the formula you propose it will succeed, won't it ? Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 13:24 ` Christophe Leroy @ 2020-09-02 13:51 ` David Laight 2020-09-02 14:12 ` Christophe Leroy 0 siblings, 1 reply; 45+ messages in thread From: David Laight @ 2020-09-02 13:51 UTC (permalink / raw) To: 'Christophe Leroy', 'Christoph Hellwig' Cc: Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel From: Christophe Leroy > Sent: 02 September 2020 14:25 > Le 02/09/2020 à 15:13, David Laight a écrit : > > From: Christoph Hellwig > >> Sent: 02 September 2020 13:37 > >> > >> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: > >>>> - return 0; > >>>> - return (size == 0 || size - 1 <= seg.seg - addr); > >>>> + if (addr >= TASK_SIZE_MAX) > >>>> + return false; > >>>> + if (size == 0) > >>>> + return false; > >>> > >>> __access_ok() was returning true when size == 0 up to now. Any reason to > >>> return false now ? > >> > >> No, this is accidental and broken. Can you re-run your benchmark with > >> this fixed? > > > > Is TASK_SIZE_MASK defined such that you can do: > > > > return (addr | size) < TASK_SIZE_MAX) || !size; > > TASK_SIZE_MAX will usually be 0xc0000000 > > With: > addr = 0x80000000; > size = 0x80000000; > > I expect it to fail .... > > With the formula you propose it will succeed, won't it ? Hmmm... Was i getting confused about some comments for 64bit about there being such a big hole between valid user and kernel addresses that it was enough to check that 'size < TASK_SIZE_MAX'. That would be true for 64bit x86 (and probably ppc (& arm??)) if TASK_SIZE_MAX were 0x4 << 60. IIUC the highest user address is (much) less than 0x0 << 60 and the lowest kernel address (much) greater than 0xf << 60 on all these 64bit platforms. Actually if doing access_ok() inside get_user() you don't need to check the size at all. You don't even need to in copy_to/from_user() provided it always does a forwards copy. (Rather that copying the last word first for misaligned lengths.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 13:51 ` David Laight @ 2020-09-02 14:12 ` Christophe Leroy 2020-09-02 15:02 ` David Laight 0 siblings, 1 reply; 45+ messages in thread From: Christophe Leroy @ 2020-09-02 14:12 UTC (permalink / raw) To: David Laight, 'Christoph Hellwig' Cc: Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel Le 02/09/2020 à 15:51, David Laight a écrit : > From: Christophe Leroy >> Sent: 02 September 2020 14:25 >> Le 02/09/2020 à 15:13, David Laight a écrit : >>> From: Christoph Hellwig >>>> Sent: 02 September 2020 13:37 >>>> >>>> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: >>>>>> - return 0; >>>>>> - return (size == 0 || size - 1 <= seg.seg - addr); >>>>>> + if (addr >= TASK_SIZE_MAX) >>>>>> + return false; >>>>>> + if (size == 0) >>>>>> + return false; >>>>> >>>>> __access_ok() was returning true when size == 0 up to now. Any reason to >>>>> return false now ? >>>> >>>> No, this is accidental and broken. Can you re-run your benchmark with >>>> this fixed? >>> >>> Is TASK_SIZE_MASK defined such that you can do: >>> >>> return (addr | size) < TASK_SIZE_MAX) || !size; >> >> TASK_SIZE_MAX will usually be 0xc0000000 >> >> With: >> addr = 0x80000000; >> size = 0x80000000; >> >> I expect it to fail .... >> >> With the formula you propose it will succeed, won't it ? > > Hmmm... Was i getting confused about some comments for 64bit > about there being such a big hole between valid user and kernel > addresses that it was enough to check that 'size < TASK_SIZE_MAX'. > > That would be true for 64bit x86 (and probably ppc (& arm??)) > if TASK_SIZE_MAX were 0x4 << 60. > IIUC the highest user address is (much) less than 0x0 << 60 > and the lowest kernel address (much) greater than 0xf << 60 > on all these 64bit platforms. > > Actually if doing access_ok() inside get_user() you don't > need to check the size at all. You mean on 64 bit or on any platform ? What about a word write to 0xbffffffe, won't it overwrite 0xc0000000 ? > You don't even need to in copy_to/from_user() provided > it always does a forwards copy. Do you mean due to the gap ? Is it garantied to be a gap ? Even on a 32 bits having TASK_SIZE set to 0xc0000000 and PAGE_OFFSET set to the same ? Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 14:12 ` Christophe Leroy @ 2020-09-02 15:02 ` David Laight 0 siblings, 0 replies; 45+ messages in thread From: David Laight @ 2020-09-02 15:02 UTC (permalink / raw) To: 'Christophe Leroy', 'Christoph Hellwig' Cc: Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel From: Christophe Leroy > Sent: 02 September 2020 15:13 > > > Le 02/09/2020 à 15:51, David Laight a écrit : > > From: Christophe Leroy > >> Sent: 02 September 2020 14:25 > >> Le 02/09/2020 à 15:13, David Laight a écrit : > >>> From: Christoph Hellwig > >>>> Sent: 02 September 2020 13:37 > >>>> > >>>> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: > >>>>>> - return 0; > >>>>>> - return (size == 0 || size - 1 <= seg.seg - addr); > >>>>>> + if (addr >= TASK_SIZE_MAX) > >>>>>> + return false; > >>>>>> + if (size == 0) > >>>>>> + return false; > >>>>> > >>>>> __access_ok() was returning true when size == 0 up to now. Any reason to > >>>>> return false now ? > >>>> > >>>> No, this is accidental and broken. Can you re-run your benchmark with > >>>> this fixed? > >>> > >>> Is TASK_SIZE_MASK defined such that you can do: > >>> > >>> return (addr | size) < TASK_SIZE_MAX) || !size; > >> > >> TASK_SIZE_MAX will usually be 0xc0000000 > >> > >> With: > >> addr = 0x80000000; > >> size = 0x80000000; > >> > >> I expect it to fail .... > >> > >> With the formula you propose it will succeed, won't it ? > > > > Hmmm... Was i getting confused about some comments for 64bit > > about there being such a big hole between valid user and kernel > > addresses that it was enough to check that 'size < TASK_SIZE_MAX'. > > > > That would be true for 64bit x86 (and probably ppc (& arm??)) > > if TASK_SIZE_MAX were 0x4 << 60. > > IIUC the highest user address is (much) less than 0x0 << 60 > > and the lowest kernel address (much) greater than 0xf << 60 > > on all these 64bit platforms. > > > > Actually if doing access_ok() inside get_user() you don't > > need to check the size at all. > > You mean on 64 bit or on any platform ? 64bit and 32bit > What about a word write to 0xbffffffe, won't it overwrite 0xc0000000 ? > > > You don't even need to in copy_to/from_user() provided > > it always does a forwards copy. > > Do you mean due to the gap ? > Is it garantied to be a gap ? Even on a 32 bits having TASK_SIZE set to > 0xc0000000 and PAGE_OFFSET set to the same ? I read somewhere (I won't find it again) that the last 4k page (below 0xc0000000) must not be allocated on i386 because some cpu (both intel and amd) do 'horrid things' if they try to (IIRC) do instruction prefetches across the boundary. So the accesses to 0xbffffffe will fault and the one to 0xc0000000 won't happen (in any useful way at least). I'd suspect that not allocating the 3G-4k page would be a safe bet on all architectures - even 68k. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 12:36 ` Christoph Hellwig 2020-09-02 13:13 ` David Laight @ 2020-09-02 15:17 ` Christophe Leroy 2020-09-02 18:02 ` Linus Torvalds 1 sibling, 1 reply; 45+ messages in thread From: Christophe Leroy @ 2020-09-02 15:17 UTC (permalink / raw) To: Christoph Hellwig Cc: Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel Le 02/09/2020 à 14:36, Christoph Hellwig a écrit : > On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: >>> - return 0; >>> - return (size == 0 || size - 1 <= seg.seg - addr); >>> + if (addr >= TASK_SIZE_MAX) >>> + return false; >>> + if (size == 0) >>> + return false; >> >> __access_ok() was returning true when size == 0 up to now. Any reason to >> return false now ? > > No, this is accidental and broken. Can you re-run your benchmark with > this fixed? > With this fix, I get root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M 1048576+0 records in 1048576+0 records out 536870912 bytes (512.0MB) copied, 6.776327 seconds, 75.6MB/s real 0m 6.78s user 0m 1.64s sys 0m 5.13s That's still far from the 91.7MB/s I get with 5.9-rc2, but better than the 65.8MB/s I got yesterday with your series. Still some way to go thought. Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 15:17 ` Christophe Leroy @ 2020-09-02 18:02 ` Linus Torvalds 2020-09-03 7:11 ` Christoph Hellwig 2020-09-03 7:20 ` Christophe Leroy 0 siblings, 2 replies; 45+ messages in thread From: Linus Torvalds @ 2020-09-02 18:02 UTC (permalink / raw) To: Christophe Leroy Cc: Christoph Hellwig, Al Viro, Michael Ellerman, the arch/x86 maintainers, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, Linux Kernel Mailing List On Wed, Sep 2, 2020 at 8:17 AM Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > > With this fix, I get > > root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M > 536870912 bytes (512.0MB) copied, 6.776327 seconds, 75.6MB/s > > That's still far from the 91.7MB/s I get with 5.9-rc2, but better than > the 65.8MB/s I got yesterday with your series. Still some way to go thought. I don't see why this change would make any difference. And btw, why do the 32-bit and 64-bit checks even differ? It's not like the extra (single) instruction should even matter. I think the main reason is that the simpler 64-bit case could stay as a macro (because it only uses "addr" and "size" once), but honestly, that "simplification" doesn't help when you then need to have that #ifdef for the 32-bit case and an inline function anyway. So why isn't it just static inline int __access_ok(unsigned long addr, unsigned long size) { return addr <= TASK_SIZE_MAX && size <= TASK_SIZE_MAX-addr; } for both and be done with it? The "size=0" check is only relevant for the "addr == TASK_SIZE_MAX" case, and existed in the old code because it had that "-1" thing becasue "seg.seg" was actually TASK_SIZE-1. Now that we don't have any TASK_SIZE-1, zero isn't special any more. However, I suspect a bigger reason for the actual performance degradation would be the patch that makes things use "write_iter()" for writing, even when a simpler "write()" exists. For writing to /dev/null, the cost of setting up iterators and all the pointless indirection is all kinds of stupid. So I think "write()" should just go back to default to using "->write()" rather than "->write_iter()" if the simpler case exists. Linus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 18:02 ` Linus Torvalds @ 2020-09-03 7:11 ` Christoph Hellwig 2020-09-03 7:27 ` Christophe Leroy 2020-09-03 8:55 ` Christophe Leroy 2020-09-03 7:20 ` Christophe Leroy 1 sibling, 2 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-09-03 7:11 UTC (permalink / raw) To: Linus Torvalds Cc: Christophe Leroy, Christoph Hellwig, Al Viro, Michael Ellerman, the arch/x86 maintainers, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, Linux Kernel Mailing List On Wed, Sep 02, 2020 at 11:02:22AM -0700, Linus Torvalds wrote: > I don't see why this change would make any difference. Me neither, but while looking at a different project I did spot places that actually do an access_ok with len 0, that's why I wanted him to try. That being said: Christophe are these number stables? Do you get similar numbers with multiple runs? > And btw, why do the 32-bit and 64-bit checks even differ? It's not > like the extra (single) instruction should even matter. I think the > main reason is that the simpler 64-bit case could stay as a macro > (because it only uses "addr" and "size" once), but honestly, that > "simplification" doesn't help when you then need to have that #ifdef > for the 32-bit case and an inline function anyway. I'll have to leave that to the powerpc folks. The intent was to not change the behavior (and I even fucked that up for the the size == 0 case). > However, I suspect a bigger reason for the actual performance > degradation would be the patch that makes things use "write_iter()" > for writing, even when a simpler "write()" exists. Except that we do not actually have such a patch. For normal user writes we only use ->write_iter if ->write is not present. But what shows up in the profile is that /dev/zero only has a read_iter op and not a normal read. I've added a patch below that implements a normal read which might help a tad with this workload, but should not be part of a regression. Also Christophe: can you bisect which patch starts this? Is it really this last patch in the series? --- diff --git a/drivers/char/mem.c b/drivers/char/mem.c index abd4ffdc8cdebc..1dc99ab158457a 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -726,6 +726,27 @@ static ssize_t read_iter_zero(struct kiocb *iocb, struct iov_iter *iter) return written; } +static ssize_t read_zero(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + size_t cleared = 0; + + while (count) { + size_t chunk = min_t(size_t, count, PAGE_SIZE); + + if (clear_user(buf + cleared, chunk)) + return cleared ? cleared : -EFAULT; + cleared += chunk; + count -= chunk; + + if (signal_pending(current)) + return cleared ? cleared : -ERESTARTSYS; + cond_resched(); + } + + return cleared; +} + static int mmap_zero(struct file *file, struct vm_area_struct *vma) { #ifndef CONFIG_MMU @@ -921,6 +942,7 @@ static const struct file_operations zero_fops = { .llseek = zero_lseek, .write = write_zero, .read_iter = read_iter_zero, + .read = read_zero, .write_iter = write_iter_zero, .mmap = mmap_zero, .get_unmapped_area = get_unmapped_area_zero, ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-03 7:11 ` Christoph Hellwig @ 2020-09-03 7:27 ` Christophe Leroy 2020-09-03 8:55 ` Christophe Leroy 1 sibling, 0 replies; 45+ messages in thread From: Christophe Leroy @ 2020-09-03 7:27 UTC (permalink / raw) To: Christoph Hellwig, Linus Torvalds Cc: Al Viro, Michael Ellerman, the arch/x86 maintainers, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, Linux Kernel Mailing List Le 03/09/2020 à 09:11, Christoph Hellwig a écrit : > On Wed, Sep 02, 2020 at 11:02:22AM -0700, Linus Torvalds wrote: >> I don't see why this change would make any difference. > > Me neither, but while looking at a different project I did spot places > that actually do an access_ok with len 0, that's why I wanted him to > try. > > That being said: Christophe are these number stables? Do you get > similar numbers with multiple runs? Yes the numbers are similar with multiple runs and multiple reboots. > >> And btw, why do the 32-bit and 64-bit checks even differ? It's not >> like the extra (single) instruction should even matter. I think the >> main reason is that the simpler 64-bit case could stay as a macro >> (because it only uses "addr" and "size" once), but honestly, that >> "simplification" doesn't help when you then need to have that #ifdef >> for the 32-bit case and an inline function anyway. > > I'll have to leave that to the powerpc folks. The intent was to not > change the behavior (and I even fucked that up for the the size == 0 > case). > >> However, I suspect a bigger reason for the actual performance >> degradation would be the patch that makes things use "write_iter()" >> for writing, even when a simpler "write()" exists. > > Except that we do not actually have such a patch. For normal user > writes we only use ->write_iter if ->write is not present. But what > shows up in the profile is that /dev/zero only has a read_iter op and > not a normal read. I've added a patch below that implements a normal > read which might help a tad with this workload, but should not be part > of a regression. > > Also Christophe: can you bisect which patch starts this? Is it really > this last patch in the series? 5.9-rc2: 91.5MB/s Patch 1: 74.9MB/s Patch 2: 97.9MB/s Patch 3: 97.7MB/s Patch 4 to 9: 97.9MB/s Patch 10: 85.3MB/s Patch 11: 75.4MB/s See my other mail, when removing CONFIG_STACKPROTECTOR, I get a stable 99.8MB/s throughput. Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-03 7:11 ` Christoph Hellwig 2020-09-03 7:27 ` Christophe Leroy @ 2020-09-03 8:55 ` Christophe Leroy 1 sibling, 0 replies; 45+ messages in thread From: Christophe Leroy @ 2020-09-03 8:55 UTC (permalink / raw) To: Christoph Hellwig, Linus Torvalds Cc: Al Viro, Michael Ellerman, the arch/x86 maintainers, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, Linux Kernel Mailing List Le 03/09/2020 à 09:11, Christoph Hellwig a écrit : > > Except that we do not actually have such a patch. For normal user > writes we only use ->write_iter if ->write is not present. But what > shows up in the profile is that /dev/zero only has a read_iter op and > not a normal read. I've added a patch below that implements a normal > read which might help a tad with this workload, but should not be part > of a regression. > With that patch below, throughput is 113.5MB/s (instead of 99.9MB/s). So a 14% improvement. That's not bad. Christophe > > --- > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > index abd4ffdc8cdebc..1dc99ab158457a 100644 > --- a/drivers/char/mem.c > +++ b/drivers/char/mem.c > @@ -726,6 +726,27 @@ static ssize_t read_iter_zero(struct kiocb *iocb, struct iov_iter *iter) > return written; > } > > +static ssize_t read_zero(struct file *file, char __user *buf, > + size_t count, loff_t *ppos) > +{ > + size_t cleared = 0; > + > + while (count) { > + size_t chunk = min_t(size_t, count, PAGE_SIZE); > + > + if (clear_user(buf + cleared, chunk)) > + return cleared ? cleared : -EFAULT; > + cleared += chunk; > + count -= chunk; > + > + if (signal_pending(current)) > + return cleared ? cleared : -ERESTARTSYS; > + cond_resched(); > + } > + > + return cleared; > +} > + > static int mmap_zero(struct file *file, struct vm_area_struct *vma) > { > #ifndef CONFIG_MMU > @@ -921,6 +942,7 @@ static const struct file_operations zero_fops = { > .llseek = zero_lseek, > .write = write_zero, > .read_iter = read_iter_zero, > + .read = read_zero, > .write_iter = write_iter_zero, > .mmap = mmap_zero, > .get_unmapped_area = get_unmapped_area_zero, > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs() 2020-09-02 18:02 ` Linus Torvalds 2020-09-03 7:11 ` Christoph Hellwig @ 2020-09-03 7:20 ` Christophe Leroy 1 sibling, 0 replies; 45+ messages in thread From: Christophe Leroy @ 2020-09-03 7:20 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Al Viro, Michael Ellerman, the arch/x86 maintainers, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, Linux Kernel Mailing List Le 02/09/2020 à 20:02, Linus Torvalds a écrit : > On Wed, Sep 2, 2020 at 8:17 AM Christophe Leroy > <christophe.leroy@csgroup.eu> wrote: >> >> >> With this fix, I get >> >> root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M >> 536870912 bytes (512.0MB) copied, 6.776327 seconds, 75.6MB/s >> >> That's still far from the 91.7MB/s I get with 5.9-rc2, but better than >> the 65.8MB/s I got yesterday with your series. Still some way to go thought. > > I don't see why this change would make any difference. > Neither do I. Looks like nowadays, CONFIG_STACKPROTECTOR has become a default. I rebuilt the kernel without it, I now get a throughput of 99.8MB/s both without and with this series. Looking at the generated code (GCC 10.1), a small change in a function seems to make large changes in the generated code when CONFIG_STACKPROTECTOR is set. In addition to that, trivial functions which don't use the stack at all get a stack frame anyway when CONFIG_STACKPROTECTOR is set, allthough that's only -fstack-protector-strong. And there is no canary check. Without CONFIG_STACKPROTECTOR: c01572a0 <no_llseek>: c01572a0: 38 60 ff ff li r3,-1 c01572a4: 38 80 ff e3 li r4,-29 c01572a8: 4e 80 00 20 blr With CONFIG_STACKPROTECTOR (regardless of CONFIG_STACKPROTECTOR_STRONG or not): c0164e08 <no_llseek>: c0164e08: 94 21 ff f0 stwu r1,-16(r1) c0164e0c: 38 60 ff ff li r3,-1 c0164e10: 38 80 ff e3 li r4,-29 c0164e14: 38 21 00 10 addi r1,r1,16 c0164e18: 4e 80 00 20 blr Wondering why CONFIG_STACKPROTECTOR has become the default. It seems to imply a 10% performance loss even in the best case (91.7MB/s versus 99.8MB/s) Note that without CONFIG_STACKPROTECTOR_STRONG, I'm at 99.3MB/s, so that's really the _STRONG alternative that hurts. Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (9 preceding siblings ...) 2020-08-27 15:00 ` [PATCH 10/10] powerpc: remove address space overrides using set_fs() Christoph Hellwig @ 2020-08-27 15:31 ` Christoph Hellwig 2020-09-01 17:13 ` Christophe Leroy ` (2 subsequent siblings) 13 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-08-27 15:31 UTC (permalink / raw) To: Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel > Diffstat: Actually no diffstat here as David Howells pointed out. Here we go: arch/Kconfig | 3 arch/alpha/Kconfig | 1 arch/arc/Kconfig | 1 arch/arm/Kconfig | 1 arch/arm64/Kconfig | 1 arch/c6x/Kconfig | 1 arch/csky/Kconfig | 1 arch/h8300/Kconfig | 1 arch/hexagon/Kconfig | 1 arch/ia64/Kconfig | 1 arch/m68k/Kconfig | 1 arch/microblaze/Kconfig | 1 arch/mips/Kconfig | 1 arch/nds32/Kconfig | 1 arch/nios2/Kconfig | 1 arch/openrisc/Kconfig | 1 arch/parisc/Kconfig | 1 arch/powerpc/include/asm/processor.h | 7 - arch/powerpc/include/asm/thread_info.h | 5 - arch/powerpc/include/asm/uaccess.h | 78 ++++++++----------- arch/powerpc/kernel/signal.c | 3 arch/powerpc/lib/sstep.c | 6 - arch/riscv/Kconfig | 1 arch/s390/Kconfig | 1 arch/sh/Kconfig | 1 arch/sparc/Kconfig | 1 arch/um/Kconfig | 1 arch/x86/ia32/ia32_aout.c | 1 arch/x86/include/asm/page_32_types.h | 11 ++ arch/x86/include/asm/page_64_types.h | 38 +++++++++ arch/x86/include/asm/processor.h | 60 --------------- arch/x86/include/asm/thread_info.h | 2 arch/x86/include/asm/uaccess.h | 26 ------ arch/x86/kernel/asm-offsets.c | 3 arch/x86/lib/getuser.S | 28 ++++--- arch/x86/lib/putuser.S | 21 +++-- arch/xtensa/Kconfig | 1 drivers/misc/lkdtm/bugs.c | 4 + drivers/misc/lkdtm/usercopy.c | 4 + fs/read_write.c | 69 ++++++++++------- fs/splice.c | 130 +++------------------------------ include/linux/fs.h | 2 include/linux/uaccess.h | 18 ++++ lib/test_bitmap.c | 10 ++ 44 files changed, 235 insertions(+), 316 deletions(-) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (10 preceding siblings ...) 2020-08-27 15:31 ` remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig @ 2020-09-01 17:13 ` Christophe Leroy 2020-09-01 17:25 ` Al Viro 2020-10-27 9:29 ` [PATCH 02/10] fs: don't allow splice read/write without explicit ops David Howells 2020-10-27 9:51 ` David Howells 13 siblings, 1 reply; 45+ messages in thread From: Christophe Leroy @ 2020-09-01 17:13 UTC (permalink / raw) To: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman, x86 Cc: linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel Hi Christoph, Le 27/08/2020 à 17:00, Christoph Hellwig a écrit : > Hi all, > > this series removes the last set_fs() used to force a kernel address > space for the uaccess code in the kernel read/write/splice code, and then > stops implementing the address space overrides entirely for x86 and > powerpc. > > The file system part has been posted a few times, and the read/write side > has been pretty much unchanced. For splice this series drops the > conversion of the seq_file and sysctl code to the iter ops, and thus loses > the splice support for them. The reasons for that is that it caused a lot > of churn for not much use - splice for these small files really isn't much > of a win, even if existing userspace uses it. All callers I found do the > proper fallback, but if this turns out to be an issue the conversion can > be resurrected. > > Besides x86 and powerpc I plan to eventually convert all other > architectures, although this will be a slow process, starting with the > easier ones once the infrastructure is merged. The process to convert > architectures is roughtly: > > (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code > (2) implement __get_kernel_nofault and __put_kernel_nofault > (3) remove the arch specific address limitation functionality > > Changes since v1: > - drop the patch to remove the non-iter ops for /dev/zero and > /dev/null as they caused a performance regression > - don't enable user access in __get_kernel on powerpc > - xfail the set_fs() based lkdtm tests > > Diffstat: > I'm still sceptic with the results I get. With 5.9-rc2: root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M 1048576+0 records in 1048576+0 records out 536870912 bytes (512.0MB) copied, 5.585880 seconds, 91.7MB/s real 0m 5.59s user 0m 1.40s sys 0m 4.19s With your series: root@vgoippro:/tmp# time dd if=/dev/zero of=/dev/null count=1M 1048576+0 records in 1048576+0 records out 536870912 bytes (512.0MB) copied, 7.780540 seconds, 65.8MB/s real 0m 7.79s user 0m 2.12s sys 0m 5.66s Top of perf report of a standard perf record: With 5.9-rc2: 20.31% dd [kernel.kallsyms] [k] __arch_clear_user 8.37% dd [kernel.kallsyms] [k] transfer_to_syscall 7.37% dd [kernel.kallsyms] [k] __fsnotify_parent 6.95% dd [kernel.kallsyms] [k] iov_iter_zero 5.72% dd [kernel.kallsyms] [k] new_sync_read 4.87% dd [kernel.kallsyms] [k] vfs_write 4.47% dd [kernel.kallsyms] [k] vfs_read 3.07% dd [kernel.kallsyms] [k] ksys_write 2.77% dd [kernel.kallsyms] [k] ksys_read 2.65% dd [kernel.kallsyms] [k] __fget_light 2.37% dd [kernel.kallsyms] [k] __fdget_pos 2.35% dd [kernel.kallsyms] [k] memset 1.53% dd [kernel.kallsyms] [k] rw_verify_area 1.52% dd [kernel.kallsyms] [k] read_iter_zero With your series: 19.60% dd [kernel.kallsyms] [k] __arch_clear_user 10.92% dd [kernel.kallsyms] [k] iov_iter_zero 9.50% dd [kernel.kallsyms] [k] vfs_write 8.97% dd [kernel.kallsyms] [k] __fsnotify_parent 5.46% dd [kernel.kallsyms] [k] transfer_to_syscall 5.42% dd [kernel.kallsyms] [k] vfs_read 3.58% dd [kernel.kallsyms] [k] ksys_read 2.84% dd [kernel.kallsyms] [k] read_iter_zero 2.24% dd [kernel.kallsyms] [k] ksys_write 1.80% dd [kernel.kallsyms] [k] __fget_light 1.34% dd [kernel.kallsyms] [k] __fdget_pos 0.91% dd [kernel.kallsyms] [k] memset 0.91% dd [kernel.kallsyms] [k] rw_verify_area Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2 2020-09-01 17:13 ` Christophe Leroy @ 2020-09-01 17:25 ` Al Viro 2020-09-01 17:42 ` Matthew Wilcox ` (3 more replies) 0 siblings, 4 replies; 45+ messages in thread From: Al Viro @ 2020-09-01 17:25 UTC (permalink / raw) To: Christophe Leroy Cc: Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel On Tue, Sep 01, 2020 at 07:13:00PM +0200, Christophe Leroy wrote: > 10.92% dd [kernel.kallsyms] [k] iov_iter_zero Interesting... Could you get an instruction-level profile inside iov_iter_zero(), along with the disassembly of that sucker? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2 2020-09-01 17:25 ` Al Viro @ 2020-09-01 17:42 ` Matthew Wilcox 2020-09-01 18:39 ` Christophe Leroy ` (2 subsequent siblings) 3 siblings, 0 replies; 45+ messages in thread From: Matthew Wilcox @ 2020-09-01 17:42 UTC (permalink / raw) To: Al Viro Cc: Christophe Leroy, Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel On Tue, Sep 01, 2020 at 06:25:12PM +0100, Al Viro wrote: > On Tue, Sep 01, 2020 at 07:13:00PM +0200, Christophe Leroy wrote: > > > 10.92% dd [kernel.kallsyms] [k] iov_iter_zero > > Interesting... Could you get an instruction-level profile inside iov_iter_zero(), > along with the disassembly of that sucker? Also, does [1] make any difference? Probably not since it's translating O flags into IOCB flags instead of RWF flags into IOCB flags. I wonder if there's a useful trick we can play here ... something like: static inline int iocb_flags(struct file *file) { int res = 0; if (likely(!file->f_flags & O_APPEND | O_DIRECT | O_DSYNC | __O_SYNC)) && !IS_SYNC(file->f_mapping->host)) return res; if (file->f_flags & O_APPEND) res |= IOCB_APPEND; if (file->f_flags & O_DIRECT) res |= IOCB_DIRECT; if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) res |= IOCB_DSYNC; if (file->f_flags & __O_SYNC) res |= IOCB_SYNC; return res; } Can we do something like force O_DSYNC to be set if the inode IS_SYNC() at the time of open? Or is setting the sync bit on the inode required to affect currently-open files? [1] https://lore.kernel.org/linux-fsdevel/95de7ce4-9254-39f1-304f-4455f66bf0f4@kernel.dk/ ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2 2020-09-01 17:25 ` Al Viro 2020-09-01 17:42 ` Matthew Wilcox @ 2020-09-01 18:39 ` Christophe Leroy 2020-09-01 19:01 ` Christophe Leroy 2020-09-02 8:10 ` Christoph Hellwig 3 siblings, 0 replies; 45+ messages in thread From: Christophe Leroy @ 2020-09-01 18:39 UTC (permalink / raw) To: Al Viro Cc: Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel Le 01/09/2020 à 19:25, Al Viro a écrit : > On Tue, Sep 01, 2020 at 07:13:00PM +0200, Christophe Leroy wrote: > >> 10.92% dd [kernel.kallsyms] [k] iov_iter_zero > > Interesting... Could you get an instruction-level profile inside iov_iter_zero(), > along with the disassembly of that sucker? > Output of perf annotate: Percent | Source code & Disassembly of vmlinux for cpu-clock (3579 samples) --------------------------------------------------------------------------------- : : : : Disassembly of section .text: : : c02cb3a4 <iov_iter_zero>: : iov_iter_zero(): 2.24 : c02cb3a4: stwu r1,-80(r1) 0.31 : c02cb3a8: stw r30,72(r1) 0.00 : c02cb3ac: mr r30,r4 0.11 : c02cb3b0: stw r31,76(r1) 0.00 : c02cb3b4: mr r31,r3 1.06 : c02cb3b8: stw r27,60(r1) : iov_iter_type(): 0.03 : c02cb3bc: lwz r10,0(r4) 0.06 : c02cb3c0: rlwinm r9,r10,0,0,30 : iov_iter_zero(): 0.03 : c02cb3c4: cmpwi r9,32 0.00 : c02cb3c8: lwz r9,624(r2) 2.15 : c02cb3cc: stw r9,28(r1) 0.00 : c02cb3d0: li r9,0 0.00 : c02cb3d4: beq c02cb520 <iov_iter_zero+0x17c> 0.14 : c02cb3d8: lwz r9,8(r4) 0.08 : c02cb3dc: cmplw r9,r3 0.00 : c02cb3e0: mr r27,r9 0.03 : c02cb3e4: bgt c02cb4fc <iov_iter_zero+0x158> 1.34 : c02cb3e8: cmpwi r9,0 0.00 : c02cb3ec: beq c02cb4d0 <iov_iter_zero+0x12c> 0.11 : c02cb3f0: andi. r8,r10,16 0.17 : c02cb3f4: lwz r31,4(r30) 1.79 : c02cb3f8: bne c02cb61c <iov_iter_zero+0x278> 0.00 : c02cb3fc: andi. r8,r10,8 0.06 : c02cb400: bne c02cb770 <iov_iter_zero+0x3cc> 0.22 : c02cb404: andi. r10,r10,64 0.03 : c02cb408: bne c02cb88c <iov_iter_zero+0x4e8> 0.11 : c02cb40c: stw r29,68(r1) 1.59 : c02cb410: stw r28,64(r1) 0.03 : c02cb414: lwz r28,12(r30) 0.00 : c02cb418: lwz r7,4(r28) 1.87 : c02cb41c: subf r29,r31,r7 0.28 : c02cb420: cmplw r29,r27 0.03 : c02cb424: bgt c02cb50c <iov_iter_zero+0x168> 0.03 : c02cb428: cmpwi r29,0 0.00 : c02cb42c: beq c02cb898 <iov_iter_zero+0x4f4> 1.34 : c02cb430: lwz r3,0(r28) : __access_ok(): 0.00 : c02cb434: lis r10,-16384 : iov_iter_zero(): 0.36 : c02cb438: add r3,r3,r31 : __access_ok(): 0.03 : c02cb43c: cmplw r3,r10 1.79 : c02cb440: bge c02cb514 <iov_iter_zero+0x170> 13.19 : c02cb444: subf r10,r3,r10 : clear_user(): 0.00 : c02cb448: cmplw r29,r10 4.41 : c02cb44c: mflr r0 0.00 : c02cb450: stw r0,84(r1) 0.00 : c02cb454: bgt c02cb8c4 <iov_iter_zero+0x520> 0.00 : c02cb458: mr r4,r29 0.00 : c02cb45c: bl c001a41c <__arch_clear_user> : iov_iter_zero(): 0.70 : c02cb460: add r31,r31,r29 0.00 : c02cb464: cmpwi r3,0 17.13 : c02cb468: subf r29,r29,r27 0.00 : c02cb46c: subf r31,r3,r31 1.20 : c02cb470: add r29,r29,r3 0.00 : c02cb474: beq c02cb8b8 <iov_iter_zero+0x514> 0.00 : c02cb478: lwz r9,8(r30) 0.00 : c02cb47c: subf r10,r27,r29 0.00 : c02cb480: lwz r0,84(r1) 0.00 : c02cb484: subf r27,r29,r27 0.00 : c02cb488: add r9,r10,r9 0.00 : c02cb48c: lwz r7,4(r28) 0.00 : c02cb490: lwz r10,12(r30) 0.00 : c02cb494: mtlr r0 1.65 : c02cb498: cmplw r31,r7 14.61 : c02cb49c: bne c02cb4a8 <iov_iter_zero+0x104> 1.65 : c02cb4a0: addi r28,r28,8 0.00 : c02cb4a4: li r31,0 14.92 : c02cb4a8: lwz r8,16(r30) 0.00 : c02cb4ac: subf r10,r10,r28 1.12 : c02cb4b0: srawi r10,r10,3 0.56 : c02cb4b4: stw r28,12(r30) 0.00 : c02cb4b8: subf r10,r10,r8 1.23 : c02cb4bc: stw r10,16(r30) 0.00 : c02cb4c0: lwz r28,64(r1) 0.56 : c02cb4c4: lwz r29,68(r1) 0.00 : c02cb4c8: stw r9,8(r30) 2.12 : c02cb4cc: stw r31,4(r30) 0.00 : c02cb4d0: lwz r9,28(r1) 0.61 : c02cb4d4: lwz r10,624(r2) 0.00 : c02cb4d8: xor. r9,r9,r10 0.00 : c02cb4dc: li r10,0 0.00 : c02cb4e0: bne c02cb9a8 <iov_iter_zero+0x604> 0.00 : c02cb4e4: mr r3,r27 0.00 : c02cb4e8: lwz r30,72(r1) 1.73 : c02cb4ec: lwz r27,60(r1) 0.50 : c02cb4f0: lwz r31,76(r1) 0.00 : c02cb4f4: addi r1,r1,80 0.00 : c02cb4f8: blr 0.00 : c02cb4fc: cmpwi r9,0 0.00 : c02cb500: mr r27,r3 0.00 : c02cb504: beq c02cb4d0 <iov_iter_zero+0x12c> 0.00 : c02cb508: b c02cb3f0 <iov_iter_zero+0x4c> 0.00 : c02cb50c: mr r29,r27 0.00 : c02cb510: b c02cb428 <iov_iter_zero+0x84> : __access_ok(): 0.00 : c02cb514: li r27,0 0.00 : c02cb518: mr r10,r28 0.00 : c02cb51c: b c02cb498 <iov_iter_zero+0xf4> : pipe_zero(): 0.00 : c02cb520: mflr r0 0.00 : c02cb524: stw r26,56(r1) 0.00 : c02cb528: stw r0,84(r1) 0.00 : c02cb52c: mr r3,r4 0.00 : c02cb530: stw r28,64(r1) 0.00 : c02cb534: lwz r28,12(r4) 0.00 : c02cb538: lwz r26,40(r28) 0.00 : c02cb53c: bl c02c8e48 <sanity> 0.00 : c02cb540: cmpwi r3,0 0.00 : c02cb544: bne c02cb560 <iov_iter_zero+0x1bc> 0.00 : c02cb548: lwz r0,84(r1) 0.00 : c02cb54c: li r27,0 0.00 : c02cb550: lwz r26,56(r1) 0.00 : c02cb554: lwz r28,64(r1) 0.00 : c02cb558: mtlr r0 0.00 : c02cb55c: b c02cb4d0 <iov_iter_zero+0x12c> 0.00 : c02cb560: mr r4,r31 0.00 : c02cb564: addi r6,r1,24 0.00 : c02cb568: addi r5,r1,20 0.00 : c02cb56c: mr r3,r30 0.00 : c02cb570: bl c02c9030 <push_pipe> 0.00 : c02cb574: mr. r27,r3 0.00 : c02cb578: beq c02cb548 <iov_iter_zero+0x1a4> 0.00 : c02cb57c: lwz r4,24(r1) 0.00 : c02cb580: addi r26,r26,-1 0.00 : c02cb584: lwz r9,20(r1) 0.00 : c02cb588: stw r25,52(r1) 0.00 : c02cb58c: li r25,0 0.00 : c02cb590: stw r29,68(r1) 0.00 : c02cb594: mr r29,r27 0.00 : c02cb598: subfic r31,r4,4096 0.00 : c02cb59c: cmplw r31,r29 0.00 : c02cb5a0: ble c02cb5a8 <iov_iter_zero+0x204> 0.00 : c02cb5a4: mr r31,r29 0.00 : c02cb5a8: and r9,r26,r9 0.00 : c02cb5ac: lwz r8,80(r28) 0.00 : c02cb5b0: rlwinm r10,r9,1,0,30 0.00 : c02cb5b4: add r9,r10,r9 0.00 : c02cb5b8: rlwinm r9,r9,3,0,28 0.00 : c02cb5bc: lwzx r3,r8,r9 0.00 : c02cb5c0: mr r5,r31 0.00 : c02cb5c4: bl c02c92ec <memzero_page> 0.00 : c02cb5c8: subf. r29,r31,r29 0.00 : c02cb5cc: lwz r9,20(r1) 0.00 : c02cb5d0: li r4,0 0.00 : c02cb5d4: lwz r10,24(r1) 0.00 : c02cb5d8: stw r9,16(r30) 0.00 : c02cb5dc: addi r9,r9,1 0.00 : c02cb5e0: add r10,r10,r31 0.00 : c02cb5e4: stw r9,20(r1) 0.00 : c02cb5e8: stw r10,4(r30) 0.00 : c02cb5ec: stw r25,24(r1) 0.00 : c02cb5f0: bne c02cb598 <iov_iter_zero+0x1f4> 0.00 : c02cb5f4: lwz r9,8(r30) 0.00 : c02cb5f8: subf r9,r27,r9 0.00 : c02cb5fc: stw r9,8(r30) : iov_iter_zero(): 0.00 : c02cb600: lwz r0,84(r1) 0.00 : c02cb604: lwz r25,52(r1) 0.00 : c02cb608: lwz r26,56(r1) 0.00 : c02cb60c: mtlr r0 0.00 : c02cb610: lwz r28,64(r1) 0.00 : c02cb614: lwz r29,68(r1) 0.00 : c02cb618: b c02cb4d0 <iov_iter_zero+0x12c> 0.00 : c02cb61c: stw r23,44(r1) 0.00 : c02cb620: cmpwi r27,0 0.00 : c02cb624: stw r28,64(r1) 0.00 : c02cb628: mr r23,r27 0.00 : c02cb62c: stw r24,48(r1) 0.00 : c02cb630: li r28,0 0.00 : c02cb634: lwz r24,12(r30) 0.00 : c02cb638: mr r8,r24 0.00 : c02cb63c: beq c02cb714 <iov_iter_zero+0x370> 0.00 : c02cb640: mflr r0 0.00 : c02cb644: stw r25,52(r1) 0.00 : c02cb648: stw r0,84(r1) 0.00 : c02cb64c: stw r26,56(r1) 0.00 : c02cb650: stw r29,68(r1) 0.00 : c02cb654: rlwinm r25,r28,1,0,30 0.00 : c02cb658: add r25,r25,r28 0.00 : c02cb65c: rlwinm r25,r25,2,0,29 0.00 : c02cb660: add r10,r8,r25 0.00 : c02cb664: lwz r26,4(r10) 0.00 : c02cb668: mr r29,r25 0.00 : c02cb66c: lwz r9,8(r10) 0.00 : c02cb670: subf r26,r31,r26 0.00 : c02cb674: cmplw r26,r23 0.00 : c02cb678: add r9,r31,r9 0.00 : c02cb67c: clrlwi r4,r9,20 0.00 : c02cb680: ble c02cb688 <iov_iter_zero+0x2e4> 0.00 : c02cb684: mr r26,r23 0.00 : c02cb688: subfic r7,r4,4096 0.00 : c02cb68c: cmplw r26,r7 0.00 : c02cb690: ble c02cb698 <iov_iter_zero+0x2f4> 0.00 : c02cb694: mr r26,r7 0.00 : c02cb698: cmpwi r26,0 0.00 : c02cb69c: beq c02cb6c0 <iov_iter_zero+0x31c> 0.00 : c02cb6a0: lwz r3,0(r10) 0.00 : c02cb6a4: rlwinm r9,r9,25,7,26 0.00 : c02cb6a8: mr r5,r26 0.00 : c02cb6ac: add r3,r3,r9 0.00 : c02cb6b0: bl c02c92ec <memzero_page> : bvec_iter_advance(): 0.00 : c02cb6b4: cmplw r23,r26 : iov_iter_zero(): 0.00 : c02cb6b8: lwz r8,12(r30) : bvec_iter_advance(): 0.00 : c02cb6bc: blt c02cb850 <iov_iter_zero+0x4ac> 0.00 : c02cb6c0: add. r31,r31,r26 0.00 : c02cb6c4: subf r23,r26,r23 0.00 : c02cb6c8: addi r10,r8,4 0.00 : c02cb6cc: bne c02cb6e4 <iov_iter_zero+0x340> 0.00 : c02cb6d0: b c02cb6f0 <iov_iter_zero+0x34c> 0.00 : c02cb6d4: subf. r31,r9,r31 0.00 : c02cb6d8: addi r28,r28,1 0.00 : c02cb6dc: addi r29,r29,12 0.00 : c02cb6e0: beq c02cb760 <iov_iter_zero+0x3bc> 0.00 : c02cb6e4: lwzx r9,r10,r29 0.00 : c02cb6e8: cmplw r31,r9 0.00 : c02cb6ec: bge c02cb6d4 <iov_iter_zero+0x330> : iov_iter_zero(): 0.00 : c02cb6f0: cmpwi r23,0 0.00 : c02cb6f4: bne c02cb654 <iov_iter_zero+0x2b0> 0.00 : c02cb6f8: add r8,r8,r29 0.00 : c02cb6fc: lwz r0,84(r1) 0.00 : c02cb700: lwz r9,8(r30) 0.00 : c02cb704: lwz r25,52(r1) 0.00 : c02cb708: mtlr r0 0.00 : c02cb70c: lwz r26,56(r1) 0.00 : c02cb710: lwz r29,68(r1) 0.00 : c02cb714: subf r24,r24,r8 0.00 : c02cb718: stw r8,12(r30) 0.00 : c02cb71c: srawi r6,r24,2 0.00 : c02cb720: lwz r7,16(r30) 0.00 : c02cb724: rlwinm r10,r24,0,0,29 0.00 : c02cb728: add r10,r10,r6 0.00 : c02cb72c: rlwinm r8,r10,4,0,27 0.00 : c02cb730: add r10,r10,r8 0.00 : c02cb734: rlwinm r8,r10,8,0,23 0.00 : c02cb738: add r10,r10,r8 0.00 : c02cb73c: rlwinm r8,r10,16,0,15 0.00 : c02cb740: add r10,r10,r8 0.00 : c02cb744: add r10,r7,r10 0.00 : c02cb748: stw r10,16(r30) 0.00 : c02cb74c: subf r9,r27,r9 0.00 : c02cb750: lwz r23,44(r1) 0.00 : c02cb754: lwz r24,48(r1) 0.00 : c02cb758: lwz r28,64(r1) 0.00 : c02cb75c: b c02cb4c8 <iov_iter_zero+0x124> 0.00 : c02cb760: rlwinm r29,r28,1,0,30 0.00 : c02cb764: add r29,r29,r28 0.00 : c02cb768: rlwinm r29,r29,2,0,29 0.00 : c02cb76c: b c02cb6f0 <iov_iter_zero+0x34c> 0.00 : c02cb770: mflr r0 0.00 : c02cb774: stw r26,56(r1) 0.00 : c02cb778: stw r0,84(r1) 0.00 : c02cb77c: stw r28,64(r1) 0.00 : c02cb780: stw r29,68(r1) 0.00 : c02cb784: lwz r28,12(r30) 0.00 : c02cb788: lwz r29,4(r28) 0.00 : c02cb78c: subf r29,r31,r29 0.00 : c02cb790: cmplw r29,r27 0.00 : c02cb794: ble c02cb79c <iov_iter_zero+0x3f8> 0.00 : c02cb798: mr r29,r27 0.00 : c02cb79c: cmpwi r29,0 0.00 : c02cb7a0: beq c02cb8d8 <iov_iter_zero+0x534> 0.00 : c02cb7a4: lwz r3,0(r28) 0.00 : c02cb7a8: mr r5,r29 0.00 : c02cb7ac: li r4,0 0.00 : c02cb7b0: add r3,r3,r31 0.00 : c02cb7b4: subf r26,r29,r27 0.00 : c02cb7b8: bl c001999c <memset> 0.00 : c02cb7bc: add r31,r31,r29 0.00 : c02cb7c0: cmpwi r26,0 0.00 : c02cb7c4: bne c02cb818 <iov_iter_zero+0x474> 0.00 : c02cb7c8: lwz r9,4(r28) 0.00 : c02cb7cc: cmpw r9,r31 0.00 : c02cb7d0: bne c02cb7dc <iov_iter_zero+0x438> 0.00 : c02cb7d4: addi r28,r28,8 0.00 : c02cb7d8: li r31,0 0.00 : c02cb7dc: lwz r9,12(r30) 0.00 : c02cb7e0: lwz r8,16(r30) 0.00 : c02cb7e4: subf r10,r9,r28 0.00 : c02cb7e8: stw r28,12(r30) 0.00 : c02cb7ec: srawi r10,r10,3 0.00 : c02cb7f0: lwz r9,8(r30) 0.00 : c02cb7f4: subf r10,r10,r8 0.00 : c02cb7f8: stw r10,16(r30) 0.00 : c02cb7fc: subf r9,r27,r9 0.00 : c02cb800: lwz r0,84(r1) 0.00 : c02cb804: lwz r26,56(r1) 0.00 : c02cb808: lwz r28,64(r1) 0.00 : c02cb80c: mtlr r0 0.00 : c02cb810: lwz r29,68(r1) 0.00 : c02cb814: b c02cb4c8 <iov_iter_zero+0x124> 0.00 : c02cb818: lwz r31,12(r28) 0.00 : c02cb81c: addi r28,r28,8 0.00 : c02cb820: cmplw r31,r26 0.00 : c02cb824: ble c02cb82c <iov_iter_zero+0x488> 0.00 : c02cb828: mr r31,r26 0.00 : c02cb82c: cmpwi r31,0 0.00 : c02cb830: beq c02cb818 <iov_iter_zero+0x474> 0.00 : c02cb834: lwz r3,0(r28) 0.00 : c02cb838: mr r5,r31 0.00 : c02cb83c: li r4,0 0.00 : c02cb840: bl c001999c <memset> 0.00 : c02cb844: subf. r26,r31,r26 0.00 : c02cb848: beq c02cb7c8 <iov_iter_zero+0x424> 0.00 : c02cb84c: b c02cb818 <iov_iter_zero+0x474> : bvec_iter_advance(): 0.00 : c02cb850: lis r9,-16236 0.00 : c02cb854: lbz r10,-20170(r9) 0.00 : c02cb858: cmpwi r10,0 0.00 : c02cb85c: beq c02cb868 <iov_iter_zero+0x4c4> : iov_iter_zero(): 0.00 : c02cb860: add r8,r8,r25 0.00 : c02cb864: b c02cb6fc <iov_iter_zero+0x358> : bvec_iter_advance(): 0.00 : c02cb868: lis r3,-16253 0.00 : c02cb86c: li r10,1 0.00 : c02cb870: addi r3,r3,7580 0.00 : c02cb874: stb r10,-20170(r9) 0.00 : c02cb878: bl c0029b1c <__warn_printk> 0.00 : c02cb87c: twui r0,0 : iov_iter_zero(): 0.00 : c02cb880: lwz r8,12(r30) 0.00 : c02cb884: add r8,r8,r25 0.00 : c02cb888: b c02cb6fc <iov_iter_zero+0x358> 0.00 : c02cb88c: add r31,r31,r27 0.00 : c02cb890: subf r9,r27,r9 0.00 : c02cb894: b c02cb4c8 <iov_iter_zero+0x124> 0.00 : c02cb898: mr r29,r27 0.00 : c02cb89c: cmpwi r29,0 1.65 : c02cb8a0: bne c02cb8e0 <iov_iter_zero+0x53c> 0.00 : c02cb8a4: lwz r9,8(r30) 0.53 : c02cb8a8: lwz r7,4(r28) 0.00 : c02cb8ac: lwz r10,12(r30) 0.00 : c02cb8b0: subf r9,r27,r9 0.00 : c02cb8b4: b c02cb498 <iov_iter_zero+0xf4> 0.25 : c02cb8b8: lwz r0,84(r1) 2.26 : c02cb8bc: mtlr r0 0.00 : c02cb8c0: b c02cb89c <iov_iter_zero+0x4f8> : clear_user(): 0.00 : c02cb8c4: lwz r0,84(r1) 0.00 : c02cb8c8: li r27,0 0.00 : c02cb8cc: mr r10,r28 0.00 : c02cb8d0: mtlr r0 0.00 : c02cb8d4: b c02cb498 <iov_iter_zero+0xf4> : iov_iter_zero(): 0.00 : c02cb8d8: mr r26,r27 0.00 : c02cb8dc: b c02cb7c0 <iov_iter_zero+0x41c> 0.00 : c02cb8e0: stw r26,56(r1) 0.00 : c02cb8e4: stw r25,52(r1) : __access_ok(): 0.00 : c02cb8e8: lis r25,-16384 : iov_iter_zero(): 0.00 : c02cb8ec: lwz r7,12(r28) 0.00 : c02cb8f0: addi r26,r28,8 0.00 : c02cb8f4: mr r31,r29 0.00 : c02cb8f8: cmplw r29,r7 0.00 : c02cb8fc: ble c02cb904 <iov_iter_zero+0x560> 0.00 : c02cb900: mr r31,r7 0.00 : c02cb904: cmpwi r31,0 0.00 : c02cb908: beq c02cba04 <iov_iter_zero+0x660> 0.00 : c02cb90c: lwz r3,0(r26) : __access_ok(): 0.00 : c02cb910: cmplw r3,r25 0.00 : c02cb914: bge c02cb980 <iov_iter_zero+0x5dc> 0.00 : c02cb918: subf r9,r3,r25 : clear_user(): 0.00 : c02cb91c: cmplw r31,r9 0.00 : c02cb920: mflr r0 0.00 : c02cb924: stw r0,84(r1) 0.00 : c02cb928: bgt c02cb978 <iov_iter_zero+0x5d4> 0.00 : c02cb92c: mr r4,r31 0.00 : c02cb930: bl c001a41c <__arch_clear_user> : iov_iter_zero(): 0.00 : c02cb934: subf r29,r31,r29 0.00 : c02cb938: cmpwi r3,0 0.00 : c02cb93c: subf r31,r3,r31 0.00 : c02cb940: add r29,r3,r29 0.00 : c02cb944: beq c02cb9cc <iov_iter_zero+0x628> 0.00 : c02cb948: lwz r9,8(r30) 0.00 : c02cb94c: subf r8,r27,r29 0.00 : c02cb950: lwz r0,84(r1) 0.00 : c02cb954: subf r27,r29,r27 0.00 : c02cb958: lwz r7,12(r28) 0.00 : c02cb95c: add r9,r8,r9 0.00 : c02cb960: mr r28,r26 0.00 : c02cb964: lwz r10,12(r30) 0.00 : c02cb968: lwz r25,52(r1) 0.00 : c02cb96c: mtlr r0 0.00 : c02cb970: lwz r26,56(r1) 0.00 : c02cb974: b c02cb498 <iov_iter_zero+0xf4> 0.00 : c02cb978: lwz r0,84(r1) 0.00 : c02cb97c: mtlr r0 0.00 : c02cb980: lwz r9,8(r30) 0.00 : c02cb984: subf r8,r27,r29 0.00 : c02cb988: mr r28,r26 0.00 : c02cb98c: lwz r10,12(r30) 0.00 : c02cb990: lwz r25,52(r1) 0.00 : c02cb994: add r9,r8,r9 0.00 : c02cb998: lwz r26,56(r1) 0.00 : c02cb99c: subf r27,r29,r27 0.00 : c02cb9a0: li r31,0 0.00 : c02cb9a4: b c02cb498 <iov_iter_zero+0xf4> 0.00 : c02cb9a8: mflr r0 0.00 : c02cb9ac: stw r23,44(r1) 0.00 : c02cb9b0: stw r0,84(r1) 0.00 : c02cb9b4: stw r24,48(r1) 0.00 : c02cb9b8: stw r25,52(r1) 0.00 : c02cb9bc: stw r26,56(r1) 0.00 : c02cb9c0: stw r28,64(r1) 0.00 : c02cb9c4: stw r29,68(r1) 0.00 : c02cb9c8: bl c071db48 <__stack_chk_fail> 0.00 : c02cb9cc: cmpwi r29,0 0.00 : c02cb9d0: bne c02cb9fc <iov_iter_zero+0x658> 0.00 : c02cb9d4: lwz r9,8(r30) 0.00 : c02cb9d8: lwz r0,84(r1) 0.00 : c02cb9dc: lwz r7,12(r28) 0.00 : c02cb9e0: subf r9,r27,r9 0.00 : c02cb9e4: mr r28,r26 0.00 : c02cb9e8: lwz r10,12(r30) 0.00 : c02cb9ec: lwz r25,52(r1) 0.00 : c02cb9f0: mtlr r0 0.00 : c02cb9f4: lwz r26,56(r1) 0.00 : c02cb9f8: b c02cb498 <iov_iter_zero+0xf4> 0.00 : c02cb9fc: lwz r0,84(r1) 0.00 : c02cba00: mtlr r0 0.00 : c02cba04: mr r28,r26 0.00 : c02cba08: b c02cb8ec <iov_iter_zero+0x548> Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2 2020-09-01 17:25 ` Al Viro 2020-09-01 17:42 ` Matthew Wilcox 2020-09-01 18:39 ` Christophe Leroy @ 2020-09-01 19:01 ` Christophe Leroy 2020-09-02 8:10 ` Christoph Hellwig 3 siblings, 0 replies; 45+ messages in thread From: Christophe Leroy @ 2020-09-01 19:01 UTC (permalink / raw) To: Al Viro Cc: Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel Le 01/09/2020 à 19:25, Al Viro a écrit : > On Tue, Sep 01, 2020 at 07:13:00PM +0200, Christophe Leroy wrote: > >> 10.92% dd [kernel.kallsyms] [k] iov_iter_zero > > Interesting... Could you get an instruction-level profile inside iov_iter_zero(), > along with the disassembly of that sucker? > As a comparison, hereunder is the perf annotate of the 5.9-rc2 without the series: Percent | Source code & Disassembly of vmlinux for cpu-clock (2581 samples) --------------------------------------------------------------------------------- : : : : Disassembly of section .text: : : c02cbb80 <iov_iter_zero>: : iov_iter_zero(): 3.22 : c02cbb80: stwu r1,-80(r1) 3.25 : c02cbb84: stw r30,72(r1) 0.00 : c02cbb88: mr r30,r4 2.91 : c02cbb8c: stw r31,76(r1) 0.00 : c02cbb90: mr r31,r3 0.19 : c02cbb94: stw r27,60(r1) : iov_iter_type(): 1.82 : c02cbb98: lwz r10,0(r4) 0.54 : c02cbb9c: rlwinm r9,r10,0,0,30 : iov_iter_zero(): 1.98 : c02cbba0: cmpwi r9,32 0.00 : c02cbba4: lwz r9,624(r2) 0.35 : c02cbba8: stw r9,28(r1) 0.00 : c02cbbac: li r9,0 0.00 : c02cbbb0: beq c02cbd00 <iov_iter_zero+0x180> 2.67 : c02cbbb4: lwz r9,8(r4) 1.98 : c02cbbb8: cmplw r9,r3 0.00 : c02cbbbc: mr r27,r9 0.00 : c02cbbc0: bgt c02cbce8 <iov_iter_zero+0x168> 0.31 : c02cbbc4: cmpwi r9,0 0.00 : c02cbbc8: beq c02cbcbc <iov_iter_zero+0x13c> 3.22 : c02cbbcc: andi. r8,r10,16 1.70 : c02cbbd0: lwz r31,4(r30) 0.00 : c02cbbd4: bne c02cbe10 <iov_iter_zero+0x290> 0.31 : c02cbbd8: andi. r8,r10,8 0.00 : c02cbbdc: bne c02cbf64 <iov_iter_zero+0x3e4> 1.82 : c02cbbe0: andi. r10,r10,64 0.00 : c02cbbe4: bne c02cc080 <iov_iter_zero+0x500> 0.27 : c02cbbe8: stw r29,68(r1) 1.94 : c02cbbec: stw r28,64(r1) 1.98 : c02cbbf0: lwz r28,12(r30) 0.31 : c02cbbf4: lwz r7,4(r28) 2.13 : c02cbbf8: subf r29,r31,r7 1.78 : c02cbbfc: cmplw r29,r27 0.08 : c02cbc00: bgt c02cbcf8 <iov_iter_zero+0x178> 28.24 : c02cbc04: cmpwi r29,0 0.00 : c02cbc08: beq c02cc08c <iov_iter_zero+0x50c> 2.01 : c02cbc0c: lwz r3,0(r28) 3.10 : c02cbc10: lwz r10,1208(r2) 0.00 : c02cbc14: add r3,r3,r31 : __access_ok(): 0.00 : c02cbc18: cmplw r3,r10 0.00 : c02cbc1c: bgt c02cbc7c <iov_iter_zero+0xfc> 3.37 : c02cbc20: subf r10,r3,r10 0.00 : c02cbc24: addi r8,r29,-1 3.14 : c02cbc28: cmplw r8,r10 0.08 : c02cbc2c: mflr r0 0.00 : c02cbc30: stw r0,84(r1) 0.00 : c02cbc34: bgt c02cbd40 <iov_iter_zero+0x1c0> : clear_user(): 0.00 : c02cbc38: mr r4,r29 2.40 : c02cbc3c: bl c001a428 <__arch_clear_user> : iov_iter_zero(): 1.55 : c02cbc40: add r31,r31,r29 0.00 : c02cbc44: cmpwi r3,0 1.94 : c02cbc48: subf r29,r29,r27 0.00 : c02cbc4c: subf r31,r3,r31 0.00 : c02cbc50: add r29,r29,r3 0.00 : c02cbc54: beq c02cc0ac <iov_iter_zero+0x52c> 0.00 : c02cbc58: lwz r9,8(r30) 0.00 : c02cbc5c: subf r10,r27,r29 0.00 : c02cbc60: lwz r0,84(r1) 0.00 : c02cbc64: subf r27,r29,r27 0.00 : c02cbc68: add r9,r10,r9 0.00 : c02cbc6c: lwz r7,4(r28) 0.00 : c02cbc70: lwz r10,12(r30) 0.00 : c02cbc74: mtlr r0 0.00 : c02cbc78: b c02cbc84 <iov_iter_zero+0x104> : __access_ok(): 0.00 : c02cbc7c: li r27,0 0.00 : c02cbc80: mr r10,r28 : iov_iter_zero(): 0.00 : c02cbc84: cmplw r31,r7 0.00 : c02cbc88: bne c02cbc94 <iov_iter_zero+0x114> 0.93 : c02cbc8c: addi r28,r28,8 0.00 : c02cbc90: li r31,0 1.28 : c02cbc94: lwz r8,16(r30) 0.00 : c02cbc98: subf r10,r10,r28 1.05 : c02cbc9c: srawi r10,r10,3 0.00 : c02cbca0: stw r28,12(r30) 0.00 : c02cbca4: subf r10,r10,r8 0.93 : c02cbca8: stw r10,16(r30) 0.04 : c02cbcac: lwz r28,64(r1) 0.00 : c02cbcb0: lwz r29,68(r1) 1.05 : c02cbcb4: stw r9,8(r30) 0.00 : c02cbcb8: stw r31,4(r30) 1.39 : c02cbcbc: lwz r9,28(r1) 0.00 : c02cbcc0: lwz r10,624(r2) 1.08 : c02cbcc4: xor. r9,r9,r10 0.00 : c02cbcc8: li r10,0 0.00 : c02cbccc: bne c02cc180 <iov_iter_zero+0x600> 1.08 : c02cbcd0: mr r3,r27 0.00 : c02cbcd4: lwz r30,72(r1) 0.08 : c02cbcd8: lwz r27,60(r1) 1.01 : c02cbcdc: lwz r31,76(r1) 0.00 : c02cbce0: addi r1,r1,80 0.04 : c02cbce4: blr 0.00 : c02cbce8: cmpwi r9,0 0.00 : c02cbcec: mr r27,r3 0.00 : c02cbcf0: beq c02cbcbc <iov_iter_zero+0x13c> 0.00 : c02cbcf4: b c02cbbcc <iov_iter_zero+0x4c> 0.00 : c02cbcf8: mr r29,r27 0.00 : c02cbcfc: b c02cbc04 <iov_iter_zero+0x84> : pipe_zero(): 0.00 : c02cbd00: mflr r0 0.00 : c02cbd04: stw r26,56(r1) 0.00 : c02cbd08: stw r0,84(r1) 0.00 : c02cbd0c: mr r3,r4 0.00 : c02cbd10: stw r28,64(r1) 0.00 : c02cbd14: lwz r28,12(r4) 0.00 : c02cbd18: lwz r26,40(r28) 0.00 : c02cbd1c: bl c02c95d0 <sanity> 0.00 : c02cbd20: cmpwi r3,0 0.00 : c02cbd24: bne c02cbd54 <iov_iter_zero+0x1d4> 0.00 : c02cbd28: lwz r0,84(r1) 0.00 : c02cbd2c: li r27,0 0.00 : c02cbd30: lwz r26,56(r1) 0.00 : c02cbd34: lwz r28,64(r1) 0.00 : c02cbd38: mtlr r0 0.00 : c02cbd3c: b c02cbcbc <iov_iter_zero+0x13c> : __access_ok(): 0.00 : c02cbd40: lwz r0,84(r1) 0.00 : c02cbd44: li r27,0 0.00 : c02cbd48: mr r10,r28 0.00 : c02cbd4c: mtlr r0 0.00 : c02cbd50: b c02cbc84 <iov_iter_zero+0x104> : pipe_zero(): 0.00 : c02cbd54: mr r4,r31 0.00 : c02cbd58: addi r6,r1,24 0.00 : c02cbd5c: addi r5,r1,20 0.00 : c02cbd60: mr r3,r30 0.00 : c02cbd64: bl c02c97ac <push_pipe> 0.00 : c02cbd68: mr. r27,r3 0.00 : c02cbd6c: beq c02cbd28 <iov_iter_zero+0x1a8> 0.00 : c02cbd70: lwz r4,24(r1) 0.00 : c02cbd74: addi r26,r26,-1 0.00 : c02cbd78: lwz r9,20(r1) 0.00 : c02cbd7c: stw r25,52(r1) 0.00 : c02cbd80: li r25,0 0.00 : c02cbd84: stw r29,68(r1) 0.00 : c02cbd88: mr r29,r27 0.00 : c02cbd8c: subfic r31,r4,4096 0.00 : c02cbd90: cmplw r31,r29 0.00 : c02cbd94: ble c02cbd9c <iov_iter_zero+0x21c> 0.00 : c02cbd98: mr r31,r29 0.00 : c02cbd9c: and r9,r26,r9 0.00 : c02cbda0: lwz r8,80(r28) 0.00 : c02cbda4: rlwinm r10,r9,1,0,30 0.00 : c02cbda8: add r9,r10,r9 0.00 : c02cbdac: rlwinm r9,r9,3,0,28 0.00 : c02cbdb0: lwzx r3,r8,r9 0.00 : c02cbdb4: mr r5,r31 0.00 : c02cbdb8: bl c02c99d0 <memzero_page> 0.00 : c02cbdbc: subf. r29,r31,r29 0.00 : c02cbdc0: lwz r9,20(r1) 0.00 : c02cbdc4: li r4,0 0.00 : c02cbdc8: lwz r10,24(r1) 0.00 : c02cbdcc: stw r9,16(r30) 0.00 : c02cbdd0: addi r9,r9,1 0.00 : c02cbdd4: add r10,r10,r31 0.00 : c02cbdd8: stw r9,20(r1) 0.00 : c02cbddc: stw r10,4(r30) 0.00 : c02cbde0: stw r25,24(r1) 0.00 : c02cbde4: bne c02cbd8c <iov_iter_zero+0x20c> 0.00 : c02cbde8: lwz r9,8(r30) 0.00 : c02cbdec: subf r9,r27,r9 0.00 : c02cbdf0: stw r9,8(r30) : iov_iter_zero(): 0.00 : c02cbdf4: lwz r0,84(r1) 0.00 : c02cbdf8: lwz r25,52(r1) 0.00 : c02cbdfc: lwz r26,56(r1) 0.00 : c02cbe00: mtlr r0 0.00 : c02cbe04: lwz r28,64(r1) 0.00 : c02cbe08: lwz r29,68(r1) 0.00 : c02cbe0c: b c02cbcbc <iov_iter_zero+0x13c> 0.00 : c02cbe10: stw r23,44(r1) 0.00 : c02cbe14: cmpwi r27,0 0.00 : c02cbe18: stw r28,64(r1) 0.00 : c02cbe1c: mr r23,r27 0.00 : c02cbe20: stw r24,48(r1) 0.00 : c02cbe24: li r28,0 0.00 : c02cbe28: lwz r24,12(r30) 0.00 : c02cbe2c: mr r8,r24 0.00 : c02cbe30: beq c02cbf08 <iov_iter_zero+0x388> 0.00 : c02cbe34: mflr r0 0.00 : c02cbe38: stw r25,52(r1) 0.00 : c02cbe3c: stw r0,84(r1) 0.00 : c02cbe40: stw r26,56(r1) 0.00 : c02cbe44: stw r29,68(r1) 0.00 : c02cbe48: rlwinm r25,r28,1,0,30 0.00 : c02cbe4c: add r25,r25,r28 0.00 : c02cbe50: rlwinm r25,r25,2,0,29 0.00 : c02cbe54: add r10,r8,r25 0.00 : c02cbe58: lwz r26,4(r10) 0.00 : c02cbe5c: mr r29,r25 0.00 : c02cbe60: lwz r9,8(r10) 0.00 : c02cbe64: subf r26,r31,r26 0.00 : c02cbe68: cmplw r26,r23 0.00 : c02cbe6c: add r9,r31,r9 0.00 : c02cbe70: clrlwi r4,r9,20 0.00 : c02cbe74: ble c02cbe7c <iov_iter_zero+0x2fc> 0.00 : c02cbe78: mr r26,r23 0.00 : c02cbe7c: subfic r7,r4,4096 0.00 : c02cbe80: cmplw r26,r7 0.00 : c02cbe84: ble c02cbe8c <iov_iter_zero+0x30c> 0.00 : c02cbe88: mr r26,r7 0.00 : c02cbe8c: cmpwi r26,0 0.00 : c02cbe90: beq c02cbeb4 <iov_iter_zero+0x334> 0.00 : c02cbe94: lwz r3,0(r10) 0.00 : c02cbe98: rlwinm r9,r9,25,7,26 0.00 : c02cbe9c: mr r5,r26 0.00 : c02cbea0: add r3,r3,r9 0.00 : c02cbea4: bl c02c99d0 <memzero_page> : bvec_iter_advance(): 0.00 : c02cbea8: cmplw r23,r26 : iov_iter_zero(): 0.00 : c02cbeac: lwz r8,12(r30) : bvec_iter_advance(): 0.00 : c02cbeb0: blt c02cc044 <iov_iter_zero+0x4c4> 0.00 : c02cbeb4: add. r31,r31,r26 0.00 : c02cbeb8: subf r23,r26,r23 0.00 : c02cbebc: addi r10,r8,4 0.00 : c02cbec0: bne c02cbed8 <iov_iter_zero+0x358> 0.00 : c02cbec4: b c02cbee4 <iov_iter_zero+0x364> 0.00 : c02cbec8: subf. r31,r9,r31 0.00 : c02cbecc: addi r28,r28,1 0.00 : c02cbed0: addi r29,r29,12 0.00 : c02cbed4: beq c02cbf54 <iov_iter_zero+0x3d4> 0.00 : c02cbed8: lwzx r9,r10,r29 0.00 : c02cbedc: cmplw r31,r9 0.00 : c02cbee0: bge c02cbec8 <iov_iter_zero+0x348> : iov_iter_zero(): 0.00 : c02cbee4: cmpwi r23,0 0.00 : c02cbee8: bne c02cbe48 <iov_iter_zero+0x2c8> 0.00 : c02cbeec: add r8,r8,r29 0.00 : c02cbef0: lwz r0,84(r1) 0.00 : c02cbef4: lwz r9,8(r30) 0.00 : c02cbef8: lwz r25,52(r1) 0.00 : c02cbefc: mtlr r0 0.00 : c02cbf00: lwz r26,56(r1) 0.00 : c02cbf04: lwz r29,68(r1) 0.00 : c02cbf08: subf r24,r24,r8 0.00 : c02cbf0c: stw r8,12(r30) 0.00 : c02cbf10: srawi r6,r24,2 0.00 : c02cbf14: lwz r7,16(r30) 0.00 : c02cbf18: rlwinm r10,r24,0,0,29 0.00 : c02cbf1c: add r10,r10,r6 0.00 : c02cbf20: rlwinm r8,r10,4,0,27 0.00 : c02cbf24: add r10,r10,r8 0.00 : c02cbf28: rlwinm r8,r10,8,0,23 0.00 : c02cbf2c: add r10,r10,r8 0.00 : c02cbf30: rlwinm r8,r10,16,0,15 0.00 : c02cbf34: add r10,r10,r8 0.00 : c02cbf38: add r10,r7,r10 0.00 : c02cbf3c: stw r10,16(r30) 0.00 : c02cbf40: subf r9,r27,r9 0.00 : c02cbf44: lwz r23,44(r1) 0.00 : c02cbf48: lwz r24,48(r1) 0.00 : c02cbf4c: lwz r28,64(r1) 0.00 : c02cbf50: b c02cbcb4 <iov_iter_zero+0x134> 0.00 : c02cbf54: rlwinm r29,r28,1,0,30 0.00 : c02cbf58: add r29,r29,r28 0.00 : c02cbf5c: rlwinm r29,r29,2,0,29 0.00 : c02cbf60: b c02cbee4 <iov_iter_zero+0x364> 0.00 : c02cbf64: mflr r0 0.00 : c02cbf68: stw r26,56(r1) 0.00 : c02cbf6c: stw r0,84(r1) 0.00 : c02cbf70: stw r28,64(r1) 0.00 : c02cbf74: stw r29,68(r1) 0.00 : c02cbf78: lwz r28,12(r30) 0.00 : c02cbf7c: lwz r29,4(r28) 0.00 : c02cbf80: subf r29,r31,r29 0.00 : c02cbf84: cmplw r29,r27 0.00 : c02cbf88: ble c02cbf90 <iov_iter_zero+0x410> 0.00 : c02cbf8c: mr r29,r27 0.00 : c02cbf90: cmpwi r29,0 0.00 : c02cbf94: beq c02cc0b8 <iov_iter_zero+0x538> 0.00 : c02cbf98: lwz r3,0(r28) 0.00 : c02cbf9c: mr r5,r29 0.00 : c02cbfa0: li r4,0 0.00 : c02cbfa4: add r3,r3,r31 0.00 : c02cbfa8: subf r26,r29,r27 0.00 : c02cbfac: bl c001999c <memset> 0.00 : c02cbfb0: add r31,r31,r29 0.00 : c02cbfb4: cmpwi r26,0 0.00 : c02cbfb8: bne c02cc00c <iov_iter_zero+0x48c> 0.00 : c02cbfbc: lwz r9,4(r28) 0.00 : c02cbfc0: cmpw r9,r31 0.00 : c02cbfc4: bne c02cbfd0 <iov_iter_zero+0x450> 0.00 : c02cbfc8: addi r28,r28,8 0.00 : c02cbfcc: li r31,0 0.00 : c02cbfd0: lwz r9,12(r30) 0.00 : c02cbfd4: lwz r8,16(r30) 0.00 : c02cbfd8: subf r10,r9,r28 0.00 : c02cbfdc: stw r28,12(r30) 0.00 : c02cbfe0: srawi r10,r10,3 0.00 : c02cbfe4: lwz r9,8(r30) 0.00 : c02cbfe8: subf r10,r10,r8 0.00 : c02cbfec: stw r10,16(r30) 0.00 : c02cbff0: subf r9,r27,r9 0.00 : c02cbff4: lwz r0,84(r1) 0.00 : c02cbff8: lwz r26,56(r1) 0.00 : c02cbffc: lwz r28,64(r1) 0.00 : c02cc000: mtlr r0 0.00 : c02cc004: lwz r29,68(r1) 0.00 : c02cc008: b c02cbcb4 <iov_iter_zero+0x134> 0.00 : c02cc00c: lwz r31,12(r28) 0.00 : c02cc010: addi r28,r28,8 0.00 : c02cc014: cmplw r31,r26 0.00 : c02cc018: ble c02cc020 <iov_iter_zero+0x4a0> 0.00 : c02cc01c: mr r31,r26 0.00 : c02cc020: cmpwi r31,0 0.00 : c02cc024: beq c02cc00c <iov_iter_zero+0x48c> 0.00 : c02cc028: lwz r3,0(r28) 0.00 : c02cc02c: mr r5,r31 0.00 : c02cc030: li r4,0 0.00 : c02cc034: bl c001999c <memset> 0.00 : c02cc038: subf. r26,r31,r26 0.00 : c02cc03c: beq c02cbfbc <iov_iter_zero+0x43c> 0.00 : c02cc040: b c02cc00c <iov_iter_zero+0x48c> : bvec_iter_advance(): 0.00 : c02cc044: lis r9,-16236 0.00 : c02cc048: lbz r10,-20202(r9) 0.00 : c02cc04c: cmpwi r10,0 0.00 : c02cc050: beq c02cc05c <iov_iter_zero+0x4dc> : iov_iter_zero(): 0.00 : c02cc054: add r8,r8,r25 0.00 : c02cc058: b c02cbef0 <iov_iter_zero+0x370> : bvec_iter_advance(): 0.00 : c02cc05c: lis r3,-16253 0.00 : c02cc060: li r10,1 0.00 : c02cc064: addi r3,r3,7692 0.00 : c02cc068: stb r10,-20202(r9) 0.00 : c02cc06c: bl c0029bc0 <__warn_printk> 0.00 : c02cc070: twui r0,0 : iov_iter_zero(): 0.00 : c02cc074: lwz r8,12(r30) 0.00 : c02cc078: add r8,r8,r25 0.00 : c02cc07c: b c02cbef0 <iov_iter_zero+0x370> 0.00 : c02cc080: add r31,r31,r27 0.00 : c02cc084: subf r9,r27,r9 0.00 : c02cc088: b c02cbcb4 <iov_iter_zero+0x134> 0.00 : c02cc08c: mr r29,r27 0.00 : c02cc090: cmpwi r29,0 0.00 : c02cc094: bne c02cc0c0 <iov_iter_zero+0x540> 1.51 : c02cc098: lwz r9,8(r30) 0.00 : c02cc09c: lwz r7,4(r28) 0.00 : c02cc0a0: lwz r10,12(r30) 0.00 : c02cc0a4: subf r9,r27,r9 0.00 : c02cc0a8: b c02cbc84 <iov_iter_zero+0x104> 1.47 : c02cc0ac: lwz r0,84(r1) 6.47 : c02cc0b0: mtlr r0 0.00 : c02cc0b4: b c02cc090 <iov_iter_zero+0x510> 0.00 : c02cc0b8: mr r26,r27 0.00 : c02cc0bc: b c02cbfb4 <iov_iter_zero+0x434> 0.00 : c02cc0c0: stw r26,56(r1) 0.00 : c02cc0c4: lwz r7,12(r28) 0.00 : c02cc0c8: addi r26,r28,8 0.00 : c02cc0cc: mr r31,r29 0.00 : c02cc0d0: cmplw r29,r7 0.00 : c02cc0d4: ble c02cc0dc <iov_iter_zero+0x55c> 0.00 : c02cc0d8: mr r31,r7 0.00 : c02cc0dc: cmpwi r31,0 0.00 : c02cc0e0: beq c02cc1d8 <iov_iter_zero+0x658> 0.00 : c02cc0e4: lwz r3,0(r26) : clear_user(): 0.00 : c02cc0e8: lwz r9,1208(r2) : __access_ok(): 0.00 : c02cc0ec: cmplw r3,r9 0.00 : c02cc0f0: bgt c02cc114 <iov_iter_zero+0x594> 0.00 : c02cc0f4: subf r9,r3,r9 0.00 : c02cc0f8: addi r10,r31,-1 0.00 : c02cc0fc: cmplw r10,r9 0.00 : c02cc100: mflr r0 0.00 : c02cc104: stw r0,84(r1) 0.00 : c02cc108: ble c02cc138 <iov_iter_zero+0x5b8> 0.00 : c02cc10c: lwz r0,84(r1) 0.00 : c02cc110: mtlr r0 : iov_iter_zero(): 0.00 : c02cc114: lwz r9,8(r30) 0.00 : c02cc118: subf r8,r27,r29 0.00 : c02cc11c: mr r28,r26 0.00 : c02cc120: lwz r10,12(r30) 0.00 : c02cc124: lwz r26,56(r1) 0.00 : c02cc128: add r9,r8,r9 0.00 : c02cc12c: subf r27,r29,r27 0.00 : c02cc130: li r31,0 0.00 : c02cc134: b c02cbc84 <iov_iter_zero+0x104> : clear_user(): 0.00 : c02cc138: mr r4,r31 0.00 : c02cc13c: bl c001a428 <__arch_clear_user> : iov_iter_zero(): 0.00 : c02cc140: subf r29,r31,r29 0.00 : c02cc144: cmpwi r3,0 0.00 : c02cc148: subf r31,r3,r31 0.00 : c02cc14c: add r29,r3,r29 0.00 : c02cc150: beq c02cc1a4 <iov_iter_zero+0x624> 0.00 : c02cc154: lwz r9,8(r30) 0.00 : c02cc158: subf r8,r27,r29 0.00 : c02cc15c: lwz r0,84(r1) 0.00 : c02cc160: subf r27,r29,r27 0.00 : c02cc164: lwz r7,12(r28) 0.00 : c02cc168: add r9,r8,r9 0.00 : c02cc16c: mr r28,r26 0.00 : c02cc170: lwz r10,12(r30) 0.00 : c02cc174: lwz r26,56(r1) 0.00 : c02cc178: mtlr r0 0.00 : c02cc17c: b c02cbc84 <iov_iter_zero+0x104> 0.00 : c02cc180: mflr r0 0.00 : c02cc184: stw r23,44(r1) 0.00 : c02cc188: stw r0,84(r1) 0.00 : c02cc18c: stw r24,48(r1) 0.00 : c02cc190: stw r25,52(r1) 0.00 : c02cc194: stw r26,56(r1) 0.00 : c02cc198: stw r28,64(r1) 0.00 : c02cc19c: stw r29,68(r1) 0.00 : c02cc1a0: bl c071e2b0 <__stack_chk_fail> 0.00 : c02cc1a4: cmpwi r29,0 0.00 : c02cc1a8: bne c02cc1d0 <iov_iter_zero+0x650> 0.00 : c02cc1ac: lwz r9,8(r30) 0.00 : c02cc1b0: lwz r0,84(r1) 0.00 : c02cc1b4: lwz r7,12(r28) 0.00 : c02cc1b8: subf r9,r27,r9 0.00 : c02cc1bc: mr r28,r26 0.00 : c02cc1c0: lwz r10,12(r30) 0.00 : c02cc1c4: lwz r26,56(r1) 0.00 : c02cc1c8: mtlr r0 0.00 : c02cc1cc: b c02cbc84 <iov_iter_zero+0x104> 0.00 : c02cc1d0: lwz r0,84(r1) 0.00 : c02cc1d4: mtlr r0 0.00 : c02cc1d8: mr r28,r26 0.00 : c02cc1dc: b c02cc0c4 <iov_iter_zero+0x544> Christophe ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2 2020-09-01 17:25 ` Al Viro ` (2 preceding siblings ...) 2020-09-01 19:01 ` Christophe Leroy @ 2020-09-02 8:10 ` Christoph Hellwig 3 siblings, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-09-02 8:10 UTC (permalink / raw) To: Al Viro Cc: Christophe Leroy, Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86, linux-fsdevel, linux-arch, linuxppc-dev, Kees Cook, linux-kernel On Tue, Sep 01, 2020 at 06:25:12PM +0100, Al Viro wrote: > On Tue, Sep 01, 2020 at 07:13:00PM +0200, Christophe Leroy wrote: > > > 10.92% dd [kernel.kallsyms] [k] iov_iter_zero > > Interesting... Could you get an instruction-level profile inside iov_iter_zero(), > along with the disassembly of that sucker? So the interesting thing here is with that none of these code paths should have changed at all, and the biggest items on the profile look the same modulo some minor reordering. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 02/10] fs: don't allow splice read/write without explicit ops 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (11 preceding siblings ...) 2020-09-01 17:13 ` Christophe Leroy @ 2020-10-27 9:29 ` David Howells 2020-10-27 9:51 ` David Howells 13 siblings, 0 replies; 45+ messages in thread From: David Howells @ 2020-10-27 9:29 UTC (permalink / raw) To: Christoph Hellwig Cc: dhowells, Linus Torvalds, Al Viro, Michael Ellerman, x86, Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Christoph Hellwig <hch@lst.de> wrote: > default_file_splice_write is the last piece of generic code that uses > set_fs to make the uaccess routines operate on kernel pointers. It > implements a "fallback loop" for splicing from files that do not actually > provide a proper splice_read method. The usual file systems and other > high bandwith instances all provide a ->splice_read, so this just removes > support for various device drivers and procfs/debugfs files. If splice > support for any of those turns out to be important it can be added back > by switching them to the iter ops and using generic_file_splice_read. Hmmm... this causes the copy_file_range() syscall to fail with EINVAL in some places where before it used to work. For my part, it causes the generic/112 xfstest to fail with afs, but there may be other places. Is this a regression we need to fix in the VFS core? Or is it something we need to fix in xfstests and assume userspace will fallback to doing it itself? David ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 02/10] fs: don't allow splice read/write without explicit ops 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig ` (12 preceding siblings ...) 2020-10-27 9:29 ` [PATCH 02/10] fs: don't allow splice read/write without explicit ops David Howells @ 2020-10-27 9:51 ` David Howells 2020-10-27 9:54 ` Christoph Hellwig 2020-10-27 10:38 ` David Howells 13 siblings, 2 replies; 45+ messages in thread From: David Howells @ 2020-10-27 9:51 UTC (permalink / raw) To: Christoph Hellwig Cc: dhowells, Linus Torvalds, Al Viro, Michael Ellerman, x86, Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev David Howells <dhowells@redhat.com> wrote: > > default_file_splice_write is the last piece of generic code that uses > > set_fs to make the uaccess routines operate on kernel pointers. It > > implements a "fallback loop" for splicing from files that do not actually > > provide a proper splice_read method. The usual file systems and other > > high bandwith instances all provide a ->splice_read, so this just removes > > support for various device drivers and procfs/debugfs files. If splice > > support for any of those turns out to be important it can be added back > > by switching them to the iter ops and using generic_file_splice_read. > > Hmmm... this causes the copy_file_range() syscall to fail with EINVAL in some > places where before it used to work. > > For my part, it causes the generic/112 xfstest to fail with afs, but there may > be other places. > > Is this a regression we need to fix in the VFS core? Or is it something we > need to fix in xfstests and assume userspace will fallback to doing it itself? That said, for afs at least, the fix seems to be just this: diff --git a/fs/afs/file.c b/fs/afs/file.c index 395075d7fe02..2bc6adfe351a 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -33,6 +33,7 @@ const struct file_operations afs_file_operations = { .write_iter = afs_file_write, .mmap = afs_file_mmap, .splice_read = generic_file_splice_read, + .splice_write = iter_file_splice_write, .fsync = afs_fsync, .lock = afs_lock, .flock = afs_flock, David ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 02/10] fs: don't allow splice read/write without explicit ops 2020-10-27 9:51 ` David Howells @ 2020-10-27 9:54 ` Christoph Hellwig 2020-10-27 10:38 ` David Howells 1 sibling, 0 replies; 45+ messages in thread From: Christoph Hellwig @ 2020-10-27 9:54 UTC (permalink / raw) To: David Howells Cc: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman, x86, Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev On Tue, Oct 27, 2020 at 09:51:34AM +0000, David Howells wrote: > David Howells <dhowells@redhat.com> wrote: > > > > default_file_splice_write is the last piece of generic code that uses > > > set_fs to make the uaccess routines operate on kernel pointers. It > > > implements a "fallback loop" for splicing from files that do not actually > > > provide a proper splice_read method. The usual file systems and other > > > high bandwith instances all provide a ->splice_read, so this just removes > > > support for various device drivers and procfs/debugfs files. If splice > > > support for any of those turns out to be important it can be added back > > > by switching them to the iter ops and using generic_file_splice_read. > > > > Hmmm... this causes the copy_file_range() syscall to fail with EINVAL in some > > places where before it used to work. > > > > For my part, it causes the generic/112 xfstest to fail with afs, but there may > > be other places. > > > > Is this a regression we need to fix in the VFS core? Or is it something we > > need to fix in xfstests and assume userspace will fallback to doing it itself? > > That said, for afs at least, the fix seems to be just this: And that is the correct fix, I was about to send it to you. We can't have a "generic" splice using ->read/->write without set_fs, in addition to the iter_file_splice_write based version being a lot more efficient than what you had before. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 02/10] fs: don't allow splice read/write without explicit ops 2020-10-27 9:51 ` David Howells 2020-10-27 9:54 ` Christoph Hellwig @ 2020-10-27 10:38 ` David Howells 1 sibling, 0 replies; 45+ messages in thread From: David Howells @ 2020-10-27 10:38 UTC (permalink / raw) To: Christoph Hellwig Cc: dhowells, Linus Torvalds, Al Viro, Michael Ellerman, x86, Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev Christoph Hellwig <hch@lst.de> wrote: > > That said, for afs at least, the fix seems to be just this: > > And that is the correct fix, I was about to send it to you. Thanks. David ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2020-10-27 10:40 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-27 15:00 remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig 2020-08-27 15:00 ` [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig 2020-08-27 15:58 ` David Laight 2020-08-29 9:23 ` 'Christoph Hellwig' [not found] ` <20200901064849.GI4299@shao2-debian> 2020-09-01 7:08 ` [fs] ef30fb3c60: kernel write not supported for file /sys/kernel/softlockup_panic Christoph Hellwig 2020-08-27 15:00 ` [PATCH 02/10] fs: don't allow splice read/write without explicit ops Christoph Hellwig 2020-08-27 15:00 ` [PATCH 03/10] uaccess: add infrastructure for kernel builds with set_fs() Christoph Hellwig 2020-08-27 15:00 ` [PATCH 04/10] test_bitmap: skip user bitmap tests for !CONFIG_SET_FS Christoph Hellwig 2020-08-27 15:00 ` [PATCH 05/10] lkdtm: disable set_fs-based " Christoph Hellwig 2020-08-27 18:06 ` Linus Torvalds 2020-08-29 9:24 ` Christoph Hellwig 2020-09-01 18:52 ` Kees Cook 2020-09-01 18:57 ` Kees Cook 2020-09-02 8:09 ` Christoph Hellwig 2020-08-27 15:00 ` [PATCH 06/10] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h Christoph Hellwig 2020-08-27 15:00 ` [PATCH 07/10] x86: make TASK_SIZE_MAX usable from assembly code Christoph Hellwig 2020-08-27 15:00 ` [PATCH 08/10] x86: remove address space overrides using set_fs() Christoph Hellwig 2020-08-27 18:15 ` Linus Torvalds 2020-08-29 9:25 ` Christoph Hellwig 2020-08-27 15:00 ` [PATCH 09/10] powerpc: use non-set_fs based maccess routines Christoph Hellwig 2020-08-27 15:00 ` [PATCH 10/10] powerpc: remove address space overrides using set_fs() Christoph Hellwig 2020-09-02 6:15 ` Christophe Leroy 2020-09-02 12:36 ` Christoph Hellwig 2020-09-02 13:13 ` David Laight 2020-09-02 13:24 ` Christophe Leroy 2020-09-02 13:51 ` David Laight 2020-09-02 14:12 ` Christophe Leroy 2020-09-02 15:02 ` David Laight 2020-09-02 15:17 ` Christophe Leroy 2020-09-02 18:02 ` Linus Torvalds 2020-09-03 7:11 ` Christoph Hellwig 2020-09-03 7:27 ` Christophe Leroy 2020-09-03 8:55 ` Christophe Leroy 2020-09-03 7:20 ` Christophe Leroy 2020-08-27 15:31 ` remove the last set_fs() in common code, and remove it for x86 and powerpc v2 Christoph Hellwig 2020-09-01 17:13 ` Christophe Leroy 2020-09-01 17:25 ` Al Viro 2020-09-01 17:42 ` Matthew Wilcox 2020-09-01 18:39 ` Christophe Leroy 2020-09-01 19:01 ` Christophe Leroy 2020-09-02 8:10 ` Christoph Hellwig 2020-10-27 9:29 ` [PATCH 02/10] fs: don't allow splice read/write without explicit ops David Howells 2020-10-27 9:51 ` David Howells 2020-10-27 9:54 ` Christoph Hellwig 2020-10-27 10:38 ` David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).