LKML Archive on lore.kernel.org help / color / mirror / Atom feed
From: Dave Hansen <haveblue@us.ibm.com> To: akpm@osdl.org Cc: linux-kernel@vger.kernel.org, miklos@szeredi.hu, hch@infradead.org, Dave Hansen <haveblue@us.ibm.com> Subject: [PATCH 24/27] r-o-bind-mounts-track-number-of-mount-writers Date: Thu, 01 Nov 2007 16:08:58 -0700 [thread overview] Message-ID: <20071101230858.6F76E20C@kernel> (raw) In-Reply-To: <20071101230826.9A4F6E00@kernel> This is the real meat of the entire series. It actually implements the tracking of the number of writers to a mount. However, it causes scalability problems because there can be hundreds of cpus doing open()/close() on files on the same mnt at the same time. Even an atomic_t in the mnt has massive scalaing problems because the cacheline gets so terribly contended. This uses a statically-allocated percpu variable. All operations are local to a cpu as long that cpu operates on the same mount, and there are no writer count imbalances. Writer count imbalances happen when a write is taken on one cpu, and released on another, like when an open/close pair is performed on two Upon a remount,ro request, all of the data from the percpu variables is collected (expensive, but very rare) and we determine if there are any outstanding writers to the mount. I've written a little benchmark to sit in a loop for a couple of seconds in several cpus in parallel doing open/write/close loops. http://sr71.net/~dave/linux/openbench.c The code in here is a a worst-possible case for this patch. It does opens on a _pair_ of files in two different mounts in parallel. This should cause my code to lose its "operate on the same mount" optimization completely. This worst-case scenario causes a 3% degredation in the benchmark. I could probably get rid of even this 3%, but it would be more complex than what I have here, and I think this is getting into acceptable territory. In practice, I expect writing more than 3 bytes to a file, as well as disk I/O to mask any effects that this has. (To get rid of that 3%, we could have an #defined number of mounts in the percpu variable. So, instead of a CPU getting operate only on percpu data when it accesses only one mount, it could stay on percpu data when it only accesses N or fewer mounts.) Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- linux-2.6.git-dave/fs/namespace.c | 205 ++++++++++++++++++++++++++++--- linux-2.6.git-dave/include/linux/mount.h | 8 + 2 files changed, 198 insertions(+), 15 deletions(-) diff -puN fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers fs/namespace.c --- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers 2007-11-01 14:46:20.000000000 -0700 +++ linux-2.6.git-dave/fs/namespace.c 2007-11-01 14:46:20.000000000 -0700 @@ -17,6 +17,7 @@ #include <linux/quotaops.h> #include <linux/acct.h> #include <linux/capability.h> +#include <linux/cpumask.h> #include <linux/module.h> #include <linux/sysfs.h> #include <linux/seq_file.h> @@ -52,6 +53,8 @@ static inline unsigned long hash(struct return tmp & hash_mask; } +#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16) + struct vfsmount *alloc_vfsmnt(const char *name) { struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); @@ -65,6 +68,7 @@ struct vfsmount *alloc_vfsmnt(const char INIT_LIST_HEAD(&mnt->mnt_share); INIT_LIST_HEAD(&mnt->mnt_slave_list); INIT_LIST_HEAD(&mnt->mnt_slave); + atomic_set(&mnt->__mnt_writers, 0); if (name) { int size = strlen(name) + 1; char *newname = kmalloc(size, GFP_KERNEL); @@ -85,6 +89,84 @@ struct vfsmount *alloc_vfsmnt(const char * we can determine when writes are able to occur to * a filesystem. */ +/* + * __mnt_is_readonly: check whether a mount is read-only + * @mnt: the mount to check for its write status + * + * This shouldn't be used directly ouside of the VFS. + * It does not guarantee that the filesystem will stay + * r/w, just that it is right *now*. This can not and + * should not be used in place of IS_RDONLY(inode). + * mnt_want/drop_write() will _keep_ the filesystem + * r/w. + */ +int __mnt_is_readonly(struct vfsmount *mnt) +{ + return (mnt->mnt_sb->s_flags & MS_RDONLY); +} +EXPORT_SYMBOL_GPL(__mnt_is_readonly); + +struct mnt_writer { + /* + * If holding multiple instances of this lock, they + * must be ordered by cpu number. + */ + spinlock_t lock; + unsigned long count; + struct vfsmount *mnt; +} ____cacheline_aligned_in_smp; +static DEFINE_PER_CPU(struct mnt_writer, mnt_writers); + +static int __init init_mnt_writers(void) +{ + int cpu; + for_each_possible_cpu(cpu) { + struct mnt_writer *writer = &per_cpu(mnt_writers, cpu); + spin_lock_init(&writer->lock); + writer->count = 0; + } + return 0; +} +fs_initcall(init_mnt_writers); + +static void mnt_unlock_cpus(void) +{ + int cpu; + struct mnt_writer *cpu_writer; + + for_each_possible_cpu(cpu) { + cpu_writer = &per_cpu(mnt_writers, cpu); + spin_unlock(&cpu_writer->lock); + } +} + +static inline void __clear_mnt_count(struct mnt_writer *cpu_writer) +{ + if (!cpu_writer->mnt) + return; + atomic_add(cpu_writer->count, &cpu_writer->mnt->__mnt_writers); + cpu_writer->count = 0; +} + /* + * must hold cpu_writer->lock + */ +static inline void use_cpu_writer_for_mount(struct mnt_writer *cpu_writer, + struct vfsmount *mnt) +{ + if (cpu_writer->mnt == mnt) + return; + __clear_mnt_count(cpu_writer); + cpu_writer->mnt = mnt; +} + +/* + * Most r/o checks on a fs are for operations that take + * discrete amounts of time, like a write() or unlink(). + * We must keep track of when those operations start + * (for permission checks) and when they end, so that + * we can determine when writes are able to occur to + * a filesystem. + */ /** * mnt_want_write - get write access to a mount * @mnt: the mount on which to take a write @@ -97,12 +179,58 @@ struct vfsmount *alloc_vfsmnt(const char */ int mnt_want_write(struct vfsmount *mnt) { - if (__mnt_is_readonly(mnt)) - return -EROFS; - return 0; + int ret = 0; + struct mnt_writer *cpu_writer; + + cpu_writer = &get_cpu_var(mnt_writers); + spin_lock(&cpu_writer->lock); + if (__mnt_is_readonly(mnt)) { + ret = -EROFS; + goto out; + } + use_cpu_writer_for_mount(cpu_writer, mnt); + cpu_writer->count++; +out: + spin_unlock(&cpu_writer->lock); + put_cpu_var(mnt_writers); + return ret; } EXPORT_SYMBOL_GPL(mnt_want_write); +static void lock_and_coalesce_cpu_mnt_writer_counts(void) +{ + int cpu; + struct mnt_writer *cpu_writer; + + for_each_possible_cpu(cpu) { + cpu_writer = &per_cpu(mnt_writers, cpu); + spin_lock(&cpu_writer->lock); + __clear_mnt_count(cpu_writer); + cpu_writer->mnt = NULL; + } +} + +/* + * These per-cpu write counts are not guaranteed to have + * matched increments and decrements on any given cpu. + * A file open()ed for write on one cpu and close()d on + * another cpu will imbalance this count. Make sure it + * does not get too far out of whack. + */ +static void handle_write_count_underflow(struct vfsmount *mnt) +{ + while (atomic_read(&mnt->__mnt_writers) < + MNT_WRITER_UNDERFLOW_LIMIT) { + /* + * It isn't necessary to hold all of the locks + * at the same time, but doing it this way makes + * us share a lot more code. + */ + lock_and_coalesce_cpu_mnt_writer_counts(); + mnt_unlock_cpus(); + } +} + /** * mnt_drop_write - give up write access to a mount * @mnt: the mount on which to give up write access @@ -113,23 +241,61 @@ EXPORT_SYMBOL_GPL(mnt_want_write); */ void mnt_drop_write(struct vfsmount *mnt) { + int must_check_underflow = 0; + struct mnt_writer *cpu_writer; + + cpu_writer = &get_cpu_var(mnt_writers); + spin_lock(&cpu_writer->lock); + + use_cpu_writer_for_mount(cpu_writer, mnt); + if (cpu_writer->count > 0) { + cpu_writer->count--; + } else { + must_check_underflow = 1; + atomic_dec(&mnt->__mnt_writers); + } + + spin_unlock(&cpu_writer->lock); + /* + * Logically, we could call this each time, + * but the __mnt_writers cacheline tends to + * be cold, and makes this expensive. + */ + if (must_check_underflow) + handle_write_count_underflow(mnt); + /* + * This could be done right after the spinlock + * is taken because the spinlock keeps us on + * the cpu, and disables preemption. However, + * putting it here bounds the amount that + * __mnt_writers can underflow. Without it, + * we could theoretically wrap __mnt_writers. + */ + put_cpu_var(mnt_writers); } EXPORT_SYMBOL_GPL(mnt_drop_write); -/* - * __mnt_is_readonly: check whether a mount is read-only - * @mnt: the mount to check for its write status - * - * This shouldn't be used directly ouside of the VFS. - * It does not guarantee that the filesystem will stay - * r/w, just that it is right *now*. This can not and - * should not be used in place of IS_RDONLY(inode). - */ -int __mnt_is_readonly(struct vfsmount *mnt) +int mnt_make_readonly(struct vfsmount *mnt) { - return (mnt->mnt_sb->s_flags & MS_RDONLY); + int ret = 0; + + lock_and_coalesce_cpu_mnt_writer_counts(); + /* + * With all the locks held, this value is stable + */ + if (atomic_read(&mnt->__mnt_writers) > 0) { + ret = -EBUSY; + goto out; + } + /* + * actually set mount's r/o flag here to make + * __mnt_is_readonly() true, which keeps anyone + * from doing a successful mnt_want_write(). + */ +out: + mnt_unlock_cpus(); + return ret; } -EXPORT_SYMBOL_GPL(__mnt_is_readonly); int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb) { @@ -325,6 +491,15 @@ static struct vfsmount *clone_mnt(struct static inline void __mntput(struct vfsmount *mnt) { struct super_block *sb = mnt->mnt_sb; + lock_and_coalesce_cpu_mnt_writer_counts(); + mnt_unlock_cpus(); + /* + * This probably indicates that somebody messed + * up a mnt_want/drop_write() pair. If this + * happens, the filesystem was probably unable + * to make r/w->r/o transitions. + */ + WARN_ON(atomic_read(&mnt->__mnt_writers)); dput(mnt->mnt_root); free_vfsmnt(mnt); deactivate_super(sb); diff -puN include/linux/mount.h~r-o-bind-mounts-track-number-of-mount-writers include/linux/mount.h --- linux-2.6.git/include/linux/mount.h~r-o-bind-mounts-track-number-of-mount-writers 2007-11-01 14:46:20.000000000 -0700 +++ linux-2.6.git-dave/include/linux/mount.h 2007-11-01 14:46:20.000000000 -0700 @@ -14,6 +14,7 @@ #include <linux/types.h> #include <linux/list.h> +#include <linux/nodemask.h> #include <linux/spinlock.h> #include <asm/atomic.h> @@ -61,6 +62,13 @@ struct vfsmount { atomic_t mnt_count; int mnt_expiry_mark; /* true if marked for expiry */ int mnt_pinned; + /* + * This value is not stable unless all of the + * mnt_writers[] spinlocks are held, and all + * mnt_writer[]s on this mount have 0 as + * their ->count + */ + atomic_t __mnt_writers; }; static inline struct vfsmount *mntget(struct vfsmount *mnt) _
next prev parent reply other threads:[~2007-11-01 23:17 UTC|newest] Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top 2007-11-01 23:08 [PATCH 00/27] Read-only bind mounts (-mm resend) Dave Hansen 2007-11-01 23:08 ` [PATCH 01/27] do namei_flags calculation inside open_namei() Dave Hansen 2007-11-01 23:08 ` [PATCH 02/27] make open_namei() return a filp Dave Hansen 2007-11-01 23:08 ` [PATCH 03/27] kill do_filp_open() Dave Hansen 2007-11-01 23:08 ` [PATCH 04/27] kill filp_open() Dave Hansen 2008-01-16 8:52 ` Andrew Morton 2008-01-16 17:04 ` Dave Hansen 2008-01-16 17:10 ` Christoph Hellwig 2008-01-16 17:41 ` Dave Hansen 2008-01-16 17:47 ` Christoph Hellwig 2008-01-16 17:12 ` Bryn M. Reeves 2007-11-01 23:08 ` [PATCH 05/27] rename open_namei() to open_pathname() Dave Hansen 2007-11-26 14:33 ` Christoph Hellwig 2007-11-01 23:08 ` [PATCH 06/27] r-o-bind-mounts-stub-functions Dave Hansen 2007-11-01 23:08 ` [PATCH 07/27] r-o-bind-mounts-do_rmdir-elevate-write-count Dave Hansen 2007-11-01 23:08 ` [PATCH 08/27] r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir Dave Hansen 2007-11-01 23:08 ` [PATCH 09/27] r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers Dave Hansen 2007-11-01 23:08 ` [PATCH 10/27] r-o-bind-mounts-elevate-mount-count-for-extended-attributes Dave Hansen 2007-11-01 23:08 ` [PATCH 11/27] r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl Dave Hansen 2007-11-01 23:08 ` [PATCH 12/27] r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime Dave Hansen 2007-11-01 23:08 ` [PATCH 13/27] r-o-bind-mounts-elevate-write-count-for-do_utimes Dave Hansen 2007-11-01 23:08 ` [PATCH 14/27] r-o-bind-mounts-elevate-write-count-for-file_update_time Dave Hansen 2007-11-01 23:08 ` [PATCH 15/27] r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls Dave Hansen 2007-11-01 23:08 ` [PATCH 16/27] r-o-bind-mounts-elevate-write-count-for-some-ioctls Dave Hansen 2007-11-05 23:23 ` Andrew Morton 2007-11-06 9:01 ` Jan Kara 2007-11-06 9:12 ` Andrew Morton 2007-11-01 23:08 ` [PATCH 17/27] r-o-bind-mounts-elevate-write-count-opend-files Dave Hansen 2007-11-01 23:08 ` [PATCH 18/27] r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename Dave Hansen 2007-11-01 23:08 ` [PATCH 19/27] r-o-bind-mounts-elevate-writer-count-for-chown-and-friends Dave Hansen 2007-11-01 23:08 ` [PATCH 20/27] r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate Dave Hansen 2007-11-01 23:08 ` [PATCH 21/27] r-o-bind-mounts-make-access-use-mnt-check Dave Hansen 2007-11-01 23:08 ` [PATCH 22/27] r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly Dave Hansen 2007-11-01 23:08 ` [PATCH 23/27] r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create Dave Hansen 2007-11-01 23:08 ` Dave Hansen [this message] 2007-11-01 23:09 ` [PATCH 25/27] r-o-bind-mounts-track-number-of-mount-writers-make-lockdep-happy-with-r-o-bind-mounts Dave Hansen 2007-11-05 23:35 ` Andrew Morton 2007-11-01 23:09 ` [PATCH 26/27] r-o-bind-mounts-honor-r-w-changes-at-do_remount-time Dave Hansen 2007-11-01 23:09 ` [PATCH 27/27] keep track of mnt_writer state of struct file Dave Hansen
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20071101230858.6F76E20C@kernel \ --to=haveblue@us.ibm.com \ --cc=akpm@osdl.org \ --cc=hch@infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=miklos@szeredi.hu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).