LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Ian Kent <raven@themaw.net>
To: NeilBrown <neilb@suse.com>, Mike Marion <mmarion@qualcomm.com>
Cc: autofs mailing list <autofs@vger.kernel.org>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
Date: Wed, 29 Nov 2017 10:56:25 +0800	[thread overview]
Message-ID: <ae8e93be-8e3f-ffd8-9043-13737230d18d@themaw.net> (raw)
In-Reply-To: <87a7z5yjbs.fsf@notabene.neil.brown.name>


Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
Al have discussed something similar, please feel free to chime in
with your thoughts Al.

On 29/11/17 09:17, NeilBrown wrote:
> On Tue, Nov 28 2017, Mike Marion wrote:
> 
>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>>
>>> I think the situation is going to get worse before it gets better.
>>>
>>> On recent Fedora and kernel, with a large map and heavy mount activity
>>> I see:
>>>
>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>>> gnome-settings-daemon, packagekitd and gnome-shell
>>>
>>> all go crazy consuming large amounts of CPU.
>>
>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
>> it'll be more of a problem as time goes on).  We have pretty huge
>> direct maps and our initial startup tests on a new host with the link vs
>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
>> to come up with a fix, which should've been pushed here some time ago.
>>
>> Then, there's shutdowns (and reboots). They also took a long time (on
>> the order of 20+min) because it would walk the entire /proc/mounts
>> "unmounting" things.  Also fixed now.  That one had something to do in
>> SMP code as if you used a single CPU/core, it didn't take long at all.
>>
>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
>> looking for the root dev to skip over fstype autofs
>> (probe_nfsroot_device function).
>>
>>> The symlink change was probably the start, now a number of applications
>>> now got directly to the proc file system for this information.
>>>
>>> For large mount tables and many processes accessing the mount table
>>> (probably reading the whole thing, either periodically or on change
>>> notification) the current system does not scale well at all.
>>
>> We use Clearcase in some instances as well, and that's yet another thing
>> adding mounts, and its startup is very slow, due to the size of
>> /proc/mounts.  
>>
>> It's definitely something that's more than just autofs and probably
>> going to get worse, as you say.
> 
> If we assume that applications are going to want to read
> /proc/self/mount* a log, we probably need to make it faster.
> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
> copied /proc/self/mountinfo to /tmp/mountinfo, then
> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
> On a single CPU VM:
>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
> On a 4 CPU VM
>   /tmp/mountinfo: 1.5secs
>   /proc/self/mountinfo: 3.5 secs
> 
> Using "perf record" it appears that most of the cost is repeated calls
> to prepend_path, with a small contribution from the fact that each read
> only returns 4K rather than the 128K that cat asks for.
> 
> If we could hang a cache off struct mnt_namespace and use it instead of
> iterating the mount table - using rcu and ns->event to ensure currency -
> we should be able to minimize the cost of this increased use of
> /proc/self/mount*.
> 
> I suspect that the best approach would be implement a cache at the
> seq_file level.
> 
> One possible problem might be if applications assume that a read will
> always return a whole number of lines (it currently does).  To be
> sure we remain safe, we would only be able to use the cache for
> a read() syscall which reads the whole file.
> How big do people see /proc/self/mount* getting?  What size reads
> does 'strace' show the various programs using to read it?

Buffer size almost always has a significant impact on IO so that's
likely a big factor but the other aspect of this is notification
of changes.

The risk is improving the IO efficiency might just allow a higher
rate of processing of change notifications and similar symptoms
to what we have now.

The suggestion is that a system that allows for incremental (diff
type) update notification is needed to allow mount table propagation
to scale well.

That implies some as yet undefined user <-> kernel communication
protocol.

Ian

  parent reply	other threads:[~2017-11-29  2:56 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-10  4:18 [PATCH 1/3] autofs - make disc device user accessible Ian Kent
2017-05-10  4:18 ` [PATCH 2/3] autofs - make dev ioctl version and ismountpoint " Ian Kent
2017-05-10  4:18 ` [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored Ian Kent
2017-05-12 12:49   ` Colin Walters
2017-11-21  1:53   ` NeilBrown
2017-11-22  4:28     ` Ian Kent
2017-11-23  0:36       ` Ian Kent
2017-11-23  2:21         ` NeilBrown
2017-11-23  2:46           ` Ian Kent
2017-11-23  3:04             ` Ian Kent
2017-11-23  4:49             ` NeilBrown
2017-11-23  6:34               ` Ian Kent
2017-11-27 16:01         ` Mike Marion
2017-11-27 23:43           ` Ian Kent
2017-11-28  0:29             ` Mike Marion
2017-11-29  1:17               ` NeilBrown
2017-11-29  2:13                 ` Mike Marion
2017-11-29  2:28                   ` Ian Kent
2017-11-29  2:48                     ` NeilBrown
2017-11-29  3:14                       ` Ian Kent
2017-11-29  2:56                 ` Ian Kent [this message]
2017-11-29  3:45                   ` NeilBrown
2017-11-29  6:00                     ` Ian Kent
2017-11-29  7:39                       ` NeilBrown
2017-11-30  0:00                         ` Ian Kent
2017-11-29 16:51                       ` Mike Marion
2017-11-23  0:47       ` NeilBrown
2017-11-23  1:43         ` Ian Kent
2017-11-23  2:26           ` Ian Kent
2017-11-23  3:04           ` NeilBrown
2017-11-23  3:41             ` Ian Kent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae8e93be-8e3f-ffd8-9043-13737230d18d@themaw.net \
    --to=raven@themaw.net \
    --cc=autofs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mmarion@qualcomm.com \
    --cc=neilb@suse.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).