LKML Archive on
help / color / mirror / Atom feed
From: Jeff Mahoney <>
To: "Eric W. Biederman" <>
	Al Viro <>,
	Alexey Dobriyan <>,
	Oleg Nesterov <>
Subject: Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks
Date: Sat, 23 Mar 2019 23:01:08 -0400	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

[-- Attachment #1.1: Type: text/plain, Size: 7687 bytes --]

On 3/23/19 11:56 AM, Eric W. Biederman wrote:
> Jeff Mahoney <> writes:
>> On 4/24/18 10:14 AM, Eric W. Biederman wrote:
>>> writes:
>>>> From: Jeff Mahoney <>
>>>> Hi all -
>>>> I recently encountered a customer issue where, on a machine with many TiB
>>>> of memory and a few hundred cores, after a task with a few thousand threads
>>>> and hundreds of files open exited, the system would softlockup.  That
>>>> issue was (is still) being addressed by Nik Borisov's patch to add a
>>>> cond_resched call to shrink_dentry_list.  The underlying issue is still
>>>> there, though.  We just don't complain as loudly.  When a huge task
>>>> exits, now the system is more or less unresponsive for about eight
>>>> minutes.  All CPUs are pinned and every one of them is going through
>>>> dentry and inode eviction for the procfs files associated with each
>>>> thread.  It's made worse by every CPU contending on the super's
>>>> inode list lock.
>>>> The numbers get big.  My test case was 4096 threads with 16384 files
>>>> open.  It's a contrived example, but not that far off from the actual
>>>> customer case.  In this case, a simple "find /proc" would create around
>>>> 300 million dentry/inode pairs.  More practically, lsof(1) does it too,
>>>> it just takes longer.  On smaller systems, memory pressure starts pushing
>>>> them out. Memory pressure isn't really an issue on this machine, so we
>>>> end up using well over 100GB for proc files.  It's the combination of
>>>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>>>> pushed me to take this approach.
>>>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>>>> made worse by there being multiple copies of them even for the same
>>>> task without threads getting involved:
>>>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>>>>   resources.
>>>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>>>>   contents (unless unshare(CLONE_FILES) was called), but share no
>>>>   resources.
>>>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>>>>   we'll get a valid directory.  Inside, there will be a complete
>>>>   copy of /proc/pid/task/* just like in /proc/tgid/task.  Again,
>>>>   nothing is shared.
>>>> This patch set reduces some (most) of the duplication by conditionally
>>>> replacing some of the directories with symbolic links to copies that are
>>>> identical.
>>>> 1) Eliminate the duplication of the task directories between threads.
>>>>    The task directory belongs to the thread leader and the threads
>>>>    link to it: e.g. /proc/915/task -> ../910/task  This mainly
>>>>    reduces duplication when individual threads are looked up directly
>>>>    at the tgid level.  The impact varies based on the number of threads.
>>>>    The user has to go out of their way in order to mess up their system
>>>>    in this way.  But if they were so inclined, they could create ~550
>>>>    billion inodes and dentries using the test case.
>>>> 2) Eliminate the duplication of directories that are created identically
>>>>    between the tgid-level pid directory and its task directory: fd,
>>>>    fdinfo, ns, net, attr.  There is obviously more duplication between
>>>>    the two directories, but replacing a file with a symbolic link
>>>>    doesn't get us anything.  This reduces the number of files associated
>>>>    with fd and fdinfo by half if threads aren't involved.
>>>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>>>>    that share a files_struct.  We check at directory creation time if
>>>>    the task is a group leader and if not, whether it shares ->files with
>>>>    the group leader.  If so, we create a symbolic link to ../tgid/fd*.
>>>>    We use a d_revalidate callback to check whether the thread has called
>>>>    unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>>>>    Upon re-lookup, a directory will be created in its place.  This is
>>>>    pretty simple, so if the thread group leader calls unshare, all threads
>>>>    get directories.
>>>> With these patches applied, running the same testcase, the proc_inode
>>>> cache only gets to about 600k objects, which is about 99.7% fewer.  I
>>>> get that procfs isn't supposed to be scalable, but this is kind of
>>>> extreme. :)
>>>> Finally, I'm not a procfs expert.  I'm posting this as an RFC for folks
>>>> with more knowledge of the details to pick it apart.  The biggest is that
>>>> I'm not sure if any tools depend on any of these things being directories
>>>> instead of symlinks.  I'd hope not, but I don't have the answer.  I'm
>>>> sure there are corner cases I'm missing.  Hopefully, it's not just flat
>>>> out broken since this is a problem that does need solving.
>>>> Now I'll go put on the fireproof suit.
>> Thanks for your comments.  This ended up having to get back-burnered but
>> I've finally found some time to get back to it.  I have new patches that
>> don't treat each entry as a special case and makes more sense, IMO.
>> They're not worth posting yet since some of the issues below remain.
>>> This needs to be tested against at least apparmor to see if this breaks
>>> common policies.  Changing files to symlinks in proc has a bad habit of
>>> either breaking apparmor policies or userspace assumptions.   Symbolic
>>> links are unfortunately visible to userspace.
>> AppArmor uses the @{pids} var in profiles that translates to a numeric
>> regex.  That means that /proc/pid/task -> /proc/tgid/task won't break
>> profiles but /proc/pid/fdinfo -> /proc/pid/task/tgid/fdinfo will break.
>>  Apparmor doesn't have a follow_link hook at all, so all that matters is
>> the final path.  SELinux does have a follow_link hook, but I'm not
>> familiar enough with it to know whether introducing a symlink in proc
>> will make a difference.
>> I've dropped the /proc/pid/{dirs} -> /proc/pid/task/pid/{dirs} part
>> since that clearly won't work.
>>> Further the proc structure is tgid/task/tid where the leaf directories
>>> are per thread.
>> Yes, but threads are still in /proc for lookup at the tgid level even if
>> they don't show up in readdir.
>>> We more likely could get away with some magic symlinks (that would not
>>> be user visible) rather than actual symlinks.
>> I think I'm missing something here.  Aren't magic symlinks still
>> represented to the user as symlinks?
>>> So I think you are probably on the right track to reduce the memory
>>> usage but I think some more work will be needed to make it transparently
>>> backwards compatible.
>> Yeah, that's going to be the big hiccup.  I think I've resolved the
>> biggest issue with AppArmor, but I don't think the problem is solvable
>> without introducing symlinks.
> Has anyone looked at making the fd and fdinfo files hard links.

That could work to a certain degree.  It would certainly reduce the
inode count.  It would still create all the dentries, though.  That's
still a n^2 problem where n is the number of threads in the group.

> Alternatively it may make sense to see if there is something that we can
> do with the locking to reduce the thundering hurd problem that is being
> seen.

Yeah, that could still use some attention.  The thundering herd problem
is more of a tap when you reduce the contention by 99% though.


Jeff Mahoney

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      reply	other threads:[~2019-03-24  3:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-24  2:21 jeffm
2018-04-24  2:21 ` [PATCH 1/5] procfs: factor out a few helpers jeffm
2018-04-24  2:21 ` [PATCH 2/5] procfs: factor out inode revalidation work from pid_revalidation jeffm
2018-04-24  2:21 ` [PATCH 3/5] procfs: use symlinks for /proc/<pid>/task when not thread group leader jeffm
2018-04-24  2:21 ` [PATCH 4/5] procfs: share common directories between /proc/tgid and /proc/tgid/task/tgid jeffm
2018-04-24  2:21 ` [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared jeffm
2018-04-24 15:41   ` kbuild test robot
2018-04-24 15:41   ` [RFC PATCH] procfs: proc_pid_files_link_dentry_operations can be static kbuild test robot
2018-04-24  6:17 ` [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks Alexey Dobriyan
2018-04-25 18:04   ` Jeff Mahoney
2018-04-24 14:14 ` Eric W. Biederman
2018-04-26 21:03   ` Jeff Mahoney
2019-03-21 18:30   ` Jeff Mahoney
2019-03-23 15:56     ` Eric W. Biederman
2019-03-24  3:01       ` Jeff Mahoney [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \
    --subject='Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).