LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v4 0/4] Deterministic charging of shared memory
@ 2021-11-20  4:50 Mina Almasry
  2021-11-20  4:50 ` [PATCH v4 1/4] mm: support deterministic memory charging of filesystems Mina Almasry
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Mina Almasry @ 2021-11-20  4:50 UTC (permalink / raw)
  Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
	Shuah Khan, Shakeel Butt, Greg Thelen, Dave Chinner,
	Matthew Wilcox, Roman Gushchin, Theodore Ts'o, linux-kernel,
	linux-fsdevel, linux-mm

Problem:
Currently shared memory is charged to the memcg of the allocating
process. This makes memory usage of processes accessing shared memory
a bit unpredictable since whichever process accesses the memory first
will get charged. We have a number of use cases where our userspace
would like deterministic charging of shared memory:

1. System services allocating memory for client jobs:
We have services (namely a network access service[1]) that provide
functionality for clients running on the machine and allocate memory
to carry out these services. The memory usage of these services
depends on the number of jobs running on the machine and the nature of
the requests made to the service, which makes the memory usage of
these services hard to predict and thus hard to limit via memory.max.
These system services would like a way to allocate memory and instruct
the kernel to charge this memory to the client’s memcg.

2. Shared filesystem between subtasks of a large job
Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod.

3. Shared libraries and language runtimes shared between independent jobs.
We’d like to optimize memory usage on the machine by sharing libraries
and language runtimes of many of the processes running on our machines
in separate memcgs. This produces a side effect that one job may be
unlucky to be the first to access many of the libraries and may get
oom killed as all the cached files get charged to it.

Design:
My rough proposal to solve this problem is to simply add a
‘memcg=/path/to/memcg’ mount option for filesystems:
directing all the memory of the file system to be ‘remote charged’ to
cgroup provided by that memcg= option.

Caveats:

1. One complication to address is the behavior when the target memcg
hits its memory.max limit because of remote charging. In this case the
oom-killer will be invoked, but the oom-killer may not find anything
to kill in the target memcg being charged. Thera are a number of considerations
in this case:

1. It's not great to kill the allocating process since the allocating process
   is not running in the memcg under oom, and killing it will not free memory
   in the memcg under oom.
2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
   somehow. If not, the process will forever loop the pagefault in the upstream
   kernel.

In this case, I propose simply failing the remote charge and returning an ENOSPC
to the caller. This will cause will cause the process executing the remote
charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
path.  This will be documented behavior of remote charging, and this feature is
opt-in. Users can:
- Not opt-into the feature if they want.
- Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
  abort if they desire.
- Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
  operation without executing the remote charge if possible.

2. Only processes allowed the enter cgroup at mount time can mount a
tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
process with write access to this mount point will be able to charge memory to
<cgroup>. This is largely a non-issue because in configurations where there is
untrusted code running on the machine, mount point access needs to be
restricted to the intended users only regardless of whether the mount point
memory is deterministly charged or not.

[1] https://research.google/pubs/pub48630

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org

Mina Almasry (4):
  mm: support deterministic memory charging of filesystems
  mm/oom: handle remote ooms
  mm, shmem: add filesystem memcg= option documentation
  mm, shmem, selftests: add tmpfs memcg= mount option tests

 Documentation/filesystems/tmpfs.rst       |  28 ++++
 fs/fs_context.c                           |  27 ++++
 fs/proc_namespace.c                       |   4 +
 fs/super.c                                |   9 ++
 include/linux/fs.h                        |   5 +
 include/linux/fs_context.h                |   2 +
 include/linux/memcontrol.h                |  38 +++++
 mm/filemap.c                              |   2 +-
 mm/khugepaged.c                           |   3 +-
 mm/memcontrol.c                           | 171 ++++++++++++++++++++++
 mm/oom_kill.c                             |   9 ++
 mm/shmem.c                                |   3 +-
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/mmap_write.c   | 103 +++++++++++++
 tools/testing/selftests/vm/tmpfs-memcg.sh | 116 +++++++++++++++
 15 files changed, 518 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/vm/mmap_write.c
 create mode 100755 tools/testing/selftests/vm/tmpfs-memcg.sh

--
2.34.0.rc2.393.gf8c9666880-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/4] mm: support deterministic memory charging of filesystems
  2021-11-20  4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
@ 2021-11-20  4:50 ` Mina Almasry
  2021-11-20  7:53   ` Shakeel Butt
  2021-11-20  4:50 ` [PATCH v4 2/4] mm/oom: handle remote ooms Mina Almasry
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Mina Almasry @ 2021-11-20  4:50 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Hugh Dickins
  Cc: Mina Almasry, Jonathan Corbet, Shuah Khan, Shakeel Butt,
	Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm,
	cgroups

Users can specify a memcg= mount option option at mount time and all
data page charges will be charged to the memcg supplied.  This is useful
to deterministicly charge the memory of the file system or memory shared
via tmpfs for example.

Implementation notes:
- Add memcg= option parsing to fs common code.
- We attach the memcg to charge for this filesystem data pages to the
  struct super_block. The memcg can be changed via a remount operation,
  and all future memcg charges in this filesystem will be charged to
  the new memcg.
- We create a new interface mem_cgroup_charge_mapping(), which will
  check if the super_block in the mapping has a memcg to charge. It
  charges that, and falls back to the mm passed if there is no
  super_block memcg.
- On filesystem data memory allocation paths, we call the new interface
  mem_cgroup_charge_mapping().

Caveats:
- Processes are only allowed to direct filesystem charges to a cgroup that
  they themselves can enter and allocate memory in. This so that we do not
  introduce an attack vector where processes can DoS any cgroup in the
  system that they are not normally allowed to enter and allocate memory in.
- In mem_cgroup_charge_mapping() we pay the cost of checking whether the
  super_block has a memcg to charge, regardless of whether the mount
  point was mounted with memcg=. This can be alleviated by putting the
  memcg to charge in the struct address_space, but, this increases the
  size of that struct and makes it difficult to support remounting the
  memcg= option, although remounting is of dubious value.
- mem_cgroup_charge_mapping() simply returns any error received from the
  following charge_memcg() or mem_cgroup_charge() calls. There is
  a follow up patch in this series which closely examines and handles the
  behavior when hitting the limit of the remote memcg.

Signed-off-by: Mina Almasry <almasrymina@google.com>


---

Changes in v4:
- Added cover letter and moved list of Cc's there.
- Made memcg= option generic to all file systems.
- Reverted to calling mem_cgroup_charge_mapping() for generic file
  system allocation paths, since this feature is not implemented for all
  filesystems.
- Refactored some memcontrol interfaces slightly to reduce the number of
  "#ifdef CONFIG_MEMCG" needed in other files.

Changes in v3:
- Fixed build failures/warnings Reported-by: kernel test robot <lkp@intel.com>

Changes in v2:
- Fixed Roman's email.
- Added a new wrapper around charge_memcg() instead of __mem_cgroup_charge()
- Merged the permission check into this patch as Roman suggested.
- Instead of checking for a s_memcg_to_charge off the superblock in the
filemap code, I set_active_memcg() before calling into the fs generic
code as Dave suggests.
- I have kept the s_memcg_to_charge in the superblock to keep the
struct address_space pointer small and preserve the remount use case..

---
 fs/fs_context.c            |  27 +++++++
 fs/proc_namespace.c        |   4 ++
 fs/super.c                 |   9 +++
 include/linux/fs.h         |   5 ++
 include/linux/fs_context.h |   2 +
 include/linux/memcontrol.h |  32 +++++++++
 mm/filemap.c               |   2 +-
 mm/khugepaged.c            |   3 +-
 mm/memcontrol.c            | 142 +++++++++++++++++++++++++++++++++++++
 mm/shmem.c                 |   3 +-
 10 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index b7e43a780a625..fe2449d5f1fbf 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -23,6 +23,7 @@
 #include <asm/sections.h>
 #include "mount.h"
 #include "internal.h"
+#include <linux/memcontrol.h>

 enum legacy_fs_param {
 	LEGACY_FS_UNSET_PARAMS,
@@ -108,6 +109,28 @@ int vfs_parse_fs_param_source(struct fs_context *fc, struct fs_parameter *param)
 }
 EXPORT_SYMBOL(vfs_parse_fs_param_source);

+static int parse_param_memcg(struct fs_context *fc, struct fs_parameter *param)
+{
+	struct mem_cgroup *memcg;
+
+	if (strcmp(param->key, "memcg") != 0)
+		return -ENOPARAM;
+
+	if (param->type != fs_value_is_string)
+		return invalf(fc, "Non-string source");
+
+	if (fc->memcg)
+		return invalf(fc, "Multiple memcgs specified");
+
+	memcg = mem_cgroup_get_from_path(param->string);
+	if (IS_ERR(memcg))
+		return invalf(fc, "Bad value for memcg");
+
+	fc->memcg = memcg;
+	param->string = NULL;
+	return 0;
+}
+
 /**
  * vfs_parse_fs_param - Add a single parameter to a superblock config
  * @fc: The filesystem context to modify
@@ -148,6 +171,10 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 			return ret;
 	}

+	ret = parse_param_memcg(fc, param);
+	if (ret != -ENOPARAM)
+		return ret;
+
 	/* If the filesystem doesn't take any arguments, give it the
 	 * default handling of source.
 	 */
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 392ef5162655b..32e1647dcef43 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -12,6 +12,7 @@
 #include <linux/security.h>
 #include <linux/fs_struct.h>
 #include <linux/sched/task.h>
+#include <linux/memcontrol.h>

 #include "proc/internal.h" /* only for get_proc_task() in ->open() */

@@ -125,6 +126,9 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
 	if (err)
 		goto out;
 	show_mnt_opts(m, mnt);
+
+	mem_cgroup_put_name_in_seq(m, sb);
+
 	if (sb->s_op->show_options)
 		err = sb->s_op->show_options(m, mnt_path.dentry);
 	seq_puts(m, " 0 0\n");
diff --git a/fs/super.c b/fs/super.c
index 3bfc0f8fbd5bc..06c972f80c529 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -24,6 +24,7 @@
 #include <linux/export.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/memcontrol.h>
 #include <linux/mount.h>
 #include <linux/security.h>
 #include <linux/writeback.h>		/* for the emergency remount stuff */
@@ -180,6 +181,7 @@ static void destroy_unused_super(struct super_block *s)
 	up_write(&s->s_umount);
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	mem_cgroup_set_charge_target(s, NULL);
 	security_sb_free(s);
 	put_user_ns(s->s_user_ns);
 	kfree(s->s_subtype);
@@ -292,6 +294,7 @@ static void __put_super(struct super_block *s)
 		WARN_ON(s->s_dentry_lru.node);
 		WARN_ON(s->s_inode_lru.node);
 		WARN_ON(!list_empty(&s->s_mounts));
+		mem_cgroup_set_charge_target(s, NULL);
 		security_sb_free(s);
 		fscrypt_sb_free(s);
 		put_user_ns(s->s_user_ns);
@@ -904,6 +907,9 @@ int reconfigure_super(struct fs_context *fc)
 		}
 	}

+	if (fc->memcg)
+		mem_cgroup_set_charge_target(sb, fc->memcg);
+
 	if (fc->ops->reconfigure) {
 		retval = fc->ops->reconfigure(fc);
 		if (retval) {
@@ -1528,6 +1534,9 @@ int vfs_get_tree(struct fs_context *fc)
 		return error;
 	}

+	if (fc->memcg)
+		mem_cgroup_set_charge_target(sb, fc->memcg);
+
 	/*
 	 * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE
 	 * but s_maxbytes was an unsigned long long for many releases. Throw
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3afca821df32e..59407b3e7aee3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1567,6 +1567,11 @@ struct super_block {
 	struct workqueue_struct *s_dio_done_wq;
 	struct hlist_head s_pins;

+#ifdef CONFIG_MEMCG
+	/* memcg to charge for pages allocated to this filesystem */
+	struct mem_cgroup *s_memcg_to_charge;
+#endif
+
 	/*
 	 * Owning user namespace and default context in which to
 	 * interpret filesystem uids, gids, quotas, device nodes,
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 6b54982fc5f37..8e2cc1e554fa1 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -25,6 +25,7 @@ struct super_block;
 struct user_namespace;
 struct vfsmount;
 struct path;
+struct mem_cgroup;

 enum fs_context_purpose {
 	FS_CONTEXT_FOR_MOUNT,		/* New superblock for explicit mount */
@@ -110,6 +111,7 @@ struct fs_context {
 	bool			need_free:1;	/* Need to call ops->free() */
 	bool			global:1;	/* Goes into &init_user_ns */
 	bool			oldapi:1;	/* Coming from mount(2) */
+	struct mem_cgroup 	*memcg;		/* memcg to charge */
 };

 struct fs_context_operations {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c5c403f4be6b..0a9b0bba5f3c8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,6 +27,7 @@ struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct super_block;

 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -923,6 +924,15 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
 	return !!(memcg->css.flags & CSS_ONLINE);
 }

+void mem_cgroup_set_charge_target(struct super_block *sb,
+				  struct mem_cgroup *memcg);
+
+int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm,
+			      gfp_t gfp, struct address_space *mapping);
+
+struct mem_cgroup *mem_cgroup_get_from_path(const char *path);
+void mem_cgroup_put_name_in_seq(struct seq_file *seq, struct super_block *sb);
+
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 		int zid, int nr_pages);

@@ -1223,6 +1233,28 @@ static inline int mem_cgroup_charge(struct folio *folio,
 	return 0;
 }

+static inline void mem_cgroup_set_charge_target(struct super_block *sb,
+						struct mem_cgroup *memcg)
+{
+}
+
+static inline int mem_cgroup_charge_mapping(struct folio *folio,
+					    struct mm_struct *mm, gfp_t gfp,
+					    struct address_space *mapping)
+{
+	return 0;
+}
+
+static inline struct mem_cgroup *mem_cgroup_get_from_path(const char *path)
+{
+	return NULL;
+}
+
+static inline void mem_cgroup_put_name_in_seq(struct seq_file *seq,
+					      struct super_block *sb)
+{
+}
+
 static inline int mem_cgroup_swapin_charge_page(struct page *page,
 			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 6844c9816a864..3825cf12bc345 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -903,7 +903,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 	folio->index = index;

 	if (!huge) {
-		error = mem_cgroup_charge(folio, NULL, gfp);
+		error = mem_cgroup_charge_mapping(folio, NULL, gfp, mapping);
 		VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
 		if (error)
 			goto error;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e99101162f1ab..8468a3ad446b9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1661,7 +1661,8 @@ static void collapse_file(struct mm_struct *mm,
 		goto out;
 	}

-	if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
+	if (unlikely(mem_cgroup_charge_mapping(page_folio(new_page), mm, gfp,
+					       mapping))) {
 		result = SCAN_CGROUP_CHARGE_FAIL;
 		goto out;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 781605e920153..c4ba7f364c214 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,6 +62,7 @@
 #include <linux/tracehook.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/string.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -2580,6 +2581,129 @@ void mem_cgroup_handle_over_high(void)
 	css_put(&memcg->css);
 }

+/*
+ * Non error return value must eventually be released with css_put().
+ */
+struct mem_cgroup *mem_cgroup_get_from_path(const char *path)
+{
+	static const char procs_filename[] = "/cgroup.procs";
+	struct file *file, *procs;
+	struct cgroup_subsys_state *css;
+	struct mem_cgroup *memcg;
+	char *procs_path =
+		kmalloc(strlen(path) + sizeof(procs_filename), GFP_KERNEL);
+
+	if (procs_path == NULL)
+		return ERR_PTR(-ENOMEM);
+	strcpy(procs_path, path);
+	strcat(procs_path, procs_filename);
+
+	procs = filp_open(procs_path, O_WRONLY, 0);
+	kfree(procs_path);
+
+	/*
+	 * Restrict the capability for tasks to mount with memcg charging to the
+	 * cgroup they could not join. For example, disallow:
+	 *
+	 * mount -t tmpfs -o memcg=root-cgroup nodev <MOUNT_DIR>
+	 *
+	 * if it is a non-root task.
+	 */
+	if (IS_ERR(procs))
+		return (struct mem_cgroup *)procs;
+	fput(procs);
+
+	file = filp_open(path, O_DIRECTORY | O_RDONLY, 0);
+	if (IS_ERR(file))
+		return (struct mem_cgroup *)file;
+
+	css = css_tryget_online_from_dir(file->f_path.dentry,
+					 &memory_cgrp_subsys);
+	if (IS_ERR(css))
+		memcg = (struct mem_cgroup *)css;
+	else
+		memcg = container_of(css, struct mem_cgroup, css);
+
+	fput(file);
+	return memcg;
+}
+
+void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb)
+{
+	struct mem_cgroup *memcg;
+	int ret = 0;
+	char *buf = __getname();
+	int len = PATH_MAX;
+
+	if (!buf)
+		return;
+
+	buf[0] = '\0';
+
+	rcu_read_lock();
+	memcg = rcu_dereference(sb->s_memcg_to_charge);
+	if (memcg && !css_tryget_online(&memcg->css))
+		memcg = NULL;
+	rcu_read_unlock();
+
+	if (!memcg)
+		return;
+
+	ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2);
+	if (ret >= len / 2)
+		strcpy(buf, "?");
+	else {
+		char *p = mangle_path(buf, buf + len / 2, " \t\n\\");
+
+		if (p)
+			*p = '\0';
+		else
+			strcpy(buf, "?");
+	}
+
+	css_put(&memcg->css);
+	if (buf[0] != '\0')
+		seq_printf(m, ",memcg=%s", buf);
+
+	__putname(buf);
+}
+
+/*
+ * Set or clear (if @memcg is NULL) charge association from file system to
+ * memcg.  If @memcg != NULL, then a css reference must be held by the caller to
+ * ensure that the cgroup is not deleted during this operation, this reference
+ * is dropped after this operation.
+ */
+void mem_cgroup_set_charge_target(struct super_block *sb,
+				  struct mem_cgroup *memcg)
+{
+	memcg = xchg(&sb->s_memcg_to_charge, memcg);
+	if (memcg)
+		css_put(&memcg->css);
+}
+
+/*
+ * Returns the memcg to charge for inode pages.  If non-NULL is returned, caller
+ * must drop reference with css_put().  NULL indicates that the inode does not
+ * have a memcg to charge, so the default process based policy should be used.
+ */
+static struct mem_cgroup *
+mem_cgroup_mapping_get_charge_target(struct address_space *mapping)
+{
+	struct mem_cgroup *memcg;
+
+	if (!mapping)
+		return NULL;
+
+	rcu_read_lock();
+	memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);
+	if (memcg && !css_tryget_online(&memcg->css))
+		memcg = NULL;
+	rcu_read_unlock();
+
+	return memcg;
+}
+
 static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			unsigned int nr_pages)
 {
@@ -6678,6 +6802,24 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 	return ret;
 }

+int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm,
+			      gfp_t gfp, struct address_space *mapping)
+{
+	struct mem_cgroup *mapping_memcg;
+	int ret = 0;
+	if (mem_cgroup_disabled())
+		return 0;
+
+	mapping_memcg = mem_cgroup_mapping_get_charge_target(mapping);
+	if (mapping_memcg) {
+		ret = charge_memcg(folio, mapping_memcg, gfp);
+		css_put(&mapping_memcg->css);
+		return ret;
+	}
+
+	return mem_cgroup_charge(folio, mm, gfp);
+}
+
 int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
 {
 	struct mem_cgroup *memcg;
diff --git a/mm/shmem.c b/mm/shmem.c
index 23c91a8beb781..e469da13a1b8a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -709,7 +709,8 @@ static int shmem_add_to_page_cache(struct page *page,
 	page->index = index;

 	if (!PageSwapCache(page)) {
-		error = mem_cgroup_charge(page_folio(page), charge_mm, gfp);
+		error = mem_cgroup_charge_mapping(page_folio(page), charge_mm,
+						  gfp, mapping);
 		if (error) {
 			if (PageTransHuge(page)) {
 				count_vm_event(THP_FILE_FALLBACK);
--
2.34.0.rc2.393.gf8c9666880-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 2/4] mm/oom: handle remote ooms
  2021-11-20  4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
  2021-11-20  4:50 ` [PATCH v4 1/4] mm: support deterministic memory charging of filesystems Mina Almasry
@ 2021-11-20  4:50 ` Mina Almasry
  2021-11-20  5:07   ` Matthew Wilcox
  2021-11-20  7:58   ` Shakeel Butt
  2021-11-20  4:50 ` [PATCH v4 3/4] mm, shmem: add filesystem memcg= option documentation Mina Almasry
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Mina Almasry @ 2021-11-20  4:50 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton
  Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Hugh Dickins,
	Shuah Khan, Shakeel Butt, Greg Thelen, Dave Chinner,
	Matthew Wilcox, Roman Gushchin, Theodore Ts'o, linux-kernel,
	linux-fsdevel, linux-mm, cgroups

On remote ooms (OOMs due to remote charging), the oom-killer will attempt
to find a task to kill in the memcg under oom. The oom-killer may be
unable to find a process to kill if there are no killable processes in
the remote memcg. In this case, the oom-killer (out_of_memory()) will return
false, and depending on the gfp, that will generally get bubbled up to
mem_cgroup_charge_mapping() as an ENOMEM.

A few considerations on how to handle this edge case:

1. memcg= is an opt-in feature, so we have some flexibility with the
   behavior that we export to userspace using this feature to carry
   out remote charges that may result in remote ooms. The critical thing
   is to document this behavior so the userspace knows what to expect
   and handle the edge cases.

2. It is generally not desirable to kill the allocating process, because it's
   not a member of the remote memcg which is under oom, and so killing it
   will almost certainly not free any memory in the memcg under oom.

3. There are allocations that happen in pagefault paths, as well as
   those that happen in non-pagefault paths, and the error returned from
   mem_cgroup_charge_mapping() will be handled by the caller resulting
   in different behavior seen by the userspace in the pagefault and
   non-pagefault paths. For example, currently if mem_cgroup_charge_mapping()
   returns ENOMEM, the caller will generally get an ENOMEM on non-pagefault
   paths, and the caller will be stuck looping the pagefault forever in the
   pagefault path.

4. In general, it's desirable to give userspace the option to gracefully
   handle and recover from a failed remote charge rather than kill the
   process or put it into a situation that's hard to recover from.

With these considerations, the thing that makes most sense here is to
handle this edge case similarly to how we handle ENOSPC error, and to return
ENOSPC from mem_cgroup_charge_mapping() when the remote charge
fails. This has the desirable properties:

1. On pagefault allocations, the userspace will get a SIGBUS if the remote
   charge fails, and the userspace is able to catch this signal and handle it
   to recover gracefully as desired.

2. On non-pagefault paths, the userspace will get an ENOSPC error which
   it can also handle gracefully, if desired.

3. We would not leave the remote charging process in a looping
   pagetfault (a state somewhat hard to recover from) or kill it.

Implementation notes:

1. To get the ENOSPC behavior we alegedly want, in
   mem_cgroup_charge_mapping() we detect whether charge_memcg() has
   failed, and we return ENOSPC here.

2. If the oom-killer is invoked and finds nothing to kill, it prints out
   the "Out of memory and no killable processes..." message, which can
   be spammy if the system is executing many remote charges and
   generally will cause worry as it will likely be seen as a scary
   looking kernel warning, even though this is somewhat of an expected edge
   case to run into and we handle it adequately. Therefore, in out_of_memory()
   we return early to not print this warning. This is not necessary for the
   functionality of the remote charges.

Signed-off-by: Mina Almasry <almasrymina@google.com>


---

Changes in v4:
- Greatly expanded on the commit message to include all my current
thinking.
- Converted the patch to handle remote ooms similarly to ENOSPC, rather
than ENOMEM.

Changes in v3:
- Fixed build failures/warnings Reported-by: kernel test robot <lkp@intel.com>

Changes in v2:
- Moved the remote oom handling as Roman requested.
- Used mem_cgroup_from_task(current) instead of grabbing the memcg from
current->mm

---
 include/linux/memcontrol.h |  6 ++++++
 mm/memcontrol.c            | 31 ++++++++++++++++++++++++++++++-
 mm/oom_kill.c              |  9 +++++++++
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0a9b0bba5f3c8..451feebabf160 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -932,6 +932,7 @@ int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm,

 struct mem_cgroup *mem_cgroup_get_from_path(const char *path);
 void mem_cgroup_put_name_in_seq(struct seq_file *seq, struct super_block *sb);
+bool is_remote_oom(struct mem_cgroup *memcg_under_oom);

 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 		int zid, int nr_pages);
@@ -1255,6 +1256,11 @@ static inline void mem_cgroup_put_name_in_seq(struct seq_file *seq,
 {
 }

+static inline bool is_remote_oom(struct mem_cgroup *memcg_under_oom)
+{
+	return false;
+}
+
 static inline int mem_cgroup_swapin_charge_page(struct page *page,
 			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c4ba7f364c214..3e5bc2c32c9b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2668,6 +2668,35 @@ void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb)
 	__putname(buf);
 }

+/*
+ * Returns true if current's mm is a descendant of the memcg_under_oom (or
+ * equal to it). False otherwise. This is used by the oom-killer to detect
+ * ooms due to remote charging.
+ */
+bool is_remote_oom(struct mem_cgroup *memcg_under_oom)
+{
+	struct mem_cgroup *current_memcg;
+	bool is_remote_oom;
+
+	if (!memcg_under_oom)
+		return false;
+
+	rcu_read_lock();
+	current_memcg = mem_cgroup_from_task(current);
+	if (current_memcg && !css_tryget_online(&current_memcg->css))
+		current_memcg = NULL;
+	rcu_read_unlock();
+
+	if (!current_memcg)
+		return false;
+
+	is_remote_oom =
+		!mem_cgroup_is_descendant(current_memcg, memcg_under_oom);
+	css_put(&current_memcg->css);
+
+	return is_remote_oom;
+}
+
 /*
  * Set or clear (if @memcg is NULL) charge association from file system to
  * memcg.  If @memcg != NULL, then a css reference must be held by the caller to
@@ -6814,7 +6843,7 @@ int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm,
 	if (mapping_memcg) {
 		ret = charge_memcg(folio, mapping_memcg, gfp);
 		css_put(&mapping_memcg->css);
-		return ret;
+		return ret == -ENOMEM ? -ENOSPC : ret;
 	}

 	return mem_cgroup_charge(folio, mm, gfp);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0a7e16b16b8c3..8db500b337415 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1108,6 +1108,15 @@ bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
+		if (is_remote_oom(oc->memcg)) {
+			/*
+			 * For remote ooms with no killable processes, return
+			 * false here without logging the warning below as we
+			 * expect the caller to handle this as they please.
+			 */
+			return false;
+		}
+
 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");
 		/*
--
2.34.0.rc2.393.gf8c9666880-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 3/4] mm, shmem: add filesystem memcg= option documentation
  2021-11-20  4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
  2021-11-20  4:50 ` [PATCH v4 1/4] mm: support deterministic memory charging of filesystems Mina Almasry
  2021-11-20  4:50 ` [PATCH v4 2/4] mm/oom: handle remote ooms Mina Almasry
@ 2021-11-20  4:50 ` Mina Almasry
  2021-11-20  4:50 ` [PATCH v4 4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests Mina Almasry
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Mina Almasry @ 2021-11-20  4:50 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Mina Almasry, Alexander Viro, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Matthew Wilcox,
	Roman Gushchin, Theodore Ts'o, linux-kernel, linux-fsdevel,
	linux-mm, linux-doc

Document the usage of the memcg= mount option, as well as permission
restrictions of its use and caveats with remote charging.

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

Changes in v4:
- Added more info about the permissions to mount with memcg=, and the
  importance of restricting write access to the mount point.
- Changed documentation to describe the ENOSPC/SIGBUS behavior rather
  than the ENOMEM behavior implemented in earlier patches.
- I did not find a good place to put this documentation after making the
  mount option generic. Please let me know if there is a good place to
  add this, and if not I can add a new file. Thanks!

---
 Documentation/filesystems/tmpfs.rst | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 0408c245785e3..dc1f46e16eaf4 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -137,6 +137,34 @@ mount options.  It can be added later, when the tmpfs is already mounted
 on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.


+If CONFIG_MEMCG is enabled, filesystems (including tmpfs) has a mount option to
+specify the memory cgroup to be charged for page allocations.
+
+memcg=/sys/fs/cgroup/unified/test/: data page allocations are charged to
+cgroup /sys/fs/cgroup/unified/test/.
+
+Only processes that have write access to
+/sys/fs/cgroup/unified/test/cgroup.procs can mount a tmpfs with
+memcg=/sys/fs/cgroup/unified/test. Thus, a process is able to charge memory to a
+cgroup only if it itself is able to enter that cgroup and allocate memory
+there. This is to prevent random processes from mounting filesystems in user
+namespaces and intentionally DoSing random cgroups running on the system.
+
+Once a mount point is created with memcg=, any process that has write access to
+this mount point is able to use this mount point and direct charges to the
+cgroup provided. Thus, it is important to limit write access to the mount point
+to the intended users if untrusted code is running on the machine. This is
+generally required regardless of whether the mount is done with memcg= or not.
+
+When charging memory to the remote memcg (memcg specified with memcg=) and
+hitting that memcg's limit, the oom-killer will be invoked (if enabled) and will
+attempt to kill a process in the remote memcg. If no killable processes are
+found, the remote charging process gets an ENOSPC error. If the remote charging
+process is in the pagefault path, it gets a SIGBUS signal. It's recommended
+that processes executing remote charges are able to handle a SIGBUS signal or
+ENOSPC error that may arise during executing the remote charges.
+
+
 To specify the initial root directory you can use the following mount
 options:

--
2.34.0.rc2.393.gf8c9666880-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests
  2021-11-20  4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
                   ` (2 preceding siblings ...)
  2021-11-20  4:50 ` [PATCH v4 3/4] mm, shmem: add filesystem memcg= option documentation Mina Almasry
@ 2021-11-20  4:50 ` Mina Almasry
  2021-11-20  5:01 ` [PATCH v4 0/4] Deterministic charging of shared memory Matthew Wilcox
  2021-11-22 19:04 ` Johannes Weiner
  5 siblings, 0 replies; 20+ messages in thread
From: Mina Almasry @ 2021-11-20  4:50 UTC (permalink / raw)
  To: Andrew Morton, Shuah Khan
  Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shakeel Butt,
	Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

- Test mounting and remounting with memcg= succeeds.
- Test that simple writes in this file system are charged to the correct
  memecg.
- Test that on non-pagefault paths the calling process gets an ENOSPC.
- Test that in pagefault paths the calling process gets a SIGBUS.

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

Changes in v4:

- Convert tests to expect ENOSPC/SIGBUS rather than ENOMEM oom behavior.
- Added remount test.

---
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/mmap_write.c   | 103 +++++++++++++++++++
 tools/testing/selftests/vm/tmpfs-memcg.sh | 116 ++++++++++++++++++++++
 3 files changed, 220 insertions(+)
 create mode 100644 tools/testing/selftests/vm/mmap_write.c
 create mode 100755 tools/testing/selftests/vm/tmpfs-memcg.sh

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 2e7e86e852828..cb229974c5f15 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -19,6 +19,7 @@ madv_populate
 userfaultfd
 mlock-intersect-test
 mlock-random-test
+mmap_write
 virtual_address_range
 gup_test
 va_128TBswitch
diff --git a/tools/testing/selftests/vm/mmap_write.c b/tools/testing/selftests/vm/mmap_write.c
new file mode 100644
index 0000000000000..88a8468f2128c
--- /dev/null
+++ b/tools/testing/selftests/vm/mmap_write.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program faults memory in tmpfs
+ */
+
+#include <err.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/shm.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+
+/* Global definitions. */
+
+/* Global variables. */
+static const char *self;
+static char *shmaddr;
+static int shmid;
+
+/*
+ * Show usage and exit.
+ */
+static void exit_usage(void)
+{
+	printf("Usage: %s -p <path to tmpfs file> -s <size to map>\n", self);
+	exit(EXIT_FAILURE);
+}
+
+int main(int argc, char **argv)
+{
+	int fd = 0;
+	int key = 0;
+	int *ptr = NULL;
+	int c = 0;
+	int size = 0;
+	char path[256] = "";
+	int want_sleep = 0, private = 0;
+	int populate = 0;
+	int write = 0;
+	int reserve = 1;
+
+	/* Parse command-line arguments. */
+	setvbuf(stdout, NULL, _IONBF, 0);
+	self = argv[0];
+
+	while ((c = getopt(argc, argv, ":s:p:")) != -1) {
+		switch (c) {
+		case 's':
+			size = atoi(optarg);
+			break;
+		case 'p':
+			strncpy(path, optarg, sizeof(path));
+			break;
+		default:
+			errno = EINVAL;
+			perror("Invalid arg");
+			exit_usage();
+		}
+	}
+
+	printf("%s\n", path);
+	if (strncmp(path, "", sizeof(path)) != 0) {
+		printf("Writing to this path: %s\n", path);
+	} else {
+		errno = EINVAL;
+		perror("path not found");
+		exit_usage();
+	}
+
+	if (size != 0) {
+		printf("Writing this size: %d\n", size);
+	} else {
+		errno = EINVAL;
+		perror("size not found");
+		exit_usage();
+	}
+
+	fd = open(path, O_CREAT | O_RDWR, 0777);
+	if (fd == -1)
+		err(1, "Failed to open file.");
+
+	if (ftruncate(fd, size))
+		err(1, "failed to ftruncate %s", path);
+
+	ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (ptr == MAP_FAILED) {
+		close(fd);
+		err(1, "Error mapping the file");
+	}
+
+	printf("Writing to memory.\n");
+	memset(ptr, 1, size);
+	printf("Done writing to memory.\n");
+	close(fd);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/vm/tmpfs-memcg.sh b/tools/testing/selftests/vm/tmpfs-memcg.sh
new file mode 100755
index 0000000000000..50876992107fd
--- /dev/null
+++ b/tools/testing/selftests/vm/tmpfs-memcg.sh
@@ -0,0 +1,116 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+CGROUP_PATH=/dev/cgroup/memory/tmpfs-memcg-test
+REMOUNT_CGROUP_PATH=/dev/cgroup/memory/remount-memcg-test
+
+function cleanup() {
+  rm -rf /mnt/tmpfs/*
+  umount /mnt/tmpfs
+  rm -rf /mnt/tmpfs
+
+  rmdir $CGROUP_PATH
+  rmdir $REMOUNT_CGROUP_PATH
+
+  echo CLEANUP DONE
+}
+
+function setup() {
+  mkdir -p $CGROUP_PATH
+  mkdir -p $REMOUNT_CGROUP_PATH
+  echo $((10 * 1024 * 1024)) > $CGROUP_PATH/memory.limit_in_bytes
+  echo 0 > $CGROUP_PATH/cpuset.cpus
+  echo 0 > $CGROUP_PATH/cpuset.mems
+
+  mkdir -p /mnt/tmpfs
+
+  echo SETUP DONE
+}
+
+function expect_equal() {
+  local expected="$1"
+  local actual="$2"
+  local error="$3"
+
+  if [[ "$actual" != "$expected" ]]; then
+    echo "expected ($expected) != actual ($actual): $3" >&2
+    cleanup
+    exit 1
+  fi
+}
+
+function expect_ge() {
+  local expected="$1"
+  local actual="$2"
+  local error="$3"
+
+  if [[ "$actual" -lt "$expected" ]]; then
+    echo "expected ($expected) < actual ($actual): $3" >&2
+    cleanup
+    exit 1
+  fi
+}
+
+cleanup
+setup
+
+mount -t tmpfs -o memcg=$REMOUNT_CGROUP_PATH tmpfs /mnt/tmpfs
+check=$(cat /proc/mounts | grep -i remount-memcg-test)
+if [ -z "$check" ]; then
+  echo "tmpfs memcg= was not mounted correctly:"
+  echo $check
+  echo "FAILED"
+  cleanup
+  exit 1
+fi
+
+mount -t tmpfs -o remount,memcg=$CGROUP_PATH tmpfs /mnt/tmpfs
+check=$(cat /proc/mounts | grep -i tmpfs-memcg-test)
+if [ -z "$check" ]; then
+  echo "tmpfs memcg= was not remounted correctly:"
+  echo $check
+  echo "FAILED"
+  cleanup
+  exit 1
+fi
+
+TARGET_MEMCG_USAGE=$(cat $CGROUP_PATH/memory.usage_in_bytes)
+expect_equal 0 "$TARGET_MEMCG_USAGE" "Before echo, memcg usage should be 0"
+
+# Echo to allocate a page in the tmpfs
+echo
+echo
+echo hello > /mnt/tmpfs/test
+TARGET_MEMCG_USAGE=$(cat $CGROUP_PATH/memory.usage_in_bytes)
+expect_ge 4096 "$TARGET_MEMCG_USAGE" "After echo, memcg usage should be greater than 4096"
+echo "Echo test succeeded"
+
+echo
+echo
+tools/testing/selftests/vm/mmap_write -p /mnt/tmpfs/test -s $((1 * 1024 * 1024))
+TARGET_MEMCG_USAGE=$(cat $CGROUP_PATH/memory.usage_in_bytes)
+expect_ge $((1 * 1024 * 1024)) "$TARGET_MEMCG_USAGE" "After mmap_write, memcg usage should greater than 1MB"
+echo "WRITE TEST SUCCEEDED"
+
+# SIGBUS the remote container on pagefault.
+echo
+echo
+echo "SIGBUS the process doing the remote charge on hitting the limit of the remote cgroup."
+echo "This will take a long time because the kernel goes through reclaim retries,"
+echo "but should eventually the write process should receive a SIGBUS"
+set +e
+tools/testing/selftests/vm/mmap_write -p /mnt/tmpfs/test -s $((11 * 1024 * 1024)) &
+wait $!
+expect_equal "$?" "135" "mmap_write should have exited with SIGBUS"
+set -e
+
+# ENOSPC the remote container on non pagefault.
+echo
+echo
+echo "OOMing the remote container using cat (non-pagefault)"
+echo "This will take a long time because the kernel goes through reclaim retries,"
+echo "but should eventually the cat command should receive an ENOSPC"
+cat /dev/random > /mnt/tmpfs/random || true
+
+cleanup
+echo TEST PASSED
--
2.34.0.rc2.393.gf8c9666880-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-20  4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
                   ` (3 preceding siblings ...)
  2021-11-20  4:50 ` [PATCH v4 4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests Mina Almasry
@ 2021-11-20  5:01 ` Matthew Wilcox
  2021-11-20  5:27   ` Mina Almasry
  2021-11-22 19:04 ` Johannes Weiner
  5 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2021-11-20  5:01 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> 1. One complication to address is the behavior when the target memcg
> hits its memory.max limit because of remote charging. In this case the
> oom-killer will be invoked, but the oom-killer may not find anything
> to kill in the target memcg being charged. Thera are a number of considerations
> in this case:
> 
> 1. It's not great to kill the allocating process since the allocating process
>    is not running in the memcg under oom, and killing it will not free memory
>    in the memcg under oom.
> 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
>    somehow. If not, the process will forever loop the pagefault in the upstream
>    kernel.
> 
> In this case, I propose simply failing the remote charge and returning an ENOSPC
> to the caller. This will cause will cause the process executing the remote
> charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> path.  This will be documented behavior of remote charging, and this feature is
> opt-in. Users can:
> - Not opt-into the feature if they want.
> - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
>   abort if they desire.
> - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
>   operation without executing the remote charge if possible.

Why is ENOSPC the right error instead of ENOMEM?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/4] mm/oom: handle remote ooms
  2021-11-20  4:50 ` [PATCH v4 2/4] mm/oom: handle remote ooms Mina Almasry
@ 2021-11-20  5:07   ` Matthew Wilcox
  2021-11-20  5:31     ` Mina Almasry
  2021-11-20  7:58   ` Shakeel Butt
  1 sibling, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2021-11-20  5:07 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Jonathan Corbet, Alexander Viro, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm,
	cgroups

On Fri, Nov 19, 2021 at 08:50:08PM -0800, Mina Almasry wrote:
> On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> to find a task to kill in the memcg under oom. The oom-killer may be
> unable to find a process to kill if there are no killable processes in
> the remote memcg. In this case, the oom-killer (out_of_memory()) will return
> false, and depending on the gfp, that will generally get bubbled up to
> mem_cgroup_charge_mapping() as an ENOMEM.

Why doesn't it try to run the shrinkers to get back some page cache /
slab cache memory from this memcg?  I understand it might not be able
to (eg if the memory is mlocked), but surely that's rare.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-20  5:01 ` [PATCH v4 0/4] Deterministic charging of shared memory Matthew Wilcox
@ 2021-11-20  5:27   ` Mina Almasry
  0 siblings, 0 replies; 20+ messages in thread
From: Mina Almasry @ 2021-11-20  5:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Fri, Nov 19, 2021 at 9:01 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. Thera are a number of considerations
> > in this case:
> >
> > 1. It's not great to kill the allocating process since the allocating process
> >    is not running in the memcg under oom, and killing it will not free memory
> >    in the memcg under oom.
> > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> >    somehow. If not, the process will forever loop the pagefault in the upstream
> >    kernel.
> >
> > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > to the caller. This will cause will cause the process executing the remote
> > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > path.  This will be documented behavior of remote charging, and this feature is
> > opt-in. Users can:
> > - Not opt-into the feature if they want.
> > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> >   abort if they desire.
> > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> >   operation without executing the remote charge if possible.
>
> Why is ENOSPC the right error instead of ENOMEM?

Returning ENOMEM from mem_cgroup_charge_mapping() will cause the
application to get ENOMEM from non-pagefault paths (which is perfectly
fine), and get stuck in a loop trying to resolve the pagefault in the
pagefault path (less fine). The logic is here:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1432

ENOMEM gets bubbled up here as VM_FAULT_OOM and on remote charges the
behavior I see is that the kernel loops the pagefault forever until
memory is freed in the remote memcg, and it may never will.

ENOSPC gets bubbled up here as a VM_FAULT_SIGBUS and and sends a
SIGBUS to the allocating process. The conjecture here is that it's
preferred to send a SIGBUS to the allocating process rather than have
it be stuck in a loop trying to resolve a pagefault.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/4] mm/oom: handle remote ooms
  2021-11-20  5:07   ` Matthew Wilcox
@ 2021-11-20  5:31     ` Mina Almasry
  0 siblings, 0 replies; 20+ messages in thread
From: Mina Almasry @ 2021-11-20  5:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Jonathan Corbet, Alexander Viro, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm,
	cgroups

On Fri, Nov 19, 2021 at 9:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Nov 19, 2021 at 08:50:08PM -0800, Mina Almasry wrote:
> > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > to find a task to kill in the memcg under oom. The oom-killer may be
> > unable to find a process to kill if there are no killable processes in
> > the remote memcg. In this case, the oom-killer (out_of_memory()) will return
> > false, and depending on the gfp, that will generally get bubbled up to
> > mem_cgroup_charge_mapping() as an ENOMEM.
>
> Why doesn't it try to run the shrinkers to get back some page cache /
> slab cache memory from this memcg?  I understand it might not be able
> to (eg if the memory is mlocked), but surely that's rare.
>

Please correct me if I'm wrong, but AFAICT I haven't disabled any
shrinkers from running or disabled any reclaim mechanism on remote
charges. What I see in the code is that when the remote memcg runs out
of memory is that try_to_free_mem_cgroup_pages() gets called as normal
and we do the MAX_RECLAIM_RETRIES as normal.

It's only in the (rare as you point out) case where the kernel is
completely unable to find memory in the remote memcg that we need to
do the special handling that's described in this patch. Although this
situation is probably quite rare in real-world scenarios, it's easily
reproducible (and the attached test in the series does so), so my
feeling is that it needs some sane handling, which I'm attempting to
do in this patch.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/4] mm: support deterministic memory charging of filesystems
  2021-11-20  4:50 ` [PATCH v4 1/4] mm: support deterministic memory charging of filesystems Mina Almasry
@ 2021-11-20  7:53   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2021-11-20  7:53 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Alexander Viro, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Hugh Dickins, Jonathan Corbet, Shuah Khan,
	Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm,
	cgroups

On Fri, Nov 19, 2021 at 8:50 PM Mina Almasry <almasrymina@google.com> wrote:
>
> Users can specify a memcg= mount option option at mount time and all

*option

> data page charges will be charged to the memcg supplied.  This is useful
> to deterministicly charge the memory of the file system or memory shared

*deterministically

> via tmpfs for example.
>
> Implementation notes:
> - Add memcg= option parsing to fs common code.
> - We attach the memcg to charge for this filesystem data pages to the
>   struct super_block. The memcg can be changed via a remount operation,
>   and all future memcg charges in this filesystem will be charged to
>   the new memcg.
> - We create a new interface mem_cgroup_charge_mapping(), which will
>   check if the super_block in the mapping has a memcg to charge. It
>   charges that, and falls back to the mm passed if there is no
>   super_block memcg.
> - On filesystem data memory allocation paths, we call the new interface
>   mem_cgroup_charge_mapping().
>
> Caveats:
> - Processes are only allowed to direct filesystem charges to a cgroup that

I don't think 'Processes' is the right terminology here. The
admin/user doing the mount operation with memcg option should have
access to move processes into the target memcg.

>   they themselves can enter and allocate memory in. This so that we do not
>   introduce an attack vector where processes can DoS any cgroup in the
>   system that they are not normally allowed to enter and allocate memory in.
> - In mem_cgroup_charge_mapping() we pay the cost of checking whether the
>   super_block has a memcg to charge, regardless of whether the mount
>   point was mounted with memcg=. This can be alleviated by putting the
>   memcg to charge in the struct address_space, but, this increases the
>   size of that struct and makes it difficult to support remounting the
>   memcg= option, although remounting is of dubious value.

We can start simple with no remount option or maybe follow the memcg
v2 semantics of process migrating from one memcg to another. The older
memory of the process will remain charged to the older memcg and after
migration, the new memory will be charged to the newer memcg.

Similarly the older inode/mappings will still be linked to the older
memcg and after remount the newer mappings will be linked with newer
memcg. You would need to find out if each mapping should hold a memcg
reference.

[...]
>
> +static int parse_param_memcg(struct fs_context *fc, struct fs_parameter *param)
> +{
> +       struct mem_cgroup *memcg;
> +
> +       if (strcmp(param->key, "memcg") != 0)
> +               return -ENOPARAM;
> +
> +       if (param->type != fs_value_is_string)
> +               return invalf(fc, "Non-string source");
> +
> +       if (fc->memcg)
> +               return invalf(fc, "Multiple memcgs specified");
> +
> +       memcg = mem_cgroup_get_from_path(param->string);

You need to drop this reference in put_fs_context() and give the
super_block its own memcg reference.

> +       if (IS_ERR(memcg))
> +               return invalf(fc, "Bad value for memcg");
> +
> +       fc->memcg = memcg;
> +       param->string = NULL;
> +       return 0;
> +}
> +
>  /**
>   * vfs_parse_fs_param - Add a single parameter to a superblock config
>   * @fc: The filesystem context to modify
> @@ -148,6 +171,10 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
>                         return ret;
>         }
>
> +       ret = parse_param_memcg(fc, param);

You might need to call this before fs specific handling (i.e.
fc->ops->parse_param).

> +       if (ret != -ENOPARAM)
> +               return ret;
> +
>         /* If the filesystem doesn't take any arguments, give it the
>          * default handling of source.
>          */

[...]

> +
> +void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb)
> +{
> +       struct mem_cgroup *memcg;
> +       int ret = 0;
> +       char *buf = __getname();
> +       int len = PATH_MAX;
> +
> +       if (!buf)
> +               return;
> +
> +       buf[0] = '\0';
> +
> +       rcu_read_lock();
> +       memcg = rcu_dereference(sb->s_memcg_to_charge);
> +       if (memcg && !css_tryget_online(&memcg->css))

No need to get an explicit reference. You can call cgroup_path within rcu.

> +               memcg = NULL;
> +       rcu_read_unlock();
> +
> +       if (!memcg)
> +               return;
> +
> +       ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2);
> +       if (ret >= len / 2)
> +               strcpy(buf, "?");
> +       else {
> +               char *p = mangle_path(buf, buf + len / 2, " \t\n\\");
> +
> +               if (p)
> +                       *p = '\0';
> +               else
> +                       strcpy(buf, "?");
> +       }
> +
> +       css_put(&memcg->css);
> +       if (buf[0] != '\0')
> +               seq_printf(m, ",memcg=%s", buf);
> +
> +       __putname(buf);
> +}
> +
> +/*
> + * Set or clear (if @memcg is NULL) charge association from file system to
> + * memcg.  If @memcg != NULL, then a css reference must be held by the caller to
> + * ensure that the cgroup is not deleted during this operation, this reference
> + * is dropped after this operation.

The given reference is not really dropped after this operation but
rather the reference is being stolen here.

> + */
> +void mem_cgroup_set_charge_target(struct super_block *sb,
> +                                 struct mem_cgroup *memcg)
> +{
> +       memcg = xchg(&sb->s_memcg_to_charge, memcg);

Don't borrow the reference, get a new one for sb.

> +       if (memcg)
> +               css_put(&memcg->css);
> +}
> +
> +/*
> + * Returns the memcg to charge for inode pages.  If non-NULL is returned, caller
> + * must drop reference with css_put().  NULL indicates that the inode does not
> + * have a memcg to charge,

* or the attached memcg is offline.

> so the default process based policy should be used.
> + */
> +static struct mem_cgroup *
> +mem_cgroup_mapping_get_charge_target(struct address_space *mapping)
> +{
> +       struct mem_cgroup *memcg;
> +
> +       if (!mapping)
> +               return NULL;
> +
> +       rcu_read_lock();
> +       memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);
> +       if (memcg && !css_tryget_online(&memcg->css))
> +               memcg = NULL;
> +       rcu_read_unlock();
> +
> +       return memcg;
> +}
> +

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/4] mm/oom: handle remote ooms
  2021-11-20  4:50 ` [PATCH v4 2/4] mm/oom: handle remote ooms Mina Almasry
  2021-11-20  5:07   ` Matthew Wilcox
@ 2021-11-20  7:58   ` Shakeel Butt
  1 sibling, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2021-11-20  7:58 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Jonathan Corbet, Alexander Viro, Hugh Dickins, Shuah Khan,
	Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm,
	cgroups

On Fri, Nov 19, 2021 at 8:50 PM Mina Almasry <almasrymina@google.com> wrote:
>
[...]
> +/*
> + * Returns true if current's mm is a descendant of the memcg_under_oom (or
> + * equal to it). False otherwise. This is used by the oom-killer to detect
> + * ooms due to remote charging.
> + */
> +bool is_remote_oom(struct mem_cgroup *memcg_under_oom)
> +{
> +       struct mem_cgroup *current_memcg;
> +       bool is_remote_oom;
> +
> +       if (!memcg_under_oom)
> +               return false;
> +
> +       rcu_read_lock();
> +       current_memcg = mem_cgroup_from_task(current);
> +       if (current_memcg && !css_tryget_online(&current_memcg->css))

No need to grab a reference. You can call mem_cgroup_is_descendant() within rcu.

> +               current_memcg = NULL;
> +       rcu_read_unlock();
> +
> +       if (!current_memcg)
> +               return false;
> +
> +       is_remote_oom =
> +               !mem_cgroup_is_descendant(current_memcg, memcg_under_oom);
> +       css_put(&current_memcg->css);
> +
> +       return is_remote_oom;
> +}
> +

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-20  4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
                   ` (4 preceding siblings ...)
  2021-11-20  5:01 ` [PATCH v4 0/4] Deterministic charging of shared memory Matthew Wilcox
@ 2021-11-22 19:04 ` Johannes Weiner
  2021-11-22 22:09   ` Mina Almasry
                     ` (3 more replies)
  5 siblings, 4 replies; 20+ messages in thread
From: Johannes Weiner @ 2021-11-22 19:04 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Michal Hocko,
	Vladimir Davydov, Hugh Dickins, Shuah Khan, Shakeel Butt,
	Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> Problem:
> Currently shared memory is charged to the memcg of the allocating
> process. This makes memory usage of processes accessing shared memory
> a bit unpredictable since whichever process accesses the memory first
> will get charged. We have a number of use cases where our userspace
> would like deterministic charging of shared memory:
> 
> 1. System services allocating memory for client jobs:
> We have services (namely a network access service[1]) that provide
> functionality for clients running on the machine and allocate memory
> to carry out these services. The memory usage of these services
> depends on the number of jobs running on the machine and the nature of
> the requests made to the service, which makes the memory usage of
> these services hard to predict and thus hard to limit via memory.max.
> These system services would like a way to allocate memory and instruct
> the kernel to charge this memory to the client’s memcg.
> 
> 2. Shared filesystem between subtasks of a large job
> Our infrastructure has large meta jobs such as kubernetes which spawn
> multiple subtasks which share a tmpfs mount. These jobs and its
> subtasks use that tmpfs mount for various purposes such as data
> sharing or persistent data between the subtask restarts. In kubernetes
> terminology, the meta job is similar to pods and subtasks are
> containers under pods. We want the shared memory to be
> deterministically charged to the kubernetes's pod and independent to
> the lifetime of containers under the pod.
> 
> 3. Shared libraries and language runtimes shared between independent jobs.
> We’d like to optimize memory usage on the machine by sharing libraries
> and language runtimes of many of the processes running on our machines
> in separate memcgs. This produces a side effect that one job may be
> unlucky to be the first to access many of the libraries and may get
> oom killed as all the cached files get charged to it.
> 
> Design:
> My rough proposal to solve this problem is to simply add a
> ‘memcg=/path/to/memcg’ mount option for filesystems:
> directing all the memory of the file system to be ‘remote charged’ to
> cgroup provided by that memcg= option.
> 
> Caveats:
> 
> 1. One complication to address is the behavior when the target memcg
> hits its memory.max limit because of remote charging. In this case the
> oom-killer will be invoked, but the oom-killer may not find anything
> to kill in the target memcg being charged. Thera are a number of considerations
> in this case:
> 
> 1. It's not great to kill the allocating process since the allocating process
>    is not running in the memcg under oom, and killing it will not free memory
>    in the memcg under oom.
> 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
>    somehow. If not, the process will forever loop the pagefault in the upstream
>    kernel.
> 
> In this case, I propose simply failing the remote charge and returning an ENOSPC
> to the caller. This will cause will cause the process executing the remote
> charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> path.  This will be documented behavior of remote charging, and this feature is
> opt-in. Users can:
> - Not opt-into the feature if they want.
> - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
>   abort if they desire.
> - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
>   operation without executing the remote charge if possible.
> 
> 2. Only processes allowed the enter cgroup at mount time can mount a
> tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> process with write access to this mount point will be able to charge memory to
> <cgroup>. This is largely a non-issue because in configurations where there is
> untrusted code running on the machine, mount point access needs to be
> restricted to the intended users only regardless of whether the mount point
> memory is deterministly charged or not.

I'm not a fan of this. It uses filesystem mounts to create shareable
resource domains outside of the cgroup hierarchy, which has all the
downsides you listed, and more:

1. You need a filesystem interface in the first place, and a new
   ad-hoc channel and permission model to coordinate with the cgroup
   tree, which isn't great. All filesystems you want to share data on
   need to be converted.

2. It doesn't extend to non-filesystem sources of shared data, such as
   memfds, ipc shm etc.

3. It requires unintuitive configuration for what should be basic
   shared accounting semantics. Per default you still get the old
   'first touch' semantics, but to get sharing you need to reconfigure
   the filesystems?

4. If a task needs to work with a hierarchy of data sharing domains -
   system-wide, group of jobs, job - it must interact with a hierarchy
   of filesystem mounts. This is a pain to setup and may require task
   awareness. Moving data around, working with different mount points.
   Also, no shared and private data accounting within the same file.

5. It reintroduces cgroup1 semantics of tasks and resouces, which are
   entangled, sitting in disjunct domains. OOM killing is one quirk of
   that, but there are others you haven't touched on. Who is charged
   for the CPU cycles of reclaim in the out-of-band domain?  Who is
   charged for the paging IO? How is resource pressure accounted and
   attributed? Soon you need cpu= and io= as well.

My take on this is that it might work for your rather specific
usecase, but it doesn't strike me as a general-purpose feature
suitable for upstream.


If we want sharing semantics for memory, I think we need a more
generic implementation with a cleaner interface.

Here is one idea:

Have you considered reparenting pages that are accessed by multiple
cgroups to the first common ancestor of those groups?

Essentially, whenever there is a memory access (minor fault, buffered
IO) to a page that doesn't belong to the accessing task's cgroup, you
find the common ancestor between that task and the owning cgroup, and
move the page there.

With a tree like this:

	root - job group - job
                        `- job
            `- job group - job
                        `- job

all pages accessed inside that tree will propagate to the highest
level at which they are shared - which is the same level where you'd
also set shared policies, like a job group memory limit or io weight.

E.g. libc pages would (likely) bubble to the root, persistent tmpfs
pages would bubble to the respective job group, private data would
stay within each job.

No further user configuration necessary. Although you still *can* use
mount namespacing etc. to prohibit undesired sharing between cgroups.

The actual user-visible accounting change would be quite small, and
arguably much more intuitive. Remember that accounting is recursive,
meaning that a job page today also shows up in the counters of job
group and root. This would not change. The only thing that IS weird
today is that when two jobs share a page, it will arbitrarily show up
in one job's counter but not in the other's. That would change: it
would no longer show up as either, since it's not private to either;
it would just be a job group (and up) page.

This would be a generic implementation of resource sharing semantics:
independent of data source and filesystems, contained inside the
cgroup interface, and reusing the existing hierarchies of accounting
and control domains to also represent levels of common property.

Thoughts?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-22 19:04 ` Johannes Weiner
@ 2021-11-22 22:09   ` Mina Almasry
  2021-11-22 23:09   ` Roman Gushchin
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Mina Almasry @ 2021-11-22 22:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jonathan Corbet, Alexander Viro, Andrew Morton, Michal Hocko,
	Vladimir Davydov, Hugh Dickins, Shuah Khan, Shakeel Butt,
	Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Mon, Nov 22, 2021 at 11:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > Problem:
> > Currently shared memory is charged to the memcg of the allocating
> > process. This makes memory usage of processes accessing shared memory
> > a bit unpredictable since whichever process accesses the memory first
> > will get charged. We have a number of use cases where our userspace
> > would like deterministic charging of shared memory:
> >
> > 1. System services allocating memory for client jobs:
> > We have services (namely a network access service[1]) that provide
> > functionality for clients running on the machine and allocate memory
> > to carry out these services. The memory usage of these services
> > depends on the number of jobs running on the machine and the nature of
> > the requests made to the service, which makes the memory usage of
> > these services hard to predict and thus hard to limit via memory.max.
> > These system services would like a way to allocate memory and instruct
> > the kernel to charge this memory to the client’s memcg.
> >
> > 2. Shared filesystem between subtasks of a large job
> > Our infrastructure has large meta jobs such as kubernetes which spawn
> > multiple subtasks which share a tmpfs mount. These jobs and its
> > subtasks use that tmpfs mount for various purposes such as data
> > sharing or persistent data between the subtask restarts. In kubernetes
> > terminology, the meta job is similar to pods and subtasks are
> > containers under pods. We want the shared memory to be
> > deterministically charged to the kubernetes's pod and independent to
> > the lifetime of containers under the pod.
> >
> > 3. Shared libraries and language runtimes shared between independent jobs.
> > We’d like to optimize memory usage on the machine by sharing libraries
> > and language runtimes of many of the processes running on our machines
> > in separate memcgs. This produces a side effect that one job may be
> > unlucky to be the first to access many of the libraries and may get
> > oom killed as all the cached files get charged to it.
> >
> > Design:
> > My rough proposal to solve this problem is to simply add a
> > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > directing all the memory of the file system to be ‘remote charged’ to
> > cgroup provided by that memcg= option.
> >
> > Caveats:
> >
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. Thera are a number of considerations
> > in this case:
> >
> > 1. It's not great to kill the allocating process since the allocating process
> >    is not running in the memcg under oom, and killing it will not free memory
> >    in the memcg under oom.
> > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> >    somehow. If not, the process will forever loop the pagefault in the upstream
> >    kernel.
> >
> > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > to the caller. This will cause will cause the process executing the remote
> > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > path.  This will be documented behavior of remote charging, and this feature is
> > opt-in. Users can:
> > - Not opt-into the feature if they want.
> > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> >   abort if they desire.
> > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> >   operation without executing the remote charge if possible.
> >
> > 2. Only processes allowed the enter cgroup at mount time can mount a
> > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > process with write access to this mount point will be able to charge memory to
> > <cgroup>. This is largely a non-issue because in configurations where there is
> > untrusted code running on the machine, mount point access needs to be
> > restricted to the intended users only regardless of whether the mount point
> > memory is deterministly charged or not.
>
> I'm not a fan of this. It uses filesystem mounts to create shareable
> resource domains outside of the cgroup hierarchy, which has all the
> downsides you listed, and more:
>
> 1. You need a filesystem interface in the first place, and a new
>    ad-hoc channel and permission model to coordinate with the cgroup
>    tree, which isn't great. All filesystems you want to share data on
>    need to be converted.
>

My understanding is that this problem exists today with tmpfs-shared
memory, regardless of memcg= support or not. I.e. for processes to
share memory via tmpfs the sys admin needs to limit access to the
mount point to the processes regardless of which cgroup[s] the
processes are in for the machine to be properly configured, or risk
unintended data access and a security violation. So existing tmpfs
shared memory would/should already have these permissions in place,
and (I'm hoping) we can piggy back or that and provide deterministic
charging.

> 2. It doesn't extend to non-filesystem sources of shared data, such as
>    memfds, ipc shm etc.
>

I was hoping - if possible - to extend similar APIs/semantics to other
shared memory sources, although to be honest I'll concede I haven't
thoroughly thought of how the implementation would look like.

> 3. It requires unintuitive configuration for what should be basic
>    shared accounting semantics. Per default you still get the old
>    'first touch' semantics, but to get sharing you need to reconfigure
>    the filesystems?
>

Yes, this is indeed an explicit option that needs to be configured by
the sys admin. I'm not so sure about changing the default in the
kernel and potentially breaking existing accounting like you mention
below. I think the kernel also automagically trying to figure out the
proper memcg to deterministically charge has its own issues (comments
on the proposal below).

> 4. If a task needs to work with a hierarchy of data sharing domains -
>    system-wide, group of jobs, job - it must interact with a hierarchy
>    of filesystem mounts. This is a pain to setup and may require task
>    awareness. Moving data around, working with different mount points.
>    Also, no shared and private data accounting within the same file.
>

Again, my impression/feeling here is that this is a generic problem
with tmpfs shared memory, and maybe shared memory in general, which
folks find very useful already despite the existing shortcomings.
Today AFAIK we don't have interfaces to say 'this is shared memory and
it's shared between processes in cgroups A, B, and C'. Instead we say
this is shared memory and the tmpfs access permissions or visibility
decree who can access the shared memory (and the permissions are
oblivious to cgroups) and the memory charging is first touch based and
not deterministic.

> 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
>    entangled, sitting in disjunct domains. OOM killing is one quirk of
>    that, but there are others you haven't touched on. Who is charged
>    for the CPU cycles of reclaim in the out-of-band domain?  Who is
>    charged for the paging IO? How is resource pressure accounted and
>    attributed? Soon you need cpu= and io= as well.
>

I think the allocating task is charged for cpu and io resources and
I'm not sure I see a compelling reason to change that. I think the
distinction is that memory is shared but charged to the one faulting
it which is maybe not really fair or can be deterministically
predicted by the sys admin setting limits on the various cgroups. I
don't see that logic extending to cpu, but perhaps to io maybe.

> My take on this is that it might work for your rather specific
> usecase, but it doesn't strike me as a general-purpose feature
> suitable for upstream.
>
>
> If we want sharing semantics for memory, I think we need a more
> generic implementation with a cleaner interface.
>

My issue here is that AFAICT in the upstream kernel there is no way to
deterministically charge the shared memory other than preallocation
which doesn't work so well on overcommitted systems and requires
changes in the individual tasks that are allocating the shared memory.
I'm definitely on board with any proposal that achieves what we want,
although there are issues with the specific proposal you mentioned.
(and thanks for reviewing and suggesting alternatives!)

> Here is one idea:
>
> Have you considered reparenting pages that are accessed by multiple
> cgroups to the first common ancestor of those groups?
>
> Essentially, whenever there is a memory access (minor fault, buffered
> IO) to a page that doesn't belong to the accessing task's cgroup, you
> find the common ancestor between that task and the owning cgroup, and
> move the page there.
>
> With a tree like this:
>
>         root - job group - job
>                         `- job
>             `- job group - job
>                         `- job
>
> all pages accessed inside that tree will propagate to the highest
> level at which they are shared - which is the same level where you'd
> also set shared policies, like a job group memory limit or io weight.
>
> E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> pages would bubble to the respective job group, private data would
> stay within each job.
>
> No further user configuration necessary. Although you still *can* use
> mount namespacing etc. to prohibit undesired sharing between cgroups.
>
> The actual user-visible accounting change would be quite small, and
> arguably much more intuitive. Remember that accounting is recursive,
> meaning that a job page today also shows up in the counters of job
> group and root. This would not change. The only thing that IS weird
> today is that when two jobs share a page, it will arbitrarily show up
> in one job's counter but not in the other's. That would change: it
> would no longer show up as either, since it's not private to either;
> it would just be a job group (and up) page.
>
> This would be a generic implementation of resource sharing semantics:
> independent of data source and filesystems, contained inside the
> cgroup interface, and reusing the existing hierarchies of accounting
> and control domains to also represent levels of common property.
>
> Thoughts?

2 issues I see here:
1. This is a default change, somewhat likely to break existing accounting.
2. I think we're trying to make the charging deterministic, and this
makes it even harder to a priori predict where memory is charged:
(a) memory is initially charged to the allocating task, which forces
the sys admin to over provision cgroups that access shared memory,
because what if they pre-allocate the shared memory and get charged
for all of it?
(b) The shared memory will only land "where it's supposed to land" if
the sys admin has correctly set the permissions of the shared memory
(tmpfs file system permissions/visibility for example). If the mount
access is incorrectly configured and accessed by a bad actor the
memory will likely be reparented to root, which is likely worse than
causing ENOSPC/SIGBUS in the current proposal. Hence, it's really an
implicit requirement for the shared memory permissions to be correct
for this to work, in which case memcg= seems better to me since it
doesn't suffer from issue (a).

I'm loosely aware of past conversations with Shakeel where it was
recommended to charge the first common ancestor, mainly to side-step
issues with the oom-killer not finding anything to kill. IMO I quite
like memcg= approach because you can:
1. memcg=<first common ancestor cgroup>, and not deal with potential
SIGBUS/ENOSPC
2. memcg=<remote cgroup>, and deal with potential SIGBUS/ENOSPC.

And the user has flexibility to decide. But regardless of the
proposal, I see it as an existing/orthogonal problem that shared
memory permissions be 'correct', and AFAICT existing shared memory
permission models are completely oblivious to cgroups, so there is
work for the sys admin to do anyway to make sure that processes in the
intended processes only are able to access the shared memory.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-22 19:04 ` Johannes Weiner
  2021-11-22 22:09   ` Mina Almasry
@ 2021-11-22 23:09   ` Roman Gushchin
  2021-11-23 19:26     ` Mina Almasry
  2021-11-23 20:21     ` Johannes Weiner
  2021-11-24 17:27   ` Michal Hocko
  2021-11-29  6:00   ` Shakeel Butt
  3 siblings, 2 replies; 20+ messages in thread
From: Roman Gushchin @ 2021-11-22 23:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Matthew Wilcox,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote:
> On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > Problem:
> > Currently shared memory is charged to the memcg of the allocating
> > process. This makes memory usage of processes accessing shared memory
> > a bit unpredictable since whichever process accesses the memory first
> > will get charged. We have a number of use cases where our userspace
> > would like deterministic charging of shared memory:
> > 
> > 1. System services allocating memory for client jobs:
> > We have services (namely a network access service[1]) that provide
> > functionality for clients running on the machine and allocate memory
> > to carry out these services. The memory usage of these services
> > depends on the number of jobs running on the machine and the nature of
> > the requests made to the service, which makes the memory usage of
> > these services hard to predict and thus hard to limit via memory.max.
> > These system services would like a way to allocate memory and instruct
> > the kernel to charge this memory to the client’s memcg.
> > 
> > 2. Shared filesystem between subtasks of a large job
> > Our infrastructure has large meta jobs such as kubernetes which spawn
> > multiple subtasks which share a tmpfs mount. These jobs and its
> > subtasks use that tmpfs mount for various purposes such as data
> > sharing or persistent data between the subtask restarts. In kubernetes
> > terminology, the meta job is similar to pods and subtasks are
> > containers under pods. We want the shared memory to be
> > deterministically charged to the kubernetes's pod and independent to
> > the lifetime of containers under the pod.
> > 
> > 3. Shared libraries and language runtimes shared between independent jobs.
> > We’d like to optimize memory usage on the machine by sharing libraries
> > and language runtimes of many of the processes running on our machines
> > in separate memcgs. This produces a side effect that one job may be
> > unlucky to be the first to access many of the libraries and may get
> > oom killed as all the cached files get charged to it.
> > 
> > Design:
> > My rough proposal to solve this problem is to simply add a
> > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > directing all the memory of the file system to be ‘remote charged’ to
> > cgroup provided by that memcg= option.
> > 
> > Caveats:
> > 
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. Thera are a number of considerations
> > in this case:
> > 
> > 1. It's not great to kill the allocating process since the allocating process
> >    is not running in the memcg under oom, and killing it will not free memory
> >    in the memcg under oom.
> > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> >    somehow. If not, the process will forever loop the pagefault in the upstream
> >    kernel.
> > 
> > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > to the caller. This will cause will cause the process executing the remote
> > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > path.  This will be documented behavior of remote charging, and this feature is
> > opt-in. Users can:
> > - Not opt-into the feature if they want.
> > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> >   abort if they desire.
> > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> >   operation without executing the remote charge if possible.
> > 
> > 2. Only processes allowed the enter cgroup at mount time can mount a
> > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > process with write access to this mount point will be able to charge memory to
> > <cgroup>. This is largely a non-issue because in configurations where there is
> > untrusted code running on the machine, mount point access needs to be
> > restricted to the intended users only regardless of whether the mount point
> > memory is deterministly charged or not.
> 
> I'm not a fan of this. It uses filesystem mounts to create shareable
> resource domains outside of the cgroup hierarchy, which has all the
> downsides you listed, and more:
> 
> 1. You need a filesystem interface in the first place, and a new
>    ad-hoc channel and permission model to coordinate with the cgroup
>    tree, which isn't great. All filesystems you want to share data on
>    need to be converted.
> 
> 2. It doesn't extend to non-filesystem sources of shared data, such as
>    memfds, ipc shm etc.
> 
> 3. It requires unintuitive configuration for what should be basic
>    shared accounting semantics. Per default you still get the old
>    'first touch' semantics, but to get sharing you need to reconfigure
>    the filesystems?
> 
> 4. If a task needs to work with a hierarchy of data sharing domains -
>    system-wide, group of jobs, job - it must interact with a hierarchy
>    of filesystem mounts. This is a pain to setup and may require task
>    awareness. Moving data around, working with different mount points.
>    Also, no shared and private data accounting within the same file.
> 
> 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
>    entangled, sitting in disjunct domains. OOM killing is one quirk of
>    that, but there are others you haven't touched on. Who is charged
>    for the CPU cycles of reclaim in the out-of-band domain?  Who is
>    charged for the paging IO? How is resource pressure accounted and
>    attributed? Soon you need cpu= and io= as well.
> 
> My take on this is that it might work for your rather specific
> usecase, but it doesn't strike me as a general-purpose feature
> suitable for upstream.
> 
> 
> If we want sharing semantics for memory, I think we need a more
> generic implementation with a cleaner interface.
> 
> Here is one idea:
> 
> Have you considered reparenting pages that are accessed by multiple
> cgroups to the first common ancestor of those groups?
> 
> Essentially, whenever there is a memory access (minor fault, buffered
> IO) to a page that doesn't belong to the accessing task's cgroup, you
> find the common ancestor between that task and the owning cgroup, and
> move the page there.
> 
> With a tree like this:
> 
> 	root - job group - job
>                         `- job
>             `- job group - job
>                         `- job
> 
> all pages accessed inside that tree will propagate to the highest
> level at which they are shared - which is the same level where you'd
> also set shared policies, like a job group memory limit or io weight.
> 
> E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> pages would bubble to the respective job group, private data would
> stay within each job.
> 
> No further user configuration necessary. Although you still *can* use
> mount namespacing etc. to prohibit undesired sharing between cgroups.
> 
> The actual user-visible accounting change would be quite small, and
> arguably much more intuitive. Remember that accounting is recursive,
> meaning that a job page today also shows up in the counters of job
> group and root. This would not change. The only thing that IS weird
> today is that when two jobs share a page, it will arbitrarily show up
> in one job's counter but not in the other's. That would change: it
> would no longer show up as either, since it's not private to either;
> it would just be a job group (and up) page.

In general I like the idea, but I think the user-visible change will be quite
large, almost "cgroup v3"-large. Here are some problems:
1) Anything shared between e.g. system.slice and user.slice now belongs
   to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache
   belonging to shared libraries.
2) It's concerning in security terms. If I understand the idea correctly, a
   read-only access will allow to move charges to an upper level, potentially
   crossing memory.max limits. It doesn't sound safe.
3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent
   it returns us to the cgroup v1 world and a question of competition between
   resources consumed by a cgroup directly and through children cgroups. Not
   like the problem doesn't exist now, but it's less pronounced.
   If say >50% of system.slice's memory will belong to system.slice directly,
   then we likely will need separate non-recursive counters, limits, protections,
   etc.
4) Imagine a production server and a system administrator entering using ssh
   (and being put into user.slice) and running a big grep... It screws up all
   memory accounting until a next reboot. Not a completely impossible scenario.

That said, I agree with Johannes and I'm also not a big fan of this patchset.

I agree that the problem exist and that the patchset provides a solution, but
it doesn't look nice (and generic enough) and creates a lot of questions and
corner cases.

Btw, won't (an optional) disabling of memcg accounting for a tmpfs solve your
problem? It will be less invasive and will not require any oom changes.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-22 23:09   ` Roman Gushchin
@ 2021-11-23 19:26     ` Mina Almasry
  2021-11-23 20:21     ` Johannes Weiner
  1 sibling, 0 replies; 20+ messages in thread
From: Mina Almasry @ 2021-11-23 19:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Matthew Wilcox,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Mon, Nov 22, 2021 at 3:09 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote:
> > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > > Problem:
> > > Currently shared memory is charged to the memcg of the allocating
> > > process. This makes memory usage of processes accessing shared memory
> > > a bit unpredictable since whichever process accesses the memory first
> > > will get charged. We have a number of use cases where our userspace
> > > would like deterministic charging of shared memory:
> > >
> > > 1. System services allocating memory for client jobs:
> > > We have services (namely a network access service[1]) that provide
> > > functionality for clients running on the machine and allocate memory
> > > to carry out these services. The memory usage of these services
> > > depends on the number of jobs running on the machine and the nature of
> > > the requests made to the service, which makes the memory usage of
> > > these services hard to predict and thus hard to limit via memory.max.
> > > These system services would like a way to allocate memory and instruct
> > > the kernel to charge this memory to the client’s memcg.
> > >
> > > 2. Shared filesystem between subtasks of a large job
> > > Our infrastructure has large meta jobs such as kubernetes which spawn
> > > multiple subtasks which share a tmpfs mount. These jobs and its
> > > subtasks use that tmpfs mount for various purposes such as data
> > > sharing or persistent data between the subtask restarts. In kubernetes
> > > terminology, the meta job is similar to pods and subtasks are
> > > containers under pods. We want the shared memory to be
> > > deterministically charged to the kubernetes's pod and independent to
> > > the lifetime of containers under the pod.
> > >
> > > 3. Shared libraries and language runtimes shared between independent jobs.
> > > We’d like to optimize memory usage on the machine by sharing libraries
> > > and language runtimes of many of the processes running on our machines
> > > in separate memcgs. This produces a side effect that one job may be
> > > unlucky to be the first to access many of the libraries and may get
> > > oom killed as all the cached files get charged to it.
> > >
> > > Design:
> > > My rough proposal to solve this problem is to simply add a
> > > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > > directing all the memory of the file system to be ‘remote charged’ to
> > > cgroup provided by that memcg= option.
> > >
> > > Caveats:
> > >
> > > 1. One complication to address is the behavior when the target memcg
> > > hits its memory.max limit because of remote charging. In this case the
> > > oom-killer will be invoked, but the oom-killer may not find anything
> > > to kill in the target memcg being charged. Thera are a number of considerations
> > > in this case:
> > >
> > > 1. It's not great to kill the allocating process since the allocating process
> > >    is not running in the memcg under oom, and killing it will not free memory
> > >    in the memcg under oom.
> > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> > >    somehow. If not, the process will forever loop the pagefault in the upstream
> > >    kernel.
> > >
> > > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > > to the caller. This will cause will cause the process executing the remote
> > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > > path.  This will be documented behavior of remote charging, and this feature is
> > > opt-in. Users can:
> > > - Not opt-into the feature if they want.
> > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> > >   abort if they desire.
> > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> > >   operation without executing the remote charge if possible.
> > >
> > > 2. Only processes allowed the enter cgroup at mount time can mount a
> > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > > process with write access to this mount point will be able to charge memory to
> > > <cgroup>. This is largely a non-issue because in configurations where there is
> > > untrusted code running on the machine, mount point access needs to be
> > > restricted to the intended users only regardless of whether the mount point
> > > memory is deterministly charged or not.
> >
> > I'm not a fan of this. It uses filesystem mounts to create shareable
> > resource domains outside of the cgroup hierarchy, which has all the
> > downsides you listed, and more:
> >
> > 1. You need a filesystem interface in the first place, and a new
> >    ad-hoc channel and permission model to coordinate with the cgroup
> >    tree, which isn't great. All filesystems you want to share data on
> >    need to be converted.
> >
> > 2. It doesn't extend to non-filesystem sources of shared data, such as
> >    memfds, ipc shm etc.
> >
> > 3. It requires unintuitive configuration for what should be basic
> >    shared accounting semantics. Per default you still get the old
> >    'first touch' semantics, but to get sharing you need to reconfigure
> >    the filesystems?
> >
> > 4. If a task needs to work with a hierarchy of data sharing domains -
> >    system-wide, group of jobs, job - it must interact with a hierarchy
> >    of filesystem mounts. This is a pain to setup and may require task
> >    awareness. Moving data around, working with different mount points.
> >    Also, no shared and private data accounting within the same file.
> >
> > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
> >    entangled, sitting in disjunct domains. OOM killing is one quirk of
> >    that, but there are others you haven't touched on. Who is charged
> >    for the CPU cycles of reclaim in the out-of-band domain?  Who is
> >    charged for the paging IO? How is resource pressure accounted and
> >    attributed? Soon you need cpu= and io= as well.
> >
> > My take on this is that it might work for your rather specific
> > usecase, but it doesn't strike me as a general-purpose feature
> > suitable for upstream.
> >
> >
> > If we want sharing semantics for memory, I think we need a more
> > generic implementation with a cleaner interface.
> >
> > Here is one idea:
> >
> > Have you considered reparenting pages that are accessed by multiple
> > cgroups to the first common ancestor of those groups?
> >
> > Essentially, whenever there is a memory access (minor fault, buffered
> > IO) to a page that doesn't belong to the accessing task's cgroup, you
> > find the common ancestor between that task and the owning cgroup, and
> > move the page there.
> >
> > With a tree like this:
> >
> >       root - job group - job
> >                         `- job
> >             `- job group - job
> >                         `- job
> >
> > all pages accessed inside that tree will propagate to the highest
> > level at which they are shared - which is the same level where you'd
> > also set shared policies, like a job group memory limit or io weight.
> >
> > E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> > pages would bubble to the respective job group, private data would
> > stay within each job.
> >
> > No further user configuration necessary. Although you still *can* use
> > mount namespacing etc. to prohibit undesired sharing between cgroups.
> >
> > The actual user-visible accounting change would be quite small, and
> > arguably much more intuitive. Remember that accounting is recursive,
> > meaning that a job page today also shows up in the counters of job
> > group and root. This would not change. The only thing that IS weird
> > today is that when two jobs share a page, it will arbitrarily show up
> > in one job's counter but not in the other's. That would change: it
> > would no longer show up as either, since it's not private to either;
> > it would just be a job group (and up) page.
>
> In general I like the idea, but I think the user-visible change will be quite
> large, almost "cgroup v3"-large. Here are some problems:
> 1) Anything shared between e.g. system.slice and user.slice now belongs
>    to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache
>    belonging to shared libraries.
> 2) It's concerning in security terms. If I understand the idea correctly, a
>    read-only access will allow to move charges to an upper level, potentially
>    crossing memory.max limits. It doesn't sound safe.
> 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent
>    it returns us to the cgroup v1 world and a question of competition between
>    resources consumed by a cgroup directly and through children cgroups. Not
>    like the problem doesn't exist now, but it's less pronounced.
>    If say >50% of system.slice's memory will belong to system.slice directly,
>    then we likely will need separate non-recursive counters, limits, protections,
>    etc.
> 4) Imagine a production server and a system administrator entering using ssh
>    (and being put into user.slice) and running a big grep... It screws up all
>    memory accounting until a next reboot. Not a completely impossible scenario.
>
> That said, I agree with Johannes and I'm also not a big fan of this patchset.
>
> I agree that the problem exist and that the patchset provides a solution, but
> it doesn't look nice (and generic enough) and creates a lot of questions and
> corner cases.
>

Thanks as always for your review and I definitely welcome any
suggestions for how to solve this. I surmise from your response and
Johannes's that we're looking here for a solution that involves no
configuration from the sysadmin, where the kernel automatically
figures out where is the best place for the shared memory to get
charged and there are little to no corner cases to handle. I honestly
can't think of one at this moment. I was thinking some opt-in
deterministic charging with some configuration from the sysadmin and
reasonable edge case handling could make sense.

> Btw, won't (an optional) disabling of memcg accounting for a tmpfs solve your
> problem? It will be less invasive and will not require any oom changes.

I think it will solve use case #1, but I don't see it solving use
cases #2 and #3. To be completely honest it sounds a bit hacky to me
and there were concerns on this patchset that sysadmin needs to rely
on ad-hoc mount write permissions to reliably use a memcg= feature,
but disabling tmpfs accounting is in the same boat and seems a more
dangerous? (as in mistakingly granting write access to a tmpfs mount
to a bad actor can reliably DoS the entire machine).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-22 23:09   ` Roman Gushchin
  2021-11-23 19:26     ` Mina Almasry
@ 2021-11-23 20:21     ` Johannes Weiner
  2021-11-23 21:19       ` Mina Almasry
  1 sibling, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2021-11-23 20:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Matthew Wilcox,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Mon, Nov 22, 2021 at 03:09:26PM -0800, Roman Gushchin wrote:
> On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote:
> > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > > Problem:
> > > Currently shared memory is charged to the memcg of the allocating
> > > process. This makes memory usage of processes accessing shared memory
> > > a bit unpredictable since whichever process accesses the memory first
> > > will get charged. We have a number of use cases where our userspace
> > > would like deterministic charging of shared memory:
> > > 
> > > 1. System services allocating memory for client jobs:
> > > We have services (namely a network access service[1]) that provide
> > > functionality for clients running on the machine and allocate memory
> > > to carry out these services. The memory usage of these services
> > > depends on the number of jobs running on the machine and the nature of
> > > the requests made to the service, which makes the memory usage of
> > > these services hard to predict and thus hard to limit via memory.max.
> > > These system services would like a way to allocate memory and instruct
> > > the kernel to charge this memory to the client’s memcg.
> > > 
> > > 2. Shared filesystem between subtasks of a large job
> > > Our infrastructure has large meta jobs such as kubernetes which spawn
> > > multiple subtasks which share a tmpfs mount. These jobs and its
> > > subtasks use that tmpfs mount for various purposes such as data
> > > sharing or persistent data between the subtask restarts. In kubernetes
> > > terminology, the meta job is similar to pods and subtasks are
> > > containers under pods. We want the shared memory to be
> > > deterministically charged to the kubernetes's pod and independent to
> > > the lifetime of containers under the pod.
> > > 
> > > 3. Shared libraries and language runtimes shared between independent jobs.
> > > We’d like to optimize memory usage on the machine by sharing libraries
> > > and language runtimes of many of the processes running on our machines
> > > in separate memcgs. This produces a side effect that one job may be
> > > unlucky to be the first to access many of the libraries and may get
> > > oom killed as all the cached files get charged to it.
> > > 
> > > Design:
> > > My rough proposal to solve this problem is to simply add a
> > > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > > directing all the memory of the file system to be ‘remote charged’ to
> > > cgroup provided by that memcg= option.
> > > 
> > > Caveats:
> > > 
> > > 1. One complication to address is the behavior when the target memcg
> > > hits its memory.max limit because of remote charging. In this case the
> > > oom-killer will be invoked, but the oom-killer may not find anything
> > > to kill in the target memcg being charged. Thera are a number of considerations
> > > in this case:
> > > 
> > > 1. It's not great to kill the allocating process since the allocating process
> > >    is not running in the memcg under oom, and killing it will not free memory
> > >    in the memcg under oom.
> > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> > >    somehow. If not, the process will forever loop the pagefault in the upstream
> > >    kernel.
> > > 
> > > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > > to the caller. This will cause will cause the process executing the remote
> > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > > path.  This will be documented behavior of remote charging, and this feature is
> > > opt-in. Users can:
> > > - Not opt-into the feature if they want.
> > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> > >   abort if they desire.
> > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> > >   operation without executing the remote charge if possible.
> > > 
> > > 2. Only processes allowed the enter cgroup at mount time can mount a
> > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > > process with write access to this mount point will be able to charge memory to
> > > <cgroup>. This is largely a non-issue because in configurations where there is
> > > untrusted code running on the machine, mount point access needs to be
> > > restricted to the intended users only regardless of whether the mount point
> > > memory is deterministly charged or not.
> > 
> > I'm not a fan of this. It uses filesystem mounts to create shareable
> > resource domains outside of the cgroup hierarchy, which has all the
> > downsides you listed, and more:
> > 
> > 1. You need a filesystem interface in the first place, and a new
> >    ad-hoc channel and permission model to coordinate with the cgroup
> >    tree, which isn't great. All filesystems you want to share data on
> >    need to be converted.
> > 
> > 2. It doesn't extend to non-filesystem sources of shared data, such as
> >    memfds, ipc shm etc.
> > 
> > 3. It requires unintuitive configuration for what should be basic
> >    shared accounting semantics. Per default you still get the old
> >    'first touch' semantics, but to get sharing you need to reconfigure
> >    the filesystems?
> > 
> > 4. If a task needs to work with a hierarchy of data sharing domains -
> >    system-wide, group of jobs, job - it must interact with a hierarchy
> >    of filesystem mounts. This is a pain to setup and may require task
> >    awareness. Moving data around, working with different mount points.
> >    Also, no shared and private data accounting within the same file.
> > 
> > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
> >    entangled, sitting in disjunct domains. OOM killing is one quirk of
> >    that, but there are others you haven't touched on. Who is charged
> >    for the CPU cycles of reclaim in the out-of-band domain?  Who is
> >    charged for the paging IO? How is resource pressure accounted and
> >    attributed? Soon you need cpu= and io= as well.
> > 
> > My take on this is that it might work for your rather specific
> > usecase, but it doesn't strike me as a general-purpose feature
> > suitable for upstream.
> > 
> > 
> > If we want sharing semantics for memory, I think we need a more
> > generic implementation with a cleaner interface.
> > 
> > Here is one idea:
> > 
> > Have you considered reparenting pages that are accessed by multiple
> > cgroups to the first common ancestor of those groups?
> > 
> > Essentially, whenever there is a memory access (minor fault, buffered
> > IO) to a page that doesn't belong to the accessing task's cgroup, you
> > find the common ancestor between that task and the owning cgroup, and
> > move the page there.
> > 
> > With a tree like this:
> > 
> > 	root - job group - job
> >                         `- job
> >             `- job group - job
> >                         `- job
> > 
> > all pages accessed inside that tree will propagate to the highest
> > level at which they are shared - which is the same level where you'd
> > also set shared policies, like a job group memory limit or io weight.
> > 
> > E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> > pages would bubble to the respective job group, private data would
> > stay within each job.
> > 
> > No further user configuration necessary. Although you still *can* use
> > mount namespacing etc. to prohibit undesired sharing between cgroups.
> > 
> > The actual user-visible accounting change would be quite small, and
> > arguably much more intuitive. Remember that accounting is recursive,
> > meaning that a job page today also shows up in the counters of job
> > group and root. This would not change. The only thing that IS weird
> > today is that when two jobs share a page, it will arbitrarily show up
> > in one job's counter but not in the other's. That would change: it
> > would no longer show up as either, since it's not private to either;
> > it would just be a job group (and up) page.

These are great questions.

> In general I like the idea, but I think the user-visible change will be quite
> large, almost "cgroup v3"-large.

I wouldn't quite say cgroup3 :-) But it would definitely require a new
mount option for cgroupfs.

> Here are some problems:
> 1) Anything shared between e.g. system.slice and user.slice now belongs
>    to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache
>    belonging to shared libraries.

Correct, but arguably that's a good thing:

Right now, even though the libraries are used by both, they'll be held
by one group. This can cause two priority inversions: hipri references
don't prevent the shared page from thrashing inside a lowpri group,
which could subject the hipri group to reclaim pressure and waiting
for slow refaults of the lowpri groups; if the lowpri group is the
hotter user of this page, this could sustain. Or the page ends up in
the hipri group, and the lowpri group pins it there even when the
hipri group is done with it, thus stealing its capacity.

Yes, a libc page used by everybody in the system would end up in the
root cgroup. But arguably that makes much more sense than having it
show up as exclusive memory of system.slice/systemd-udevd.service.
And certainly we don't want a universally shared page be subjected to
the local resource pressure of one lowpri user of it.

Recognizing the shared property and propagating it to the common
domain - the level at which priorities are equal between them - would
make the accounting clearer and solve both these inversions.

> 2) It's concerning in security terms. If I understand the idea correctly, a
>    read-only access will allow to move charges to an upper level, potentially
>    crossing memory.max limits. It doesn't sound safe.

Hm. The mechanism is slightly different, but escaping memory.max
happens today as well: shared memory is already not subject to the
memory.max of (n-1)/n cgroups that touch it.

So before, you can escape containment to whatever other cgroup is
using the page. After, you can escape to the common domain. It's
difficult for me to say one is clearly worse than the other. You can
conceive of realistic scenarios where both are equally problematic.

Practically, they appear to require the same solution: if the
environment isn't to be trusted, namespacing and limiting access to
shared data is necessary to avoid cgroups escaping containment or
DoSing other groups.

> 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent
>    it returns us to the cgroup v1 world and a question of competition between
>    resources consumed by a cgroup directly and through children cgroups. Not
>    like the problem doesn't exist now, but it's less pronounced.
>    If say >50% of system.slice's memory will belong to system.slice directly,
>    then we likely will need separate non-recursive counters, limits, protections,
>    etc.

I actually do see numbers like this in practice. Temporary
system.slice units allocate cache, then their cgroups get deleted and
the cache is reused by the next instances. Quite often, system.slice
has much more memory than its subgroups combined.

So in a way, we have what I'm proposing if the sharing happens with
dead cgroups. Sharing with live cgroups wouldn't necessarily create a
bigger demand for new counters than what we have now.

I think the cgroup1 issue was slightly different: in cgroup1 we
allowed *tasks* to live in non-leaf groups, and so users wanted to
control the *private* memory of said tasks with policies that were
*different* from the shared policies applied to the leaves.

This wouldn't be the same here. Tasks are still only inside leafs, and
there is no "private" memory inside a non-leaf group. It's shared
among the children, and so subject to policies shared by all children.

> 4) Imagine a production server and a system administrator entering using ssh
>    (and being put into user.slice) and running a big grep... It screws up all
>    memory accounting until a next reboot. Not a completely impossible scenario.

This can also happen with the first-touch model, though. The second
you touch private data of some workload, the memory might escape it.

It's not as pronounced with a first-touch policy - although proactive
reclaim makes this worse. But I'm not sure you can call it a new
concern in the proposed model: you already have to be careful with the
data you touch and bring into memory from your current cgroup.

Again, I think this is where mount namespaces come in. You're not
necessarily supposed to see private data of workloads from the outside
and access it accidentally. It's common practice to ssh directly into
containers to muck with them and their memory, at which point you'll
be in the appropriate cgroup and permission context, too.

However, I do agree with Mina and you: this is a significant change in
behavior, and a cgroupfs mount option would certainly be warranted.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-23 20:21     ` Johannes Weiner
@ 2021-11-23 21:19       ` Mina Almasry
  2021-11-23 22:49         ` Roman Gushchin
  0 siblings, 1 reply; 20+ messages in thread
From: Mina Almasry @ 2021-11-23 21:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Matthew Wilcox,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Tue, Nov 23, 2021 at 12:21 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Nov 22, 2021 at 03:09:26PM -0800, Roman Gushchin wrote:
> > On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote:
> > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > > > Problem:
> > > > Currently shared memory is charged to the memcg of the allocating
> > > > process. This makes memory usage of processes accessing shared memory
> > > > a bit unpredictable since whichever process accesses the memory first
> > > > will get charged. We have a number of use cases where our userspace
> > > > would like deterministic charging of shared memory:
> > > >
> > > > 1. System services allocating memory for client jobs:
> > > > We have services (namely a network access service[1]) that provide
> > > > functionality for clients running on the machine and allocate memory
> > > > to carry out these services. The memory usage of these services
> > > > depends on the number of jobs running on the machine and the nature of
> > > > the requests made to the service, which makes the memory usage of
> > > > these services hard to predict and thus hard to limit via memory.max.
> > > > These system services would like a way to allocate memory and instruct
> > > > the kernel to charge this memory to the client’s memcg.
> > > >
> > > > 2. Shared filesystem between subtasks of a large job
> > > > Our infrastructure has large meta jobs such as kubernetes which spawn
> > > > multiple subtasks which share a tmpfs mount. These jobs and its
> > > > subtasks use that tmpfs mount for various purposes such as data
> > > > sharing or persistent data between the subtask restarts. In kubernetes
> > > > terminology, the meta job is similar to pods and subtasks are
> > > > containers under pods. We want the shared memory to be
> > > > deterministically charged to the kubernetes's pod and independent to
> > > > the lifetime of containers under the pod.
> > > >
> > > > 3. Shared libraries and language runtimes shared between independent jobs.
> > > > We’d like to optimize memory usage on the machine by sharing libraries
> > > > and language runtimes of many of the processes running on our machines
> > > > in separate memcgs. This produces a side effect that one job may be
> > > > unlucky to be the first to access many of the libraries and may get
> > > > oom killed as all the cached files get charged to it.
> > > >
> > > > Design:
> > > > My rough proposal to solve this problem is to simply add a
> > > > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > > > directing all the memory of the file system to be ‘remote charged’ to
> > > > cgroup provided by that memcg= option.
> > > >
> > > > Caveats:
> > > >
> > > > 1. One complication to address is the behavior when the target memcg
> > > > hits its memory.max limit because of remote charging. In this case the
> > > > oom-killer will be invoked, but the oom-killer may not find anything
> > > > to kill in the target memcg being charged. Thera are a number of considerations
> > > > in this case:
> > > >
> > > > 1. It's not great to kill the allocating process since the allocating process
> > > >    is not running in the memcg under oom, and killing it will not free memory
> > > >    in the memcg under oom.
> > > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> > > >    somehow. If not, the process will forever loop the pagefault in the upstream
> > > >    kernel.
> > > >
> > > > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > > > to the caller. This will cause will cause the process executing the remote
> > > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > > > path.  This will be documented behavior of remote charging, and this feature is
> > > > opt-in. Users can:
> > > > - Not opt-into the feature if they want.
> > > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> > > >   abort if they desire.
> > > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> > > >   operation without executing the remote charge if possible.
> > > >
> > > > 2. Only processes allowed the enter cgroup at mount time can mount a
> > > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > > > process with write access to this mount point will be able to charge memory to
> > > > <cgroup>. This is largely a non-issue because in configurations where there is
> > > > untrusted code running on the machine, mount point access needs to be
> > > > restricted to the intended users only regardless of whether the mount point
> > > > memory is deterministly charged or not.
> > >
> > > I'm not a fan of this. It uses filesystem mounts to create shareable
> > > resource domains outside of the cgroup hierarchy, which has all the
> > > downsides you listed, and more:
> > >
> > > 1. You need a filesystem interface in the first place, and a new
> > >    ad-hoc channel and permission model to coordinate with the cgroup
> > >    tree, which isn't great. All filesystems you want to share data on
> > >    need to be converted.
> > >
> > > 2. It doesn't extend to non-filesystem sources of shared data, such as
> > >    memfds, ipc shm etc.
> > >
> > > 3. It requires unintuitive configuration for what should be basic
> > >    shared accounting semantics. Per default you still get the old
> > >    'first touch' semantics, but to get sharing you need to reconfigure
> > >    the filesystems?
> > >
> > > 4. If a task needs to work with a hierarchy of data sharing domains -
> > >    system-wide, group of jobs, job - it must interact with a hierarchy
> > >    of filesystem mounts. This is a pain to setup and may require task
> > >    awareness. Moving data around, working with different mount points.
> > >    Also, no shared and private data accounting within the same file.
> > >
> > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
> > >    entangled, sitting in disjunct domains. OOM killing is one quirk of
> > >    that, but there are others you haven't touched on. Who is charged
> > >    for the CPU cycles of reclaim in the out-of-band domain?  Who is
> > >    charged for the paging IO? How is resource pressure accounted and
> > >    attributed? Soon you need cpu= and io= as well.
> > >
> > > My take on this is that it might work for your rather specific
> > > usecase, but it doesn't strike me as a general-purpose feature
> > > suitable for upstream.
> > >
> > >
> > > If we want sharing semantics for memory, I think we need a more
> > > generic implementation with a cleaner interface.
> > >
> > > Here is one idea:
> > >
> > > Have you considered reparenting pages that are accessed by multiple
> > > cgroups to the first common ancestor of those groups?
> > >
> > > Essentially, whenever there is a memory access (minor fault, buffered
> > > IO) to a page that doesn't belong to the accessing task's cgroup, you
> > > find the common ancestor between that task and the owning cgroup, and
> > > move the page there.
> > >
> > > With a tree like this:
> > >
> > >     root - job group - job
> > >                         `- job
> > >             `- job group - job
> > >                         `- job
> > >
> > > all pages accessed inside that tree will propagate to the highest
> > > level at which they are shared - which is the same level where you'd
> > > also set shared policies, like a job group memory limit or io weight.
> > >
> > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> > > pages would bubble to the respective job group, private data would
> > > stay within each job.
> > >
> > > No further user configuration necessary. Although you still *can* use
> > > mount namespacing etc. to prohibit undesired sharing between cgroups.
> > >
> > > The actual user-visible accounting change would be quite small, and
> > > arguably much more intuitive. Remember that accounting is recursive,
> > > meaning that a job page today also shows up in the counters of job
> > > group and root. This would not change. The only thing that IS weird
> > > today is that when two jobs share a page, it will arbitrarily show up
> > > in one job's counter but not in the other's. That would change: it
> > > would no longer show up as either, since it's not private to either;
> > > it would just be a job group (and up) page.
>
> These are great questions.
>
> > In general I like the idea, but I think the user-visible change will be quite
> > large, almost "cgroup v3"-large.
>
> I wouldn't quite say cgroup3 :-) But it would definitely require a new
> mount option for cgroupfs.
>
> > Here are some problems:
> > 1) Anything shared between e.g. system.slice and user.slice now belongs
> >    to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache
> >    belonging to shared libraries.
>
> Correct, but arguably that's a good thing:
>
> Right now, even though the libraries are used by both, they'll be held
> by one group. This can cause two priority inversions: hipri references
> don't prevent the shared page from thrashing inside a lowpri group,
> which could subject the hipri group to reclaim pressure and waiting
> for slow refaults of the lowpri groups; if the lowpri group is the
> hotter user of this page, this could sustain. Or the page ends up in
> the hipri group, and the lowpri group pins it there even when the
> hipri group is done with it, thus stealing its capacity.
>
> Yes, a libc page used by everybody in the system would end up in the
> root cgroup. But arguably that makes much more sense than having it
> show up as exclusive memory of system.slice/systemd-udevd.service.
> And certainly we don't want a universally shared page be subjected to
> the local resource pressure of one lowpri user of it.
>
> Recognizing the shared property and propagating it to the common
> domain - the level at which priorities are equal between them - would
> make the accounting clearer and solve both these inversions.
>
> > 2) It's concerning in security terms. If I understand the idea correctly, a
> >    read-only access will allow to move charges to an upper level, potentially
> >    crossing memory.max limits. It doesn't sound safe.
>
> Hm. The mechanism is slightly different, but escaping memory.max
> happens today as well: shared memory is already not subject to the
> memory.max of (n-1)/n cgroups that touch it.
>
> So before, you can escape containment to whatever other cgroup is
> using the page. After, you can escape to the common domain. It's
> difficult for me to say one is clearly worse than the other. You can
> conceive of realistic scenarios where both are equally problematic.
>
> Practically, they appear to require the same solution: if the
> environment isn't to be trusted, namespacing and limiting access to
> shared data is necessary to avoid cgroups escaping containment or
> DoSing other groups.
>
> > 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent
> >    it returns us to the cgroup v1 world and a question of competition between
> >    resources consumed by a cgroup directly and through children cgroups. Not
> >    like the problem doesn't exist now, but it's less pronounced.
> >    If say >50% of system.slice's memory will belong to system.slice directly,
> >    then we likely will need separate non-recursive counters, limits, protections,
> >    etc.
>
> I actually do see numbers like this in practice. Temporary
> system.slice units allocate cache, then their cgroups get deleted and
> the cache is reused by the next instances. Quite often, system.slice
> has much more memory than its subgroups combined.
>
> So in a way, we have what I'm proposing if the sharing happens with
> dead cgroups. Sharing with live cgroups wouldn't necessarily create a
> bigger demand for new counters than what we have now.
>
> I think the cgroup1 issue was slightly different: in cgroup1 we
> allowed *tasks* to live in non-leaf groups, and so users wanted to
> control the *private* memory of said tasks with policies that were
> *different* from the shared policies applied to the leaves.
>
> This wouldn't be the same here. Tasks are still only inside leafs, and
> there is no "private" memory inside a non-leaf group. It's shared
> among the children, and so subject to policies shared by all children.
>
> > 4) Imagine a production server and a system administrator entering using ssh
> >    (and being put into user.slice) and running a big grep... It screws up all
> >    memory accounting until a next reboot. Not a completely impossible scenario.
>
> This can also happen with the first-touch model, though. The second
> you touch private data of some workload, the memory might escape it.
>
> It's not as pronounced with a first-touch policy - although proactive
> reclaim makes this worse. But I'm not sure you can call it a new
> concern in the proposed model: you already have to be careful with the
> data you touch and bring into memory from your current cgroup.
>
> Again, I think this is where mount namespaces come in. You're not
> necessarily supposed to see private data of workloads from the outside
> and access it accidentally. It's common practice to ssh directly into
> containers to muck with them and their memory, at which point you'll
> be in the appropriate cgroup and permission context, too.
>
> However, I do agree with Mina and you: this is a significant change in
> behavior, and a cgroupfs mount option would certainly be warranted.

I don't mean to be a nag here but I have trouble seeing pages being
re-accounted on minor faults working for us, and that might be fine,
but I'm expecting if it doesn't really work for us it likely won't
work for the next person trying to use this.

The issue is that the fact that the memory is initially accounted to
the allocating process forces the sysadmin to overprovision the cgroup
limit anyway so that the tasks don't oom if tasks are pre-allocating
memory. The memory usage of a task accessing shared memory stays very
unpredictable because it's waiting on another task in another cgroup
to touch the shared memory for the shared memory to be unaccounted to
its cgroup.

I have a couple of (admittingly probably controversial) suggestions:
1. memcg flag, say memory.charge_for_shared_memory. When we allocate
shared memory, we charge it to the first ancestor memcg that has
memory.charge_for_shared_memory==true.
2. Somehow on the creation of shared memory, we somehow declare that
this memory belongs to <cgroup>. Only descendants of <cgroup> are able
to touch the shared memory and the shared memory is charged to
<cgroup>.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-23 21:19       ` Mina Almasry
@ 2021-11-23 22:49         ` Roman Gushchin
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2021-11-23 22:49 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Johannes Weiner, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Shakeel Butt, Greg Thelen, Dave Chinner, Matthew Wilcox,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Tue, Nov 23, 2021 at 01:19:47PM -0800, Mina Almasry wrote:
> On Tue, Nov 23, 2021 at 12:21 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Nov 22, 2021 at 03:09:26PM -0800, Roman Gushchin wrote:
> > > On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote:
> > > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > > > > Problem:
> > > > > Currently shared memory is charged to the memcg of the allocating
> > > > > process. This makes memory usage of processes accessing shared memory
> > > > > a bit unpredictable since whichever process accesses the memory first
> > > > > will get charged. We have a number of use cases where our userspace
> > > > > would like deterministic charging of shared memory:
> > > > >
> > > > > 1. System services allocating memory for client jobs:
> > > > > We have services (namely a network access service[1]) that provide
> > > > > functionality for clients running on the machine and allocate memory
> > > > > to carry out these services. The memory usage of these services
> > > > > depends on the number of jobs running on the machine and the nature of
> > > > > the requests made to the service, which makes the memory usage of
> > > > > these services hard to predict and thus hard to limit via memory.max.
> > > > > These system services would like a way to allocate memory and instruct
> > > > > the kernel to charge this memory to the client’s memcg.
> > > > >
> > > > > 2. Shared filesystem between subtasks of a large job
> > > > > Our infrastructure has large meta jobs such as kubernetes which spawn
> > > > > multiple subtasks which share a tmpfs mount. These jobs and its
> > > > > subtasks use that tmpfs mount for various purposes such as data
> > > > > sharing or persistent data between the subtask restarts. In kubernetes
> > > > > terminology, the meta job is similar to pods and subtasks are
> > > > > containers under pods. We want the shared memory to be
> > > > > deterministically charged to the kubernetes's pod and independent to
> > > > > the lifetime of containers under the pod.
> > > > >
> > > > > 3. Shared libraries and language runtimes shared between independent jobs.
> > > > > We’d like to optimize memory usage on the machine by sharing libraries
> > > > > and language runtimes of many of the processes running on our machines
> > > > > in separate memcgs. This produces a side effect that one job may be
> > > > > unlucky to be the first to access many of the libraries and may get
> > > > > oom killed as all the cached files get charged to it.
> > > > >
> > > > > Design:
> > > > > My rough proposal to solve this problem is to simply add a
> > > > > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > > > > directing all the memory of the file system to be ‘remote charged’ to
> > > > > cgroup provided by that memcg= option.
> > > > >
> > > > > Caveats:
> > > > >
> > > > > 1. One complication to address is the behavior when the target memcg
> > > > > hits its memory.max limit because of remote charging. In this case the
> > > > > oom-killer will be invoked, but the oom-killer may not find anything
> > > > > to kill in the target memcg being charged. Thera are a number of considerations
> > > > > in this case:
> > > > >
> > > > > 1. It's not great to kill the allocating process since the allocating process
> > > > >    is not running in the memcg under oom, and killing it will not free memory
> > > > >    in the memcg under oom.
> > > > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> > > > >    somehow. If not, the process will forever loop the pagefault in the upstream
> > > > >    kernel.
> > > > >
> > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > > > > to the caller. This will cause will cause the process executing the remote
> > > > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > > > > path.  This will be documented behavior of remote charging, and this feature is
> > > > > opt-in. Users can:
> > > > > - Not opt-into the feature if they want.
> > > > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> > > > >   abort if they desire.
> > > > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> > > > >   operation without executing the remote charge if possible.
> > > > >
> > > > > 2. Only processes allowed the enter cgroup at mount time can mount a
> > > > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > > > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > > > > process with write access to this mount point will be able to charge memory to
> > > > > <cgroup>. This is largely a non-issue because in configurations where there is
> > > > > untrusted code running on the machine, mount point access needs to be
> > > > > restricted to the intended users only regardless of whether the mount point
> > > > > memory is deterministly charged or not.
> > > >
> > > > I'm not a fan of this. It uses filesystem mounts to create shareable
> > > > resource domains outside of the cgroup hierarchy, which has all the
> > > > downsides you listed, and more:
> > > >
> > > > 1. You need a filesystem interface in the first place, and a new
> > > >    ad-hoc channel and permission model to coordinate with the cgroup
> > > >    tree, which isn't great. All filesystems you want to share data on
> > > >    need to be converted.
> > > >
> > > > 2. It doesn't extend to non-filesystem sources of shared data, such as
> > > >    memfds, ipc shm etc.
> > > >
> > > > 3. It requires unintuitive configuration for what should be basic
> > > >    shared accounting semantics. Per default you still get the old
> > > >    'first touch' semantics, but to get sharing you need to reconfigure
> > > >    the filesystems?
> > > >
> > > > 4. If a task needs to work with a hierarchy of data sharing domains -
> > > >    system-wide, group of jobs, job - it must interact with a hierarchy
> > > >    of filesystem mounts. This is a pain to setup and may require task
> > > >    awareness. Moving data around, working with different mount points.
> > > >    Also, no shared and private data accounting within the same file.
> > > >
> > > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
> > > >    entangled, sitting in disjunct domains. OOM killing is one quirk of
> > > >    that, but there are others you haven't touched on. Who is charged
> > > >    for the CPU cycles of reclaim in the out-of-band domain?  Who is
> > > >    charged for the paging IO? How is resource pressure accounted and
> > > >    attributed? Soon you need cpu= and io= as well.
> > > >
> > > > My take on this is that it might work for your rather specific
> > > > usecase, but it doesn't strike me as a general-purpose feature
> > > > suitable for upstream.
> > > >
> > > >
> > > > If we want sharing semantics for memory, I think we need a more
> > > > generic implementation with a cleaner interface.
> > > >
> > > > Here is one idea:
> > > >
> > > > Have you considered reparenting pages that are accessed by multiple
> > > > cgroups to the first common ancestor of those groups?
> > > >
> > > > Essentially, whenever there is a memory access (minor fault, buffered
> > > > IO) to a page that doesn't belong to the accessing task's cgroup, you
> > > > find the common ancestor between that task and the owning cgroup, and
> > > > move the page there.
> > > >
> > > > With a tree like this:
> > > >
> > > >     root - job group - job
> > > >                         `- job
> > > >             `- job group - job
> > > >                         `- job
> > > >
> > > > all pages accessed inside that tree will propagate to the highest
> > > > level at which they are shared - which is the same level where you'd
> > > > also set shared policies, like a job group memory limit or io weight.
> > > >
> > > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> > > > pages would bubble to the respective job group, private data would
> > > > stay within each job.
> > > >
> > > > No further user configuration necessary. Although you still *can* use
> > > > mount namespacing etc. to prohibit undesired sharing between cgroups.
> > > >
> > > > The actual user-visible accounting change would be quite small, and
> > > > arguably much more intuitive. Remember that accounting is recursive,
> > > > meaning that a job page today also shows up in the counters of job
> > > > group and root. This would not change. The only thing that IS weird
> > > > today is that when two jobs share a page, it will arbitrarily show up
> > > > in one job's counter but not in the other's. That would change: it
> > > > would no longer show up as either, since it's not private to either;
> > > > it would just be a job group (and up) page.
> >
> > These are great questions.
> >
> > > In general I like the idea, but I think the user-visible change will be quite
> > > large, almost "cgroup v3"-large.
> >
> > I wouldn't quite say cgroup3 :-) But it would definitely require a new
> > mount option for cgroupfs.
> >
> > > Here are some problems:
> > > 1) Anything shared between e.g. system.slice and user.slice now belongs
> > >    to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache
> > >    belonging to shared libraries.
> >
> > Correct, but arguably that's a good thing:
> >
> > Right now, even though the libraries are used by both, they'll be held
> > by one group. This can cause two priority inversions: hipri references
> > don't prevent the shared page from thrashing inside a lowpri group,
> > which could subject the hipri group to reclaim pressure and waiting
> > for slow refaults of the lowpri groups; if the lowpri group is the
> > hotter user of this page, this could sustain. Or the page ends up in
> > the hipri group, and the lowpri group pins it there even when the
> > hipri group is done with it, thus stealing its capacity.
> >
> > Yes, a libc page used by everybody in the system would end up in the
> > root cgroup. But arguably that makes much more sense than having it
> > show up as exclusive memory of system.slice/systemd-udevd.service.
> > And certainly we don't want a universally shared page be subjected to
> > the local resource pressure of one lowpri user of it.
> >
> > Recognizing the shared property and propagating it to the common
> > domain - the level at which priorities are equal between them - would
> > make the accounting clearer and solve both these inversions.
> >
> > > 2) It's concerning in security terms. If I understand the idea correctly, a
> > >    read-only access will allow to move charges to an upper level, potentially
> > >    crossing memory.max limits. It doesn't sound safe.
> >
> > Hm. The mechanism is slightly different, but escaping memory.max
> > happens today as well: shared memory is already not subject to the
> > memory.max of (n-1)/n cgroups that touch it.
> >
> > So before, you can escape containment to whatever other cgroup is
> > using the page. After, you can escape to the common domain. It's
> > difficult for me to say one is clearly worse than the other. You can
> > conceive of realistic scenarios where both are equally problematic.
> >
> > Practically, they appear to require the same solution: if the
> > environment isn't to be trusted, namespacing and limiting access to
> > shared data is necessary to avoid cgroups escaping containment or
> > DoSing other groups.
> >
> > > 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent
> > >    it returns us to the cgroup v1 world and a question of competition between
> > >    resources consumed by a cgroup directly and through children cgroups. Not
> > >    like the problem doesn't exist now, but it's less pronounced.
> > >    If say >50% of system.slice's memory will belong to system.slice directly,
> > >    then we likely will need separate non-recursive counters, limits, protections,
> > >    etc.
> >
> > I actually do see numbers like this in practice. Temporary
> > system.slice units allocate cache, then their cgroups get deleted and
> > the cache is reused by the next instances. Quite often, system.slice
> > has much more memory than its subgroups combined.
> >
> > So in a way, we have what I'm proposing if the sharing happens with
> > dead cgroups. Sharing with live cgroups wouldn't necessarily create a
> > bigger demand for new counters than what we have now.
> >
> > I think the cgroup1 issue was slightly different: in cgroup1 we
> > allowed *tasks* to live in non-leaf groups, and so users wanted to
> > control the *private* memory of said tasks with policies that were
> > *different* from the shared policies applied to the leaves.
> >
> > This wouldn't be the same here. Tasks are still only inside leafs, and
> > there is no "private" memory inside a non-leaf group. It's shared
> > among the children, and so subject to policies shared by all children.
> >
> > > 4) Imagine a production server and a system administrator entering using ssh
> > >    (and being put into user.slice) and running a big grep... It screws up all
> > >    memory accounting until a next reboot. Not a completely impossible scenario.
> >
> > This can also happen with the first-touch model, though. The second
> > you touch private data of some workload, the memory might escape it.
> >
> > It's not as pronounced with a first-touch policy - although proactive
> > reclaim makes this worse. But I'm not sure you can call it a new
> > concern in the proposed model: you already have to be careful with the
> > data you touch and bring into memory from your current cgroup.
> >
> > Again, I think this is where mount namespaces come in. You're not
> > necessarily supposed to see private data of workloads from the outside
> > and access it accidentally. It's common practice to ssh directly into
> > containers to muck with them and their memory, at which point you'll
> > be in the appropriate cgroup and permission context, too.
> >
> > However, I do agree with Mina and you: this is a significant change in
> > behavior, and a cgroupfs mount option would certainly be warranted.
> 
> I don't mean to be a nag here but I have trouble seeing pages being
> re-accounted on minor faults working for us, and that might be fine,
> but I'm expecting if it doesn't really work for us it likely won't
> work for the next person trying to use this.

Yes, I agree, the performance impact might be non-trivial.
I think we discussed something similar in the past in the context
of re-charging pages belonging to a deleted cgroup. And the consensus
was that we'd need to add hooks into many places to check whether
a page belongs to a dying (or other-than-current) cgroup and it might
be not cheap.

> 
> The issue is that the fact that the memory is initially accounted to
> the allocating process forces the sysadmin to overprovision the cgroup
> limit anyway so that the tasks don't oom if tasks are pre-allocating
> memory. The memory usage of a task accessing shared memory stays very
> unpredictable because it's waiting on another task in another cgroup
> to touch the shared memory for the shared memory to be unaccounted to
> its cgroup.
> 
> I have a couple of (admittingly probably controversial) suggestions:
> 1. memcg flag, say memory.charge_for_shared_memory. When we allocate
> shared memory, we charge it to the first ancestor memcg that has
> memory.charge_for_shared_memory==true.

I think the problem here is that we try really hard to avoid any
per-memory-type knobs, and this is another one.

> 2. Somehow on the creation of shared memory, we somehow declare that
> this memory belongs to <cgroup>. Only descendants of <cgroup> are able
> to touch the shared memory and the shared memory is charged to
> <cgroup>.

This sounds like a mount namespace.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-22 19:04 ` Johannes Weiner
  2021-11-22 22:09   ` Mina Almasry
  2021-11-22 23:09   ` Roman Gushchin
@ 2021-11-24 17:27   ` Michal Hocko
  2021-11-29  6:00   ` Shakeel Butt
  3 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2021-11-24 17:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Vladimir Davydov, Hugh Dickins, Shuah Khan, Shakeel Butt,
	Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

On Mon 22-11-21 14:04:04, Johannes Weiner wrote:
[...]
> I'm not a fan of this. It uses filesystem mounts to create shareable
> resource domains outside of the cgroup hierarchy, which has all the
> downsides you listed, and more:
> 
> 1. You need a filesystem interface in the first place, and a new
>    ad-hoc channel and permission model to coordinate with the cgroup
>    tree, which isn't great. All filesystems you want to share data on
>    need to be converted.
> 
> 2. It doesn't extend to non-filesystem sources of shared data, such as
>    memfds, ipc shm etc.
> 
> 3. It requires unintuitive configuration for what should be basic
>    shared accounting semantics. Per default you still get the old
>    'first touch' semantics, but to get sharing you need to reconfigure
>    the filesystems?
> 
> 4. If a task needs to work with a hierarchy of data sharing domains -
>    system-wide, group of jobs, job - it must interact with a hierarchy
>    of filesystem mounts. This is a pain to setup and may require task
>    awareness. Moving data around, working with different mount points.
>    Also, no shared and private data accounting within the same file.
> 
> 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
>    entangled, sitting in disjunct domains. OOM killing is one quirk of
>    that, but there are others you haven't touched on. Who is charged
>    for the CPU cycles of reclaim in the out-of-band domain?  Who is
>    charged for the paging IO? How is resource pressure accounted and
>    attributed? Soon you need cpu= and io= as well.
> 
> My take on this is that it might work for your rather specific
> usecase, but it doesn't strike me as a general-purpose feature
> suitable for upstream.

I just want to reiterate that this resonates with my concerns expressed
earlier and thanks for expressing them in a much better structured and
comprehensive way, Johannes.

[btw. a non-technical comment. For features like this it is better to
 not rush into newer versions posting until there is at least some
 agreement for the feature. Otherwise we have fragments of the
 discussion spread over several email threads]

> If we want sharing semantics for memory, I think we need a more
> generic implementation with a cleaner interface.
> 
> Here is one idea:
> 
> Have you considered reparenting pages that are accessed by multiple
> cgroups to the first common ancestor of those groups?
> 
> Essentially, whenever there is a memory access (minor fault, buffered
> IO) to a page that doesn't belong to the accessing task's cgroup, you
> find the common ancestor between that task and the owning cgroup, and
> move the page there.
> 
> With a tree like this:
> 
> 	root - job group - job
>                         `- job
>             `- job group - job
>                         `- job
> 
> all pages accessed inside that tree will propagate to the highest
> level at which they are shared - which is the same level where you'd
> also set shared policies, like a job group memory limit or io weight.
> 
> E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> pages would bubble to the respective job group, private data would
> stay within each job.
> 
> No further user configuration necessary. Although you still *can* use
> mount namespacing etc. to prohibit undesired sharing between cgroups.
> 
> The actual user-visible accounting change would be quite small, and
> arguably much more intuitive. Remember that accounting is recursive,
> meaning that a job page today also shows up in the counters of job
> group and root. This would not change. The only thing that IS weird
> today is that when two jobs share a page, it will arbitrarily show up
> in one job's counter but not in the other's. That would change: it
> would no longer show up as either, since it's not private to either;
> it would just be a job group (and up) page.
> 
> This would be a generic implementation of resource sharing semantics:
> independent of data source and filesystems, contained inside the
> cgroup interface, and reusing the existing hierarchies of accounting
> and control domains to also represent levels of common property.
> 
> Thoughts?

This is an interesting concept. I am not sure how expensive and
intrusive (code wise) this would get but that is more of an
implementation detail.

Another option would be to provide a syscall to claim a shared resource.
This would require a cooperation of the application but it would
establish a clear responsibility model.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/4] Deterministic charging of shared memory
  2021-11-22 19:04 ` Johannes Weiner
                     ` (2 preceding siblings ...)
  2021-11-24 17:27   ` Michal Hocko
@ 2021-11-29  6:00   ` Shakeel Butt
  3 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2021-11-29  6:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Michal Hocko, Vladimir Davydov, Hugh Dickins, Shuah Khan,
	Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin,
	Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm

Hi Johannes,

On Mon, Nov 22, 2021 at 11:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[...]
> Here is one idea:
>
> Have you considered reparenting pages that are accessed by multiple
> cgroups to the first common ancestor of those groups?
>
> Essentially, whenever there is a memory access (minor fault, buffered
> IO) to a page that doesn't belong to the accessing task's cgroup, you
> find the common ancestor between that task and the owning cgroup, and
> move the page there.
>
> With a tree like this:
>
>         root - job group - job
>                         `- job
>             `- job group - job
>                         `- job
>
> all pages accessed inside that tree will propagate to the highest
> level at which they are shared - which is the same level where you'd
> also set shared policies, like a job group memory limit or io weight.
>
> E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> pages would bubble to the respective job group, private data would
> stay within each job.
>
> No further user configuration necessary. Although you still *can* use
> mount namespacing etc. to prohibit undesired sharing between cgroups.
>
> The actual user-visible accounting change would be quite small, and
> arguably much more intuitive. Remember that accounting is recursive,
> meaning that a job page today also shows up in the counters of job
> group and root. This would not change. The only thing that IS weird
> today is that when two jobs share a page, it will arbitrarily show up
> in one job's counter but not in the other's. That would change: it
> would no longer show up as either, since it's not private to either;
> it would just be a job group (and up) page.
>
> This would be a generic implementation of resource sharing semantics:
> independent of data source and filesystems, contained inside the
> cgroup interface, and reusing the existing hierarchies of accounting
> and control domains to also represent levels of common property.
>
> Thoughts?

Before commenting on your proposal, I would like to clarify that the
use-cases given are not specific to us but are more general. Though I
think you are arguing that the implementation is not general purpose
which I kind of agree with.

Let me take a stab again at describing these use-cases which I think
can be partitioned based on the relationship of the entities
sharing/accessing the memory among them. (Sorry for repeating these
because I think we should keep these in mind while discussing the
possible solutions).

1) Mutually trusted entities sharing memory for collaborative work.
One example is a file-system shared between sub-tasks of a meta-job.
(Mina's second use-case).

2) Independent entities sharing memory to reduce cost. Examples
include shared libraries, packages or tool chains. (Mina's third
use-case).

3) One entity observing or monitoring another entity. Examples include
gdb, ptrace, uprobes, VM or process migration and checkpointing.

4) Server-Client relationship. (Mina's first use-case.

Let me put (3) out of the way first as these operations have special
interfaces and the target entity is a process (not a cgroup). Remote
charging works for these and no new oom corner cases are introduced.

For (1) and (2), I think your proposal aligns pretty well with them
but one important property is still missing which we are very adamant
about i.e. 'deterministic charge'. To explain with an example, suppose
two instances of the same job are running on two different systems. On
one system, it is sharing a shared library with an unrelated job and
the second instance is using that library alone. The owner will see
different memory usage for both instances which can mess with their
resource planning.

However I think this can be solved very easily with an opt-in add-on.
The node controller knows upfront the libraries/packages which can be
shared between the jobs and is responsible for creating the cgroup
hierarchy (at least the top level) for the jobs. It can create a
common ancestor for all such jobs and let the kernel know that if any
descendant accesses these libraries, charge to this specific ancestor.
If someone out of this sub-hierarchy accesses the memory, follow the
proposal i.e. common ancestor. With this specific opt-in add-on, all
job owners will see their job usage more consistent.

[I am putting this as a brainstorming discussion] Regarding (4), for
our use-case, the server wants the cost of the memory needed to serve
a client to be paid by the corresponding client. Please note that the
memory is not necessarily accessed by the client.

Now we can argue that this use-case can be served similar to (3) i.e.
through a special interface/syscall. I think that would be challenging
particularly when the lifetime of a client 'process' is independent of
the memory needed to serve that client.

Another way is to disable the accounting of that specific memory
needed to serve the clients (I think Roman suggested a similar notion
as disabling accounting of a tmpfs). Any other ideas?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-11-29  6:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-20  4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
2021-11-20  4:50 ` [PATCH v4 1/4] mm: support deterministic memory charging of filesystems Mina Almasry
2021-11-20  7:53   ` Shakeel Butt
2021-11-20  4:50 ` [PATCH v4 2/4] mm/oom: handle remote ooms Mina Almasry
2021-11-20  5:07   ` Matthew Wilcox
2021-11-20  5:31     ` Mina Almasry
2021-11-20  7:58   ` Shakeel Butt
2021-11-20  4:50 ` [PATCH v4 3/4] mm, shmem: add filesystem memcg= option documentation Mina Almasry
2021-11-20  4:50 ` [PATCH v4 4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests Mina Almasry
2021-11-20  5:01 ` [PATCH v4 0/4] Deterministic charging of shared memory Matthew Wilcox
2021-11-20  5:27   ` Mina Almasry
2021-11-22 19:04 ` Johannes Weiner
2021-11-22 22:09   ` Mina Almasry
2021-11-22 23:09   ` Roman Gushchin
2021-11-23 19:26     ` Mina Almasry
2021-11-23 20:21     ` Johannes Weiner
2021-11-23 21:19       ` Mina Almasry
2021-11-23 22:49         ` Roman Gushchin
2021-11-24 17:27   ` Michal Hocko
2021-11-29  6:00   ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).