LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <clameter@sgi.com>, Paul Jackson <pj@sgi.com>,
	David Rientjes <rientjes@google.com>, Mel Gorman <mel@csn.ul.ie>,
	torvalds@linux-foundation.org, Eric Whitney <eric.whitney@hp.com>
Subject: Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't works on memoryless node.
Date: Tue, 05 Feb 2008 16:57:32 -0500	[thread overview]
Message-ID: <1202248652.5332.51.camel@localhost> (raw)
In-Reply-To: <20080205163406.270B.KOSAKI.MOTOHIRO@jp.fujitsu.com>

Here's a patch that addresses the problem w/o requiring change to
numactl or libnuma.  It DOES have side affects, discussed in the
description.

Tested with memoryless nodes and restricted cpusets using the numactl
installed with RHEL5.1.

Altho' nominally against 24-mm1, applies cleanly to 2.6.24.  Should be
suitable for 'stable' if everyone agrees.

Lee
----------------------------------

[PATCH] 2.6.24-mm1 - mempolicy:  silently restrict to allowed nodes

Kosaki-san noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes.  This patch attempts to fix that 
problem.

Some background:  

numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask.  set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned.  A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory.  So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.

  NOTE:  the same thing will occur when running in a cpuset
         with restricted mem_allowed--for the same reason:
         node mask contains dis-allowed nodes.

mbind(2), on the other hand, just masks off any nodes in the 
nodemask that are not included in the caller's mems_allowed.

In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains 
any memoryless nodes.  This is somewhat redundant as mpol_new() 
will remove memoryless nodes for interleave policy, as will 
bind_zonelist()--called by mpol_new() for BIND policy.

Proposed fix:

1) modify contextualize_policy to just remove the non-allowed
   nodes, as is currently done in-line for mbind().  This
   guarantees that the resulting mask includes only nodes with
   memory.

   NOTE:  this is a [benign, IMO] change in behavior for
          set_mempolicy().  Dis-allowed nodes will be silently
          ignored, rather than returning an error.

          Another, perhaps less benign, change in behavior:
          MPOL_PREFERRED policy that specifies only memoryless nodes
          or nodes that are disallowed in the cpuset will be interpreted
          as "local allocation" as the nodemask will be empty after
          the masking in contextualize_policy().  With a bit of
          additional hackery I can make this return EINVAL.

          Comments?

2) modify mbind() to use contextualize_policy(), like set_mempolicy(),
   instead of masking nodes in-line.

3) remove the now redundant check for memoryless nodes from
   mpol_check_policy().

4) remove the masking of policy nodes for interleave policy from
   mpol_new().

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2008-02-05 11:25:17.000000000 -0500
+++ Linux/mm/mempolicy.c	2008-02-05 16:03:11.000000000 -0500
@@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
 			return -EINVAL;
 		break;
 	}
- 	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ 	return 0;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
-		nodes_and(policy->v.nodes, policy->v.nodes,
-					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
 			kmem_cache_free(policy_cache, policy);
 			return ERR_PTR(-EINVAL);
@@ -426,9 +424,13 @@ static int contextualize_policy(int mode
 	if (!nodes)
 		return 0;
 
+	/*
+	 * Restrict the nodes to the allowed nodes in the cpuset.
+	 * This is guaranteed to be a subset of nodes with memory.
+	 */
 	cpuset_update_task_memory_state();
-	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
-		return -EINVAL;
+	nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+
 	return mpol_check_policy(mode, nodes);
 }
 
@@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
 	if (end == start)
 		return 0;
 
-	if (mpol_check_policy(mode, nmask))
+	if (contextualize_policy(mode, nmask))
 		return -EINVAL;
 
 	new = mpol_new(mode, nmask);
@@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long 
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-#ifdef CONFIG_CPUSETS
-	/* Restrict the nodes to the allowed nodes in the cpuset */
-	nodes_and(nodes, nodes, current->mems_allowed);
-#endif
 	return do_mbind(start, len, mode, &nodes, flags);
 }
 




  reply	other threads:[~2008-02-05 21:57 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-02  8:12 [2.6.24-rc8-mm1][regression?] " KOSAKI Motohiro
2008-02-02  9:09 ` Andi Kleen
2008-02-02  9:37   ` KOSAKI Motohiro
2008-02-02 11:30     ` Andi Kleen
2008-02-04 19:03       ` Christoph Lameter
2008-02-04 18:20     ` Lee Schermerhorn
2008-02-05  9:26       ` [2.6.24 regression][BUGFIX] " KOSAKI Motohiro
2008-02-05 21:57         ` Lee Schermerhorn [this message]
2008-02-05 22:12           ` Christoph Lameter
2008-02-06 16:00             ` Lee Schermerhorn
2008-02-05 22:15           ` Paul Jackson
2008-02-06  2:17           ` David Rientjes
2008-02-06 16:11             ` Lee Schermerhorn
2008-02-06  6:49           ` KOSAKI Motohiro
2008-02-06 17:38         ` Lee Schermerhorn
2008-02-07  8:31           ` KOSAKI Motohiro
2008-02-08 19:45         ` [PATCH 2.6.24-mm1] Mempolicy: silently restrict nodemask to allowed nodes V3 Lee Schermerhorn
2008-02-09 18:11           ` KOSAKI Motohiro
2008-02-10  5:29           ` KOSAKI Motohiro
2008-02-10  5:49             ` Greg KH
2008-02-10  7:42               ` Linus Torvalds
2008-02-10 10:31                 ` Andrew Morton
2008-02-11 16:47                 ` Lee Schermerhorn
2008-02-12  4:30                   ` [PATCH for 2.6.24][regression fix] " KOSAKI Motohiro
2008-02-12  5:06                     ` David Rientjes
2008-02-12  5:07                     ` Andrew Morton
2008-02-12 13:18                       ` KOSAKI Motohiro
2008-02-05 10:17       ` [2.6.24-rc8-mm1][regression?] numactl --interleave=all doesn't works on memoryless node Paul Jackson
2008-02-05 11:14         ` KOSAKI Motohiro
2008-02-05 19:56         ` David Rientjes
2008-02-05 20:51           ` Paul Jackson
2008-02-05 21:03             ` David Rientjes
2008-02-05 21:33               ` Paul Jackson
2008-02-05 22:04                 ` Lee Schermerhorn
2008-02-05 22:44                   ` David Rientjes
2008-02-05 22:50                   ` Paul Jackson
2008-02-05 14:31       ` Mel Gorman
2008-02-05 15:23         ` Lee Schermerhorn
2008-02-05 18:12           ` Christoph Lameter
2008-02-05 18:27             ` Lee Schermerhorn
2008-02-05 19:04               ` Christoph Lameter
2008-02-05 19:15                 ` Paul Jackson
2008-02-05 20:06                   ` David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1202248652.5332.51.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=pj@sgi.com \
    --cc=rientjes@google.com \
    --cc=torvalds@linux-foundation.org \
    --subject='Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn'\''t works on memoryless node.' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).