LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RESEND PATCH v2] kernfs: fix dentry unexpected skip
@ 2018-05-28 12:54 Hatayama, Daisuke
  2018-05-28 13:08 ` Hatayama, Daisuke
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Hatayama, Daisuke @ 2018-05-28 12:54 UTC (permalink / raw)
  To: 'gregkh@linuxfoundation.org', 'tj@kernel.org'
  Cc: Okajima, Toshiyuki, linux-kernel, 'ebiederm@aristanetworks.com'

kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.

To fix this issue, try to find a next node only when the returned
object is less than or equal to the original one.

Note that this implementation focuses on getting guarantee that the
existing nodes are never skipped, not interested in the other nodes
that are added or removed during the process.

We found this issue during a stress test that repeatedly reads
/sys/module directory to get a list of the currently loaded kernel
modules while repeatedly loading and unloading kernel modules
simultaneously.

v2: Fix the case where nodes with the same hash but with the name
    larger than the original node could still be skipped. Use
    kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
    description.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Suggested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Cc: Eric W. Biederman <ebiederm@aristanetworks.com>
---
 fs/kernfs/dir.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..3aeeb7a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
 static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
 {
+	struct kernfs_node *orig = pos;
+
 	pos = kernfs_dir_pos(ns, parent, ino, pos);
-	if (pos) {
+	if (pos && kernfs_sd_compare(pos, orig) <= 0) {
 		do {
 			struct rb_node *node = rb_next(&pos->rb);
 			if (!node)
-- 
1.7.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
  2018-05-28 12:54 [RESEND PATCH v2] kernfs: fix dentry unexpected skip Hatayama, Daisuke
@ 2018-05-28 13:08 ` Hatayama, Daisuke
  2018-05-29 16:26 ` 'tj@kernel.org'
  2018-06-02 17:25 ` Eric W. Biederman
  2 siblings, 0 replies; 17+ messages in thread
From: Hatayama, Daisuke @ 2018-05-28 13:08 UTC (permalink / raw)
  To: 'gregkh@linuxfoundation.org', 'tj@kernel.org'
  Cc: Okajima, Toshiyuki, linux-kernel, 'ebiederm@aristanetworks.com'

[-- Attachment #1: Type: text/plain, Size: 3250 bytes --]

Hi,

I attach a reproducer for this issue.

Expand repro.tar.gz and run two commands:

  # make
  # ./repro.sh

Then, repro.sh will terminate when the issue is reproduced.

In this reproducer, we prepare AtvbAC0jwH.ko and U1cN9ZbARQ.ko having the same hash value.
Then, install U1cN9ZbARQ.ko and then repeating installing and removing AtvbAC0jwH.ko over and over.
Note that AtvbAC0jwH is smaller than U1cN9ZbARQ in the string comparison.
The issue is that output of ls -1 /sys/module contains NO U1cN9ZbARQ.ko.

I found a pair of AtvbAC0jwH and U1cN9ZbARQ using kernhash.c, retrieved from fs/kernfs/dirs.c,
contained in repro.tar.gz like:

    # ./kernfshash
    AtvbAC0jwH U1cN9ZbARQ 6b2609c5

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Hatayama, Daisuke
> Sent: Monday, May 28, 2018 9:54 PM
> To: 'gregkh@linuxfoundation.org' <gregkh@linuxfoundation.org>;
> 'tj@kernel.org' <tj@kernel.org>
> Cc: Okajima, Toshiyuki <toshi.okajima@jp.fujitsu.com>;
> linux-kernel@vger.kernel.org; 'ebiederm@aristanetworks.com'
> <ebiederm@aristanetworks.com>
> Subject: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> kernfs_dir_next_pos() overlooks the situation that the dentry
> corresponding to a given pos object has already been inactive. Hence,
> when kernfs_dir_pos() returns the dentry with a hash value larger than
> the original one, kernfs_dir_next_pos() returns the dentry next to the
> one returned by kernfs_dir_pos(). As a result, the dentry returned by
> kernfs_dir_pos() is skipped.
> 
> To fix this issue, try to find a next node only when the returned
> object is less than or equal to the original one.
> 
> Note that this implementation focuses on getting guarantee that the
> existing nodes are never skipped, not interested in the other nodes
> that are added or removed during the process.
> 
> We found this issue during a stress test that repeatedly reads
> /sys/module directory to get a list of the currently loaded kernel
> modules while repeatedly loading and unloading kernel modules
> simultaneously.
> 
> v2: Fix the case where nodes with the same hash but with the name
>     larger than the original node could still be skipped. Use
>     kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
>     description.
> 
> Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
> Suggested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> Cc: Eric W. Biederman <ebiederm@aristanetworks.com>
> ---
>  fs/kernfs/dir.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..3aeeb7a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>  	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>  {
> +	struct kernfs_node *orig = pos;
> +
>  	pos = kernfs_dir_pos(ns, parent, ino, pos);
> -	if (pos) {
> +	if (pos && kernfs_sd_compare(pos, orig) <= 0) {
>  		do {
>  			struct rb_node *node = rb_next(&pos->rb);
>  			if (!node)
> --
> 1.7.1
> 
> 
> 
> 
> 
> 


[-- Attachment #2: repro.tar.gz --]
[-- Type: application/x-gzip, Size: 2394 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
  2018-05-28 12:54 [RESEND PATCH v2] kernfs: fix dentry unexpected skip Hatayama, Daisuke
  2018-05-28 13:08 ` Hatayama, Daisuke
@ 2018-05-29 16:26 ` 'tj@kernel.org'
  2018-06-01  9:25   ` Hatayama, Daisuke
  2018-06-02 17:25 ` Eric W. Biederman
  2 siblings, 1 reply; 17+ messages in thread
From: 'tj@kernel.org' @ 2018-05-29 16:26 UTC (permalink / raw)
  To: Hatayama, Daisuke
  Cc: 'gregkh@linuxfoundation.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

Hello,

On Mon, May 28, 2018 at 12:54:03PM +0000, Hatayama, Daisuke wrote:
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..3aeeb7a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>  	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>  {
> +	struct kernfs_node *orig = pos;
> +
>  	pos = kernfs_dir_pos(ns, parent, ino, pos);
> -	if (pos) {
> +	if (pos && kernfs_sd_compare(pos, orig) <= 0) {

Hmm... the code seems a bit unintuitive to me and I wonder whether
it's because there are two identical skipping loops in
kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to
selectively disable one of them.  Wouldn't it make more sense to get
rid of it from kernfs_dir_pos() and skip explicitly only when
necessary?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
  2018-05-29 16:26 ` 'tj@kernel.org'
@ 2018-06-01  9:25   ` Hatayama, Daisuke
  2018-06-01 17:07     ` 'tj@kernel.org'
  0 siblings, 1 reply; 17+ messages in thread
From: Hatayama, Daisuke @ 2018-06-01  9:25 UTC (permalink / raw)
  To: 'tj@kernel.org'
  Cc: 'gregkh@linuxfoundation.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

> -----Original Message-----
> From: Tejun Heo [mailto:htejun@gmail.com] On Behalf Of 'tj@kernel.org'
> Sent: Wednesday, May 30, 2018 1:26 AM
> To: Hatayama, Daisuke <d.hatayama@jp.fujitsu.com>
> Cc: 'gregkh@linuxfoundation.org' <gregkh@linuxfoundation.org>; Okajima,
> Toshiyuki <toshi.okajima@jp.fujitsu.com>;
> linux-kernel@vger.kernel.org; 'ebiederm@aristanetworks.com'
> <ebiederm@aristanetworks.com>
> Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> Hello,
> 
> On Mon, May 28, 2018 at 12:54:03PM +0000, Hatayama, Daisuke wrote:
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 89d1dc1..3aeeb7a 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
> >  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> >  	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> >  {
> > +	struct kernfs_node *orig = pos;
> > +
> >  	pos = kernfs_dir_pos(ns, parent, ino, pos);
> > -	if (pos) {
> > +	if (pos && kernfs_sd_compare(pos, orig) <= 0) {
> 
> Hmm... the code seems a bit unintuitive to me and I wonder whether
> it's because there are two identical skipping loops in
> kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to
> selectively disable one of them.  Wouldn't it make more sense to get
> rid of it from kernfs_dir_pos() and skip explicitly only when
> necessary?
> 

kernfs_dir_pos() checks if a kernfs_node object given as one of its
arguments is still active and if so returns it, or returns a
kernfs_node object that is most equal (possibly smaller and larger) to
the given object.

kernfs_dir_next_pos() returns a kernfs_node object that is next to the
object given by kernfs_dir_pos().

Two functions does different things and both need to skip inactive
nodes. I don't think it natural to remove the skip only from
kernfs_dir_pos().

OTOH, throughout getdents(), there is no case that the kernfs_node
object given to kernfs_dir_pos() is used afterwards in the
processing. So, is it enough to provide kernfs_dir_next_pos() only?
Then, the skip code is now not duplicated.

The patch below is my thought. How do you think?

But note that this patch has some bug so that system boot get hang
without detecting root filesystem disk :) I'm debugging this now.

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..140706f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1584,9 +1584,11 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
        return 0;
 }

-static struct kernfs_node *kernfs_dir_pos(const void *ns,
+static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
        struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
+       struct kernfs_node *orig = pos;
+
        if (pos) {
                int valid = kernfs_active(pos) &&
                        pos->parent == parent && hash == pos->hash;
@@ -1608,7 +1610,9 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
                }
        }
        /* Skip over entries which are dying/dead or in the wrong namespace */
-       while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+       while (pos && (!kernfs_active(pos) ||
+                      (!orig || kernfs_sd_compare(pos, orig) <= 0) ||
+                      pos->ns != ns)) {
                struct rb_node *node = rb_next(&pos->rb);
                if (!node)
                        pos = NULL;
@@ -1618,22 +1622,6 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
        return pos;
 }

-static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
-       struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
-{
-       pos = kernfs_dir_pos(ns, parent, ino, pos);
-       if (pos) {
-               do {
-                       struct rb_node *node = rb_next(&pos->rb);
-                       if (!node)
-                               pos = NULL;
-                       else
-                               pos = rb_to_kn(node);
-               } while (pos && (!kernfs_active(pos) || pos->ns != ns));
-       }
-       return pos;
-}
-
 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
        struct dentry *dentry = file->f_path.dentry;
@@ -1648,7 +1636,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
        if (kernfs_ns_enabled(parent))
                ns = kernfs_info(dentry->d_sb)->ns;

-       for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
+       for (pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos);
             pos;
             pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
                const char *name = pos->name;

Thanks.
HATAYAMA, Daisuke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
  2018-06-01  9:25   ` Hatayama, Daisuke
@ 2018-06-01 17:07     ` 'tj@kernel.org'
  2018-06-04  9:46       ` Hatayama, Daisuke
  0 siblings, 1 reply; 17+ messages in thread
From: 'tj@kernel.org' @ 2018-06-01 17:07 UTC (permalink / raw)
  To: Hatayama, Daisuke
  Cc: 'gregkh@linuxfoundation.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

Hello,

On Fri, Jun 01, 2018 at 09:25:32AM +0000, Hatayama, Daisuke wrote:
> kernfs_dir_pos() checks if a kernfs_node object given as one of its
> arguments is still active and if so returns it, or returns a
> kernfs_node object that is most equal (possibly smaller and larger) to
> the given object.

Sometimes they're duplicate operations tho, which is exactly the bug
the posted patch is trying to fix.  What I'm suggesting is instead of
leaving both instances and skipping one conditionally, put them in one
place and trigger only when necessary.  The sequence of operations
would be exactly the same.  The only difference is how the code is
organized.

> kernfs_dir_next_pos() returns a kernfs_node object that is next to the
> object given by kernfs_dir_pos().
> 
> Two functions does different things and both need to skip inactive
> nodes. I don't think it natural to remove the skip only from
> kernfs_dir_pos().
> 
> OTOH, throughout getdents(), there is no case that the kernfs_node
> object given to kernfs_dir_pos() is used afterwards in the
> processing. So, is it enough to provide kernfs_dir_next_pos() only?
> Then, the skip code is now not duplicated.
> 
> The patch below is my thought. How do you think?
> 
> But note that this patch has some bug so that system boot get hang
> without detecting root filesystem disk :) I'm debugging this now.

I haven't looked into the code that closely but given that we had
cases where both skippings were fine and not, the condition is likely
gonna be a bit tricker?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
  2018-05-28 12:54 [RESEND PATCH v2] kernfs: fix dentry unexpected skip Hatayama, Daisuke
  2018-05-28 13:08 ` Hatayama, Daisuke
  2018-05-29 16:26 ` 'tj@kernel.org'
@ 2018-06-02 17:25 ` Eric W. Biederman
  2018-06-03 18:51   ` [CFT][PATCH] kernfs: Correct kernfs directory seeks Eric W. Biederman
  2 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-06-02 17:25 UTC (permalink / raw)
  To: Hatayama, Daisuke
  Cc: 'gregkh@linuxfoundation.org', 'tj@kernel.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

"Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com> writes:

> kernfs_dir_next_pos() overlooks the situation that the dentry
> corresponding to a given pos object has already been inactive. Hence,
> when kernfs_dir_pos() returns the dentry with a hash value larger than
> the original one, kernfs_dir_next_pos() returns the dentry next to the
> one returned by kernfs_dir_pos(). As a result, the dentry returned by
> kernfs_dir_pos() is skipped.
>
> To fix this issue, try to find a next node only when the returned
> object is less than or equal to the original one.
>
> Note that this implementation focuses on getting guarantee that the
> existing nodes are never skipped, not interested in the other nodes
> that are added or removed during the process.
>
> We found this issue during a stress test that repeatedly reads
> /sys/module directory to get a list of the currently loaded kernel
> modules while repeatedly loading and unloading kernel modules
> simultaneously.
>
> v2: Fix the case where nodes with the same hash but with the name
>     larger than the original node could still be skipped. Use
>     kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
>     description.
>

Semantically this looks like the right fix.

The deep bug is that kernfs_dir_pos can always advance the position,
and the code has never looked for or handled that case.  AKA just the
rbtree walk is enough to advance the position.

That said your implementation is buggy.  It is not safe to call
kernfs_sd_compare on orig as kernfs_put has already been called on orig
and thus orig may already be free.

I suggest moving the kernfs_put for orig into kernfs_fop_readdir,
just before the mutex_unlock calls.  That makes your kernfs_sd_compare
safe.

That would then allow moving the code to skip the next entry to also
be vmoed into kernfs_fop_readir.

Perhaps something like this:

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc19340b..2d0f71ffb539 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1584,13 +1584,12 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static struct kernfs_node *kernfs_dir_pos(const void *ns,
+static struct kernfs_node *kernfs_dir_pos(
 	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
 	if (pos) {
 		int valid = kernfs_active(pos) &&
 			pos->parent == parent && hash == pos->hash;
-		kernfs_put(pos);
 		if (!valid)
 			pos = NULL;
 	}
@@ -1607,8 +1606,14 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 				break;
 		}
 	}
-	/* Skip over entries which are dying/dead or in the wrong namespace */
-	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+	return pos;
+}
+
+static struct kernfs_node *kernfs_dir_next_pos(
+	struct kernfs_node *parent, ino_t ino, struct kernfs_node *start)
+{
+	struct kernfs_node *pos = kernfs_dir_pos(parent, ino, start);
+	if (pos && (kernfs_sd_compare(pos, start) <= 0)) {
 		struct rb_node *node = rb_next(&pos->rb);
 		if (!node)
 			pos = NULL;
@@ -1618,27 +1623,11 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
 	return pos;
 }
 
-static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
-	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
-{
-	pos = kernfs_dir_pos(ns, parent, ino, pos);
-	if (pos) {
-		do {
-			struct rb_node *node = rb_next(&pos->rb);
-			if (!node)
-				pos = NULL;
-			else
-				pos = rb_to_kn(node);
-		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
-	}
-	return pos;
-}
-
 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct dentry *dentry = file->f_path.dentry;
 	struct kernfs_node *parent = kernfs_dentry_node(dentry);
-	struct kernfs_node *pos = file->private_data;
+	struct kernfs_node *pos, *saved = file->private_data;
 	const void *ns = NULL;
 
 	if (!dir_emit_dots(file, ctx))
@@ -1648,23 +1637,30 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dentry->d_sb)->ns;
 
-	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
+	for (pos = kernfs_dir_pos(parent, ctx->pos, saved);
 	     pos;
-	     pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
+	     pos = kernfs_dir_next_pos(parent, ctx->pos, pos)) {
 		const char *name = pos->name;
 		unsigned int type = dt_type(pos);
 		int len = strlen(name);
 		ino_t ino = pos->id.ino;
 
-		ctx->pos = pos->hash;
-		file->private_data = pos;
-		kernfs_get(pos);
+		/* Skip entries not fit for userspace */
+		if (!kernfs_active(pos) || !(pos->ns != ns))
+			continue;
+
+		kernfs_put(saved);
+		saved = pos;
+		ctx->pos = saved->hash;
+		file->private_data = saved;
+		kernfs_get(saved);
 
 		mutex_unlock(&kernfs_mutex);
 		if (!dir_emit(ctx, name, len, ino, type))
 			return 0;
 		mutex_lock(&kernfs_mutex);
 	}
+	kernfs_put(saved);
 	mutex_unlock(&kernfs_mutex);
 	file->private_data = NULL;
 	ctx->pos = INT_MAX;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-02 17:25 ` Eric W. Biederman
@ 2018-06-03 18:51   ` Eric W. Biederman
  2018-06-04  9:34     ` Hatayama, Daisuke
  0 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-06-03 18:51 UTC (permalink / raw)
  To: Hatayama, Daisuke
  Cc: 'gregkh@linuxfoundation.org', 'tj@kernel.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'


Over the years two bugs have crept into the code that handled the last
returned kernfs directory being deleted, and handled general seeking
in kernfs directories.  The result is that reading the /sys/module
directory while moduled are loading and unloading will sometimes
skip over a module that was present for the entire duration of
the directory read.

The first bug is that kernfs_dir_next_pos was advancing the position
in the directory even when kernfs_dir_pos had found the starting
kernfs directory entry was not usable.  In this case kernfs_dir_pos is
guaranteed to return the directory entry past the starting directory
entry, making it so that advancing the directory position is wrong.

The second bug is that kernfs_dir_pos when the saved kernfs_node
did not validate, was not always returning a directory after
the the target position, but was sometimes returning the directory
entry just before the target position.

To correct these issues and to make the code easily readable and
comprehensible several cleanups are made.

- A function kernfs_dir_next is factored out to make it straight-forward
  to find the next node in a kernfs directory.

- A function kernfs_dir_skip_pos is factored out to skip over directory
  entries that should not be shown to userspace.  This function is called
  from kernfs_fop_readdir directly removing the complication of calling
  it from the other directory advancement functions.

- The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
  to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
  and ensuring the directory position advancment functions can reference
  the saved node without complications.

- The function kernfs_dir_next_pos is modified to only advance the directory
  position in the common case when kernfs_dir_pos validates and returns
  the starting kernfs node.  For all other cases kernfs_dir_pos is guaranteed
  to return a directory position in advance of the starting directory position.

- kernfs_dir_pos is given a significant make over.  It's essense remains the
  same but some siginficant changes were made.

  + The offset is validated to be a valid hash value at the very beginning.
    The validation is updated to handle the fact that neither 0 nor 1 are
    valid hash values.

  + An extra test is added at the end of the rbtree lookup to ensure
    the returned position is at or beyond the target position.

  + kernfs_name_compare is used during the rbtree lookup instead of just comparing
    the hash.  This allows the the passed namespace parameter to be used, and
    if it is available from the saved entry the old saved name as well.

  + The validation of the saved kernfs node now checks for two cases.
    If the saved node is returnable.
    If the name of the saved node is usable during lookup.

In net this should result in a easier to understand, and more correct
version of directory traversal for kernfs directories.

Reported-by: "Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com>
Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir scalability v2")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Can you test this and please verify it fixes your issue?

 fs/kernfs/dir.c | 109 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 67 insertions(+), 42 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc19340b..8148b5fec48d 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
+{
+	struct rb_node *node = rb_next(&pos->rb);
+	return node ? rb_to_kn(node) : NULL;
+}
+
 static struct kernfs_node *kernfs_dir_pos(const void *ns,
-	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
+	struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
 {
-	if (pos) {
-		int valid = kernfs_active(pos) &&
-			pos->parent == parent && hash == pos->hash;
-		kernfs_put(pos);
-		if (!valid)
-			pos = NULL;
-	}
-	if (!pos && (hash > 1) && (hash < INT_MAX)) {
-		struct rb_node *node = parent->dir.children.rb_node;
-		while (node) {
-			pos = rb_to_kn(node);
-
-			if (hash < pos->hash)
-				node = node->rb_left;
-			else if (hash > pos->hash)
-				node = node->rb_right;
-			else
-				break;
+	struct kernfs_node *pos;
+	struct rb_node *node;
+	unsigned int hash;
+	const char *name = "";
+
+	/* Is off a valid name hash? */
+	if ((off < 2) || (off >= INT_MAX))
+		return NULL;
+	hash = off;
+
+	/* Is the saved position usable? */
+	if (saved) {
+		/* Proper parent and hash? */
+		if ((parent != saved->parent) || (saved->hash != hash)) {
+			saved = NULL;
+		} else {
+			if (kernfs_active(saved))
+				return saved;
+			name = saved->name;
 		}
 	}
-	/* Skip over entries which are dying/dead or in the wrong namespace */
-	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
-		struct rb_node *node = rb_next(&pos->rb);
-		if (!node)
-			pos = NULL;
+
+	/* Find the closest pos to the hash we are looking for */
+	pos = NULL;
+	node = parent->dir.children.rb_node;
+	while (node) {
+		int result;
+
+		pos = rb_to_kn(node);
+		result = kernfs_name_compare(hash, name, ns, pos);
+		if (result < 0)
+			node = node->rb_left;
+		else if (result > 0)
+			node = node->rb_right;
 		else
-			pos = rb_to_kn(node);
+			break;
 	}
+
+	/* Ensure pos is at or beyond the target position */
+	if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
+		pos = kernfs_dir_next(pos);
+
 	return pos;
 }
 
 static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
-	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
+	struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
 {
-	pos = kernfs_dir_pos(ns, parent, ino, pos);
-	if (pos) {
-		do {
-			struct rb_node *node = rb_next(&pos->rb);
-			if (!node)
-				pos = NULL;
-			else
-				pos = rb_to_kn(node);
-		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
-	}
+	struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
+	if (likely(pos == start))
+		pos = kernfs_dir_next(pos);
+	return pos;
+}
+
+static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
+					       struct kernfs_node *pos)
+{
+	/* Skip entries we don't want to return to userspace */
+	while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
+		pos = kernfs_dir_next(pos);
 	return pos;
 }
 
@@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct dentry *dentry = file->f_path.dentry;
 	struct kernfs_node *parent = kernfs_dentry_node(dentry);
-	struct kernfs_node *pos = file->private_data;
+	struct kernfs_node *pos, *saved = file->private_data;
 	const void *ns = NULL;
 
 	if (!dir_emit_dots(file, ctx))
@@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dentry->d_sb)->ns;
 
-	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
-	     pos;
+	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
+	     (pos = kernfs_dir_skip_pos(ns, pos));
 	     pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
 		const char *name = pos->name;
 		unsigned int type = dt_type(pos);
 		int len = strlen(name);
 		ino_t ino = pos->id.ino;
 
-		ctx->pos = pos->hash;
-		file->private_data = pos;
-		kernfs_get(pos);
+		kernfs_put(saved);
+		saved = pos;
+		ctx->pos = saved->hash;
+		file->private_data = saved;
+		kernfs_get(saved);
 
 		mutex_unlock(&kernfs_mutex);
 		if (!dir_emit(ctx, name, len, ino, type))
 			return 0;
 		mutex_lock(&kernfs_mutex);
 	}
+	kernfs_put(saved);
 	mutex_unlock(&kernfs_mutex);
 	file->private_data = NULL;
 	ctx->pos = INT_MAX;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-03 18:51   ` [CFT][PATCH] kernfs: Correct kernfs directory seeks Eric W. Biederman
@ 2018-06-04  9:34     ` Hatayama, Daisuke
  2018-06-04 14:44       ` Eric W. Biederman
  0 siblings, 1 reply; 17+ messages in thread
From: Hatayama, Daisuke @ 2018-06-04  9:34 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: 'gregkh@linuxfoundation.org', 'tj@kernel.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

Hello,

Thanks for this patch, Eric.

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> Sent: Monday, June 4, 2018 3:51 AM
> To: Hatayama, Daisuke <d.hatayama@jp.fujitsu.com>
> Cc: 'gregkh@linuxfoundation.org' <gregkh@linuxfoundation.org>;
> 'tj@kernel.org' <tj@kernel.org>; Okajima, Toshiyuki 
> <toshi.okajima@jp.fujitsu.com>; linux-kernel@vger.kernel.org;
> 'ebiederm@aristanetworks.com' <ebiederm@aristanetworks.com>
> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> 
> Over the years two bugs have crept into the code that handled the last
> returned kernfs directory being deleted, and handled general seeking
> in kernfs directories.  The result is that reading the /sys/module
> directory while moduled are loading and unloading will sometimes
> skip over a module that was present for the entire duration of
> the directory read.
> 
> The first bug is that kernfs_dir_next_pos was advancing the position
> in the directory even when kernfs_dir_pos had found the starting
> kernfs directory entry was not usable.  In this case kernfs_dir_pos is
> guaranteed to return the directory entry past the starting directory
> entry, making it so that advancing the directory position is wrong.
> 
> The second bug is that kernfs_dir_pos when the saved kernfs_node
> did not validate, was not always returning a directory after
> the the target position, but was sometimes returning the directory
> entry just before the target position.
> 
> To correct these issues and to make the code easily readable and
> comprehensible several cleanups are made.
> 
> - A function kernfs_dir_next is factored out to make it straight-forward
>   to find the next node in a kernfs directory.
> 
> - A function kernfs_dir_skip_pos is factored out to skip over directory
>   entries that should not be shown to userspace.  This function is called
>   from kernfs_fop_readdir directly removing the complication of calling
>   it from the other directory advancement functions.
> 
> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
>   to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
>   and ensuring the directory position advancment functions can reference
>   the saved node without complications.
> 
> - The function kernfs_dir_next_pos is modified to only advance the directory
>   position in the common case when kernfs_dir_pos validates and returns
>   the starting kernfs node.  For all other cases kernfs_dir_pos is guaranteed
>   to return a directory position in advance of the starting directory position.
> 
> - kernfs_dir_pos is given a significant make over.  It's essense remains the
>   same but some siginficant changes were made.
> 
>   + The offset is validated to be a valid hash value at the very beginning.
>     The validation is updated to handle the fact that neither 0 nor 1 are
>     valid hash values.
> 
>   + An extra test is added at the end of the rbtree lookup to ensure
>     the returned position is at or beyond the target position.
> 
>   + kernfs_name_compare is used during the rbtree lookup instead of just
> comparing
>     the hash.  This allows the the passed namespace parameter to be used, and
>     if it is available from the saved entry the old saved name as well.
> 
>   + The validation of the saved kernfs node now checks for two cases.
>     If the saved node is returnable.
>     If the name of the saved node is usable during lookup.
> 
> In net this should result in a easier to understand, and more correct
> version of directory traversal for kernfs directories.
>
> Reported-by: "Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com>
> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
> scalability v2")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> Can you test this and please verify it fixes your issue?
>

I tried this patch on top of v4.17 but the system fails to boot
without detecting root disks by dracut like this:

    [  196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  199.945999] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  200.490281] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  201.071177] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
    [  201.074958] dracut-initqueue[324]: Warning: Could not boot.
	     Starting Setup Virtual Console...
    [  OK  ] Started Setup Virtual Console.
    [  201.245921] audit: type=1130 audit(1528132266.260:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
	     [  201.256378] audit: type=1131 audit(1528132266.260:13): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
    Starting Dracut Emergency Shell...
    Warning: /dev/fedora/root does not exist
    Warning: /dev/fedora/swap does not exist
    Warning: /dev/mapper/fedora-root does not exist

    Generating "/run/initramfs/rdsosreport.txt"


    Entering emergency mode. Exit the shell to continue.
    Type "journalctl" to view system logs.
    You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
    after mounting them and attach it to a bug report.


    dracut:/# ls /sys/module
    8139cp        dm_bufio        kernel       psmouse      ttm
    8139too       dm_mirror       keyboard     pstore       uhci_hcd
    8250          dm_mod          kgdboc       qemu_fw_cfg  usbcore
    acpi          dm_snapshot     kgdbts       qxl          usbhid
    acpiphp       drm             libahci      random       uv_nmi
    ahci          drm_kms_helper  libata       rcupdate     virtio
    ata_generic   dynamic_debug   md_mod       rcutree      virtio_console
    ata_piix      edac_core       mii          rng_core     virtio_pci
    battery       ehci_hcd        module       scsi_mod     virtio_ring
    block         fb              mousedev     serio_raw    vt
    button        firmware_class  pata_acpi    sg           watchdog
    configfs      fscrypto        pci_hotplug  slab_common  workqueue
    cpufreq       hid             pcie_aspm    spurious     xhci_hcd
    cpuidle       hid_magicmouse  pciehp       sr_mod       xz_dec
    crc32c_intel  hid_ntrig       pcmcia       srcutree     zswap
    cryptomgr     i8042           pcmcia_core  suspend
    debug_core    intel_idle      printk       sysrq
    devres        ipv6            processor    tcp_cubic
    dracut:/# lsmod
    Module                  Size  Used by
    qxl                    77824  1
    drm_kms_helper        196608  1
    ttm                   126976  1
    drm                   454656  4 qxl,ttm
    virtio_console         32768  0
    ata_generic            16384  0
    crc32c_intel           24576  0
    8139too                40960  0
    virtio_pci             28672  0
    8139cp                 32768  0
    virtio_ring            28672  2 virtio_pci
    serio_raw              16384  0
    pata_acpi              16384  0
    virtio                 16384  2 virtio_pci
    mii                    16384  2 8139too
    qemu_fw_cfg            16384  0
    dracut:/#

OTOH, there's no issue on the pure v4.17 kernel.

As above, ls /sys/module looks apparently good. But I guess any part of
behavior of getdentries() on sysfs must have changed, affecting the disk
detection...

and,

>  fs/kernfs/dir.c | 109
> ++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 67 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc19340b..8148b5fec48d 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
>  	return 0;
>  }
> 
> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
> +{
> +	struct rb_node *node = rb_next(&pos->rb);
> +	return node ? rb_to_kn(node) : NULL;
> +}
> +
>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
> -	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
>  {
> -	if (pos) {
> -		int valid = kernfs_active(pos) &&
> -			pos->parent == parent && hash == pos->hash;
> -		kernfs_put(pos);
> -		if (!valid)
> -			pos = NULL;
> -	}
> -	if (!pos && (hash > 1) && (hash < INT_MAX)) {
> -		struct rb_node *node = parent->dir.children.rb_node;
> -		while (node) {
> -			pos = rb_to_kn(node);
> -
> -			if (hash < pos->hash)
> -				node = node->rb_left;
> -			else if (hash > pos->hash)
> -				node = node->rb_right;
> -			else
> -				break;
> +	struct kernfs_node *pos;
> +	struct rb_node *node;
> +	unsigned int hash;
> +	const char *name = "";
> +
> +	/* Is off a valid name hash? */
> +	if ((off < 2) || (off >= INT_MAX))
> +		return NULL;
> +	hash = off;
> +
> +	/* Is the saved position usable? */
> +	if (saved) {
> +		/* Proper parent and hash? */
> +		if ((parent != saved->parent) || (saved->hash != hash)) {
> +			saved = NULL;

name is uninitialized in this path.

> +		} else {
> +			if (kernfs_active(saved))
> +				return saved;
> +			name = saved->name;
>  		}
>  	}
> -	/* Skip over entries which are dying/dead or in the wrong namespace
> */
> -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> -		struct rb_node *node = rb_next(&pos->rb);
> -		if (!node)
> -			pos = NULL;
> +
> +	/* Find the closest pos to the hash we are looking for */
> +	pos = NULL;
> +	node = parent->dir.children.rb_node;
> +	while (node) {
> +		int result;
> +
> +		pos = rb_to_kn(node);
> +		result = kernfs_name_compare(hash, name, ns, pos);
> +		if (result < 0)
> +			node = node->rb_left;
> +		else if (result > 0)
> +			node = node->rb_right;
>  		else
> -			pos = rb_to_kn(node);
> +			break;
>  	}
> +
> +	/* Ensure pos is at or beyond the target position */
> +	if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> +		pos = kernfs_dir_next(pos);
> +

Why not trying to find the final target object here?

Looking at the code, I think the operation needed here is to find the
smallest active object in the same namespace from the objects larger
than the saved object given as argument. The saved object appears to
be used as cache. I think dividing the process into kernfs_dir_pos()
is not necessary.

I mean like this:

# git diff
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8148b5f..eeffc8c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1605,13 +1605,13 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)

        /* Is the saved position usable? */
        if (saved) {
+               name = saved->name;
                /* Proper parent and hash? */
                if ((parent != saved->parent) || (saved->hash != hash)) {
                        saved = NULL;
-               } else {
-                       if (kernfs_active(saved))
-                               return saved;
-                       name = saved->name;
+               } else if (kernfs_active(saved)) {
+                       pos = saved;
+                       goto skip;
                }
        }

@@ -1631,22 +1631,14 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
                        break;
        }

+skip:
        /* Ensure pos is at or beyond the target position */
-       if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
+       if (pos && (kernfs_name_compare(hash, name, ns, pos) <= 0))
                pos = kernfs_dir_next(pos);

        return pos;
 }

-static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
-       struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
-{
-       struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
-       if (likely(pos == start))
-               pos = kernfs_dir_next(pos);
-       return pos;
-}
-
 static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
                                               struct kernfs_node *pos)
 {
@@ -1672,7 +1664,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)

        for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
             (pos = kernfs_dir_skip_pos(ns, pos));
-            pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
+            pos = kernfs_dir_pos(ns, parent, ctx->pos, pos)) {
                const char *name = pos->name;
                unsigned int type = dt_type(pos);
                int len = strlen(name);

>  	return pos;
>  }
> 
>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> -	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
>  {
> -	pos = kernfs_dir_pos(ns, parent, ino, pos);
> -	if (pos) {
> -		do {
> -			struct rb_node *node = rb_next(&pos->rb);
> -			if (!node)
> -				pos = NULL;
> -			else
> -				pos = rb_to_kn(node);
> -		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
> -	}
> +	struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> +	if (likely(pos == start))
> +		pos = kernfs_dir_next(pos);
> +	return pos;
> +}
> +
> +static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
> +					       struct kernfs_node *pos)
> +{
> +	/* Skip entries we don't want to return to userspace */
> +	while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
> +		pos = kernfs_dir_next(pos);
>  	return pos;
>  }
> 
> @@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file, struct
> dir_context *ctx)
>  {
>  	struct dentry *dentry = file->f_path.dentry;
>  	struct kernfs_node *parent = kernfs_dentry_node(dentry);
> -	struct kernfs_node *pos = file->private_data;
> +	struct kernfs_node *pos, *saved = file->private_data;
>  	const void *ns = NULL;
> 
>  	if (!dir_emit_dots(file, ctx))
> @@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file,
> struct dir_context *ctx)
>  	if (kernfs_ns_enabled(parent))
>  		ns = kernfs_info(dentry->d_sb)->ns;
> 
> -	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
> -	     pos;
> +	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
> +	     (pos = kernfs_dir_skip_pos(ns, pos));
>  	     pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
>  		const char *name = pos->name;
>  		unsigned int type = dt_type(pos);
>  		int len = strlen(name);
>  		ino_t ino = pos->id.ino;
> 
> -		ctx->pos = pos->hash;
> -		file->private_data = pos;
> -		kernfs_get(pos);
> +		kernfs_put(saved);
> +		saved = pos;
> +		ctx->pos = saved->hash;
> +		file->private_data = saved;
> +		kernfs_get(saved);
> 
>  		mutex_unlock(&kernfs_mutex);
>  		if (!dir_emit(ctx, name, len, ino, type))
>  			return 0;
>  		mutex_lock(&kernfs_mutex);
>  	}
> +	kernfs_put(saved);
>  	mutex_unlock(&kernfs_mutex);
>  	file->private_data = NULL;
>  	ctx->pos = INT_MAX;
> --
> 2.14.1
> 
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
  2018-06-01 17:07     ` 'tj@kernel.org'
@ 2018-06-04  9:46       ` Hatayama, Daisuke
  0 siblings, 0 replies; 17+ messages in thread
From: Hatayama, Daisuke @ 2018-06-04  9:46 UTC (permalink / raw)
  To: 'tj@kernel.org'
  Cc: 'gregkh@linuxfoundation.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'



> -----Original Message-----
> From: Tejun Heo [mailto:htejun@gmail.com] On Behalf Of 'tj@kernel.org'
> Sent: Saturday, June 2, 2018 2:07 AM
> To: Hatayama, Daisuke <d.hatayama@jp.fujitsu.com>
> Cc: 'gregkh@linuxfoundation.org' <gregkh@linuxfoundation.org>; Okajima,
> Toshiyuki <toshi.okajima@jp.fujitsu.com>;
> linux-kernel@vger.kernel.org; 'ebiederm@aristanetworks.com'
> <ebiederm@aristanetworks.com>
> Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
> 
> Hello,
> 
> On Fri, Jun 01, 2018 at 09:25:32AM +0000, Hatayama, Daisuke wrote:
> > kernfs_dir_pos() checks if a kernfs_node object given as one of its
> > arguments is still active and if so returns it, or returns a
> > kernfs_node object that is most equal (possibly smaller and larger) to
> > the given object.
> 
> Sometimes they're duplicate operations tho, which is exactly the bug
> the posted patch is trying to fix.  What I'm suggesting is instead of
> leaving both instances and skipping one conditionally, put them in one
> place and trigger only when necessary.  The sequence of operations
> would be exactly the same.  The only difference is how the code is
> organized.
> 

I see and I think Eric's patch is written as you suggest very well.

> > kernfs_dir_next_pos() returns a kernfs_node object that is next to the
> > object given by kernfs_dir_pos().
> >
> > Two functions does different things and both need to skip inactive
> > nodes. I don't think it natural to remove the skip only from
> > kernfs_dir_pos().
> >
> > OTOH, throughout getdents(), there is no case that the kernfs_node
> > object given to kernfs_dir_pos() is used afterwards in the
> > processing. So, is it enough to provide kernfs_dir_next_pos() only?
> > Then, the skip code is now not duplicated.
> >
> > The patch below is my thought. How do you think?
> >
> > But note that this patch has some bug so that system boot get hang
> > without detecting root filesystem disk :) I'm debugging this now.
> 
> I haven't looked into the code that closely but given that we had
> cases where both skippings were fine and not, the condition is likely
> gonna be a bit tricker?

I agree to this version looks a bit tricker. But I think once the skipping
code is separated as Eric's patch, it would be resolved naturally.

> 
> Thanks.
> 
> --
> tejun
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-04  9:34     ` Hatayama, Daisuke
@ 2018-06-04 14:44       ` Eric W. Biederman
  2018-06-05  2:02         ` Eric W. Biederman
  2018-06-05  5:45         ` Hatayama, Daisuke
  0 siblings, 2 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-06-04 14:44 UTC (permalink / raw)
  To: Hatayama, Daisuke
  Cc: 'gregkh@linuxfoundation.org', 'tj@kernel.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

"Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com> writes:

> Hello,
>
> Thanks for this patch, Eric.
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
>> Sent: Monday, June 4, 2018 3:51 AM
>> To: Hatayama, Daisuke <d.hatayama@jp.fujitsu.com>
>> Cc: 'gregkh@linuxfoundation.org' <gregkh@linuxfoundation.org>;
>> 'tj@kernel.org' <tj@kernel.org>; Okajima, Toshiyuki 
>> <toshi.okajima@jp.fujitsu.com>; linux-kernel@vger.kernel.org;
>> 'ebiederm@aristanetworks.com' <ebiederm@aristanetworks.com>
>> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
>> 
>> 
>> Over the years two bugs have crept into the code that handled the last
>> returned kernfs directory being deleted, and handled general seeking
>> in kernfs directories.  The result is that reading the /sys/module
>> directory while moduled are loading and unloading will sometimes
>> skip over a module that was present for the entire duration of
>> the directory read.
>> 
>> The first bug is that kernfs_dir_next_pos was advancing the position
>> in the directory even when kernfs_dir_pos had found the starting
>> kernfs directory entry was not usable.  In this case kernfs_dir_pos is
>> guaranteed to return the directory entry past the starting directory
>> entry, making it so that advancing the directory position is wrong.
>> 
>> The second bug is that kernfs_dir_pos when the saved kernfs_node
>> did not validate, was not always returning a directory after
>> the the target position, but was sometimes returning the directory
>> entry just before the target position.
>> 
>> To correct these issues and to make the code easily readable and
>> comprehensible several cleanups are made.
>> 
>> - A function kernfs_dir_next is factored out to make it straight-forward
>>   to find the next node in a kernfs directory.
>> 
>> - A function kernfs_dir_skip_pos is factored out to skip over directory
>>   entries that should not be shown to userspace.  This function is called
>>   from kernfs_fop_readdir directly removing the complication of calling
>>   it from the other directory advancement functions.
>> 
>> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
>>   to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
>>   and ensuring the directory position advancment functions can reference
>>   the saved node without complications.
>> 
>> - The function kernfs_dir_next_pos is modified to only advance the directory
>>   position in the common case when kernfs_dir_pos validates and returns
>>   the starting kernfs node.  For all other cases kernfs_dir_pos is guaranteed
>>   to return a directory position in advance of the starting directory position.
>> 
>> - kernfs_dir_pos is given a significant make over.  It's essense remains the
>>   same but some siginficant changes were made.
>> 
>>   + The offset is validated to be a valid hash value at the very beginning.
>>     The validation is updated to handle the fact that neither 0 nor 1 are
>>     valid hash values.
>> 
>>   + An extra test is added at the end of the rbtree lookup to ensure
>>     the returned position is at or beyond the target position.
>> 
>>   + kernfs_name_compare is used during the rbtree lookup instead of just
>> comparing
>>     the hash.  This allows the the passed namespace parameter to be used, and
>>     if it is available from the saved entry the old saved name as well.
>> 
>>   + The validation of the saved kernfs node now checks for two cases.
>>     If the saved node is returnable.
>>     If the name of the saved node is usable during lookup.
>> 
>> In net this should result in a easier to understand, and more correct
>> version of directory traversal for kernfs directories.
>>
>> Reported-by: "Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com>
>> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
>> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
>> scalability v2")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>> 
>> Can you test this and please verify it fixes your issue?
>>
>
> I tried this patch on top of v4.17 but the system fails to boot
> without detecting root disks by dracut like this:
>
>     [  196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  199.945999] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  200.490281] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  201.071177] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  201.074958] dracut-initqueue[324]: Warning: Could not boot.
> 	     Starting Setup Virtual Console...
>     [  OK  ] Started Setup Virtual Console.
>     [  201.245921] audit: type=1130 audit(1528132266.260:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> 	     [  201.256378] audit: type=1131 audit(1528132266.260:13): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
>     Starting Dracut Emergency Shell...
>     Warning: /dev/fedora/root does not exist
>     Warning: /dev/fedora/swap does not exist
>     Warning: /dev/mapper/fedora-root does not exist
>
>     Generating "/run/initramfs/rdsosreport.txt"
>
>
>     Entering emergency mode. Exit the shell to continue.
>     Type "journalctl" to view system logs.
>     You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
>     after mounting them and attach it to a bug report.
>
>
>     dracut:/# ls /sys/module
>     8139cp        dm_bufio        kernel       psmouse      ttm
>     8139too       dm_mirror       keyboard     pstore       uhci_hcd
>     8250          dm_mod          kgdboc       qemu_fw_cfg  usbcore
>     acpi          dm_snapshot     kgdbts       qxl          usbhid
>     acpiphp       drm             libahci      random       uv_nmi
>     ahci          drm_kms_helper  libata       rcupdate     virtio
>     ata_generic   dynamic_debug   md_mod       rcutree      virtio_console
>     ata_piix      edac_core       mii          rng_core     virtio_pci
>     battery       ehci_hcd        module       scsi_mod     virtio_ring
>     block         fb              mousedev     serio_raw    vt
>     button        firmware_class  pata_acpi    sg           watchdog
>     configfs      fscrypto        pci_hotplug  slab_common  workqueue
>     cpufreq       hid             pcie_aspm    spurious     xhci_hcd
>     cpuidle       hid_magicmouse  pciehp       sr_mod       xz_dec
>     crc32c_intel  hid_ntrig       pcmcia       srcutree     zswap
>     cryptomgr     i8042           pcmcia_core  suspend
>     debug_core    intel_idle      printk       sysrq
>     devres        ipv6            processor    tcp_cubic
>     dracut:/# lsmod
>     Module                  Size  Used by
>     qxl                    77824  1
>     drm_kms_helper        196608  1
>     ttm                   126976  1
>     drm                   454656  4 qxl,ttm
>     virtio_console         32768  0
>     ata_generic            16384  0
>     crc32c_intel           24576  0
>     8139too                40960  0
>     virtio_pci             28672  0
>     8139cp                 32768  0
>     virtio_ring            28672  2 virtio_pci
>     serio_raw              16384  0
>     pata_acpi              16384  0
>     virtio                 16384  2 virtio_pci
>     mii                    16384  2 8139too
>     qemu_fw_cfg            16384  0
>     dracut:/#
>
> OTOH, there's no issue on the pure v4.17 kernel.
>
> As above, ls /sys/module looks apparently good. But I guess any part of
> behavior of getdentries() on sysfs must have changed, affecting the disk
> detection...

I haven't been able to reproduce this yet.  My test system boots fine.
Which fedora are you testing on?


>>  fs/kernfs/dir.c | 109
>> ++++++++++++++++++++++++++++++++++----------------------
>>  1 file changed, 67 insertions(+), 42 deletions(-)
>> 
>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> index 89d1dc19340b..8148b5fec48d 100644
>> --- a/fs/kernfs/dir.c
>> +++ b/fs/kernfs/dir.c
>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode,
>> struct file *filp)
>>  	return 0;
>>  }
>> 
>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
>> +{
>> +	struct rb_node *node = rb_next(&pos->rb);
>> +	return node ? rb_to_kn(node) : NULL;
>> +}
>> +
>>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
>> -	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
>> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
>>  {
>> -	if (pos) {
>> -		int valid = kernfs_active(pos) &&
>> -			pos->parent == parent && hash == pos->hash;
>> -		kernfs_put(pos);
>> -		if (!valid)
>> -			pos = NULL;
>> -	}
>> -	if (!pos && (hash > 1) && (hash < INT_MAX)) {
>> -		struct rb_node *node = parent->dir.children.rb_node;
>> -		while (node) {
>> -			pos = rb_to_kn(node);
>> -
>> -			if (hash < pos->hash)
>> -				node = node->rb_left;
>> -			else if (hash > pos->hash)
>> -				node = node->rb_right;
>> -			else
>> -				break;
>> +	struct kernfs_node *pos;
>> +	struct rb_node *node;
>> +	unsigned int hash;
>> +	const char *name = "";
>> +
>> +	/* Is off a valid name hash? */
>> +	if ((off < 2) || (off >= INT_MAX))
>> +		return NULL;
>> +	hash = off;
>> +
>> +	/* Is the saved position usable? */
>> +	if (saved) {
>> +		/* Proper parent and hash? */
>> +		if ((parent != saved->parent) || (saved->hash != hash)) {
>> +			saved = NULL;
>
> name is uninitialized in this path.

It is.  name is initialized to "" see above.

>> +		} else {
>> +			if (kernfs_active(saved))
>> +				return saved;
>> +			name = saved->name;
>>  		}
>>  	}
>> -	/* Skip over entries which are dying/dead or in the wrong namespace
>> */
>> -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
>> -		struct rb_node *node = rb_next(&pos->rb);
>> -		if (!node)
>> -			pos = NULL;
>> +
>> +	/* Find the closest pos to the hash we are looking for */
>> +	pos = NULL;
>> +	node = parent->dir.children.rb_node;
>> +	while (node) {
>> +		int result;
>> +
>> +		pos = rb_to_kn(node);
>> +		result = kernfs_name_compare(hash, name, ns, pos);
>> +		if (result < 0)
>> +			node = node->rb_left;
>> +		else if (result > 0)
>> +			node = node->rb_right;
>>  		else
>> -			pos = rb_to_kn(node);
>> +			break;
>>  	}
>> +
>> +	/* Ensure pos is at or beyond the target position */
>> +	if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
>> +		pos = kernfs_dir_next(pos);
>> +
>
> Why not trying to find the final target object here?
>
> Looking at the code, I think the operation needed here is to find the
> smallest active object in the same namespace from the objects larger
> than the saved object given as argument. The saved object appears to
> be used as cache. I think dividing the process into kernfs_dir_pos()
> is not necessary.

What you are suggesting below is wrong when restarting readdir.

Ordinarily saved is the next entry we are going to return from readdir.
When dir_emit does not succeed we stop returnning entries and when
we restart readdir we need to attempt dir_emit on saved again.

Always advancing past saved will cause us to skip saved in the
event a single readdir is not enough.

The restart (kernfs_dir_pos) must return saved or if saved is now gone,
the first entry past saved.

Saved dying in the middle is what I believe caused the original issue.

Making all of this clear is part of the reason I moved the
kernfs_dir_skip_pos logic into it's own function.

Eric

>
> I mean like this:
>
> # git diff
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 8148b5f..eeffc8c 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1605,13 +1605,13 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
>
>         /* Is the saved position usable? */
>         if (saved) {
> +               name = saved->name;
>                 /* Proper parent and hash? */
>                 if ((parent != saved->parent) || (saved->hash != hash)) {
>                         saved = NULL;
> -               } else {
> -                       if (kernfs_active(saved))
> -                               return saved;
> -                       name = saved->name;
> +               } else if (kernfs_active(saved)) {
> +                       pos = saved;
> +                       goto skip;
>                 }
>         }
>
> @@ -1631,22 +1631,14 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
>                         break;
>         }
>
> +skip:
>         /* Ensure pos is at or beyond the target position */
> -       if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> +       if (pos && (kernfs_name_compare(hash, name, ns, pos) <= 0))
>                 pos = kernfs_dir_next(pos);
>
>         return pos;
>  }
>
> -static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> -       struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> -{
> -       struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> -       if (likely(pos == start))
> -               pos = kernfs_dir_next(pos);
> -       return pos;
> -}
> -
>  static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
>                                                struct kernfs_node *pos)
>  {
> @@ -1672,7 +1664,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
>
>         for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
>              (pos = kernfs_dir_skip_pos(ns, pos));
> -            pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> +            pos = kernfs_dir_pos(ns, parent, ctx->pos, pos)) {
>                 const char *name = pos->name;
>                 unsigned int type = dt_type(pos);
>                 int len = strlen(name);
>
>>  	return pos;
>>  }
>> 
>>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>> -	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
>>  {
>> -	pos = kernfs_dir_pos(ns, parent, ino, pos);
>> -	if (pos) {
>> -		do {
>> -			struct rb_node *node = rb_next(&pos->rb);
>> -			if (!node)
>> -				pos = NULL;
>> -			else
>> -				pos = rb_to_kn(node);
>> -		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
>> -	}
>> +	struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
>> +	if (likely(pos == start))
>> +		pos = kernfs_dir_next(pos);
>> +	return pos;
>> +}
>> +
>> +static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
>> +					       struct kernfs_node *pos)
>> +{
>> +	/* Skip entries we don't want to return to userspace */
>> +	while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
>> +		pos = kernfs_dir_next(pos);
>>  	return pos;
>>  }
>> 
>> @@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file, struct
>> dir_context *ctx)
>>  {
>>  	struct dentry *dentry = file->f_path.dentry;
>>  	struct kernfs_node *parent = kernfs_dentry_node(dentry);
>> -	struct kernfs_node *pos = file->private_data;
>> +	struct kernfs_node *pos, *saved = file->private_data;
>>  	const void *ns = NULL;
>> 
>>  	if (!dir_emit_dots(file, ctx))
>> @@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file,
>> struct dir_context *ctx)
>>  	if (kernfs_ns_enabled(parent))
>>  		ns = kernfs_info(dentry->d_sb)->ns;
>> 
>> -	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
>> -	     pos;
>> +	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
>> +	     (pos = kernfs_dir_skip_pos(ns, pos));
>>  	     pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
>>  		const char *name = pos->name;
>>  		unsigned int type = dt_type(pos);
>>  		int len = strlen(name);
>>  		ino_t ino = pos->id.ino;
>> 
>> -		ctx->pos = pos->hash;
>> -		file->private_data = pos;
>> -		kernfs_get(pos);
>> +		kernfs_put(saved);
>> +		saved = pos;
>> +		ctx->pos = saved->hash;
>> +		file->private_data = saved;
>> +		kernfs_get(saved);
>> 
>>  		mutex_unlock(&kernfs_mutex);
>>  		if (!dir_emit(ctx, name, len, ino, type))
>>  			return 0;
>>  		mutex_lock(&kernfs_mutex);
>>  	}
>> +	kernfs_put(saved);
>>  	mutex_unlock(&kernfs_mutex);
>>  	file->private_data = NULL;
>>  	ctx->pos = INT_MAX;
>> --
>> 2.14.1
>> 
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-04 14:44       ` Eric W. Biederman
@ 2018-06-05  2:02         ` Eric W. Biederman
  2018-06-05  5:52           ` Hatayama, Daisuke
  2018-06-05  5:45         ` Hatayama, Daisuke
  1 sibling, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-06-05  2:02 UTC (permalink / raw)
  To: Hatayama, Daisuke
  Cc: 'gregkh@linuxfoundation.org', 'tj@kernel.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

ebiederm@xmission.com (Eric W. Biederman) writes:

> "Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com> writes:
>
>>> Can you test this and please verify it fixes your issue?
>>
>> I tried this patch on top of v4.17 but the system fails to boot
>> without detecting root disks by dracut like this:
[snip]

>> OTOH, there's no issue on the pure v4.17 kernel.
>>
>> As above, ls /sys/module looks apparently good. But I guess any part of
>> behavior of getdentries() on sysfs must have changed, affecting the disk
>> detection...
>
> I haven't been able to reproduce this yet.  My test system boots fine.
> Which fedora are you testing on?

I reproduced something similar and fedora 28.  So I think I have found
and fixed the issue.  I believe I simply reversed the test at the end of
kernfs_dir_pos. AKA "<" instead of ">".

I am going to see if I can test my changes more throughly on this side
and then repost.



>>>  fs/kernfs/dir.c | 109
>>> ++++++++++++++++++++++++++++++++++----------------------
>>>  1 file changed, 67 insertions(+), 42 deletions(-)
>>> 
>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>> index 89d1dc19340b..8148b5fec48d 100644
>>> --- a/fs/kernfs/dir.c
>>> +++ b/fs/kernfs/dir.c
>>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode,
>>> struct file *filp)
>>>  	return 0;
>>>  }
>>> 
>>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
>>> +{
>>> +	struct rb_node *node = rb_next(&pos->rb);
>>> +	return node ? rb_to_kn(node) : NULL;
>>> +}
>>> +
>>>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
>>> -	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
>>> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
>>>  {
>>> -	if (pos) {
>>> -		int valid = kernfs_active(pos) &&
>>> -			pos->parent == parent && hash == pos->hash;
>>> -		kernfs_put(pos);
>>> -		if (!valid)
>>> -			pos = NULL;
>>> -	}
>>> -	if (!pos && (hash > 1) && (hash < INT_MAX)) {
>>> -		struct rb_node *node = parent->dir.children.rb_node;
>>> -		while (node) {
>>> -			pos = rb_to_kn(node);
>>> -
>>> -			if (hash < pos->hash)
>>> -				node = node->rb_left;
>>> -			else if (hash > pos->hash)
>>> -				node = node->rb_right;
>>> -			else
>>> -				break;
>>> +	struct kernfs_node *pos;
>>> +	struct rb_node *node;
>>> +	unsigned int hash;
>>> +	const char *name = "";
>>> +
>>> +	/* Is off a valid name hash? */
>>> +	if ((off < 2) || (off >= INT_MAX))
>>> +		return NULL;
>>> +	hash = off;
>>> +
>>> +	/* Is the saved position usable? */
>>> +	if (saved) {
>>> +		/* Proper parent and hash? */
>>> +		if ((parent != saved->parent) || (saved->hash != hash)) {
>>> +			saved = NULL;
>>
>> name is uninitialized in this path.
>
> It is.  name is initialized to "" see above.
>
>>> +		} else {
>>> +			if (kernfs_active(saved))
>>> +				return saved;
>>> +			name = saved->name;
>>>  		}
>>>  	}
>>> -	/* Skip over entries which are dying/dead or in the wrong namespace
>>> */
>>> -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
>>> -		struct rb_node *node = rb_next(&pos->rb);
>>> -		if (!node)
>>> -			pos = NULL;
>>> +
>>> +	/* Find the closest pos to the hash we are looking for */
>>> +	pos = NULL;
>>> +	node = parent->dir.children.rb_node;
>>> +	while (node) {
>>> +		int result;
>>> +
>>> +		pos = rb_to_kn(node);
>>> +		result = kernfs_name_compare(hash, name, ns, pos);
>>> +		if (result < 0)
>>> +			node = node->rb_left;
>>> +		else if (result > 0)
>>> +			node = node->rb_right;
>>>  		else
>>> -			pos = rb_to_kn(node);
>>> +			break;
>>>  	}
>>> +
>>> +	/* Ensure pos is at or beyond the target position */
>>> +	if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
                                                    ^^^^^^^^^^^^^^^^
                                          should be > 0
>>> +		pos = kernfs_dir_next(pos);
>>> +

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-04 14:44       ` Eric W. Biederman
  2018-06-05  2:02         ` Eric W. Biederman
@ 2018-06-05  5:45         ` Hatayama, Daisuke
  2018-06-05 15:31           ` Eric W. Biederman
  1 sibling, 1 reply; 17+ messages in thread
From: Hatayama, Daisuke @ 2018-06-05  5:45 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: 'gregkh@linuxfoundation.org', 'tj@kernel.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'



> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> Sent: Monday, June 4, 2018 11:45 PM
> To: Hatayama, Daisuke <d.hatayama@jp.fujitsu.com>
> Cc: 'gregkh@linuxfoundation.org' <gregkh@linuxfoundation.org>;
> 'tj@kernel.org' <tj@kernel.org>; Okajima, Toshiyuki 
> <toshi.okajima@jp.fujitsu.com>; linux-kernel@vger.kernel.org;
> 'ebiederm@aristanetworks.com' <ebiederm@aristanetworks.com>
> Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> "Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com> writes:
> 
> > Hello,
> >
> > Thanks for this patch, Eric.
> >
> >> -----Original Message-----
> >> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> >> Sent: Monday, June 4, 2018 3:51 AM
> >> To: Hatayama, Daisuke <d.hatayama@jp.fujitsu.com>
> >> Cc: 'gregkh@linuxfoundation.org' <gregkh@linuxfoundation.org>;
> >> 'tj@kernel.org' <tj@kernel.org>; Okajima, Toshiyuki
> >> <toshi.okajima@jp.fujitsu.com>; linux-kernel@vger.kernel.org;
> >> 'ebiederm@aristanetworks.com' <ebiederm@aristanetworks.com>
> >> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> >>
> >>
> >> Over the years two bugs have crept into the code that handled the last
> >> returned kernfs directory being deleted, and handled general seeking
> >> in kernfs directories.  The result is that reading the /sys/module
> >> directory while moduled are loading and unloading will sometimes
> >> skip over a module that was present for the entire duration of
> >> the directory read.
> >>
> >> The first bug is that kernfs_dir_next_pos was advancing the position
> >> in the directory even when kernfs_dir_pos had found the starting
> >> kernfs directory entry was not usable.  In this case kernfs_dir_pos is
> >> guaranteed to return the directory entry past the starting directory
> >> entry, making it so that advancing the directory position is wrong.
> >>
> >> The second bug is that kernfs_dir_pos when the saved kernfs_node
> >> did not validate, was not always returning a directory after
> >> the the target position, but was sometimes returning the directory
> >> entry just before the target position.
> >>
> >> To correct these issues and to make the code easily readable and
> >> comprehensible several cleanups are made.
> >>
> >> - A function kernfs_dir_next is factored out to make it straight-forward
> >>   to find the next node in a kernfs directory.
> >>
> >> - A function kernfs_dir_skip_pos is factored out to skip over directory
> >>   entries that should not be shown to userspace.  This function is called
> >>   from kernfs_fop_readdir directly removing the complication of calling
> >>   it from the other directory advancement functions.
> >>
> >> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
> >>   to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
> >>   and ensuring the directory position advancment functions can reference
> >>   the saved node without complications.
> >>
> >> - The function kernfs_dir_next_pos is modified to only advance the directory
> >>   position in the common case when kernfs_dir_pos validates and returns
> >>   the starting kernfs node.  For all other cases kernfs_dir_pos is
> guaranteed
> >>   to return a directory position in advance of the starting directory
> position.
> >>
> >> - kernfs_dir_pos is given a significant make over.  It's essense remains
> the
> >>   same but some siginficant changes were made.
> >>
> >>   + The offset is validated to be a valid hash value at the very beginning.
> >>     The validation is updated to handle the fact that neither 0 nor 1 are
> >>     valid hash values.
> >>
> >>   + An extra test is added at the end of the rbtree lookup to ensure
> >>     the returned position is at or beyond the target position.
> >>
> >>   + kernfs_name_compare is used during the rbtree lookup instead of just
> >> comparing
> >>     the hash.  This allows the the passed namespace parameter to be used,
> and
> >>     if it is available from the saved entry the old saved name as well.
> >>
> >>   + The validation of the saved kernfs node now checks for two cases.
> >>     If the saved node is returnable.
> >>     If the name of the saved node is usable during lookup.
> >>
> >> In net this should result in a easier to understand, and more correct
> >> version of directory traversal for kernfs directories.
> >>
> >> Reported-by: "Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com>
> >> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
> >> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
> >> scalability v2")
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>
> >> Can you test this and please verify it fixes your issue?
> >>
> >
> > I tried this patch on top of v4.17 but the system fails to boot
> > without detecting root disks by dracut like this:
> >
> >     [  196.121294] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  196.672175] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  197.219519] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  197.768997] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  198.312498] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  198.856841] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  199.403190] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  199.945999] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  200.490281] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  201.071177] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> >     [  201.074958] dracut-initqueue[324]: Warning: Could not boot.
> > 	     Starting Setup Virtual Console...
> >     [  OK  ] Started Setup Virtual Console.
> >     [  201.245921] audit: type=1130 audit(1528132266.260:12): pid=1 uid=0
> auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup
> comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
> res=success'
> > 	     [  201.256378] audit: type=1131 audit(1528132266.260:13):
> pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel
> msg='unit=systemd-vconsole-setup comm="systemd"
> exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> >     Starting Dracut Emergency Shell...
> >     Warning: /dev/fedora/root does not exist
> >     Warning: /dev/fedora/swap does not exist
> >     Warning: /dev/mapper/fedora-root does not exist
> >
> >     Generating "/run/initramfs/rdsosreport.txt"
> >
> >
> >     Entering emergency mode. Exit the shell to continue.
> >     Type "journalctl" to view system logs.
> >     You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick
> or /boot
> >     after mounting them and attach it to a bug report.
> >
> >
> >     dracut:/# ls /sys/module
> >     8139cp        dm_bufio        kernel       psmouse      ttm
> >     8139too       dm_mirror       keyboard     pstore       uhci_hcd
> >     8250          dm_mod          kgdboc       qemu_fw_cfg  usbcore
> >     acpi          dm_snapshot     kgdbts       qxl          usbhid
> >     acpiphp       drm             libahci      random       uv_nmi
> >     ahci          drm_kms_helper  libata       rcupdate     virtio
> >     ata_generic   dynamic_debug   md_mod       rcutree
> virtio_console
> >     ata_piix      edac_core       mii          rng_core     virtio_pci
> >     battery       ehci_hcd        module       scsi_mod     virtio_ring
> >     block         fb              mousedev     serio_raw    vt
> >     button        firmware_class  pata_acpi    sg           watchdog
> >     configfs      fscrypto        pci_hotplug  slab_common  workqueue
> >     cpufreq       hid             pcie_aspm    spurious     xhci_hcd
> >     cpuidle       hid_magicmouse  pciehp       sr_mod       xz_dec
> >     crc32c_intel  hid_ntrig       pcmcia       srcutree     zswap
> >     cryptomgr     i8042           pcmcia_core  suspend
> >     debug_core    intel_idle      printk       sysrq
> >     devres        ipv6            processor    tcp_cubic
> >     dracut:/# lsmod
> >     Module                  Size  Used by
> >     qxl                    77824  1
> >     drm_kms_helper        196608  1
> >     ttm                   126976  1
> >     drm                   454656  4 qxl,ttm
> >     virtio_console         32768  0
> >     ata_generic            16384  0
> >     crc32c_intel           24576  0
> >     8139too                40960  0
> >     virtio_pci             28672  0
> >     8139cp                 32768  0
> >     virtio_ring            28672  2 virtio_pci
> >     serio_raw              16384  0
> >     pata_acpi              16384  0
> >     virtio                 16384  2 virtio_pci
> >     mii                    16384  2 8139too
> >     qemu_fw_cfg            16384  0
> >     dracut:/#
> >
> > OTOH, there's no issue on the pure v4.17 kernel.
> >
> > As above, ls /sys/module looks apparently good. But I guess any part of
> > behavior of getdentries() on sysfs must have changed, affecting the disk
> > detection...
> 
> I haven't been able to reproduce this yet.  My test system boots fine.
> Which fedora are you testing on?
> 
> 
> >>  fs/kernfs/dir.c | 109
> >> ++++++++++++++++++++++++++++++++++----------------------
> >>  1 file changed, 67 insertions(+), 42 deletions(-)
> >>
> >> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> >> index 89d1dc19340b..8148b5fec48d 100644
> >> --- a/fs/kernfs/dir.c
> >> +++ b/fs/kernfs/dir.c
> >> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode
> *inode,
> >> struct file *filp)
> >>  	return 0;
> >>  }
> >>
> >> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
> >> +{
> >> +	struct rb_node *node = rb_next(&pos->rb);
> >> +	return node ? rb_to_kn(node) : NULL;
> >> +}
> >> +
> >>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >> -	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> >> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
> >>  {
> >> -	if (pos) {
> >> -		int valid = kernfs_active(pos) &&
> >> -			pos->parent == parent && hash == pos->hash;
> >> -		kernfs_put(pos);
> >> -		if (!valid)
> >> -			pos = NULL;
> >> -	}
> >> -	if (!pos && (hash > 1) && (hash < INT_MAX)) {
> >> -		struct rb_node *node = parent->dir.children.rb_node;
> >> -		while (node) {
> >> -			pos = rb_to_kn(node);
> >> -
> >> -			if (hash < pos->hash)
> >> -				node = node->rb_left;
> >> -			else if (hash > pos->hash)
> >> -				node = node->rb_right;
> >> -			else
> >> -				break;
> >> +	struct kernfs_node *pos;
> >> +	struct rb_node *node;
> >> +	unsigned int hash;
> >> +	const char *name = "";
> >> +
> >> +	/* Is off a valid name hash? */
> >> +	if ((off < 2) || (off >= INT_MAX))
> >> +		return NULL;
> >> +	hash = off;
> >> +
> >> +	/* Is the saved position usable? */
> >> +	if (saved) {
> >> +		/* Proper parent and hash? */
> >> +		if ((parent != saved->parent) || (saved->hash != hash)) {
> >> +			saved = NULL;
> >
> > name is uninitialized in this path.
> 
> It is.  name is initialized to "" see above.
> 

Or when either of the conditions is true, it has resulted in some inconsistent state, right?
So, why not terminating this session of readdir() immediately by returning NULL just as when off is turned out to be invalid?

> >> +		} else {
> >> +			if (kernfs_active(saved))
> >> +				return saved;
> >> +			name = saved->name;
> >>  		}
> >>  	}
> >> -	/* Skip over entries which are dying/dead or in the wrong namespace
> >> */
> >> -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> >> -		struct rb_node *node = rb_next(&pos->rb);
> >> -		if (!node)
> >> -			pos = NULL;
> >> +
> >> +	/* Find the closest pos to the hash we are looking for */
> >> +	pos = NULL;
> >> +	node = parent->dir.children.rb_node;
> >> +	while (node) {
> >> +		int result;
> >> +
> >> +		pos = rb_to_kn(node);
> >> +		result = kernfs_name_compare(hash, name, ns, pos);
> >> +		if (result < 0)
> >> +			node = node->rb_left;
> >> +		else if (result > 0)
> >> +			node = node->rb_right;
> >>  		else
> >> -			pos = rb_to_kn(node);
> >> +			break;
> >>  	}
> >> +
> >> +	/* Ensure pos is at or beyond the target position */
> >> +	if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> >> +		pos = kernfs_dir_next(pos);
> >> +
> >
> > Why not trying to find the final target object here?
> >
> > Looking at the code, I think the operation needed here is to find the
> > smallest active object in the same namespace from the objects larger
> > than the saved object given as argument. The saved object appears to
> > be used as cache. I think dividing the process into kernfs_dir_pos()
> > is not necessary.
> 
> What you are suggesting below is wrong when restarting readdir.
> 
> Ordinarily saved is the next entry we are going to return from readdir.
> When dir_emit does not succeed we stop returnning entries and when
> we restart readdir we need to attempt dir_emit on saved again.
> 
> Always advancing past saved will cause us to skip saved in the
> event a single readdir is not enough.
> 
> The restart (kernfs_dir_pos) must return saved or if saved is now gone,
> the first entry past saved.
> 
> Saved dying in the middle is what I believe caused the original issue.
> 
> Making all of this clear is part of the reason I moved the
> kernfs_dir_skip_pos logic into it's own function.
> 
> Eric

Thanks for the explanation. I'm sure why my suggestion doesn't work.

> 
> >
> > I mean like this:
> >
> > # git diff
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 8148b5f..eeffc8c 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1605,13 +1605,13 @@ static int kernfs_dir_fop_release(struct inode
> *inode, struct file *filp)
> >
> >         /* Is the saved position usable? */
> >         if (saved) {
> > +               name = saved->name;
> >                 /* Proper parent and hash? */
> >                 if ((parent != saved->parent) || (saved->hash != hash)) {
> >                         saved = NULL;
> > -               } else {
> > -                       if (kernfs_active(saved))
> > -                               return saved;
> > -                       name = saved->name;
> > +               } else if (kernfs_active(saved)) {
> > +                       pos = saved;
> > +                       goto skip;
> >                 }
> >         }
> >
> > @@ -1631,22 +1631,14 @@ static int kernfs_dir_fop_release(struct inode
> *inode, struct file *filp)
> >                         break;
> >         }
> >
> > +skip:
> >         /* Ensure pos is at or beyond the target position */
> > -       if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> > +       if (pos && (kernfs_name_compare(hash, name, ns, pos) <= 0))
> >                 pos = kernfs_dir_next(pos);
> >
> >         return pos;
> >  }
> >
> > -static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> > -       struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> > -{
> > -       struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> > -       if (likely(pos == start))
> > -               pos = kernfs_dir_next(pos);
> > -       return pos;
> > -}
> > -
> >  static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
> >                                                struct kernfs_node *pos)
> >  {
> > @@ -1672,7 +1664,7 @@ static int kernfs_fop_readdir(struct file *file, struct
> dir_context *ctx)
> >
> >         for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
> >              (pos = kernfs_dir_skip_pos(ns, pos));
> > -            pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> > +            pos = kernfs_dir_pos(ns, parent, ctx->pos, pos)) {
> >                 const char *name = pos->name;
> >                 unsigned int type = dt_type(pos);
> >                 int len = strlen(name);
> >
> >>  	return pos;
> >>  }
> >>
> >>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> >> -	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> >> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> >>  {
> >> -	pos = kernfs_dir_pos(ns, parent, ino, pos);
> >> -	if (pos) {
> >> -		do {
> >> -			struct rb_node *node = rb_next(&pos->rb);
> >> -			if (!node)
> >> -				pos = NULL;
> >> -			else
> >> -				pos = rb_to_kn(node);
> >> -		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
> >> -	}
> >> +	struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> >> +	if (likely(pos == start))
> >> +		pos = kernfs_dir_next(pos);
> >> +	return pos;
> >> +}
> >> +
> >> +static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
> >> +					       struct kernfs_node *pos)
> >> +{
> >> +	/* Skip entries we don't want to return to userspace */
> >> +	while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
> >> +		pos = kernfs_dir_next(pos);
> >>  	return pos;
> >>  }
> >>
> >> @@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file,
> struct
> >> dir_context *ctx)
> >>  {
> >>  	struct dentry *dentry = file->f_path.dentry;
> >>  	struct kernfs_node *parent = kernfs_dentry_node(dentry);
> >> -	struct kernfs_node *pos = file->private_data;
> >> +	struct kernfs_node *pos, *saved = file->private_data;
> >>  	const void *ns = NULL;
> >>
> >>  	if (!dir_emit_dots(file, ctx))
> >> @@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file,
> >> struct dir_context *ctx)
> >>  	if (kernfs_ns_enabled(parent))
> >>  		ns = kernfs_info(dentry->d_sb)->ns;
> >>
> >> -	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
> >> -	     pos;
> >> +	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
> >> +	     (pos = kernfs_dir_skip_pos(ns, pos));
> >>  	     pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> >>  		const char *name = pos->name;
> >>  		unsigned int type = dt_type(pos);
> >>  		int len = strlen(name);
> >>  		ino_t ino = pos->id.ino;
> >>
> >> -		ctx->pos = pos->hash;
> >> -		file->private_data = pos;
> >> -		kernfs_get(pos);
> >> +		kernfs_put(saved);
> >> +		saved = pos;
> >> +		ctx->pos = saved->hash;
> >> +		file->private_data = saved;
> >> +		kernfs_get(saved);
> >>
> >>  		mutex_unlock(&kernfs_mutex);
> >>  		if (!dir_emit(ctx, name, len, ino, type))
> >>  			return 0;
> >>  		mutex_lock(&kernfs_mutex);
> >>  	}
> >> +	kernfs_put(saved);
> >>  	mutex_unlock(&kernfs_mutex);
> >>  	file->private_data = NULL;
> >>  	ctx->pos = INT_MAX;
> >> --
> >> 2.14.1
> >>
> >>
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-05  2:02         ` Eric W. Biederman
@ 2018-06-05  5:52           ` Hatayama, Daisuke
  0 siblings, 0 replies; 17+ messages in thread
From: Hatayama, Daisuke @ 2018-06-05  5:52 UTC (permalink / raw)
  To: 'Eric W. Biederman'
  Cc: 'gregkh@linuxfoundation.org', 'tj@kernel.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'



> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@xmission.com]
> Sent: Tuesday, June 5, 2018 11:03 AM
> To: Hatayama, Daisuke <d.hatayama@jp.fujitsu.com>
> Cc: 'gregkh@linuxfoundation.org' <gregkh@linuxfoundation.org>;
> 'tj@kernel.org' <tj@kernel.org>; Okajima, Toshiyuki 
> <toshi.okajima@jp.fujitsu.com>; linux-kernel@vger.kernel.org;
> 'ebiederm@aristanetworks.com' <ebiederm@aristanetworks.com>
> Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> 
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > "Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com> writes:
> >
> >>> Can you test this and please verify it fixes your issue?
> >>
> >> I tried this patch on top of v4.17 but the system fails to boot
> >> without detecting root disks by dracut like this:
> [snip]
> 
> >> OTOH, there's no issue on the pure v4.17 kernel.
> >>
> >> As above, ls /sys/module looks apparently good. But I guess any part of
> >> behavior of getdentries() on sysfs must have changed, affecting the disk
> >> detection...
> >
> > I haven't been able to reproduce this yet.  My test system boots fine.
> > Which fedora are you testing on?
> 
> I reproduced something similar and fedora 28.  So I think I have found
> and fixed the issue.  I believe I simply reversed the test at the end of
> kernfs_dir_pos. AKA "<" instead of ">".

Though too late, I used fedora 28 and RHEL7.5.

I applied this fix to your patch and the system boot was successfully done.

I'll start testing your patch from now on.

> 
> I am going to see if I can test my changes more throughly on this side
> and then repost.
> 
> 
> 
> >>>  fs/kernfs/dir.c | 109
> >>> ++++++++++++++++++++++++++++++++++----------------------
> >>>  1 file changed, 67 insertions(+), 42 deletions(-)
> >>>
> >>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> >>> index 89d1dc19340b..8148b5fec48d 100644
> >>> --- a/fs/kernfs/dir.c
> >>> +++ b/fs/kernfs/dir.c
> >>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode
> *inode,
> >>> struct file *filp)
> >>>  	return 0;
> >>>  }
> >>>
> >>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
> >>> +{
> >>> +	struct rb_node *node = rb_next(&pos->rb);
> >>> +	return node ? rb_to_kn(node) : NULL;
> >>> +}
> >>> +
> >>>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >>> -	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> >>> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
> >>>  {
> >>> -	if (pos) {
> >>> -		int valid = kernfs_active(pos) &&
> >>> -			pos->parent == parent && hash == pos->hash;
> >>> -		kernfs_put(pos);
> >>> -		if (!valid)
> >>> -			pos = NULL;
> >>> -	}
> >>> -	if (!pos && (hash > 1) && (hash < INT_MAX)) {
> >>> -		struct rb_node *node = parent->dir.children.rb_node;
> >>> -		while (node) {
> >>> -			pos = rb_to_kn(node);
> >>> -
> >>> -			if (hash < pos->hash)
> >>> -				node = node->rb_left;
> >>> -			else if (hash > pos->hash)
> >>> -				node = node->rb_right;
> >>> -			else
> >>> -				break;
> >>> +	struct kernfs_node *pos;
> >>> +	struct rb_node *node;
> >>> +	unsigned int hash;
> >>> +	const char *name = "";
> >>> +
> >>> +	/* Is off a valid name hash? */
> >>> +	if ((off < 2) || (off >= INT_MAX))
> >>> +		return NULL;
> >>> +	hash = off;
> >>> +
> >>> +	/* Is the saved position usable? */
> >>> +	if (saved) {
> >>> +		/* Proper parent and hash? */
> >>> +		if ((parent != saved->parent) || (saved->hash != hash)) {
> >>> +			saved = NULL;
> >>
> >> name is uninitialized in this path.
> >
> > It is.  name is initialized to "" see above.
> >
> >>> +		} else {
> >>> +			if (kernfs_active(saved))
> >>> +				return saved;
> >>> +			name = saved->name;
> >>>  		}
> >>>  	}
> >>> -	/* Skip over entries which are dying/dead or in the wrong namespace
> >>> */
> >>> -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> >>> -		struct rb_node *node = rb_next(&pos->rb);
> >>> -		if (!node)
> >>> -			pos = NULL;
> >>> +
> >>> +	/* Find the closest pos to the hash we are looking for */
> >>> +	pos = NULL;
> >>> +	node = parent->dir.children.rb_node;
> >>> +	while (node) {
> >>> +		int result;
> >>> +
> >>> +		pos = rb_to_kn(node);
> >>> +		result = kernfs_name_compare(hash, name, ns, pos);
> >>> +		if (result < 0)
> >>> +			node = node->rb_left;
> >>> +		else if (result > 0)
> >>> +			node = node->rb_right;
> >>>  		else
> >>> -			pos = rb_to_kn(node);
> >>> +			break;
> >>>  	}
> >>> +
> >>> +	/* Ensure pos is at or beyond the target position */
> >>> +	if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
>                                                     ^^^^^^^^^^^^^^^^
>                                           should be > 0
> >>> +		pos = kernfs_dir_next(pos);
> >>> +
> 
> Eric
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-05  5:45         ` Hatayama, Daisuke
@ 2018-06-05 15:31           ` Eric W. Biederman
  2018-06-05 15:42             ` 'tj@kernel.org'
  0 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-06-05 15:31 UTC (permalink / raw)
  To: Hatayama, Daisuke
  Cc: 'gregkh@linuxfoundation.org', 'tj@kernel.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

"Hatayama, Daisuke" <d.hatayama@jp.fujitsu.com> writes:

>> >> +
>> >> +	/* Is the saved position usable? */
>> >> +	if (saved) {
>> >> +		/* Proper parent and hash? */
>> >> +		if ((parent != saved->parent) || (saved->hash != hash)) {
>> >> +			saved = NULL;
>> >
>> > name is uninitialized in this path.
>> 
>> It is.  name is initialized to "" see above.
>> 
>
> Or when either of the conditions is true, it has resulted in some inconsistent state, right?
> So, why not terminating this session of readdir() immediately by
> returning NULL just as when off is turned out to be invalid?

What I have above is not the clearest, and in fact the logic could be
better.

The fundamental challenge is because hash collisions are possible a file
offset does not hold complete position information in a directory.

So the kernfs node that is to be read/displayed next is saved in the
struct file.  The it is tested if the saved kernfs node is usable
for finding the location in the directory.  Several things may have
gone wrong.

- Someone may have called seekdir.
- The saved kernfs node may have been renamed.
- The saved kernfs node may have been moved to a different directory in
  kernfs.
- the saved kernfs node may have been deleted.

If any of those are true the code needs to do the rbtree lookup.

If the kernfs node has been deleted or moved to a different directory we
can safely use it's name while performing the rbtree lookup.  Which in
the event of a hash collision will be more accurate in finding our old
location, and preventing the same directory entry being returned
multiple times.

Which is completely different than if the directory offset is an invalid
value that will never point to any directory entries.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-05 15:31           ` Eric W. Biederman
@ 2018-06-05 15:42             ` 'tj@kernel.org'
  2018-06-05 17:47               ` Eric W. Biederman
  0 siblings, 1 reply; 17+ messages in thread
From: 'tj@kernel.org' @ 2018-06-05 15:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Hatayama, Daisuke, 'gregkh@linuxfoundation.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

Hello,

On Tue, Jun 05, 2018 at 10:31:36AM -0500, Eric W. Biederman wrote:
> What I have above is not the clearest, and in fact the logic could be
> better.
> 
> The fundamental challenge is because hash collisions are possible a file
> offset does not hold complete position information in a directory.
> 
> So the kernfs node that is to be read/displayed next is saved in the
> struct file.  The it is tested if the saved kernfs node is usable
> for finding the location in the directory.  Several things may have
> gone wrong.
> 
> - Someone may have called seekdir.
> - The saved kernfs node may have been renamed.
> - The saved kernfs node may have been moved to a different directory in
>   kernfs.
> - the saved kernfs node may have been deleted.
> 
> If any of those are true the code needs to do the rbtree lookup.

So, given that the whole thing is protected by a mutex which protects
modifications, it could be an option to simply keep track of who's
iterating what and shift them on removals.  IOW, always keep cursor
pointing to the next thing to visit and if that gets removed shift the
cursor to the next one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-05 15:42             ` 'tj@kernel.org'
@ 2018-06-05 17:47               ` Eric W. Biederman
  2018-06-07 18:36                 ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-06-05 17:47 UTC (permalink / raw)
  To: 'tj@kernel.org'
  Cc: Hatayama, Daisuke, 'gregkh@linuxfoundation.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

"'tj@kernel.org'" <tj@kernel.org> writes:

> Hello,
>
> On Tue, Jun 05, 2018 at 10:31:36AM -0500, Eric W. Biederman wrote:
>> What I have above is not the clearest, and in fact the logic could be
>> better.
>> 
>> The fundamental challenge is because hash collisions are possible a file
>> offset does not hold complete position information in a directory.
>> 
>> So the kernfs node that is to be read/displayed next is saved in the
>> struct file.  The it is tested if the saved kernfs node is usable
>> for finding the location in the directory.  Several things may have
>> gone wrong.
>> 
>> - Someone may have called seekdir.
>> - The saved kernfs node may have been renamed.
>> - The saved kernfs node may have been moved to a different directory in
>>   kernfs.
>> - the saved kernfs node may have been deleted.
>> 
>> If any of those are true the code needs to do the rbtree lookup.
>
> So, given that the whole thing is protected by a mutex which protects
> modifications, it could be an option to simply keep track of who's
> iterating what and shift them on removals.  IOW, always keep cursor
> pointing to the next thing to visit and if that gets removed shift the
> cursor to the next one.

Yes.  We could.

The primary case we have to worry about is someone using seekdir,
and for that we always need the rbtree lookup.   For seekdir
we could invalidate the saved entry and make things simpler
that way.

We could add list_head to the kernfs_node and create:
struct kernfs_dir_file {
	struct list_head entry;
        struct kernfs_node *kn;
}
And point at that from struct file->private_data.

I don't know if it would be worth the trouble to do that over a quick
check to make certain the kernfs_node is what it is expected to be.
But that is an option.

Part of the pain of supporting seekdir is that the offset we expose
to userspace in has to be 32bit to support 32bit userspace applications.
Which unfortunately is small enough that if nothing else a name
collision can be brute forced.  So we can not avoid handling collisions.


Sigh,  I have found another issue with kernfs_fop_readdir.

We are not currently protecting file->private_data with the kernfs_mutex
or any other kind of serialization.  Which means if two processes are
calling readdir on the same file descriptor we might get unpredictable
behavior.

It doesn't look too bad and easy enough to fix, but definitely something
to be watchful of.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
  2018-06-05 17:47               ` Eric W. Biederman
@ 2018-06-07 18:36                 ` Al Viro
  0 siblings, 0 replies; 17+ messages in thread
From: Al Viro @ 2018-06-07 18:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: 'tj@kernel.org',
	Hatayama, Daisuke, 'gregkh@linuxfoundation.org',
	Okajima, Toshiyuki, linux-kernel,
	'ebiederm@aristanetworks.com'

On Tue, Jun 05, 2018 at 12:47:33PM -0500, Eric W. Biederman wrote:

> Sigh,  I have found another issue with kernfs_fop_readdir.
> 
> We are not currently protecting file->private_data with the kernfs_mutex
> or any other kind of serialization.  Which means if two processes are
> calling readdir on the same file descriptor we might get unpredictable
> behavior.
> 
> It doesn't look too bad and easy enough to fix, but definitely something
> to be watchful of.

As discussed off-list - this is not a problem; getdents() et.al. are
serialized on per-struct-file basis by fdget_pos() in relevant syscalls,
since all directories automatically get FMODE_ATOMIC_POS in ->f_mode.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-06-07 18:36 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-28 12:54 [RESEND PATCH v2] kernfs: fix dentry unexpected skip Hatayama, Daisuke
2018-05-28 13:08 ` Hatayama, Daisuke
2018-05-29 16:26 ` 'tj@kernel.org'
2018-06-01  9:25   ` Hatayama, Daisuke
2018-06-01 17:07     ` 'tj@kernel.org'
2018-06-04  9:46       ` Hatayama, Daisuke
2018-06-02 17:25 ` Eric W. Biederman
2018-06-03 18:51   ` [CFT][PATCH] kernfs: Correct kernfs directory seeks Eric W. Biederman
2018-06-04  9:34     ` Hatayama, Daisuke
2018-06-04 14:44       ` Eric W. Biederman
2018-06-05  2:02         ` Eric W. Biederman
2018-06-05  5:52           ` Hatayama, Daisuke
2018-06-05  5:45         ` Hatayama, Daisuke
2018-06-05 15:31           ` Eric W. Biederman
2018-06-05 15:42             ` 'tj@kernel.org'
2018-06-05 17:47               ` Eric W. Biederman
2018-06-07 18:36                 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).