LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] mm: fix possible cause of a page_mapped BUG
@ 2011-02-24  5:39 Hugh Dickins
  2011-02-28 23:35 ` Robert Święcki
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2011-02-24  5:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Robert Swiecki, Miklos Szeredi, linux-kernel, linux-mm

Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
a hole with madvise(,, MADV_REMOVE).  That path is under mutex, and
cannot be explained by lack of serialization in unmap_mapping_range().

Reviewing the code, I found one place where vm_truncate_count handling
should have been updated, when I switched at the last minute from one
way of managing the restart_addr to another: mremap move changes the
virtual addresses, so it ought to adjust the restart_addr.

But rather than exporting the notion of restart_addr from memory.c, or
converting to restart_pgoff throughout, simply reset vm_truncate_count
to 0 to force a rescan if mremap move races with preempted truncation.

We have no confirmation that this fixes Robert's BUG,
but it is a fix that's worth making anyway.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/mremap.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- 2.6.38-rc6/mm/mremap.c	2011-01-18 22:04:56.000000000 -0800
+++ linux/mm/mremap.c	2011-02-23 15:29:52.000000000 -0800
@@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str
 		 */
 		mapping = vma->vm_file->f_mapping;
 		spin_lock(&mapping->i_mmap_lock);
-		if (new_vma->vm_truncate_count &&
-		    new_vma->vm_truncate_count != vma->vm_truncate_count)
-			new_vma->vm_truncate_count = 0;
+		new_vma->vm_truncate_count = 0;
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-02-24  5:39 [PATCH] mm: fix possible cause of a page_mapped BUG Hugh Dickins
@ 2011-02-28 23:35 ` Robert Święcki
  2011-03-17 15:40   ` Robert Święcki
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-02-28 23:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Miklos Szeredi, linux-kernel, linux-mm

> But rather than exporting the notion of restart_addr from memory.c, or
> converting to restart_pgoff throughout, simply reset vm_truncate_count
> to 0 to force a rescan if mremap move races with preempted truncation.
>
> We have no confirmation that this fixes Robert's BUG,
> but it is a fix that's worth making anyway.

Hi, I don't have currently access to my rs232/console testing machine
(lame excuse but it helps a lot;), cause I'm working currently OOtO,
so I'll try to test it asap - which is probably Mar 15th or so.

Btw, the fuzzer is here: http://code.google.com/p/iknowthis/

I think i was trying it with this revision:
http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode,
newest 'iknowthis' supports x86-64 natively), so feel free to try it.

It used to crash the machine (it's BUG_ON but the system became
unusable) in matter of hours. Btw, when I was testing it for the last
time it Ooopsed much more frequently in proc_readdir (I sent report in
one of earliet e-mails).

> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>
>  mm/mremap.c |    4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> --- 2.6.38-rc6/mm/mremap.c      2011-01-18 22:04:56.000000000 -0800
> +++ linux/mm/mremap.c   2011-02-23 15:29:52.000000000 -0800
> @@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str
>                 */
>                mapping = vma->vm_file->f_mapping;
>                spin_lock(&mapping->i_mmap_lock);
> -               if (new_vma->vm_truncate_count &&
> -                   new_vma->vm_truncate_count != vma->vm_truncate_count)
> -                       new_vma->vm_truncate_count = 0;
> +               new_vma->vm_truncate_count = 0;
>        }
>
>        /*
>



-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-02-28 23:35 ` Robert Święcki
@ 2011-03-17 15:40   ` Robert Święcki
  2011-03-19  5:34     ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-03-17 15:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Miklos Szeredi, linux-kernel, linux-mm

Hi,

On Tue, Mar 1, 2011 at 12:35 AM, Robert Święcki <robert@swiecki.net> wrote:
>> But rather than exporting the notion of restart_addr from memory.c, or
>> converting to restart_pgoff throughout, simply reset vm_truncate_count
>> to 0 to force a rescan if mremap move races with preempted truncation.
>>
>> We have no confirmation that this fixes Robert's BUG,
>> but it is a fix that's worth making anyway.
>
> Hi, I don't have currently access to my rs232/console testing machine
> (lame excuse but it helps a lot;), cause I'm working currently OOtO,
> so I'll try to test it asap - which is probably Mar 15th or so.

So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other
problems, and never seen anything about mremap in 2.6.38 (yet), as it
had been happening in 2.6.37-rc2. The output goes to
http://alt.swiecki.net/linux_kernel/ - I'm still trying.

> Btw, the fuzzer is here: http://code.google.com/p/iknowthis/
>
> I think i was trying it with this revision:
> http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode,
> newest 'iknowthis' supports x86-64 natively), so feel free to try it.
>
> It used to crash the machine (it's BUG_ON but the system became
> unusable) in matter of hours. Btw, when I was testing it for the last
> time it Ooopsed much more frequently in proc_readdir (I sent report in
> one of earliet e-mails).
>
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>> ---
>>
>>  mm/mremap.c |    4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> --- 2.6.38-rc6/mm/mremap.c      2011-01-18 22:04:56.000000000 -0800
>> +++ linux/mm/mremap.c   2011-02-23 15:29:52.000000000 -0800
>> @@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str
>>                 */
>>                mapping = vma->vm_file->f_mapping;
>>                spin_lock(&mapping->i_mmap_lock);
>> -               if (new_vma->vm_truncate_count &&
>> -                   new_vma->vm_truncate_count != vma->vm_truncate_count)
>> -                       new_vma->vm_truncate_count = 0;
>> +               new_vma->vm_truncate_count = 0;
>>        }
>>
>>        /*
>>
>
>
>
> --
> Robert Święcki
>



-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-03-17 15:40   ` Robert Święcki
@ 2011-03-19  5:34     ` Hugh Dickins
  2011-04-01 14:34       ` Robert Święcki
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2011-03-19  5:34 UTC (permalink / raw)
  To: Robert Swiecki
  Cc: Andrew Morton, Linus Torvalds, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm

On Thu, 17 Mar 2011, Robert Swiecki wrote:
> On Tue, Mar 1, 2011 at 12:35 AM, Robert Swiecki <robert@swiecki.net> wrote:
> 
> So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other
> problems, and never seen anything about mremap in 2.6.38 (yet),

Thanks a lot for getting back to this, Robert, and thanks for the update.
I won't be celebrating, but this sounds like good news for my mremap patch.

> as it had been happening in 2.6.37-rc2. The output goes to
> http://alt.swiecki.net/linux_kernel/ - I'm still trying.

A problem in sys_mlock: I've Cc'ed Michel who is the current expert.

A problem in sys_munlock: Michel again, except vma_prio_tree_add is
implicated, and I used to be involved with that.  I've appended below
a debug patch which I wrote years ago, and have largely forgotten, but
Andrew keeps it around in mmotm: we might learn more if you add that
into your kernel build.

A problem in next_pidmap from find_ge_pid from ... proc_pid_readdir.
I did spend a while looking into that when you first reported it.
I'm pretty sure, from the register values, that it's a result of
a pid number (in some places signed int, in some places unsigned)
getting unexpectedly sign-extended to negative, so indexing before
the beginning of an array; but I never tracked down the root of the
problem, and failed to reproduce it with odd lseeks on the directory.

Ah, the one you report now comes from compat_sys_getdents,
whereas the original one came from compat_sys_old_readdir: okay,
I had been wondering whether it was peculiar to the old_readdir case,
but no, it's reproduced with getdents too.  Might be peculiar to compat.

Anyway, I've Cc'ed Eric who will be the best for that one.

And a couple of watchdog problems: I haven't even glanced at
those, hope someone else can suggest a good way forward on them.

Hugh

> 
> > Btw, the fuzzer is here: http://code.google.com/p/iknowthis/
> >
> > I think i was trying it with this revision:
> > http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode,
> > newest 'iknowthis' supports x86-64 natively), so feel free to try it.
> >
> > It used to crash the machine (it's BUG_ON but the system became
> > unusable) in matter of hours. Btw, when I was testing it for the last
> > time it Ooopsed much more frequently in proc_readdir (I sent report in
> > one of earliet e-mails).

From: Hugh Dickins <hughd@google.com>

Jayson Santos has sighted mm/prio_tree.c:78,79 BUGs (kernel bugzilla 8446),
and one was sighted a couple of years ago.  No reason yet to suppose
they're prio_tree bugs, but we can't tell much about them without seeing
the vmas.

So dump vma and the one it's supposed to resemble: I had expected to use
print_hex_dump(), but that's designed for u8 dumps, whereas almost every
field of vm_area_struct is either a pointer or an unsigned long - which
look nonsense dumped as u8s.

Replace the two BUG_ONs by a single WARN_ON; and if it fires, just keep
this vma out of the tree (truncation and swapout won't be able to find it).
 How safe this is depends on what the error really is; but we hold a file's
i_mmap_lock here, so it may be impossible to recover from BUG_ON.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Jayson Santos <jaysonsantos2003@yahoo.com.br>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/prio_tree.c |   33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff -puN mm/prio_tree.c~prio_tree-debugging-patch mm/prio_tree.c
--- a/mm/prio_tree.c~prio_tree-debugging-patch
+++ a/mm/prio_tree.c
@@ -67,6 +67,20 @@
  * 	vma->shared.vm_set.head == NULL ==> a list node
  */
 
+static void dump_vma(struct vm_area_struct *vma)
+{
+	void **ptr = (void **) vma;
+	int i;
+
+	printk("vm_area_struct at %p:", ptr);
+	for (i = 0; i < sizeof(*vma)/sizeof(*ptr); i++, ptr++) {
+		if (!(i & 3))
+			printk("\n");
+		printk(" %p", *ptr);
+	}
+	printk("\n");
+}
+
 /*
  * Add a new vma known to map the same set of pages as the old vma:
  * useful for fork's dup_mmap as well as vma_prio_tree_insert below.
@@ -74,14 +88,23 @@
  */
 void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old)
 {
-	/* Leave these BUG_ONs till prio_tree patch stabilizes */
-	BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
-	BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old));
-
 	vma->shared.vm_set.head = NULL;
 	vma->shared.vm_set.parent = NULL;
 
-	if (!old->shared.vm_set.parent)
+	if (WARN_ON(RADIX_INDEX(vma) != RADIX_INDEX(old) ||
+		    HEAP_INDEX(vma)  != HEAP_INDEX(old))) {
+		/*
+		 * This should never happen, yet it has been seen a few times:
+		 * we cannot say much about it without seeing the vma contents.
+		 */
+		dump_vma(vma);
+		dump_vma(old);
+		/*
+		 * Don't try to link this (corrupt?) vma into the (corrupt?)
+		 * prio_tree, but arrange for its removal to succeed later.
+		 */
+		INIT_LIST_HEAD(&vma->shared.vm_set.list);
+	} else if (!old->shared.vm_set.parent)
 		list_add(&vma->shared.vm_set.list,
 				&old->shared.vm_set.list);
 	else if (old->shared.vm_set.head)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-03-19  5:34     ` Hugh Dickins
@ 2011-04-01 14:34       ` Robert Święcki
  2011-04-01 15:44         ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-04-01 14:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm

On Sat, Mar 19, 2011 at 6:34 AM, Hugh Dickins <hughd@google.com> wrote:
> On Thu, 17 Mar 2011, Robert Swiecki wrote:
>> On Tue, Mar 1, 2011 at 12:35 AM, Robert Swiecki <robert@swiecki.net> wrote:
>>
>> So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other
>> problems, and never seen anything about mremap in 2.6.38 (yet),
>
> Thanks a lot for getting back to this, Robert, and thanks for the update.
> I won't be celebrating, but this sounds like good news for my mremap patch.
>
>> as it had been happening in 2.6.37-rc2. The output goes to
>> http://alt.swiecki.net/linux_kernel/ - I'm still trying.
>
> A problem in sys_mlock: I've Cc'ed Michel who is the current expert.
>
> A problem in sys_munlock: Michel again, except vma_prio_tree_add is
> implicated, and I used to be involved with that.  I've appended below
> a debug patch which I wrote years ago, and have largely forgotten, but
> Andrew keeps it around in mmotm: we might learn more if you add that
> into your kernel build.

Hey, I'll apply your patch and check it out. In the meantime I
triggered another Oops (NULL-ptr deref via sys_mprotect).

The oops is here:

http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt

> A problem in next_pidmap from find_ge_pid from ... proc_pid_readdir.
> I did spend a while looking into that when you first reported it.
> I'm pretty sure, from the register values, that it's a result of
> a pid number (in some places signed int, in some places unsigned)
> getting unexpectedly sign-extended to negative, so indexing before
> the beginning of an array; but I never tracked down the root of the
> problem, and failed to reproduce it with odd lseeks on the directory.
>
> Ah, the one you report now comes from compat_sys_getdents,
> whereas the original one came from compat_sys_old_readdir: okay,
> I had been wondering whether it was peculiar to the old_readdir case,
> but no, it's reproduced with getdents too.  Might be peculiar to compat.
>
> Anyway, I've Cc'ed Eric who will be the best for that one.
>
> And a couple of watchdog problems: I haven't even glanced at
> those, hope someone else can suggest a good way forward on them.
>
> Hugh
>
>>
>> > Btw, the fuzzer is here: http://code.google.com/p/iknowthis/
>> >
>> > I think i was trying it with this revision:
>> > http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode,
>> > newest 'iknowthis' supports x86-64 natively), so feel free to try it.
>> >
>> > It used to crash the machine (it's BUG_ON but the system became
>> > unusable) in matter of hours. Btw, when I was testing it for the last
>> > time it Ooopsed much more frequently in proc_readdir (I sent report in
>> > one of earliet e-mails).
>
> From: Hugh Dickins <hughd@google.com>
>
> Jayson Santos has sighted mm/prio_tree.c:78,79 BUGs (kernel bugzilla 8446),
> and one was sighted a couple of years ago.  No reason yet to suppose
> they're prio_tree bugs, but we can't tell much about them without seeing
> the vmas.
>
> So dump vma and the one it's supposed to resemble: I had expected to use
> print_hex_dump(), but that's designed for u8 dumps, whereas almost every
> field of vm_area_struct is either a pointer or an unsigned long - which
> look nonsense dumped as u8s.
>
> Replace the two BUG_ONs by a single WARN_ON; and if it fires, just keep
> this vma out of the tree (truncation and swapout won't be able to find it).
>  How safe this is depends on what the error really is; but we hold a file's
> i_mmap_lock here, so it may be impossible to recover from BUG_ON.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Jayson Santos <jaysonsantos2003@yahoo.com.br>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  mm/prio_tree.c |   33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff -puN mm/prio_tree.c~prio_tree-debugging-patch mm/prio_tree.c
> --- a/mm/prio_tree.c~prio_tree-debugging-patch
> +++ a/mm/prio_tree.c
> @@ -67,6 +67,20 @@
>  *     vma->shared.vm_set.head == NULL ==> a list node
>  */
>
> +static void dump_vma(struct vm_area_struct *vma)
> +{
> +       void **ptr = (void **) vma;
> +       int i;
> +
> +       printk("vm_area_struct at %p:", ptr);
> +       for (i = 0; i < sizeof(*vma)/sizeof(*ptr); i++, ptr++) {
> +               if (!(i & 3))
> +                       printk("\n");
> +               printk(" %p", *ptr);
> +       }
> +       printk("\n");
> +}
> +
>  /*
>  * Add a new vma known to map the same set of pages as the old vma:
>  * useful for fork's dup_mmap as well as vma_prio_tree_insert below.
> @@ -74,14 +88,23 @@
>  */
>  void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old)
>  {
> -       /* Leave these BUG_ONs till prio_tree patch stabilizes */
> -       BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
> -       BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old));
> -
>        vma->shared.vm_set.head = NULL;
>        vma->shared.vm_set.parent = NULL;
>
> -       if (!old->shared.vm_set.parent)
> +       if (WARN_ON(RADIX_INDEX(vma) != RADIX_INDEX(old) ||
> +                   HEAP_INDEX(vma)  != HEAP_INDEX(old))) {
> +               /*
> +                * This should never happen, yet it has been seen a few times:
> +                * we cannot say much about it without seeing the vma contents.
> +                */
> +               dump_vma(vma);
> +               dump_vma(old);
> +               /*
> +                * Don't try to link this (corrupt?) vma into the (corrupt?)
> +                * prio_tree, but arrange for its removal to succeed later.
> +                */
> +               INIT_LIST_HEAD(&vma->shared.vm_set.list);
> +       } else if (!old->shared.vm_set.parent)
>                list_add(&vma->shared.vm_set.list,
>                                &old->shared.vm_set.list);
>        else if (old->shared.vm_set.head)
>



-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-01 14:34       ` Robert Święcki
@ 2011-04-01 15:44         ` Linus Torvalds
  2011-04-01 16:21           ` Robert Święcki
  2011-04-02  1:46           ` Hugh Dickins
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2011-04-01 15:44 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote:
>
> Hey, I'll apply your patch and check it out. In the meantime I
> triggered another Oops (NULL-ptr deref via sys_mprotect).
>
> The oops is here:
>
> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt

That's not a NULL pointer dereference. That's a BUG_ON().

And for some reason you've turned off the BUG_ON() messages, saving
some tiny amount of memory.

Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
would be this one:

        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));

but it is possible that gcc has shuffled things around (so it _might_
be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
get a filename and line number. One reason I hate -O2 in cases like
this is that the basic block movement makes it way harder to actually
debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
or whatever it's called).

Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?

I do wonder what triggers this. Is it a huge-page vma? We seem to be
lacking the check to see that mprotect() is on a hugepage boundary -
and that seems bogus. Or am I missing some check? The new transparent
hugepage support splits the page, but what if it's a _static_ hugepage
thing?

But why would that affect the radix_index thing? I have no idea. I'd
like to blame the anon_vma rewrites last year, but I can't see why
that should matter either. Again, hugepages had some special rules, I
think (and that would explain why nobody normal sees this).

Guys, please give this one a look.

                          Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-01 15:44         ` Linus Torvalds
@ 2011-04-01 16:21           ` Robert Święcki
  2011-04-01 16:35             ` Linus Torvalds
  2011-04-02  1:46           ` Hugh Dickins
  1 sibling, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-04-01 16:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Fri, Apr 1, 2011 at 5:44 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote:
>>
>> Hey, I'll apply your patch and check it out. In the meantime I
>> triggered another Oops (NULL-ptr deref via sys_mprotect).
>>
>> The oops is here:
>>
>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt
>
> That's not a NULL pointer dereference. That's a BUG_ON().
>
> And for some reason you've turned off the BUG_ON() messages, saving
> some tiny amount of memory.

Is it possible to turn it off via config flags? Looking into
arch/x86/include/asm/bug.h it seems it's unconditional (as in "it
always manifests itself somehow") and I have
CONFIG_DEBUG_BUGVERBOSE=y.

This BUG/Oopps was triggered before I applied Hugh's patch on a vanilla kernel.

Anything that could help you debugging this? Uploading kernel image
(unfortunately I've overwritten this one), dumping more kgdb data? I
must admit I'm not up-to-date with current linux kernel debugging
techniques. The kernel config is here:
http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt

For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?)

> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
> would be this one:
>
>        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
>
> but it is possible that gcc has shuffled things around (so it _might_
> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
> get a filename and line number. One reason I hate -O2 in cases like
> this is that the basic block movement makes it way harder to actually
> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
> or whatever it's called).
>
> Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?
>
> I do wonder what triggers this. Is it a huge-page vma? We seem to be
> lacking the check to see that mprotect() is on a hugepage boundary -
> and that seems bogus. Or am I missing some check? The new transparent
> hugepage support splits the page, but what if it's a _static_ hugepage
> thing?
>
> But why would that affect the radix_index thing? I have no idea. I'd
> like to blame the anon_vma rewrites last year, but I can't see why
> that should matter either. Again, hugepages had some special rules, I
> think (and that would explain why nobody normal sees this).
>
> Guys, please give this one a look.
>
>                          Linus
>



-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-01 16:21           ` Robert Święcki
@ 2011-04-01 16:35             ` Linus Torvalds
  2011-04-02  4:01               ` Hui Zhu
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-04-01 16:35 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <robert@swiecki.net> wrote:
>
> Is it possible to turn it off via config flags? Looking into
> arch/x86/include/asm/bug.h it seems it's unconditional (as in "it
> always manifests itself somehow") and I have
> CONFIG_DEBUG_BUGVERBOSE=y.

Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table.

Maybe it's just kdb that is broken, and doesn't print it. I wouldn't
be surprised. It's not the first time I've seen debugging features
that just make debugging a mess.

> Anything that could help you debugging this? Uploading kernel image
> (unfortunately I've overwritten this one), dumping more kgdb data?

So in this case kgdb just dropped the most important data on the floor.

But if you have kdb active next time, print out the vma/old contents
in that function that has the BUG() in it.

> I must admit I'm not up-to-date with current linux kernel debugging
> techniques. The kernel config is here:
> http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt
>
> For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?)

Oh, don't do that. -O0 makes the code totally unreadable (the compiler
just does _stupid_ things, making the asm code look so horrible that
you can't match it up against anything sane), and -fno-inline isn't
worth the pain either.

-Os is much better than those.

But in this case, just getting the filename and line number would have
made the thing moot anyway - without kdb it _should_ have said
something clear like

   kernel BUG at %s:%u!

where %s:%u is the filename and line number.

                          Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-01 15:44         ` Linus Torvalds
  2011-04-01 16:21           ` Robert Święcki
@ 2011-04-02  1:46           ` Hugh Dickins
  2011-04-04 12:46             ` Robert Święcki
  1 sibling, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2011-04-02  1:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert Święcki, Andrew Morton, Miklos Szeredi,
	Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote:
>>
>> Hey, I'll apply your patch and check it out. In the meantime I
>> triggered another Oops (NULL-ptr deref via sys_mprotect).
>>
>> The oops is here:
>>
>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt
>
> That's not a NULL pointer dereference. That's a BUG_ON().
>
> And for some reason you've turned off the BUG_ON() messages, saving
> some tiny amount of memory.
>
> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
> would be this one:
>
>        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
>
> but it is possible that gcc has shuffled things around (so it _might_
> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
> get a filename and line number. One reason I hate -O2 in cases like
> this is that the basic block movement makes it way harder to actually
> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
> or whatever it's called).
>
> Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?
>
> I do wonder what triggers this. Is it a huge-page vma? We seem to be
> lacking the check to see that mprotect() is on a hugepage boundary -
> and that seems bogus. Or am I missing some check? The new transparent
> hugepage support splits the page, but what if it's a _static_ hugepage
> thing?
>
> But why would that affect the radix_index thing? I have no idea. I'd
> like to blame the anon_vma rewrites last year, but I can't see why
> that should matter either. Again, hugepages had some special rules, I
> think (and that would explain why nobody normal sees this).
>
> Guys, please give this one a look.

I do intend to look, but I think it makes more sense to wait until
Robert has reproduced it (or something like it) with my debugging
patch in.

He's fuzzing, so no reason to get anxious about recent changes, it may
have been lurking there for years.  He did already report a
vma_prio_tree_add() crash, which led me to send him the patch: so the
issue seems to be reproducible, and the patch dumps out the
vma_area_structs involved, which has some hope of telling us more.

Down the years we've had about three earlier reports of crashes there:
so rare we've tended to put them down to bad memory or slab
corruption, but never had anything like a reproducible case to study
until now.

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-01 16:35             ` Linus Torvalds
@ 2011-04-02  4:01               ` Hui Zhu
  2011-04-04 13:02                 ` Robert Święcki
  0 siblings, 1 reply; 37+ messages in thread
From: Hui Zhu @ 2011-04-02  4:01 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, Miklos Szeredi,
	Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Sat, Apr 2, 2011 at 00:35, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <robert@swiecki.net> wrote:
>>
>> Is it possible to turn it off via config flags? Looking into
>> arch/x86/include/asm/bug.h it seems it's unconditional (as in "it
>> always manifests itself somehow") and I have
>> CONFIG_DEBUG_BUGVERBOSE=y.
>
> Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table.
>
> Maybe it's just kdb that is broken, and doesn't print it. I wouldn't
> be surprised. It's not the first time I've seen debugging features
> that just make debugging a mess.
>
>> Anything that could help you debugging this? Uploading kernel image
>> (unfortunately I've overwritten this one), dumping more kgdb data?
>
> So in this case kgdb just dropped the most important data on the floor.
>
> But if you have kdb active next time, print out the vma/old contents
> in that function that has the BUG() in it.
>
>> I must admit I'm not up-to-date with current linux kernel debugging
>> techniques. The kernel config is here:
>> http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt
>>
>> For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?)

Hi Robert,

I am not sure you can success with build trunk with  -O0 -fno-inline.
I suggest you try the patch in
http://code.google.com/p/kgtp/downloads/detail?name=co.patch.
It add a option in "Kernel hacking" called "Compile with almost no
optimization". It will make kernel be built without -O2. It support
x86_32, x86_64 and arm.

PS, maybe you can try kgtp (https://code.google.com/p/kgtp/)  debug your kernel.

Thanks,
Hui

>
> Oh, don't do that. -O0 makes the code totally unreadable (the compiler
> just does _stupid_ things, making the asm code look so horrible that
> you can't match it up against anything sane), and -fno-inline isn't
> worth the pain either.
>
> -Os is much better than those.
>
> But in this case, just getting the filename and line number would have
> made the thing moot anyway - without kdb it _should_ have said
> something clear like
>
>   kernel BUG at %s:%u!
>
> where %s:%u is the filename and line number.
>
>                          Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-02  1:46           ` Hugh Dickins
@ 2011-04-04 12:46             ` Robert Święcki
  2011-04-04 18:30               ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-04-04 12:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Sat, Apr 2, 2011 at 3:46 AM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote:
>>>
>>> Hey, I'll apply your patch and check it out. In the meantime I
>>> triggered another Oops (NULL-ptr deref via sys_mprotect).
>>>
>>> The oops is here:
>>>
>>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt
>>
>> That's not a NULL pointer dereference. That's a BUG_ON().
>>
>> And for some reason you've turned off the BUG_ON() messages, saving
>> some tiny amount of memory.
>>
>> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
>> would be this one:
>>
>>        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
>>
>> but it is possible that gcc has shuffled things around (so it _might_
>> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
>> get a filename and line number. One reason I hate -O2 in cases like
>> this is that the basic block movement makes it way harder to actually
>> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
>> or whatever it's called).
>>
>> Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?
>>
>> I do wonder what triggers this. Is it a huge-page vma? We seem to be
>> lacking the check to see that mprotect() is on a hugepage boundary -
>> and that seems bogus. Or am I missing some check? The new transparent
>> hugepage support splits the page, but what if it's a _static_ hugepage
>> thing?
>>
>> But why would that affect the radix_index thing? I have no idea. I'd
>> like to blame the anon_vma rewrites last year, but I can't see why
>> that should matter either. Again, hugepages had some special rules, I
>> think (and that would explain why nobody normal sees this).
>>
>> Guys, please give this one a look.
>
> I do intend to look, but I think it makes more sense to wait until
> Robert has reproduced it (or something like it) with my debugging
> patch in.

Hi Hugh,

I did two things, included your patch, and compiled with
CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for
~2 days under fuzzing (with getdents and readdir syscalls disabled in
the fuzzer). I don't think -Os has any bigger influence on how mm
internally works therefore I must attribute the change to your patch
(was it patch which fixes something or merely dumps vma structures in
case of any problem?).


==============
BTW, another problem arose in the meantime, not sure if anyhow related
to things we're discussing here, although 'btc 0' in kdb shows that
processor 0 hangs in sys_mlock - I did it in two different moments, to
exclude any coincidences. After those 2 days of fuzzing, 'ps wuax'
stopped working, i.e. it prints some output, then stops, cannot be
killed with -SIGKILL etc. I'll let it run for the time being, I can
dump more data in this PID 17750 if anybody wants:

strace:

# strace -f ps wwuax
....
open("/proc/17750/status", O_RDONLY)    = 6
read(6, "Name:\tiknowthis\nState:\tD (disk s"..., 1023) = 777
close(6)                                = 0
open("/proc/17750/cmdline", O_RDONLY)   = 6
read(6,

Process 17750 also cannot be killed. Attaching more data:

# cat /proc/17750/status
Name:	iknowthis
State:	D (disk sleep)
Tgid:	17750
Pid:	17750
PPid:	1
TracerPid:	0
Uid:	1001	1001	1001	1001
Gid:	1001	1001	1001	1001
FDSize:	64
Groups:	1001
VmPeak:	    7752 kB
VmSize:	    5760 kB
VmLck:	      32 kB
VmHWM:	    4892 kB
VmRSS:	    3068 kB
VmData:	    2472 kB
VmStk:	     408 kB
VmExe:	     160 kB
VmLib:	    2684 kB
VmPTE:	      44 kB
VmSwap:	       0 kB
Threads:	1
SigQ:	218/16382
SigPnd:	0000000000000b00
ShdPnd:	0000400000000503
SigBlk:	0000000000000000
SigIgn:	0000000001001000
SigCgt:	0000000000000000
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	ffffffffffffffff
Cpus_allowed:	01
Cpus_allowed_list:	0
Mems_allowed:	00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	43330
nonvoluntary_ctxt_switches:	4436

# cat /proc/17750/wchan
call_rwsem_down_write_failed

# cat /proc/17750/maps (hangs)

(from kdb)

[0]kdb> btp 17750
Stack traceback for pid 17750
0xffff88011e772dc0    17750        1  0    0   D  0xffff88011e773240  iknowthis
<c> ffff88011cbcfb88<c> 0000000000000086<c> ffff88011cbcfb08<c>
ffff88011cbcffd8<c>
<c> 0000000000013f00<c> ffff88011e772dc0<c> ffff88011e773180<c>
ffff88011e773178<c>
<c> ffff88011cbce000<c> ffff88011cbcffd8<c> 0000000000013f00<c>
0000000000013f00<c>
Call Trace:
 [<ffffffff81e286ad>] rwsem_down_failed_common+0xdb/0x10d
 [<ffffffff81e286f2>] rwsem_down_write_failed+0x13/0x15
 [<ffffffff81416953>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffff81e27da0>] ? down_write+0x25/0x27
 [<ffffffff8115f041>] do_coredump+0x14f/0x9a5
 [<ffffffff8114d4e2>] ? T.1006+0x17/0x32
 [<ffffffff810a45f0>] ? __dequeue_signal+0xfa/0x12f
 [<ffffffff8108a79c>] ? get_parent_ip+0x11/0x42
 [<ffffffff810a6406>] get_signal_to_deliver+0x3be/0x3e6
 [<ffffffff8103e0c1>] do_signal+0x72/0x67d
 [<ffffffff81096807>] ? child_wait_callback+0x0/0x58
 [<ffffffff81e28c28>] ? _raw_spin_unlock_irq+0x36/0x41
 [<ffffffff8108bb06>] ? finish_task_switch+0x4b/0xb9
 [<ffffffff8108c3ed>] ? schedule_tail+0x38/0x68
 [<ffffffff8103eb43>] ? ret_from_fork+0x13/0x80
 [<ffffffff8103e6f8>] do_notify_resume+0x2c/0x6e

[0]kdb>  btc 0
Stack traceback for pid 10350
0xffff88011b6badc0    10350        1  1    0   R  0xffff88011b6bb240 *iknowthis2
<c> ffff8800cfc03db8<c> 0000000000000000<c>
Call Trace:
 <#DB>  <<EOE>>  <IRQ>  [<ffffffff81518b03>] ? __handle_sysrq+0xbf/0x15c
 [<ffffffff81518d7d>] ? handle_sysrq+0x2c/0x2e
 [<ffffffff8152bd90>] ? serial8250_handle_port+0x157/0x2b2
 [<ffffffff810a1be8>] ? run_timer_softirq+0x2b3/0x2c2
 [<ffffffff8152bf4c>] ? serial8250_interrupt+0x61/0x111
 [<ffffffff810e6e52>] ? handle_IRQ_event+0x78/0x150
 [<ffffffff810ea044>] ? move_native_irq+0x19/0x6d
 [<ffffffff810e8d90>] ? handle_edge_irq+0xe3/0x12f
 [<ffffffff8104198f>] ? handle_irq+0x88/0x91
 [<ffffffff81e297a5>] ? do_IRQ+0x4d/0xb3
 [<ffffffff81e29193>] ? ret_from_intr+0x0/0x15
 <EOI>  [<ffffffff811336aa>] ? __mlock_vma_pages_range+0x49/0xad
 [<ffffffff8113370a>] ? __mlock_vma_pages_range+0xa9/0xad
 [<ffffffff811337c0>] ? do_mlock_pages+0xb2/0x118
 [<ffffffff81134002>] ? sys_mlock+0xe8/0xf6
 [<ffffffff8107d7e3>] ? ia32_sysret+0x0/0x5

[0]kdb> btc 1
Stack traceback for pid 9409
0xffff88011c9816e0     9409        1  1    1   R  0xffff88011c981b60  iknowthis2
<c> ffff88011dc21ec8

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-02  4:01               ` Hui Zhu
@ 2011-04-04 13:02                 ` Robert Święcki
  0 siblings, 0 replies; 37+ messages in thread
From: Robert Święcki @ 2011-04-04 13:02 UTC (permalink / raw)
  To: Hui Zhu
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, Miklos Szeredi,
	Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Sat, Apr 2, 2011 at 6:01 AM, Hui Zhu <teawater@gmail.com> wrote:
> On Sat, Apr 2, 2011 at 00:35, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <robert@swiecki.net> wrote:
>>>
>>> Is it possible to turn it off via config flags? Looking into
>>> arch/x86/include/asm/bug.h it seems it's unconditional (as in "it
>>> always manifests itself somehow") and I have
>>> CONFIG_DEBUG_BUGVERBOSE=y.
>>
>> Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table.
>>
>> Maybe it's just kdb that is broken, and doesn't print it. I wouldn't
>> be surprised. It's not the first time I've seen debugging features
>> that just make debugging a mess.
>>
>>> Anything that could help you debugging this? Uploading kernel image
>>> (unfortunately I've overwritten this one), dumping more kgdb data?
>>
>> So in this case kgdb just dropped the most important data on the floor.
>>
>> But if you have kdb active next time, print out the vma/old contents
>> in that function that has the BUG() in it.
>>
>>> I must admit I'm not up-to-date with current linux kernel debugging
>>> techniques. The kernel config is here:
>>> http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt
>>>
>>> For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?)
>
> Hi Robert,
>
> I am not sure you can success with build trunk with  -O0 -fno-inline.
> I suggest you try the patch in
> http://code.google.com/p/kgtp/downloads/detail?name=co.patch.
> It add a option in "Kernel hacking" called "Compile with almost no
> optimization". It will make kernel be built without -O2. It support
> x86_32, x86_64 and arm.

HI,

Yeah.. -O0 doesn't build smoothly, it seems that building with -O0 is
not required right now, but I'll keep your patch in mind in case it
becomes necessary. Thanks for the tip.

> PS, maybe you can try kgtp (https://code.google.com/p/kgtp/)  debug your kernel.
>
>>
>> Oh, don't do that. -O0 makes the code totally unreadable (the compiler
>> just does _stupid_ things, making the asm code look so horrible that
>> you can't match it up against anything sane), and -fno-inline isn't
>> worth the pain either.
>>
>> -Os is much better than those.
>>
>> But in this case, just getting the filename and line number would have
>> made the thing moot anyway - without kdb it _should_ have said
>> something clear like
>>
>>   kernel BUG at %s:%u!
>>
>> where %s:%u is the filename and line number.
>>
>>                          Linus
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>



-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-04 12:46             ` Robert Święcki
@ 2011-04-04 18:30               ` Hugh Dickins
  2011-04-05 12:21                 ` Robert Święcki
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2011-04-04 18:30 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Mon, Apr 4, 2011 at 5:46 AM, Robert Święcki <robert@swiecki.net> wrote:
> On Sat, Apr 2, 2011 at 3:46 AM, Hugh Dickins <hughd@google.com> wrote:
>> On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote:
>>>>
>>>> Hey, I'll apply your patch and check it out. In the meantime I
>>>> triggered another Oops (NULL-ptr deref via sys_mprotect).
>>>>
>>>> The oops is here:
>>>>
>>>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt
>>>
>>> That's not a NULL pointer dereference. That's a BUG_ON().
>>>
>>> And for some reason you've turned off the BUG_ON() messages, saving
>>> some tiny amount of memory.
>>>
>>> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it
>>> would be this one:
>>>
>>>        BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
>>>
>>> but it is possible that gcc has shuffled things around (so it _might_
>>> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd
>>> get a filename and line number. One reason I hate -O2 in cases like
>>> this is that the basic block movement makes it way harder to actually
>>> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE
>>> or whatever it's called).
>>>
>>> Anyway, I do find it worrying. The vma code shouldn't be this fragile.  Hugh?
>>>
>>> I do wonder what triggers this. Is it a huge-page vma? We seem to be
>>> lacking the check to see that mprotect() is on a hugepage boundary -
>>> and that seems bogus. Or am I missing some check? The new transparent
>>> hugepage support splits the page, but what if it's a _static_ hugepage
>>> thing?
>>>
>>> But why would that affect the radix_index thing? I have no idea. I'd
>>> like to blame the anon_vma rewrites last year, but I can't see why
>>> that should matter either. Again, hugepages had some special rules, I
>>> think (and that would explain why nobody normal sees this).
>>>
>>> Guys, please give this one a look.
>>
>> I do intend to look, but I think it makes more sense to wait until
>> Robert has reproduced it (or something like it) with my debugging
>> patch in.
>
> Hi Hugh,
>
> I did two things, included your patch, and compiled with
> CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for
> ~2 days under fuzzing (with getdents and readdir syscalls disabled in
> the fuzzer). I don't think -Os has any bigger influence on how mm
> internally works therefore I must attribute the change to your patch
> (was it patch which fixes something or merely dumps vma structures in
> case of any problem?).

I'm sorry, I should have explained the patch a little more.  Along
with dumping out the vma structs, it does change the BUG or BUGs there
to WARN_ONs, allowing the system to continue if it's not too badly
corrupted, though leaking some structure memory (if the structs have
been reused, it's probably not safe to assume we still have ownership
of them).  So if the problem has occurred again, it should be leaving
WARNING messages and vma struct dumps in your /var/log/messages -
please look for them and send them in if found.

Perhaps we should simply include the patch in mainline kernel: it
doesn't do much good just lingering in mmotm, but seems to be helping
your system to limp along longer.

>
>
> ==============
> BTW, another problem arose in the meantime, not sure if anyhow related
> to things we're discussing here, although 'btc 0' in kdb shows that
> processor 0 hangs in sys_mlock - I did it in two different moments, to
> exclude any coincidences. After those 2 days of fuzzing, 'ps wuax'
> stopped working, i.e. it prints some output, then stops, cannot be
> killed with -SIGKILL etc. I'll let it run for the time being, I can
> dump more data in this PID 17750 if anybody wants:
>
> strace:
>
> # strace -f ps wwuax
> ....
> open("/proc/17750/status", O_RDONLY)    = 6
> read(6, "Name:\tiknowthis\nState:\tD (disk s"..., 1023) = 777
> close(6)                                = 0
> open("/proc/17750/cmdline", O_RDONLY)   = 6
> read(6,
>
> Process 17750 also cannot be killed. Attaching more data:
>
> # cat /proc/17750/status
> Name:   iknowthis
> State:  D (disk sleep)
> Tgid:   17750
> Pid:    17750
> PPid:   1
> TracerPid:      0
> Uid:    1001    1001    1001    1001
> Gid:    1001    1001    1001    1001
> FDSize: 64
> Groups: 1001
> VmPeak:     7752 kB
> VmSize:     5760 kB
> VmLck:        32 kB
> VmHWM:      4892 kB
> VmRSS:      3068 kB
> VmData:     2472 kB
> VmStk:       408 kB
> VmExe:       160 kB
> VmLib:      2684 kB
> VmPTE:        44 kB
> VmSwap:        0 kB
> Threads:        1
> SigQ:   218/16382
> SigPnd: 0000000000000b00
> ShdPnd: 0000400000000503
> SigBlk: 0000000000000000
> SigIgn: 0000000001001000
> SigCgt: 0000000000000000
> CapInh: 0000000000000000
> CapPrm: 0000000000000000
> CapEff: 0000000000000000
> CapBnd: ffffffffffffffff
> Cpus_allowed:   01
> Cpus_allowed_list:      0
> Mems_allowed:   00000000,00000001
> Mems_allowed_list:      0
> voluntary_ctxt_switches:        43330
> nonvoluntary_ctxt_switches:     4436
>
> # cat /proc/17750/wchan
> call_rwsem_down_write_failed
>
> # cat /proc/17750/maps (hangs)
>
> (from kdb)
>
> [0]kdb> btp 17750
> Stack traceback for pid 17750
> 0xffff88011e772dc0    17750        1  0    0   D  0xffff88011e773240  iknowthis
> <c> ffff88011cbcfb88<c> 0000000000000086<c> ffff88011cbcfb08<c>
> ffff88011cbcffd8<c>
> <c> 0000000000013f00<c> ffff88011e772dc0<c> ffff88011e773180<c>
> ffff88011e773178<c>
> <c> ffff88011cbce000<c> ffff88011cbcffd8<c> 0000000000013f00<c>
> 0000000000013f00<c>
> Call Trace:
>  [<ffffffff81e286ad>] rwsem_down_failed_common+0xdb/0x10d
>  [<ffffffff81e286f2>] rwsem_down_write_failed+0x13/0x15
>  [<ffffffff81416953>] call_rwsem_down_write_failed+0x13/0x20
>  [<ffffffff81e27da0>] ? down_write+0x25/0x27
>  [<ffffffff8115f041>] do_coredump+0x14f/0x9a5
>  [<ffffffff8114d4e2>] ? T.1006+0x17/0x32
>  [<ffffffff810a45f0>] ? __dequeue_signal+0xfa/0x12f
>  [<ffffffff8108a79c>] ? get_parent_ip+0x11/0x42
>  [<ffffffff810a6406>] get_signal_to_deliver+0x3be/0x3e6
>  [<ffffffff8103e0c1>] do_signal+0x72/0x67d
>  [<ffffffff81096807>] ? child_wait_callback+0x0/0x58
>  [<ffffffff81e28c28>] ? _raw_spin_unlock_irq+0x36/0x41
>  [<ffffffff8108bb06>] ? finish_task_switch+0x4b/0xb9
>  [<ffffffff8108c3ed>] ? schedule_tail+0x38/0x68
>  [<ffffffff8103eb43>] ? ret_from_fork+0x13/0x80
>  [<ffffffff8103e6f8>] do_notify_resume+0x2c/0x6e
>
> [0]kdb>  btc 0
> Stack traceback for pid 10350
> 0xffff88011b6badc0    10350        1  1    0   R  0xffff88011b6bb240 *iknowthis2
> <c> ffff8800cfc03db8<c> 0000000000000000<c>
> Call Trace:
>  <#DB>  <<EOE>>  <IRQ>  [<ffffffff81518b03>] ? __handle_sysrq+0xbf/0x15c
>  [<ffffffff81518d7d>] ? handle_sysrq+0x2c/0x2e
>  [<ffffffff8152bd90>] ? serial8250_handle_port+0x157/0x2b2
>  [<ffffffff810a1be8>] ? run_timer_softirq+0x2b3/0x2c2
>  [<ffffffff8152bf4c>] ? serial8250_interrupt+0x61/0x111
>  [<ffffffff810e6e52>] ? handle_IRQ_event+0x78/0x150
>  [<ffffffff810ea044>] ? move_native_irq+0x19/0x6d
>  [<ffffffff810e8d90>] ? handle_edge_irq+0xe3/0x12f
>  [<ffffffff8104198f>] ? handle_irq+0x88/0x91
>  [<ffffffff81e297a5>] ? do_IRQ+0x4d/0xb3
>  [<ffffffff81e29193>] ? ret_from_intr+0x0/0x15
>  <EOI>  [<ffffffff811336aa>] ? __mlock_vma_pages_range+0x49/0xad
>  [<ffffffff8113370a>] ? __mlock_vma_pages_range+0xa9/0xad
>  [<ffffffff811337c0>] ? do_mlock_pages+0xb2/0x118
>  [<ffffffff81134002>] ? sys_mlock+0xe8/0xf6
>  [<ffffffff8107d7e3>] ? ia32_sysret+0x0/0x5
>
> [0]kdb> btc 1
> Stack traceback for pid 9409
> 0xffff88011c9816e0     9409        1  1    1   R  0xffff88011c981b60  iknowthis2
> <c> ffff88011dc21ec8

Sorry, I've no time to think about this one at the moment (at LSF).
Does this look similar to what you previously reported on mlock?

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-04 18:30               ` Hugh Dickins
@ 2011-04-05 12:21                 ` Robert Święcki
  2011-04-05 15:37                   ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-04-05 12:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

>> Hi Hugh,
>>
>> I did two things, included your patch, and compiled with
>> CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for
>> ~2 days under fuzzing (with getdents and readdir syscalls disabled in
>> the fuzzer). I don't think -Os has any bigger influence on how mm
>> internally works therefore I must attribute the change to your patch
>> (was it patch which fixes something or merely dumps vma structures in
>> case of any problem?).
>
> I'm sorry, I should have explained the patch a little more.  Along
> with dumping out the vma structs, it does change the BUG or BUGs there
> to WARN_ONs, allowing the system to continue if it's not too badly
> corrupted, though leaking some structure memory (if the structs have
> been reused, it's probably not safe to assume we still have ownership
> of them).  So if the problem has occurred again, it should be leaving
> WARNING messages and vma struct dumps in your /var/log/messages -
> please look for them and send them in if found.

Here it is, I'll leave it in this state (kdb) in case you need some
remote debugging

<4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110()
<4>[ 1523.884381] Hardware name: Precision WorkStation 390
<4>[ 1523.889703] Pid: 13801, comm: iknowthis2 Not tainted 2.6.38 #2
<4>[ 1523.895544] Call Trace:
<4>[ 1523.898000]  [<ffffffff810b5d2a>] ? warn_slowpath_common+0x7a/0xb0
<4>[ 1523.904195]  [<ffffffff810b5d7a>] ? warn_slowpath_null+0x1a/0x20
<4>[ 1523.910210]  [<ffffffff8116b4b3>] ? vma_prio_tree_add+0x43/0x110
<4>[ 1523.916226]  [<ffffffff8116b5c1>] ? vma_prio_tree_insert+0x41/0x60
<4>[ 1523.922416]  [<ffffffff8117b69c>] ? __vma_link_file+0x4c/0x90
<4>[ 1523.928171]  [<ffffffff8117c078>] ? vma_adjust+0xe8/0x570
<4>[ 1523.933579]  [<ffffffff8117c641>] ? __split_vma+0x141/0x280
<4>[ 1523.939157]  [<ffffffff8117c7a5>] ? split_vma+0x25/0x30
<4>[ 1523.944391]  [<ffffffff811733f6>] ? sys_madvise+0x6a6/0x720
<4>[ 1523.949969]  [<ffffffff810a4e09>] ? sub_preempt_count+0xa9/0xe0
<4>[ 1523.955900]  [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c
<4>[ 1523.962347]  [<ffffffff8109a653>] ? ia32_sysret+0x0/0x5
<4>[ 1523.967580]  [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c
<4>[ 1523.974026] ---[ end trace c13483b7eb481afd ]---
<4>[ 1523.978650] vm_area_struct at ffff880120bda508:
<4>[ 1523.983199]  ffff88011eb5aa00 00000000f72f3000 00000000f73f0000
ffff88011b8eaa10
<4>[ 1523.990674]  ffff88011b8ea228 0000000000000027 00000000000101ff
ffff88011b8ea6b1
<4>[ 1523.998151]  ffff88011e390820 ffff88011b8ea260 ffff880120796780
ffff880120bdad40
<4>[ 1524.005624]            (null)           (null) ffff88011ed5b910
ffff88011ed5b1f0
<4>[ 1524.013103]  ffff88011f72b168 ffffffff82427480 ffffffffffffff03
ffff8800793ff0c0
<4>[ 1524.020581]            (null)           (null)           (null)
<4>[ 1524.026556] vm_area_struct at ffff880120bdacf0:
<4>[ 1524.031110]  ffff88011eb5a300 00000000f72f3000 00000000f7400000
ffff88011f6c6f18
<4>[ 1524.038584]  ffff88011b5c9da8 0000000000000027 00000000000101ff
ffff8801206f0c71
<4>[ 1524.046062]  ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40
ffff880120bdad40
<4>[ 1524.053536]  ffff880120bda558           (null) ffff88011f758ee0
ffff88011f7583a0
<4>[ 1524.061016]  ffff88011f556690 ffffffff82427480 ffffffffffffff03
ffff8800793ff0c0
<4>[ 1524.068491]            (null)           (null)           (null)

[1]kdb> pid
KDB current process is iknowthis2(pid=13801)

[1]kdb> btp 13801
Stack traceback for pid 13801
0xffff88011ec35cc0    13801     4516  1    1   R  0xffff88011ec36140 *iknowthis2
<c> ffff88011c5d1d68<c> 0000000000000000<c> ffff88011f7a3eb8<c>
ffff88011c5d1d88<c>
<c> ffffffff8116b3b9<c> ffff88011b8ea730<c> ffff88011b8eaa10<c>
ffff88011c5d1e28<c>
<c> ffffffff8117c0cb<c> 00000000f73f0000<c> ffff88011eb5aa00<c>
ffff88011f72b168<c>
Call Trace:
 [<ffffffff8116b3b9>] ? vma_prio_tree_remove+0xc9/0x110
 [<ffffffff8117c0cb>] ? vma_adjust+0x13b/0x570
 [<ffffffff8117c641>] ? __split_vma+0x141/0x280
 [<ffffffff8117c7a5>] ? split_vma+0x25/0x30
 [<ffffffff811733f6>] ? sys_madvise+0x6a6/0x720
 [<ffffffff810a4e09>] ? sub_preempt_count+0xa9/0xe0
 [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [<ffffffff8109a653>] ? ia32_sysret+0x0/0x5
 [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c

[1]kdb> rd
ax: ffff88011b8ea780  bx: ffff880120796678  cx: ffff8801207966c8
dx: ffff8801207966c8  si: ffff8801207966c8  di: ffff880027d3bec8
bp: ffff88011c5d1d68  sp: ffff88011c5d1d68  r8: 0000000000000000
r9: ffff88011c5d1946  r10: ffff88011c5d1945  r11: 0000000000000000
r12: ffff88011b8eaa10  r13: ffff880120bda508  r14: ffff880027d3bea8
r15: ffff88011eb5aa00  ip: ffffffff8158d935  flags: 00010297  cs: 00000010

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-05 12:21                 ` Robert Święcki
@ 2011-04-05 15:37                   ` Linus Torvalds
  2011-04-06 14:47                     ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-04-05 15:37 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Tue, Apr 5, 2011 at 5:21 AM, Robert Święcki <robert@swiecki.net> wrote:
>
> Here it is, I'll leave it in this state (kdb) in case you need some
> remote debugging
>
> <4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110()
> <4>[ 1523.978650] vm_area_struct at ffff880120bda508:
> <4>[ 1523.983199]  ffff88011eb5aa00 00000000f72f3000 00000000f73f0000 ffff88011b8eaa10
> <4>[ 1523.990674]  ffff88011b8ea228 0000000000000027 00000000000101ff ffff88011b8ea6b1
> <4>[ 1523.998151]  ffff88011e390820 ffff88011b8ea260 ffff880120796780 ffff880120bdad40
> <4>[ 1524.005624]            (null)           (null) ffff88011ed5b910 ffff88011ed5b1f0
> <4>[ 1524.013103]  ffff88011f72b168 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0
> <4>[ 1524.020581]            (null)           (null)           (null)

vma->vm_start/end is 0xf72f3000-0xf73f0000

> <4>[ 1524.026556] vm_area_struct at ffff880120bdacf0:
> <4>[ 1524.031110]  ffff88011eb5a300 00000000f72f3000 00000000f7400000 ffff88011f6c6f18
> <4>[ 1524.038584]  ffff88011b5c9da8 0000000000000027 00000000000101ff ffff8801206f0c71
> <4>[ 1524.046062]  ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40 ffff880120bdad40
> <4>[ 1524.053536]  ffff880120bda558           (null) ffff88011f758ee0 ffff88011f7583a0
> <4>[ 1524.061016]  ffff88011f556690 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0
> <4>[ 1524.068491]            (null)           (null)           (null)

vma->vm_start/end is 0xf72f3000-0xf7400000.

If I read those right, then the vm_pgoff (RADIX_INDEX for the
prio-tree) is ffffffffffffff03 for both cases. That doesn't look good.
How do we get a negative pg_off for a file mapping?

Also, since they have a different size, they should have a different
HEAP_INDEX. That's why we BUG_ON() - with a different HEAP_INDEX,
shouldn't that mean that the prio_tree_insert() logic should create a
new node for it?

I dunno. But that odd negative pg_off thing makes me think there is
some overflow issue (ie HEAP_INDEX being pg_off + size ends up
fluctuating between really big and really small). So I'd suspect THAT
as the main reason.

But maybe I'm mis-reading the dump, and the ffffffffffffff03 isn't
vm_pgoff at all.

Hugh?

                              Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-05 15:37                   ` Linus Torvalds
@ 2011-04-06 14:47                     ` Hugh Dickins
  2011-04-06 15:32                       ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2011-04-06 14:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert Święcki, Andrew Morton, Miklos Szeredi,
	Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Tue, Apr 5, 2011 at 8:37 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 5, 2011 at 5:21 AM, Robert Święcki <robert@swiecki.net> wrote:
>>
>> Here it is, I'll leave it in this state (kdb) in case you need some
>> remote debugging
>>
>> <4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110()
>> <4>[ 1523.978650] vm_area_struct at ffff880120bda508:
>> <4>[ 1523.983199]  ffff88011eb5aa00 00000000f72f3000 00000000f73f0000 ffff88011b8eaa10
>> <4>[ 1523.990674]  ffff88011b8ea228 0000000000000027 00000000000101ff ffff88011b8ea6b1
>> <4>[ 1523.998151]  ffff88011e390820 ffff88011b8ea260 ffff880120796780 ffff880120bdad40
>> <4>[ 1524.005624]            (null)           (null) ffff88011ed5b910 ffff88011ed5b1f0
>> <4>[ 1524.013103]  ffff88011f72b168 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0
>> <4>[ 1524.020581]            (null)           (null)           (null)
>
> vma->vm_start/end is 0xf72f3000-0xf73f0000
>
>> <4>[ 1524.026556] vm_area_struct at ffff880120bdacf0:
>> <4>[ 1524.031110]  ffff88011eb5a300 00000000f72f3000 00000000f7400000 ffff88011f6c6f18
>> <4>[ 1524.038584]  ffff88011b5c9da8 0000000000000027 00000000000101ff ffff8801206f0c71
>> <4>[ 1524.046062]  ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40 ffff880120bdad40
>> <4>[ 1524.053536]  ffff880120bda558           (null) ffff88011f758ee0 ffff88011f7583a0
>> <4>[ 1524.061016]  ffff88011f556690 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0
>> <4>[ 1524.068491]            (null)           (null)           (null)
>
> vma->vm_start/end is 0xf72f3000-0xf7400000.
>
> If I read those right, then the vm_pgoff (RADIX_INDEX for the
> prio-tree) is ffffffffffffff03 for both cases. That doesn't look good.
> How do we get a negative pg_off for a file mapping?

Yes, I think that's probably at the root of it.  Robert is using a
fuzzer, and it's a 32-bit executable running on a 64-bit kernel: I
suspect there's somewhere on our compat path where we've not validated
incoming mmap offset properly.

Hmm, but I don't see anything wrong there.

>
> Also, since they have a different size, they should have a different
> HEAP_INDEX. That's why we BUG_ON() - with a different HEAP_INDEX,
> shouldn't that mean that the prio_tree_insert() logic should create a
> new node for it?

Yes.

>
> I dunno. But that odd negative pg_off thing makes me think there is
> some overflow issue (ie HEAP_INDEX being pg_off + size ends up
> fluctuating between really big and really small). So I'd suspect THAT
> as the main reason.

Yes, one of the vmas is such that the end offset (pgoff of next page
after) would be 0, and for the other it would be 16.  There's sure to
be places, inside the prio_tree code and outside it, where we rely
upon pgoff not wrapping around - wrap should be prevented by original
validation of arguments.

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-06 14:47                     ` Hugh Dickins
@ 2011-04-06 15:32                       ` Linus Torvalds
  2011-04-06 15:43                         ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-04-06 15:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robert Święcki, Andrew Morton, Miklos Szeredi,
	Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1381 bytes --]

On Wed, Apr 6, 2011 at 7:47 AM, Hugh Dickins <hughd@google.com> wrote:
>>
>> I dunno. But that odd negative pg_off thing makes me think there is
>> some overflow issue (ie HEAP_INDEX being pg_off + size ends up
>> fluctuating between really big and really small). So I'd suspect THAT
>> as the main reason.
>
> Yes, one of the vmas is such that the end offset (pgoff of next page
> after) would be 0, and for the other it would be 16.  There's sure to
> be places, inside the prio_tree code and outside it, where we rely
> upon pgoff not wrapping around - wrap should be prevented by original
> validation of arguments.

Well, we _do_ validate them in do_mmap_pgoff(), which is the main
routine for all the mmap() system calls, and the main way to get a new
mapping.

There are other ways, like do_brk(), but afaik that always sets
vm_pgoff to the virtual address (shifted), so again the new mapping
should be fine.

So when a new mapping is created, it should all be ok.

But I think mremap() may end up expanding it without doing the same
overflow check.

Do you see any other way to get this situation? Does the vma dump give
you any hint about where it came from?

Robert - here's a (UNTESTED!) patch to make mremap() be a bit more
careful about vm_pgoff when growing a mapping. Does it make any
difference?

                            Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 771 bytes --]

 mm/mremap.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 1de98d492ddc..a7c1f9f9b941 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -277,9 +277,16 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	if (old_len > vma->vm_end - addr)
 		goto Efault;
 
-	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) {
-		if (new_len > old_len)
+	/* Need to be careful about a growing mapping */
+	if (new_len > old_len) {
+		unsigned long pgoff;
+
+		if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
 			goto Efault;
+		pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
+		pgoff += vma->vm_pgoff;
+		if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
+			goto Einval;
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-06 15:32                       ` Linus Torvalds
@ 2011-04-06 15:43                         ` Hugh Dickins
  2011-04-06 15:59                           ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2011-04-06 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Robert Święcki, Andrew Morton,
	Miklos Szeredi, Michel Lespinasse, Eric W. Biederman,
	linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2555 bytes --]

On Wed, 6 Apr 2011, Linus Torvalds wrote:
> On Wed, Apr 6, 2011 at 7:47 AM, Hugh Dickins <hughd@google.com> wrote:
> >>
> >> I dunno. But that odd negative pg_off thing makes me think there is
> >> some overflow issue (ie HEAP_INDEX being pg_off + size ends up
> >> fluctuating between really big and really small). So I'd suspect THAT
> >> as the main reason.
> >
> > Yes, one of the vmas is such that the end offset (pgoff of next page
> > after) would be 0, and for the other it would be 16.  There's sure to
> > be places, inside the prio_tree code and outside it, where we rely
> > upon pgoff not wrapping around - wrap should be prevented by original
> > validation of arguments.
> 
> Well, we _do_ validate them in do_mmap_pgoff(), which is the main
> routine for all the mmap() system calls, and the main way to get a new
> mapping.
> 
> There are other ways, like do_brk(), but afaik that always sets
> vm_pgoff to the virtual address (shifted), so again the new mapping
> should be fine.
> 
> So when a new mapping is created, it should all be ok.
> 
> But I think mremap() may end up expanding it without doing the same
> overflow check.
> 
> Do you see any other way to get this situation? Does the vma dump give
> you any hint about where it came from?
> 
> Robert - here's a (UNTESTED!) patch to make mremap() be a bit more
> careful about vm_pgoff when growing a mapping. Does it make any
> difference?

I'd come to the same conclusion: the original page_mapped BUG has itself
suggested that mremap() is getting used.

I was about to send you my own UNTESTED patch: let me append it anyway,
I think it is more correct than yours (it's the offset of vm_end we need
to worry about, and there's the funny old_len,new_len stuff).  See what
you think - sorry, I'm going out now.

Hugh

--- 2.6.38/mm/mremap.c	2011-03-14 18:20:32.000000000 -0700
+++ linux/mm/mremap.c	2011-04-06 08:31:46.000000000 -0700
@@ -282,6 +282,12 @@ static struct vm_area_struct *vma_to_res
 			goto Efault;
 	}
 
+	if (vma->vm_file && new_len > old_len) {
+		pgoff_t endoff = linear_page_index(vma, vma->vm_end);
+		if (endoff + ((new_len - old_len) >> PAGE_SHIFT) < endoff)
+			goto Eoverflow;
+	}
+
 	if (vma->vm_flags & VM_LOCKED) {
 		unsigned long locked, lock_limit;
 		locked = mm->locked_vm << PAGE_SHIFT;
@@ -311,6 +317,8 @@ Enomem:
 	return ERR_PTR(-ENOMEM);
 Eagain:
 	return ERR_PTR(-EAGAIN);
+Eoverflow:
+	return ERR_PTR(-EOVERFLOW);
 }
 
 static unsigned long mremap_to(unsigned long addr,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-06 15:43                         ` Hugh Dickins
@ 2011-04-06 15:59                           ` Linus Torvalds
  2011-04-06 17:54                             ` Robert Święcki
  2011-04-07 14:17                             ` Hugh Dickins
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2011-04-06 15:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Robert Święcki, Andrew Morton, Miklos Szeredi,
	Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <hughd@google.com> wrote:
>
> I was about to send you my own UNTESTED patch: let me append it anyway,
> I think it is more correct than yours (it's the offset of vm_end we need
> to worry about, and there's the funny old_len,new_len stuff).

Umm. That's what my patch did too. The

   pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;

is the "offset of the pgoff" from the original mapping, then we do

   pgoff += vma->vm_pgoff;

to get the pgoff of the new mapping, and then we do

   if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)

to check that the new mapping is ok.

I think yours is equivalent, just a different (and odd - that
linear_page_index() thing will do lots of unnecessary shifts and
hugepage crap) way of writing it.

> See what you think - sorry, I'm going out now.

I think _yours_ is conceptually buggy, because I think that test for
"vma->vm_file" is wrong.

Yes, new anonymous mappings set vm_pgoff to the virtual address, but
that's not true for mremap() moving them around, afaik.

Admittedly it's really hard to get to the overflow case, because the
address is shifted down, so even if you start out with an anonymous
mmap at a high address (to get a big vm_off), and then move it down
and expand it (to get a big size), I doubt you can possibly overflow.
But I still don't think that the test for vm_file is semantically
sensible, even if it might not _matter_.

But whatever. I suspect both our patches are practically doing the
same thing, and it would be interesting to hear if it actually fixes
the issue. Maybe there is some other way to mess up vm_pgoff that I
can't think of right now.

                                  Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-06 15:59                           ` Linus Torvalds
@ 2011-04-06 17:54                             ` Robert Święcki
  2011-04-07 12:41                               ` Robert Święcki
  2011-04-07 14:17                             ` Hugh Dickins
  1 sibling, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-04-06 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Wed, Apr 6, 2011 at 5:59 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <hughd@google.com> wrote:
>>
>> I was about to send you my own UNTESTED patch: let me append it anyway,
>> I think it is more correct than yours (it's the offset of vm_end we need
>> to worry about, and there's the funny old_len,new_len stuff).
>
> Umm. That's what my patch did too. The
>
>   pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
>
> is the "offset of the pgoff" from the original mapping, then we do
>
>   pgoff += vma->vm_pgoff;
>
> to get the pgoff of the new mapping, and then we do
>
>   if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
>
> to check that the new mapping is ok.
>
> I think yours is equivalent, just a different (and odd - that
> linear_page_index() thing will do lots of unnecessary shifts and
> hugepage crap) way of writing it.
>
>> See what you think - sorry, I'm going out now.
>
> I think _yours_ is conceptually buggy, because I think that test for
> "vma->vm_file" is wrong.
>
> Yes, new anonymous mappings set vm_pgoff to the virtual address, but
> that's not true for mremap() moving them around, afaik.
>
> Admittedly it's really hard to get to the overflow case, because the
> address is shifted down, so even if you start out with an anonymous
> mmap at a high address (to get a big vm_off), and then move it down
> and expand it (to get a big size), I doubt you can possibly overflow.
> But I still don't think that the test for vm_file is semantically
> sensible, even if it might not _matter_.
>
> But whatever. I suspect both our patches are practically doing the
> same thing, and it would be interesting to hear if it actually fixes
> the issue. Maybe there is some other way to mess up vm_pgoff that I
> can't think of right now.

Testing with Linus' patch. Will let you know in a few hours.

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-06 17:54                             ` Robert Święcki
@ 2011-04-07 12:41                               ` Robert Święcki
  2011-04-07 14:24                                 ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-04-07 12:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

>>> I was about to send you my own UNTESTED patch: let me append it anyway,
>>> I think it is more correct than yours (it's the offset of vm_end we need
>>> to worry about, and there's the funny old_len,new_len stuff).
>>
>> Umm. That's what my patch did too. The
>>
>>   pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
>>
>> is the "offset of the pgoff" from the original mapping, then we do
>>
>>   pgoff += vma->vm_pgoff;
>>
>> to get the pgoff of the new mapping, and then we do
>>
>>   if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
>>
>> to check that the new mapping is ok.
>>
>> I think yours is equivalent, just a different (and odd - that
>> linear_page_index() thing will do lots of unnecessary shifts and
>> hugepage crap) way of writing it.
>>
>>> See what you think - sorry, I'm going out now.
>>
>> I think _yours_ is conceptually buggy, because I think that test for
>> "vma->vm_file" is wrong.
>>
>> Yes, new anonymous mappings set vm_pgoff to the virtual address, but
>> that's not true for mremap() moving them around, afaik.
>>
>> Admittedly it's really hard to get to the overflow case, because the
>> address is shifted down, so even if you start out with an anonymous
>> mmap at a high address (to get a big vm_off), and then move it down
>> and expand it (to get a big size), I doubt you can possibly overflow.
>> But I still don't think that the test for vm_file is semantically
>> sensible, even if it might not _matter_.
>>
>> But whatever. I suspect both our patches are practically doing the
>> same thing, and it would be interesting to hear if it actually fixes
>> the issue. Maybe there is some other way to mess up vm_pgoff that I
>> can't think of right now.
>
> Testing with Linus' patch. Will let you know in a few hours.

Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h.

I can add some printk in this condition, and let it run for a few days
(I will not have access to my testing machine throughout that time),
if you think this will confirm your hypothesis.

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-06 15:59                           ` Linus Torvalds
  2011-04-06 17:54                             ` Robert Święcki
@ 2011-04-07 14:17                             ` Hugh Dickins
  1 sibling, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2011-04-07 14:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert Swiecki, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4160 bytes --]

On Wed, 6 Apr 2011, Linus Torvalds wrote:
> On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <hughd@google.com> wrote:
> >
> > I was about to send you my own UNTESTED patch: let me append it anyway,
> > I think it is more correct than yours (it's the offset of vm_end we need
> > to worry about, and there's the funny old_len,new_len stuff).
> 
> Umm. That's what my patch did too. The
> 
>    pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
> 
> is the "offset of the pgoff" from the original mapping, then we do
> 
>    pgoff += vma->vm_pgoff;
> 
> to get the pgoff of the new mapping, and then we do
> 
>    if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
> 
> to check that the new mapping is ok.

Right, I was forgetting the semantics for mremap when
addr + old_len < vma->vm_end.  It has to move out the
old section and extend it elsewhere, it does not affect
the page just before vma->vm_end at all.  So mine was
indeed a more complicated way of doing yours.

> 
> I think yours is equivalent, just a different (and odd - that
> linear_page_index() thing will do lots of unnecessary shifts and
> hugepage crap) way of writing it.

I was trying to use the common function provided: but it's
actually wrong, that's a function for getting the value found
in page->index (in units of PAGE_CACHE_SIZE), whereas here we
want the value found in vm_pgoff (in units of PAGE_SIZE).

Of course PAGE_CACHE_SIZE has equalled PAGE_SIZE everywhere but in
some patches by Christoph Lameter a few years back, so there isn't
an effective difference; but I was wrong to use that function.

> 
> > See what you think - sorry, I'm going out now.
> 
> I think _yours_ is conceptually buggy, because I think that test for
> "vma->vm_file" is wrong.

Just being cautious: we cannot hit the BUG in prio_tree.c when we're
dealing with an anonymous mapping, and I didn't want to think about
anonymous at the time.

> 
> Yes, new anonymous mappings set vm_pgoff to the virtual address, but
> that's not true for mremap() moving them around, afaik.
> 
> Admittedly it's really hard to get to the overflow case, because the
> address is shifted down, so even if you start out with an anonymous
> mmap at a high address (to get a big vm_off), and then move it down
> and expand it (to get a big size), I doubt you can possibly overflow.
> But I still don't think that the test for vm_file is semantically
> sensible, even if it might not _matter_.

The strangest case is when a 64-bit kernel execs a 32-bit executable,
preparing the stack with a very high virtual address which goes into
vm_pgoff (shifted by PAGE_SHIFT), then moves that stack down into the
32-bit address space but leaving it with the original high vm_pgoff.

I think you are now excluding some wild anonymous cases which were
allowed before, and gave no trouble - vma_address() looks like a wrap
won't upset it.  But they're not cases which anyone is likely to do,
and safer to keep the anon rules in synch with the file rules.

> 
> But whatever. I suspect both our patches are practically doing the
> same thing, and it would be interesting to hear if it actually fixes
> the issue. Maybe there is some other way to mess up vm_pgoff that I
> can't think of right now.

Here's yours inline below:

Acked-by: Hugh Dickins <hughd@google.com>
---

 mm/mremap.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 1de98d492ddc..a7c1f9f9b941 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -277,9 +277,16 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	if (old_len > vma->vm_end - addr)
 		goto Efault;
 
-	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) {
-		if (new_len > old_len)
+	/* Need to be careful about a growing mapping */
+	if (new_len > old_len) {
+		unsigned long pgoff;
+
+		if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
 			goto Efault;
+		pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
+		pgoff += vma->vm_pgoff;
+		if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
+			goto Einval;
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-07 12:41                               ` Robert Święcki
@ 2011-04-07 14:24                                 ` Hugh Dickins
  2011-04-12  9:58                                   ` Robert Święcki
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2011-04-07 14:24 UTC (permalink / raw)
  To: Robert Swiecki
  Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Thu, 7 Apr 2011, Robert Swiecki wrote:
> >
> > Testing with Linus' patch. Will let you know in a few hours.
> 
> Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h.
> 
> I can add some printk in this condition, and let it run for a few days
> (I will not have access to my testing machine throughout that time),
> if you think this will confirm your hypothesis.

That's great, thanks Robert.  If the machine has nothing better to do,
then it would be nice to let it run a little longer (a few days if that's
what suits you), but it does look good so far.  Though I'm afraid you'll
now discover something else entirely ;)

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-07 14:24                                 ` Hugh Dickins
@ 2011-04-12  9:58                                   ` Robert Święcki
  2011-04-12 14:21                                     ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-04-12  9:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Thu, Apr 7, 2011 at 4:24 PM, Hugh Dickins <hughd@google.com> wrote:
> On Thu, 7 Apr 2011, Robert Swiecki wrote:
>> >
>> > Testing with Linus' patch. Will let you know in a few hours.
>>
>> Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h.
>>
>> I can add some printk in this condition, and let it run for a few days
>> (I will not have access to my testing machine throughout that time),
>> if you think this will confirm your hypothesis.
>
> That's great, thanks Robert.  If the machine has nothing better to do,
> then it would be nice to let it run a little longer (a few days if that's
> what suits you), but it does look good so far.  Though I'm afraid you'll
> now discover something else entirely ;)

Ok, I added printk here:

        if (new_len > old_len) {
                unsigned long pgoff;

                if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
                        goto Efault;
                pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
                pgoff += vma->vm_pgoff;
                if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) {
                        printk("VMA_TO_RESIZE: ADDR:%lx OLD_LEN:%lx
NEW_LEN:%lx PGOFF: %lx VMA->VM_START:%lx VMA->VM_FLAGS:%lx",
                                addr, old_len, new_len, pgoff,
vma->vm_start, vma->vm_flags);

                        goto Einval;
                }
        }


and after a few mins of fuzzing I get:

[  584.224028] VMA_TO_RESIZE: ADDR:f751f000 OLD_LEN:6000 NEW_LEN:c000
PGOFF: fffffffffffffffa VMA->VM_START:f751f000 VMA->VM_FLAGS:2321fa
[  639.777561] VMA_TO_RESIZE: ADDR:f751f000 OLD_LEN:6000 NEW_LEN:b000
PGOFF: fffffffffffffffa VMA->VM_START:f751f000 VMA->VM_FLAGS:2301f8

So, if this case is not caught later on in the code, I guess it solves
the problem. During the fuzzing I didn't experience any panic's, but
some other problems arose, i.e. cannot read /proc/<pid>/maps for some
processes (sys_read hangs, and such process cannot be killed or
stopped with any signal, still it's running (R state) and using CPU -
I'll submit another report for that).

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-12  9:58                                   ` Robert Święcki
@ 2011-04-12 14:21                                     ` Linus Torvalds
       [not found]                                       ` <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com>
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-04-12 14:21 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Tue, Apr 12, 2011 at 2:58 AM, Robert Święcki <robert@swiecki.net> wrote:
>
> So, if this case is not caught later on in the code, I guess it solves
> the problem. During the fuzzing I didn't experience any panic's, but
> some other problems arose, i.e. cannot read /proc/<pid>/maps for some
> processes (sys_read hangs, and such process cannot be killed or
> stopped with any signal, still it's running (R state) and using CPU -
> I'll submit another report for that).

Hmm. Sounds like an endless loop in kernel mode.

Use "perf record -ag" as root, it should show up very clearly in the report.

                          Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
       [not found]                                       ` <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com>
@ 2011-04-12 16:17                                         ` Robert Święcki
  2011-04-12 17:19                                         ` Linus Torvalds
  1 sibling, 0 replies; 37+ messages in thread
From: Robert Święcki @ 2011-04-12 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

>>> So, if this case is not caught later on in the code, I guess it solves
>>> the problem. During the fuzzing I didn't experience any panic's, but
>>> some other problems arose, i.e. cannot read /proc/<pid>/maps for some
>>> processes (sys_read hangs, and such process cannot be killed or
>>> stopped with any signal, still it's running (R state) and using CPU -
>>> I'll submit another report for that).
>>
>> Hmm. Sounds like an endless loop in kernel mode.
>>
>> Use "perf record -ag" as root, it should show up very clearly in the report.
>>
>>                          Linus
>
> I've put some data here -
> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/4345dcc4f7750ce2
> - I think it's somewhat connected (sys_mlock appears on both cases).
>
> Attaching perf data (for 2.6.38) + kdb dumpall + procdump for process 14158
>
> Those 3 processes cannot be stopped/killed
>
> 14158 66.2  0.0   8380  3012 ?  RL  /tmp/iknowthis
> 17100 63.6  0.1  18248  4004 ?  RL  /tmp/iknowthis
> 19772 63.8  0.0   4000  1888 ?   RL  /tmp/iknowthis

Also, the system doesn't look usable after such fuzzing (executing a
few times some pretty deterministic program)

root@ise-test:~# gcc -m32 mlock.c  -o mlock

root@ise-test:~# ./mlock
./mlock: relocation error: ./mlock: symbol perror, version GLIBC_2.0
not defined in file libc.so.6 with link time reference

root@ise-test:~# ./mlock
mmap: Success
RET: 0xf751f000
mremap: Invalid argument
RET: 0xffffffff

root@ise-test:~# ./mlock
Segmentation fault

root@ise-test:~# dmesg | tail -n 1
[ 5164.961568] mlock[7097]: segfault at 0 ip           (null) sp
00000000ff8a00d4 error 14 in mlock[8048000+1000]

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
       [not found]                                       ` <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com>
  2011-04-12 16:17                                         ` Robert Święcki
@ 2011-04-12 17:19                                         ` Linus Torvalds
  2011-04-12 18:59                                           ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-04-12 17:19 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1701 bytes --]

On Tue, Apr 12, 2011 at 8:48 AM, Robert Święcki <robert@swiecki.net> wrote:
>>
>> Hmm. Sounds like an endless loop in kernel mode.
>>
>> Use "perf record -ag" as root, it should show up very clearly in the report.
>
> I've put some data here -
> http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/4345dcc4f7750ce2
> - I think it's somewhat connected (sys_mlock appears on both cases).

Ok, so it's definitely sys_mlock.

And I suspect it's due to commit 53a7706d5ed8 somehow looping forever.

One possible cause would be how that commit made things care deeply
about the return value of __get_user_pages(), and in particular what
happens when that return value is zero. It ends up looping forever in
do_mlock_pages() for that case, because it does

                nend = nstart + ret * PAGE_SIZE;

so now the next round we'll set "nstart = nend" and start all over.

I see at least one way __get_user_pages() will return zero, and it's
if it is passed a npages of 0 to begin with. Which can easily happen
if you try to mlock() the first page of a stack segment: the code will
jump over that stack segment page, and then have nothing to do, and
return zero. So then do_mlock_pages() will just keep on trying again.

THIS IS A HACKY AND UNTESTED PATCH!

It's ugly as hell, because the real problem is do_mlock_pages() caring
too damn much about the return value, and us hiding the whole stack
page thing in that function. I wouldn't want to commit it as-is, but
if you can easily reproduce the problem, it's a good patch to test out
the theory. Assuming I didn't screw something up.

Again, TOTALLY UNTESTED!

                           Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1066 bytes --]

 mm/mlock.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 2689a08c79af..080c219973ea 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -162,6 +162,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	unsigned long addr = start;
 	int nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
+	long retval, offset;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(end   & ~PAGE_MASK);
@@ -189,13 +190,20 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		gup_flags |= FOLL_MLOCK;
 
 	/* We don't try to access the guard page of a stack vma */
+	offset = 0;
 	if (stack_guard_page(vma, start)) {
 		addr += PAGE_SIZE;
 		nr_pages--;
+		offset = 1;
 	}
 
-	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+	retval = __get_user_pages(current, mm, addr, nr_pages, gup_flags,
 				NULL, NULL, nonblocking);
+
+	/* Get the return value correct even in the face of the guard page */
+	if (retval < 0)
+		return offset ? : retval;
+	return retval + offset;
 }
 
 /*

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-12 17:19                                         ` Linus Torvalds
@ 2011-04-12 18:59                                           ` Linus Torvalds
  2011-04-12 19:02                                             ` Robert Święcki
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-04-12 18:59 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 437 bytes --]

On Tue, Apr 12, 2011 at 10:19 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> THIS IS A HACKY AND UNTESTED PATCH!

.. and here is a rather less hacky, but still equally untested patch.
It moves the stack guard page handling into __get_user_pages() itself,
and thus avoids the whole problem.

This one I could easily see myself committing. Assuming I get some
ack's and testing..

Comments?

                          Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 3078 bytes --]

 mm/memory.c |   26 ++++++++++++++++++--------
 mm/mlock.c  |   13 -------------
 2 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9da8cab1b1b0..b623a249918c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1410,6 +1410,13 @@ no_page_table:
 	return page;
 }
 
+static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	return (vma->vm_flags & VM_GROWSDOWN) &&
+		(vma->vm_start == addr) &&
+		!vma_stack_continue(vma->vm_prev, addr);
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -1488,7 +1495,6 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		vma = find_extend_vma(mm, start);
 		if (!vma && in_gate_area(mm, start)) {
 			unsigned long pg = start & PAGE_MASK;
-			struct vm_area_struct *gate_vma = get_gate_vma(mm);
 			pgd_t *pgd;
 			pud_t *pud;
 			pmd_t *pmd;
@@ -1513,10 +1519,11 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				pte_unmap(pte);
 				return i ? : -EFAULT;
 			}
+			vma = get_gate_vma(mm);
 			if (pages) {
 				struct page *page;
 
-				page = vm_normal_page(gate_vma, start, *pte);
+				page = vm_normal_page(vma, start, *pte);
 				if (!page) {
 					if (!(gup_flags & FOLL_DUMP) &&
 					     is_zero_pfn(pte_pfn(*pte)))
@@ -1530,12 +1537,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				get_page(page);
 			}
 			pte_unmap(pte);
-			if (vmas)
-				vmas[i] = gate_vma;
-			i++;
-			start += PAGE_SIZE;
-			nr_pages--;
-			continue;
+			goto next_page;
 		}
 
 		if (!vma ||
@@ -1549,6 +1551,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			continue;
 		}
 
+		/*
+		 * If we don't actually want the page itself,
+		 * and it's the stack guard page, just skip it.
+		 */
+		if (!pages && stack_guard_page(vma, start))
+			goto next_page;
+
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
@@ -1631,6 +1640,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				flush_anon_page(vma, page, start);
 				flush_dcache_page(page);
 			}
+next_page:
 			if (vmas)
 				vmas[i] = vma;
 			i++;
diff --git a/mm/mlock.c b/mm/mlock.c
index 2689a08c79af..6b55e3efe0df 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -135,13 +135,6 @@ void munlock_vma_page(struct page *page)
 	}
 }
 
-static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
-{
-	return (vma->vm_flags & VM_GROWSDOWN) &&
-		(vma->vm_start == addr) &&
-		!vma_stack_continue(vma->vm_prev, addr);
-}
-
 /**
  * __mlock_vma_pages_range() -  mlock a range of pages in the vma.
  * @vma:   target vma
@@ -188,12 +181,6 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_LOCKED)
 		gup_flags |= FOLL_MLOCK;
 
-	/* We don't try to access the guard page of a stack vma */
-	if (stack_guard_page(vma, start)) {
-		addr += PAGE_SIZE;
-		nr_pages--;
-	}
-
 	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
 				NULL, NULL, nonblocking);
 }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-12 18:59                                           ` Linus Torvalds
@ 2011-04-12 19:02                                             ` Robert Święcki
  2011-04-12 19:38                                               ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Robert Święcki @ 2011-04-12 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Tue, Apr 12, 2011 at 8:59 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 12, 2011 at 10:19 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> THIS IS A HACKY AND UNTESTED PATCH!
>
> .. and here is a rather less hacky, but still equally untested patch.
> It moves the stack guard page handling into __get_user_pages() itself,
> and thus avoids the whole problem.
>
> This one I could easily see myself committing. Assuming I get some
> ack's and testing..

I'm testing currently with the old one, w/o any symptoms of problems
by now, but it's not a meaningful period of time. I can try with the
new one, leave it over(European)night, and let you know tomorrow.

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-12 19:02                                             ` Robert Święcki
@ 2011-04-12 19:38                                               ` Linus Torvalds
  2011-04-18 21:15                                                 ` Michel Lespinasse
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-04-12 19:38 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse,
	Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra,
	Rik van Riel

On Tue, Apr 12, 2011 at 12:02 PM, Robert Święcki <robert@swiecki.net> wrote:
>
> I'm testing currently with the old one, w/o any symptoms of problems
> by now, but it's not a meaningful period of time. I can try with the
> new one, leave it over(European)night, and let you know tomorrow.

You might as well keep testing the old one, if that gives it better
coverage. No need to disrupt anything you already have running.

The more important input is "was that actually the root cause", rather
than deciding between the ugly or clean way of fixing it.

So if the first patch fixes it, then I'm pretty sure the second one
will too - just in a cleaner manner.

                          Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-12 19:38                                               ` Linus Torvalds
@ 2011-04-18 21:15                                                 ` Michel Lespinasse
  2011-05-05  0:09                                                   ` Michel Lespinasse
  0 siblings, 1 reply; 37+ messages in thread
From: Michel Lespinasse @ 2011-04-18 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert Święcki, Hugh Dickins, Andrew Morton,
	Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Tue, Apr 12, 2011 at 12:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 12, 2011 at 12:02 PM, Robert Święcki <robert@swiecki.net> wrote:
>>
>> I'm testing currently with the old one, w/o any symptoms of problems
>> by now, but it's not a meaningful period of time. I can try with the
>> new one, leave it over(European)night, and let you know tomorrow.
>
> You might as well keep testing the old one, if that gives it better
> coverage. No need to disrupt anything you already have running.
>
> The more important input is "was that actually the root cause", rather
> than deciding between the ugly or clean way of fixing it.
>
> So if the first patch fixes it, then I'm pretty sure the second one
> will too - just in a cleaner manner.

Sorry for the delayed response - I have been traveling abroad in the
last two weeks and until the end of the month.

This second patch looks more attractive than the first, but is also
harder to prove correct. Hugh looked at all gup call sites and
convinced himself that the change was safe, except for the
fault_in_user_writeable() site in futex.c which he asked me to look
at. I am worried that we would have an issue there, as places like
futex_wake_op() or fixup_pi_state_owner() operate on user memory with
page faults disabled, and expect fault_in_user_writeable() to set up
the user page so that they can retry if the initial access failed.
With this proposal, fault_in_user_writeable() would become inoperative
when the  address is within the guard page; this could cause some
malicious futex operation to create an infinite loop.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-04-18 21:15                                                 ` Michel Lespinasse
@ 2011-05-05  0:09                                                   ` Michel Lespinasse
  2011-05-05  0:38                                                     ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Michel Lespinasse @ 2011-05-05  0:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert Święcki, Hugh Dickins, Andrew Morton,
	Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

FYI, the attached code causes an infinite loop in kernels that have
the 95042f9eb7 commit:

#include <stdio.h>
#include <string.h>

#include <unistd.h>
#include <sys/syscall.h>
#include <linux/futex.h>

int *get_stack_guard(void)
{
  FILE *map;
  char buf[1000];

  map = fopen("/proc/self/maps", "r");
  if (!map)
    return NULL;
  while(fgets(buf, 1000, map)) {
    long a, b;
    char c[1000], d[1000], e[1000], f[1000], g[1000];
    if (sscanf(buf, "%lx-%lx %s %s %s %s %s", &a, &b, c, d, e, f, g) == 7 &&
        !strcmp(g, "[stack]")) {
      fclose(map);
      return (int *)(a - 4096);
    }
  }
  fclose(map);
  return NULL;
}

int main(void)
{
  int *uaddr = get_stack_guard();
  syscall(SYS_futex, uaddr, FUTEX_LOCK_PI_PRIVATE, 0, NULL, NULL, 0);
  return 0;
}

Linus, I am not sure as to what would be the preferred way to fix
this. One option could be to modify fault_in_user_writeable so that it
passes a non-NULL page pointer, and just does a put_page on it
afterwards. While this would work, this is kinda ugly and would slow
down futex operations somewhat. A more conservative alternative could
be to enable the guard page special case under an new GUP flag, but
this loses much of the elegance of your original proposal...

On Mon, Apr 18, 2011 at 2:15 PM, Michel Lespinasse <walken@google.com> wrote:
> This second patch looks more attractive than the first, but is also
> harder to prove correct. Hugh looked at all gup call sites and
> convinced himself that the change was safe, except for the
> fault_in_user_writeable() site in futex.c which he asked me to look
> at. I am worried that we would have an issue there, as places like
> futex_wake_op() or fixup_pi_state_owner() operate on user memory with
> page faults disabled, and expect fault_in_user_writeable() to set up
> the user page so that they can retry if the initial access failed.
> With this proposal, fault_in_user_writeable() would become inoperative
> when the  address is within the guard page; this could cause some
> malicious futex operation to create an infinite loop.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-05-05  0:09                                                   ` Michel Lespinasse
@ 2011-05-05  0:38                                                     ` Linus Torvalds
  2011-05-05  1:18                                                       ` Michel Lespinasse
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-05-05  0:38 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Robert Święcki, Hugh Dickins, Andrew Morton,
	Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1654 bytes --]

On Wed, May 4, 2011 at 5:09 PM, Michel Lespinasse <walken@google.com> wrote:
>
> FYI, the attached code causes an infinite loop in kernels that have
> the 95042f9eb7 commit:

Mmm.

Yes. The atomic fault will never work, and the get_user_pages() thing
won't either, so things will just loop forever.

> Linus, I am not sure as to what would be the preferred way to fix
> this. One option could be to modify fault_in_user_writeable so that it
> passes a non-NULL page pointer, and just does a put_page on it
> afterwards. While this would work, this is kinda ugly and would slow
> down futex operations somewhat.

No, that's just ugly as hell.

> A more conservative alternative could
> be to enable the guard page special case under an new GUP flag, but
> this loses much of the elegance of your original proposal...

How about only doing that only for FOLL_MLOCK?

Also, looking at mm/mlock.c, why _do_ we call get_user_pages() even if
the vma isn't mlocked? That looks bogus. Since we have dropped the
mm_semaphore, an unlock may have happened, and afaik we should *not*
try to bring those pages back in at all. There's this whole comment
about that in the caller ("__mlock_vma_pages_range() double checks the
vma flags, so that it won't mlock pages if the vma was already
munlocked."), but despite that it would actually call
__get_user_pages() even if the VM_LOCKED bit had been cleared (it just
wouldn't call it with the FOLL_MLOCK flag).

So maybe something like the attached?

UNTESTED! And maybe there was some really subtle reason to still call
__get_user_pages() without that FOLL_MLOCK thing that I'm missing.

                           Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1442 bytes --]

 mm/memory.c |    2 +-
 mm/mlock.c  |    8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 607098d47e74..f7a487c908a5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1555,7 +1555,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		 * If we don't actually want the page itself,
 		 * and it's the stack guard page, just skip it.
 		 */
-		if (!pages && stack_guard_page(vma, start))
+		if (!pages && (gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))
 			goto next_page;
 
 		do {
diff --git a/mm/mlock.c b/mm/mlock.c
index 6b55e3efe0df..8ed7fd09f81c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -162,7 +162,10 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(end   > vma->vm_end);
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	gup_flags = FOLL_TOUCH;
+	if (!(vma->vm_flags & VM_LOCKED))
+		return nr_pages;
+
+	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -178,9 +181,6 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
 		gup_flags |= FOLL_FORCE;
 
-	if (vma->vm_flags & VM_LOCKED)
-		gup_flags |= FOLL_MLOCK;
-
 	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
 				NULL, NULL, nonblocking);
 }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-05-05  0:38                                                     ` Linus Torvalds
@ 2011-05-05  1:18                                                       ` Michel Lespinasse
  2011-05-05  1:40                                                         ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Michel Lespinasse @ 2011-05-05  1:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert Święcki, Hugh Dickins, Andrew Morton,
	Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Wed, May 4, 2011 at 5:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>> A more conservative alternative could
>> be to enable the guard page special case under an new GUP flag, but
>> this loses much of the elegance of your original proposal...
>
> How about only doing that only for FOLL_MLOCK?

Sounds reasonable.

> Also, looking at mm/mlock.c, why _do_ we call get_user_pages() even if
> the vma isn't mlocked? That looks bogus. Since we have dropped the
> mm_semaphore, an unlock may have happened, and afaik we should *not*
> try to bring those pages back in at all. There's this whole comment
> about that in the caller ("__mlock_vma_pages_range() double checks the
> vma flags, so that it won't mlock pages if the vma was already
> munlocked."), but despite that it would actually call
> __get_user_pages() even if the VM_LOCKED bit had been cleared (it just
> wouldn't call it with the FOLL_MLOCK flag).

There are two reasons VM_LOCKED might be cleared in
__mlock_vma_pages_range(). It could be that one of the VM_SPECIAL
flags were set on the VMA, in which case mlock() won't set VM_LOCKED
but it still must make the pages present. Or, there is an munlock()
executing concurrently with mlock() - in that case, the conservative
thing to do is to give the same results as if the mlock() had
completed before the munlock(). That is, the mlock() would have broken
COW / made the pages present and the munlock() would have cleared the
VM_LOCKED and PageMlocked flags.

> UNTESTED! And maybe there was some really subtle reason to still call
> __get_user_pages() without that FOLL_MLOCK thing that I'm missing.

I think we want the mm/memory.c part of this proposal without the
mm/mlock.c part.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-05-05  1:18                                                       ` Michel Lespinasse
@ 2011-05-05  1:40                                                         ` Linus Torvalds
  2011-05-05  3:37                                                           ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-05-05  1:40 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Robert Święcki, Hugh Dickins, Andrew Morton,
	Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Wed, May 4, 2011 at 6:18 PM, Michel Lespinasse <walken@google.com> wrote:
>
> I think we want the mm/memory.c part of this proposal without the
> mm/mlock.c part.

.. but what about mlock not setting the FOLL_MLOCK bit, then?

In that case, you'd get the "mlock extends the stack" problem.

                         Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-05-05  1:40                                                         ` Linus Torvalds
@ 2011-05-05  3:37                                                           ` Linus Torvalds
  2011-05-05  4:26                                                             ` Michel Lespinasse
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2011-05-05  3:37 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Robert Święcki, Hugh Dickins, Andrew Morton,
	Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1541 bytes --]

Ok, so here's a slightly different approach.

It still makes the FOLL_MLOCK be unconditional in the mlock path, but
it really just pushes down the

-       gup_flags = FOLL_TOUCH;
+       gup_flags = FOLL_TOUCH | FOLL_MLOCK;
        ...
-       if (vma->vm_flags & VM_LOCKED)
-               gup_flags |= FOLL_MLOCK;

from __mlock_vma_pages_range(), and moves the conditional into
'follow_page()' (which is the only _user_ of that flag) instead:

-       if (flags & FOLL_MLOCK) {
+       if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {

so semantically this changes nothing at all.

But now, because __get_user_pages() can look at FOLL_MLOCK to see that
it's a mlock access, we can do that whole "skip stack guard page"
based on that flag:

-               if (!pages && stack_guard_page(vma, start))
+               if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))

which means that other uses will try to page in the stack guard page.

I seriously considered making that "skip stack guard page" and the
"mlock lookup" be two separate bits, because conceptually they are
really pretty independent, but right now the only _users_ seem to be
tied together, so I kept it as one single bit (FOLL_MLOCK).

But as far as I can tell, the attached patch is 100% equivalent to
what we do now, except for that "skip stack guard pages only for
mlock" change.

Comments? I like this patch because it seems to make the logic more
straightforward.

But somebody else should double-check my logic.

                         Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1728 bytes --]

 mm/memory.c |    7 +++----
 mm/mlock.c  |    5 +----
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 607098d47e74..27f425378112 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1359,7 +1359,7 @@ split_fallthrough:
 		 */
 		mark_page_accessed(page);
 	}
-	if (flags & FOLL_MLOCK) {
+	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/*
 		 * The preliminary mapping check is mainly to avoid the
 		 * pointless overhead of lock_page on the ZERO_PAGE
@@ -1552,10 +1552,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		}
 
 		/*
-		 * If we don't actually want the page itself,
-		 * and it's the stack guard page, just skip it.
+		 * For mlock, just skip the stack guard page.
 		 */
-		if (!pages && stack_guard_page(vma, start))
+		if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))
 			goto next_page;
 
 		do {
diff --git a/mm/mlock.c b/mm/mlock.c
index 6b55e3efe0df..516b2c2ddd5a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(end   > vma->vm_end);
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	gup_flags = FOLL_TOUCH;
+	gup_flags = FOLL_TOUCH | FOLL_MLOCK;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -178,9 +178,6 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
 		gup_flags |= FOLL_FORCE;
 
-	if (vma->vm_flags & VM_LOCKED)
-		gup_flags |= FOLL_MLOCK;
-
 	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
 				NULL, NULL, nonblocking);
 }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] mm: fix possible cause of a page_mapped BUG
  2011-05-05  3:37                                                           ` Linus Torvalds
@ 2011-05-05  4:26                                                             ` Michel Lespinasse
  0 siblings, 0 replies; 37+ messages in thread
From: Michel Lespinasse @ 2011-05-05  4:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robert Święcki, Hugh Dickins, Andrew Morton,
	Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm,
	Peter Zijlstra, Rik van Riel

On Wed, May 4, 2011 at 8:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I seriously considered making that "skip stack guard page" and the
> "mlock lookup" be two separate bits, because conceptually they are
> really pretty independent, but right now the only _users_ seem to be
> tied together, so I kept it as one single bit (FOLL_MLOCK).

Yes, this seems best for now.

> But as far as I can tell, the attached patch is 100% equivalent to
> what we do now, except for that "skip stack guard pages only for
> mlock" change.
>
> Comments? I like this patch because it seems to make the logic more
> straightforward.
>
> But somebody else should double-check my logic.

It makes perfect sense.

Reviewed-by: Michel Lespinasse <walken@google.com>

(I would argue for it to go to stable trees as well)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2011-05-05  4:26 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-24  5:39 [PATCH] mm: fix possible cause of a page_mapped BUG Hugh Dickins
2011-02-28 23:35 ` Robert Święcki
2011-03-17 15:40   ` Robert Święcki
2011-03-19  5:34     ` Hugh Dickins
2011-04-01 14:34       ` Robert Święcki
2011-04-01 15:44         ` Linus Torvalds
2011-04-01 16:21           ` Robert Święcki
2011-04-01 16:35             ` Linus Torvalds
2011-04-02  4:01               ` Hui Zhu
2011-04-04 13:02                 ` Robert Święcki
2011-04-02  1:46           ` Hugh Dickins
2011-04-04 12:46             ` Robert Święcki
2011-04-04 18:30               ` Hugh Dickins
2011-04-05 12:21                 ` Robert Święcki
2011-04-05 15:37                   ` Linus Torvalds
2011-04-06 14:47                     ` Hugh Dickins
2011-04-06 15:32                       ` Linus Torvalds
2011-04-06 15:43                         ` Hugh Dickins
2011-04-06 15:59                           ` Linus Torvalds
2011-04-06 17:54                             ` Robert Święcki
2011-04-07 12:41                               ` Robert Święcki
2011-04-07 14:24                                 ` Hugh Dickins
2011-04-12  9:58                                   ` Robert Święcki
2011-04-12 14:21                                     ` Linus Torvalds
     [not found]                                       ` <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com>
2011-04-12 16:17                                         ` Robert Święcki
2011-04-12 17:19                                         ` Linus Torvalds
2011-04-12 18:59                                           ` Linus Torvalds
2011-04-12 19:02                                             ` Robert Święcki
2011-04-12 19:38                                               ` Linus Torvalds
2011-04-18 21:15                                                 ` Michel Lespinasse
2011-05-05  0:09                                                   ` Michel Lespinasse
2011-05-05  0:38                                                     ` Linus Torvalds
2011-05-05  1:18                                                       ` Michel Lespinasse
2011-05-05  1:40                                                         ` Linus Torvalds
2011-05-05  3:37                                                           ` Linus Torvalds
2011-05-05  4:26                                                             ` Michel Lespinasse
2011-04-07 14:17                             ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).