LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: Linux 3.19-rc3
@ 2015-01-06  4:49 Sedat Dilek
  2015-01-06  9:34 ` Sedat Dilek
  2015-01-06  9:40 ` Peter Zijlstra
  0 siblings, 2 replies; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06  4:49 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, LKML, Peter Zijlstra (Intel)

[ Please CC me I am not subscribed to LKML ]

[ QUOTE ]

On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
 > It's a day delayed - not because of any particular development issues,
 > but simply because I was tiling a bathroom yesterday. But rc3 is out
 > there now, and things have stayed reasonably calm. I really hope that
 > implies that 3.19 is looking good, but it's equally likely that it's
 > just that people are still recovering from the holiday season.
 >
 > A bit over three quarters of the changes here are drivers - mostly
 > networking, thermal, input layer, sound, power management. The rest is
 > misc - filesystems, core networking, some arch fixes, etc. But all of
 > it is pretty small.
 >
 > So go out and test,

This has been there since just before rc1. Is there a fix for this
stalled in someones git tree maybe ?

[    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
__might_sleep+0x8d/0xa0()
[    7.952592] do not call blocking ops when !TASK_RUNNING; state=1
set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
[    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted
3.19.0-rc3+ #100
[    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88
ffffffff915b47c7
[    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8
ffffffff91062c30
[    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d
0000000000000000
[    7.952600] Call Trace:
[    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
[    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
[    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
[    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
[    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
[    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
[    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
[    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
[    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
[    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
[    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
[    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
[    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
[    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
[    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
[    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
[    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
[    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17

[ /QUOTE ]

I am seeing a similiar call-trace/warning.
It is reproducible when running fio (latest: v2.2.4) while my loop-mq
tests (see block.git#for-next)

Some people tend to say it's coming from the linux-aio area [1], but I
am not sure.
1st I thought this is a Linux-next problem but I am seeing it also
with my rc-kernels.
For parts of aio there is a patch discussed in [2].
The experimental patchset of Ken from [3] made the "aio" call-trace go
away here.

I tried also a patch pending in peterz/queue.git#sched/core from Eric Sandeen.
It's "check for stack overflow in ___might_sleep".
Unfortunately, it did not help in case of my loop-mq tests.
( BTW, this is touching ___might_sleep() (note: triple-underscore VS.
affected __might_sleep() <--- double-underscrore). )

Let me hear your feedback.

Have more fun!

- Sedat -

[1] http://marc.info/?l=linux-aio&m=142033318411355&w=2
[2] http://marc.info/?l=linux-aio&m=142035799514685&w=2
[3] http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
[4] http://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/patch/?id=48e615e4c3ebed488fecb6bfb40b372151f62db2

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  4:49 Linux 3.19-rc3 Sedat Dilek
@ 2015-01-06  9:34 ` Sedat Dilek
  2015-01-06  9:56   ` Takashi Iwai
  2015-01-06  9:59   ` Peter Zijlstra
  2015-01-06  9:40 ` Peter Zijlstra
  1 sibling, 2 replies; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06  9:34 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, LKML, Peter Zijlstra (Intel), Takashi Iwai

[-- Attachment #1: Type: text/plain, Size: 4914 bytes --]

On Tue, Jan 6, 2015 at 5:49 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
> [ Please CC me I am not subscribed to LKML ]
>
> [ QUOTE ]
>
> On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
>  > It's a day delayed - not because of any particular development issues,
>  > but simply because I was tiling a bathroom yesterday. But rc3 is out
>  > there now, and things have stayed reasonably calm. I really hope that
>  > implies that 3.19 is looking good, but it's equally likely that it's
>  > just that people are still recovering from the holiday season.
>  >
>  > A bit over three quarters of the changes here are drivers - mostly
>  > networking, thermal, input layer, sound, power management. The rest is
>  > misc - filesystems, core networking, some arch fixes, etc. But all of
>  > it is pretty small.
>  >
>  > So go out and test,
>
> This has been there since just before rc1. Is there a fix for this
> stalled in someones git tree maybe ?
>
> [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
> __might_sleep+0x8d/0xa0()
> [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1
> set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
> [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted
> 3.19.0-rc3+ #100
> [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88
> ffffffff915b47c7
> [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8
> ffffffff91062c30
> [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d
> 0000000000000000
> [    7.952600] Call Trace:
> [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
> [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
> [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
> [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
> [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
> [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
> [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
> [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
> [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
> [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
> [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
> [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
> [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
> [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
> [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
> [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17
>
> [ /QUOTE ]
>
> I am seeing a similiar call-trace/warning.
> It is reproducible when running fio (latest: v2.2.4) while my loop-mq
> tests (see block.git#for-next)
>
> Some people tend to say it's coming from the linux-aio area [1], but I
> am not sure.
> 1st I thought this is a Linux-next problem but I am seeing it also
> with my rc-kernels.
> For parts of aio there is a patch discussed in [2].
> The experimental patchset of Ken from [3] made the "aio" call-trace go
> away here.
>
> I tried also a patch pending in peterz/queue.git#sched/core from Eric Sandeen.
> It's "check for stack overflow in ___might_sleep".
> Unfortunately, it did not help in case of my loop-mq tests.
> ( BTW, this is touching ___might_sleep() (note: triple-underscore VS.
> affected __might_sleep() <--- double-underscrore). )
>
> Let me hear your feedback.
>
> Have more fun!
>
> - Sedat -
>
> [1] http://marc.info/?l=linux-aio&m=142033318411355&w=2
> [2] http://marc.info/?l=linux-aio&m=142035799514685&w=2
> [3] http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
> [4] http://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/patch/?id=48e615e4c3ebed488fecb6bfb40b372151f62db2

[ CC Takashi ]

>From [1]:
...

Just "me too" (but overlooked until recently).

The cause is a mutex_lock() call right after prepare_to_wait() with
TASK_INTERRUPTIBLE in fanotify_read().

static ssize_t fanotify_read(struct file *file, char __user *buf,
    size_t count, loff_t *pos)
{
....
while (1) {
prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
mutex_lock(&group->notification_mutex);

I saw Peter already fixed a similar code in inotify_user.c by commit
e23738a7300a (but interestingly for a different reason, "Deal with
nested sleeps").  Supposedly a similar fix would be needed for
fanotify_user.c.
...

Can you explain why do you think the problem is in sched-fanotify?

I tried to do such a "similiar" (quick) fix analog to the mentioned
"sched, inotify: Deal with nested sleeps" patch from Peter.
If I did correct... It does not make the call-trace go away here.

- Sedat -

[1] http://marc.info/?l=linux-kernel&m=142053231023575&w=2

[-- Attachment #2: 0001-sched-fanotify-Deal-with-nested-sleeps.patch --]
[-- Type: text/x-patch, Size: 1750 bytes --]

From 5445404e768653771faca9770755340200fe8b6c Mon Sep 17 00:00:00 2001
From: Sedat Dilek <sedat.dilek@gmail.com>
Date: Tue, 6 Jan 2015 09:51:54 +0100
Subject: [PATCH] sched: fanotify: Deal with nested sleeps

---
 fs/notify/fanotify/fanotify_user.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index c991616..65e96e2 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -14,6 +14,7 @@
 #include <linux/types.h>
 #include <linux/uaccess.h>
 #include <linux/compat.h>
+#include <linux/wait.h>
 
 #include <asm/ioctls.h>
 
@@ -259,15 +260,15 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
 	struct fsnotify_event *kevent;
 	char __user *start;
 	int ret;
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
 	start = buf;
 	group = file->private_data;
 
 	pr_debug("%s: group=%p\n", __func__, group);
 
+	add_wait_queue(&group->notification_waitq, &wait);
 	while (1) {
-		prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
 
 		mutex_lock(&group->notification_mutex);
 		kevent = get_one_event(group, count);
@@ -289,7 +290,7 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
 
 			if (start != buf)
 				break;
-			schedule();
+			wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
 			continue;
 		}
 
@@ -318,8 +319,8 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
 		buf += ret;
 		count -= ret;
 	}
+	remove_wait_queue(&group->notification_waitq, &wait);
 
-	finish_wait(&group->notification_waitq, &wait);
 	if (start != buf && ret != -EFAULT)
 		ret = buf - start;
 	return ret;
-- 
2.2.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  4:49 Linux 3.19-rc3 Sedat Dilek
  2015-01-06  9:34 ` Sedat Dilek
@ 2015-01-06  9:40 ` Peter Zijlstra
  2015-01-06  9:42   ` Sedat Dilek
  2015-01-06 10:29   ` Sedat Dilek
  1 sibling, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06  9:40 UTC (permalink / raw)
  To: Sedat Dilek; +Cc: Dave Jones, Linus Torvalds, LKML

On Tue, Jan 06, 2015 at 05:49:11AM +0100, Sedat Dilek wrote:
> This has been there since just before rc1. Is there a fix for this
> stalled in someones git tree maybe ?
> 
> [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
> __might_sleep+0x8d/0xa0()
> [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
> [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted 3.19.0-rc3+ #100

> [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0


http://marc.info/?l=linux-kernel&m=141874374029791

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  9:40 ` Peter Zijlstra
@ 2015-01-06  9:42   ` Sedat Dilek
  2015-01-06  9:57     ` Sedat Dilek
  2015-01-06 10:29   ` Sedat Dilek
  1 sibling, 1 reply; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06  9:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linus Torvalds, LKML

On Tue, Jan 6, 2015 at 10:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 06, 2015 at 05:49:11AM +0100, Sedat Dilek wrote:
>> This has been there since just before rc1. Is there a fix for this
>> stalled in someones git tree maybe ?
>>
>> [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
>> __might_sleep+0x8d/0xa0()
>> [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
>> [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted 3.19.0-rc3+ #100
>
>> [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
>
>
> http://marc.info/?l=linux-kernel&m=141874374029791

Hehe, I created same fix... did not help here.

- Sedat -

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  9:34 ` Sedat Dilek
@ 2015-01-06  9:56   ` Takashi Iwai
  2015-01-06 10:06     ` Sedat Dilek
  2015-01-06  9:59   ` Peter Zijlstra
  1 sibling, 1 reply; 101+ messages in thread
From: Takashi Iwai @ 2015-01-06  9:56 UTC (permalink / raw)
  To: sedat.dilek; +Cc: Dave Jones, Linus Torvalds, LKML, Peter Zijlstra (Intel)

At Tue, 6 Jan 2015 10:34:30 +0100,
Sedat Dilek wrote:
> 
> On Tue, Jan 6, 2015 at 5:49 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
> > [ Please CC me I am not subscribed to LKML ]
> >
> > [ QUOTE ]
> >
> > On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
> >  > It's a day delayed - not because of any particular development issues,
> >  > but simply because I was tiling a bathroom yesterday. But rc3 is out
> >  > there now, and things have stayed reasonably calm. I really hope that
> >  > implies that 3.19 is looking good, but it's equally likely that it's
> >  > just that people are still recovering from the holiday season.
> >  >
> >  > A bit over three quarters of the changes here are drivers - mostly
> >  > networking, thermal, input layer, sound, power management. The rest is
> >  > misc - filesystems, core networking, some arch fixes, etc. But all of
> >  > it is pretty small.
> >  >
> >  > So go out and test,
> >
> > This has been there since just before rc1. Is there a fix for this
> > stalled in someones git tree maybe ?
> >
> > [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
> > __might_sleep+0x8d/0xa0()
> > [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1
> > set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
> > [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted
> > 3.19.0-rc3+ #100
> > [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88
> > ffffffff915b47c7
> > [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8
> > ffffffff91062c30
> > [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d
> > 0000000000000000
> > [    7.952600] Call Trace:
> > [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
> > [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
> > [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
> > [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
> > [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> > [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> > [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
> > [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
> > [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
> > [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
> > [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
> > [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
> > [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
> > [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
> > [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
> > [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
> > [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
> > [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17
> >
> > [ /QUOTE ]
> >
> > I am seeing a similiar call-trace/warning.
> > It is reproducible when running fio (latest: v2.2.4) while my loop-mq
> > tests (see block.git#for-next)
> >
> > Some people tend to say it's coming from the linux-aio area [1], but I
> > am not sure.
> > 1st I thought this is a Linux-next problem but I am seeing it also
> > with my rc-kernels.
> > For parts of aio there is a patch discussed in [2].
> > The experimental patchset of Ken from [3] made the "aio" call-trace go
> > away here.
> >
> > I tried also a patch pending in peterz/queue.git#sched/core from Eric Sandeen.
> > It's "check for stack overflow in ___might_sleep".
> > Unfortunately, it did not help in case of my loop-mq tests.
> > ( BTW, this is touching ___might_sleep() (note: triple-underscore VS.
> > affected __might_sleep() <--- double-underscrore). )
> >
> > Let me hear your feedback.
> >
> > Have more fun!
> >
> > - Sedat -
> >
> > [1] http://marc.info/?l=linux-aio&m=142033318411355&w=2
> > [2] http://marc.info/?l=linux-aio&m=142035799514685&w=2
> > [3] http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
> > [4] http://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/patch/?id=48e615e4c3ebed488fecb6bfb40b372151f62db2
> 
> [ CC Takashi ]
> 
> >From [1]:
> ...
> 
> Just "me too" (but overlooked until recently).
> 
> The cause is a mutex_lock() call right after prepare_to_wait() with
> TASK_INTERRUPTIBLE in fanotify_read().
> 
> static ssize_t fanotify_read(struct file *file, char __user *buf,
>     size_t count, loff_t *pos)
> {
> ....
> while (1) {
> prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
> mutex_lock(&group->notification_mutex);
> 
> I saw Peter already fixed a similar code in inotify_user.c by commit
> e23738a7300a (but interestingly for a different reason, "Deal with
> nested sleeps").  Supposedly a similar fix would be needed for
> fanotify_user.c.
> ...
> 
> Can you explain why do you think the problem is in sched-fanotify?
> 
> I tried to do such a "similiar" (quick) fix analog to the mentioned
> "sched, inotify: Deal with nested sleeps" patch from Peter.
> If I did correct... It does not make the call-trace go away here.

Your code path is different from what Dave and I hit.  Take a closer
look at the stack trace.


Takashi

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  9:42   ` Sedat Dilek
@ 2015-01-06  9:57     ` Sedat Dilek
  2015-01-06 10:06       ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06  9:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linus Torvalds, LKML

On Tue, Jan 6, 2015 at 10:42 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
> On Tue, Jan 6, 2015 at 10:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Jan 06, 2015 at 05:49:11AM +0100, Sedat Dilek wrote:
>>> This has been there since just before rc1. Is there a fix for this
>>> stalled in someones git tree maybe ?
>>>
>>> [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
>>> __might_sleep+0x8d/0xa0()
>>> [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
>>> [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted 3.19.0-rc3+ #100
>>
>>> [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
>>
>>
>> http://marc.info/?l=linux-kernel&m=141874374029791
>
> Hehe, I created same fix... did not help here.
>

>From my call-trace...

[   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0

...and having a quick look at read_events() in...

"
         * But aio_read_events() can block, and if it blocks it's going to flip
         * the task state back to TASK_RUNNING.
"

[ fs/aio.c ]
...
static long read_events(struct kioctx *ctx, long min_nr, long nr,
                        struct io_event __user *event,
                        struct timespec __user *timeout)
{
        ktime_t until = { .tv64 = KTIME_MAX };
        long ret = 0;

        if (timeout) {
                struct timespec ts;

                if (unlikely(copy_from_user(&ts, timeout, sizeof(ts))))
                        return -EFAULT;

                until = timespec_to_ktime(ts);
        }

        /*
         * Note that aio_read_events() is being called as the conditional - i.e.
         * we're calling it after prepare_to_wait() has set task state to
         * TASK_INTERRUPTIBLE.
         *
         * But aio_read_events() can block, and if it blocks it's going to flip
         * the task state back to TASK_RUNNING.
         *
         * This should be ok, provided it doesn't flip the state back to
         * TASK_RUNNING and return 0 too much - that causes us to spin. That
         * will only happen if the mutex_lock() call blocks, and we then find
         * the ringbuffer empty. So in practice we should be ok, but it's
         * something to be aware of when touching this code.
         */
        if (until.tv64 == 0)
                aio_read_events(ctx, min_nr, nr, event, &ret);
        else
                wait_event_interruptible_hrtimeout(ctx->wait,
                                aio_read_events(ctx, min_nr, nr, event, &ret),
                                until);

        if (!ret && signal_pending(current))
                ret = -EINTR;

        return ret;
}
...

I have not the right skillz to look at this deeper.

- Sedat -

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  9:34 ` Sedat Dilek
  2015-01-06  9:56   ` Takashi Iwai
@ 2015-01-06  9:59   ` Peter Zijlstra
  1 sibling, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06  9:59 UTC (permalink / raw)
  To: Sedat Dilek; +Cc: Dave Jones, Linus Torvalds, LKML, Takashi Iwai

On Tue, Jan 06, 2015 at 10:34:30AM +0100, Sedat Dilek wrote:
> I tried to do such a "similiar" (quick) fix analog to the mentioned
> "sched, inotify: Deal with nested sleeps" patch from Peter.
> If I did correct... It does not make the call-trace go away here.

Are you very sure its the same splat and not another? I don't appear to
have anything triggering this on my test boxes (which are very much not
infected with system-disease).

The patch does look about right.

> ---
>  fs/notify/fanotify/fanotify_user.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index c991616..65e96e2 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -14,6 +14,7 @@
>  #include <linux/types.h>
>  #include <linux/uaccess.h>
>  #include <linux/compat.h>
> +#include <linux/wait.h>
>  
>  #include <asm/ioctls.h>
>  
> @@ -259,15 +260,15 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
>  	struct fsnotify_event *kevent;
>  	char __user *start;
>  	int ret;
> -	DEFINE_WAIT(wait);
> +	DEFINE_WAIT_FUNC(wait, woken_wake_function);
>  
>  	start = buf;
>  	group = file->private_data;
>  
>  	pr_debug("%s: group=%p\n", __func__, group);
>  
> +	add_wait_queue(&group->notification_waitq, &wait);
>  	while (1) {
> -		prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
>  
>  		mutex_lock(&group->notification_mutex);
>  		kevent = get_one_event(group, count);
> @@ -289,7 +290,7 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
>  
>  			if (start != buf)
>  				break;
> -			schedule();
> +			wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
>  			continue;
>  		}
>  
> @@ -318,8 +319,8 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
>  		buf += ret;
>  		count -= ret;
>  	}
> +	remove_wait_queue(&group->notification_waitq, &wait);
>  
> -	finish_wait(&group->notification_waitq, &wait);
>  	if (start != buf && ret != -EFAULT)
>  		ret = buf - start;
>  	return ret;
> -- 
> 2.2.1
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  9:57     ` Sedat Dilek
@ 2015-01-06 10:06       ` Peter Zijlstra
  2015-01-06 10:18         ` Sedat Dilek
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06 10:06 UTC (permalink / raw)
  To: Sedat Dilek
  Cc: Dave Jones, Linus Torvalds, LKML, Kent Overstreet, Chris Mason

On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
> [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
> 

Ah, that one. Chris Mason and Kent Overstreet were looking at that one.
I'm not touching the AIO code either ;-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  9:56   ` Takashi Iwai
@ 2015-01-06 10:06     ` Sedat Dilek
  2015-01-06 10:28       ` Takashi Iwai
  0 siblings, 1 reply; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06 10:06 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Dave Jones, Linus Torvalds, LKML, Peter Zijlstra (Intel)

On Tue, Jan 6, 2015 at 10:56 AM, Takashi Iwai <tiwai@suse.de> wrote:
> At Tue, 6 Jan 2015 10:34:30 +0100,
> Sedat Dilek wrote:
>>
>> On Tue, Jan 6, 2015 at 5:49 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>> > [ Please CC me I am not subscribed to LKML ]
>> >
>> > [ QUOTE ]
>> >
>> > On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
>> >  > It's a day delayed - not because of any particular development issues,
>> >  > but simply because I was tiling a bathroom yesterday. But rc3 is out
>> >  > there now, and things have stayed reasonably calm. I really hope that
>> >  > implies that 3.19 is looking good, but it's equally likely that it's
>> >  > just that people are still recovering from the holiday season.
>> >  >
>> >  > A bit over three quarters of the changes here are drivers - mostly
>> >  > networking, thermal, input layer, sound, power management. The rest is
>> >  > misc - filesystems, core networking, some arch fixes, etc. But all of
>> >  > it is pretty small.
>> >  >
>> >  > So go out and test,
>> >
>> > This has been there since just before rc1. Is there a fix for this
>> > stalled in someones git tree maybe ?
>> >
>> > [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
>> > __might_sleep+0x8d/0xa0()
>> > [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1
>> > set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
>> > [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted
>> > 3.19.0-rc3+ #100
>> > [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88
>> > ffffffff915b47c7
>> > [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8
>> > ffffffff91062c30
>> > [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d
>> > 0000000000000000
>> > [    7.952600] Call Trace:
>> > [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
>> > [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
>> > [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
>> > [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
>> > [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
>> > [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
>> > [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
>> > [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
>> > [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
>> > [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
>> > [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
>> > [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
>> > [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
>> > [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
>> > [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
>> > [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
>> > [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
>> > [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17
>> >
>> > [ /QUOTE ]
>> >
>> > I am seeing a similiar call-trace/warning.
>> > It is reproducible when running fio (latest: v2.2.4) while my loop-mq
>> > tests (see block.git#for-next)
>> >
>> > Some people tend to say it's coming from the linux-aio area [1], but I
>> > am not sure.
>> > 1st I thought this is a Linux-next problem but I am seeing it also
>> > with my rc-kernels.
>> > For parts of aio there is a patch discussed in [2].
>> > The experimental patchset of Ken from [3] made the "aio" call-trace go
>> > away here.
>> >
>> > I tried also a patch pending in peterz/queue.git#sched/core from Eric Sandeen.
>> > It's "check for stack overflow in ___might_sleep".
>> > Unfortunately, it did not help in case of my loop-mq tests.
>> > ( BTW, this is touching ___might_sleep() (note: triple-underscore VS.
>> > affected __might_sleep() <--- double-underscrore). )
>> >
>> > Let me hear your feedback.
>> >
>> > Have more fun!
>> >
>> > - Sedat -
>> >
>> > [1] http://marc.info/?l=linux-aio&m=142033318411355&w=2
>> > [2] http://marc.info/?l=linux-aio&m=142035799514685&w=2
>> > [3] http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
>> > [4] http://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/patch/?id=48e615e4c3ebed488fecb6bfb40b372151f62db2
>>
>> [ CC Takashi ]
>>
>> >From [1]:
>> ...
>>
>> Just "me too" (but overlooked until recently).
>>
>> The cause is a mutex_lock() call right after prepare_to_wait() with
>> TASK_INTERRUPTIBLE in fanotify_read().
>>
>> static ssize_t fanotify_read(struct file *file, char __user *buf,
>>     size_t count, loff_t *pos)
>> {
>> ....
>> while (1) {
>> prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
>> mutex_lock(&group->notification_mutex);
>>
>> I saw Peter already fixed a similar code in inotify_user.c by commit
>> e23738a7300a (but interestingly for a different reason, "Deal with
>> nested sleeps").  Supposedly a similar fix would be needed for
>> fanotify_user.c.
>> ...
>>
>> Can you explain why do you think the problem is in sched-fanotify?
>>
>> I tried to do such a "similiar" (quick) fix analog to the mentioned
>> "sched, inotify: Deal with nested sleeps" patch from Peter.
>> If I did correct... It does not make the call-trace go away here.
>
> Your code path is different from what Dave and I hit.  Take a closer
> look at the stack trace.
>

Yeah, you are right.
I looked again into the code (see thread "Linux 3.19-rc3", I am
reading offline).

As said aio_ring_fix patchset and especially [1] fixed the issue for me.

Can you confirm Peter's new patch works-for-you?

- Sedat -

[0] http://marc.info/?t=142050755700004&r=1&w=2
[1] http://evilpiepirate.org/git/linux-bcache.git/commit/?h=aio_ring_fix&id=c91f0de111da37581709f7d201793a88c6993188

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 10:06       ` Peter Zijlstra
@ 2015-01-06 10:18         ` Sedat Dilek
  2015-01-06 11:01           ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06 10:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linus Torvalds, LKML, Kent Overstreet, Chris Mason

On Tue, Jan 6, 2015 at 11:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
>> [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
>>
>
> Ah, that one. Chris Mason and Kent Overstreet were looking at that one.
> I'm not touching the AIO code either ;-)

I know, I was so excited when I see nearly the same output.

Can you tell me why people see "similiar" problems in different areas?

[  181.397024] WARNING: CPU: 0 PID: 2872 at kernel/sched/core.c:7303
__might_sleep+0xbd/0xd0()
[  181.397028] do not call blocking ops when !TASK_RUNNING; state=1
set at [<ffffffff810b83bd>] prepare_to_wait_event+0x5d/0x110

With similiar buzzwords... namely...

mutex_lock_nested
prepare_to_wait(_event)
__might_sleep

I am asking myself... Where is the real root cause - in sched/core?
Fix one single place VS. fix the impact at several other places?

- Sedat -

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 10:06     ` Sedat Dilek
@ 2015-01-06 10:28       ` Takashi Iwai
  2015-01-06 10:31         ` Sedat Dilek
  0 siblings, 1 reply; 101+ messages in thread
From: Takashi Iwai @ 2015-01-06 10:28 UTC (permalink / raw)
  To: sedat.dilek; +Cc: Dave Jones, Linus Torvalds, LKML, Peter Zijlstra (Intel)

At Tue, 6 Jan 2015 11:06:45 +0100,
Sedat Dilek wrote:
> 
> On Tue, Jan 6, 2015 at 10:56 AM, Takashi Iwai <tiwai@suse.de> wrote:
> > At Tue, 6 Jan 2015 10:34:30 +0100,
> > Sedat Dilek wrote:
> >>
> >> On Tue, Jan 6, 2015 at 5:49 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >> > [ Please CC me I am not subscribed to LKML ]
> >> >
> >> > [ QUOTE ]
> >> >
> >> > On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
> >> >  > It's a day delayed - not because of any particular development issues,
> >> >  > but simply because I was tiling a bathroom yesterday. But rc3 is out
> >> >  > there now, and things have stayed reasonably calm. I really hope that
> >> >  > implies that 3.19 is looking good, but it's equally likely that it's
> >> >  > just that people are still recovering from the holiday season.
> >> >  >
> >> >  > A bit over three quarters of the changes here are drivers - mostly
> >> >  > networking, thermal, input layer, sound, power management. The rest is
> >> >  > misc - filesystems, core networking, some arch fixes, etc. But all of
> >> >  > it is pretty small.
> >> >  >
> >> >  > So go out and test,
> >> >
> >> > This has been there since just before rc1. Is there a fix for this
> >> > stalled in someones git tree maybe ?
> >> >
> >> > [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
> >> > __might_sleep+0x8d/0xa0()
> >> > [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1
> >> > set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
> >> > [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted
> >> > 3.19.0-rc3+ #100
> >> > [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88
> >> > ffffffff915b47c7
> >> > [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8
> >> > ffffffff91062c30
> >> > [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d
> >> > 0000000000000000
> >> > [    7.952600] Call Trace:
> >> > [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
> >> > [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
> >> > [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
> >> > [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
> >> > [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> >> > [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> >> > [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
> >> > [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
> >> > [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
> >> > [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
> >> > [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
> >> > [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
> >> > [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
> >> > [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
> >> > [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
> >> > [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
> >> > [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
> >> > [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17
> >> >
> >> > [ /QUOTE ]
> >> >
> >> > I am seeing a similiar call-trace/warning.
> >> > It is reproducible when running fio (latest: v2.2.4) while my loop-mq
> >> > tests (see block.git#for-next)
> >> >
> >> > Some people tend to say it's coming from the linux-aio area [1], but I
> >> > am not sure.
> >> > 1st I thought this is a Linux-next problem but I am seeing it also
> >> > with my rc-kernels.
> >> > For parts of aio there is a patch discussed in [2].
> >> > The experimental patchset of Ken from [3] made the "aio" call-trace go
> >> > away here.
> >> >
> >> > I tried also a patch pending in peterz/queue.git#sched/core from Eric Sandeen.
> >> > It's "check for stack overflow in ___might_sleep".
> >> > Unfortunately, it did not help in case of my loop-mq tests.
> >> > ( BTW, this is touching ___might_sleep() (note: triple-underscore VS.
> >> > affected __might_sleep() <--- double-underscrore). )
> >> >
> >> > Let me hear your feedback.
> >> >
> >> > Have more fun!
> >> >
> >> > - Sedat -
> >> >
> >> > [1] http://marc.info/?l=linux-aio&m=142033318411355&w=2
> >> > [2] http://marc.info/?l=linux-aio&m=142035799514685&w=2
> >> > [3] http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
> >> > [4] http://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/patch/?id=48e615e4c3ebed488fecb6bfb40b372151f62db2
> >>
> >> [ CC Takashi ]
> >>
> >> >From [1]:
> >> ...
> >>
> >> Just "me too" (but overlooked until recently).
> >>
> >> The cause is a mutex_lock() call right after prepare_to_wait() with
> >> TASK_INTERRUPTIBLE in fanotify_read().
> >>
> >> static ssize_t fanotify_read(struct file *file, char __user *buf,
> >>     size_t count, loff_t *pos)
> >> {
> >> ....
> >> while (1) {
> >> prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
> >> mutex_lock(&group->notification_mutex);
> >>
> >> I saw Peter already fixed a similar code in inotify_user.c by commit
> >> e23738a7300a (but interestingly for a different reason, "Deal with
> >> nested sleeps").  Supposedly a similar fix would be needed for
> >> fanotify_user.c.
> >> ...
> >>
> >> Can you explain why do you think the problem is in sched-fanotify?
> >>
> >> I tried to do such a "similiar" (quick) fix analog to the mentioned
> >> "sched, inotify: Deal with nested sleeps" patch from Peter.
> >> If I did correct... It does not make the call-trace go away here.
> >
> > Your code path is different from what Dave and I hit.  Take a closer
> > look at the stack trace.
> >
> 
> Yeah, you are right.
> I looked again into the code (see thread "Linux 3.19-rc3", I am
> reading offline).
> 
> As said aio_ring_fix patchset and especially [1] fixed the issue for me.
> 
> Can you confirm Peter's new patch works-for-you?

Yes, it seems working for me at the last time I tried.
(BTW, you don't need to add #include <linux/wait.h>)


Takashi

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  9:40 ` Peter Zijlstra
  2015-01-06  9:42   ` Sedat Dilek
@ 2015-01-06 10:29   ` Sedat Dilek
  1 sibling, 0 replies; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06 10:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Dave Jones, Linus Torvalds, LKML

On Tue, Jan 6, 2015 at 10:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 06, 2015 at 05:49:11AM +0100, Sedat Dilek wrote:
>> This has been there since just before rc1. Is there a fix for this
>> stalled in someones git tree maybe ?
>>
>> [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
>> __might_sleep+0x8d/0xa0()
>> [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
>> [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted 3.19.0-rc3+ #100
>
>> [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
>
>
> http://marc.info/?l=linux-kernel&m=141874374029791

I included below include in my patch and inotify_user.c has it...
...
+#include <linux/wait.h>
...

Will you send out a proper patch when people confirm it solves their
issue within fanotify area?

- Sedat -

[1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/notify/inotify/inotify_user.c#n40

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 10:28       ` Takashi Iwai
@ 2015-01-06 10:31         ` Sedat Dilek
  2015-01-06 10:37           ` Takashi Iwai
  0 siblings, 1 reply; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06 10:31 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Dave Jones, Linus Torvalds, LKML, Peter Zijlstra (Intel)

On Tue, Jan 6, 2015 at 11:28 AM, Takashi Iwai <tiwai@suse.de> wrote:
> At Tue, 6 Jan 2015 11:06:45 +0100,
> Sedat Dilek wrote:
>>
>> On Tue, Jan 6, 2015 at 10:56 AM, Takashi Iwai <tiwai@suse.de> wrote:
>> > At Tue, 6 Jan 2015 10:34:30 +0100,
>> > Sedat Dilek wrote:
>> >>
>> >> On Tue, Jan 6, 2015 at 5:49 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>> >> > [ Please CC me I am not subscribed to LKML ]
>> >> >
>> >> > [ QUOTE ]
>> >> >
>> >> > On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
>> >> >  > It's a day delayed - not because of any particular development issues,
>> >> >  > but simply because I was tiling a bathroom yesterday. But rc3 is out
>> >> >  > there now, and things have stayed reasonably calm. I really hope that
>> >> >  > implies that 3.19 is looking good, but it's equally likely that it's
>> >> >  > just that people are still recovering from the holiday season.
>> >> >  >
>> >> >  > A bit over three quarters of the changes here are drivers - mostly
>> >> >  > networking, thermal, input layer, sound, power management. The rest is
>> >> >  > misc - filesystems, core networking, some arch fixes, etc. But all of
>> >> >  > it is pretty small.
>> >> >  >
>> >> >  > So go out and test,
>> >> >
>> >> > This has been there since just before rc1. Is there a fix for this
>> >> > stalled in someones git tree maybe ?
>> >> >
>> >> > [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
>> >> > __might_sleep+0x8d/0xa0()
>> >> > [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1
>> >> > set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
>> >> > [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted
>> >> > 3.19.0-rc3+ #100
>> >> > [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88
>> >> > ffffffff915b47c7
>> >> > [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8
>> >> > ffffffff91062c30
>> >> > [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d
>> >> > 0000000000000000
>> >> > [    7.952600] Call Trace:
>> >> > [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
>> >> > [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
>> >> > [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
>> >> > [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
>> >> > [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
>> >> > [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
>> >> > [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
>> >> > [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
>> >> > [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
>> >> > [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
>> >> > [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
>> >> > [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
>> >> > [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
>> >> > [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
>> >> > [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
>> >> > [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
>> >> > [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
>> >> > [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17
>> >> >
>> >> > [ /QUOTE ]
>> >> >
>> >> > I am seeing a similiar call-trace/warning.
>> >> > It is reproducible when running fio (latest: v2.2.4) while my loop-mq
>> >> > tests (see block.git#for-next)
>> >> >
>> >> > Some people tend to say it's coming from the linux-aio area [1], but I
>> >> > am not sure.
>> >> > 1st I thought this is a Linux-next problem but I am seeing it also
>> >> > with my rc-kernels.
>> >> > For parts of aio there is a patch discussed in [2].
>> >> > The experimental patchset of Ken from [3] made the "aio" call-trace go
>> >> > away here.
>> >> >
>> >> > I tried also a patch pending in peterz/queue.git#sched/core from Eric Sandeen.
>> >> > It's "check for stack overflow in ___might_sleep".
>> >> > Unfortunately, it did not help in case of my loop-mq tests.
>> >> > ( BTW, this is touching ___might_sleep() (note: triple-underscore VS.
>> >> > affected __might_sleep() <--- double-underscrore). )
>> >> >
>> >> > Let me hear your feedback.
>> >> >
>> >> > Have more fun!
>> >> >
>> >> > - Sedat -
>> >> >
>> >> > [1] http://marc.info/?l=linux-aio&m=142033318411355&w=2
>> >> > [2] http://marc.info/?l=linux-aio&m=142035799514685&w=2
>> >> > [3] http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
>> >> > [4] http://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/patch/?id=48e615e4c3ebed488fecb6bfb40b372151f62db2
>> >>
>> >> [ CC Takashi ]
>> >>
>> >> >From [1]:
>> >> ...
>> >>
>> >> Just "me too" (but overlooked until recently).
>> >>
>> >> The cause is a mutex_lock() call right after prepare_to_wait() with
>> >> TASK_INTERRUPTIBLE in fanotify_read().
>> >>
>> >> static ssize_t fanotify_read(struct file *file, char __user *buf,
>> >>     size_t count, loff_t *pos)
>> >> {
>> >> ....
>> >> while (1) {
>> >> prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
>> >> mutex_lock(&group->notification_mutex);
>> >>
>> >> I saw Peter already fixed a similar code in inotify_user.c by commit
>> >> e23738a7300a (but interestingly for a different reason, "Deal with
>> >> nested sleeps").  Supposedly a similar fix would be needed for
>> >> fanotify_user.c.
>> >> ...
>> >>
>> >> Can you explain why do you think the problem is in sched-fanotify?
>> >>
>> >> I tried to do such a "similiar" (quick) fix analog to the mentioned
>> >> "sched, inotify: Deal with nested sleeps" patch from Peter.
>> >> If I did correct... It does not make the call-trace go away here.
>> >
>> > Your code path is different from what Dave and I hit.  Take a closer
>> > look at the stack trace.
>> >
>>
>> Yeah, you are right.
>> I looked again into the code (see thread "Linux 3.19-rc3", I am
>> reading offline).
>>
>> As said aio_ring_fix patchset and especially [1] fixed the issue for me.
>>
>> Can you confirm Peter's new patch works-for-you?
>
> Yes, it seems working for me at the last time I tried.
> (BTW, you don't need to add #include <linux/wait.h>)
>

Just one minute ago, I asked about that?
Can you explain that - included by another include?

- Sedat -

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 10:31         ` Sedat Dilek
@ 2015-01-06 10:37           ` Takashi Iwai
  2015-01-06 10:42             ` Sedat Dilek
  0 siblings, 1 reply; 101+ messages in thread
From: Takashi Iwai @ 2015-01-06 10:37 UTC (permalink / raw)
  To: sedat.dilek; +Cc: Dave Jones, Linus Torvalds, LKML, Peter Zijlstra (Intel)

At Tue, 6 Jan 2015 11:31:34 +0100,
Sedat Dilek wrote:
> 
> On Tue, Jan 6, 2015 at 11:28 AM, Takashi Iwai <tiwai@suse.de> wrote:
> > At Tue, 6 Jan 2015 11:06:45 +0100,
> > Sedat Dilek wrote:
> >>
> >> On Tue, Jan 6, 2015 at 10:56 AM, Takashi Iwai <tiwai@suse.de> wrote:
> >> > At Tue, 6 Jan 2015 10:34:30 +0100,
> >> > Sedat Dilek wrote:
> >> >>
> >> >> On Tue, Jan 6, 2015 at 5:49 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >> >> > [ Please CC me I am not subscribed to LKML ]
> >> >> >
> >> >> > [ QUOTE ]
> >> >> >
> >> >> > On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
> >> >> >  > It's a day delayed - not because of any particular development issues,
> >> >> >  > but simply because I was tiling a bathroom yesterday. But rc3 is out
> >> >> >  > there now, and things have stayed reasonably calm. I really hope that
> >> >> >  > implies that 3.19 is looking good, but it's equally likely that it's
> >> >> >  > just that people are still recovering from the holiday season.
> >> >> >  >
> >> >> >  > A bit over three quarters of the changes here are drivers - mostly
> >> >> >  > networking, thermal, input layer, sound, power management. The rest is
> >> >> >  > misc - filesystems, core networking, some arch fixes, etc. But all of
> >> >> >  > it is pretty small.
> >> >> >  >
> >> >> >  > So go out and test,
> >> >> >
> >> >> > This has been there since just before rc1. Is there a fix for this
> >> >> > stalled in someones git tree maybe ?
> >> >> >
> >> >> > [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
> >> >> > __might_sleep+0x8d/0xa0()
> >> >> > [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1
> >> >> > set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
> >> >> > [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted
> >> >> > 3.19.0-rc3+ #100
> >> >> > [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88
> >> >> > ffffffff915b47c7
> >> >> > [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8
> >> >> > ffffffff91062c30
> >> >> > [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d
> >> >> > 0000000000000000
> >> >> > [    7.952600] Call Trace:
> >> >> > [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
> >> >> > [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
> >> >> > [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
> >> >> > [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
> >> >> > [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> >> >> > [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> >> >> > [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
> >> >> > [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
> >> >> > [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
> >> >> > [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
> >> >> > [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
> >> >> > [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
> >> >> > [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
> >> >> > [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
> >> >> > [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
> >> >> > [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
> >> >> > [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
> >> >> > [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17
> >> >> >
> >> >> > [ /QUOTE ]
> >> >> >
> >> >> > I am seeing a similiar call-trace/warning.
> >> >> > It is reproducible when running fio (latest: v2.2.4) while my loop-mq
> >> >> > tests (see block.git#for-next)
> >> >> >
> >> >> > Some people tend to say it's coming from the linux-aio area [1], but I
> >> >> > am not sure.
> >> >> > 1st I thought this is a Linux-next problem but I am seeing it also
> >> >> > with my rc-kernels.
> >> >> > For parts of aio there is a patch discussed in [2].
> >> >> > The experimental patchset of Ken from [3] made the "aio" call-trace go
> >> >> > away here.
> >> >> >
> >> >> > I tried also a patch pending in peterz/queue.git#sched/core from Eric Sandeen.
> >> >> > It's "check for stack overflow in ___might_sleep".
> >> >> > Unfortunately, it did not help in case of my loop-mq tests.
> >> >> > ( BTW, this is touching ___might_sleep() (note: triple-underscore VS.
> >> >> > affected __might_sleep() <--- double-underscrore). )
> >> >> >
> >> >> > Let me hear your feedback.
> >> >> >
> >> >> > Have more fun!
> >> >> >
> >> >> > - Sedat -
> >> >> >
> >> >> > [1] http://marc.info/?l=linux-aio&m=142033318411355&w=2
> >> >> > [2] http://marc.info/?l=linux-aio&m=142035799514685&w=2
> >> >> > [3] http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
> >> >> > [4] http://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/patch/?id=48e615e4c3ebed488fecb6bfb40b372151f62db2
> >> >>
> >> >> [ CC Takashi ]
> >> >>
> >> >> >From [1]:
> >> >> ...
> >> >>
> >> >> Just "me too" (but overlooked until recently).
> >> >>
> >> >> The cause is a mutex_lock() call right after prepare_to_wait() with
> >> >> TASK_INTERRUPTIBLE in fanotify_read().
> >> >>
> >> >> static ssize_t fanotify_read(struct file *file, char __user *buf,
> >> >>     size_t count, loff_t *pos)
> >> >> {
> >> >> ....
> >> >> while (1) {
> >> >> prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
> >> >> mutex_lock(&group->notification_mutex);
> >> >>
> >> >> I saw Peter already fixed a similar code in inotify_user.c by commit
> >> >> e23738a7300a (but interestingly for a different reason, "Deal with
> >> >> nested sleeps").  Supposedly a similar fix would be needed for
> >> >> fanotify_user.c.
> >> >> ...
> >> >>
> >> >> Can you explain why do you think the problem is in sched-fanotify?
> >> >>
> >> >> I tried to do such a "similiar" (quick) fix analog to the mentioned
> >> >> "sched, inotify: Deal with nested sleeps" patch from Peter.
> >> >> If I did correct... It does not make the call-trace go away here.
> >> >
> >> > Your code path is different from what Dave and I hit.  Take a closer
> >> > look at the stack trace.
> >> >
> >>
> >> Yeah, you are right.
> >> I looked again into the code (see thread "Linux 3.19-rc3", I am
> >> reading offline).
> >>
> >> As said aio_ring_fix patchset and especially [1] fixed the issue for me.
> >>
> >> Can you confirm Peter's new patch works-for-you?
> >
> > Yes, it seems working for me at the last time I tried.
> > (BTW, you don't need to add #include <linux/wait.h>)
> >
> 
> Just one minute ago, I asked about that?
> Can you explain that - included by another include?

Well, the original code calls the stuff defined in linux/wait.h, so
it's already there obviously.


Takashi

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 10:37           ` Takashi Iwai
@ 2015-01-06 10:42             ` Sedat Dilek
  0 siblings, 0 replies; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06 10:42 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: Dave Jones, Linus Torvalds, LKML, Peter Zijlstra (Intel)

On Tue, Jan 6, 2015 at 11:37 AM, Takashi Iwai <tiwai@suse.de> wrote:
> At Tue, 6 Jan 2015 11:31:34 +0100,
> Sedat Dilek wrote:
>>
>> On Tue, Jan 6, 2015 at 11:28 AM, Takashi Iwai <tiwai@suse.de> wrote:
>> > At Tue, 6 Jan 2015 11:06:45 +0100,
>> > Sedat Dilek wrote:
>> >>
>> >> On Tue, Jan 6, 2015 at 10:56 AM, Takashi Iwai <tiwai@suse.de> wrote:
>> >> > At Tue, 6 Jan 2015 10:34:30 +0100,
>> >> > Sedat Dilek wrote:
>> >> >>
>> >> >> On Tue, Jan 6, 2015 at 5:49 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>> >> >> > [ Please CC me I am not subscribed to LKML ]
>> >> >> >
>> >> >> > [ QUOTE ]
>> >> >> >
>> >> >> > On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
>> >> >> >  > It's a day delayed - not because of any particular development issues,
>> >> >> >  > but simply because I was tiling a bathroom yesterday. But rc3 is out
>> >> >> >  > there now, and things have stayed reasonably calm. I really hope that
>> >> >> >  > implies that 3.19 is looking good, but it's equally likely that it's
>> >> >> >  > just that people are still recovering from the holiday season.
>> >> >> >  >
>> >> >> >  > A bit over three quarters of the changes here are drivers - mostly
>> >> >> >  > networking, thermal, input layer, sound, power management. The rest is
>> >> >> >  > misc - filesystems, core networking, some arch fixes, etc. But all of
>> >> >> >  > it is pretty small.
>> >> >> >  >
>> >> >> >  > So go out and test,
>> >> >> >
>> >> >> > This has been there since just before rc1. Is there a fix for this
>> >> >> > stalled in someones git tree maybe ?
>> >> >> >
>> >> >> > [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303
>> >> >> > __might_sleep+0x8d/0xa0()
>> >> >> > [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1
>> >> >> > set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
>> >> >> > [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted
>> >> >> > 3.19.0-rc3+ #100
>> >> >> > [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88
>> >> >> > ffffffff915b47c7
>> >> >> > [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8
>> >> >> > ffffffff91062c30
>> >> >> > [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d
>> >> >> > 0000000000000000
>> >> >> > [    7.952600] Call Trace:
>> >> >> > [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
>> >> >> > [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
>> >> >> > [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
>> >> >> > [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
>> >> >> > [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
>> >> >> > [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
>> >> >> > [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
>> >> >> > [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
>> >> >> > [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
>> >> >> > [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
>> >> >> > [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
>> >> >> > [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
>> >> >> > [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
>> >> >> > [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
>> >> >> > [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
>> >> >> > [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
>> >> >> > [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
>> >> >> > [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17
>> >> >> >
>> >> >> > [ /QUOTE ]
>> >> >> >
>> >> >> > I am seeing a similiar call-trace/warning.
>> >> >> > It is reproducible when running fio (latest: v2.2.4) while my loop-mq
>> >> >> > tests (see block.git#for-next)
>> >> >> >
>> >> >> > Some people tend to say it's coming from the linux-aio area [1], but I
>> >> >> > am not sure.
>> >> >> > 1st I thought this is a Linux-next problem but I am seeing it also
>> >> >> > with my rc-kernels.
>> >> >> > For parts of aio there is a patch discussed in [2].
>> >> >> > The experimental patchset of Ken from [3] made the "aio" call-trace go
>> >> >> > away here.
>> >> >> >
>> >> >> > I tried also a patch pending in peterz/queue.git#sched/core from Eric Sandeen.
>> >> >> > It's "check for stack overflow in ___might_sleep".
>> >> >> > Unfortunately, it did not help in case of my loop-mq tests.
>> >> >> > ( BTW, this is touching ___might_sleep() (note: triple-underscore VS.
>> >> >> > affected __might_sleep() <--- double-underscrore). )
>> >> >> >
>> >> >> > Let me hear your feedback.
>> >> >> >
>> >> >> > Have more fun!
>> >> >> >
>> >> >> > - Sedat -
>> >> >> >
>> >> >> > [1] http://marc.info/?l=linux-aio&m=142033318411355&w=2
>> >> >> > [2] http://marc.info/?l=linux-aio&m=142035799514685&w=2
>> >> >> > [3] http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
>> >> >> > [4] http://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/patch/?id=48e615e4c3ebed488fecb6bfb40b372151f62db2
>> >> >>
>> >> >> [ CC Takashi ]
>> >> >>
>> >> >> >From [1]:
>> >> >> ...
>> >> >>
>> >> >> Just "me too" (but overlooked until recently).
>> >> >>
>> >> >> The cause is a mutex_lock() call right after prepare_to_wait() with
>> >> >> TASK_INTERRUPTIBLE in fanotify_read().
>> >> >>
>> >> >> static ssize_t fanotify_read(struct file *file, char __user *buf,
>> >> >>     size_t count, loff_t *pos)
>> >> >> {
>> >> >> ....
>> >> >> while (1) {
>> >> >> prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
>> >> >> mutex_lock(&group->notification_mutex);
>> >> >>
>> >> >> I saw Peter already fixed a similar code in inotify_user.c by commit
>> >> >> e23738a7300a (but interestingly for a different reason, "Deal with
>> >> >> nested sleeps").  Supposedly a similar fix would be needed for
>> >> >> fanotify_user.c.
>> >> >> ...
>> >> >>
>> >> >> Can you explain why do you think the problem is in sched-fanotify?
>> >> >>
>> >> >> I tried to do such a "similiar" (quick) fix analog to the mentioned
>> >> >> "sched, inotify: Deal with nested sleeps" patch from Peter.
>> >> >> If I did correct... It does not make the call-trace go away here.
>> >> >
>> >> > Your code path is different from what Dave and I hit.  Take a closer
>> >> > look at the stack trace.
>> >> >
>> >>
>> >> Yeah, you are right.
>> >> I looked again into the code (see thread "Linux 3.19-rc3", I am
>> >> reading offline).
>> >>
>> >> As said aio_ring_fix patchset and especially [1] fixed the issue for me.
>> >>
>> >> Can you confirm Peter's new patch works-for-you?
>> >
>> > Yes, it seems working for me at the last time I tried.
>> > (BTW, you don't need to add #include <linux/wait.h>)
>> >
>>
>> Just one minute ago, I asked about that?
>> Can you explain that - included by another include?
>
> Well, the original code calls the stuff defined in linux/wait.h, so
> it's already there obviously.
>

>From that POV you are right, but as said I checked parallelly
inotify_user.c which had this include.
Does it hurt to explicitly include again?

- Sedat -

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 10:18         ` Sedat Dilek
@ 2015-01-06 11:01           ` Peter Zijlstra
  2015-01-06 11:07             ` Kent Overstreet
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06 11:01 UTC (permalink / raw)
  To: Sedat Dilek
  Cc: Dave Jones, Linus Torvalds, LKML, Kent Overstreet, Chris Mason

On Tue, Jan 06, 2015 at 11:18:04AM +0100, Sedat Dilek wrote:
> On Tue, Jan 6, 2015 at 11:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
> >> [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
> >>
> >
> > Ah, that one. Chris Mason and Kent Overstreet were looking at that one.
> > I'm not touching the AIO code either ;-)
> 
> I know, I was so excited when I see nearly the same output.
> 
> Can you tell me why people see "similiar" problems in different areas?

Because the debug check is new :-) It's a pattern that should not be
used but mostly works most of the times.

> [  181.397024] WARNING: CPU: 0 PID: 2872 at kernel/sched/core.c:7303
> __might_sleep+0xbd/0xd0()
> [  181.397028] do not call blocking ops when !TASK_RUNNING; state=1
> set at [<ffffffff810b83bd>] prepare_to_wait_event+0x5d/0x110
> 
> With similiar buzzwords... namely...
> 
> mutex_lock_nested
> prepare_to_wait(_event)
> __might_sleep
> 
> I am asking myself... Where is the real root cause - in sched/core?
> Fix one single place VS. fix the impact at several other places?

No, the root cause is nesting sleep primitives, this is not fixable in
the one place, both prepare_to_wait and mutex_lock are using
task_struct::state, they have to, no way around it.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:01           ` Peter Zijlstra
@ 2015-01-06 11:07             ` Kent Overstreet
  2015-01-06 11:25               ` Sedat Dilek
                                 ` (3 more replies)
  0 siblings, 4 replies; 101+ messages in thread
From: Kent Overstreet @ 2015-01-06 11:07 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 12:01:12PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 06, 2015 at 11:18:04AM +0100, Sedat Dilek wrote:
> > On Tue, Jan 6, 2015 at 11:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
> > >> [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
> > >>
> > >
> > > Ah, that one. Chris Mason and Kent Overstreet were looking at that one.
> > > I'm not touching the AIO code either ;-)
> > 
> > I know, I was so excited when I see nearly the same output.
> > 
> > Can you tell me why people see "similiar" problems in different areas?
> 
> Because the debug check is new :-) It's a pattern that should not be
> used but mostly works most of the times.
> 
> > [  181.397024] WARNING: CPU: 0 PID: 2872 at kernel/sched/core.c:7303
> > __might_sleep+0xbd/0xd0()
> > [  181.397028] do not call blocking ops when !TASK_RUNNING; state=1
> > set at [<ffffffff810b83bd>] prepare_to_wait_event+0x5d/0x110
> > 
> > With similiar buzzwords... namely...
> > 
> > mutex_lock_nested
> > prepare_to_wait(_event)
> > __might_sleep
> > 
> > I am asking myself... Where is the real root cause - in sched/core?
> > Fix one single place VS. fix the impact at several other places?
> 
> No, the root cause is nesting sleep primitives, this is not fixable in
> the one place, both prepare_to_wait and mutex_lock are using
> task_struct::state, they have to, no way around it.

No, it's completely possible to construct a prepare_to_wait() that doesn't
require messing with the task state. Had it for years.

http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:07             ` Kent Overstreet
@ 2015-01-06 11:25               ` Sedat Dilek
  2015-01-06 11:40                 ` Kent Overstreet
  2015-01-06 11:42               ` Peter Zijlstra
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06 11:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Peter Zijlstra, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 6, 2015 at 12:07 PM, Kent Overstreet <kmo@daterainc.com> wrote:
> On Tue, Jan 06, 2015 at 12:01:12PM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 06, 2015 at 11:18:04AM +0100, Sedat Dilek wrote:
>> > On Tue, Jan 6, 2015 at 11:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
>> > >> [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
>> > >>
>> > >
>> > > Ah, that one. Chris Mason and Kent Overstreet were looking at that one.
>> > > I'm not touching the AIO code either ;-)
>> >
>> > I know, I was so excited when I see nearly the same output.
>> >
>> > Can you tell me why people see "similiar" problems in different areas?
>>
>> Because the debug check is new :-) It's a pattern that should not be
>> used but mostly works most of the times.
>>
>> > [  181.397024] WARNING: CPU: 0 PID: 2872 at kernel/sched/core.c:7303
>> > __might_sleep+0xbd/0xd0()
>> > [  181.397028] do not call blocking ops when !TASK_RUNNING; state=1
>> > set at [<ffffffff810b83bd>] prepare_to_wait_event+0x5d/0x110
>> >
>> > With similiar buzzwords... namely...
>> >
>> > mutex_lock_nested
>> > prepare_to_wait(_event)
>> > __might_sleep
>> >
>> > I am asking myself... Where is the real root cause - in sched/core?
>> > Fix one single place VS. fix the impact at several other places?
>>
>> No, the root cause is nesting sleep primitives, this is not fixable in
>> the one place, both prepare_to_wait and mutex_lock are using
>> task_struct::state, they have to, no way around it.
>
> No, it's completely possible to construct a prepare_to_wait() that doesn't
> require messing with the task state. Had it for years.
>
> http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix

I am just rebuilding a new kernel with "aio_ring_fix" included - I
have tested this alread with loop-mq and it made the call-trace in aio
go away.


Jut curious...
How would a patch look like a patch to fix the sched-fanotify issue
with a conversion to "closure waitlist"?

Thanks.

- Sedat -

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:25               ` Sedat Dilek
@ 2015-01-06 11:40                 ` Kent Overstreet
  2015-01-06 12:51                   ` Sedat Dilek
  0 siblings, 1 reply; 101+ messages in thread
From: Kent Overstreet @ 2015-01-06 11:40 UTC (permalink / raw)
  To: Sedat Dilek; +Cc: Peter Zijlstra, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 12:25:39PM +0100, Sedat Dilek wrote:
> On Tue, Jan 6, 2015 at 12:07 PM, Kent Overstreet <kmo@daterainc.com> wrote:
> > On Tue, Jan 06, 2015 at 12:01:12PM +0100, Peter Zijlstra wrote:
> >> On Tue, Jan 06, 2015 at 11:18:04AM +0100, Sedat Dilek wrote:
> >> > On Tue, Jan 6, 2015 at 11:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > > On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
> >> > >> [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
> >> > >>
> >> > >
> >> > > Ah, that one. Chris Mason and Kent Overstreet were looking at that one.
> >> > > I'm not touching the AIO code either ;-)
> >> >
> >> > I know, I was so excited when I see nearly the same output.
> >> >
> >> > Can you tell me why people see "similiar" problems in different areas?
> >>
> >> Because the debug check is new :-) It's a pattern that should not be
> >> used but mostly works most of the times.
> >>
> >> > [  181.397024] WARNING: CPU: 0 PID: 2872 at kernel/sched/core.c:7303
> >> > __might_sleep+0xbd/0xd0()
> >> > [  181.397028] do not call blocking ops when !TASK_RUNNING; state=1
> >> > set at [<ffffffff810b83bd>] prepare_to_wait_event+0x5d/0x110
> >> >
> >> > With similiar buzzwords... namely...
> >> >
> >> > mutex_lock_nested
> >> > prepare_to_wait(_event)
> >> > __might_sleep
> >> >
> >> > I am asking myself... Where is the real root cause - in sched/core?
> >> > Fix one single place VS. fix the impact at several other places?
> >>
> >> No, the root cause is nesting sleep primitives, this is not fixable in
> >> the one place, both prepare_to_wait and mutex_lock are using
> >> task_struct::state, they have to, no way around it.
> >
> > No, it's completely possible to construct a prepare_to_wait() that doesn't
> > require messing with the task state. Had it for years.
> >
> > http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
> 
> I am just rebuilding a new kernel with "aio_ring_fix" included - I
> have tested this alread with loop-mq and it made the call-trace in aio
> go away.
> 
> 
> Jut curious...
> How would a patch look like a patch to fix the sched-fanotify issue
> with a conversion to "closure waitlist"?

wait_queue_head_t	-> struct closure_waitlist
DEFINE_WAIT()		-> struct closure cl; closure_init_stack(&cl)
prepare_to_wait()	-> closure_wait(&waitlist, &cl)
schedule()		-> closure_sync()
finish_wait()		-> closure_wake_up(); closure_sync()

That's the standard conversion, I haven't looked at the fanotify code before
just now but from a cursory glance it appears that all should work here. Only
annoying thing is the waitqueue here is actually part of the poll interface (if
I'm reading this correctly), so I dunno what I'd do about that.

Also FYI: closure waitlists are currently singly linked, thus there's no direct
equivalent to finish_wait(), the conversion I gave works but will lead to
spurious wakeups. I kinda figured I was going to have to switch to doubly linked
lists eventually though.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:07             ` Kent Overstreet
  2015-01-06 11:25               ` Sedat Dilek
@ 2015-01-06 11:42               ` Peter Zijlstra
  2015-01-06 11:48                 ` Peter Zijlstra
  2015-01-06 11:56                 ` Kent Overstreet
  2015-01-06 11:58               ` Peter Zijlstra
  2015-01-16 16:56               ` Peter Hurley
  3 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06 11:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 03:07:30AM -0800, Kent Overstreet wrote:
> > No, the root cause is nesting sleep primitives, this is not fixable in
> > the one place, both prepare_to_wait and mutex_lock are using
> > task_struct::state, they have to, no way around it.
> 
> No, it's completely possible to construct a prepare_to_wait() that doesn't
> require messing with the task state. Had it for years.
> 
> http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix

Your closures are cute but not the same. And sure you can do a wait
queue like interface -- my wait_woken thing is an example -- that
doesn't require task state.

The point remains that you then have to fix every instance to conform to
the new interface.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:42               ` Peter Zijlstra
@ 2015-01-06 11:48                 ` Peter Zijlstra
  2015-01-06 12:01                   ` Kent Overstreet
  2015-01-06 11:56                 ` Kent Overstreet
  1 sibling, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06 11:48 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason



Looking at that closure stuff, why is there an smp_mb() in
closure_wake_up() ? Typically wakeup only needs to imply a wmb.

Also note that __closure_wake_up() starts with a fully serializing
instruction (xchg) and thereby already implies the full barrier.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:42               ` Peter Zijlstra
  2015-01-06 11:48                 ` Peter Zijlstra
@ 2015-01-06 11:56                 ` Kent Overstreet
  2015-01-06 12:16                   ` Peter Zijlstra
  1 sibling, 1 reply; 101+ messages in thread
From: Kent Overstreet @ 2015-01-06 11:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 12:42:15PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 06, 2015 at 03:07:30AM -0800, Kent Overstreet wrote:
> > > No, the root cause is nesting sleep primitives, this is not fixable in
> > > the one place, both prepare_to_wait and mutex_lock are using
> > > task_struct::state, they have to, no way around it.
> > 
> > No, it's completely possible to construct a prepare_to_wait() that doesn't
> > require messing with the task state. Had it for years.
> > 
> > http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
> 
> Your closures are cute but not the same. And sure you can do a wait
> queue like interface -- my wait_woken thing is an example -- that
> doesn't require task state.
> 
> The point remains that you then have to fix every instance to conform to
> the new interface.

Possibly I missed your point because I've been overly crotchety lately.

I do want to make the point that it's not really the callers that are broken;
especially those that are using prepare_to_wait() via wait_event(). Using
wait_event() is exactly what people _should_ be doing, instead of open coding
this stuff and/or coming up with hacks to work around the fact that
prepare_to_wait() is implemented via messing with the task state.

This is a sore point for me because I've seen other experienced, _skilled_
programmers screw this kind of code up too many times and I hate debugging lost
wakeups. I can write this kind of code in my sleep because I've spent too much
of time time implementing these kinds of primitives and I'm sure you can too,
but most people can't.

Anyways, my point is either wait_event() should be fixed to not muck with the
task state, or since that's not really practical we should at least provide a
standard drop in replacement that doesn't.

And the drop in replacement more or less exists, closure_wait_event() has the
same semantics as wait_event, similarly with the lower level primitives I just
listed the conversions for.

I don't even really care if it's closures, we could make something new that uses
the existing wait_queue_head_t but IMO it's gonna look a lot like closures with
an embedded refcount and closure_sync() instead of schedule().

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:07             ` Kent Overstreet
  2015-01-06 11:25               ` Sedat Dilek
  2015-01-06 11:42               ` Peter Zijlstra
@ 2015-01-06 11:58               ` Peter Zijlstra
  2015-01-06 12:18                 ` Kent Overstreet
  2015-01-16 16:56               ` Peter Hurley
  3 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06 11:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 03:07:30AM -0800, Kent Overstreet wrote:
> http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix

Very terse changelogs there :/

Also, I'm not sure I agree with that whole closure_wait_event*() stuff,
the closure interface as it exist before that makes sense, but now
you're just mixing up things.

Why would you want to retrofit a lot of the wait_event*() stuff on top
of this?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:48                 ` Peter Zijlstra
@ 2015-01-06 12:01                   ` Kent Overstreet
  2015-01-06 12:20                     ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Kent Overstreet @ 2015-01-06 12:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 12:48:42PM +0100, Peter Zijlstra wrote:
> 
> 
> Looking at that closure stuff, why is there an smp_mb() in
> closure_wake_up() ? Typically wakeup only needs to imply a wmb.
> 
> Also note that __closure_wake_up() starts with a fully serializing
> instruction (xchg) and thereby already implies the full barrier.

Probably no good reason, that code is pretty old :)

If I was to hazard a guess, I had my own lockless linked lists before llist.h
existed and perhaps I did it with atomic_xchg() - which was at least documented
to not imply a barrier. I suppose it should just be dropped.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:56                 ` Kent Overstreet
@ 2015-01-06 12:16                   ` Peter Zijlstra
  2015-01-06 12:43                     ` Kent Overstreet
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06 12:16 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 03:56:45AM -0800, Kent Overstreet wrote:
> I do want to make the point that it's not really the callers that are broken;
> especially those that are using prepare_to_wait() via wait_event(). Using
> wait_event() is exactly what people _should_ be doing, instead of open coding
> this stuff and/or coming up with hacks to work around the fact that
> prepare_to_wait() is implemented via messing with the task state.

Yes and no.

So I agree that people should be using wait_event(), but I was also very
much hoping people would not be nesting sleep primitives like this.

Now that we have the debug check its at least obvious when you do.

But yes I'm somewhat saddened by the amount of stuff that has come up
because of this.

> Anyways, my point is either wait_event() should be fixed to not muck with the
> task state, or since that's not really practical we should at least provide a
> standard drop in replacement that doesn't.

I had explicitly not done this because I had hoped this would be rare
and feel/felt we should not encourage this pattern.

> And the drop in replacement more or less exists, closure_wait_event() has the
> same semantics as wait_event, similarly with the lower level primitives I just
> listed the conversions for.

See my other email, I don't really agree with the whole
closure_wait_event() thing, I think it dilutes what closures are. You've
just used what you know to cobble together something that has the right
semantics, but its not at all related to the concept of what closures
were.

I'm also not sure we want to change the existing wait_event() stuff to
allow nested sleeps per default, there is some additional overhead
involved -- although it could turn out to not be an issue, we'd have to
look at that.

But IF we want to create a drop in replacement it should be in the wait
code, it shouldn't be too hard once we've decided we do indeed want to
go do this.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:58               ` Peter Zijlstra
@ 2015-01-06 12:18                 ` Kent Overstreet
  0 siblings, 0 replies; 101+ messages in thread
From: Kent Overstreet @ 2015-01-06 12:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 12:58:22PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 06, 2015 at 03:07:30AM -0800, Kent Overstreet wrote:
> > http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
> 
> Very terse changelogs there :/

erg, I've been slacking on changelogs lately. that closure_sync() fix definitely
merits explanation.

> Also, I'm not sure I agree with that whole closure_wait_event*() stuff,
> the closure interface as it exist before that makes sense, but now
> you're just mixing up things.
> 
> Why would you want to retrofit a lot of the wait_event*() stuff on top
> of this?

Actually it's not retrofitted, closure_wait_event() dates to the very original
closure code, it was dropped for awhile because bcache happened not to be using
it anymore and I just dug it out of the git history.

Think of it this way - closures wait on things: sometimes you want to wait
asynchronously, sometimes synchronously, but you want the same primitives for
both - something has to bridge the gap between the async and sync stuff.

For example - here's the code in the bcache-dev branch that handles reading the
journal from each device in the cache set in parallel:

http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/journal.c?h=bcache-dev#n399

It's using closure_call() to kick off the read for each device, then
closure_sync() to wait on them all to finish.

So closure_sync() is completely necessary, and then once you've got that
closure_wait_event() is just a trivial macro.

Also, closures could be using wait_queue_head_t instead of closure waitlist,
mainly I didn't want to nearly double the size of closures to stuff in a
__wait_queue.

I'd argue that "closures the junk for writing weird pseudo continuation passing
style asynchronous C" are not really the important parts of closures, the
important part is the infrastructure for waiting on stuff and then doing
something when that stuff completes. closure_get(), closure_put() and waitlists
are the real primitives; both closure_sync() and all the fancy asynchronous
stuff are built on top of that.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 12:01                   ` Kent Overstreet
@ 2015-01-06 12:20                     ` Peter Zijlstra
  2015-01-06 12:45                       ` Kent Overstreet
  2015-01-06 12:55                       ` Peter Hurley
  0 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06 12:20 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 04:01:21AM -0800, Kent Overstreet wrote:
> On Tue, Jan 06, 2015 at 12:48:42PM +0100, Peter Zijlstra wrote:
> > 
> > 
> > Looking at that closure stuff, why is there an smp_mb() in
> > closure_wake_up() ? Typically wakeup only needs to imply a wmb.
> > 
> > Also note that __closure_wake_up() starts with a fully serializing
> > instruction (xchg) and thereby already implies the full barrier.
> 
> Probably no good reason, that code is pretty old :)
> 
> If I was to hazard a guess, I had my own lockless linked lists before llist.h
> existed and perhaps I did it with atomic_xchg() - which was at least documented
> to not imply a barrier. I suppose it should just be dropped.

We (probably me) should probably audit all the atomic_xchg()
implementations and documentation and fix that. I was very much under
the impression it should imply a full barrier (and it certainly does on
x86), the documentation should state the rule that any atomic_ function
that returns a result is fully serializing, therefore, because
atomic_xchg() has a return value, it should too.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 12:16                   ` Peter Zijlstra
@ 2015-01-06 12:43                     ` Kent Overstreet
  2015-01-06 13:03                       ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Kent Overstreet @ 2015-01-06 12:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 01:16:03PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 06, 2015 at 03:56:45AM -0800, Kent Overstreet wrote:
> > I do want to make the point that it's not really the callers that are broken;
> > especially those that are using prepare_to_wait() via wait_event(). Using
> > wait_event() is exactly what people _should_ be doing, instead of open coding
> > this stuff and/or coming up with hacks to work around the fact that
> > prepare_to_wait() is implemented via messing with the task state.
> 
> Yes and no.
> 
> So I agree that people should be using wait_event(), but I was also very
> much hoping people would not be nesting sleep primitives like this.
> 
> Now that we have the debug check its at least obvious when you do.
> 
> But yes I'm somewhat saddened by the amount of stuff that has come up
> because of this.

The cond argument to wait_event() _really is_ an arbitrary expression/chunk of
code; it's inescapable that you're going to be doing stuff that sleeps, and even
much more complicated stuff in there.

I have code out of tree that's sending network RPCs under wait_event_timeout()
(or did we switch that to closures? I'd have to check...) - and that actually
wasn't the first way I wrote it, but when I rewrote it that way the end result
was _much_ improved and easier to understand.

> > Anyways, my point is either wait_event() should be fixed to not muck with the
> > task state, or since that's not really practical we should at least provide a
> > standard drop in replacement that doesn't.
> 
> I had explicitly not done this because I had hoped this would be rare
> and feel/felt we should not encourage this pattern.

But it should be encouraged! If the expression you're waiting on sleeps, you
shouldn't have to contort your code to work around that - I mean, look at the
history of the AIO code, what was tried in the past and what Ben came up most
recently for this bug.

I can see where you're coming from, but this is something I've learned from
painful experience.

> > And the drop in replacement more or less exists, closure_wait_event() has the
> > same semantics as wait_event, similarly with the lower level primitives I just
> > listed the conversions for.
> 
> See my other email, I don't really agree with the whole
> closure_wait_event() thing, I think it dilutes what closures are. You've
> just used what you know to cobble together something that has the right
> semantics, but its not at all related to the concept of what closures
> were.

You know, if anyone's the authority on what closures are it's me :) I've done a
lot of programming with them, and experimented a lot with them - I've added and
taken back out lots of functionality, and this is something I'll confidently say
naturally goes with closures.

> I'm also not sure we want to change the existing wait_event() stuff to
> allow nested sleeps per default, there is some additional overhead
> involved -- although it could turn out to not be an issue, we'd have to
> look at that.

Yeah I don't think there's anything wrong with having two parallel
implementations, with a slightly faster one that doesn't allow sleeps.

> But IF we want to create a drop in replacement it should be in the wait
> code, it shouldn't be too hard once we've decided we do indeed want to
> go do this.

I don't care one way or the other there.

It might make the most sense to cook up something new, stealing some of the
closure code but using standard the wait_queue_head_t - having a single standard
waitlist type is definitely a good thing, and unfortunately I don't think it'd
be a good idea to convert closures to wait_queue_head_t mainly because of the
memory usage.

I will note that one thing that has been immensely useful with closures is the
ability to pass a closure around - think of it as a "wait object" - to some code
that may end up waiting on something, but you don't want to itself sleep, and
then the caller can closure_sync() or continue_at() or whatever it wants (or use
the same closure for waiting on multiple things, e.g. where we wait on writing
the two new btree nodes after a split).

Think of it a souped up completion.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 12:20                     ` Peter Zijlstra
@ 2015-01-06 12:45                       ` Kent Overstreet
  2015-01-06 12:55                       ` Peter Hurley
  1 sibling, 0 replies; 101+ messages in thread
From: Kent Overstreet @ 2015-01-06 12:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 01:20:06PM +0100, Peter Zijlstra wrote:
> We (probably me) should probably audit all the atomic_xchg()
> implementations and documentation and fix that. I was very much under
> the impression it should imply a full barrier (and it certainly does on
> x86), the documentation should state the rule that any atomic_ function
> that returns a result is fully serializing, therefore, because
> atomic_xchg() has a return value, it should too.

I think that the documentation was changed awhile ago - I'd have to check and I
should sleep, though. It was probably 4-5 years ago that I saw that old weird
"atomic_xchg() doesn't imply barriers" thing.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:40                 ` Kent Overstreet
@ 2015-01-06 12:51                   ` Sedat Dilek
  0 siblings, 0 replies; 101+ messages in thread
From: Sedat Dilek @ 2015-01-06 12:51 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Peter Zijlstra, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 6, 2015 at 12:40 PM, Kent Overstreet <kmo@daterainc.com> wrote:
> On Tue, Jan 06, 2015 at 12:25:39PM +0100, Sedat Dilek wrote:
>> On Tue, Jan 6, 2015 at 12:07 PM, Kent Overstreet <kmo@daterainc.com> wrote:
>> > On Tue, Jan 06, 2015 at 12:01:12PM +0100, Peter Zijlstra wrote:
>> >> On Tue, Jan 06, 2015 at 11:18:04AM +0100, Sedat Dilek wrote:
>> >> > On Tue, Jan 6, 2015 at 11:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > > On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
>> >> > >> [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
>> >> > >>
>> >> > >
>> >> > > Ah, that one. Chris Mason and Kent Overstreet were looking at that one.
>> >> > > I'm not touching the AIO code either ;-)
>> >> >
>> >> > I know, I was so excited when I see nearly the same output.
>> >> >
>> >> > Can you tell me why people see "similiar" problems in different areas?
>> >>
>> >> Because the debug check is new :-) It's a pattern that should not be
>> >> used but mostly works most of the times.
>> >>
>> >> > [  181.397024] WARNING: CPU: 0 PID: 2872 at kernel/sched/core.c:7303
>> >> > __might_sleep+0xbd/0xd0()
>> >> > [  181.397028] do not call blocking ops when !TASK_RUNNING; state=1
>> >> > set at [<ffffffff810b83bd>] prepare_to_wait_event+0x5d/0x110
>> >> >
>> >> > With similiar buzzwords... namely...
>> >> >
>> >> > mutex_lock_nested
>> >> > prepare_to_wait(_event)
>> >> > __might_sleep
>> >> >
>> >> > I am asking myself... Where is the real root cause - in sched/core?
>> >> > Fix one single place VS. fix the impact at several other places?
>> >>
>> >> No, the root cause is nesting sleep primitives, this is not fixable in
>> >> the one place, both prepare_to_wait and mutex_lock are using
>> >> task_struct::state, they have to, no way around it.
>> >
>> > No, it's completely possible to construct a prepare_to_wait() that doesn't
>> > require messing with the task state. Had it for years.
>> >
>> > http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix
>>
>> I am just rebuilding a new kernel with "aio_ring_fix" included - I
>> have tested this alread with loop-mq and it made the call-trace in aio
>> go away.
>>
>>
>> Jut curious...
>> How would a patch look like a patch to fix the sched-fanotify issue
>> with a conversion to "closure waitlist"?
>
> wait_queue_head_t       -> struct closure_waitlist
> DEFINE_WAIT()           -> struct closure cl; closure_init_stack(&cl)
> prepare_to_wait()       -> closure_wait(&waitlist, &cl)
> schedule()              -> closure_sync()
> finish_wait()           -> closure_wake_up(); closure_sync()
>
> That's the standard conversion, I haven't looked at the fanotify code before
> just now but from a cursory glance it appears that all should work here. Only
> annoying thing is the waitqueue here is actually part of the poll interface (if
> I'm reading this correctly), so I dunno what I'd do about that.
>
> Also FYI: closure waitlists are currently singly linked, thus there's no direct
> equivalent to finish_wait(), the conversion I gave works but will lead to
> spurious wakeups. I kinda figured I was going to have to switch to doubly linked
> lists eventually though.

I followed as far as I have understood the subsequent discussion.
Let's see where this will lead to.

I am also very curious about how that aio issue will be fixed.

Thanks Peter and Ken for the vital and hopefully fruitful discussion.

- Sedat -

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 12:20                     ` Peter Zijlstra
  2015-01-06 12:45                       ` Kent Overstreet
@ 2015-01-06 12:55                       ` Peter Hurley
  2015-01-06 17:38                         ` Paul E. McKenney
  1 sibling, 1 reply; 101+ messages in thread
From: Peter Hurley @ 2015-01-06 12:55 UTC (permalink / raw)
  To: Peter Zijlstra, Kent Overstreet
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason,
	Paul E. McKenney

[ +cc Paul McKenney ]

On 01/06/2015 07:20 AM, Peter Zijlstra wrote:
> On Tue, Jan 06, 2015 at 04:01:21AM -0800, Kent Overstreet wrote:
>> On Tue, Jan 06, 2015 at 12:48:42PM +0100, Peter Zijlstra wrote:
>>>
>>>
>>> Looking at that closure stuff, why is there an smp_mb() in
>>> closure_wake_up() ? Typically wakeup only needs to imply a wmb.
>>>
>>> Also note that __closure_wake_up() starts with a fully serializing
>>> instruction (xchg) and thereby already implies the full barrier.
>>
>> Probably no good reason, that code is pretty old :)
>>
>> If I was to hazard a guess, I had my own lockless linked lists before llist.h
>> existed and perhaps I did it with atomic_xchg() - which was at least documented
>> to not imply a barrier. I suppose it should just be dropped.
> 
> We (probably me) should probably audit all the atomic_xchg()
> implementations and documentation and fix that. I was very much under
> the impression it should imply a full barrier (and it certainly does on
> x86), the documentation should state the rule that any atomic_ function
> that returns a result is fully serializing, therefore, because
> atomic_xchg() has a return value, it should too.

memory-barriers.txt and atomic_ops.txt appear to contradict each other here,
but I think that's because atomic_ops.txt has drifted toward an
arch-implementer's POV:

260:atomic_xchg requires explicit memory barriers around the operation.

All the serializing atomic operations have descriptions like this.

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 12:43                     ` Kent Overstreet
@ 2015-01-06 13:03                       ` Peter Zijlstra
  2015-01-06 13:28                         ` Kent Overstreet
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-06 13:03 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 04:43:13AM -0800, Kent Overstreet wrote:
> It might make the most sense to cook up something new, stealing some of the
> closure code but using standard the wait_queue_head_t - having a single standard
> waitlist type is definitely a good thing, and unfortunately I don't think it'd
> be a good idea to convert closures to wait_queue_head_t mainly because of the
> memory usage.
> 
> I will note that one thing that has been immensely useful with closures is the
> ability to pass a closure around - think of it as a "wait object" - to some code
> that may end up waiting on something, but you don't want to itself sleep, and
> then the caller can closure_sync() or continue_at() or whatever it wants (or use
> the same closure for waiting on multiple things, e.g. where we wait on writing
> the two new btree nodes after a split).
> 
> Think of it a souped up completion.

Yeah I got that aspect. I'm still trying to get my head around how the
wait_event bit would be a natural match though ;-)

Let me stew a bit on that.

That said, the RT people want a simple waitqueue, one that has
deterministic behaviour. This is only possibly by removing some of the
more obscure waitqueue features and thus also results in a slimmer
structure.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 13:03                       ` Peter Zijlstra
@ 2015-01-06 13:28                         ` Kent Overstreet
  2015-01-13 15:23                           ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Kent Overstreet @ 2015-01-06 13:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 02:03:17PM +0100, Peter Zijlstra wrote:
> Yeah I got that aspect. I'm still trying to get my head around how the
> wait_event bit would be a natural match though ;-)
> 
> Let me stew a bit on that.

Also, I think it was kind of a surprise to me back in the day how structurally
similar the wait_event() implementation (circa several years ago, not the
monstrosity it is now :) and closure_wait_event() turned out - I wasn't aiming
for that, but that got me thinking there must be something more fundamental
going on here.

Probably rambling at this point though.

> That said, the RT people want a simple waitqueue, one that has
> deterministic behaviour. This is only possibly by removing some of the
> more obscure waitqueue features and thus also results in a slimmer
> structure.

Oh really? That's good to hear.

I do like wake_all() being cheaper with the singly linked list, wakups are much
more common than waiting on things (e.g. the aio code delivering events to the
ringbuffer, anything that's freeing up resources).

Been kind of wondering how sane it would be to implement
finish_wait()/wake_one() with a singly linked list, and maybe preserve some of
the locklessness. You do fancy lockless stuff too, don't you? Maybe you have
some ideas :)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 12:55                       ` Peter Hurley
@ 2015-01-06 17:38                         ` Paul E. McKenney
  2015-01-06 17:58                           ` Peter Hurley
  0 siblings, 1 reply; 101+ messages in thread
From: Paul E. McKenney @ 2015-01-06 17:38 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 07:55:39AM -0500, Peter Hurley wrote:
> [ +cc Paul McKenney ]
> 
> On 01/06/2015 07:20 AM, Peter Zijlstra wrote:
> > On Tue, Jan 06, 2015 at 04:01:21AM -0800, Kent Overstreet wrote:
> >> On Tue, Jan 06, 2015 at 12:48:42PM +0100, Peter Zijlstra wrote:
> >>>
> >>>
> >>> Looking at that closure stuff, why is there an smp_mb() in
> >>> closure_wake_up() ? Typically wakeup only needs to imply a wmb.
> >>>
> >>> Also note that __closure_wake_up() starts with a fully serializing
> >>> instruction (xchg) and thereby already implies the full barrier.
> >>
> >> Probably no good reason, that code is pretty old :)
> >>
> >> If I was to hazard a guess, I had my own lockless linked lists before llist.h
> >> existed and perhaps I did it with atomic_xchg() - which was at least documented
> >> to not imply a barrier. I suppose it should just be dropped.
> > 
> > We (probably me) should probably audit all the atomic_xchg()
> > implementations and documentation and fix that. I was very much under
> > the impression it should imply a full barrier (and it certainly does on
> > x86), the documentation should state the rule that any atomic_ function
> > that returns a result is fully serializing, therefore, because
> > atomic_xchg() has a return value, it should too.
> 
> memory-barriers.txt and atomic_ops.txt appear to contradict each other here,
> but I think that's because atomic_ops.txt has drifted toward an
> arch-implementer's POV:
> 
> 260:atomic_xchg requires explicit memory barriers around the operation.
> 
> All the serializing atomic operations have descriptions like this.

I am not seeing the contradiction.

You posted the relevant line from atomic_ops.txt.  The relevant passage
from memory-barriers.txt is as follows:

	Any atomic operation that modifies some state in memory and
	returns information about the state (old or new) implies an
	SMP-conditional general memory barrier (smp_mb()) on each side
	of the actual operation (with the exception of explicit lock
	operations, described later).  These include:

		xchg();
		...
		atomic_xchg();			atomic_long_xchg();

So it appears to me that both documents require full barriers before and
after any atomic exchange operation in SMP builds.  Therefore, any
SMP-capable architecture that omits these barriers is buggy.

So, what am I missing here?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 17:38                         ` Paul E. McKenney
@ 2015-01-06 17:58                           ` Peter Hurley
  2015-01-06 19:25                             ` Paul E. McKenney
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Hurley @ 2015-01-06 17:58 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On 01/06/2015 12:38 PM, Paul E. McKenney wrote:
> On Tue, Jan 06, 2015 at 07:55:39AM -0500, Peter Hurley wrote:
>> [ +cc Paul McKenney ]
>>
>> On 01/06/2015 07:20 AM, Peter Zijlstra wrote:
>>> On Tue, Jan 06, 2015 at 04:01:21AM -0800, Kent Overstreet wrote:
>>>> On Tue, Jan 06, 2015 at 12:48:42PM +0100, Peter Zijlstra wrote:
>>>>>
>>>>>
>>>>> Looking at that closure stuff, why is there an smp_mb() in
>>>>> closure_wake_up() ? Typically wakeup only needs to imply a wmb.
>>>>>
>>>>> Also note that __closure_wake_up() starts with a fully serializing
>>>>> instruction (xchg) and thereby already implies the full barrier.
>>>>
>>>> Probably no good reason, that code is pretty old :)
>>>>
>>>> If I was to hazard a guess, I had my own lockless linked lists before llist.h
>>>> existed and perhaps I did it with atomic_xchg() - which was at least documented
>>>> to not imply a barrier. I suppose it should just be dropped.
>>>
>>> We (probably me) should probably audit all the atomic_xchg()
>>> implementations and documentation and fix that. I was very much under
>>> the impression it should imply a full barrier (and it certainly does on
>>> x86), the documentation should state the rule that any atomic_ function
>>> that returns a result is fully serializing, therefore, because
>>> atomic_xchg() has a return value, it should too.
>>
>> memory-barriers.txt and atomic_ops.txt appear to contradict each other here,
>> but I think that's because atomic_ops.txt has drifted toward an
>> arch-implementer's POV:
>>
>> 260:atomic_xchg requires explicit memory barriers around the operation.
>>
>> All the serializing atomic operations have descriptions like this.
> 
> I am not seeing the contradiction.
> 
> You posted the relevant line from atomic_ops.txt.  The relevant passage
> from memory-barriers.txt is as follows:
> 
> 	Any atomic operation that modifies some state in memory and
> 	returns information about the state (old or new) implies an
> 	SMP-conditional general memory barrier (smp_mb()) on each side
> 	of the actual operation (with the exception of explicit lock
> 	operations, described later).  These include:
> 
> 		xchg();
> 		...
> 		atomic_xchg();			atomic_long_xchg();
> 
> So it appears to me that both documents require full barriers before and
> after any atomic exchange operation in SMP builds.  Therefore, any
> SMP-capable architecture that omits these barriers is buggy.

Sure, I understand that, but I think the atomic_ops.txt is easy to
misinterpret.

> So, what am I missing here?

Well, it's a matter of the intended audience. There is a significant
difference between:

static inline int atomic_xchg(atomic_t *v, int new)
{
	/* this arch doesn't have serializing xchg() */
	smp_mb();
	/* arch xchg */
	smp_mb();
}

and

	smp_mb();
	atomic_xchg(&v, 1);
	smp_mb();

but both have "explicit memory barriers around the operation."

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 17:58                           ` Peter Hurley
@ 2015-01-06 19:25                             ` Paul E. McKenney
  2015-01-06 19:57                               ` Peter Hurley
  0 siblings, 1 reply; 101+ messages in thread
From: Paul E. McKenney @ 2015-01-06 19:25 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 12:58:36PM -0500, Peter Hurley wrote:
> On 01/06/2015 12:38 PM, Paul E. McKenney wrote:
> > On Tue, Jan 06, 2015 at 07:55:39AM -0500, Peter Hurley wrote:
> >> [ +cc Paul McKenney ]
> >>
> >> On 01/06/2015 07:20 AM, Peter Zijlstra wrote:
> >>> On Tue, Jan 06, 2015 at 04:01:21AM -0800, Kent Overstreet wrote:
> >>>> On Tue, Jan 06, 2015 at 12:48:42PM +0100, Peter Zijlstra wrote:
> >>>>>
> >>>>>
> >>>>> Looking at that closure stuff, why is there an smp_mb() in
> >>>>> closure_wake_up() ? Typically wakeup only needs to imply a wmb.
> >>>>>
> >>>>> Also note that __closure_wake_up() starts with a fully serializing
> >>>>> instruction (xchg) and thereby already implies the full barrier.
> >>>>
> >>>> Probably no good reason, that code is pretty old :)
> >>>>
> >>>> If I was to hazard a guess, I had my own lockless linked lists before llist.h
> >>>> existed and perhaps I did it with atomic_xchg() - which was at least documented
> >>>> to not imply a barrier. I suppose it should just be dropped.
> >>>
> >>> We (probably me) should probably audit all the atomic_xchg()
> >>> implementations and documentation and fix that. I was very much under
> >>> the impression it should imply a full barrier (and it certainly does on
> >>> x86), the documentation should state the rule that any atomic_ function
> >>> that returns a result is fully serializing, therefore, because
> >>> atomic_xchg() has a return value, it should too.
> >>
> >> memory-barriers.txt and atomic_ops.txt appear to contradict each other here,
> >> but I think that's because atomic_ops.txt has drifted toward an
> >> arch-implementer's POV:
> >>
> >> 260:atomic_xchg requires explicit memory barriers around the operation.
> >>
> >> All the serializing atomic operations have descriptions like this.
> > 
> > I am not seeing the contradiction.
> > 
> > You posted the relevant line from atomic_ops.txt.  The relevant passage
> > from memory-barriers.txt is as follows:
> > 
> > 	Any atomic operation that modifies some state in memory and
> > 	returns information about the state (old or new) implies an
> > 	SMP-conditional general memory barrier (smp_mb()) on each side
> > 	of the actual operation (with the exception of explicit lock
> > 	operations, described later).  These include:
> > 
> > 		xchg();
> > 		...
> > 		atomic_xchg();			atomic_long_xchg();
> > 
> > So it appears to me that both documents require full barriers before and
> > after any atomic exchange operation in SMP builds.  Therefore, any
> > SMP-capable architecture that omits these barriers is buggy.
> 
> Sure, I understand that, but I think the atomic_ops.txt is easy to
> misinterpret.
> 
> > So, what am I missing here?
> 
> Well, it's a matter of the intended audience. There is a significant
> difference between:
> 
> static inline int atomic_xchg(atomic_t *v, int new)
> {
> 	/* this arch doesn't have serializing xchg() */
> 	smp_mb();
> 	/* arch xchg */
> 	smp_mb();
> }
> 
> and
> 
> 	smp_mb();
> 	atomic_xchg(&v, 1);
> 	smp_mb();
> 
> but both have "explicit memory barriers around the operation."

The atomic_ops.txt file is pretty explicit about its intended audience
right at the beginning of the document:

	This document is intended to serve as a guide to Linux port
	maintainers on how to implement atomic counter, bitops, and spinlock
	interfaces properly.

It is intended for people implementing the atomic operations more than
for people making use of them.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 19:25                             ` Paul E. McKenney
@ 2015-01-06 19:57                               ` Peter Hurley
  2015-01-06 20:47                                 ` Paul E. McKenney
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Hurley @ 2015-01-06 19:57 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On 01/06/2015 02:25 PM, Paul E. McKenney wrote:
> On Tue, Jan 06, 2015 at 12:58:36PM -0500, Peter Hurley wrote:
>> On 01/06/2015 12:38 PM, Paul E. McKenney wrote:
>>> On Tue, Jan 06, 2015 at 07:55:39AM -0500, Peter Hurley wrote:
>>>> [ +cc Paul McKenney ]
>>>>
>>>> On 01/06/2015 07:20 AM, Peter Zijlstra wrote:
>>>>> On Tue, Jan 06, 2015 at 04:01:21AM -0800, Kent Overstreet wrote:
>>>>>> On Tue, Jan 06, 2015 at 12:48:42PM +0100, Peter Zijlstra wrote:
>>>>>>>
>>>>>>>
>>>>>>> Looking at that closure stuff, why is there an smp_mb() in
>>>>>>> closure_wake_up() ? Typically wakeup only needs to imply a wmb.
>>>>>>>
>>>>>>> Also note that __closure_wake_up() starts with a fully serializing
>>>>>>> instruction (xchg) and thereby already implies the full barrier.
>>>>>>
>>>>>> Probably no good reason, that code is pretty old :)
>>>>>>
>>>>>> If I was to hazard a guess, I had my own lockless linked lists before llist.h
>>>>>> existed and perhaps I did it with atomic_xchg() - which was at least documented
>>>>>> to not imply a barrier. I suppose it should just be dropped.
>>>>>
>>>>> We (probably me) should probably audit all the atomic_xchg()
>>>>> implementations and documentation and fix that. I was very much under
>>>>> the impression it should imply a full barrier (and it certainly does on
>>>>> x86), the documentation should state the rule that any atomic_ function
>>>>> that returns a result is fully serializing, therefore, because
>>>>> atomic_xchg() has a return value, it should too.
>>>>
>>>> memory-barriers.txt and atomic_ops.txt appear to contradict each other here,
>>>> but I think that's because atomic_ops.txt has drifted toward an
>>>> arch-implementer's POV:
>>>>
>>>> 260:atomic_xchg requires explicit memory barriers around the operation.
>>>>
>>>> All the serializing atomic operations have descriptions like this.
>>>
>>> I am not seeing the contradiction.
>>>
>>> You posted the relevant line from atomic_ops.txt.  The relevant passage
>>> from memory-barriers.txt is as follows:
>>>
>>> 	Any atomic operation that modifies some state in memory and
>>> 	returns information about the state (old or new) implies an
>>> 	SMP-conditional general memory barrier (smp_mb()) on each side
>>> 	of the actual operation (with the exception of explicit lock
>>> 	operations, described later).  These include:
>>>
>>> 		xchg();
>>> 		...
>>> 		atomic_xchg();			atomic_long_xchg();
>>>
>>> So it appears to me that both documents require full barriers before and
>>> after any atomic exchange operation in SMP builds.  Therefore, any
>>> SMP-capable architecture that omits these barriers is buggy.
>>
>> Sure, I understand that, but I think the atomic_ops.txt is easy to
>> misinterpret.
>>
>>> So, what am I missing here?
>>
>> Well, it's a matter of the intended audience. There is a significant
>> difference between:
>>
>> static inline int atomic_xchg(atomic_t *v, int new)
>> {
>> 	/* this arch doesn't have serializing xchg() */
>> 	smp_mb();
>> 	/* arch xchg */
>> 	smp_mb();
>> }
>>
>> and
>>
>> 	smp_mb();
>> 	atomic_xchg(&v, 1);
>> 	smp_mb();
>>
>> but both have "explicit memory barriers around the operation."
> 
> The atomic_ops.txt file is pretty explicit about its intended audience
> right at the beginning of the document:
> 
> 	This document is intended to serve as a guide to Linux port
> 	maintainers on how to implement atomic counter, bitops, and spinlock
> 	interfaces properly.
> 
> It is intended for people implementing the atomic operations more than
> for people making use of them.

And yet the following admonition is clearly aimed at interface users:

*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***

Some architectures may choose to use the volatile keyword, barriers, or inline
assembly to guarantee some degree of immediacy for atomic_read() and
atomic_set().  This is not uniformly guaranteed, and may change in the future,
so all users of atomic_t should treat atomic_read() and atomic_set() as simple
C statements that may be reordered or optimized away entirely by the compiler
or processor, and explicitly invoke the appropriate compiler and/or memory
barrier for each use case.  Failure to do so will result in code that may
suddenly break when used with different architectures or compiler
optimizations, or even changes in unrelated code which changes how the
compiler optimizes the section accessing atomic_t variables.

*** YOU HAVE BEEN WARNED! ***


To me, it makes sense to (also) document the arch-independent interfaces
for the much, much larger audience actually using them (not that I'm
suggesting this is your responsibility).

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 19:57                               ` Peter Hurley
@ 2015-01-06 20:47                                 ` Paul E. McKenney
  2015-01-20  0:30                                   ` Paul E. McKenney
  0 siblings, 1 reply; 101+ messages in thread
From: Paul E. McKenney @ 2015-01-06 20:47 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 02:57:37PM -0500, Peter Hurley wrote:
> On 01/06/2015 02:25 PM, Paul E. McKenney wrote:
> > On Tue, Jan 06, 2015 at 12:58:36PM -0500, Peter Hurley wrote:
> >> On 01/06/2015 12:38 PM, Paul E. McKenney wrote:
> >>> On Tue, Jan 06, 2015 at 07:55:39AM -0500, Peter Hurley wrote:
> >>>> [ +cc Paul McKenney ]
> >>>>
> >>>> On 01/06/2015 07:20 AM, Peter Zijlstra wrote:
> >>>>> On Tue, Jan 06, 2015 at 04:01:21AM -0800, Kent Overstreet wrote:
> >>>>>> On Tue, Jan 06, 2015 at 12:48:42PM +0100, Peter Zijlstra wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Looking at that closure stuff, why is there an smp_mb() in
> >>>>>>> closure_wake_up() ? Typically wakeup only needs to imply a wmb.
> >>>>>>>
> >>>>>>> Also note that __closure_wake_up() starts with a fully serializing
> >>>>>>> instruction (xchg) and thereby already implies the full barrier.
> >>>>>>
> >>>>>> Probably no good reason, that code is pretty old :)
> >>>>>>
> >>>>>> If I was to hazard a guess, I had my own lockless linked lists before llist.h
> >>>>>> existed and perhaps I did it with atomic_xchg() - which was at least documented
> >>>>>> to not imply a barrier. I suppose it should just be dropped.
> >>>>>
> >>>>> We (probably me) should probably audit all the atomic_xchg()
> >>>>> implementations and documentation and fix that. I was very much under
> >>>>> the impression it should imply a full barrier (and it certainly does on
> >>>>> x86), the documentation should state the rule that any atomic_ function
> >>>>> that returns a result is fully serializing, therefore, because
> >>>>> atomic_xchg() has a return value, it should too.
> >>>>
> >>>> memory-barriers.txt and atomic_ops.txt appear to contradict each other here,
> >>>> but I think that's because atomic_ops.txt has drifted toward an
> >>>> arch-implementer's POV:
> >>>>
> >>>> 260:atomic_xchg requires explicit memory barriers around the operation.
> >>>>
> >>>> All the serializing atomic operations have descriptions like this.
> >>>
> >>> I am not seeing the contradiction.
> >>>
> >>> You posted the relevant line from atomic_ops.txt.  The relevant passage
> >>> from memory-barriers.txt is as follows:
> >>>
> >>> 	Any atomic operation that modifies some state in memory and
> >>> 	returns information about the state (old or new) implies an
> >>> 	SMP-conditional general memory barrier (smp_mb()) on each side
> >>> 	of the actual operation (with the exception of explicit lock
> >>> 	operations, described later).  These include:
> >>>
> >>> 		xchg();
> >>> 		...
> >>> 		atomic_xchg();			atomic_long_xchg();
> >>>
> >>> So it appears to me that both documents require full barriers before and
> >>> after any atomic exchange operation in SMP builds.  Therefore, any
> >>> SMP-capable architecture that omits these barriers is buggy.
> >>
> >> Sure, I understand that, but I think the atomic_ops.txt is easy to
> >> misinterpret.
> >>
> >>> So, what am I missing here?
> >>
> >> Well, it's a matter of the intended audience. There is a significant
> >> difference between:
> >>
> >> static inline int atomic_xchg(atomic_t *v, int new)
> >> {
> >> 	/* this arch doesn't have serializing xchg() */
> >> 	smp_mb();
> >> 	/* arch xchg */
> >> 	smp_mb();
> >> }
> >>
> >> and
> >>
> >> 	smp_mb();
> >> 	atomic_xchg(&v, 1);
> >> 	smp_mb();
> >>
> >> but both have "explicit memory barriers around the operation."
> > 
> > The atomic_ops.txt file is pretty explicit about its intended audience
> > right at the beginning of the document:
> > 
> > 	This document is intended to serve as a guide to Linux port
> > 	maintainers on how to implement atomic counter, bitops, and spinlock
> > 	interfaces properly.
> > 
> > It is intended for people implementing the atomic operations more than
> > for people making use of them.
> 
> And yet the following admonition is clearly aimed at interface users:
> 
> *** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***
> 
> Some architectures may choose to use the volatile keyword, barriers, or inline
> assembly to guarantee some degree of immediacy for atomic_read() and
> atomic_set().  This is not uniformly guaranteed, and may change in the future,
> so all users of atomic_t should treat atomic_read() and atomic_set() as simple
> C statements that may be reordered or optimized away entirely by the compiler
> or processor, and explicitly invoke the appropriate compiler and/or memory
> barrier for each use case.  Failure to do so will result in code that may
> suddenly break when used with different architectures or compiler
> optimizations, or even changes in unrelated code which changes how the
> compiler optimizes the section accessing atomic_t variables.
> 
> *** YOU HAVE BEEN WARNED! ***
> 
> 
> To me, it makes sense to (also) document the arch-independent interfaces
> for the much, much larger audience actually using them (not that I'm
> suggesting this is your responsibility).

David Miller's call, actually.

But the rule is that if it is an atomic read-modify-write operation and it
returns a value, then the operation itself needs to include full memory
barriers before and after (as in the caller doesn't need to add them).
Otherwise, the operation does not need to include memory ordering.
Since xchg(), atomic_xchg(), and atomic_long_xchg() all return a value,
their implementations must include full memory barriers before and after.

Pretty straightforward.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 13:28                         ` Kent Overstreet
@ 2015-01-13 15:23                           ` Peter Zijlstra
  0 siblings, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2015-01-13 15:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 05:28:44AM -0800, Kent Overstreet wrote:
> On Tue, Jan 06, 2015 at 02:03:17PM +0100, Peter Zijlstra wrote:
> > Yeah I got that aspect. I'm still trying to get my head around how the
> > wait_event bit would be a natural match though ;-)
> > 
> > Let me stew a bit on that.
> 
> Also, I think it was kind of a surprise to me back in the day how structurally
> similar the wait_event() implementation (circa several years ago, not the
> monstrosity it is now :) and closure_wait_event() turned out - I wasn't aiming
> for that, but that got me thinking there must be something more fundamental
> going on here.

That 'monstrosity' helped reduce the line count significantly, but more
importantly it fixed a fair few inconsistencies across the various
wait_event*() functions. But yes, its a bit of a handful.

Now back to why I don't really like closures for this purpose; the
wait_event*() stuff is really only a wait list, closures are a wait list
+ bits.

So while it makes sense to me to implement closures in terms of
wait_event, the reverse does not make sense to me.

Now you gave a good reason to not use the existing wait list stuff, its
somewhat bloated, and that's fair.

> > That said, the RT people want a simple waitqueue, one that has
> > deterministic behaviour. This is only possibly by removing some of the
> > more obscure waitqueue features and thus also results in a slimmer
> > structure.
> 
> Oh really? That's good to hear.

http://thread.gmane.org/gmane.linux.kernel/1808752 is the last posting
iirc.

> I do like wake_all() being cheaper with the singly linked list, wakups are much
> more common than waiting on things (e.g. the aio code delivering events to the
> ringbuffer, anything that's freeing up resources).
> 
> Been kind of wondering how sane it would be to implement
> finish_wait()/wake_one() with a singly linked list, and maybe preserve some of
> the locklessness. You do fancy lockless stuff too, don't you? Maybe you have
> some ideas :)

Ha! I think I implemented the required nightmare, have a look at:

  fb0527bd5ea9 ("locking/mutexes: Introduce cancelable MCS lock for adaptive spinning")

MCS locks are basically a single linked lockless FIFO queue, however for
the optimistic spinning stuff we needed to be able to abort the lock
op/unlink ourselves.

I'll be the first to admit that that code is somewhat mind bending. I
had to draw quite a few doodles when writing that code :-)



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 11:07             ` Kent Overstreet
                                 ` (2 preceding siblings ...)
  2015-01-06 11:58               ` Peter Zijlstra
@ 2015-01-16 16:56               ` Peter Hurley
  2015-01-16 17:00                 ` Chris Mason
  3 siblings, 1 reply; 101+ messages in thread
From: Peter Hurley @ 2015-01-16 16:56 UTC (permalink / raw)
  To: Kent Overstreet, Peter Zijlstra
  Cc: Sedat Dilek, Dave Jones, Linus Torvalds, LKML, Chris Mason

On 01/06/2015 06:07 AM, Kent Overstreet wrote:
> On Tue, Jan 06, 2015 at 12:01:12PM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 06, 2015 at 11:18:04AM +0100, Sedat Dilek wrote:
>>> On Tue, Jan 6, 2015 at 11:06 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
>>>>> [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
>>>>>
>>>>
>>>> Ah, that one. Chris Mason and Kent Overstreet were looking at that one.
>>>> I'm not touching the AIO code either ;-)
>>>
>>> I know, I was so excited when I see nearly the same output.
>>>
>>> Can you tell me why people see "similiar" problems in different areas?
>>
>> Because the debug check is new :-) It's a pattern that should not be
>> used but mostly works most of the times.
>>
>>> [  181.397024] WARNING: CPU: 0 PID: 2872 at kernel/sched/core.c:7303
>>> __might_sleep+0xbd/0xd0()
>>> [  181.397028] do not call blocking ops when !TASK_RUNNING; state=1
>>> set at [<ffffffff810b83bd>] prepare_to_wait_event+0x5d/0x110
>>>
>>> With similiar buzzwords... namely...
>>>
>>> mutex_lock_nested
>>> prepare_to_wait(_event)
>>> __might_sleep
>>>
>>> I am asking myself... Where is the real root cause - in sched/core?
>>> Fix one single place VS. fix the impact at several other places?
>>
>> No, the root cause is nesting sleep primitives, this is not fixable in
>> the one place, both prepare_to_wait and mutex_lock are using
>> task_struct::state, they have to, no way around it.
> 
> No, it's completely possible to construct a prepare_to_wait() that doesn't
> require messing with the task state. Had it for years.
> 
> http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix

Peter & Kent,

What's the plan here?

I ask because this triggers a max cpu utilization in mysql (mysqld is
795% on 8 cores). The machine is still usable so the threads are at
least sleeping :)

I grabbed cpu backtraces [1] in case anyone cares.

Regards,
Peter Hurley

PS - I didn't bother bisecting because bisection near the new checks
for nested sleeps results in inop machine (my fault: when I reviewed
that series, I should have realized that the patches needed to
go in backwards).

[1] cpu backtraces
[   90.667392] sending NMI to all CPUs:
[   90.667552] NMI backtrace for cpu 0
[   90.667555] CPU: 0 PID: 2032 Comm: mysqld Tainted: G        W      3.19.0-rc4+wip-xeon+debug #rc4+wip
[   90.667557] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   90.667558] task: ffff8800bb768000 ti: ffff8800ba770000 task.ti: ffff8800ba770000
[   90.667560] RIP: 0010:[<ffffffff817b4a8a>]  [<ffffffff817b4a8a>] mutex_lock_nested+0xda/0x530
[   90.667562] RSP: 0018:ffff8800ba773d48  EFLAGS: 00000046
[   90.667563] RAX: 000000000000d8d8 RBX: ffff8800bad7b000 RCX: 0000000000000000
[   90.667565] RDX: 00000000000000d8 RSI: 0000000000000001 RDI: ffffffff810bf71d
[   90.667566] RBP: ffff8800ba773db8 R08: 0000000000000000 R09: 0000000000000000
[   90.667568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800bad7b200
[   90.667570] R13: ffff8800bad7b208 R14: 0000000000000246 R15: ffff8800bb768000
[   90.667572] FS:  00007fd92d07e700(0000) GS:ffff8802b4600000(0000) knlGS:0000000000000000
[   90.667573] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   90.667575] CR2: 00007f265ffff9c8 CR3: 00000002af011000 CR4: 00000000000007f0
[   90.667576] Stack:
[   90.667578]  ffffffff8124bbca ffffffff810bfaae ffffffff8124bbca ffff8800bad7b278
[   90.667580]  00000000457705c0 ffffffff817b97a5 ffffffff817b97a5 ffff8800bb768000
[   90.667582]  ffff8800ba773da8 ffff8800bad7b000 0000000000000001 ffff8800bad7b000
[   90.667583] Call Trace:
[   90.667585]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667586]  [<ffffffff810bfaae>] ? put_lock_stats.isra.26+0xe/0x30
[   90.667588]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667590]  [<ffffffff817b97a5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[   90.667591]  [<ffffffff817b97a5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[   90.667593]  [<ffffffff8124bbca>] aio_read_events+0x4a/0x350
[   90.667594]  [<ffffffff810b7a05>] ? prepare_to_wait_event+0x95/0x110
[   90.667596]  [<ffffffff8124c0cc>] read_events+0x1fc/0x230
[   90.667598]  [<ffffffff810b7a80>] ? prepare_to_wait_event+0x110/0x110
[   90.667599]  [<ffffffff810eae00>] ? hrtimer_get_res+0x50/0x50
[   90.667601]  [<ffffffff810eb384>] ? hrtimer_start_range_ns+0x14/0x20
[   90.667602]  [<ffffffff8124c1c0>] ? lookup_ioctx+0xc0/0x150
[   90.667604]  [<ffffffff8124e0d9>] SyS_io_getevents+0x59/0x110
[   90.667606]  [<ffffffff817ba46d>] system_call_fastpath+0x16/0x1b
[   90.667608] Code: 00 f0 66 41 0f c1 44 24 08 0f b6 d4 38 c2 0f 85 82 03 00 00 44 8b 05 1e 7e 62 01 45 85 c0 75 0b 4d 3b 64 24 70 0f 85 e3 03 00 00 <41> 8b 04 24 83 f8 01 0f 84 39 02 00 00 48 8d 75 b0 4c 89 e7 e8 
[   90.667610] NMI backtrace for cpu 1
[   90.667612] CPU: 1 PID: 2024 Comm: mysqld Tainted: G        W      3.19.0-rc4+wip-xeon+debug #rc4+wip
[   90.667613] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   90.667615] task: ffff8802b1ea4b20 ti: ffff8800bf124000 task.ti: ffff8800bf124000
[   90.667617] RIP: 0010:[<ffffffff810a53c9>]  [<ffffffff810a53c9>] local_clock+0x9/0x30
[   90.667618] RSP: 0018:ffff8800bf127c58  EFLAGS: 00000002
[   90.667620] RAX: 0000000000000004 RBX: ffff8802b1ea4b20 RCX: 0000000000000000
[   90.667622] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800bad78e78
[   90.667623] RBP: ffff8800bf127c58 R08: 0000000000000001 R09: 0000000000000000
[   90.667625] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000549
[   90.667626] R13: ffff8802b1ea53f8 R14: 0000000000000000 R15: ffff8800bad78e78
[   90.667628] FS:  00007fd931086700(0000) GS:ffff8802b4800000(0000) knlGS:0000000000000000
[   90.667630] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   90.667631] CR2: 00007f17540010b8 CR3: 00000002af011000 CR4: 00000000000007e0
[   90.667633] Stack:
[   90.667635]  ffff8800bf127cc8 ffffffff810c3bc0 ffff8800bf127c98 ffffffff810a50f5
[   90.667636]  ffff8800bf127c98 ffff8802b49d5b80 00000000001d5b80 ffff880200000001
[   90.667638]  ffff8800bf127cb8 0000000000000246 0000000000000000 0000000000000000
[   90.667640] Call Trace:
[   90.667641]  [<ffffffff810c3bc0>] __lock_acquire+0x250/0x10b0
[   90.667643]  [<ffffffff810a50f5>] ? sched_clock_local+0x25/0x90
[   90.667644]  [<ffffffff810c51cd>] lock_acquire+0xbd/0x170
[   90.667646]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667647]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667649]  [<ffffffff817b4a22>] mutex_lock_nested+0x72/0x530
[   90.667651]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667652]  [<ffffffff810bfaae>] ? put_lock_stats.isra.26+0xe/0x30
[   90.667654]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667656]  [<ffffffff817b97a5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[   90.667657]  [<ffffffff817b97a5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[   90.667659]  [<ffffffff8124bbca>] aio_read_events+0x4a/0x350
[   90.667660]  [<ffffffff810b7a05>] ? prepare_to_wait_event+0x95/0x110
[   90.667662]  [<ffffffff8124c0cc>] read_events+0x1fc/0x230
[   90.667664]  [<ffffffff810b7a80>] ? prepare_to_wait_event+0x110/0x110
[   90.667665]  [<ffffffff810eae00>] ? hrtimer_get_res+0x50/0x50
[   90.667667]  [<ffffffff810eb384>] ? hrtimer_start_range_ns+0x14/0x20
[   90.667669]  [<ffffffff8124c1c0>] ? lookup_ioctx+0xc0/0x150
[   90.667670]  [<ffffffff8124e0d9>] SyS_io_getevents+0x59/0x110
[   90.667672]  [<ffffffff817ba46d>] system_call_fastpath+0x16/0x1b
[   90.667674] Code: 00 00 55 48 89 e5 66 66 66 66 90 e8 f2 fe ff ff 5d c3 e8 6b c7 f7 ff 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 <65> 8b 3d 38 4d f6 7e e8 cb fe ff ff 5d c3 66 0f 1f 84 00 00 00 
[   90.667676] NMI backtrace for cpu 2
[   90.667678] CPU: 2 PID: 2023 Comm: mysqld Tainted:  3.19.0-rc4+wip-xeon+debug #rc4+wip
[   90.667679] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   90.667681] task: ffff8802adf64b20 ti: ffff8800bb4fc000 task.ti: ffff8800bb4fc000
[   90.667683] RIP: 0010:[<ffffffff810c3ac4>]  [<ffffffff810c3ac4>] __lock_acquire+0x154/0x10b0
[   90.667684] RSP: 0018:ffff8800bb4ffc68  EFLAGS: 00000006
[   90.667686] RAX: 0000000000000000 RBX: ffff8802adf64b20 RCX: 0000000000000000
[   90.667688] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800bad78278
[   90.667689] RBP: ffff8800bb4ffcc8 R08: 0000000000000001 R09: 0000000000000000
[   90.667691] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff828a03d0
[   90.667692] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8800bad78278
[   90.667694] FS:  00007fd940cdb700(0000) GS:ffff8802b4a00000(0000) knlGS:0000000000000000
[   90.667696] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   90.667697] CR2: 00007f5cae47f000 CR3: 00000002af011000 CR4: 00000000000007e0
[   90.667699] Stack:
[   90.667701]  ffff8800bb4ffc98 ffffffff810a50f5 ffff8800bb4ffc98 ffff8802b4bd5b80
[   90.667702]  00000000001d5b80 ffff8802adf653f8 ffff8800bb4ffcb8 0000000000000246
[   90.667704]  0000000000000000 0000000000000000 0000000000000001 0000000000000000
[   90.667705] Call Trace:
[   90.667707]  [<ffffffff810a50f5>] ? sched_clock_local+0x25/0x90
[   90.667709]  [<ffffffff810c51cd>] lock_acquire+0xbd/0x170
[   90.667710]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667712]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667713]  [<ffffffff817b4a22>] mutex_lock_nested+0x72/0x530
[   90.667715]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667717]  [<ffffffff810bfaae>] ? put_lock_stats.isra.26+0xe/0x30
[   90.667718]  [<ffffffff8124bbca>] ? aio_read_events+0x4a/0x350
[   90.667720]  [<ffffffff817b97a5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[   90.667722]  [<ffffffff817b97a5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[   90.667723]  [<ffffffff8124bbca>] aio_read_events+0x4a/0x350
[   90.667725]  [<ffffffff810b7a05>] ? prepare_to_wait_event+0x95/0x110
[   90.667727]  [<ffffffff8124bb92>] ? aio_read_events+0x12/0x350
[   90.667728]  [<ffffffff8124c0cc>] read_events+0x1fc/0x230
[   90.667730]  [<ffffffff810b7a80>] ? prepare_to_wait_event+0x110/0x110
[   90.667731]  [<ffffffff810eae00>] ? hrtimer_get_res+0x50/0x50
[   90.667733]  [<ffffffff810eb384>] ? hrtimer_start_range_ns+0x14/0x20
[   90.667735]  [<ffffffff8124c1c0>] ? lookup_ioctx+0xc0/0x150
[   90.667736]  [<ffffffff8124e0d9>] SyS_io_getevents+0x59/0x110
[   90.667738]  [<ffffffff817ba46d>] system_call_fastpath+0x16/0x1b
[   90.667740] Code: 00 00 00 44 0f 44 c0 41 83 fd 01 44 89 e8 0f 87 2c ff ff ff 4d 8b 64 c7 08 4d 85 e4 0f 84 28 ff ff ff f0 41 ff 84 24 98 01 00 00 <8b> 05 ce 8d d1 01 44 8b b3 d0 08 00 00 85 c0 75 0a 41 83 fe 2f 
[   90.667742] NMI backtrace for cpu 3
[   90.667743] CPU: 3 PID: 2026 Comm: mysqld Tainted: G        W      3.19.0-rc4+wip-xeon+debug #rc4+wip
[   90.667745] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   90.667747] task: ffff8800bf7d8000 ti: ffff8800ba61c000 task.ti: ffff8800ba61c000
[   90.667748] RIP: 0010:[<ffffffff810c56b1>]  [<ffffffff810c56b1>] lock_release+0xe1/0x2a0
[   90.667750] RSP: 0018:ffff8800ba61fd98  EFLAGS: 00000092
[   90.667752] RAX: ffff8800bf7d8000 RBX: ffff8800bf7d8000 RCX: 000000000000a9c0
[   90.667753] RDX: ffff8802b4c58960 RSI: 000000151c32a5b1 RDI: 0000000000000092
[   90.667755] RBP: ffff8800ba61fdc8 R08: 0000000000000000 R09: 0000000000000001
[   90.667757] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8800bad7b6c0
[   90.667758] R13: ffffffff810b7a05 R14: 0000000000000092 R15: 0000000000000001
[   90.667760] FS:  00007fd930084700(0000) GS:ffff8802b4c00000(0000) knlGS:0000000000000000
[   90.667762] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   90.667763] CR2: 00007fc97c006028 CR3: 00000002af011000 CR4: 00000000000007e0
[   90.667765] Stack:
[   90.667766]  00007fd945764340 0000000000000296 ffff8800bad7b6a8 0000000000000001
[   90.667768]  00007fd945764340 0000000000000000 ffff8800ba61fde8 ffffffff817b9764
[   90.667770]  ffff8800ba61fe68 ffff8800bad7b6a8 ffff8800ba61fe38 ffffffff810b7a05
[   90.667771] Call Trace:
[   90.667773]  [<ffffffff817b9764>] _raw_spin_unlock_irqrestore+0x24/0x80
[   90.667774]  [<ffffffff810b7a05>] prepare_to_wait_event+0x95/0x110
[   90.667776]  [<ffffffff8124c0b1>] read_events+0x1e1/0x230
[   90.667778]  [<ffffffff810b7a80>] ? prepare_to_wait_event+0x110/0x110
[   90.667779]  [<ffffffff810eae00>] ? hrtimer_get_res+0x50/0x50
[   90.667781]  [<ffffffff810eb384>] ? hrtimer_start_range_ns+0x14/0x20
[   90.667782]  [<ffffffff8124c1c0>] ? lookup_ioctx+0xc0/0x150
[   90.667784]  [<ffffffff8124e0d9>] SyS_io_getevents+0x59/0x110
[   90.667786]  [<ffffffff817ba46d>] system_call_fastpath+0x16/0x1b
[   90.667788] Code: 85 b5 00 00 00 4c 89 ea 4c 89 e6 48 89 df e8 e7 fb ff ff 65 48 8b 04 25 80 ba 00 00 c7 80 d4 08 00 00 00 00 00 00 4c 89 f7 57 9d <66> 66 90 66 90 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d f3 c3 
[   90.667790] NMI backtrace for cpu 4
[   90.667791] CPU: 4 PID: 2028 Comm: mysqld Tainted: G        W      3.19.0-rc4+wip-xeon+debug #rc4+wip
[   90.667793] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   90.667795] task: ffff8800bf15cb20 ti: ffff8800bb534000 task.ti: ffff8800bb518000
[   90.667796] RIP: 0010:[<ffffffff810166bf>]  [<ffffffff810166bf>] __switch_to+0x1ff/0x5a0
[   90.667798] RSP: 0018:ffff8800bb51bd60  EFLAGS: 00000046
[   90.667800] RAX: ffff8800bb518000 RBX: ffff8800bf15cb20 RCX: 00000000c0000100
[   90.667801] RDX: 0000000080000002 RSI: 000000002f082700 RDI: 00000000c0000100
[   90.667803] RBP: ffff8800bb51bdb0 R08: 0000000000000000 R09: 0000000000000000
[   90.667805] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8802b03aa590
[   90.667806] R13: ffff8802b4fd21c0 R14: 0000000000000000 R15: ffff8802b03aabf8
[   90.667808] FS:  00007fd92f082700(0000) GS:ffff8802b4e00000(0000) knlGS:0000000000000000
[   90.667810] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   90.667811] CR2: 00007fc9841ab350 CR3: 00000002af011000 CR4: 00000000000007e0
[   90.667813] Stack:
[   90.667815]  ffff8800bf15cb20 00000000b4fd5058 0000000000000004 ffff8800bf15d188
[   90.667816]  0000000000000096 ffff8802b4fd5040 ffff8802ad83b200 ffff8802ad83b200
[   90.667818]  ffff8802b03aa590 0000000000000000 ffff8800bf15cb20 ffffffff817b2a33
[   90.667819] Call Trace:
[   90.667821]  [<ffffffff817b2a33>] __schedule+0x393/0x8d0
[   90.667823]  [<ffffffff817b2f99>] ? schedule+0x29/0x70
[   90.667824]  [<ffffffff8124c099>] ? read_events+0x1c9/0x230
[   90.667826]  [<ffffffff810b7a80>] ? prepare_to_wait_event+0x110/0x110
[   90.667827]  [<ffffffff810eae00>] ? hrtimer_get_res+0x50/0x50
[   90.667829]  [<ffffffff810eb384>] ? hrtimer_start_range_ns+0x14/0x20
[   90.667831]  [<ffffffff8124c1c0>] ? lookup_ioctx+0xc0/0x150
[   90.667832]  [<ffffffff8124e0d9>] ? SyS_io_getevents+0x59/0x110
[   90.667834]  [<ffffffff817ba46d>] ? system_call_fastpath+0x16/0x1b
[   90.667836] Code: 89 05 e6 49 ff 7e 49 8b 44 24 08 65 8b 15 ba 53 ff 7e 65 48 89 1d d2 53 ff 7e 89 50 1c 48 8b 43 08 8b 50 1c 65 89 15 a1 53 ff 7e <48> 8d 90 d8 3f 00 00 65 48 89 15 ba 53 ff 7e f7 40 10 00 00 41 
[   90.667838] NMI backtrace for cpu 5
[   90.667839] CPU: 5 PID: 2030 Comm: mysqld Tainted: G        W      3.19.0-rc4+wip-xeon+debug #rc4+wip
[   90.667841] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   90.667843] task: ffff8802b03a8000 ti: ffff8800bfb7c000 task.ti: ffff8800bfb7c000
[   90.667845] RIP: 0010:[<ffffffff810c3ac4>]  [<ffffffff810c3ac4>] __lock_acquire+0x154/0x10b0
[   90.667846] RSP: 0018:ffff8800bfb7fcd8  EFLAGS: 00000002
[   90.667848] RAX: 0000000000000000 RBX: ffff8802b03a8000 RCX: 0000000000000000
[   90.667850] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800bad78ac0
[   90.667851] RBP: ffff8800bfb7fd38 R08: 0000000000000001 R09: 0000000000000001
[   90.667853] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff828a0d80
[   90.667855] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8800bad78ac0
[   90.667856] FS:  00007fd92e080700(0000) GS:ffff8802b5000000(0000) knlGS:0000000000000000
[   90.667858] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   90.667860] CR2: 00007f9b65f089c8 CR3: 00000002af011000 CR4: 00000000000007e0
[   90.667861] Stack:
[   90.667863]  0000000000000b56 ffff8802b51d5b80 00000000001d5b80 ffffffff817b2f2c
[   90.667864]  ffff8800bfb7fd18 ffffffff810a5358 ffff8802b03a8000 0000000000000046
[   90.667866]  0000000000000000 0000000000000000 0000000000000001 0000000000000000
[   90.667868] Call Trace:
[   90.667869]  [<ffffffff817b2f2c>] ? __schedule+0x88c/0x8d0
[   90.667871]  [<ffffffff810a5358>] ? sched_clock_cpu+0xb8/0xe0
[   90.667872]  [<ffffffff810c51cd>] lock_acquire+0xbd/0x170
[   90.667874]  [<ffffffff810b79c9>] ? prepare_to_wait_event+0x59/0x110
[   90.667876]  [<ffffffff817b95c8>] _raw_spin_lock_irqsave+0x58/0xa0
[   90.667877]  [<ffffffff810b79c9>] ? prepare_to_wait_event+0x59/0x110
[   90.667879]  [<ffffffff817b2a77>] ? __schedule+0x3d7/0x8d0
[   90.667881]  [<ffffffff810b79c9>] prepare_to_wait_event+0x59/0x110
[   90.667882]  [<ffffffff8124c0b1>] read_events+0x1e1/0x230
[   90.667884]  [<ffffffff810b7a80>] ? prepare_to_wait_event+0x110/0x110
[   90.667886]  [<ffffffff810eae00>] ? hrtimer_get_res+0x50/0x50
[   90.667887]  [<ffffffff810eb384>] ? hrtimer_start_range_ns+0x14/0x20
[   90.667889]  [<ffffffff8124c1c0>] ? lookup_ioctx+0xc0/0x150
[   90.667891]  [<ffffffff8124e0d9>] SyS_io_getevents+0x59/0x110
[   90.667892]  [<ffffffff817ba46d>] system_call_fastpath+0x16/0x1b
[   90.667894] Code: 00 00 00 44 0f 44 c0 41 83 fd 01 44 89 e8 0f 87 2c ff ff ff 4d 8b 64 c7 08 4d 85 e4 0f 84 28 ff ff ff f0 41 ff 84 24 98 01 00 00 <8b> 05 ce 8d d1 01 44 8b b3 d0 08 00 00 85 c0 75 0a 41 83 fe 2f 
[   90.667896] NMI backtrace for cpu 6
[   90.667898] CPU: 6 PID: 2025 Comm: mysqld Tainted: G        W      3.19.0-rc4+wip-xeon+debug #rc4+wip
[   90.667900] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   90.667901] task: ffff8802accb8000 ti: ffff8800ba518000 task.ti: ffff8800ba518000
[   90.667903] RIP: 0010:[<ffffffff810bf71d>]  [<ffffffff810bf71d>] trace_hardirqs_off+0xd/0x10
[   90.667904] RSP: 0018:ffff8802b5203908  EFLAGS: 00000046
[   90.667906] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
[   90.667908] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff810bf71d
[   90.667909] RBP: ffff8802b5203908 R08: 0000000000000001 R09: 0000000000000001
[   90.667911] R10: 000000000000048b R11: ffff8802b520361e R12: 0000000000000096
[   90.667913] R13: 0000000000000c00 R14: 00000000000000ff R15: ffff8802adb36ba0
[   90.667914] FS:  00007fd930885700(0000) GS:ffff8802b5200000(0000) knlGS:0000000000000000
[   90.667916] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   90.667917] CR2: 00007fcc1e73b000 CR3: 00000002af011000 CR4: 00000000000007e0
[   90.667919] Stack:
[   90.667921]  ffff8802b5203938 ffffffff810555cf 000000000000d100 0000000000000001
[   90.667922]  0000000000000001 0000000000000004 ffff8802b5203988 ffffffff81051313
[   90.667924]  0000000000000000 0000000000000010 ffff8802b52039a8 ffffffff81cc65a0
[   90.667926] Call Trace:
[   90.667927]  <IRQ> \x01d [<ffffffff810555cf>] flat_send_IPI_mask+0xbf/0xd0
[   90.667929]  [<ffffffff81051313>] arch_trigger_all_cpu_backtrace+0x273/0x280
[   90.667930]  [<ffffffff814ada43>] sysrq_handle_showallcpus+0x13/0x20
[   90.667932]  [<ffffffff814ae187>] __handle_sysrq+0x137/0x1b0
[   90.667934]  [<ffffffff814ae055>] ? __handle_sysrq+0x5/0x1b0
[   90.667935]  [<ffffffff814ae5d9>] sysrq_filter+0x3a9/0x3f0
[   90.667937]  [<ffffffff81606979>] input_to_handler+0x59/0xf0
[   90.667938]  [<ffffffff81609589>] input_pass_values.part.4+0x1a9/0x1b0
[   90.667940]  [<ffffffff816093e5>] ? input_pass_values.part.4+0x5/0x1b0
[   90.667942]  [<ffffffff81609e29>] input_handle_event+0x129/0x550
[   90.667943]  [<ffffffff8160a295>] ? input_event+0x45/0x70
[   90.667945]  [<ffffffff8160a2a9>] input_event+0x59/0x70
[   90.667946]  [<ffffffffa02f409f>] hidinput_report_event+0x3f/0x50 [hid]
[   90.667948]  [<ffffffffa02f2208>] hid_report_raw_event+0x148/0x1c0 [hid]
[   90.667950]  [<ffffffffa02f2399>] hid_input_report+0x119/0x1a0 [hid]
[   90.667951]  [<ffffffffa0120878>] ? logi_dj_raw_event+0x168/0x2d0 [hid_logitech_dj]
[   90.667953]  [<ffffffffa0120973>] logi_dj_raw_event+0x263/0x2d0 [hid_logitech_dj]
[   90.667955]  [<ffffffff817b978a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
[   90.667956]  [<ffffffffa02f23e7>] hid_input_report+0x167/0x1a0 [hid]
[   90.667958]  [<ffffffffa0327d02>] hid_irq_in+0xc2/0x260 [usbhid]
[   90.667960]  [<ffffffff815bc8e3>] __usb_hcd_giveback_urb+0x83/0x130
[   90.667961]  [<ffffffff815bcab3>] usb_hcd_giveback_urb+0x43/0x120
[   90.667963]  [<ffffffff815e751a>] uhci_giveback_urb+0xaa/0x290
[   90.667965]  [<ffffffff811cb2f7>] ? dma_pool_free+0xa7/0xd0
[   90.667966]  [<ffffffff815e9533>] uhci_scan_schedule+0x493/0xb30
[   90.667968]  [<ffffffff815ea20e>] uhci_irq+0xae/0x1a0
[   90.667969]  [<ffffffff815bbb66>] usb_hcd_irq+0x26/0x40
[   90.667971]  [<ffffffff810d7be9>] handle_irq_event_percpu+0x59/0x260
[   90.667973]  [<ffffffff810d7e31>] handle_irq_event+0x41/0x70
[   90.667974]  [<ffffffff810da928>] ? handle_fasteoi_irq+0x28/0x140
[   90.667976]  [<ffffffff810da982>] handle_fasteoi_irq+0x82/0x140
[   90.667977]  [<ffffffff8101a832>] handle_irq+0x22/0x40
[   90.667979]  [<ffffffff817bd5e1>] do_IRQ+0x51/0xf0
[   90.667981]  [<ffffffff817bb272>] common_interrupt+0x72/0x72
[   90.667982]  <EOI> \x01d [<ffffffff810bfaae>] ? put_lock_stats.isra.26+0xe/0x30
[   90.667984]  [<ffffffff817b97f4>] ? _raw_spin_unlock_irq+0x34/0x60
[   90.667986]  [<ffffffff817b97f0>] ? _raw_spin_unlock_irq+0x30/0x60
[   90.667987]  [<ffffffff810992a1>] finish_task_switch+0x91/0x170
[   90.667989]  [<ffffffff81099262>] ? finish_task_switch+0x52/0x170
[   90.667991]  [<ffffffff817b2a5c>] __schedule+0x3bc/0x8d0
[   90.667992]  [<ffffffff817b2f99>] schedule+0x29/0x70
[   90.667994]  [<ffffffff8124c099>] read_events+0x1c9/0x230
[   90.667996]  [<ffffffff810b7a80>] ? prepare_to_wait_event+0x110/0x110
[   90.667997]  [<ffffffff810eae00>] ? hrtimer_get_res+0x50/0x50
[   90.667999]  [<ffffffff810eb384>] ? hrtimer_start_range_ns+0x14/0x20
[   90.668000]  [<ffffffff8124c1c0>] ? lookup_ioctx+0xc0/0x150
[   90.668002]  [<ffffffff8124e0d9>] SyS_io_getevents+0x59/0x110
[   90.668003]  [<ffffffff817ba46d>] system_call_fastpath+0x16/0x1b
[   90.668003] Code: aa 81 31 c0 be 44 0a 00 00 48 c7 c7 10 92 a9 81 e8 79 3c fb ff eb b4 0f 1f 80 00 00 00 00 55 48 89 e5 48 8b 7d 08 e8 33 ff ff ff <5d> c3 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 81 ff 00 30 f3 
[   90.668003] NMI backtrace for cpu 7
[   90.668003] CPU: 7 PID: 2029 Comm: mysqld Tainted: G        W      3.19.0-rc4+wip-xeon+debug #rc4+wip
[   90.668003] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
[   90.668003] task: ffff8802b04b0000 ti: ffff8800bf924000 task.ti: ffff8800bf924000
[   90.668003] RIP: 0010:[<ffffffff81021486>]  [<ffffffff81021486>] native_read_tsc+0x6/0x20
[   90.668003] RSP: 0018:ffff8800bf927cd8  EFLAGS: 00000092
[   90.668003] RAX: 00000000fe931f38 RBX: ffff8802b55d5b80 RCX: 0000000000000000
[   90.668003] RDX: 000000000000004a RSI: 0000000000000000 RDI: ffff8802b55d5b80
[   90.668003] RBP: ffff8800bf927cd8 R08: 0000000000000000 R09: 0000000000000001
[   90.668003] R10: 0000000000000000 R11: 0000000000000304 R12: 0000000000000000
[   90.668003] R13: ffff8802b55d5b90 R14: 0000000000000092 R15: 0000000000000001
[   90.668003] FS:  00007fd92e881700(0000) GS:ffff8802b5400000(0000) knlGS:0000000000000000
[   90.668003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   90.668003] CR2: 00007f3289fc6000 CR3: 00000002af011000 CR4: 00000000000007e0
[   90.668003] Stack:
[   90.668003]  ffff8800bf927ce8 ffffffff81021ab5 ffff8800bf927cf8 ffffffff81021b29
[   90.668003]  ffff8800bf927d28 ffffffff810a50f5 ffff8802b04b0000 ffff8802b55d5b80
[   90.668003]  00000000001d5b80 ffffffff810b7a05 ffff8800bf927d48 ffffffff810a5358
[   90.668003] Call Trace:
[   90.668003]  [<ffffffff81021ab5>] native_sched_clock+0x35/0xa0
[   90.668003]  [<ffffffff81021b29>] sched_clock+0x9/0x10
[   90.668003]  [<ffffffff810a50f5>] sched_clock_local+0x25/0x90
[   90.668003]  [<ffffffff810b7a05>] ? prepare_to_wait_event+0x95/0x110
[   90.668003]  [<ffffffff810a5358>] sched_clock_cpu+0xb8/0xe0
[   90.668003]  [<ffffffff810a53d5>] local_clock+0x15/0x30
[   90.668003]  [<ffffffff810c0155>] lock_release_holdtime.part.27+0x15/0x1b0
[   90.668003]  [<ffffffff810b7a05>] ? prepare_to_wait_event+0x95/0x110
[   90.668003]  [<ffffffff810c57fe>] lock_release+0x22e/0x2a0
[   90.668003]  [<ffffffff817b9764>] _raw_spin_unlock_irqrestore+0x24/0x80
[   90.668003]  [<ffffffff810b7a05>] prepare_to_wait_event+0x95/0x110
[   90.668003]  [<ffffffff8124c0b1>] read_events+0x1e1/0x230
[   90.668003]  [<ffffffff810b7a80>] ? prepare_to_wait_event+0x110/0x110
[   90.668003]  [<ffffffff810eae00>] ? hrtimer_get_res+0x50/0x50
[   90.668003]  [<ffffffff810eb384>] ? hrtimer_start_range_ns+0x14/0x20
[   90.668003]  [<ffffffff8124c1c0>] ? lookup_ioctx+0xc0/0x150
[   90.668003]  [<ffffffff8124e0d9>] SyS_io_getevents+0x59/0x110
[   90.668003]  [<ffffffff817ba46d>] system_call_fastpath+0x16/0x1b
[   90.668003] Code: 00 e8 5f d7 0b 00 e9 b7 fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 5d c3 0f 1f 44 00 00 55 48 89 e5 0f 31 <89> c0 48 c1 e2 20 5d 48 09 c2 48 89 d0 c3 66 66 66 2e 0f 1f 84 

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-16 16:56               ` Peter Hurley
@ 2015-01-16 17:00                 ` Chris Mason
  2015-01-16 18:58                   ` Peter Hurley
  0 siblings, 1 reply; 101+ messages in thread
From: Chris Mason @ 2015-01-16 17:00 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Kent Overstreet, Peter Zijlstra, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML

On Fri, Jan 16, 2015 at 11:56 AM, Peter Hurley 
<peter@hurleysoftware.com> wrote:
> On 01/06/2015 06:07 AM, Kent Overstreet wrote:
>>  On Tue, Jan 06, 2015 at 12:01:12PM +0100, Peter Zijlstra wrote:
>>>  On Tue, Jan 06, 2015 at 11:18:04AM +0100, Sedat Dilek wrote:
>>>>  On Tue, Jan 6, 2015 at 11:06 AM, Peter Zijlstra 
>>>> <peterz@infradead.org> wrote:
>>>>>  On Tue, Jan 06, 2015 at 10:57:19AM +0100, Sedat Dilek wrote:
>>>>>>  [   88.028739]  [<ffffffff8124433f>] aio_read_events+0x4f/0x2d0
>>>>>> 
>>>>> 
>>>>>  Ah, that one. Chris Mason and Kent Overstreet were looking at 
>>>>> that one.
>>>>>  I'm not touching the AIO code either ;-)
>>>> 
>>>>  I know, I was so excited when I see nearly the same output.
>>>> 
>>>>  Can you tell me why people see "similiar" problems in different 
>>>> areas?
>>> 
>>>  Because the debug check is new :-) It's a pattern that should not 
>>> be
>>>  used but mostly works most of the times.
>>> 
>>>>  [  181.397024] WARNING: CPU: 0 PID: 2872 at 
>>>> kernel/sched/core.c:7303
>>>>  __might_sleep+0xbd/0xd0()
>>>>  [  181.397028] do not call blocking ops when !TASK_RUNNING; 
>>>> state=1
>>>>  set at [<ffffffff810b83bd>] prepare_to_wait_event+0x5d/0x110
>>>> 
>>>>  With similiar buzzwords... namely...
>>>> 
>>>>  mutex_lock_nested
>>>>  prepare_to_wait(_event)
>>>>  __might_sleep
>>>> 
>>>>  I am asking myself... Where is the real root cause - in 
>>>> sched/core?
>>>>  Fix one single place VS. fix the impact at several other places?
>>> 
>>>  No, the root cause is nesting sleep primitives, this is not 
>>> fixable in
>>>  the one place, both prepare_to_wait and mutex_lock are using
>>>  task_struct::state, they have to, no way around it.
>> 
>>  No, it's completely possible to construct a prepare_to_wait() that 
>> doesn't
>>  require messing with the task state. Had it for years.
>> 
>>  
>> https://urldefense.proofpoint.com/v1/url?u=http://evilpiepirate.org/git/linux-bcache.git/log/?h%3Daio_ring_fix&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=QKQw1WQ3qeio%2FM623F%2BN1X1PeHp7PLLjdIQdHnHU5qo%3D%0A&s=b4e94a6a4b0922e356cadd19f6b22862dbd258fa11c2f26c3d7d76dcac1963ce
> 
> Peter & Kent,
> 
> What's the plan here?

I'm cleaning up my patch slightly and resubmitting.

-chris




^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-16 17:00                 ` Chris Mason
@ 2015-01-16 18:58                   ` Peter Hurley
  0 siblings, 0 replies; 101+ messages in thread
From: Peter Hurley @ 2015-01-16 18:58 UTC (permalink / raw)
  To: Chris Mason
  Cc: Kent Overstreet, Peter Zijlstra, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML

On 01/16/2015 12:00 PM, Chris Mason wrote:
> On Fri, Jan 16, 2015 at 11:56 AM, Peter Hurley <peter@hurleysoftware.com> wrote:
>> Peter & Kent,
>>
>> What's the plan here?
> 
> I'm cleaning up my patch slightly and resubmitting.

Awesome. Will you copy me on it, please?

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06 20:47                                 ` Paul E. McKenney
@ 2015-01-20  0:30                                   ` Paul E. McKenney
  2015-01-20 14:03                                     ` Peter Hurley
  0 siblings, 1 reply; 101+ messages in thread
From: Paul E. McKenney @ 2015-01-20  0:30 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On Tue, Jan 06, 2015 at 12:47:53PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 06, 2015 at 02:57:37PM -0500, Peter Hurley wrote:

[ . . . ]

> David Miller's call, actually.
> 
> But the rule is that if it is an atomic read-modify-write operation and it
> returns a value, then the operation itself needs to include full memory
> barriers before and after (as in the caller doesn't need to add them).
> Otherwise, the operation does not need to include memory ordering.
> Since xchg(), atomic_xchg(), and atomic_long_xchg() all return a value,
> their implementations must include full memory barriers before and after.
> 
> Pretty straightforward.  ;-)

Hello again, Peter,

Were you going to push a patch clarifying this?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-20  0:30                                   ` Paul E. McKenney
@ 2015-01-20 14:03                                     ` Peter Hurley
  2015-02-02 16:11                                       ` Paul E. McKenney
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Hurley @ 2015-01-20 14:03 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On 01/19/2015 07:30 PM, Paul E. McKenney wrote:
> On Tue, Jan 06, 2015 at 12:47:53PM -0800, Paul E. McKenney wrote:
>> On Tue, Jan 06, 2015 at 02:57:37PM -0500, Peter Hurley wrote:
> 
> [ . . . ]
> 
>> David Miller's call, actually.
>>
>> But the rule is that if it is an atomic read-modify-write operation and it
>> returns a value, then the operation itself needs to include full memory
>> barriers before and after (as in the caller doesn't need to add them).
>> Otherwise, the operation does not need to include memory ordering.
>> Since xchg(), atomic_xchg(), and atomic_long_xchg() all return a value,
>> their implementations must include full memory barriers before and after.
>>
>> Pretty straightforward.  ;-)
> 
> Hello again, Peter,
> 
> Were you going to push a patch clarifying this?

Hi Paul,

As you pointed out, atomic_ops.txt is for arch implementors, so I wasn't
planning on patching that file.

I've been meaning to write up something specifically for everyone else but
my own bugs have kept me from that. [That, and I'm not sure what I write
will be suitable for Documentation.]

Regards,
Peter Hurley

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-20 14:03                                     ` Peter Hurley
@ 2015-02-02 16:11                                       ` Paul E. McKenney
  2015-02-02 19:03                                         ` Peter Hurley
  0 siblings, 1 reply; 101+ messages in thread
From: Paul E. McKenney @ 2015-02-02 16:11 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On Tue, Jan 20, 2015 at 09:03:12AM -0500, Peter Hurley wrote:
> On 01/19/2015 07:30 PM, Paul E. McKenney wrote:
> > On Tue, Jan 06, 2015 at 12:47:53PM -0800, Paul E. McKenney wrote:
> >> On Tue, Jan 06, 2015 at 02:57:37PM -0500, Peter Hurley wrote:
> > 
> > [ . . . ]
> > 
> >> David Miller's call, actually.
> >>
> >> But the rule is that if it is an atomic read-modify-write operation and it
> >> returns a value, then the operation itself needs to include full memory
> >> barriers before and after (as in the caller doesn't need to add them).
> >> Otherwise, the operation does not need to include memory ordering.
> >> Since xchg(), atomic_xchg(), and atomic_long_xchg() all return a value,
> >> their implementations must include full memory barriers before and after.
> >>
> >> Pretty straightforward.  ;-)
> > 
> > Hello again, Peter,
> > 
> > Were you going to push a patch clarifying this?
> 
> Hi Paul,
> 
> As you pointed out, atomic_ops.txt is for arch implementors, so I wasn't
> planning on patching that file.
> 
> I've been meaning to write up something specifically for everyone else but
> my own bugs have kept me from that. [That, and I'm not sure what I write
> will be suitable for Documentation.]

Well, upon revisiting this after coming back from travel, I am more inclined
to agree that a change would be good.  I doubt if you would be the only
person who might be confused.  So how about the following patch?

							Thanx, Paul

------------------------------------------------------------------------

documentation: Clarify memory-barrier semantics of atomic operations

All value-returning atomic read-modify-write operations must provide full
memory-barrier semantics on both sides of the operation.  This commit
clarifies the documentation to make it clear that these memory-barrier
semantics are provided by the operations themselves, not by their callers.

Reported-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index 183e41bdcb69..672201e79c49 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -201,11 +201,11 @@ These routines add 1 and subtract 1, respectively, from the given
 atomic_t and return the new counter value after the operation is
 performed.
 
-Unlike the above routines, it is required that explicit memory
-barriers are performed before and after the operation.  It must be
-done such that all memory operations before and after the atomic
-operation calls are strongly ordered with respect to the atomic
-operation itself.
+Unlike the above routines, it is required that these primitives
+inlude explicit memory barriers that are performed before and after
+the operation.  It must be done such that all memory operations before
+and after the atomic operation calls are strongly ordered with respect
+to the atomic operation itself.
 
 For example, it should behave as if a smp_mb() call existed both
 before and after the atomic operation.
@@ -233,21 +233,21 @@ These two routines increment and decrement by 1, respectively, the
 given atomic counter.  They return a boolean indicating whether the
 resulting counter value was zero or not.
 
-It requires explicit memory barrier semantics around the operation as
-above.
+Again, these primitive provide explicit memory barrier semantics around
+the atomic operation.
 
 	int atomic_sub_and_test(int i, atomic_t *v);
 
 This is identical to atomic_dec_and_test() except that an explicit
-decrement is given instead of the implicit "1".  It requires explicit
-memory barrier semantics around the operation.
+decrement is given instead of the implicit "1".  This primitive must
+provide explicit memory barrier semantics around the operation.
 
 	int atomic_add_negative(int i, atomic_t *v);
 
-The given increment is added to the given atomic counter value.  A
-boolean is return which indicates whether the resulting counter value
-is negative.  It requires explicit memory barrier semantics around the
-operation.
+The given increment is added to the given atomic counter value.  A boolean
+is return which indicates whether the resulting counter value is negative.
+This primitive must provide explicit memory barrier semantics around
+the operation.
 
 Then:
 
@@ -257,7 +257,7 @@ This performs an atomic exchange operation on the atomic variable v, setting
 the given new value.  It returns the old value that the atomic variable v had
 just before the operation.
 
-atomic_xchg requires explicit memory barriers around the operation.
+atomic_xchg must provide explicit memory barriers around the operation.
 
 	int atomic_cmpxchg(atomic_t *v, int old, int new);
 
@@ -266,7 +266,7 @@ with the given old and new values. Like all atomic_xxx operations,
 atomic_cmpxchg will only satisfy its atomicity semantics as long as all
 other accesses of *v are performed through atomic_xxx operations.
 
-atomic_cmpxchg requires explicit memory barriers around the operation.
+atomic_cmpxchg must provide explicit memory barriers around the operation.
 
 The semantics for atomic_cmpxchg are the same as those defined for 'cas'
 below.
@@ -279,8 +279,8 @@ If the atomic value v is not equal to u, this function adds a to v, and
 returns non zero. If v is equal to u then it returns zero. This is done as
 an atomic operation.
 
-atomic_add_unless requires explicit memory barriers around the operation
-unless it fails (returns 0).
+atomic_add_unless must provide explicit memory barriers around the
+operation unless it fails (returns 0).
 
 atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)
 
@@ -460,9 +460,9 @@ the return value into an int.  There are other places where things
 like this occur as well.
 
 These routines, like the atomic_t counter operations returning values,
-require explicit memory barrier semantics around their execution.  All
-memory operations before the atomic bit operation call must be made
-visible globally before the atomic bit operation is made visible.
+must provide explicit memory barrier semantics around their execution.
+All memory operations before the atomic bit operation call must be
+made visible globally before the atomic bit operation is made visible.
 Likewise, the atomic bit operation must be visible globally before any
 subsequent memory operation is made visible.  For example:
 
@@ -536,8 +536,9 @@ except that two underscores are prefixed to the interface name.
 These non-atomic variants also do not require any special memory
 barrier semantics.
 
-The routines xchg() and cmpxchg() need the same exact memory barriers
-as the atomic and bit operations returning values.
+The routines xchg() and cmpxchg() must provide the same exact
+memory-barrier semantics as the atomic and bit operations returning
+values.
 
 Spinlocks and rwlocks have memory barrier expectations as well.
 The rule to follow is simple:


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-02-02 16:11                                       ` Paul E. McKenney
@ 2015-02-02 19:03                                         ` Peter Hurley
  2015-02-02 19:33                                           ` Paul E. McKenney
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Hurley @ 2015-02-02 19:03 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On 02/02/2015 11:11 AM, Paul E. McKenney wrote:
> On Tue, Jan 20, 2015 at 09:03:12AM -0500, Peter Hurley wrote:
>> On 01/19/2015 07:30 PM, Paul E. McKenney wrote:
>>> On Tue, Jan 06, 2015 at 12:47:53PM -0800, Paul E. McKenney wrote:
>>>> On Tue, Jan 06, 2015 at 02:57:37PM -0500, Peter Hurley wrote:
>>>
>>> [ . . . ]
>>>
>>>> David Miller's call, actually.
>>>>
>>>> But the rule is that if it is an atomic read-modify-write operation and it
>>>> returns a value, then the operation itself needs to include full memory
>>>> barriers before and after (as in the caller doesn't need to add them).
>>>> Otherwise, the operation does not need to include memory ordering.
>>>> Since xchg(), atomic_xchg(), and atomic_long_xchg() all return a value,
>>>> their implementations must include full memory barriers before and after.
>>>>
>>>> Pretty straightforward.  ;-)
>>>
>>> Hello again, Peter,
>>>
>>> Were you going to push a patch clarifying this?
>>
>> Hi Paul,
>>
>> As you pointed out, atomic_ops.txt is for arch implementors, so I wasn't
>> planning on patching that file.
>>
>> I've been meaning to write up something specifically for everyone else but
>> my own bugs have kept me from that. [That, and I'm not sure what I write
>> will be suitable for Documentation.]
> 
> Well, upon revisiting this after coming back from travel, I am more inclined
> to agree that a change would be good.  I doubt if you would be the only
> person who might be confused.  So how about the following patch?

I think this is much clearer to both audiences. Thanks!

Regards,
Peter Hurley

[minor corrections below]

> ------------------------------------------------------------------------
> 
> documentation: Clarify memory-barrier semantics of atomic operations
> 
> All value-returning atomic read-modify-write operations must provide full
> memory-barrier semantics on both sides of the operation.  This commit
> clarifies the documentation to make it clear that these memory-barrier
> semantics are provided by the operations themselves, not by their callers.
> 
> Reported-by: Peter Hurley <peter@hurleysoftware.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
> index 183e41bdcb69..672201e79c49 100644
> --- a/Documentation/atomic_ops.txt
> +++ b/Documentation/atomic_ops.txt
> @@ -201,11 +201,11 @@ These routines add 1 and subtract 1, respectively, from the given
>  atomic_t and return the new counter value after the operation is
>  performed.
>  
> -Unlike the above routines, it is required that explicit memory
> -barriers are performed before and after the operation.  It must be
> -done such that all memory operations before and after the atomic
> -operation calls are strongly ordered with respect to the atomic
> -operation itself.
> +Unlike the above routines, it is required that these primitives
> +inlude explicit memory barriers that are performed before and after
   ^^^^^^
   include|provide
> +the operation.  It must be done such that all memory operations before
> +and after the atomic operation calls are strongly ordered with respect
> +to the atomic operation itself.
>  
>  For example, it should behave as if a smp_mb() call existed both
>  before and after the atomic operation.
> @@ -233,21 +233,21 @@ These two routines increment and decrement by 1, respectively, the
>  given atomic counter.  They return a boolean indicating whether the
>  resulting counter value was zero or not.
>  
> -It requires explicit memory barrier semantics around the operation as
> -above.
> +Again, these primitive provide explicit memory barrier semantics around
                         ^
                         s
> +the atomic operation.
>  
>  	int atomic_sub_and_test(int i, atomic_t *v);
>  
>  This is identical to atomic_dec_and_test() except that an explicit
> -decrement is given instead of the implicit "1".  It requires explicit
> -memory barrier semantics around the operation.
> +decrement is given instead of the implicit "1".  This primitive must
> +provide explicit memory barrier semantics around the operation.
>  
>  	int atomic_add_negative(int i, atomic_t *v);
>  
> -The given increment is added to the given atomic counter value.  A
> -boolean is return which indicates whether the resulting counter value
> -is negative.  It requires explicit memory barrier semantics around the
> -operation.
> +The given increment is added to the given atomic counter value.  A boolean
> +is return which indicates whether the resulting counter value is negative.
> +This primitive must provide explicit memory barrier semantics around
> +the operation.
>  
>  Then:
>  
> @@ -257,7 +257,7 @@ This performs an atomic exchange operation on the atomic variable v, setting
>  the given new value.  It returns the old value that the atomic variable v had
>  just before the operation.
>  
> -atomic_xchg requires explicit memory barriers around the operation.
> +atomic_xchg must provide explicit memory barriers around the operation.
>  
>  	int atomic_cmpxchg(atomic_t *v, int old, int new);
>  
> @@ -266,7 +266,7 @@ with the given old and new values. Like all atomic_xxx operations,
>  atomic_cmpxchg will only satisfy its atomicity semantics as long as all
>  other accesses of *v are performed through atomic_xxx operations.
>  
> -atomic_cmpxchg requires explicit memory barriers around the operation.
> +atomic_cmpxchg must provide explicit memory barriers around the operation.
>  
>  The semantics for atomic_cmpxchg are the same as those defined for 'cas'
>  below.
> @@ -279,8 +279,8 @@ If the atomic value v is not equal to u, this function adds a to v, and
>  returns non zero. If v is equal to u then it returns zero. This is done as
>  an atomic operation.
>  
> -atomic_add_unless requires explicit memory barriers around the operation
> -unless it fails (returns 0).
> +atomic_add_unless must provide explicit memory barriers around the
> +operation unless it fails (returns 0).
>  
>  atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)
>  
> @@ -460,9 +460,9 @@ the return value into an int.  There are other places where things
>  like this occur as well.
>  
>  These routines, like the atomic_t counter operations returning values,
> -require explicit memory barrier semantics around their execution.  All
> -memory operations before the atomic bit operation call must be made
> -visible globally before the atomic bit operation is made visible.
> +must provide explicit memory barrier semantics around their execution.
> +All memory operations before the atomic bit operation call must be
> +made visible globally before the atomic bit operation is made visible.
>  Likewise, the atomic bit operation must be visible globally before any
>  subsequent memory operation is made visible.  For example:
>  
> @@ -536,8 +536,9 @@ except that two underscores are prefixed to the interface name.
>  These non-atomic variants also do not require any special memory
>  barrier semantics.
>  
> -The routines xchg() and cmpxchg() need the same exact memory barriers
> -as the atomic and bit operations returning values.
> +The routines xchg() and cmpxchg() must provide the same exact
> +memory-barrier semantics as the atomic and bit operations returning
> +values.
>  
>  Spinlocks and rwlocks have memory barrier expectations as well.
>  The rule to follow is simple:
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-02-02 19:03                                         ` Peter Hurley
@ 2015-02-02 19:33                                           ` Paul E. McKenney
  0 siblings, 0 replies; 101+ messages in thread
From: Paul E. McKenney @ 2015-02-02 19:33 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Peter Zijlstra, Kent Overstreet, Sedat Dilek, Dave Jones,
	Linus Torvalds, LKML, Chris Mason

On Mon, Feb 02, 2015 at 02:03:33PM -0500, Peter Hurley wrote:
> On 02/02/2015 11:11 AM, Paul E. McKenney wrote:
> > On Tue, Jan 20, 2015 at 09:03:12AM -0500, Peter Hurley wrote:
> >> On 01/19/2015 07:30 PM, Paul E. McKenney wrote:
> >>> On Tue, Jan 06, 2015 at 12:47:53PM -0800, Paul E. McKenney wrote:
> >>>> On Tue, Jan 06, 2015 at 02:57:37PM -0500, Peter Hurley wrote:
> >>>
> >>> [ . . . ]
> >>>
> >>>> David Miller's call, actually.
> >>>>
> >>>> But the rule is that if it is an atomic read-modify-write operation and it
> >>>> returns a value, then the operation itself needs to include full memory
> >>>> barriers before and after (as in the caller doesn't need to add them).
> >>>> Otherwise, the operation does not need to include memory ordering.
> >>>> Since xchg(), atomic_xchg(), and atomic_long_xchg() all return a value,
> >>>> their implementations must include full memory barriers before and after.
> >>>>
> >>>> Pretty straightforward.  ;-)
> >>>
> >>> Hello again, Peter,
> >>>
> >>> Were you going to push a patch clarifying this?
> >>
> >> Hi Paul,
> >>
> >> As you pointed out, atomic_ops.txt is for arch implementors, so I wasn't
> >> planning on patching that file.
> >>
> >> I've been meaning to write up something specifically for everyone else but
> >> my own bugs have kept me from that. [That, and I'm not sure what I write
> >> will be suitable for Documentation.]
> > 
> > Well, upon revisiting this after coming back from travel, I am more inclined
> > to agree that a change would be good.  I doubt if you would be the only
> > person who might be confused.  So how about the following patch?
> 
> I think this is much clearer to both audiences. Thanks!
> 
> Regards,
> Peter Hurley
> 
> [minor corrections below]
> 
> > ------------------------------------------------------------------------
> > 
> > documentation: Clarify memory-barrier semantics of atomic operations
> > 
> > All value-returning atomic read-modify-write operations must provide full
> > memory-barrier semantics on both sides of the operation.  This commit
> > clarifies the documentation to make it clear that these memory-barrier
> > semantics are provided by the operations themselves, not by their callers.
> > 
> > Reported-by: Peter Hurley <peter@hurleysoftware.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
> > index 183e41bdcb69..672201e79c49 100644
> > --- a/Documentation/atomic_ops.txt
> > +++ b/Documentation/atomic_ops.txt
> > @@ -201,11 +201,11 @@ These routines add 1 and subtract 1, respectively, from the given
> >  atomic_t and return the new counter value after the operation is
> >  performed.
> >  
> > -Unlike the above routines, it is required that explicit memory
> > -barriers are performed before and after the operation.  It must be
> > -done such that all memory operations before and after the atomic
> > -operation calls are strongly ordered with respect to the atomic
> > -operation itself.
> > +Unlike the above routines, it is required that these primitives
> > +inlude explicit memory barriers that are performed before and after
>    ^^^^^^
>    include|provide

Good catch, switched to "include".

> > +the operation.  It must be done such that all memory operations before
> > +and after the atomic operation calls are strongly ordered with respect
> > +to the atomic operation itself.
> >  
> >  For example, it should behave as if a smp_mb() call existed both
> >  before and after the atomic operation.
> > @@ -233,21 +233,21 @@ These two routines increment and decrement by 1, respectively, the
> >  given atomic counter.  They return a boolean indicating whether the
> >  resulting counter value was zero or not.
> >  
> > -It requires explicit memory barrier semantics around the operation as
> > -above.
> > +Again, these primitive provide explicit memory barrier semantics around
>                          ^
>                          s

Good eyes, fixed.

							Thanx, Paul

> > +the atomic operation.
> >  
> >  	int atomic_sub_and_test(int i, atomic_t *v);
> >  
> >  This is identical to atomic_dec_and_test() except that an explicit
> > -decrement is given instead of the implicit "1".  It requires explicit
> > -memory barrier semantics around the operation.
> > +decrement is given instead of the implicit "1".  This primitive must
> > +provide explicit memory barrier semantics around the operation.
> >  
> >  	int atomic_add_negative(int i, atomic_t *v);
> >  
> > -The given increment is added to the given atomic counter value.  A
> > -boolean is return which indicates whether the resulting counter value
> > -is negative.  It requires explicit memory barrier semantics around the
> > -operation.
> > +The given increment is added to the given atomic counter value.  A boolean
> > +is return which indicates whether the resulting counter value is negative.
> > +This primitive must provide explicit memory barrier semantics around
> > +the operation.
> >  
> >  Then:
> >  
> > @@ -257,7 +257,7 @@ This performs an atomic exchange operation on the atomic variable v, setting
> >  the given new value.  It returns the old value that the atomic variable v had
> >  just before the operation.
> >  
> > -atomic_xchg requires explicit memory barriers around the operation.
> > +atomic_xchg must provide explicit memory barriers around the operation.
> >  
> >  	int atomic_cmpxchg(atomic_t *v, int old, int new);
> >  
> > @@ -266,7 +266,7 @@ with the given old and new values. Like all atomic_xxx operations,
> >  atomic_cmpxchg will only satisfy its atomicity semantics as long as all
> >  other accesses of *v are performed through atomic_xxx operations.
> >  
> > -atomic_cmpxchg requires explicit memory barriers around the operation.
> > +atomic_cmpxchg must provide explicit memory barriers around the operation.
> >  
> >  The semantics for atomic_cmpxchg are the same as those defined for 'cas'
> >  below.
> > @@ -279,8 +279,8 @@ If the atomic value v is not equal to u, this function adds a to v, and
> >  returns non zero. If v is equal to u then it returns zero. This is done as
> >  an atomic operation.
> >  
> > -atomic_add_unless requires explicit memory barriers around the operation
> > -unless it fails (returns 0).
> > +atomic_add_unless must provide explicit memory barriers around the
> > +operation unless it fails (returns 0).
> >  
> >  atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)
> >  
> > @@ -460,9 +460,9 @@ the return value into an int.  There are other places where things
> >  like this occur as well.
> >  
> >  These routines, like the atomic_t counter operations returning values,
> > -require explicit memory barrier semantics around their execution.  All
> > -memory operations before the atomic bit operation call must be made
> > -visible globally before the atomic bit operation is made visible.
> > +must provide explicit memory barrier semantics around their execution.
> > +All memory operations before the atomic bit operation call must be
> > +made visible globally before the atomic bit operation is made visible.
> >  Likewise, the atomic bit operation must be visible globally before any
> >  subsequent memory operation is made visible.  For example:
> >  
> > @@ -536,8 +536,9 @@ except that two underscores are prefixed to the interface name.
> >  These non-atomic variants also do not require any special memory
> >  barrier semantics.
> >  
> > -The routines xchg() and cmpxchg() need the same exact memory barriers
> > -as the atomic and bit operations returning values.
> > +The routines xchg() and cmpxchg() must provide the same exact
> > +memory-barrier semantics as the atomic and bit operations returning
> > +values.
> >  
> >  Spinlocks and rwlocks have memory barrier expectations as well.
> >  The rule to follow is simple:
> > 
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-13  3:33                     ` Rik van Riel
@ 2015-01-13 10:28                       ` Catalin Marinas
  0 siblings, 0 replies; 101+ messages in thread
From: Catalin Marinas @ 2015-01-13 10:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Lang, Linus Torvalds, Kirill A. Shutemov, Mark Langsdorf,
	Linux Kernel Mailing List, linux-arm-kernel

On Tue, Jan 13, 2015 at 03:33:12AM +0000, Rik van Riel wrote:
> On 01/09/2015 09:51 PM, David Lang wrote:
> > On Fri, 9 Jan 2015, Linus Torvalds wrote:
> > 
> >> Big pages are a bad bad bad idea. They work fine for databases,
> >> and that's pretty much just about it. I'm sure there are some
> >> other loads, but they are few and far between.
> > 
> > what about a dedicated virtualization host (where your workload is
> > a handful of virtual machines), would the file cache issue still
> > be overwelming, even though it's the virtual machines accessing
> > things?
> 
> You would still have page cache inside the guest.
> 
> Using large pages in the host, and small pages in the guest
> would not give you the TLB benefits, and that is assuming
> that different page sizes in host and guest even work...

This works on ARM. The TLB caching the full VA->PA translation would
indeed stick to the guest page size as that's the input. But, depending
on the TLB implementation, it may also cache the guest PA -> real PA
translation (a TLB with the guest/Intermediate PA as input; ARMv8 also
introduces TLB invalidation ops that take such IPA as input). A miss in
the stage 1 (guest) TLB would be cheaper if it hits in the stage 2 TLB,
especially when it needs to look up the stage 2 for each level in the
stage 1 table.

But when it doesn't hit in any of the stages, it's still beneficial to
have smaller number of levels at stage 2 (host) and that's what 64KB
pages bring on ARM. If you use the maximum 4 levels in both host and
guest, a TLB miss in the guest requires 24 memory accesses to populate
it (each guest page table level entry needs a stage 2 look-up). In
practice, you may get some locality but I think the guest page table
access pattern can get quite sparse. In addition, stage 2 entries are
not as volatile as they are per VM rather than per process as the stage
1 entries.

> Using large pages in the guests gets you back to the wasted
> memory, except you are now wasting memory in a situation where
> you have less memory available in each guest. Density is a real
> consideration for virtualization.

I agree. I think guests should stick to 4KB pages (well, unless all they
need to do is mmap large database files).

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  2:51                   ` David Lang
  2015-01-10  3:06                     ` Linus Torvalds
@ 2015-01-13  3:33                     ` Rik van Riel
  2015-01-13 10:28                       ` Catalin Marinas
  1 sibling, 1 reply; 101+ messages in thread
From: Rik van Riel @ 2015-01-13  3:33 UTC (permalink / raw)
  To: David Lang, Linus Torvalds
  Cc: Kirill A. Shutemov, Catalin Marinas, Mark Langsdorf,
	Linux Kernel Mailing List, linux-arm-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/09/2015 09:51 PM, David Lang wrote:
> On Fri, 9 Jan 2015, Linus Torvalds wrote:
> 
>> Big pages are a bad bad bad idea. They work fine for databases,
>> and that's pretty much just about it. I'm sure there are some
>> other loads, but they are few and far between.
> 
> what about a dedicated virtualization host (where your workload is
> a handful of virtual machines), would the file cache issue still
> be overwelming, even though it's the virtual machines accessing
> things?

You would still have page cache inside the guest.

Using large pages in the host, and small pages in the guest
would not give you the TLB benefits, and that is assuming
that different page sizes in host and guest even work...

Using large pages in the guests gets you back to the wasted
memory, except you are now wasting memory in a situation where
you have less memory available in each guest. Density is a real
consideration for virtualization.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUtJH4AAoJEM553pKExN6Dd3QH/ivcIo2n06Czg14/gL61MSHM
uZPOuMGQt51DYtF3s3mtDHqWyZq9hafz+2hoSJDwGIvVE6hJKVJ5rvb/OcN7AEKe
PWfru+bOvID0d4YOy38ax2tZwdItlL/sj1AbTbPXjnkLWm0yP3dYVM40dj47JvPy
+aE3iHB+wPZ+xxUmQ5KIlpRydUS1fl+tdmsiyi41fSFu8X19YDtDSrPylLk3to/w
6RGbHWLxJQZXJk+pkVWuSELmzWRrCaNaE7XBlvP9VS4U8bRg8WYJJXax1FBKLGBO
ygVt2OmqLi9dneN8ePNRUW8x2Y6OqjobDgCkOzTxJB8NrtRDJpSqWCI4W5xReBM=
=WhPN
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 19:07                           ` Linus Torvalds
@ 2015-01-12 19:24                             ` Will Deacon
  0 siblings, 0 replies; 101+ messages in thread
From: Will Deacon @ 2015-01-12 19:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Laszlo Ersek, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin, Dave Hansen

On Mon, Jan 12, 2015 at 07:07:12PM +0000, Linus Torvalds wrote:
> On Tue, Jan 13, 2015 at 8:06 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So I'm ok with it, as long as we don't have a performance regression.
> >
> > Your "don't bother freeing when the batch is empty" should hopefully
> > be fine. Dave, does that work for your case?
> 
> Oh, and Dave just replied that it's ok. So should I just take it
> directly, or expect it through the arm64 tree? Either works for me.

Although I do have a couple of arm64 fixes on the radar, it'd be quicke
if you just take the patch. I added a commit log/SoB below.

Cheers,

Will

--->8

>From bcf792ffc9ce29415261d2055954b883c5bec978 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Mon, 12 Jan 2015 19:10:55 +0000
Subject: [PATCH] mm: mmu_gather: use tlb->end != 0 only for TLB invalidation

When batching up address ranges for TLB invalidation, we check tlb->end
!= 0 to indicate that some pages have actually been unmapped.

As of commit f045bbb9fa1b ("mmu_gather: fix over-eager
tlb_flush_mmu_free() calling"), we use the same check for freeing these
pages in order to avoid a performance regression where we call
free_pages_and_swap_cache even when no pages are actually queued up.

Unfortunately, the range could have been reset (tlb->end = 0) by
tlb_end_vma, which has been shown to cause memory leaks on arm64.
Furthermore, investigation into these leaks revealed that the fullmm
case on task exit no longer invalidates the TLB, by virtue of tlb->end
 == 0 (in 3.18, need_flush would have been set).

This patch resolves the problem by reverting f045bbb9fa1b, using
tlb->local.nr as the predicate for page freeing in tlb_flush_mmu_free
and ensuring that tlb->end is initialised to a non-zero value in the
fullmm case.

Tested-by: Mark Langsdorf <mlangsdo@redhat.com>
Tested-by: Dave Hansen <dave@sr71.net>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/asm-generic/tlb.h | 8 ++++++--
 mm/memory.c               | 8 ++++----
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 08848050922e..db284bff29dc 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -136,8 +136,12 @@ static inline void __tlb_adjust_range(struct mmu_gather *tlb,
 
 static inline void __tlb_reset_range(struct mmu_gather *tlb)
 {
-	tlb->start = TASK_SIZE;
-	tlb->end = 0;
+	if (tlb->fullmm) {
+		tlb->start = tlb->end = ~0;
+	} else {
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
+	}
 }
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index c6565f00fb38..54f3a9b00956 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -235,6 +235,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 
 static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
+	if (!tlb->end)
+		return;
+
 	tlb_flush(tlb);
 	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
@@ -247,7 +250,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
-	for (batch = &tlb->local; batch; batch = batch->next) {
+	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
 		free_pages_and_swap_cache(batch->pages, batch->nr);
 		batch->nr = 0;
 	}
@@ -256,9 +259,6 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 
 void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->end)
-		return;
-
 	tlb_flush_mmu_tlbonly(tlb);
 	tlb_flush_mmu_free(tlb);
 }
-- 
2.1.4


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 19:06                         ` Linus Torvalds
@ 2015-01-12 19:07                           ` Linus Torvalds
  2015-01-12 19:24                             ` Will Deacon
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2015-01-12 19:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Laszlo Ersek, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin, Dave Hansen

On Tue, Jan 13, 2015 at 8:06 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'm ok with it, as long as we don't have a performance regression.
>
> Your "don't bother freeing when the batch is empty" should hopefully
> be fine. Dave, does that work for your case?

Oh, and Dave just replied that it's ok. So should I just take it
directly, or expect it through the arm64 tree? Either works for me.

            Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 12:42                       ` Will Deacon
  2015-01-12 13:22                         ` Mark Langsdorf
  2015-01-12 19:03                         ` Dave Hansen
@ 2015-01-12 19:06                         ` Linus Torvalds
  2015-01-12 19:07                           ` Linus Torvalds
  2 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2015-01-12 19:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: Laszlo Ersek, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin, Dave Hansen

On Tue, Jan 13, 2015 at 1:42 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> Linus: this moves the tlb->end check back into tlb_flush_mmu_tlbonly.
> How much do you hate that?

So I'm ok with it, as long as we don't have a performance regression.

Your "don't bother freeing when the batch is empty" should hopefully
be fine. Dave, does that work for your case?

           Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 12:42                       ` Will Deacon
  2015-01-12 13:22                         ` Mark Langsdorf
@ 2015-01-12 19:03                         ` Dave Hansen
  2015-01-12 19:06                         ` Linus Torvalds
  2 siblings, 0 replies; 101+ messages in thread
From: Dave Hansen @ 2015-01-12 19:03 UTC (permalink / raw)
  To: Will Deacon, Linus Torvalds
  Cc: Laszlo Ersek, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin

On 01/12/2015 04:42 AM, Will Deacon wrote:
> Since this effectively reverts f045bbb9fa1b, I've added Dave to Cc. It
> should fix the leak without reintroducing the performance regression.

I ran this on the big system that showed the earlier issue.  Everything
seems OK, at least with the test that showed the issue earlier (the brk1
test from will-it-scale).

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 14:23                             ` Catalin Marinas
@ 2015-01-12 15:42                               ` Arnd Bergmann
  0 siblings, 0 replies; 101+ messages in thread
From: Arnd Bergmann @ 2015-01-12 15:42 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Catalin Marinas, Kirill A. Shutemov, Mark Langsdorf,
	Linus Torvalds, Linux Kernel Mailing List

On Monday 12 January 2015 14:23:32 Catalin Marinas wrote:
> 
> So, I guess it's run-time cost of the LRU algorithm, especially under
> memory pressure. Harder to benchmark though (we'll see when we get
> hardware, though probably not very soon).

One thing you could try is to add an access fault handler that gathers
statistics about number of calls (you could trivially get this by
using gcov) and time spent in the handler for a workload that causes
memory pressure.

	Arnd

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 13:57                           ` Arnd Bergmann
@ 2015-01-12 14:23                             ` Catalin Marinas
  2015-01-12 15:42                               ` Arnd Bergmann
  0 siblings, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2015-01-12 14:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kirill A. Shutemov, Mark Langsdorf, Linus Torvalds,
	Linux Kernel Mailing List, linux-arm-kernel

On Mon, Jan 12, 2015 at 01:57:48PM +0000, Arnd Bergmann wrote:
> On Monday 12 January 2015 12:18:15 Catalin Marinas wrote:
> > On Sat, Jan 10, 2015 at 09:36:13PM +0000, Arnd Bergmann wrote:
> > > On Saturday 10 January 2015 13:00:27 Linus Torvalds wrote:
> > > > > IIRC, AIX works great with 64k pages, but only because of two
> > > > > reasons that don't apply on Linux:
> > > > 
> > > > .. there's a few other ones:
> > > > 
> > > >  (c) nobody really runs AIX on dekstops. It's very much a DB load
> > > > environment, with historically some HPC.
> > > > 
> > > >  (d) the powerpc TLB fill/buildup/teardown costs are horrible, so on
> > > > AIX the cost of lots of small pages is much higher too.
> > > 
> > > I think (d) applies to ARM as well, since it has no hardware
> > > dirty/referenced bit tracking and requires the OS to mark the
> > > pages as invalid/readonly until the first access. ARMv8.1
> > > has a fix for that, but it's optional and we haven't seen any
> > > implementations yet.
> > 
> > Do you happen have any data on how significantly non-hardware
> > dirty/access bits impact the performance? I think it may affect the user
> > process start-up time a but at run-time it shouldn't be that bad.
> > 
> > If it is that significant, we could optimise it further in the arch
> > code. For example, make a fast exception path where we need to mark the
> > pte dirty. This would be handled by arch code without even calling
> > handle_pte_fault().
> 
> If I understand the way that LRU works right, we end up clearing
> the referenced bits in shrink_active_list()->page_referenced()->
> page_referenced_one()->ptep_clear_flush_young_notify()->pte_mkold()
> whenever there is memory pressure, so definitely not just for
> startup.

Yes. Actually, I think the start-up is probably not that bad. For pages
pointing to zero-page, you need to go through the kernel to allocate an
anonymous page and change the pte anyway. If the access was "write", the
page is marked dirty already (same with the access flag, the page is
marked young to avoid a subsequent fault).

So, I guess it's run-time cost of the LRU algorithm, especially under
memory pressure. Harder to benchmark though (we'll see when we get
hardware, though probably not very soon).

> > > > so I feel pretty confident in saying it won't happen. It's just too
> > > > much of a bother, for little to no actual upside. It's likely a much
> > > > better approach to try to instead use THP for anonymous mappings.
> > > 
> > > arm64 already supports 2MB transparent hugepages. I guess it
> > > wouldn't be too hard to change it so that an existing hugepage
> > > on an anonymous mapping that gets split up into 4KB pages gets
> > > split along 64KB boundaries with the contiguous mapping bit set.
> > > 
> > > Having full support for multiple hugepage sizes (64KB, 2MB and 32MB
> > > in case of ARM64 with 4KB PAGE_SIZE) would be even better and
> > > probably negate any benefits of 64KB PAGE_SIZE, but requires more
> > > changes to common mm code.
> > 
> > As I replied to your other email, I don't think that's simple for the
> > transparent huge pages case.
> > 
> > The main advantage I see with 64KB pages is not the reduced TLB pressure
> > but the number of levels of page tables. Take the AMD Seattle board for
> > example, with 4KB pages you need 4 levels but 64KB allow only 2 levels
> > (42-bit VA). Larger TLBs and improved walk caches (caching VA -> pmd
> > entry translation rather than all the way to pte/PA) make things better
> > but you still have the warming up time for any fork/new process as they
> > don't share the same TLB entries.
> 
> Not sure I'm following. Does the A57 core cache partial TLBs or not?

AFAIK, yes (I think A15 started doing this years ago). Architecturally,
an entry at any page table level could be cached in the TLB if it is
valid, irrespective of whether the full translation is valid.
Implementations, however, usually just cache the VA->pmd translation in
the intermediate TLB. Anyway, populating such cache still requires 3
accesses on a 4-level system.

> Even if not, I would expect the page tables to be hot in dcache most
> of the time, possibly with the exception of the last level on
> multi-threaded processes, but then you are back to the difference
> between the page size and the upper levels almost out of the equation.

As I said, I think it's more important with virtualisation where each
guest page table entry address needs to be translated from the guest
address space (IPA) to the physical address via the stage 2 tables.

BTW, arm64 can differentiate between a leaf TLB invalidation (just
changing the pte) and an all-levels one (which affects the walk cache
for a given VA). I have some hacked code to do this in Linux but I
haven't had time to assess the impact properly.

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 12:18                         ` Catalin Marinas
@ 2015-01-12 13:57                           ` Arnd Bergmann
  2015-01-12 14:23                             ` Catalin Marinas
  0 siblings, 1 reply; 101+ messages in thread
From: Arnd Bergmann @ 2015-01-12 13:57 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Linus Torvalds, linux-arm-kernel, Kirill A. Shutemov,
	Mark Langsdorf, Linux Kernel Mailing List

On Monday 12 January 2015 12:18:15 Catalin Marinas wrote:
> On Sat, Jan 10, 2015 at 09:36:13PM +0000, Arnd Bergmann wrote:
> > On Saturday 10 January 2015 13:00:27 Linus Torvalds wrote:
> > > > IIRC, AIX works great with 64k pages, but only because of two
> > > > reasons that don't apply on Linux:
> > > 
> > > .. there's a few other ones:
> > > 
> > >  (c) nobody really runs AIX on dekstops. It's very much a DB load
> > > environment, with historically some HPC.
> > > 
> > >  (d) the powerpc TLB fill/buildup/teardown costs are horrible, so on
> > > AIX the cost of lots of small pages is much higher too.
> > 
> > I think (d) applies to ARM as well, since it has no hardware
> > dirty/referenced bit tracking and requires the OS to mark the
> > pages as invalid/readonly until the first access. ARMv8.1
> > has a fix for that, but it's optional and we haven't seen any
> > implementations yet.
> 
> Do you happen have any data on how significantly non-hardware
> dirty/access bits impact the performance? I think it may affect the user
> process start-up time a but at run-time it shouldn't be that bad.
> 
> If it is that significant, we could optimise it further in the arch
> code. For example, make a fast exception path where we need to mark the
> pte dirty. This would be handled by arch code without even calling
> handle_pte_fault().

If I understand the way that LRU works right, we end up clearing
the referenced bits in shrink_active_list()->page_referenced()->
page_referenced_one()->ptep_clear_flush_young_notify()->pte_mkold()
whenever there is memory pressure, so definitely not just for
startup.

> > > so I feel pretty confident in saying it won't happen. It's just too
> > > much of a bother, for little to no actual upside. It's likely a much
> > > better approach to try to instead use THP for anonymous mappings.
> > 
> > arm64 already supports 2MB transparent hugepages. I guess it
> > wouldn't be too hard to change it so that an existing hugepage
> > on an anonymous mapping that gets split up into 4KB pages gets
> > split along 64KB boundaries with the contiguous mapping bit set.
> > 
> > Having full support for multiple hugepage sizes (64KB, 2MB and 32MB
> > in case of ARM64 with 4KB PAGE_SIZE) would be even better and
> > probably negate any benefits of 64KB PAGE_SIZE, but requires more
> > changes to common mm code.
> 
> As I replied to your other email, I don't think that's simple for the
> transparent huge pages case.
> 
> The main advantage I see with 64KB pages is not the reduced TLB pressure
> but the number of levels of page tables. Take the AMD Seattle board for
> example, with 4KB pages you need 4 levels but 64KB allow only 2 levels
> (42-bit VA). Larger TLBs and improved walk caches (caching VA -> pmd
> entry translation rather than all the way to pte/PA) make things better
> but you still have the warming up time for any fork/new process as they
> don't share the same TLB entries.

Not sure I'm following. Does the A57 core cache partial TLBs or not?

Even if not, I would expect the page tables to be hot in dcache most
of the time, possibly with the exception of the last level on
multi-threaded processes, but then you are back to the difference
between the page size and the upper levels almost out of the equation.

	Arnd

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 12:42                       ` Will Deacon
@ 2015-01-12 13:22                         ` Mark Langsdorf
  2015-01-12 19:03                         ` Dave Hansen
  2015-01-12 19:06                         ` Linus Torvalds
  2 siblings, 0 replies; 101+ messages in thread
From: Mark Langsdorf @ 2015-01-12 13:22 UTC (permalink / raw)
  To: Will Deacon, Linus Torvalds
  Cc: Laszlo Ersek, Marc Zyngier, Mark Rutland, Steve Capper,
	vishnu.ps, main kernel list, arm kernel list, Kyle McMartin,
	dave

On 01/12/2015 06:42 AM, Will Deacon wrote:
> On Sat, Jan 10, 2015 at 07:51:03PM +0000, Linus Torvalds wrote:
>> I can revert the commit that causes problems, but considering the
>> performance impact on x86 (it would be a regression since 3.18), I
>> would *really* like to fix the arm64 problem instead. So I'll wait
>> with the revert for at least a week, I think, hoping that the arm64
>> people figure this out. Sound reasonable?
>
> Sure. I've put together the following patch which:
>
>    (1) Uses tlb->end != 0 only as the predicate for TLB invalidation
>
>    (2) Uses tlb->local.nr as the predicate for page freeing in
>        tlb_flush_mmu_free
>
>    (3) Ensures tlb->end != 0 for the fullmm case (I think this was a
>        benign change introduced by my original patch, but this puts us
>        back to 3.18 behaviour)
>
> Since this effectively reverts f045bbb9fa1b, I've added Dave to Cc. It
> should fix the leak without reintroducing the performance regression.
>
> Linus: this moves the tlb->end check back into tlb_flush_mmu_tlbonly.
> How much do you hate that?
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 08848050922e..db284bff29dc 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -136,8 +136,12 @@ static inline void __tlb_adjust_range(struct mmu_gather *tlb,
>
>   static inline void __tlb_reset_range(struct mmu_gather *tlb)
>   {
> -	tlb->start = TASK_SIZE;
> -	tlb->end = 0;
> +	if (tlb->fullmm) {
> +		tlb->start = tlb->end = ~0;
> +	} else {
> +		tlb->start = TASK_SIZE;
> +		tlb->end = 0;
> +	}
>   }
>
>   /*
> diff --git a/mm/memory.c b/mm/memory.c
> index c6565f00fb38..54f3a9b00956 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -235,6 +235,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
>
>   static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
>   {
> +	if (!tlb->end)
> +		return;
> +
>   	tlb_flush(tlb);
>   	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
>   #ifdef CONFIG_HAVE_RCU_TABLE_FREE
> @@ -247,7 +250,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
>   {
>   	struct mmu_gather_batch *batch;
>
> -	for (batch = &tlb->local; batch; batch = batch->next) {
> +	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>   		free_pages_and_swap_cache(batch->pages, batch->nr);
>   		batch->nr = 0;
>   	}
> @@ -256,9 +259,6 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
>
>   void tlb_flush_mmu(struct mmu_gather *tlb)
>   {
> -	if (!tlb->end)
> -		return;
> -
>   	tlb_flush_mmu_tlbonly(tlb);
>   	tlb_flush_mmu_free(tlb);
>   }
>

This fixes the originally reported problem. I have no opinion on
moving the tlb->end check back into tlb_flush_mmu_tlbonly.

Tested-by: Mark Langsdorf <mlangsdo@redhat.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-12 11:53                     ` Catalin Marinas
@ 2015-01-12 13:15                       ` Arnd Bergmann
  0 siblings, 0 replies; 101+ messages in thread
From: Arnd Bergmann @ 2015-01-12 13:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-arm-kernel, Linus Torvalds, Kirill A. Shutemov,
	Mark Langsdorf, Linux Kernel Mailing List

On Monday 12 January 2015 11:53:42 Catalin Marinas wrote:
> On Sat, Jan 10, 2015 at 08:16:02PM +0000, Arnd Bergmann wrote:
> > Regarding ARM64 in particular, I think it would be nice to investigate
> > how to extend the THP code to cover 64KB TLBs when running with the 4KB
> > page size. There is a hint bit in the page table to tell the CPU that
> > a set of 16 aligned pages can share one TLB, and it would be nice to
> > use that bit in Linux, and to make this case more common for anonymous
> > mappings, and possible large file based mappings.
> 
> The generic THP code assumes that huge pages are done at the pmd level,
> which means 2MB for arm64 with 4KB page configuration. Hugetlb allows
> larger ptes which may not necessarily be at the pmd level, though we
> haven't implemented this on arm64 and it's not transparent either. As a
> first step it would be nice if at least we unify the APIs between
> hugetlbfs and THP (set_huge_pte_at vs. set_pmd_at).
> 
> I think you could do some arch-only tricks by pretending that you have a
> pte with 16 entries only and a dummy pmd (without a corresponding
> hardware page table level) that can host a "huge" page (16 consecutive
> ptes). But we lose the 2MB transparent huge page as I don't see
> mm/huge_memory.c handling huge puds. We also lose the ability of
> building 4 real level page tables since we use the pmd as a dummy one.

Yes, it quickly gets ugly at that point.
 
> But it would be a nice investigation. Maybe something simpler like
> getting the mm layer to prefer contiguous 64KB ranges and we do the
> detection in the arch set_pte_at().

Doing the detection would be easy enough I guess and immediately
helps with the post-split THP mapping, but I don't think that
by itself would have a noticeable benefit on general workloads.

My first reaction to a change to the mm layer was that it's probably really
hard, but then again if we limit it to anonymous mappings, all we really
need is a modification in do_anonymous_page() to allocate a larger chunk
if possible and install n PTEs at a time or fall back to the current
behavior if anything gets in the way. For completeness, the same thing
could be done in do_wp_page() for the case where an entire block of pages
are either not mapped or point to the zero page. Anything beyond that
probably adds more complexity than it gains.

Do we have someone who code this up and do some benchmarks to find out
the cost in terms of memory consumption and the performance compared to
normal 4k pages and static 64k pages?

Do the Cortex-A53/A57 cores actually implement the necessary hardware
feature?
IIRC some x86 processors are also able to use larger TLBs for contiguous
page table entries even without an architected hint bit, so if one
could show this to perform better on x86, it would be much easier to
merge.

	Arnd

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 19:51                     ` Linus Torvalds
@ 2015-01-12 12:42                       ` Will Deacon
  2015-01-12 13:22                         ` Mark Langsdorf
                                           ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Will Deacon @ 2015-01-12 12:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Laszlo Ersek, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin, dave

On Sat, Jan 10, 2015 at 07:51:03PM +0000, Linus Torvalds wrote:
> On Sat, Jan 10, 2015 at 5:37 AM, Will Deacon <will.deacon@arm.com> wrote:
> >>
> >> Will?
> >
> > I'm wondering if this is now broken in the fullmm case, because tlb->end
> > will be zero and we won't actually free any of the pages being unmapped
> > on task exit. Does that sound plausible?
> 
> But did anything change wrt fullmm? I don't see any changes wrt fullmm
> logic in generic code.

No, and now that I've had a play, the issue is worse than I thought.

> The arm64 code changed more, so maybe there was somethinig I missed.
> Again, arm64 uses tlb_end_vma() etc, so arm64 certainly triggers code
> that x86 does not.

Yup, and I think tlb_end_vma is also problematic. The issue is that we
reset the range in __tlb_end_vma, and therefore the batched pages don't
get freed. We also have an issue in the fullmm case, because
tlb_flush_mmu tests tlb->end != 0 before doing any freeing.

The fundamental problem is that tlb->end *only* indicates whether or not
TLB invalidation is required, *not* whether there are pages to be freed.
The advantage of the old need_flush flag was that it was sticky over
calls to things like tlb_flush_mmu_tlbonly.

> I can revert the commit that causes problems, but considering the
> performance impact on x86 (it would be a regression since 3.18), I
> would *really* like to fix the arm64 problem instead. So I'll wait
> with the revert for at least a week, I think, hoping that the arm64
> people figure this out. Sound reasonable?

Sure. I've put together the following patch which:

  (1) Uses tlb->end != 0 only as the predicate for TLB invalidation

  (2) Uses tlb->local.nr as the predicate for page freeing in
      tlb_flush_mmu_free

  (3) Ensures tlb->end != 0 for the fullmm case (I think this was a
      benign change introduced by my original patch, but this puts us
      back to 3.18 behaviour)

Since this effectively reverts f045bbb9fa1b, I've added Dave to Cc. It
should fix the leak without reintroducing the performance regression.

Linus: this moves the tlb->end check back into tlb_flush_mmu_tlbonly.
How much do you hate that?

Will

--->8

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 08848050922e..db284bff29dc 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -136,8 +136,12 @@ static inline void __tlb_adjust_range(struct mmu_gather *tlb,
 
 static inline void __tlb_reset_range(struct mmu_gather *tlb)
 {
-	tlb->start = TASK_SIZE;
-	tlb->end = 0;
+	if (tlb->fullmm) {
+		tlb->start = tlb->end = ~0;
+	} else {
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
+	}
 }
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index c6565f00fb38..54f3a9b00956 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -235,6 +235,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 
 static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
+	if (!tlb->end)
+		return;
+
 	tlb_flush(tlb);
 	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
@@ -247,7 +250,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
-	for (batch = &tlb->local; batch; batch = batch->next) {
+	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
 		free_pages_and_swap_cache(batch->pages, batch->nr);
 		batch->nr = 0;
 	}
@@ -256,9 +259,6 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 
 void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->end)
-		return;
-
 	tlb_flush_mmu_tlbonly(tlb);
 	tlb_flush_mmu_free(tlb);
 }

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 21:36                       ` Arnd Bergmann
  2015-01-10 21:48                         ` Linus Torvalds
  2015-01-12 11:37                         ` Kirill A. Shutemov
@ 2015-01-12 12:18                         ` Catalin Marinas
  2015-01-12 13:57                           ` Arnd Bergmann
  2 siblings, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2015-01-12 12:18 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, linux-arm-kernel, Kirill A. Shutemov,
	Mark Langsdorf, Linux Kernel Mailing List

On Sat, Jan 10, 2015 at 09:36:13PM +0000, Arnd Bergmann wrote:
> On Saturday 10 January 2015 13:00:27 Linus Torvalds wrote:
> > > IIRC, AIX works great with 64k pages, but only because of two
> > > reasons that don't apply on Linux:
> > 
> > .. there's a few other ones:
> > 
> >  (c) nobody really runs AIX on dekstops. It's very much a DB load
> > environment, with historically some HPC.
> > 
> >  (d) the powerpc TLB fill/buildup/teardown costs are horrible, so on
> > AIX the cost of lots of small pages is much higher too.
> 
> I think (d) applies to ARM as well, since it has no hardware
> dirty/referenced bit tracking and requires the OS to mark the
> pages as invalid/readonly until the first access. ARMv8.1
> has a fix for that, but it's optional and we haven't seen any
> implementations yet.

Do you happen have any data on how significantly non-hardware
dirty/access bits impact the performance? I think it may affect the user
process start-up time a but at run-time it shouldn't be that bad.

If it is that significant, we could optimise it further in the arch
code. For example, make a fast exception path where we need to mark the
pte dirty. This would be handled by arch code without even calling
handle_pte_fault().

> > so I feel pretty confident in saying it won't happen. It's just too
> > much of a bother, for little to no actual upside. It's likely a much
> > better approach to try to instead use THP for anonymous mappings.
> 
> arm64 already supports 2MB transparent hugepages. I guess it
> wouldn't be too hard to change it so that an existing hugepage
> on an anonymous mapping that gets split up into 4KB pages gets
> split along 64KB boundaries with the contiguous mapping bit set.
> 
> Having full support for multiple hugepage sizes (64KB, 2MB and 32MB
> in case of ARM64 with 4KB PAGE_SIZE) would be even better and
> probably negate any benefits of 64KB PAGE_SIZE, but requires more
> changes to common mm code.

As I replied to your other email, I don't think that's simple for the
transparent huge pages case.

The main advantage I see with 64KB pages is not the reduced TLB pressure
but the number of levels of page tables. Take the AMD Seattle board for
example, with 4KB pages you need 4 levels but 64KB allow only 2 levels
(42-bit VA). Larger TLBs and improved walk caches (caching VA -> pmd
entry translation rather than all the way to pte/PA) make things better
but you still have the warming up time for any fork/new process as they
don't share the same TLB entries.

But as Linus said already, the trade-off with the memory wastage
is highly dependent on the targeted load.

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 20:16                   ` Arnd Bergmann
  2015-01-10 21:00                     ` Linus Torvalds
@ 2015-01-12 11:53                     ` Catalin Marinas
  2015-01-12 13:15                       ` Arnd Bergmann
  1 sibling, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2015-01-12 11:53 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Linus Torvalds, Kirill A. Shutemov,
	Mark Langsdorf, Linux Kernel Mailing List

On Sat, Jan 10, 2015 at 08:16:02PM +0000, Arnd Bergmann wrote:
> Regarding ARM64 in particular, I think it would be nice to investigate
> how to extend the THP code to cover 64KB TLBs when running with the 4KB
> page size. There is a hint bit in the page table to tell the CPU that
> a set of 16 aligned pages can share one TLB, and it would be nice to
> use that bit in Linux, and to make this case more common for anonymous
> mappings, and possible large file based mappings.

The generic THP code assumes that huge pages are done at the pmd level,
which means 2MB for arm64 with 4KB page configuration. Hugetlb allows
larger ptes which may not necessarily be at the pmd level, though we
haven't implemented this on arm64 and it's not transparent either. As a
first step it would be nice if at least we unify the APIs between
hugetlbfs and THP (set_huge_pte_at vs. set_pmd_at).

I think you could do some arch-only tricks by pretending that you have a
pte with 16 entries only and a dummy pmd (without a corresponding
hardware page table level) that can host a "huge" page (16 consecutive
ptes). But we lose the 2MB transparent huge page as I don't see
mm/huge_memory.c handling huge puds. We also lose the ability of
building 4 real level page tables since we use the pmd as a dummy one.

But it would be a nice investigation. Maybe something simpler like
getting the mm layer to prefer contiguous 64KB ranges and we do the
detection in the arch set_pte_at().

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 21:36                       ` Arnd Bergmann
  2015-01-10 21:48                         ` Linus Torvalds
@ 2015-01-12 11:37                         ` Kirill A. Shutemov
  2015-01-12 12:18                         ` Catalin Marinas
  2 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2015-01-12 11:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, linux-arm-kernel, Catalin Marinas,
	Mark Langsdorf, Linux Kernel Mailing List

On Sat, Jan 10, 2015 at 10:36:13PM +0100, Arnd Bergmann wrote:
> On Saturday 10 January 2015 13:00:27 Linus Torvalds wrote:
> > so I feel pretty confident in saying it won't happen. It's just too
> > much of a bother, for little to no actual upside. It's likely a much
> > better approach to try to instead use THP for anonymous mappings.
> 
> arm64 already supports 2MB transparent hugepages. I guess it
> wouldn't be too hard to change it so that an existing hugepage
> on an anonymous mapping that gets split up into 4KB pages gets
> split along 64KB boundaries with the contiguous mapping bit set.

What you are talking about is in fact multi-level transparent huge page
support: you need to couple 4k pages into 64k to avoid breaking them apart
by compactation or migration or whatever.

That definetely would not make THP code simplier.

> Having full support for multiple hugepage sizes (64KB, 2MB and 32MB
> in case of ARM64 with 4KB PAGE_SIZE) would be even better and
> probably negate any benefits of 64KB PAGE_SIZE, but requires more
> changes to common mm code.
> 
> 	Arnd

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 21:36                       ` Arnd Bergmann
@ 2015-01-10 21:48                         ` Linus Torvalds
  2015-01-12 11:37                         ` Kirill A. Shutemov
  2015-01-12 12:18                         ` Catalin Marinas
  2 siblings, 0 replies; 101+ messages in thread
From: Linus Torvalds @ 2015-01-10 21:48 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Kirill A. Shutemov, Catalin Marinas,
	Mark Langsdorf, Linux Kernel Mailing List

On Sat, Jan 10, 2015 at 1:36 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>>  (d) the powerpc TLB fill/buildup/teardown costs are horrible, so on
>> AIX the cost of lots of small pages is much higher too.
>
> I think (d) applies to ARM as well, since it has no hardware
> dirty/referenced bit tracking and requires the OS to mark the
> pages as invalid/readonly until the first access. ARMv8.1
> has a fix for that, but it's optional and we haven't seen any
> implementations yet.

Powerpc really makes things worse by having those hashed page tables
that (a) have bad locality and (b) have to be built up and torn down
in software. I don't think ARM ends up coming close, even with the
issues it has.

Now, it's definitely true that the x86 page table handling hadrware
tends to just be superior. Both Intel and AMD had to work really hard
on it, because Windows (in the pre-NT days) used to flush the TLB
absolutely _all_ the time.  So x86 hardware really does tend to do
very well on this.

ARM simply doesn't have the same kind of history. TLB issues seldom
show up very much on simple benchmarks or on smaller loads. It's one
of those things that tends to take a couple of generations.

[ Or, sadly, _much_ more, because some hardware designers never get
the memo, and continue to blame software and say "you should use big
pages", because they don't see the problems ]

Of course, it's entirely possible that vendors like AMD coudl transfer
their TLB handling know-how to their ARM64 cores. I have no visibility
into that, maybe some people here do..

                      Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 21:00                     ` Linus Torvalds
@ 2015-01-10 21:36                       ` Arnd Bergmann
  2015-01-10 21:48                         ` Linus Torvalds
                                           ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Arnd Bergmann @ 2015-01-10 21:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arm-kernel, Kirill A. Shutemov, Catalin Marinas,
	Mark Langsdorf, Linux Kernel Mailing List

On Saturday 10 January 2015 13:00:27 Linus Torvalds wrote:
> 
> > IIRC, AIX works great with 64k pages, but only because of two
> > reasons that don't apply on Linux:
> 
> .. there's a few other ones:
> 
>  (c) nobody really runs AIX on dekstops. It's very much a DB load
> environment, with historically some HPC.
> 
>  (d) the powerpc TLB fill/buildup/teardown costs are horrible, so on
> AIX the cost of lots of small pages is much higher too.

I think (d) applies to ARM as well, since it has no hardware
dirty/referenced bit tracking and requires the OS to mark the
pages as invalid/readonly until the first access. ARMv8.1
has a fix for that, but it's optional and we haven't seen any
implementations yet.

> so I feel pretty confident in saying it won't happen. It's just too
> much of a bother, for little to no actual upside. It's likely a much
> better approach to try to instead use THP for anonymous mappings.

arm64 already supports 2MB transparent hugepages. I guess it
wouldn't be too hard to change it so that an existing hugepage
on an anonymous mapping that gets split up into 4KB pages gets
split along 64KB boundaries with the contiguous mapping bit set.

Having full support for multiple hugepage sizes (64KB, 2MB and 32MB
in case of ARM64 with 4KB PAGE_SIZE) would be even better and
probably negate any benefits of 64KB PAGE_SIZE, but requires more
changes to common mm code.

	Arnd

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 20:16                   ` Arnd Bergmann
@ 2015-01-10 21:00                     ` Linus Torvalds
  2015-01-10 21:36                       ` Arnd Bergmann
  2015-01-12 11:53                     ` Catalin Marinas
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2015-01-10 21:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-arm-kernel, Kirill A. Shutemov, Catalin Marinas,
	Mark Langsdorf, Linux Kernel Mailing List

On Sat, Jan 10, 2015 at 12:16 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On a recent kernel, I get 628 MB for storing all files of the
> kernel tree in 4KB pages, and 3141 MB for storing the same data
> in 64KB pages, almost exactly factor 5, or 2.45 GiB wasted.

Ok, so it's even worse than it used to be.  Partly because the tree
has grown, partly because I did the math for 16kB and 64kB is just
hugely worse.

I did the math back in the days when the PPC people were talking about
16kB pages (iirc - it's been closer to a decade, so I might
misremember the details).

And back then, with 4kB pages I could cache the kernel tree twice over
in 1GB, and have enough left to run a graphical desktop. So enough
memory to build a tree and also enough to have two kernel trees and do
"diff -urN" between them.

Of course, back then, 1-2GB was the usual desktop memory size, so the
"I can do kernel development in 1GB without excessive IO" mattered to
me in ways it wouldn't today.

And it was before "git", so the whole "two kernel trees and do diffs
between them" was a real concern.

With 16kB pages, I think I had to have twice the memory for the same loads.

> IIRC, AIX works great with 64k pages, but only because of two
> reasons that don't apply on Linux:

.. there's a few other ones:

 (c) nobody really runs AIX on dekstops. It's very much a DB load
environment, with historically some HPC.

 (d) the powerpc TLB fill/buildup/teardown costs are horrible, so on
AIX the cost of lots of small pages is much higher too.

Now obviously, we *could* try to have a 64kB page size, and then do
lots of tricks to actually allocate file caches in partial pages in
order to avoid the internal fragmentation costs. HOWEVER:

 - that obviously doesn't help with the page management overhead (in
fact, it hurts). So it would be purely about trying to optimize for
bad TLB's.

 - that adds a *lot* of complexity to the VM. The coherency issues
when you may need to move cached information between partial pages and
full pages (required for mmap, but *most* files don't get mmap'ed)
would actually be pretty horrible.

 - all of this cost and complexity wouldn't help at all on x86, so it
would be largely untested and almost inevitably broken crap.

so I feel pretty confident in saying it won't happen. It's just too
much of a bother, for little to no actual upside. It's likely a much
better approach to try to instead use THP for anonymous mappings.

                            Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  2:27                 ` Linus Torvalds
  2015-01-10  2:51                   ` David Lang
  2015-01-10  3:17                   ` Tony Luck
@ 2015-01-10 20:16                   ` Arnd Bergmann
  2015-01-10 21:00                     ` Linus Torvalds
  2015-01-12 11:53                     ` Catalin Marinas
  2 siblings, 2 replies; 101+ messages in thread
From: Arnd Bergmann @ 2015-01-10 20:16 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Linus Torvalds, Kirill A. Shutemov, Catalin Marinas,
	Mark Langsdorf, Linux Kernel Mailing List

On Friday 09 January 2015 18:27:38 Linus Torvalds wrote:
> On Fri, Jan 9, 2015 at 4:35 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > With bigger page size there's also reduction in number of entities to
> > handle by kernel: less memory occupied by struct pages, fewer pages on
> > lru, etc.
> 
> Really, do the math. [...]
>
> With a 64kB page, that means that for caching the kernel tree (what,
> closer to 50k files by now), you are basically wasting 60kB for most
> source files. Say, 60kB * 30k files, or 1.8GB.

On a recent kernel, I get 628 MB for storing all files of the
kernel tree in 4KB pages, and 3141 MB for storing the same data
in 64KB pages, almost exactly factor 5, or 2.45 GiB wasted.

> Maybe things have changed, and maybe I did my math wrong, and people
> can give a more exact number. But it's an example of why 64kB
> granularity is completely unacceptable in any kind of general-purpose
> load.

I'd say it's unacceptable for any file backed mappings in general, but
usually an improvement for anonymous maps, for the same reasons that
transparent huge pages are great. IIRC, AIX works great with 64k
pages, but only because of two reasons that don't apply on Linux:

a) The PowerPC MMU can mix 4KB and 64KB pages in a single process.
   Linux doesn't use this feature except for very special cases,
   although it could be done on PowerPC but not most other architectures.

b) Linux has a unified page cache page size that is used for both
   anonymous and file backed mappings. It's a great feature of the
   Linux MM code (it avoids having two copies of each mapped file
   in memory), but other OSs can just use 4KB blocks in the file
   system cache independent of the page size.

> 4kB works well. 8kB is perfectly acceptable. 16kB is already wasting a
> lot of memory. 32kB and up is complete garbage for general-purpose
> computing.

I was expecting 16KB pages to work better, but you are right:

arnd:~/linux$ for i in 1 2 4 8 16 32 64 128 256 ; do echo -n "$i KiB pages: " ; total=0 ; git ls-files | xargs ls -ld | while read a b c d e f ; do echo $[((e + $i*1024 - 1) / (1024 * $i))  ]  ; done | sort -n | uniq -c | while read num size ; do total=$[$total + ($num * $size) * $i] ; echo $[total / 1024] MiB ; done  | tail -n 1 ; done
1 KiB pages: 544 MiB
2 KiB pages: 571 MiB
4 KiB pages: 628 MiB
8 KiB pages: 759 MiB
16 KiB pages: 1055 MiB
32 KiB pages: 1717 MiB
64 KiB pages: 3141 MiB
128 KiB pages: 6103 MiB
256 KiB pages: 12125 MiB

Regarding ARM64 in particular, I think it would be nice to investigate
how to extend the THP code to cover 64KB TLBs when running with the 4KB
page size. There is a hint bit in the page table to tell the CPU that
a set of 16 aligned pages can share one TLB, and it would be nice to
use that bit in Linux, and to make this case more common for anonymous
mappings, and possible large file based mappings.

	Arnd

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 19:56                       ` Linus Torvalds
@ 2015-01-10 20:08                         ` Laszlo Ersek
  0 siblings, 0 replies; 101+ messages in thread
From: Laszlo Ersek @ 2015-01-10 20:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin

On 01/10/15 20:56, Linus Torvalds wrote:
> On Sat, Jan 10, 2015 at 11:47 AM, Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> I grepped the tree for "fullmm", and only tlb_gather_mmu() seems to set
>> it. There are several instances of that function, but each sets fullmm to:
>>
>>         /* Is it from 0 to ~0? */
>>         tlb->fullmm     = !(start | (end+1));
>>
>> So, a nonzero fullmm seems to imply (end == ~0UL).
> 
> Yes. But note how it imples "end == ~0ul", but it does *not* imply
> "tlb->end" having that value.

Ooops! :)

> tlb->end is initialized to zero (not obvious, but it's what the call
> to __tlb_reset_range() does). It's then updated by
> __tlb_adjust_range() as we actually flush individual pages.

Thanks.
Laszlo


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 19:47                     ` Laszlo Ersek
@ 2015-01-10 19:56                       ` Linus Torvalds
  2015-01-10 20:08                         ` Laszlo Ersek
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2015-01-10 19:56 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Will Deacon, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin

On Sat, Jan 10, 2015 at 11:47 AM, Laszlo Ersek <lersek@redhat.com> wrote:
>
> I grepped the tree for "fullmm", and only tlb_gather_mmu() seems to set
> it. There are several instances of that function, but each sets fullmm to:
>
>         /* Is it from 0 to ~0? */
>         tlb->fullmm     = !(start | (end+1));
>
> So, a nonzero fullmm seems to imply (end == ~0UL).

Yes. But note how it imples "end == ~0ul", but it does *not* imply
"tlb->end" having that value.

tlb->end is initialized to zero (not obvious, but it's what the call
to __tlb_reset_range() does). It's then updated by
__tlb_adjust_range() as we actually flush individual pages.

                           Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 13:37                   ` Will Deacon
  2015-01-10 19:47                     ` Laszlo Ersek
@ 2015-01-10 19:51                     ` Linus Torvalds
  2015-01-12 12:42                       ` Will Deacon
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2015-01-10 19:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: Laszlo Ersek, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin

On Sat, Jan 10, 2015 at 5:37 AM, Will Deacon <will.deacon@arm.com> wrote:
>>
>> Will?
>
> I'm wondering if this is now broken in the fullmm case, because tlb->end
> will be zero and we won't actually free any of the pages being unmapped
> on task exit. Does that sound plausible?

But did anything change wrt fullmm? I don't see any changes wrt fullmm
logic in generic code.

The arm64 code changed more, so maybe there was somethinig I missed.
Again, arm64 uses tlb_end_vma() etc, so arm64 certainly triggers code
that x86 does not.

I can revert the commit that causes problems, but considering the
performance impact on x86 (it would be a regression since 3.18), I
would *really* like to fix the arm64 problem instead. So I'll wait
with the revert for at least a week, I think, hoping that the arm64
people figure this out. Sound reasonable?

                                  Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 13:37                   ` Will Deacon
@ 2015-01-10 19:47                     ` Laszlo Ersek
  2015-01-10 19:56                       ` Linus Torvalds
  2015-01-10 19:51                     ` Linus Torvalds
  1 sibling, 1 reply; 101+ messages in thread
From: Laszlo Ersek @ 2015-01-10 19:47 UTC (permalink / raw)
  To: Will Deacon, Linus Torvalds
  Cc: Mark Langsdorf, Marc Zyngier, Mark Rutland, Steve Capper,
	vishnu.ps, main kernel list, arm kernel list, Kyle McMartin

On 01/10/15 14:37, Will Deacon wrote:

> My hunch is that when a task exits and sets fullmm, end is zero and so the
> old need_flush cases no longer run.

(Disclaimer: I'm completely unfamiliar with this code.)

If you have the following call chain in mind:

  exit_mmap()
    tlb_gather_mmu()

then I think that (fullmm != 0) precludes (end == 0).

I grepped the tree for "fullmm", and only tlb_gather_mmu() seems to set
it. There are several instances of that function, but each sets fullmm to:

	/* Is it from 0 to ~0? */
	tlb->fullmm     = !(start | (end+1));

So, a nonzero fullmm seems to imply (end == ~0UL).

(And sure enough, exit_mmap() passes it ((unsigned long)-1) as "end").

> With my original patch, we skipped the
> TLB invalidation (since the task is exiting and we will invalidate the TLB
> for that ASID before the ASID is reallocated) but still did the freeing.
> With the current code, we skip the freeing too, which causes us to leak
> pages on exit.

Yes, the new check prevents

  tlb_flush_mmu()
    tlb_flush_mmu_free()  <--- this
      free_pages_and_swap_cache()

> I guess we can either check need_flush as well as end, or we could set both
> start == end == some_nonzero_value in __tlb_adjust_range when need_flush is
> set. Unfortunately, I'm away from my h/w right now, so it's not easy to test
> this.

If you have a patch that applies and builds, I'm glad to test it. I got
a few hours now and I'll have some tomorrow as well. (On Monday I guess
you'll have access to your hardware again.)

Thanks!
Laszlo


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10 10:46                       ` Andreas Mohr
@ 2015-01-10 19:42                         ` Linus Torvalds
  0 siblings, 0 replies; 101+ messages in thread
From: Linus Torvalds @ 2015-01-10 19:42 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: David Lang, Kirill A. Shutemov, Catalin Marinas, Mark Langsdorf,
	Linux Kernel Mailing List, linux-arm-kernel

On Sat, Jan 10, 2015 at 2:46 AM, Andreas Mohr <andi@lisas.de> wrote:
>
> Yet that is what any VirtualAlloc() call on Windows does
> One thing less left to wonder why 'doze is such a performance pig...

Well, to be fair, you shouldn't use VirtualAlloc() as some 'malloc()'
replacement. It's  more of a backing store allocator for malloc() and
friends than anything else.

                         Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  3:29               ` Laszlo Ersek
  2015-01-10  4:39                 ` Linus Torvalds
@ 2015-01-10 15:22                 ` Kyle McMartin
  1 sibling, 0 replies; 101+ messages in thread
From: Kyle McMartin @ 2015-01-10 15:22 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Mark Langsdorf, Will Deacon, Marc Zyngier, Mark Rutland,
	Steve Capper, Linus Torvalds, vishnu.ps, main kernel list,
	arm kernel list

On Sat, Jan 10, 2015 at 04:29:39AM +0100, Laszlo Ersek wrote:
> I've bisected this issue to
> 

Awesome, this was on my list of list of suspicious commits to check
before my ARM64 box decided not to come back from reboot on Friday. :)

Thanks for bisecting!

cheers,
--Kyle

> > f045bbb9fa1bf6f507ad4de12d4e3471d8f672f1 is the first bad commit
> > commit f045bbb9fa1bf6f507ad4de12d4e3471d8f672f1
> > Author: Linus Torvalds <torvalds@linux-foundation.org>
> > Date:   Wed Dec 17 11:59:04 2014 -0800
> >
> >     mmu_gather: fix over-eager tlb_flush_mmu_free() calling
> >
> >     Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal
> >     range calculations into generic code") caused a performance problem:
> >
> >       "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
> >        tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
> >        applied, but does not show up at all on the commit before"
> >
> >     and the reason is that Will moved the test for whether we need to flush
> >     from tlb_flush_mmu() into tlb_flush_mmu_tlbonly().  But that meant that
> >     tlb_flush_mmu_free() basically lost that check.
> >
> >     Move it back into tlb_flush_mmu() where it belongs, so that it covers
> >     both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free().
> >
> >     Reported-and-tested-by: Dave Hansen <dave@sr71.net>
> >     Acked-by: Will Deacon <will.deacon@arm.com>
> >     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> >

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  4:39                 ` Linus Torvalds
@ 2015-01-10 13:37                   ` Will Deacon
  2015-01-10 19:47                     ` Laszlo Ersek
  2015-01-10 19:51                     ` Linus Torvalds
  0 siblings, 2 replies; 101+ messages in thread
From: Will Deacon @ 2015-01-10 13:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Laszlo Ersek, Mark Langsdorf, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin

Hi Linus, Laszlo,

On Sat, Jan 10, 2015 at 04:39:05AM +0000, Linus Torvalds wrote:
> On Fri, Jan 9, 2015 at 7:29 PM, Laszlo Ersek <lersek@redhat.com> wrote:
> > I've bisected this issue to

Thanks for bisecting this!

> .. commit f045bbb9fa1b ("mmu_gather: fix over-eager
> tlb_flush_mmu_free() calling")
> 
> Hmm. That commit literally just undoes something that commit
> fb7332a9fedf ("mmu_gather: move minimal range calculations into
> generic code") changed, and that was very wrong on x86.
> 
> But arm64 did have very different TLB flushing logic, so there may be
> some ARM64 reason that Will did that change originally, and then he
> forgot that reason when he ack'ed commit f045bbb9fa1b that undid it.
> 
> Will?

I'm wondering if this is now broken in the fullmm case, because tlb->end
will be zero and we won't actually free any of the pages being unmapped
on task exit. Does that sound plausible?

> Before your mmu_gather range calculations commit, we used to have
> 
> In tlb_flush_mmu_tlbonly():
> 
>      tlb->need_flush = 0;
> 
> and in tlb_flush_mmu():
> 
>     if (!tlb->need_flush)
>                 return;
> 
> and your commit changed the rule to be
> 
>     !tlb->need_flush == !tlb->end
> 
> so in the current tree we have
> 
>  In tlb_flush_mmu_tlbonly():
> 
>     __tlb_reset_range(tlb);   // replaces "tlb->need_flush = 0;"
> 
> and in tlb_flush_mmu():
> 
>     if (!tlb->end)    // replaces if (!tlb->need_flush)
>         return;
> 
> so we seem to do exactly the same as 3.18.
> 
> But in your original patch, you moved that "if (!tlb->end) return;"
> check from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(), and that
> apparently is actually needed on arm64. But *why*?

My hunch is that when a task exits and sets fullmm, end is zero and so the
old need_flush cases no longer run. With my original patch, we skipped the
TLB invalidation (since the task is exiting and we will invalidate the TLB
for that ASID before the ASID is reallocated) but still did the freeing.
With the current code, we skip the freeing too, which causes us to leak
pages on exit.

> Also, looking at that commit fb7332a9fedf, I note that some of the
> "need_flush" setting was simply removed. See for example
> arch/powerpc/mm/hugetlbpage.c, and also in mm/memory.c:
> tlb_remove_table(). Is there something non-obvious that sets tlb->end
> there?

I figured that need_flush was already set in these cases, so the additional
setting was redundant:

  https://lkml.org/lkml/2014/11/10/340

> The other need_flush removals seem to all be paired with adding a
> __tlb_adjust_range() call, which will set ->end.
> 
> I'm starting to suspect that you moved the need_flush test into
> tlbonly exactly because you removed that
> 
>         tlb->need_flush = 1;
> 
> from mm/memory.c: tlb_remove_table().
> 
> x86 doesn't care, because x86 doesn't *use* tlb_remove_table(). But
> arm64 does, at least with the RCU freeing.
> 
> Any ideas?

I guess we can either check need_flush as well as end, or we could set both
start == end == some_nonzero_value in __tlb_adjust_range when need_flush is
set. Unfortunately, I'm away from my h/w right now, so it's not easy to test
this.

What do you reckon?

Will

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  3:06                     ` Linus Torvalds
@ 2015-01-10 10:46                       ` Andreas Mohr
  2015-01-10 19:42                         ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Andreas Mohr @ 2015-01-10 10:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Lang, Kirill A. Shutemov, Catalin Marinas, Mark Langsdorf,
	Linux Kernel Mailing List, linux-arm-kernel

Linus Torvalds wrote:
> I dunno. I do know that you definitely don't want to haev a
> desktop/workstation with 64kB pages.

Yet that is what any VirtualAlloc() call on Windows does
(well, not exactly *page* granularity but *allocation* granularity there).
Prime example: do a naively specific/custom VirtualAlloc() request
for a simple string Hello World\0 allocation (11+1 bytes),
get one page (4kB, "Private Data") plus "overhead" (60kB, "Unusable").
--> allocation efficiency: 0.01831%(!).
And that does hurt plenty IME, especially on a 32bit address space's
very limited 2GB/3GB total per Win32 process.

http://blogs.microsoft.co.il/sasha/2014/07/22/tracking-unusable-virtual-memory-vmmap/
"Why is address space allocation granularity 64K?"
  http://blogs.msdn.com/b/oldnewthing/archive/2003/10/08/55239.aspx

One thing less left to wonder why 'doze is such a performance pig...

Andreas Mohr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  3:29               ` Laszlo Ersek
@ 2015-01-10  4:39                 ` Linus Torvalds
  2015-01-10 13:37                   ` Will Deacon
  2015-01-10 15:22                 ` Kyle McMartin
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2015-01-10  4:39 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Mark Langsdorf, Will Deacon, Marc Zyngier, Mark Rutland,
	Steve Capper, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin

On Fri, Jan 9, 2015 at 7:29 PM, Laszlo Ersek <lersek@redhat.com> wrote:
>
> I've bisected this issue to

.. commit f045bbb9fa1b ("mmu_gather: fix over-eager
tlb_flush_mmu_free() calling")

Hmm. That commit literally just undoes something that commit
fb7332a9fedf ("mmu_gather: move minimal range calculations into
generic code") changed, and that was very wrong on x86.

But arm64 did have very different TLB flushing logic, so there may be
some ARM64 reason that Will did that change originally, and then he
forgot that reason when he ack'ed commit f045bbb9fa1b that undid it.

Will?

Before your mmu_gather range calculations commit, we used to have

In tlb_flush_mmu_tlbonly():

     tlb->need_flush = 0;

and in tlb_flush_mmu():

    if (!tlb->need_flush)
                return;

and your commit changed the rule to be

    !tlb->need_flush == !tlb->end

so in the current tree we have

 In tlb_flush_mmu_tlbonly():

    __tlb_reset_range(tlb);   // replaces "tlb->need_flush = 0;"

and in tlb_flush_mmu():

    if (!tlb->end)    // replaces if (!tlb->need_flush)
        return;

so we seem to do exactly the same as 3.18.

But in your original patch, you moved that "if (!tlb->end) return;"
check from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(), and that
apparently is actually needed on arm64. But *why*?

Also, looking at that commit fb7332a9fedf, I note that some of the
"need_flush" setting was simply removed. See for example
arch/powerpc/mm/hugetlbpage.c, and also in mm/memory.c:
tlb_remove_table(). Is there something non-obvious that sets tlb->end
there?

The other need_flush removals seem to all be paired with adding a
__tlb_adjust_range() call, which will set ->end.

I'm starting to suspect that you moved the need_flush test into
tlbonly exactly because you removed that

        tlb->need_flush = 1;

from mm/memory.c: tlb_remove_table().

x86 doesn't care, because x86 doesn't *use* tlb_remove_table(). But
arm64 does, at least with the RCU freeing.

Any ideas?

                         Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-09 19:43             ` Will Deacon
@ 2015-01-10  3:29               ` Laszlo Ersek
  2015-01-10  4:39                 ` Linus Torvalds
  2015-01-10 15:22                 ` Kyle McMartin
  0 siblings, 2 replies; 101+ messages in thread
From: Laszlo Ersek @ 2015-01-10  3:29 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Will Deacon, Marc Zyngier, Mark Rutland, Steve Capper,
	Linus Torvalds, vishnu.ps, main kernel list, arm kernel list,
	Kyle McMartin

On 01/09/15 20:43, Will Deacon wrote:
> On Fri, Jan 09, 2015 at 06:37:36PM +0000, Marc Zyngier wrote:
>> On 09/01/15 17:57, Mark Rutland wrote:
>>> On Fri, Jan 09, 2015 at 02:27:06PM +0000, Mark Langsdorf wrote:
>>>> On 01/09/2015 08:19 AM, Steve Capper wrote:
>>>>> On 9 January 2015 at 12:13, Mark Rutland <mark.rutland@arm.com> wrote:
>>>>>> On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
>>>>>>> I'm consistently getting an out of memory killer triggered when
>>>>>>> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
>>>>>>> with 16 GB of memory. This doesn't happen when running a 3.18
>>>>>>> kernel.
>>>>>>>
>>>>>>> I'm going to start bisecting the failure now, but here's the crash
>>>>>>> log in case someone can see something obvious in it.
>>>>>>
>>>>>> FWIW I've just reproduced this with v3.19-rc3 defconfig +
>>>>>> CONFIG_ARM64_64K_PAGES=y by attempting a git clone of mainline. My
>>>>>> system has 16GB of RAM and 6 CPUs.
> 
> [...]
> 
>>> I wasn't able to trigger the issue again with git, and the only way I've
>>> managed to trigger the issue is repeatedly building the kernel in a
>>> loop:
>>>
>>> while true; do
>>> 	git clean -fdx > /dev/null 2>&1;
>>> 	make defconfig > /dev/null 2>&1;
>>> 	make > /dev/null > 2>&1;
>>> done
>>>
>>> Which after a while died:
>>>
>>> -bash: fork: Cannot allocate memory
> 
> [...]
> 
>> Just as another data point: I'm reproducing the exact same thing (it
>> only took a couple of kernel builds to kill the box), with almost all
>> 16GB of RAM stuck in Active(anon). I do *not* have CMA enabled though.
>>
>> I've kicked another run with 4k pages.
> 
> The `mallocstress' tool from LTP seems to be a quick way to reproduce
> the memory leak behind this (leaks 5/8GB on my Juno). It spawns a bunch
> of threads, that each call malloc until it returns NULL. I thought maybe
> we're leaking page tables, but 5GB is pretty excessive.
> 
> However, I'm unable to reproduce the problem under a 32-bit kernel on my
> TC2 board or on 3.18 + the 3.19 merge window pull for arm64.
> 
> I guess we should try to bisect using the above.
> 
> Will
> 

I've bisected this issue to

> f045bbb9fa1bf6f507ad4de12d4e3471d8f672f1 is the first bad commit
> commit f045bbb9fa1bf6f507ad4de12d4e3471d8f672f1
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Wed Dec 17 11:59:04 2014 -0800
>
>     mmu_gather: fix over-eager tlb_flush_mmu_free() calling
>
>     Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal
>     range calculations into generic code") caused a performance problem:
>
>       "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
>        tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
>        applied, but does not show up at all on the commit before"
>
>     and the reason is that Will moved the test for whether we need to flush
>     from tlb_flush_mmu() into tlb_flush_mmu_tlbonly().  But that meant that
>     tlb_flush_mmu_free() basically lost that check.
>
>     Move it back into tlb_flush_mmu() where it belongs, so that it covers
>     both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free().
>
>     Reported-and-tested-by: Dave Hansen <dave@sr71.net>
>     Acked-by: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> :040000 040000 a4768484a068b37a43863123ac782fb6d01149b7 75ff0be6a2f3e4caa9a06df6503fa54d25dfa44d M      mm

in an aarch64 QEMU/KVM guest running Fedora 21 Server (8GB RAM, 8
VCPUs). I bisected between v3.18 and v3.19-rc3. I used
"config-3.17.4-302.fc21.aarch64" as starting config, on which I kept
running olddefconfig and localmodconfig.

<tangent>
It is beneficial to perform such a bisection in a virtual machine. Super
fast reboot times, and the hot parts of the virtual disk are cached in
host memory. It's easy to boot a test kernel for running the reproducer,
and then reboot the known good distro kernel for building the next step
in the bisection.
</tangent>

As reproducer I used "./mallocstress -t 300" (recommended by Mark
Langsdorf & Kyle McMartin, but also named by Will just above this
thread).

One thing I noticed during the several repro turns is that the OOM
killer never hit while mallocstress was running "normally" (ie. before
the first thread exited). In the broken kernels, the OOM killer always
hit after a few (tens) of the threads had exited. The leak is probably
related to thread exit. (Which is consistent with the kernel build
reproducer, because that causes a lot of threads (processes) to exit
too.)

Bisection log below.

Thanks
Laszlo

git bisect start
# bad: [b1940cd21c0f4abdce101253e860feff547291b0] Linux 3.19-rc3
git bisect bad b1940cd21c0f4abdce101253e860feff547291b0
# good: [b2776bf7149bddd1f4161f14f79520f17fc1d71d] Linux 3.18
git bisect good b2776bf7149bddd1f4161f14f79520f17fc1d71d
# good: [a7cfef21e3d066343bec14d3113a9f9c92d1c2a8] Merge branches 'core', 'cxgb4', 'ipoib', 'iser', 'mlx4', 'ocrdma', 'odp' and 'srp' into for-next
git bisect good a7cfef21e3d066343bec14d3113a9f9c92d1c2a8
# good: [988adfdffdd43cfd841df734664727993076d7cb] Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
git bisect good 988adfdffdd43cfd841df734664727993076d7cb
# good: [eb64c3c6cdb8fa8a4d324eb71a9033b62e150918] Merge tag 'stable/for-linus-3.19-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
git bisect good eb64c3c6cdb8fa8a4d324eb71a9033b62e150918
# bad: [385336e321c41b5174055c0194b60c19a27cc5c5] Merge tag 'platform-drivers-x86-v3.19-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
git bisect bad 385336e321c41b5174055c0194b60c19a27cc5c5
# bad: [87c31b39abcb6fb6bd7d111200c9627a594bf6a9] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
git bisect bad 87c31b39abcb6fb6bd7d111200c9627a594bf6a9
# good: [0ea90b9e79cff66934119e6dd8fa8e9d0f7d005a] Merge tag 'microblaze-3.19-rc1' of git://git.monstr.eu/linux-2.6-microblaze
git bisect good 0ea90b9e79cff66934119e6dd8fa8e9d0f7d005a
# good: [d797da41b2aceed5daa8cd2eee92cd74b2a0c652] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
git bisect good d797da41b2aceed5daa8cd2eee92cd74b2a0c652
# good: [c89d99546dc5b076ccd6692c48ada9a92820a4ac] Merge branch 'eduardo-soc-thermal' into thermal-soc
git bisect good c89d99546dc5b076ccd6692c48ada9a92820a4ac
# good: [9f3e15129902bca9d8e296c165345f158bac94eb] Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
git bisect good 9f3e15129902bca9d8e296c165345f158bac94eb
# good: [80dd00a23784b384ccea049bfb3f259d3f973b9d] userns: Check euid no fsuid when establishing an unprivileged uid mapping
git bisect good 80dd00a23784b384ccea049bfb3f259d3f973b9d
# good: [db86da7cb76f797a1a8b445166a15cb922c6ff85] userns: Unbreak the unprivileged remount tests
git bisect good db86da7cb76f797a1a8b445166a15cb922c6ff85
# good: [cc669743a39e3f61c9ca5e786e959bf478ccd197] Merge tag 'vfio-v3.19-rc1' of git://github.com/awilliam/linux-vfio
git bisect good cc669743a39e3f61c9ca5e786e959bf478ccd197
# bad: [f045bbb9fa1bf6f507ad4de12d4e3471d8f672f1] mmu_gather: fix over-eager tlb_flush_mmu_free() calling
git bisect bad f045bbb9fa1bf6f507ad4de12d4e3471d8f672f1
# good: [cf3c0a1579eff90195a791c5f464463c1011ef4a] x86: mm: fix VM_FAULT_RETRY handling
git bisect good cf3c0a1579eff90195a791c5f464463c1011ef4a
# first bad commit: [f045bbb9fa1bf6f507ad4de12d4e3471d8f672f1] mmu_gather: fix over-eager tlb_flush_mmu_free() calling


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  2:27                 ` Linus Torvalds
  2015-01-10  2:51                   ` David Lang
@ 2015-01-10  3:17                   ` Tony Luck
  2015-01-10 20:16                   ` Arnd Bergmann
  2 siblings, 0 replies; 101+ messages in thread
From: Tony Luck @ 2015-01-10  3:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Catalin Marinas, Mark Langsdorf,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Jan 9, 2015 at 6:27 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Big pages are a bad bad bad idea. They work fine for databases, and
> that's pretty much just about it. I'm sure there are some other loads,
> but they are few and far between.

For HPC too. They tend not to do a lot of I/O (and when they do it is
from a few big files). Then they just sit crunching over gigabytes of
memory for seven and a half million years before doing:

    printf("Answer is %d\n", 42);

-Tony

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  2:51                   ` David Lang
@ 2015-01-10  3:06                     ` Linus Torvalds
  2015-01-10 10:46                       ` Andreas Mohr
  2015-01-13  3:33                     ` Rik van Riel
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2015-01-10  3:06 UTC (permalink / raw)
  To: David Lang
  Cc: Kirill A. Shutemov, Catalin Marinas, Mark Langsdorf,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, Jan 9, 2015 at 6:51 PM, David Lang <david@lang.hm> wrote:
>
> what about a dedicated virtualization host (where your workload is a handful
> of virtual machines), would the file cache issue still be overwelming, even
> though it's the virtual machines accessing things?

How much filesystem caches does the host need or use? It can range
from basically zero ("pure" hypervisor host with no filesystem at all)
to 100% (virtual filesystem in all the clients with the host doing all
the real work).

I dunno. I do know that you definitely don't want to haev a
desktop/workstation with 64kB pages.

                        Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  2:27                 ` Linus Torvalds
@ 2015-01-10  2:51                   ` David Lang
  2015-01-10  3:06                     ` Linus Torvalds
  2015-01-13  3:33                     ` Rik van Riel
  2015-01-10  3:17                   ` Tony Luck
  2015-01-10 20:16                   ` Arnd Bergmann
  2 siblings, 2 replies; 101+ messages in thread
From: David Lang @ 2015-01-10  2:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Catalin Marinas, Mark Langsdorf,
	Linux Kernel Mailing List, linux-arm-kernel

On Fri, 9 Jan 2015, Linus Torvalds wrote:

> Big pages are a bad bad bad idea. They work fine for databases, and
> that's pretty much just about it. I'm sure there are some other loads,
> but they are few and far between.

what about a dedicated virtualization host (where your workload is a handful of 
virtual machines), would the file cache issue still be overwelming, even though 
it's the virtual machines accessing things?

David Lang

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-10  0:35               ` Kirill A. Shutemov
@ 2015-01-10  2:27                 ` Linus Torvalds
  2015-01-10  2:51                   ` David Lang
                                     ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Linus Torvalds @ 2015-01-10  2:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Catalin Marinas, Mark Langsdorf, Linux Kernel Mailing List,
	linux-arm-kernel

On Fri, Jan 9, 2015 at 4:35 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> With bigger page size there's also reduction in number of entities to
> handle by kernel: less memory occupied by struct pages, fewer pages on
> lru, etc.

Bah. Humbug. You've been listening to HW people too much, but they see
the small details, not the big picture.

Page management was indeed a huge PITA for PAE. But that was because
PAE is garbage.

Really, do the math. 64kB pages waste so much memory due to internal
fragmentation that any other memory use is completely immaterial.

Pretty much the only situation where 64kB pages are fine is when you
have one single meaningful load running on your machine, and that one
single load is likely a database.

Any other time, your filesystem caches will just suck.

Just as an example, look at the kernel tree. Last I did the math
(which is admittedly a long time ago), the median size (not average -
median) of a file was pretty much 4kB.

With a 64kB page, that means that for caching the kernel tree (what,
closer to 50k files by now), you are basically wasting 60kB for most
source files. Say, 60kB * 30k files, or 1.8GB.

Read that again. 1.8 GIGABYTES. Wasted. Completely unused. Thrown
away, with no upside. And that's just from the page cache for the
kernel sources, which would be populated by a single "git grep".

Maybe things have changed, and maybe I did my math wrong, and people
can give a more exact number. But it's an example of why 64kB
granularity is completely unacceptable in any kind of general-purpose
load.

Anybody who brings up TLB costs or costs of maintaining pages is just
jying, or hasn't actually looked at the real issues. Anything you win
in TLB you lose is *lots* and lots of extra IO, because you aren't
using your memory efficiently for caching, and are basically easily
throwing away half your RAM.

Big pages are a bad bad bad idea. They work fine for databases, and
that's pretty much just about it. I'm sure there are some other loads,
but they are few and far between.

4kB works well. 8kB is perfectly acceptable. 16kB is already wasting a
lot of memory. 32kB and up is complete garbage for general-purpose
computing.

And no, source code isn't *that* special. There are lots of other
cases where you have a multitude of small files. And small files are
the things you want to cache - don't tell me about big video files etc
that make a 64kB page size look small, because those files don't
actually tend to be all that relevant for caching.

                                   Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-09 23:27             ` Catalin Marinas
@ 2015-01-10  0:35               ` Kirill A. Shutemov
  2015-01-10  2:27                 ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Kirill A. Shutemov @ 2015-01-10  0:35 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Linus Torvalds, Mark Langsdorf, Linux Kernel Mailing List,
	linux-arm-kernel

On Fri, Jan 09, 2015 at 11:27:07PM +0000, Catalin Marinas wrote:
> On Thu, Jan 08, 2015 at 07:21:02PM +0000, Linus Torvalds wrote:
> > The only excuse for 64kB pages is "my hardware TLB is complete crap,
> > and I have very specialized server-only loads".
> 
> I would make a slight correction: s/and/or/.
> 
> I agree that for a general purpose system (and even systems like web
> hosting servers), 64KB is overkill; 16KB may be a better compromise.
> 
> There are however some specialised loads that benefit from this. The
> main example here is virtualisation where if both guest and host use 4
> levels of page tables each (that's what you may get with 4KB pages on
> arm64), a full TLB miss in both stages of translation (the ARM
> terminology for nested page tables) needs up to _24_ memory accesses
> (though cached). Of course, once the TLB warms up, there will be much
> less but for new mmaps you always get some misses.
> 
> With 64KB pages (in the host usually), you can reduce the page table
> levels to three or two (the latter for 42-bit VA) or you could even
> couple this with some insanely huge pages (512MB, the next up from 64KB)
> to decrease the number of levels further.
> 
> I see three main advantages: the usual reduced TLB pressure (which
> arguably can be solved with bigger TLBs), less TLB misses and, pretty
> important with virtualisation, the cost of the TLB miss due to a reduced
> number of levels. But that's for the user to balance the advantages and
> disadvantages you already mentioned based on the planned workload (e.g.
> host configured with 64KB pages while guests use 4KB).
> 
> Another aspect on ARM is the TLB flushing on (large) MP systems. With a
> larger page size, we reduce the number of TLB operation (in-hardware)
> broadcasting between CPUs (we could use non-broadcasting ops and IPIs,
> not sure they are any faster though).

With bigger page size there's also reduction in number of entities to
handle by kernel: less memory occupied by struct pages, fewer pages on
lru, etc.

Managing a lot of memory (TiB scale) with 4k chunks is just insane.
We will need to find a way to cluster memory together to manage it
reasonably. Whether it bigger base page size or some other mechanism.
Maybe THP? ;)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 19:21           ` Linus Torvalds
@ 2015-01-09 23:27             ` Catalin Marinas
  2015-01-10  0:35               ` Kirill A. Shutemov
  0 siblings, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2015-01-09 23:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Langsdorf, Linux Kernel Mailing List, linux-arm-kernel

On Thu, Jan 08, 2015 at 07:21:02PM +0000, Linus Torvalds wrote:
> The only excuse for 64kB pages is "my hardware TLB is complete crap,
> and I have very specialized server-only loads".

I would make a slight correction: s/and/or/.

I agree that for a general purpose system (and even systems like web
hosting servers), 64KB is overkill; 16KB may be a better compromise.

There are however some specialised loads that benefit from this. The
main example here is virtualisation where if both guest and host use 4
levels of page tables each (that's what you may get with 4KB pages on
arm64), a full TLB miss in both stages of translation (the ARM
terminology for nested page tables) needs up to _24_ memory accesses
(though cached). Of course, once the TLB warms up, there will be much
less but for new mmaps you always get some misses.

With 64KB pages (in the host usually), you can reduce the page table
levels to three or two (the latter for 42-bit VA) or you could even
couple this with some insanely huge pages (512MB, the next up from 64KB)
to decrease the number of levels further.

I see three main advantages: the usual reduced TLB pressure (which
arguably can be solved with bigger TLBs), less TLB misses and, pretty
important with virtualisation, the cost of the TLB miss due to a reduced
number of levels. But that's for the user to balance the advantages and
disadvantages you already mentioned based on the planned workload (e.g.
host configured with 64KB pages while guests use 4KB).

Another aspect on ARM is the TLB flushing on (large) MP systems. With a
larger page size, we reduce the number of TLB operation (in-hardware)
broadcasting between CPUs (we could use non-broadcasting ops and IPIs,
not sure they are any faster though).

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-09 18:37           ` Marc Zyngier
@ 2015-01-09 19:43             ` Will Deacon
  2015-01-10  3:29               ` Laszlo Ersek
  0 siblings, 1 reply; 101+ messages in thread
From: Will Deacon @ 2015-01-09 19:43 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Mark Rutland, Mark Langsdorf, Steve Capper,
	Linux Kernel Mailing List, linux-arm-kernel, Linus Torvalds,
	vishnu.ps

On Fri, Jan 09, 2015 at 06:37:36PM +0000, Marc Zyngier wrote:
> On 09/01/15 17:57, Mark Rutland wrote:
> > On Fri, Jan 09, 2015 at 02:27:06PM +0000, Mark Langsdorf wrote:
> >> On 01/09/2015 08:19 AM, Steve Capper wrote:
> >>> On 9 January 2015 at 12:13, Mark Rutland <mark.rutland@arm.com> wrote:
> >>>> On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
> >>>>> I'm consistently getting an out of memory killer triggered when
> >>>>> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
> >>>>> with 16 GB of memory. This doesn't happen when running a 3.18
> >>>>> kernel.
> >>>>>
> >>>>> I'm going to start bisecting the failure now, but here's the crash
> >>>>> log in case someone can see something obvious in it.
> >>>>
> >>>> FWIW I've just reproduced this with v3.19-rc3 defconfig +
> >>>> CONFIG_ARM64_64K_PAGES=y by attempting a git clone of mainline. My
> >>>> system has 16GB of RAM and 6 CPUs.

[...]

> > I wasn't able to trigger the issue again with git, and the only way I've
> > managed to trigger the issue is repeatedly building the kernel in a
> > loop:
> > 
> > while true; do
> > 	git clean -fdx > /dev/null 2>&1;
> > 	make defconfig > /dev/null 2>&1;
> > 	make > /dev/null > 2>&1;
> > done
> > 
> > Which after a while died:
> > 
> > -bash: fork: Cannot allocate memory

[...]

> Just as another data point: I'm reproducing the exact same thing (it
> only took a couple of kernel builds to kill the box), with almost all
> 16GB of RAM stuck in Active(anon). I do *not* have CMA enabled though.
> 
> I've kicked another run with 4k pages.

The `mallocstress' tool from LTP seems to be a quick way to reproduce
the memory leak behind this (leaks 5/8GB on my Juno). It spawns a bunch
of threads, that each call malloc until it returns NULL. I thought maybe
we're leaking page tables, but 5GB is pretty excessive.

However, I'm unable to reproduce the problem under a 32-bit kernel on my
TC2 board or on 3.18 + the 3.19 merge window pull for arm64.

I guess we should try to bisect using the above.

Will

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-09 17:57         ` Mark Rutland
@ 2015-01-09 18:37           ` Marc Zyngier
  2015-01-09 19:43             ` Will Deacon
  0 siblings, 1 reply; 101+ messages in thread
From: Marc Zyngier @ 2015-01-09 18:37 UTC (permalink / raw)
  To: Mark Rutland, Mark Langsdorf
  Cc: Steve Capper, Will Deacon, Linux Kernel Mailing List,
	linux-arm-kernel, Linus Torvalds, vishnu.ps

On 09/01/15 17:57, Mark Rutland wrote:
> On Fri, Jan 09, 2015 at 02:27:06PM +0000, Mark Langsdorf wrote:
>> On 01/09/2015 08:19 AM, Steve Capper wrote:
>>> On 9 January 2015 at 12:13, Mark Rutland <mark.rutland@arm.com> wrote:
>>>> On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
>>>>> I'm consistently getting an out of memory killer triggered when
>>>>> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
>>>>> with 16 GB of memory. This doesn't happen when running a 3.18
>>>>> kernel.
>>>>>
>>>>> I'm going to start bisecting the failure now, but here's the crash
>>>>> log in case someone can see something obvious in it.
>>>>
>>>> FWIW I've just reproduced this with v3.19-rc3 defconfig +
>>>> CONFIG_ARM64_64K_PAGES=y by attempting a git clone of mainline. My
>>>> system has 16GB of RAM and 6 CPUs.
>>>>
>>>> I have a similarly dodgy looking number of pages reserved
>>>> (18446744073709544451 A.K.A. -7165). Log below.
>>>>
>>>
>>> I think the negative page reserved count is a consequence of another bug.
>>>
>>> We have the following reporting code in lib/show_mem.c:
>>> #ifdef CONFIG_CMA
>>>          printk("%lu pages reserved\n", (reserved - totalcma_pages));
>>>          printk("%lu pages cma reserved\n", totalcma_pages);
>>> #else
>>>
>>> With totalcma_pages being reported as 8192, that would account for the
>>> -7000ish values reported.
>>>
>>> That change appears to have come from:
>>> 49abd8c lib/show_mem.c: add cma reserved information
>>>
>>> Is the quickest way to exacerbate this OOM a kernel compile?
>>
>> I haven't really tried to characterize this. Compiling a kernel
>> on a 64K page machine causes a failure reasonably quickly and
>> doesn't require a lot of thought. I think that time spent finding
>> a faster reproducer wouldn't pay off.
> 
> I wasn't able to trigger the issue again with git, and the only way I've
> managed to trigger the issue is repeatedly building the kernel in a
> loop:
> 
> while true; do
> 	git clean -fdx > /dev/null 2>&1;
> 	make defconfig > /dev/null 2>&1;
> 	make > /dev/null > 2>&1;
> done
> 
> Which after a while died:
> 
> -bash: fork: Cannot allocate memory
> 
> I didn't see anything interesting in dmesg, but I was able to get at
> /proc/meminfo:
> 
> MemTotal:       16695168 kB
> MemFree:          998336 kB
> MemAvailable:     325568 kB
> Buffers:           51200 kB
> Cached:           236224 kB
> SwapCached:            0 kB
> Active:         14970880 kB
> Inactive:         580288 kB
> Active(anon):   14834496 kB
> Inactive(anon):     5760 kB
> Active(file):     136384 kB
> Inactive(file):   574528 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:             0 kB
> SwapFree:              0 kB
> Dirty:               448 kB
> Writeback:             0 kB
> AnonPages:         22400 kB
> Mapped:            10240 kB
> Shmem:              8768 kB
> Slab:              63744 kB
> SReclaimable:      27072 kB
> SUnreclaim:        36672 kB
> KernelStack:        1824 kB
> PageTables:         3776 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     8347584 kB
> Committed_AS:      50368 kB
> VmallocTotal:   2142764992 kB
> VmallocUsed:      283264 kB
> VmallocChunk:   2142387200 kB
> AnonHugePages:         0 kB
> CmaTotal:         524288 kB
> CmaFree:             128 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:     524288 kB
> 
> And also magic-sysrq m:
> 
> SysRq : Show Memory
> Mem-Info:
> DMA per-cpu:
> CPU    0: hi:    6, btch:   1 usd:   1
> CPU    1: hi:    6, btch:   1 usd:   1
> CPU    2: hi:    6, btch:   1 usd:   1
> CPU    3: hi:    6, btch:   1 usd:   3
> CPU    4: hi:    6, btch:   1 usd:   5
> CPU    5: hi:    6, btch:   1 usd:   5
> Normal per-cpu:
> CPU    0: hi:    6, btch:   1 usd:   0
> CPU    1: hi:    6, btch:   1 usd:   5
> CPU    2: hi:    6, btch:   1 usd:   1
> CPU    3: hi:    6, btch:   1 usd:   5
> CPU    4: hi:    6, btch:   1 usd:   5
> CPU    5: hi:    6, btch:   1 usd:   5
> active_anon:231780 inactive_anon:90 isolated_anon:0
>  active_file:2131 inactive_file:8977 isolated_file:0
>  unevictable:0 dirty:8 writeback:0 unstable:0
>  free:15601 slab_reclaimable:423 slab_unreclaimable:573
>  mapped:160 shmem:137 pagetables:59 bounce:0
>  free_cma:2
> DMA free:302336kB min:208000kB low:259968kB high:312000kB active_anon:3618432kB inactive_anon:768kB active_file:34432kB inactive_file:131584kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4177920kB managed:4166528kB mlocked:0kB dirty:192kB writeback:0kB mapped:4736kB shmem:1024kB slab_reclaimable:5184kB slab_unreclaimable:3328kB kernel_stack:0kB pagetables:1600kB unstable:0kB bounce:0kB free_cma:128kB writeback_tmp:0kB pages_scanned:1208448 all_unreclaimable? yes
> lowmem_reserve[]: 0 764 764
> Normal free:696128kB min:625472kB low:781824kB high:938176kB active_anon:11215488kB inactive_anon:4992kB active_file:101952kB inactive_file:442944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12582912kB managed:12528640kB mlocked:0kB dirty:320kB writeback:0kB mapped:5504kB shmem:7744kB slab_reclaimable:21888kB slab_unreclaimable:33344kB kernel_stack:1840kB pagetables:2176kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3331648 all_unreclaimable? yes
> lowmem_reserve[]: 0 0 0
> DMA: 42*64kB (MRC) 37*128kB (R) 6*256kB (R) 5*512kB (R) 2*1024kB (R) 3*2048kB (R) 1*4096kB (R) 0*8192kB 1*16384kB (R) 0*32768kB 0*65536kB 0*131072kB 1*262144kB (R) 0*524288kB = 302336kB
> Normal: 280*64kB (MR) 40*128kB (R) 5*256kB (R) 4*512kB (R) 6*1024kB (R) 4*2048kB (R) 1*4096kB (R) 1*8192kB (R) 1*16384kB (R) 1*32768kB (R) 1*65536kB (R) 0*131072kB 0*262144kB 1*524288kB (R) = 691968kB
> Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
> 4492 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 0, delete 0, find 0/0
> Free swap  = 0kB
> Total swap = 0kB
> 261888 pages RAM
> 0 pages HighMem/MovableOnly
> 18446744073709544450 pages reserved
> 8192 pages cma reserved
> 
> I also ran ps aux, but I didn't see any stale tasks lying around, nor
> did any remaining tasks seem to account for all that active anonymous
> memory.
> 
> I'll see if I can reproduce on x86.

Just as another data point: I'm reproducing the exact same thing (it
only took a couple of kernel builds to kill the box), with almost all
16GB of RAM stuck in Active(anon). I do *not* have CMA enabled though.

I've kicked another run with 4k pages.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-09 14:27       ` Mark Langsdorf
@ 2015-01-09 17:57         ` Mark Rutland
  2015-01-09 18:37           ` Marc Zyngier
  0 siblings, 1 reply; 101+ messages in thread
From: Mark Rutland @ 2015-01-09 17:57 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Steve Capper, Linus Torvalds, Linux Kernel Mailing List,
	linux-arm-kernel, vishnu.ps, Will Deacon

On Fri, Jan 09, 2015 at 02:27:06PM +0000, Mark Langsdorf wrote:
> On 01/09/2015 08:19 AM, Steve Capper wrote:
> > On 9 January 2015 at 12:13, Mark Rutland <mark.rutland@arm.com> wrote:
> >> On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
> >>> I'm consistently getting an out of memory killer triggered when
> >>> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
> >>> with 16 GB of memory. This doesn't happen when running a 3.18
> >>> kernel.
> >>>
> >>> I'm going to start bisecting the failure now, but here's the crash
> >>> log in case someone can see something obvious in it.
> >>
> >> FWIW I've just reproduced this with v3.19-rc3 defconfig +
> >> CONFIG_ARM64_64K_PAGES=y by attempting a git clone of mainline. My
> >> system has 16GB of RAM and 6 CPUs.
> >>
> >> I have a similarly dodgy looking number of pages reserved
> >> (18446744073709544451 A.K.A. -7165). Log below.
> >>
> >
> > I think the negative page reserved count is a consequence of another bug.
> >
> > We have the following reporting code in lib/show_mem.c:
> > #ifdef CONFIG_CMA
> >          printk("%lu pages reserved\n", (reserved - totalcma_pages));
> >          printk("%lu pages cma reserved\n", totalcma_pages);
> > #else
> >
> > With totalcma_pages being reported as 8192, that would account for the
> > -7000ish values reported.
> >
> > That change appears to have come from:
> > 49abd8c lib/show_mem.c: add cma reserved information
> >
> > Is the quickest way to exacerbate this OOM a kernel compile?
> 
> I haven't really tried to characterize this. Compiling a kernel
> on a 64K page machine causes a failure reasonably quickly and
> doesn't require a lot of thought. I think that time spent finding
> a faster reproducer wouldn't pay off.

I wasn't able to trigger the issue again with git, and the only way I've
managed to trigger the issue is repeatedly building the kernel in a
loop:

while true; do
	git clean -fdx > /dev/null 2>&1;
	make defconfig > /dev/null 2>&1;
	make > /dev/null > 2>&1;
done

Which after a while died:

-bash: fork: Cannot allocate memory

I didn't see anything interesting in dmesg, but I was able to get at
/proc/meminfo:

MemTotal:       16695168 kB
MemFree:          998336 kB
MemAvailable:     325568 kB
Buffers:           51200 kB
Cached:           236224 kB
SwapCached:            0 kB
Active:         14970880 kB
Inactive:         580288 kB
Active(anon):   14834496 kB
Inactive(anon):     5760 kB
Active(file):     136384 kB
Inactive(file):   574528 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               448 kB
Writeback:             0 kB
AnonPages:         22400 kB
Mapped:            10240 kB
Shmem:              8768 kB
Slab:              63744 kB
SReclaimable:      27072 kB
SUnreclaim:        36672 kB
KernelStack:        1824 kB
PageTables:         3776 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8347584 kB
Committed_AS:      50368 kB
VmallocTotal:   2142764992 kB
VmallocUsed:      283264 kB
VmallocChunk:   2142387200 kB
AnonHugePages:         0 kB
CmaTotal:         524288 kB
CmaFree:             128 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:     524288 kB

And also magic-sysrq m:

SysRq : Show Memory
Mem-Info:
DMA per-cpu:
CPU    0: hi:    6, btch:   1 usd:   1
CPU    1: hi:    6, btch:   1 usd:   1
CPU    2: hi:    6, btch:   1 usd:   1
CPU    3: hi:    6, btch:   1 usd:   3
CPU    4: hi:    6, btch:   1 usd:   5
CPU    5: hi:    6, btch:   1 usd:   5
Normal per-cpu:
CPU    0: hi:    6, btch:   1 usd:   0
CPU    1: hi:    6, btch:   1 usd:   5
CPU    2: hi:    6, btch:   1 usd:   1
CPU    3: hi:    6, btch:   1 usd:   5
CPU    4: hi:    6, btch:   1 usd:   5
CPU    5: hi:    6, btch:   1 usd:   5
active_anon:231780 inactive_anon:90 isolated_anon:0
 active_file:2131 inactive_file:8977 isolated_file:0
 unevictable:0 dirty:8 writeback:0 unstable:0
 free:15601 slab_reclaimable:423 slab_unreclaimable:573
 mapped:160 shmem:137 pagetables:59 bounce:0
 free_cma:2
DMA free:302336kB min:208000kB low:259968kB high:312000kB active_anon:3618432kB inactive_anon:768kB active_file:34432kB inactive_file:131584kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4177920kB managed:4166528kB mlocked:0kB dirty:192kB writeback:0kB mapped:4736kB shmem:1024kB slab_reclaimable:5184kB slab_unreclaimable:3328kB kernel_stack:0kB pagetables:1600kB unstable:0kB bounce:0kB free_cma:128kB writeback_tmp:0kB pages_scanned:1208448 all_unreclaimable? yes
lowmem_reserve[]: 0 764 764
Normal free:696128kB min:625472kB low:781824kB high:938176kB active_anon:11215488kB inactive_anon:4992kB active_file:101952kB inactive_file:442944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12582912kB managed:12528640kB mlocked:0kB dirty:320kB writeback:0kB mapped:5504kB shmem:7744kB slab_reclaimable:21888kB slab_unreclaimable:33344kB kernel_stack:1840kB pagetables:2176kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3331648 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
DMA: 42*64kB (MRC) 37*128kB (R) 6*256kB (R) 5*512kB (R) 2*1024kB (R) 3*2048kB (R) 1*4096kB (R) 0*8192kB 1*16384kB (R) 0*32768kB 0*65536kB 0*131072kB 1*262144kB (R) 0*524288kB = 302336kB
Normal: 280*64kB (MR) 40*128kB (R) 5*256kB (R) 4*512kB (R) 6*1024kB (R) 4*2048kB (R) 1*4096kB (R) 1*8192kB (R) 1*16384kB (R) 1*32768kB (R) 1*65536kB (R) 0*131072kB 0*262144kB 1*524288kB (R) = 691968kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
4492 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
261888 pages RAM
0 pages HighMem/MovableOnly
18446744073709544450 pages reserved
8192 pages cma reserved

I also ran ps aux, but I didn't see any stale tasks lying around, nor
did any remaining tasks seem to account for all that active anonymous
memory.

I'll see if I can reproduce on x86.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 16:37     ` Mark Langsdorf
@ 2015-01-09 15:56       ` Michal Hocko
  0 siblings, 0 replies; 101+ messages in thread
From: Michal Hocko @ 2015-01-09 15:56 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel,
	linux-mm, Joonsoo Kim, Marek Szyprowski

On Thu 08-01-15 10:37:50, Mark Langsdorf wrote:
> On 01/08/2015 09:08 AM, Michal Hocko wrote:
> >[CCing linux-mm and CMA people]
> >[Full message here:
> >http://article.gmane.org/gmane.linux.ports.arm.kernel/383669]
> 
> >>[ 1054.095277] DMA: 109*64kB (UR) 53*128kB (R) 8*256kB (R) 0*512kB 0*1024kB
> >>0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 0*65536kB = 52672kB
> >>[ 1054.108621] Normal: 191*64kB (MR) 0*128kB 0*256kB 0*512kB 0*1024kB
> >>0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 12224kB
> >[...]
> >>[ 1054.142545] Free swap  = 6598400kB
> >>[ 1054.145928] Total swap = 8388544kB
> >>[ 1054.149317] 262112 pages RAM
> >>[ 1054.152180] 0 pages HighMem/MovableOnly
> >>[ 1054.155995] 18446744073709544361 pages reserved
> >>[ 1054.160505] 8192 pages cma reserved
> >
> >Besides underflow in the reserved pages accounting mentioned in other
> >email the free lists look strange as well. All free blocks with some memory
> >are marked as reserved. I would suspect something CMA related.
> 
> I get the same failure with CMA turned off entirely. I assume that means
> CMA is not the culprit.

OK. Do you see all the free page blocks completely reserved without CMA
as well?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-09 14:19     ` Steve Capper
@ 2015-01-09 14:27       ` Mark Langsdorf
  2015-01-09 17:57         ` Mark Rutland
  0 siblings, 1 reply; 101+ messages in thread
From: Mark Langsdorf @ 2015-01-09 14:27 UTC (permalink / raw)
  To: Steve Capper, Mark Rutland
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel, vishnu.ps

On 01/09/2015 08:19 AM, Steve Capper wrote:
> On 9 January 2015 at 12:13, Mark Rutland <mark.rutland@arm.com> wrote:
>> On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
>>> I'm consistently getting an out of memory killer triggered when
>>> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
>>> with 16 GB of memory. This doesn't happen when running a 3.18
>>> kernel.
>>>
>>> I'm going to start bisecting the failure now, but here's the crash
>>> log in case someone can see something obvious in it.
>>
>> FWIW I've just reproduced this with v3.19-rc3 defconfig +
>> CONFIG_ARM64_64K_PAGES=y by attempting a git clone of mainline. My
>> system has 16GB of RAM and 6 CPUs.
>>
>> I have a similarly dodgy looking number of pages reserved
>> (18446744073709544451 A.K.A. -7165). Log below.
>>
>
> I think the negative page reserved count is a consequence of another bug.
>
> We have the following reporting code in lib/show_mem.c:
> #ifdef CONFIG_CMA
>          printk("%lu pages reserved\n", (reserved - totalcma_pages));
>          printk("%lu pages cma reserved\n", totalcma_pages);
> #else
>
> With totalcma_pages being reported as 8192, that would account for the
> -7000ish values reported.
>
> That change appears to have come from:
> 49abd8c lib/show_mem.c: add cma reserved information
>
> Is the quickest way to exacerbate this OOM a kernel compile?

I haven't really tried to characterize this. Compiling a kernel
on a 64K page machine causes a failure reasonably quickly and
doesn't require a lot of thought. I think that time spent finding
a faster reproducer wouldn't pay off.

Also, contrary to last night's report and in line with Linus'
assumption, the failure still occurs with 4K pages. It just
takes substantially longer to occur.

--Mark Langsdorf

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-09 12:13   ` Mark Rutland
@ 2015-01-09 14:19     ` Steve Capper
  2015-01-09 14:27       ` Mark Langsdorf
  0 siblings, 1 reply; 101+ messages in thread
From: Steve Capper @ 2015-01-09 14:19 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Mark Langsdorf, Linus Torvalds, Linux Kernel Mailing List,
	linux-arm-kernel, vishnu.ps

On 9 January 2015 at 12:13, Mark Rutland <mark.rutland@arm.com> wrote:
> On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
>> On 01/05/2015 07:46 PM, Linus Torvalds wrote:
>> > It's a day delayed - not because of any particular development issues,
>> > but simply because I was tiling a bathroom yesterday. But rc3 is out
>> > there now, and things have stayed reasonably calm. I really hope that
>> > implies that 3.19 is looking good, but it's equally likely that it's
>> > just that people are still recovering from the holiday season.
>>
>> I'm consistently getting an out of memory killer triggered when
>> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
>> with 16 GB of memory. This doesn't happen when running a 3.18
>> kernel.
>>
>> I'm going to start bisecting the failure now, but here's the crash
>> log in case someone can see something obvious in it.
>
> FWIW I've just reproduced this with v3.19-rc3 defconfig +
> CONFIG_ARM64_64K_PAGES=y by attempting a git clone of mainline. My
> system has 16GB of RAM and 6 CPUs.
>
> I have a similarly dodgy looking number of pages reserved
> (18446744073709544451 A.K.A. -7165). Log below.
>

I think the negative page reserved count is a consequence of another bug.

We have the following reporting code in lib/show_mem.c:
#ifdef CONFIG_CMA
        printk("%lu pages reserved\n", (reserved - totalcma_pages));
        printk("%lu pages cma reserved\n", totalcma_pages);
#else

With totalcma_pages being reported as 8192, that would account for the
-7000ish values reported.

That change appears to have come from:
49abd8c lib/show_mem.c: add cma reserved information

Is the quickest way to exacerbate this OOM a kernel compile?

Cheers,
--
Steve

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 12:51 ` Mark Langsdorf
  2015-01-08 13:45   ` Catalin Marinas
  2015-01-08 15:08   ` Michal Hocko
@ 2015-01-09 12:13   ` Mark Rutland
  2015-01-09 14:19     ` Steve Capper
  2 siblings, 1 reply; 101+ messages in thread
From: Mark Rutland @ 2015-01-09 12:13 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel

On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
> On 01/05/2015 07:46 PM, Linus Torvalds wrote:
> > It's a day delayed - not because of any particular development issues,
> > but simply because I was tiling a bathroom yesterday. But rc3 is out
> > there now, and things have stayed reasonably calm. I really hope that
> > implies that 3.19 is looking good, but it's equally likely that it's
> > just that people are still recovering from the holiday season.
> 
> I'm consistently getting an out of memory killer triggered when
> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
> with 16 GB of memory. This doesn't happen when running a 3.18
> kernel.
> 
> I'm going to start bisecting the failure now, but here's the crash
> log in case someone can see something obvious in it.

FWIW I've just reproduced this with v3.19-rc3 defconfig +
CONFIG_ARM64_64K_PAGES=y by attempting a git clone of mainline. My
system has 16GB of RAM and 6 CPUs.

I have a similarly dodgy looking number of pages reserved
(18446744073709544451 A.K.A. -7165). Log below.

Thanks,
Mark.

git invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
CPU: 2 PID: 9777 Comm: git Not tainted 3.19.0-rc3+ #37
Call trace:
[<fffffe0000096b4c>] dump_backtrace+0x0/0x124
[<fffffe0000096c80>] show_stack+0x10/0x1c
[<fffffe0000552f40>] dump_stack+0x80/0xc4
[<fffffe000013c148>] dump_header.isra.9+0x80/0x1c4
[<fffffe000013c868>] oom_kill_process+0x390/0x3f0
[<fffffe000013cdb0>] out_of_memory+0x2f0/0x324
[<fffffe0000141178>] __alloc_pages_nodemask+0x860/0x874
[<fffffe0000161ae4>] handle_mm_fault+0x7c0/0xe7c
[<fffffe000009f8d0>] do_page_fault+0x188/0x2f8
[<fffffe0000090230>] do_mem_abort+0x38/0x9c
Exception stack(0xfffffe03bf877e30 to 0xfffffe03bf877f50)
7e20:                                     00000000 00000000 84000078 000003ff
7e40: ffffffff ffffffff abc27210 000003ff 00000006 00000000 001971a4 fffffe00
7e60: bf877ec0 fffffe03 0019725c fffffe00 00000000 00000000 00000028 00000000
7e80: ffffffff ffffffff abc6c718 000003ff 00000000 00000000 00000015 00000000
7ea0: 0000011a 00000000 00024800 00000000 00000024 00000100 00000003 fffffe03
7ec0: 93f7d6c0 000003ff 000939b0 fffffe00 85879770 000003ff 858817b0 000003ff
7ee0: 0000e851 00000000 00008045 00000000 00008041 00000000 04b10000 00000000
7f00: 00008060 00000000 85890000 000003ff 00000038 00000000 6f6c72ff 62606f5e
7f20: 00000040 00000000 01010101 01010101 00000076 00000000 00040000 00000000
7f40: 09fecda5 00000000 ec5a90c4 00000000
Mem-Info:
DMA per-cpu:
CPU    0: hi:    6, btch:   1 usd:   4
CPU    1: hi:    6, btch:   1 usd:   5
CPU    2: hi:    6, btch:   1 usd:   4
CPU    3: hi:    6, btch:   1 usd:   5
CPU    4: hi:    6, btch:   1 usd:   5
CPU    5: hi:    6, btch:   1 usd:   2
Normal per-cpu:
CPU    0: hi:    6, btch:   1 usd:   1
CPU    1: hi:    6, btch:   1 usd:   5
CPU    2: hi:    6, btch:   1 usd:   0
CPU    3: hi:    6, btch:   1 usd:   5
CPU    4: hi:    6, btch:   1 usd:   4
CPU    5: hi:    6, btch:   1 usd:   5
active_anon:241994 inactive_anon:226 isolated_anon:0
 active_file:1063 inactive_file:1075 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:13439 slab_reclaimable:363 slab_unreclaimable:1137
 mapped:188 shmem:146 pagetables:94 bounce:0
 free_cma:2401
DMA free:243840kB min:208000kB low:259968kB high:312000kB active_anon:3800448kB inactive_anon:2624kB active_file:8576kB inactive_file:9344kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4177920kB managed:4166464kB mlocked:0kB dirty:0kB writeback:0kB mapped:1152kB shmem:2304kB slab_reclaimable:4480kB slab_unreclaimable:16576kB kernel_stack:128kB pagetables:2688kB unstable:0kB bounce:0kB free_cma:153664kB writeback_tmp:0kB pages_scanned:201984 all_unreclaimable? yes
lowmem_reserve[]: 0 764 764
Normal free:616256kB min:625472kB low:781824kB high:938176kB active_anon:11687168kB inactive_anon:11840kB active_file:59456kB inactive_file:59456kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12582912kB managed:12528640kB mlocked:0kB dirty:0kB writeback:0kB mapped:10880kB shmem:7040kB slab_reclaimable:18752kB slab_unreclaimable:56192kB kernel_stack:2032kB pagetables:3328kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2989056 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
DMA: 1563*64kB (URC) 527*128kB (UC) 0*256kB 1*512kB (R) 0*1024kB 1*2048kB (R) 0*4096kB 1*8192kB (R) 0*16384kB 0*32768kB 1*65536kB (R) 0*131072kB 0*262144kB 0*524288kB = 243776kB
Normal: 5*64kB (MR) 4*128kB (MR) 4*256kB (R) 2*512kB (MR) 3*1024kB (MR) 2*2048kB (MR) 4*4096kB (MR) 2*8192kB (MR) 3*16384kB (MR) 4*32768kB (MR) 0*65536kB 1*131072kB (R) 1*262144kB (R) 0*524288kB = 616256kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
1266 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
261888 pages RAM
0 pages HighMem/MovableOnly
18446744073709544451 pages reserved
8192 pages cma reserved
[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  930]     0   930       57       13       2        0             0 upstart-udev-br
[  935]     0   935      195       60       4        0         -1000 systemd-udevd
[ 1252]     0  1252       51       11       2        0             0 upstart-file-br
[ 1254]   101  1254     3538       56       3        0             0 rsyslogd
[ 1278]   104  1278       88       49       4        0             0 dbus-daemon
[ 1338]     0  1338       96       52       4        0             0 systemd-logind
[ 1376]     0  1376       50       11       2        0             0 upstart-socket-
[ 1416]     0  1416     3682      144       3        0             0 ModemManager
[ 1527]     0  1527       74       34       4        0             0 getty
[ 1528]     0  1528     3872      145       5        0             0 NetworkManager
[ 1529]     0  1529       74       35       4        0             0 getty
[ 1534]     0  1534       74       34       4        0             0 getty
[ 1535]     0  1535       74       34       3        0             0 getty
[ 1537]     0  1537       74       34       3        0             0 getty
[ 1552]     0  1552      151       92       3        0         -1000 sshd
[ 1561]     0  1561       63       38       3        0             0 cron
[ 1565]     0  1565     3602       89       4        0             0 polkitd
[ 1604]     0  1604      126       72       4        0             0 login
[ 1606]     0  1606       74       34       3        0             0 getty
[ 1621]     0  1621       99       68       3        0             0 dhclient
[ 1626] 65534  1626       67       48       4        0             0 dnsmasq
[ 1716]  1000  1716       72       45       4        0             0 bash
[ 1730]     0  1730      238      120       2        0             0 sshd
[ 1743]  1000  1743      238       94       2        0             0 sshd
[ 1744]  1000  1744       96       50       4        0             0 bash
[ 9767]  1000  9767      240       58       3        0             0 git
[ 9773]  1000  9773    11894     9252       6        0             0 git
Out of memory: Kill process 9773 (git) score 35 or sacrifice child
Killed process 9773 (git) total-vm:761216kB, anon-rss:589376kB, file-rss:2752kB


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 18:48         ` Mark Langsdorf
@ 2015-01-08 19:21           ` Linus Torvalds
  2015-01-09 23:27             ` Catalin Marinas
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2015-01-08 19:21 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Catalin Marinas, Linux Kernel Mailing List, linux-arm-kernel

On Thu, Jan 8, 2015 at 10:48 AM, Mark Langsdorf <mlangsdo@redhat.com> wrote:
>
> With 4K pages, the oom killer doesn't trigger during `make -j 16 -s`
> on a fresh kernel. Thanks for the suggestion. I'm not sure what to do
> about that, though.

I suspect that bisecting remains the best option.

A 64kB allocation will make various memory pressure issues *much*
worse, and quite frankly, I consider 64kB pages completely
unacceptable for any real life situation for that and other reasons,
but if it used to work, we do want to figure out what actually broke.

The only excuse for 64kB pages is "my hardware TLB is complete crap,
and I have very specialized server-only loads".

People who think 64kB pages are a good idea are wrong. It's that
simple. You seem to be trying to compile stuff, which is not a good
load for 64kB pages (notably the page cache becomes 90% unused due to
wasted space at the end of pages).

But while 64kB pages are a completely braindead idea, the actual *bug*
is likely not the fact that 64kB page thing in itself, but some other
issue that gets triggered by it. Maybe we have some piece of code that
"knows" that a page is 4kB and mis-accounts things due to that, or
similar.

                       Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 17:34       ` Catalin Marinas
@ 2015-01-08 18:48         ` Mark Langsdorf
  2015-01-08 19:21           ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Mark Langsdorf @ 2015-01-08 18:48 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel

On 01/08/2015 11:34 AM, Catalin Marinas wrote:
> On Thu, Jan 08, 2015 at 05:29:40PM +0000, Mark Langsdorf wrote:
>> On 01/08/2015 07:45 AM, Catalin Marinas wrote:
>>> On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
>>>> On 01/05/2015 07:46 PM, Linus Torvalds wrote:
>>>>> It's a day delayed - not because of any particular development issues,
>>>>> but simply because I was tiling a bathroom yesterday. But rc3 is out
>>>>> there now, and things have stayed reasonably calm. I really hope that
>>>>> implies that 3.19 is looking good, but it's equally likely that it's
>>>>> just that people are still recovering from the holiday season.
>>>>
>>>> I'm consistently getting an out of memory killer triggered when
>>>> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
>>>> with 16 GB of memory. This doesn't happen when running a 3.18
>>>> kernel.
>>>>
>>>> I'm going to start bisecting the failure now, but here's the crash
>>>> log in case someone can see something obvious in it.
>>
>>> Can you disable (transparent) huge pages? I don't have any better at the
>>> moment suggestion apart from bisecting.
>>
>> I didn't have transparent huge pages on. Turning off hugetblfs didn't
>> change anything. Turning off 64K pages isn't an option because of
>> firmware constraints.
>
> What constraints are these? I thought they could only happen the other
> way around (4K to 64K).

I was confused. I can turn off 64K pages with only minor loss of
functionality (network MAC address gets corrupted; I can work around
for testing).

With 4K pages, the oom killer doesn't trigger during `make -j 16 -s`
on a fresh kernel. Thanks for the suggestion. I'm not sure what to do
about that, though.

--Mark Langsdorf



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 17:29     ` Mark Langsdorf
@ 2015-01-08 17:34       ` Catalin Marinas
  2015-01-08 18:48         ` Mark Langsdorf
  0 siblings, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2015-01-08 17:34 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel

On Thu, Jan 08, 2015 at 05:29:40PM +0000, Mark Langsdorf wrote:
> On 01/08/2015 07:45 AM, Catalin Marinas wrote:
> > On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
> >> On 01/05/2015 07:46 PM, Linus Torvalds wrote:
> >>> It's a day delayed - not because of any particular development issues,
> >>> but simply because I was tiling a bathroom yesterday. But rc3 is out
> >>> there now, and things have stayed reasonably calm. I really hope that
> >>> implies that 3.19 is looking good, but it's equally likely that it's
> >>> just that people are still recovering from the holiday season.
> >>
> >> I'm consistently getting an out of memory killer triggered when
> >> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
> >> with 16 GB of memory. This doesn't happen when running a 3.18
> >> kernel.
> >>
> >> I'm going to start bisecting the failure now, but here's the crash
> >> log in case someone can see something obvious in it.
> 
> > Can you disable (transparent) huge pages? I don't have any better at the
> > moment suggestion apart from bisecting.
> 
> I didn't have transparent huge pages on. Turning off hugetblfs didn't
> change anything. Turning off 64K pages isn't an option because of
> firmware constraints.

What constraints are these? I thought they could only happen the other
way around (4K to 64K).

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 13:45   ` Catalin Marinas
@ 2015-01-08 17:29     ` Mark Langsdorf
  2015-01-08 17:34       ` Catalin Marinas
  0 siblings, 1 reply; 101+ messages in thread
From: Mark Langsdorf @ 2015-01-08 17:29 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel

On 01/08/2015 07:45 AM, Catalin Marinas wrote:
> On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
>> On 01/05/2015 07:46 PM, Linus Torvalds wrote:
>>> It's a day delayed - not because of any particular development issues,
>>> but simply because I was tiling a bathroom yesterday. But rc3 is out
>>> there now, and things have stayed reasonably calm. I really hope that
>>> implies that 3.19 is looking good, but it's equally likely that it's
>>> just that people are still recovering from the holiday season.
>>
>> I'm consistently getting an out of memory killer triggered when
>> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
>> with 16 GB of memory. This doesn't happen when running a 3.18
>> kernel.
>>
>> I'm going to start bisecting the failure now, but here's the crash
>> log in case someone can see something obvious in it.

> Can you disable (transparent) huge pages? I don't have any better at the
> moment suggestion apart from bisecting.

I didn't have transparent huge pages on. Turning off hugetblfs didn't
change anything. Turning off 64K pages isn't an option because of
firmware constraints.

--Mark Langsdorf


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 15:08   ` Michal Hocko
@ 2015-01-08 16:37     ` Mark Langsdorf
  2015-01-09 15:56       ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Mark Langsdorf @ 2015-01-08 16:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel,
	linux-mm, Joonsoo Kim, Marek Szyprowski

On 01/08/2015 09:08 AM, Michal Hocko wrote:
> [CCing linux-mm and CMA people]
> [Full message here:
> http://article.gmane.org/gmane.linux.ports.arm.kernel/383669]

>> [ 1054.095277] DMA: 109*64kB (UR) 53*128kB (R) 8*256kB (R) 0*512kB 0*1024kB
>> 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 0*65536kB = 52672kB
>> [ 1054.108621] Normal: 191*64kB (MR) 0*128kB 0*256kB 0*512kB 0*1024kB
>> 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 12224kB
> [...]
>> [ 1054.142545] Free swap  = 6598400kB
>> [ 1054.145928] Total swap = 8388544kB
>> [ 1054.149317] 262112 pages RAM
>> [ 1054.152180] 0 pages HighMem/MovableOnly
>> [ 1054.155995] 18446744073709544361 pages reserved
>> [ 1054.160505] 8192 pages cma reserved
>
> Besides underflow in the reserved pages accounting mentioned in other
> email the free lists look strange as well. All free blocks with some memory
> are marked as reserved. I would suspect something CMA related.

I get the same failure with CMA turned off entirely. I assume that means
CMA is not the culprit.

--Mark Langsdorf


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 12:51 ` Mark Langsdorf
  2015-01-08 13:45   ` Catalin Marinas
@ 2015-01-08 15:08   ` Michal Hocko
  2015-01-08 16:37     ` Mark Langsdorf
  2015-01-09 12:13   ` Mark Rutland
  2 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2015-01-08 15:08 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel,
	linux-mm, Joonsoo Kim, Marek Szyprowski

[CCing linux-mm and CMA people]
[Full message here:
http://article.gmane.org/gmane.linux.ports.arm.kernel/383669]

On Thu 08-01-15 06:51:31, Mark Langsdorf wrote:
[...]
> [ 1053.968815] active_anon:207417 inactive_anon:25722 isolated_anon:0
> [ 1053.968815]  active_file:1300 inactive_file:21234 isolated_file:0
> [ 1053.968815]  unevictable:0 dirty:0 writeback:0 unstable:0
> [ 1053.968815]  free:1014 slab_reclaimable:1047 slab_unreclaimable:1758
> [ 1053.968815]  mapped:733 shmem:58 pagetables:267 bounce:0
> [ 1053.968815]  free_cma:1

Still a lot of pages (~80M) on the file LRU list which should be reclaimable
because they are not dirty apparently.
Anon pages can be reclaimed as well because the swap is basically
unused.

[...]
> [ 1054.095277] DMA: 109*64kB (UR) 53*128kB (R) 8*256kB (R) 0*512kB 0*1024kB
> 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 0*65536kB = 52672kB
> [ 1054.108621] Normal: 191*64kB (MR) 0*128kB 0*256kB 0*512kB 0*1024kB
> 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 12224kB
[...]
> [ 1054.142545] Free swap  = 6598400kB
> [ 1054.145928] Total swap = 8388544kB
> [ 1054.149317] 262112 pages RAM
> [ 1054.152180] 0 pages HighMem/MovableOnly
> [ 1054.155995] 18446744073709544361 pages reserved
> [ 1054.160505] 8192 pages cma reserved

Besides underflow in the reserved pages accounting mentioned in other
email the free lists look strange as well. All free blocks with some memory
are marked as reserved. I would suspect something CMA related.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-08 12:51 ` Mark Langsdorf
@ 2015-01-08 13:45   ` Catalin Marinas
  2015-01-08 17:29     ` Mark Langsdorf
  2015-01-08 15:08   ` Michal Hocko
  2015-01-09 12:13   ` Mark Rutland
  2 siblings, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2015-01-08 13:45 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel

On Thu, Jan 08, 2015 at 12:51:31PM +0000, Mark Langsdorf wrote:
> On 01/05/2015 07:46 PM, Linus Torvalds wrote:
> > It's a day delayed - not because of any particular development issues,
> > but simply because I was tiling a bathroom yesterday. But rc3 is out
> > there now, and things have stayed reasonably calm. I really hope that
> > implies that 3.19 is looking good, but it's equally likely that it's
> > just that people are still recovering from the holiday season.
> 
> I'm consistently getting an out of memory killer triggered when
> compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
> with 16 GB of memory. This doesn't happen when running a 3.18
> kernel.
> 
> I'm going to start bisecting the failure now, but here's the crash
> log in case someone can see something obvious in it.
[...]
> [ 1053.968815] active_anon:207417 inactive_anon:25722 isolated_anon:0
> [ 1053.968815]  active_file:1300 inactive_file:21234 isolated_file:0
> [ 1053.968815]  unevictable:0 dirty:0 writeback:0 unstable:0
> [ 1053.968815]  free:1014 slab_reclaimable:1047 slab_unreclaimable:1758
> [ 1053.968815]  mapped:733 shmem:58 pagetables:267 bounce:0
> [ 1053.968815]  free_cma:1
> [ 1054.000398] DMA free:52928kB min:4032kB low:4992kB high:6016kB
> active_anon:3025728kB inactive_anon:612608kB active_file:20992kB
> inactive_file:323072kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:15168kB shmem:1792kB
> slab_reclaimable:26368kB slab_unreclaimable:36160kB kernel_stack:464kB
> pagetables:6976kB unstable:0kB bounce:0kB free_cma:64kB
> writeback_tmp:0kB pages_scanned:26151872 all_unreclaimable? yes
> [ 1054.043628] lowmem_reserve[]: 0 765 765
> [ 1054.047498] Normal free:12032kB min:12224kB low:15232kB high:18304kB
> active_anon:10248960kB inactive_anon:1033728kB active_file:62208kB
> inactive_file:1036032kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:12582912kB managed:12538176kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:31744kB shmem:1920kB
> slab_reclaimable:40640kB slab_unreclaimable:76352kB kernel_stack:2992kB
> pagetables:10112kB unstable:0kB bounce:0kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:82150272 all_unreclaimable? yes
> [ 1054.091760] lowmem_reserve[]: 0 0 0
> [ 1054.095277] DMA: 109*64kB (UR) 53*128kB (R) 8*256kB (R) 0*512kB
> 0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R)
> 0*65536kB = 52672kB
> [ 1054.108621] Normal: 191*64kB (MR) 0*128kB 0*256kB 0*512kB 0*1024kB
> 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 12224kB
> [ 1054.120708] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=524288kB
> [ 1054.129280] 43732 total pagecache pages
> [ 1054.133096] 25736 pages in swap cache
> [ 1054.136739] Swap cache stats: add 27971, delete 2235, find 0/1
> [ 1054.142545] Free swap  = 6598400kB
> [ 1054.145928] Total swap = 8388544kB
> [ 1054.149317] 262112 pages RAM
> [ 1054.152180] 0 pages HighMem/MovableOnly
> [ 1054.155995] 18446744073709544361 pages reserved

This looks weird (pages reserved = -7255).

Can you disable (transparent) huge pages? I don't have any better at the
moment suggestion apart from bisecting.

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  1:46 Linus Torvalds
  2015-01-06  2:46 ` Dave Jones
@ 2015-01-08 12:51 ` Mark Langsdorf
  2015-01-08 13:45   ` Catalin Marinas
                     ` (2 more replies)
  1 sibling, 3 replies; 101+ messages in thread
From: Mark Langsdorf @ 2015-01-08 12:51 UTC (permalink / raw)
  To: Linus Torvalds, Linux Kernel Mailing List, linux-arm-kernel

On 01/05/2015 07:46 PM, Linus Torvalds wrote:
> It's a day delayed - not because of any particular development issues,
> but simply because I was tiling a bathroom yesterday. But rc3 is out
> there now, and things have stayed reasonably calm. I really hope that
> implies that 3.19 is looking good, but it's equally likely that it's
> just that people are still recovering from the holiday season.

I'm consistently getting an out of memory killer triggered when
compiling the kernel (make -j 16 -s) on a 16 core ARM64 system
with 16 GB of memory. This doesn't happen when running a 3.18
kernel.

I'm going to start bisecting the failure now, but here's the crash
log in case someone can see something obvious in it.

--Mark Langsdorf

[  137.440443] random: nonblocking pool is initialized
[ 1053.720094] cc1 invoked oom-killer: gfp_mask=0x200da, order=0, 
oom_score_adj=0
[ 1053.727292] cc1 cpuset=/ mems_allowed=0
[ 1053.731169] CPU: 5 PID: 32180 Comm: cc1 Not tainted 3.19.0-rc3+ #2
[ 1053.737321] Hardware name: APM X-Gene Mustang board (DT)
[ 1053.742627] Call trace:
[ 1053.745073] [<fffffe0000096954>] dump_backtrace+0x0/0x16c
[ 1053.750465] [<fffffe0000096ad0>] show_stack+0x10/0x1c
[ 1053.755498] [<fffffe000062911c>] dump_stack+0x74/0x98
[ 1053.760543] [<fffffe0000626550>] dump_header.isra.11+0x98/0x1d0
[ 1053.766439] [<fffffe000017c4e0>] oom_kill_process+0x3ac/0x408
[ 1053.772182] [<fffffe000017c9d4>] out_of_memory+0x2a4/0x2d4
[ 1053.777647] [<fffffe0000180b78>] __alloc_pages_nodemask+0x808/0x8d8
[ 1053.783898] [<fffffe00001a443c>] handle_mm_fault+0x938/0xd1c
[ 1053.789544] [<fffffe00000a01c8>] do_page_fault+0x214/0x338
[ 1053.795004] [<fffffe000009022c>] do_mem_abort+0x38/0x9c
[ 1053.800211] Exception stack(0xfffffe0010b1fe30 to 0xfffffe0010b1ff50)
[ 1053.806622] fe20:                                     00000200 
00000000 00f886a8 00000000
[ 1053.814771] fe40: ffffffff ffffffff aa924620 000003ff 00b11000 
fffffe00 000f7c84 fffffe00
[ 1053.822915] fe60: 10b1fea0 fffffe00 000903b4 fffffe00 10b1fed0 
fffffe00 0002000c fffffc00
[ 1053.831063] fe80: 00b686b8 fffffe00 00020010 fffffc00 10b1feb0 
fffffe00 000964e8 fffffe00
[ 1053.839211] fea0: 00000208 00000000 00092f90 fffffe00 cac18530 
000003ff 0009308c fffffe00
[ 1053.847348] fec0: 00000200 00000000 00093094 fffffe00 a28e0000 
000003ff 00000000 00000000
[ 1053.855489] fee0: 00000248 00000000 a28e0000 000003ff 00000000 
00000000 a28e0000 000003ff
[ 1053.863630] ff00: 00000009 00000000 00000009 00000000 00fe8000 
00000000 00000480 00000000
[ 1053.871779] ff20: 008213c8 00000000 000000a2 00000000 00000006 
00000000 00000018 00000000
[ 1053.879921] ff40: 00000008 00000000 00000000 00000000
[ 1053.884946] Mem-Info:
[ 1053.887206] DMA per-cpu:
[ 1053.889731] CPU    0: hi:    6, btch:   1 usd:   0
[ 1053.894496] CPU    1: hi:    6, btch:   1 usd:   0
[ 1053.899268] CPU    2: hi:    6, btch:   1 usd:   0
[ 1053.904034] CPU    3: hi:    6, btch:   1 usd:   0
[ 1053.908809] CPU    4: hi:    6, btch:   1 usd:   0
[ 1053.913576] CPU    5: hi:    6, btch:   1 usd:   0
[ 1053.918350] CPU    6: hi:    6, btch:   1 usd:   0
[ 1053.923115] CPU    7: hi:    6, btch:   1 usd:   0
[ 1053.927880] Normal per-cpu:
[ 1053.930663] CPU    0: hi:    6, btch:   1 usd:   0
[ 1053.935428] CPU    1: hi:    6, btch:   1 usd:   0
[ 1053.940198] CPU    2: hi:    6, btch:   1 usd:   0
[ 1053.944963] CPU    3: hi:    6, btch:   1 usd:   0
[ 1053.949732] CPU    4: hi:    6, btch:   1 usd:   0
[ 1053.954497] CPU    5: hi:    6, btch:   1 usd:   0
[ 1053.959271] CPU    6: hi:    6, btch:   1 usd:   0
[ 1053.964038] CPU    7: hi:    6, btch:   1 usd:   0
[ 1053.968815] active_anon:207417 inactive_anon:25722 isolated_anon:0
[ 1053.968815]  active_file:1300 inactive_file:21234 isolated_file:0
[ 1053.968815]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1053.968815]  free:1014 slab_reclaimable:1047 slab_unreclaimable:1758
[ 1053.968815]  mapped:733 shmem:58 pagetables:267 bounce:0
[ 1053.968815]  free_cma:1
[ 1054.000398] DMA free:52928kB min:4032kB low:4992kB high:6016kB 
active_anon:3025728kB inactive_anon:612608kB active_file:20992kB 
inactive_file:323072kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:15168kB shmem:1792kB 
slab_reclaimable:26368kB slab_unreclaimable:36160kB kernel_stack:464kB 
pagetables:6976kB unstable:0kB bounce:0kB free_cma:64kB 
writeback_tmp:0kB pages_scanned:26151872 all_unreclaimable? yes
[ 1054.043628] lowmem_reserve[]: 0 765 765
[ 1054.047498] Normal free:12032kB min:12224kB low:15232kB high:18304kB 
active_anon:10248960kB inactive_anon:1033728kB active_file:62208kB 
inactive_file:1036032kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:12582912kB managed:12538176kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:31744kB shmem:1920kB 
slab_reclaimable:40640kB slab_unreclaimable:76352kB kernel_stack:2992kB 
pagetables:10112kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:82150272 all_unreclaimable? yes
[ 1054.091760] lowmem_reserve[]: 0 0 0
[ 1054.095277] DMA: 109*64kB (UR) 53*128kB (R) 8*256kB (R) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 
0*65536kB = 52672kB
[ 1054.108621] Normal: 191*64kB (MR) 0*128kB 0*256kB 0*512kB 0*1024kB 
0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 12224kB
[ 1054.120708] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=524288kB
[ 1054.129280] 43732 total pagecache pages
[ 1054.133096] 25736 pages in swap cache
[ 1054.136739] Swap cache stats: add 27971, delete 2235, find 0/1
[ 1054.142545] Free swap  = 6598400kB
[ 1054.145928] Total swap = 8388544kB
[ 1054.149317] 262112 pages RAM
[ 1054.152180] 0 pages HighMem/MovableOnly
[ 1054.155995] 18446744073709544361 pages reserved
[ 1054.160505] 8192 pages cma reserved
[ 1054.163974] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[ 1054.171784] [  481]     0   481      233       89       3       39 
           0 systemd-journal
[ 1054.180534] [  487]     0   487     1420       32       3       80 
           0 lvmetad
[ 1054.188591] [  512]     0   512      216       40       2       46 
       -1000 systemd-udevd
[ 1054.197162] [  598]     0   598      258       45       3       38 
       -1000 auditd
[ 1054.205133] [  620]     0   620     6825      213       3      114 
           0 NetworkManager
[ 1054.213794] [  623]     0   623     2028       83       3      100 
           0 abrtd
[ 1054.221678] [  624]     0   624     2019       78       3       92 
           0 abrt-watch-log
[ 1054.230338] [  628]    70   628       87       38       4       34 
           0 avahi-daemon
[ 1054.238826] [  629]   998   629       40       13       2        0 
           0 lsmd
[ 1054.246617] [  631]    70   631       87        0       4       32 
           0 avahi-daemon
[ 1054.255106] [  641]     0   641     2687       48       3       77 
           0 rsyslogd
[ 1054.263248] [  642]     0   642       86       42       3        9 
           0 irqbalance
[ 1054.271563] [  643]     0   643     6821      196       5      241 
           0 tuned
[ 1054.279447] [  647]     0   647      109       44       2       31 
           0 smartd
[ 1054.287411] [  651]   997   651       83       45       2       18 
           0 chronyd
[ 1054.295466] [  652]     0   652       99       38       3       36 
           0 systemd-logind
[ 1054.304129] [  653]    81   653      222       35       2       37 
        -900 dbus-daemon
[ 1054.312531] [  667]     0   667       79       27       3       29 
           0 atd
[ 1054.320242] [  675]     0   675     1846       68       4       86 
           0 login
[ 1054.328125] [  676]     0   676     1717       23       5       10 
           0 agetty
[ 1054.336089] [  710]     0   710     1761       47       3       36 
           0 bash
[ 1054.343887] [  727]     0   727      240       84       2       74 
       -1000 sshd
[ 1054.351684] [  733]     0   733     1739       34       3        9 
           0 rhsmcertd
[ 1054.359913] [  741]   999   741     7276      118       4      122 
           0 polkitd
[ 1054.367964] [  747]     0   747      502      102       3      281 
           0 dhclient
[ 1054.376106] [ 1323]     0  1323      334       69       3       69 
           0 master
[ 1054.384075] [ 1325]    89  1325      337       82       2       78 
           0 qmgr
[ 1054.391872] [ 1328]     0  1328      358      130       3      111 
           0 sshd
[ 1054.399669] [ 1331]     0  1331     1762       48       5       39 
           0 bash
[ 1054.407460] [ 1373]    89  1373      336       89       2       69 
           0 pickup
[ 1054.415430] [ 2944]     0  2944     1734       50       4        3 
           0 make
[ 1054.423228] [31425]     0 31425     1760       79       3        0 
           0 make
[ 1054.431028] [32092]     0 32092     1736       51       4        0 
           0 sh
[ 1054.438652] [32103]     0 32103     1736       51       4        0 
           0 sh
[ 1054.446271] [32108]     0 32108     1737       42       4        0 
           0 gcc
[ 1054.453981] [32110]     0 32110     1736       52       4        0 
           0 sh
[ 1054.461605] [32111]     0 32111     2479      373       4        0 
           0 cc1
[ 1054.469315] [32118]     0 32118     1736       52       4        0 
           0 sh
[ 1054.476934] [32120]     0 32120     1737       42       4        0 
           0 gcc
[ 1054.484644] [32123]     0 32123     1736       51       4        0 
           0 sh
[ 1054.492269] [32125]     0 32125     2479      373       4        0 
           0 cc1
[ 1054.499980] [32127]     0 32127     1737       43       4        0 
           0 gcc
[ 1054.507684] [32132]     0 32132     1736       52       4        0 
           0 sh
[ 1054.515308] [32134]     0 32134     1737       42       5        0 
           0 gcc
[ 1054.523019] [32135]     0 32135     2472      304       4        0 
           0 cc1
[ 1054.530730] [32136]     0 32136     2474      314       4        0 
           0 cc1
[ 1054.538440] [32140]     0 32140     1736       52       3        0 
           0 sh
[ 1054.546058] [32141]     0 32141     1737       42       4        0 
           0 gcc
[ 1054.553768] [32142]     0 32142     1736       51       3        0 
           0 sh
[ 1054.561394] [32146]     0 32146     1737       42       4        0 
           0 gcc
[ 1054.569105] [32150]     0 32150     1736       52       4        0 
           0 sh
[ 1054.576723] [32151]     0 32151     2474      314       3        0 
           0 cc1
[ 1054.584433] [32154]     0 32154     1737       42       4        0 
           0 gcc
[ 1054.592143] [32157]     0 32157     1736       51       3        0 
           0 sh
[ 1054.599767] [32158]     0 32158     2479      373       4        0 
           0 cc1
[ 1054.607472] [32164]     0 32164     1737       43       5        0 
           0 gcc
[ 1054.615183] [32165]     0 32165     1736       52       3        0 
           0 sh
[ 1054.622808] [32166]     0 32166     1737       42       4        0 
           0 gcc
[ 1054.630524] [32170]     0 32170     2466      232       3        0 
           0 cc1
[ 1054.638236] [32173]     0 32173     2470      232       4        0 
           0 cc1
[ 1054.645941] [32176]     0 32176     1736       52       4        0 
           0 sh
[ 1054.653566] [32178]     0 32178     1737       42       4        0 
           0 gcc
[ 1054.661276] [32180]     0 32180     2474      313       3        0 
           0 cc1
[ 1054.668986] [32182]     0 32182     1736       51       3        0 
           0 sh
[ 1054.676604] [32184]     0 32184     1737       42       3        0 
           0 gcc
[ 1054.684314] [32185]     0 32185     1736       52       5        0 
           0 sh
[ 1054.691940] [32186]     0 32186     1737       42       4        0 
           0 gcc
[ 1054.699650] [32188]     0 32188     2464      199       4        0 
           0 cc1
[ 1054.707354] [32190]     0 32190     1736       51       4        0 
           0 sh
[ 1054.714978] [32194]     0 32194     2466      232       4        0 
           0 cc1
[ 1054.722688] [32195]     0 32195     1737       43       4        0 
           0 gcc
[ 1054.730397] [32197]     0 32197     1736       13       4        0 
           0 sh
[ 1054.738015] [32199]     0 32199     2464      199       4        0 
           0 cc1
[ 1054.745725] [32200]     0 32200       70       34       3        0 
           0 mv
[ 1054.753349] [32201]     0 32201     1760       79       3        0 
           0 make
[ 1054.761146] [32202]     0 32202     1909      157       4        0 
           0 as
[ 1054.768770] Out of memory: Kill process 643 (tuned) score 1 or 
sacrifice child
[ 1054.775957] Killed process 643 (tuned) total-vm:436544kB, 
anon-rss:2496kB, file-rss:10048kB
[ 1054.828700] cc1 invoked oom-killer: gfp_mask=0x200da, order=0, 
oom_score_adj=0
[ 1054.835892] cc1 cpuset=/ mems_allowed=0
[ 1054.839764] CPU: 0 PID: 32125 Comm: cc1 Not tainted 3.19.0-rc3+ #2
[ 1054.845913] Hardware name: APM X-Gene Mustang board (DT)
[ 1054.851209] Call trace:
[ 1054.853651] [<fffffe0000096954>] dump_backtrace+0x0/0x16c
[ 1054.859033] [<fffffe0000096ad0>] show_stack+0x10/0x1c
[ 1054.864060] [<fffffe000062911c>] dump_stack+0x74/0x98
[ 1054.869094] [<fffffe0000626550>] dump_header.isra.11+0x98/0x1d0
[ 1054.874986] [<fffffe000017c4e0>] oom_kill_process+0x3ac/0x408
[ 1054.880710] [<fffffe000017c9d4>] out_of_memory+0x2a4/0x2d4
[ 1054.886169] [<fffffe0000180b78>] __alloc_pages_nodemask+0x808/0x8d8
[ 1054.892413] [<fffffe00001a443c>] handle_mm_fault+0x938/0xd1c
[ 1054.898045] [<fffffe00000a01c8>] do_page_fault+0x214/0x338
[ 1054.903509] [<fffffe000009022c>] do_mem_abort+0x38/0x9c
[ 1054.908713] Exception stack(0xfffffe011be67e30 to 0xfffffe011be67f50)
[ 1054.915122] 7e20:                                     00000200 
00000000 9d060000 000003ff
[ 1054.923266] 7e40: ffffffff ffffffff a4e64f54 000003ff 00b11000 
fffffe00 000f7c84 fffffe00
[ 1054.931408] 7e60: 1be67ea0 fffffe01 000903b4 fffffe00 1be67ed0 
fffffe01 0002000c fffffc00
[ 1054.939553] 7e80: 00b686b8 fffffe00 00096520 fffffe00 1be67eb0 
fffffe01 000964e8 fffffe00
[ 1054.947689] 7ea0: 00000208 00000000 00092f90 fffffe00 cc09e9e0 
000003ff 0009308c fffffe00
[ 1054.955832] 7ec0: 00000200 00000000 007a7d08 00000000 9d05ffe0 
000003ff 9cf68c58 000003ff
[ 1054.963977] 7ee0: 00000018 00000000 00000003 00000000 00000004 
00000004 a4e64fac 000003ff
[ 1054.972119] 7f00: a4e64fac 000003ff 00000000 00000000 00fe8000 
00000000 000005a0 00000000
[ 1054.980262] 7f20: 00a46f60 00000000 0000009d 00000000 00000006 
00000000 00000018 00000000
[ 1054.988404] 7f40: 00000008 00000000 00000000 00000000
[ 1054.993428] Mem-Info:
[ 1054.995687] DMA per-cpu:
[ 1054.998213] CPU    0: hi:    6, btch:   1 usd:   0
[ 1055.002978] CPU    1: hi:    6, btch:   1 usd:   0
[ 1055.007743] CPU    2: hi:    6, btch:   1 usd:   0
[ 1055.012514] CPU    3: hi:    6, btch:   1 usd:   0
[ 1055.017279] CPU    4: hi:    6, btch:   1 usd:   0
[ 1055.022050] CPU    5: hi:    6, btch:   1 usd:   0
[ 1055.026815] CPU    6: hi:    6, btch:   1 usd:   0
[ 1055.031586] CPU    7: hi:    6, btch:   1 usd:   0
[ 1055.036350] Normal per-cpu:
[ 1055.039133] CPU    0: hi:    6, btch:   1 usd:   0
[ 1055.043898] CPU    1: hi:    6, btch:   1 usd:   0
[ 1055.048668] CPU    2: hi:    6, btch:   1 usd:   0
[ 1055.053432] CPU    3: hi:    6, btch:   1 usd:   0
[ 1055.058203] CPU    4: hi:    6, btch:   1 usd:   0
[ 1055.062969] CPU    5: hi:    6, btch:   1 usd:   0
[ 1055.067733] CPU    6: hi:    6, btch:   1 usd:   0
[ 1055.072504] CPU    7: hi:    6, btch:   1 usd:   0
[ 1055.077274] active_anon:207422 inactive_anon:25737 isolated_anon:0
[ 1055.077274]  active_file:1301 inactive_file:21246 isolated_file:0
[ 1055.077274]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1055.077274]  free:1013 slab_reclaimable:1047 slab_unreclaimable:1758
[ 1055.077274]  mapped:734 shmem:58 pagetables:267 bounce:0
[ 1055.077274]  free_cma:0
[ 1055.108845] DMA free:52672kB min:4032kB low:4992kB high:6016kB 
active_anon:3026048kB inactive_anon:612608kB active_file:20992kB 
inactive_file:323136kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:15040kB shmem:1792kB 
slab_reclaimable:26368kB slab_unreclaimable:36160kB kernel_stack:464kB 
pagetables:6976kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:26608000 all_unreclaimable? yes
[ 1055.151981] lowmem_reserve[]: 0 765 765
[ 1055.155847] Normal free:11904kB min:12224kB low:15232kB high:18304kB 
active_anon:10248960kB inactive_anon:1034816kB active_file:62272kB 
inactive_file:1036608kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:12582912kB managed:12538176kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:31936kB shmem:1920kB 
slab_reclaimable:40640kB slab_unreclaimable:76352kB kernel_stack:2992kB 
pagetables:10112kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:82908864 all_unreclaimable? yes
[ 1055.200104] lowmem_reserve[]: 0 0 0
[ 1055.203616] DMA: 109*64kB (UR) 53*128kB (R) 8*256kB (R) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 
0*65536kB = 52672kB
[ 1055.216957] Normal: 182*64kB (MR) 0*128kB 0*256kB 0*512kB 0*1024kB 
0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 11648kB
[ 1055.229048] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=524288kB
[ 1055.237614] 43737 total pagecache pages
[ 1055.241438] 25743 pages in swap cache
[ 1055.245081] Swap cache stats: add 27980, delete 2237, find 2297/2303
[ 1055.251407] Free swap  = 6598528kB
[ 1055.254789] Total swap = 8388544kB
[ 1055.258176] 262112 pages RAM
[ 1055.261040] 0 pages HighMem/MovableOnly
[ 1055.264855] 18446744073709544361 pages reserved
[ 1055.269366] 8192 pages cma reserved
[ 1055.272835] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[ 1055.280640] [  481]     0   481      233       89       3       39 
           0 systemd-journal
[ 1055.289389] [  487]     0   487     1420       32       3       80 
           0 lvmetad
[ 1055.297440] [  512]     0   512      216       40       2       46 
       -1000 systemd-udevd
[ 1055.306017] [  598]     0   598      258       45       3       38 
       -1000 auditd
[ 1055.313989] [  620]     0   620     6825      213       3      114 
           0 NetworkManager
[ 1055.322650] [  623]     0   623     2028       83       3      100 
           0 abrtd
[ 1055.330534] [  624]     0   624     2019       78       3       92 
           0 abrt-watch-log
[ 1055.339195] [  628]    70   628       87       38       4       34 
           0 avahi-daemon
[ 1055.347677] [  629]   998   629       40       13       2        0 
           0 lsmd
[ 1055.355474] [  631]    70   631       87        0       4       32 
           0 avahi-daemon
[ 1055.363963] [  641]     0   641     2687       48       3       77 
           0 rsyslogd
[ 1055.372106] [  642]     0   642       86       42       3        9 
           0 irqbalance
[ 1055.380423] [  681]     0   643     6821      210       5      236 
           0 tuned
[ 1055.388306] [  647]     0   647      109       44       2       31 
           0 smartd
[ 1055.396270] [  651]   997   651       83       45       2       18 
           0 chronyd
[ 1055.404326] [  652]     0   652       99       38       3       36 
           0 systemd-logind
[ 1055.412987] [  653]    81   653      222       35       2       37 
        -900 dbus-daemon
[ 1055.421389] [  667]     0   667       79       27       3       29 
           0 atd
[ 1055.429100] [  675]     0   675     1846       68       4       86 
           0 login
[ 1055.436978] [  676]     0   676     1717       23       5       10 
           0 agetty
[ 1055.444948] [  710]     0   710     1761       47       3       36 
           0 bash
[ 1055.452752] [  727]     0   727      240       84       2       74 
       -1000 sshd
[ 1055.460553] [  733]     0   733     1739       34       3        9 
           0 rhsmcertd
[ 1055.468784] [  741]   999   741     7276      118       4      122 
           0 polkitd
[ 1055.476834] [  747]     0   747      502      102       3      281 
           0 dhclient
[ 1055.484977] [ 1323]     0  1323      334       69       3       69 
           0 master
[ 1055.492947] [ 1325]    89  1325      337       82       2       78 
           0 qmgr
[ 1055.500746] [ 1328]     0  1328      358      130       3      111 
           0 sshd
[ 1055.508544] [ 1331]     0  1331     1762       48       5       39 
           0 bash
[ 1055.516335] [ 1373]    89  1373      336       89       2       69 
           0 pickup
[ 1055.524305] [ 2944]     0  2944     1734       50       4        3 
           0 make
[ 1055.532102] [31425]     0 31425     1760       79       3        0 
           0 make
[ 1055.539900] [32092]     0 32092     1736       51       4        0 
           0 sh
[ 1055.547518] [32103]     0 32103     1736       51       4        0 
           0 sh
[ 1055.555142] [32108]     0 32108     1737       42       4        0 
           0 gcc
[ 1055.562852] [32110]     0 32110     1736       52       4        0 
           0 sh
[ 1055.570477] [32111]     0 32111     2479      373       4        0 
           0 cc1
[ 1055.578187] [32118]     0 32118     1736       52       4        0 
           0 sh
[ 1055.585805] [32120]     0 32120     1737       42       4        0 
           0 gcc
[ 1055.593515] [32123]     0 32123     1736       51       4        0 
           0 sh
[ 1055.601137] [32125]     0 32125     2479      373       4        0 
           0 cc1
[ 1055.608848] [32127]     0 32127     1737       43       4        0 
           0 gcc
[ 1055.616552] [32132]     0 32132     1736       52       4        0 
           0 sh
[ 1055.624177] [32134]     0 32134     1737       42       5        0 
           0 gcc
[ 1055.631890] [32135]     0 32135     2472      304       4        0 
           0 cc1
[ 1055.639601] [32136]     0 32136     2474      314       4        0 
           0 cc1
[ 1055.647305] [32140]     0 32140     1736       52       3        0 
           0 sh
[ 1055.654930] [32141]     0 32141     1737       42       4        0 
           0 gcc
[ 1055.662640] [32142]     0 32142     1736       51       3        0 
           0 sh
[ 1055.670264] [32146]     0 32146     1737       42       4        0 
           0 gcc
[ 1055.677968] [32150]     0 32150     1736       52       4        0 
           0 sh
[ 1055.685591] [32151]     0 32151     2474      314       3        0 
           0 cc1
[ 1055.693302] [32154]     0 32154     1737       42       4        0 
           0 gcc
[ 1055.701012] [32157]     0 32157     1736       51       3        0 
           0 sh
[ 1055.708636] [32158]     0 32158     2479      373       4        0 
           0 cc1
[ 1055.716340] [32164]     0 32164     1737       43       5        0 
           0 gcc
[ 1055.724051] [32165]     0 32165     1736       52       3        0 
           0 sh
[ 1055.731673] [32166]     0 32166     1737       42       4        0 
           0 gcc
[ 1055.739383] [32170]     0 32170     2466      232       3        0 
           0 cc1
[ 1055.747087] [32173]     0 32173     2470      232       4        0 
           0 cc1
[ 1055.754799] [32176]     0 32176     1736       52       4        0 
           0 sh
[ 1055.762424] [32178]     0 32178     1737       42       4        0 
           0 gcc
[ 1055.770134] [32180]     0 32180     2474      313       3        0 
           0 cc1
[ 1055.777839] [32182]     0 32182     1736       51       3        0 
           0 sh
[ 1055.785462] [32184]     0 32184     1737       42       3        0 
           0 gcc
[ 1055.793172] [32185]     0 32185     1736       52       5        0 
           0 sh
[ 1055.800795] [32186]     0 32186     1737       42       4        0 
           0 gcc
[ 1055.808505] [32188]     0 32188     2464      199       4        0 
           0 cc1
[ 1055.816209] [32190]     0 32190     1736       51       4        0 
           0 sh
[ 1055.823833] [32194]     0 32194     2466      232       4        0 
           0 cc1
[ 1055.831544] [32195]     0 32195     1737       43       4        0 
           0 gcc
[ 1055.839254] [32197]     0 32197     1736       13       4        0 
           0 sh
[ 1055.846872] [32199]     0 32199     2464      199       4        0 
           0 cc1
[ 1055.854582] [32200]     0 32200       70       34       3        0 
           0 mv
[ 1055.862207] [32201]     0 32201     1760       79       3        0 
           0 make
[ 1055.870002] [32202]     0 32202     1909      157       4        0 
           0 as
[ 1055.877620] Out of memory: Kill process 681 (tuned) score 1 or 
sacrifice child
[ 1055.884812] Killed process 681 (tuned) total-vm:436544kB, 
anon-rss:3200kB, file-rss:10240kB

\aMessage from syslogd@localhost at Jan  8 07:46:37 ...
  kernel:Call trace:

\aMessage from syslogd@localhost at Jan  8 07:46:37 ...
  kernel:Call trace:
[ 1057.551522] cc1 invoked oom-killer: gfp_mask=0x200da, order=0, 
oom_score_adj=0
[ 1057.558741] cc1 cpuset=/ mems_allowed=0
[ 1057.562589] CPU: 3 PID: 32236 Comm: cc1 Not tainted 3.19.0-rc3+ #2
[ 1057.568761] Hardware name: APM X-Gene Mustang board (DT)
[ 1057.574049] Call trace:
[ 1057.576494] [<fffffe0000096954>] dump_backtrace+0x0/0x16c
[ 1057.581888] [<fffffe0000096ad0>] show_stack+0x10/0x1c
[ 1057.586927] [<fffffe000062911c>] dump_stack+0x74/0x98
[ 1057.591973] [<fffffe0000626550>] dump_header.isra.11+0x98/0x1d0
[ 1057.597869] [<fffffe000017c4e0>] oom_kill_process+0x3ac/0x408
[ 1057.603597] [<fffffe000017c9d4>] out_of_memory+0x2a4/0x2d4
[ 1057.609070] [<fffffe0000180b78>] __alloc_pages_nodemask+0x808/0x8d8
[ 1057.615310] [<fffffe00001a1fc8>] do_wp_page+0x444/0x904
[ 1057.620519] [<fffffe00001a3fc8>] handle_mm_fault+0x4c4/0xd1c
[ 1057.626152] [<fffffe00000a01c8>] do_page_fault+0x214/0x338
[ 1057.631625] [<fffffe000009022c>] do_mem_abort+0x38/0x9c
[ 1057.636825] Exception stack(0xfffffe00419e3e30 to 0xfffffe00419e3f50)
[ 1057.643241] 3e20:                                     00000200 
00000000 1ffe5c60 00000000
[ 1057.651385] 3e40: ffffffff ffffffff 006f0560 00000000 00b11000 
fffffe00 000f7c84 fffffe00
[ 1057.659528] 3e60: 419e3ea0 fffffe00 000903b4 fffffe00 419e3ed0 
fffffe00 0002000c fffffc00
[ 1057.667664] 3e80: 00b686b8 fffffe00 af976004 000003ff 419e3eb0 
fffffe00 000964e8 fffffe00
[ 1057.675805] 3ea0: 00000208 00000000 00092f90 fffffe00 e1ae3770 
000003ff 0009308c fffffe00
[ 1057.683947] 3ec0: 00000200 00000000 00093094 fffffe00 20006cf0 
00000000 20006ce0 00000000
[ 1057.692092] 3ee0: a727db78 000003ff a727db77 000003ff fffffffe 
00000000 00000000 00000000
[ 1057.700234] 3f00: 00007fff 00000000 00004000 00000000 4c9abfcd 
00000000 00003fcd 00000000
[ 1057.708374] 3f20: 00003fcd 00000000 0000bf97 00000000 00007ffe 
00000000 00000000 00000000
[ 1057.716510] 3f40: 1fffffff 00000000 0000ddd0 00000000
[ 1057.721539] Mem-Info:
[ 1057.723800] DMA per-cpu:
[ 1057.726319] CPU    0: hi:    6, btch:   1 usd:   0
[ 1057.731089] CPU    1: hi:    6, btch:   1 usd:   0
[ 1057.735855] CPU    2: hi:    6, btch:   1 usd:   0
[ 1057.740625] CPU    3: hi:    6, btch:   1 usd:   0
[ 1057.745389] CPU    4: hi:    6, btch:   1 usd:   0
[ 1057.750159] CPU    5: hi:    6, btch:   1 usd:   0
[ 1057.754924] CPU    6: hi:    6, btch:   1 usd:   0
[ 1057.759693] CPU    7: hi:    6, btch:   1 usd:   0
[ 1057.764458] Normal per-cpu:
[ 1057.767235] CPU    0: hi:    6, btch:   1 usd:   0
[ 1057.772004] CPU    1: hi:    6, btch:   1 usd:   0
[ 1057.776769] CPU    2: hi:    6, btch:   1 usd:   0
[ 1057.781538] CPU    3: hi:    6, btch:   1 usd:   0
[ 1057.786302] CPU    4: hi:    6, btch:   1 usd:   0
[ 1057.791072] CPU    5: hi:    6, btch:   1 usd:   0
[ 1057.795837] CPU    6: hi:    6, btch:   1 usd:   0
[ 1057.800606] CPU    7: hi:    6, btch:   1 usd:   0
[ 1057.805375] active_anon:207677 inactive_anon:25723 isolated_anon:0
[ 1057.805375]  active_file:1246 inactive_file:21239 isolated_file:0
[ 1057.805375]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1057.805375]  free:850 slab_reclaimable:1045 slab_unreclaimable:1748
[ 1057.805375]  mapped:667 shmem:63 pagetables:232 bounce:0
[ 1057.805375]  free_cma:6
[ 1057.836860] DMA free:48576kB min:4032kB low:4992kB high:6016kB 
active_anon:3035200kB inactive_anon:612608kB active_file:17856kB 
inactive_file:323136kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:11520kB shmem:2048kB 
slab_reclaimable:26304kB slab_unreclaimable:36096kB kernel_stack:352kB 
pagetables:5120kB unstable:0kB bounce:0kB free_cma:384kB 
writeback_tmp:0kB pages_scanned:25701184 all_unreclaimable? yes
[ 1057.880173] lowmem_reserve[]: 0 765 765
[ 1057.884043] Normal free:5824kB min:12224kB low:15232kB high:18304kB 
active_anon:10256128kB inactive_anon:1033664kB active_file:61888kB 
inactive_file:1036160kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:12582912kB managed:12538176kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:31168kB shmem:1984kB 
slab_reclaimable:40576kB slab_unreclaimable:75776kB kernel_stack:3072kB 
pagetables:9728kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:80137088 all_unreclaimable? yes
[ 1057.928132] lowmem_reserve[]: 0 0 0
[ 1057.931662] DMA: 55*64kB (UMRC) 49*128kB (R) 8*256kB (R) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 
0*65536kB = 48704kB
[ 1057.945093] Normal: 62*64kB (MR) 9*128kB (R) 4*256kB (R) 0*512kB 
0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 6144kB
[ 1057.957726] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=524288kB
[ 1057.966305] 43758 total pagecache pages
[ 1057.970133] 25806 pages in swap cache
[ 1057.973778] Swap cache stats: add 28140, delete 2334, find 2447/2608
[ 1057.980108] Free swap  = 6619264kB
[ 1057.983491] Total swap = 8388544kB
[ 1057.986874] 262112 pages RAM
[ 1057.989750] 0 pages HighMem/MovableOnly
[ 1057.993567] 18446744073709544361 pages reserved
[ 1057.998072] 8192 pages cma reserved
[ 1058.001550] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[ 1058.009370] [  481]     0   481      233      106       3       26 
           0 systemd-journal
[ 1058.018115] [  487]     0   487     1420       32       3       80 
           0 lvmetad
[ 1058.026179] [  512]     0   512      216       40       2       46 
       -1000 systemd-udevd
[ 1058.034756] [  598]     0   598      258       45       3       38 
       -1000 auditd
[ 1058.042728] [  620]     0   620     6825      224       3      106 
           0 NetworkManager
[ 1058.051400] [  623]     0   623     2028       83       3      100 
           0 abrtd
[ 1058.059289] [  624]     0   624     2019       78       3       92 
           0 abrt-watch-log
[ 1058.067944] [  628]    70   628       87       38       4       34 
           0 avahi-daemon
[ 1058.076440] [  629]   998   629       40       13       2        0 
           0 lsmd
[ 1058.084238] [  631]    70   631       87        0       4       32 
           0 avahi-daemon
[ 1058.092736] [  641]     0   641     2687       82       3       56 
           0 rsyslogd
[ 1058.100881] [  642]     0   642       86       42       3        9 
           0 irqbalance
[ 1058.109202] [  647]     0   647      109       44       2       31 
           0 smartd
[ 1058.117169] [  651]   997   651       83       45       2       18 
           0 chronyd
[ 1058.125230] [  652]     0   652       99       38       3       36 
           0 systemd-logind
[ 1058.133900] [  653]    81   653      222       53       2       21 
        -900 dbus-daemon
[ 1058.142303] [  667]     0   667       79       27       3       29 
           0 atd
[ 1058.150020] [  675]     0   675     1846       68       4       86 
           0 login
[ 1058.157900] [  676]     0   676     1717       23       5       10 
           0 agetty
[ 1058.165874] [  710]     0   710     1761       47       3       36 
           0 bash
[ 1058.173679] [  727]     0   727      240       84       2       74 
       -1000 sshd
[ 1058.181476] [  733]     0   733     1739       34       3        9 
           0 rhsmcertd
[ 1058.189712] [  741]   999   741     7276      132       4      105 
           0 polkitd
[ 1058.197764] [  747]     0   747      502      102       3      281 
           0 dhclient
[ 1058.205912] [ 1323]     0  1323      334       69       3       69 
           0 master
[ 1058.213891] [ 1325]    89  1325      337       82       2       78 
           0 qmgr
[ 1058.221691] [ 1328]     0  1328      358      130       3      111 
           0 sshd
[ 1058.229495] [ 1331]     0  1331     1762       48       5       39 
           0 bash
[ 1058.237289] [ 1373]    89  1373      336       89       2       69 
           0 pickup
[ 1058.245264] [ 2944]     0  2944     1734       50       4        3 
           0 make
[ 1058.253069] [31425]     0 31425     1760       79       3        0 
           0 make
[ 1058.260869] [32190]     0 32190     1736       51       4        0 
           0 sh
[ 1058.268493] [32195]     0 32195     1737       43       4        0 
           0 gcc
[ 1058.276199] [32199]     0 32199     2479      382       4        0 
           0 cc1
[ 1058.283915] [32224]     0 32224     1736       51       4        0 
           0 sh
[ 1058.291548] [32230]     0 32230     1737       42       4        0 
           0 gcc
[ 1058.299264] [32232]     0 32232     1736       51       3        0 
           0 sh
[ 1058.306883] [32236]     0 32236     2479      314       4        0 
           0 cc1
[ 1058.314594] [32237]     0 32237     1736       52       5        0 
           0 sh
[ 1058.322217] [32239]     0 32239     1737       42       5        0 
           0 gcc
[ 1058.329934] [32241]     0 32241     1737       43       5        0 
           0 gcc
[ 1058.337642] [32245]     0 32245     1736       51       4        0 
           0 sh
[ 1058.345271] [32246]     0 32246     2474      313       4        0 
           0 cc1
[ 1058.352992] [32248]     0 32248     1737       42       4        0 
           0 gcc
[ 1058.360704] [32249]     0 32249     2479      373       3        0 
           0 cc1
[ 1058.368414] [32250]     0 32250     1736       52       3        0 
           0 sh
[ 1058.376032] [32252]     0 32252     1737       43       4        0 
           0 gcc
[ 1058.383750] [32253]     0 32253     1736       51       5        0 
           0 sh
[ 1058.391376] [32254]     0 32254     2472      303       4        0 
           0 cc1
[ 1058.399087] [32257]     0 32257     2473      296       3        0 
           0 cc1
[ 1058.406791] [32258]     0 32258     1737       42       4        0 
           0 gcc
[ 1058.414501] [32260]     0 32260     1736       51       3        0 
           0 sh
[ 1058.422132] [32261]     0 32261     1737       42       3        0 
           0 gcc
[ 1058.429844] [32264]     0 32264     2470      232       3        0 
           0 cc1
[ 1058.437549] [32268]     0 32268     2479      373       3        0 
           0 cc1
[ 1058.445259] [32273]     0 32273     1736       51       5        0 
           0 sh
[ 1058.452890] [32276]     0 32276     1737       43       4        0 
           0 gcc
[ 1058.460602] [32278]     0 32278     1736       51       4        0 
           0 sh
[ 1058.468220] [32280]     0 32280     1737       43       4        0 
           0 gcc
[ 1058.475930] [32282]     0 32282     2470      296       4        0 
           0 cc1
[ 1058.483652] [32286]     0 32286     1736       52       4        0 
           0 sh
[ 1058.491278] [32287]     0 32287     2474      313       5        0 
           0 cc1
[ 1058.498992] [32291]     0 32291     1737       42       3        0 
           0 gcc
[ 1058.506691] [32294]     0 32294     2470      233       4        0 
           0 cc1
[ 1058.514405] [32299]     0 32299     1736       51       3        0 
           0 sh
[ 1058.522036] [32300]     0 32300     1737       42       3        0 
           0 gcc
[ 1058.529747] [32301]     0 32301     1737       42       3        0 
           0 gcc
[ 1058.537452] [32302]     0 32302     1760       79       3        0 
           0 make
[ 1058.545255] Out of memory: Kill process 747 (dhclient) score 0 or 
sacrifice child
[ 1058.552708] Killed process 747 (dhclient) total-vm:32128kB, 
anon-rss:192kB, file-rss:6336kB
[ 1059.379317] NetworkManager invoked oom-killer: gfp_mask=0xd0, 
order=0, oom_score_adj=0
[ 1059.387206] NetworkManager cpuset=/ mems_allowed=0
[ 1059.392022] CPU: 6 PID: 620 Comm: NetworkManager Not tainted 
3.19.0-rc3+ #2
[ 1059.398964] Hardware name: APM X-Gene Mustang board (DT)
[ 1059.404251] Call trace:
[ 1059.406694] [<fffffe0000096954>] dump_backtrace+0x0/0x16c
[ 1059.412088] [<fffffe0000096ad0>] show_stack+0x10/0x1c
[ 1059.417121] [<fffffe000062911c>] dump_stack+0x74/0x98
[ 1059.422162] [<fffffe0000626550>] dump_header.isra.11+0x98/0x1d0
[ 1059.428057] [<fffffe000017c4e0>] oom_kill_process+0x3ac/0x408
[ 1059.433788] [<fffffe000017c9d4>] out_of_memory+0x2a4/0x2d4
[ 1059.439264] [<fffffe0000180b78>] __alloc_pages_nodemask+0x808/0x8d8
[ 1059.445504] [<fffffe0000180c5c>] __get_free_pages+0x14/0x48
[ 1059.451063] [<fffffe00001e8454>] __pollwait+0x4c/0xf4
[ 1059.456091] [<fffffe00004ff568>] datagram_poll+0x24/0x11c
[ 1059.461479] [<fffffe0000535a70>] netlink_poll+0xac/0x180
[ 1059.466767] [<fffffe00004ef3dc>] sock_poll+0xf8/0x110
[ 1059.471800] [<fffffe00001e97b0>] do_sys_poll+0x220/0x484
[ 1059.477085] [<fffffe00001e9d24>] SyS_ppoll+0x1c4/0x1dc
[ 1059.482213] Mem-Info:
[ 1059.484476] DMA per-cpu:
[ 1059.486996] CPU    0: hi:    6, btch:   1 usd:   0
[ 1059.491767] CPU    1: hi:    6, btch:   1 usd:   0
[ 1059.496532] CPU    2: hi:    6, btch:   1 usd:   0
[ 1059.501301] CPU    3: hi:    6, btch:   1 usd:   0
[ 1059.506066] CPU    4: hi:    6, btch:   1 usd:   0
[ 1059.510842] CPU    5: hi:    6, btch:   1 usd:   0
[ 1059.515609] CPU    6: hi:    6, btch:   1 usd:   0
[ 1059.520383] CPU    7: hi:    6, btch:   1 usd:   0
[ 1059.525149] Normal per-cpu:
[ 1059.527927] CPU    0: hi:    6, btch:   1 usd:   0
[ 1059.532706] CPU    1: hi:    6, btch:   1 usd:   0
[ 1059.537471] CPU    2: hi:    6, btch:   1 usd:   0
[ 1059.542242] CPU    3: hi:    6, btch:   1 usd:   0
[ 1059.547006] CPU    4: hi:    6, btch:   1 usd:   0
[ 1059.551783] CPU    5: hi:    6, btch:   1 usd:   0
[ 1059.556549] CPU    6: hi:    6, btch:   1 usd:   0
[ 1059.561321] CPU    7: hi:    6, btch:   1 usd:   1
[ 1059.566092] active_anon:207642 inactive_anon:25756 isolated_anon:0
[ 1059.566092]  active_file:1244 inactive_file:21227 isolated_file:0
[ 1059.566092]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1059.566092]  free:864 slab_reclaimable:1045 slab_unreclaimable:1748
[ 1059.566092]  mapped:642 shmem:63 pagetables:229 bounce:0
[ 1059.566092]  free_cma:6
[ 1059.597581] DMA free:48768kB min:4032kB low:4992kB high:6016kB 
active_anon:3035200kB inactive_anon:612608kB active_file:17792kB 
inactive_file:322944kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:11520kB shmem:2048kB 
slab_reclaimable:26304kB slab_unreclaimable:36096kB kernel_stack:336kB 
pagetables:5056kB unstable:0kB bounce:0kB free_cma:384kB 
writeback_tmp:0kB pages_scanned:24814080 all_unreclaimable? yes
[ 1059.640895] lowmem_reserve[]: 0 765 765
[ 1059.644766] Normal free:6528kB min:12224kB low:15232kB high:18304kB 
active_anon:10253888kB inactive_anon:1035776kB active_file:61824kB 
inactive_file:1035584kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:12582912kB managed:12538176kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:29568kB shmem:1984kB 
slab_reclaimable:40576kB slab_unreclaimable:75776kB kernel_stack:2992kB 
pagetables:9600kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:79671936 all_unreclaimable? yes
[ 1059.688856] lowmem_reserve[]: 0 0 0
[ 1059.692375] DMA: 57*64kB (UMRC) 49*128kB (R) 8*256kB (R) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 
0*65536kB = 48832kB
[ 1059.705808] Normal: 68*64kB (MR) 9*128kB (R) 4*256kB (R) 0*512kB 
0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 6528kB
[ 1059.718443] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=524288kB
[ 1059.727021] 43784 total pagecache pages
[ 1059.730850] 25838 pages in swap cache
[ 1059.734493] Swap cache stats: add 28173, delete 2335, find 2447/2609
[ 1059.740820] Free swap  = 6635136kB
[ 1059.744204] Total swap = 8388544kB
[ 1059.747587] 262112 pages RAM
[ 1059.750463] 0 pages HighMem/MovableOnly
[ 1059.754280] 18446744073709544361 pages reserved
[ 1059.758794] 8192 pages cma reserved
[ 1059.762265] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[ 1059.770084] [  481]     0   481      233      106       3       26 
           0 systemd-journal
[ 1059.778838] [  487]     0   487     1420       32       3       80 
           0 lvmetad
[ 1059.786890] [  512]     0   512      216       40       2       46 
       -1000 systemd-udevd
[ 1059.795474] [  598]     0   598      258       45       3       38 
       -1000 auditd
[ 1059.803446] [  620]     0   620     6825      224       3      106 
           0 NetworkManager
[ 1059.812116] [  623]     0   623     2028       83       3      100 
           0 abrtd
[ 1059.820004] [  624]     0   624     2019       78       3       92 
           0 abrt-watch-log
[ 1059.828667] [  628]    70   628       87       38       4       34 
           0 avahi-daemon
[ 1059.837151] [  629]   998   629       40       13       2        0 
           0 lsmd
[ 1059.844954] [  631]    70   631       87        0       4       32 
           0 avahi-daemon
[ 1059.853450] [  641]     0   641     2687       82       3       56 
           0 rsyslogd
[ 1059.861594] [  642]     0   642       86       42       3        9 
           0 irqbalance
[ 1059.869918] [  647]     0   647      109       44       2       31 
           0 smartd
[ 1059.877885] [  651]   997   651       83       45       2       18 
           0 chronyd
[ 1059.885945] [  652]     0   652       99       38       3       36 
           0 systemd-logind
[ 1059.894614] [  653]    81   653      222       53       2       21 
        -900 dbus-daemon
[ 1059.903021] [  667]     0   667       79       27       3       29 
           0 atd
[ 1059.910740] [  675]     0   675     1846       68       4       86 
           0 login
[ 1059.918629] [  676]     0   676     1717       23       5       10 
           0 agetty
[ 1059.926594] [  710]     0   710     1761       47       3       36 
           0 bash
[ 1059.934392] [  727]     0   727      240       84       2       74 
       -1000 sshd
[ 1059.942197] [  733]     0   733     1739       34       3        9 
           0 rhsmcertd
[ 1059.950427] [  741]   999   741     7276      132       4      105 
           0 polkitd
[ 1059.958478] [ 1323]     0  1323      334       69       3       69 
           0 master
[ 1059.966457] [ 1325]    89  1325      337       82       2       78 
           0 qmgr
[ 1059.974256] [ 1328]     0  1328      358      130       3      111 
           0 sshd
[ 1059.982061] [ 1331]     0  1331     1762       48       5       39 
           0 bash
[ 1059.989859] [ 1373]    89  1373      336       89       2       69 
           0 pickup
[ 1059.997824] [ 2944]     0  2944     1734       50       4        3 
           0 make
[ 1060.005622] [31425]     0 31425     1760       79       3        0 
           0 make
[ 1060.013426] [32190]     0 32190     1736       51       4        0 
           0 sh
[ 1060.021052] [32195]     0 32195     1737       43       4        0 
           0 gcc
[ 1060.028768] [32199]     0 32199     2479      382       4        0 
           0 cc1
[ 1060.036475] [32224]     0 32224     1736       51       4        0 
           0 sh
[ 1060.044104] [32230]     0 32230     1737       42       4        0 
           0 gcc
[ 1060.051824] [32232]     0 32232     1736       51       3        0 
           0 sh
[ 1060.059450] [32236]     0 32236     2479      314       4        0 
           0 cc1
[ 1060.067155] [32237]     0 32237     1736       52       5        0 
           0 sh
[ 1060.074786] [32239]     0 32239     1737       42       5        0 
           0 gcc
[ 1060.082496] [32241]     0 32241     1737       43       5        0 
           0 gcc
[ 1060.090215] [32245]     0 32245     1736       51       4        0 
           0 sh
[ 1060.097836] [32246]     0 32246     2474      313       4        0 
           0 cc1
[ 1060.105550] [32248]     0 32248     1737       42       4        0 
           0 gcc
[ 1060.113274] [32249]     0 32249     2479      373       3        0 
           0 cc1
[ 1060.120986] [32250]     0 32250     1736       52       3        0 
           0 sh
[ 1060.128610] [32252]     0 32252     1737       43       4        0 
           0 gcc
[ 1060.136316] [32253]     0 32253     1736       51       5        0 
           0 sh
[ 1060.143945] [32254]     0 32254     2472      303       4        0 
           0 cc1
[ 1060.151664] [32257]     0 32257     2473      296       3        0 
           0 cc1
[ 1060.159383] [32258]     0 32258     1737       42       4        0 
           0 gcc
[ 1060.167088] [32260]     0 32260     1736       51       3        0 
           0 sh
[ 1060.174720] [32261]     0 32261     1737       42       3        0 
           0 gcc
[ 1060.182432] [32264]     0 32264     2470      232       3        0 
           0 cc1
[ 1060.190149] [32268]     0 32268     2479      373       3        0 
           0 cc1
[ 1060.197856] [32273]     0 32273     1736       51       5        0 
           0 sh
[ 1060.205484] [32276]     0 32276     1737       43       4        0 
           0 gcc
[ 1060.213202] [32278]     0 32278     1736       51       4        0 
           0 sh
[ 1060.220827] [32280]     0 32280     1737       43       4        0 
           0 gcc
[ 1060.228532] [32282]     0 32282     2470      296       4        0 
           0 cc1
[ 1060.236248] [32286]     0 32286     1736       52       4        0 
           0 sh
[ 1060.243874] [32287]     0 32287     2474      313       5        0 
           0 cc1
[ 1060.251591] [32291]     0 32291     1737       42       3        0 
           0 gcc
[ 1060.259303] [32294]     0 32294     2470      233       4        0 
           0 cc1
[ 1060.267008] [32299]     0 32299     1736       51       3        0 
           0 sh
[ 1060.274638] [32300]     0 32300     1737       42       3        0 
           0 gcc
[ 1060.282349] [32301]     0 32301     1737       42       3        0 
           0 gcc
[ 1060.290072] [32302]     0 32302     1760       79       3        0 
           0 make
[ 1060.297866] Out of memory: Kill process 32199 (cc1) score 0 or 
sacrifice child
[ 1060.305063] Killed process 32199 (cc1) total-vm:158656kB, 
anon-rss:14912kB, file-rss:9536kB

\aMessage from syslogd@localhost at Jan  8 07:46:41 ...
  kernel:Call trace:

\aMessage from syslogd@localhost at Jan  8 07:46:41 ...
  kernel:Call trace:
[ 1061.074807] cc1 invoked oom-killer: gfp_mask=0x200da, order=0, 
oom_score_adj=0
[ 1061.082033] cc1 cpuset=/ mems_allowed=0
[ 1061.085881] CPU: 3 PID: 32257 Comm: cc1 Not tainted 3.19.0-rc3+ #2
[ 1061.092051] Hardware name: APM X-Gene Mustang board (DT)
[ 1061.097339] Call trace:
[ 1061.099794] [<fffffe0000096954>] dump_backtrace+0x0/0x16c
[ 1061.105169] [<fffffe0000096ad0>] show_stack+0x10/0x1c
[ 1061.110205] [<fffffe000062911c>] dump_stack+0x74/0x98
[ 1061.115233] [<fffffe0000626550>] dump_header.isra.11+0x98/0x1d0
[ 1061.121131] [<fffffe000017c4e0>] oom_kill_process+0x3ac/0x408
[ 1061.126848] [<fffffe000017c9d4>] out_of_memory+0x2a4/0x2d4
[ 1061.132314] [<fffffe0000180b78>] __alloc_pages_nodemask+0x808/0x8d8
[ 1061.138550] [<fffffe00001a443c>] handle_mm_fault+0x938/0xd1c
[ 1061.144188] [<fffffe00000a01c8>] do_page_fault+0x214/0x338
[ 1061.149652] [<fffffe000009022c>] do_mem_abort+0x38/0x9c
[ 1061.154849] Exception stack(0xfffffe011be2fe30 to 0xfffffe011be2ff50)
[ 1061.161264] fe20:                                     00000200 
00000000 00f886a8 00000000
[ 1061.169407] fe40: ffffffff ffffffff 969b4620 000003ff 0000011a 
00000000 00000015 00000000
[ 1061.177543] fe60: 1be2fea0 fffffe01 00000000 00000000 1be2fe80 
fffffe01 000ce950 fffffe00
[ 1061.185685] fe80: 1be2feb0 fffffe01 00096520 fffffe00 00000204 
00000000 00000d7e 00000000
[ 1061.193826] fea0: ffffffff ffffffff 969fc628 000003ff ed276070 
000003ff 0009308c fffffe00
[ 1061.201968] fec0: 00000200 00000000 000931cc fffffe00 8e7c0000 
000003ff 00000000 00000000
[ 1061.210112] fee0: 00000248 00000000 8e7c0000 000003ff 00000000 
00000000 8e7c0000 000003ff
[ 1061.218248] ff00: 00000009 00000000 00000009 00000000 00fe8000 
00000000 000003a8 00000000
[ 1061.226390] ff20: 006528e0 00000000 0000008e 00000000 00000006 
00000000 00000018 00000000
[ 1061.234531] ff40: 00000008 00000000 00000000 00000000
[ 1061.239560] Mem-Info:
[ 1061.241820] DMA per-cpu:
[ 1061.244339] CPU    0: hi:    6, btch:   1 usd:   0
[ 1061.249110] CPU    1: hi:    6, btch:   1 usd:   0
[ 1061.253875] CPU    2: hi:    6, btch:   1 usd:   0
[ 1061.258644] CPU    3: hi:    6, btch:   1 usd:   0
[ 1061.263409] CPU    4: hi:    6, btch:   1 usd:   0
[ 1061.268174] CPU    5: hi:    6, btch:   1 usd:   0
[ 1061.272944] CPU    6: hi:    6, btch:   1 usd:   0
[ 1061.277709] CPU    7: hi:    6, btch:   1 usd:   0
[ 1061.282479] Normal per-cpu:
[ 1061.285257] CPU    0: hi:    6, btch:   1 usd:   0
[ 1061.290027] CPU    1: hi:    6, btch:   1 usd:   0
[ 1061.294792] CPU    2: hi:    6, btch:   1 usd:   0
[ 1061.299562] CPU    3: hi:    6, btch:   1 usd:   0
[ 1061.304327] CPU    4: hi:    6, btch:   1 usd:   0
[ 1061.309096] CPU    5: hi:    6, btch:   1 usd:   0
[ 1061.313861] CPU    6: hi:    6, btch:   1 usd:   0
[ 1061.318625] CPU    7: hi:    6, btch:   1 usd:   0
[ 1061.323400] active_anon:207507 inactive_anon:25760 isolated_anon:0
[ 1061.323400]  active_file:1223 inactive_file:21249 isolated_file:0
[ 1061.323400]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1061.323400]  free:1016 slab_reclaimable:1044 slab_unreclaimable:1748
[ 1061.323400]  mapped:652 shmem:65 pagetables:229 bounce:0
[ 1061.323400]  free_cma:1
[ 1061.354973] DMA free:52992kB min:4032kB low:4992kB high:6016kB 
active_anon:3030912kB inactive_anon:612608kB active_file:17920kB 
inactive_file:323328kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:11648kB shmem:2112kB 
slab_reclaimable:26240kB slab_unreclaimable:36096kB kernel_stack:336kB 
pagetables:5056kB unstable:0kB bounce:0kB free_cma:64kB 
writeback_tmp:0kB pages_scanned:24781760 all_unreclaimable? yes
[ 1061.398197] lowmem_reserve[]: 0 765 765
[ 1061.402072] Normal free:12032kB min:12224kB low:15232kB high:18304kB 
active_anon:10249536kB inactive_anon:1036032kB active_file:60352kB 
inactive_file:1036608kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:12582912kB managed:12538176kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:30080kB shmem:2048kB 
slab_reclaimable:40576kB slab_unreclaimable:75776kB kernel_stack:2992kB 
pagetables:9600kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:76284480 all_unreclaimable? yes
[ 1061.446244] lowmem_reserve[]: 0 0 0
[ 1061.449762] DMA: 122*64kB (UEMR) 49*128kB (R) 8*256kB (R) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 
0*65536kB = 52992kB
[ 1061.463271] Normal: 134*64kB (MR) 14*128kB (MR) 5*256kB (R) 1*512kB 
(M) 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 
12160kB
[ 1061.476597] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=524288kB
[ 1061.485172] 43768 total pagecache pages
[ 1061.488993] 25841 pages in swap cache
[ 1061.492636] Swap cache stats: add 28188, delete 2347, find 2460/2635
[ 1061.498962] Free swap  = 6635968kB
[ 1061.502345] Total swap = 8388544kB
[ 1061.505729] 262112 pages RAM
[ 1061.508592] 0 pages HighMem/MovableOnly
[ 1061.512412] 18446744073709544361 pages reserved
[ 1061.516917] 8192 pages cma reserved
[ 1061.520390] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[ 1061.528192] [  481]     0   481      234      111       3       26 
           0 systemd-journal
[ 1061.536940] [  487]     0   487     1420       32       3       80 
           0 lvmetad
[ 1061.544998] [  512]     0   512      216       40       2       46 
       -1000 systemd-udevd
[ 1061.553575] [  598]     0   598      258       45       3       38 
       -1000 auditd
[ 1061.561545] [  620]     0   620     6825      224       3      106 
           0 NetworkManager
[ 1061.570206] [  623]     0   623     2028       83       3      100 
           0 abrtd
[ 1061.578083] [  624]     0   624     2019       87       3       84 
           0 abrt-watch-log
[ 1061.586744] [  628]    70   628       87       38       4       34 
           0 avahi-daemon
[ 1061.595232] [  629]   998   629       40       13       2        0 
           0 lsmd
[ 1061.603030] [  631]    70   631       87        0       4       32 
           0 avahi-daemon
[ 1061.611520] [  641]     0   641     2687       82       3       56 
           0 rsyslogd
[ 1061.619662] [  642]     0   642       86       42       3        9 
           0 irqbalance
[ 1061.627972] [  647]     0   647      109       44       2       31 
           0 smartd
[ 1061.635942] [  651]   997   651       83       45       2       18 
           0 chronyd
[ 1061.643997] [  652]     0   652       99       38       3       36 
           0 systemd-logind
[ 1061.652658] [  653]    81   653      222       53       2       21 
        -900 dbus-daemon
[ 1061.661059] [  667]     0   667       79       27       3       29 
           0 atd
[ 1061.668769] [  675]     0   675     1846       68       4       86 
           0 login
[ 1061.676647] [  676]     0   676     1717       23       5       10 
           0 agetty
[ 1061.684616] [  710]     0   710     1761       47       3       36 
           0 bash
[ 1061.692413] [  727]     0   727      240       84       2       74 
       -1000 sshd
[ 1061.700209] [  733]     0   733     1739       34       3        9 
           0 rhsmcertd
[ 1061.708432] [  741]   999   741     7276      132       4      105 
           0 polkitd
[ 1061.716488] [ 1323]     0  1323      334       69       3       69 
           0 master
[ 1061.724457] [ 1325]    89  1325      337       82       2       78 
           0 qmgr
[ 1061.732254] [ 1328]     0  1328      358      130       3      111 
           0 sshd
[ 1061.740052] [ 1331]     0  1331     1762       48       5       39 
           0 bash
[ 1061.747843] [ 1373]    89  1373      336       89       2       69 
           0 pickup
[ 1061.755813] [ 2944]     0  2944     1734       50       4        3 
           0 make
[ 1061.763611] [31425]     0 31425     1760       79       3        0 
           0 make
[ 1061.771408] [32190]     0 32190     1736       51       4        0 
           0 sh
[ 1061.779032] [32195]     0 32195     1751       52       4        0 
           0 gcc
[ 1061.786738] [32224]     0 32224     1736       51       4        0 
           0 sh
[ 1061.794362] [32230]     0 32230     1737       42       4        0 
           0 gcc
[ 1061.802073] [32232]     0 32232     1736       51       3        0 
           0 sh
[ 1061.809696] [32236]     0 32236     2479      373       4        0 
           0 cc1
[ 1061.817401] [32237]     0 32237     1736       52       5        0 
           0 sh
[ 1061.825024] [32239]     0 32239     1737       42       5        0 
           0 gcc
[ 1061.832734] [32241]     0 32241     1737       43       5        0 
           0 gcc
[ 1061.840445] [32245]     0 32245     1736       51       4        0 
           0 sh
[ 1061.848063] [32246]     0 32246     2474      313       4        0 
           0 cc1
[ 1061.855773] [32248]     0 32248     1737       42       4        0 
           0 gcc
[ 1061.863483] [32249]     0 32249     2479      383       3        0 
           0 cc1
[ 1061.871195] [32250]     0 32250     1736       52       3        0 
           0 sh
[ 1061.878820] [32252]     0 32252     1737       43       4        0 
           0 gcc
[ 1061.886525] [32253]     0 32253     1736       51       5        0 
           0 sh
[ 1061.894148] [32254]     0 32254     2472      303       4        0 
           0 cc1
[ 1061.901858] [32257]     0 32257     2472      304       3        0 
           0 cc1
[ 1061.909570] [32258]     0 32258     1737       42       4        0 
           0 gcc
[ 1061.917274] [32260]     0 32260     1736       51       3        0 
           0 sh
[ 1061.924898] [32261]     0 32261     1737       42       3        0 
           0 gcc
[ 1061.932607] [32264]     0 32264     2470      232       3        0 
           0 cc1
[ 1061.940317] [32268]     0 32268     2479      373       3        0 
           0 cc1
[ 1061.948022] [32273]     0 32273     1736       51       5        0 
           0 sh
[ 1061.955646] [32276]     0 32276     1737       43       4        0 
           0 gcc
[ 1061.963355] [32278]     0 32278     1736       51       4        0 
           0 sh
[ 1061.970979] [32280]     0 32280     1737       43       4        0 
           0 gcc
[ 1061.978683] [32282]     0 32282     2473      296       4        0 
           0 cc1
[ 1061.986394] [32286]     0 32286     1736       52       4        0 
           0 sh
[ 1061.994017] [32287]     0 32287     2474      313       5        0 
           0 cc1
[ 1062.001729] [32291]     0 32291     1737       42       3        0 
           0 gcc
[ 1062.009440] [32294]     0 32294     2470      233       4        0 
           0 cc1
[ 1062.017144] [32299]     0 32299     1736       51       3        0 
           0 sh
[ 1062.024768] [32300]     0 32300     1737       42       3        0 
           0 gcc
[ 1062.032479] [32301]     0 32301      208       11       3        0 
           0 cc1
[ 1062.040189] [32302]     0 32302       32       11       3        0 
           0 sh
[ 1062.047807] [32303]     0 32303     1760       79       3        0 
           0 make
[ 1062.055607] Out of memory: Kill process 32249 (cc1) score 0 or 
sacrifice child
[ 1062.062801] Killed process 32249 (cc1) total-vm:158656kB, 
anon-rss:14976kB, file-rss:9536kB

\aMessage from syslogd@localhost at Jan  8 07:46:43 ...
  kernel:Call trace:
[ 1063.003783] systemd invoked oom-killer: gfp_mask=0x200da, order=0, 
oom_score_adj=0
[ 1063.011343] systemd cpuset=/ mems_allowed=0
[ 1063.015536] CPU: 2 PID: 1 Comm: systemd Not tainted 3.19.0-rc3+ #2
[ 1063.021707] Hardware name: APM X-Gene Mustang board (DT)
[ 1063.026995] Call trace:
[ 1063.029450] [<fffffe0000096954>] dump_backtrace+0x0/0x16c
[ 1063.034829] [<fffffe0000096ad0>] show_stack+0x10/0x1c
[ 1063.039872] [<fffffe000062911c>] dump_stack+0x74/0x98
[ 1063.044901] [<fffffe0000626550>] dump_header.isra.11+0x98/0x1d0
[ 1063.050808] [<fffffe000017c4e0>] oom_kill_process+0x3ac/0x408
[ 1063.056531] [<fffffe000017c9d4>] out_of_memory+0x2a4/0x2d4
[ 1063.062007] [<fffffe0000180b78>] __alloc_pages_nodemask+0x808/0x8d8
[ 1063.068246] [<fffffe00001a1c54>] do_wp_page+0xd0/0x904
[ 1063.073374] [<fffffe00001a3fc8>] handle_mm_fault+0x4c4/0xd1c
[ 1063.079020] [<fffffe00000a01c8>] do_page_fault+0x214/0x338
[ 1063.084480] [<fffffe000009022c>] do_mem_abort+0x38/0x9c
[ 1063.089691] Exception stack(0xfffffe03dbe83e30 to 0xfffffe03dbe83f50)
[ 1063.096102] 3e20:                                     00000000 
00000000 00007e36 00000000
[ 1063.104253] 3e40: ffffffff ffffffff b5728d14 000003ff dbe83e60 
fffffe03 001a8b80 fffffe00
[ 1063.112403] 3e60: dbe83ea0 fffffe03 001a9968 fffffe00 b4f60000 
000003ff 00000000 00000000
[ 1063.120554] 3e80: ffffffff ffffffff 001055ac fffffe00 dbe83ea0 
fffffe03 00010000 00000000
[ 1063.128693] 3ea0: e6cf3e80 000003ff 00093170 fffffe00 00000000 
00000000 0009308c fffffe00
[ 1063.136843] 3ec0: 00000000 00000000 00010000 00000000 d6cb9ee0 
000003ff d6d2fd70 000003ff
[ 1063.144994] 3ee0: d6d307f0 000003ff d6e03e80 000003ff 00000000 
00000000 d6de9ea0 000003ff
[ 1063.153148] 3f00: 00000006 00000000 00000006 00000000 0002edc2 
00000000 00000030 00000000
[ 1063.161299] 3f20: 00000426 00000000 1009a555 00000000 00000018 
00000000 ab5187f2 ffffffff
[ 1063.169446] 3f40: bc000000 003103cf 00000000 003b9aca
[ 1063.174472] Mem-Info:
[ 1063.176732] DMA per-cpu:
[ 1063.179262] CPU    0: hi:    6, btch:   1 usd:   0
[ 1063.184030] CPU    1: hi:    6, btch:   1 usd:   0
[ 1063.188797] CPU    2: hi:    6, btch:   1 usd:   0
[ 1063.193573] CPU    3: hi:    6, btch:   1 usd:   0
[ 1063.198341] CPU    4: hi:    6, btch:   1 usd:   0
[ 1063.203116] CPU    5: hi:    6, btch:   1 usd:   0
[ 1063.207883] CPU    6: hi:    6, btch:   1 usd:   0
[ 1063.212659] CPU    7: hi:    6, btch:   1 usd:   0
[ 1063.217426] Normal per-cpu:
[ 1063.220213] CPU    0: hi:    6, btch:   1 usd:   0
[ 1063.224981] CPU    1: hi:    6, btch:   1 usd:   1
[ 1063.229755] CPU    2: hi:    6, btch:   1 usd:   0
[ 1063.234521] CPU    3: hi:    6, btch:   1 usd:   0
[ 1063.239295] CPU    4: hi:    6, btch:   1 usd:   0
[ 1063.244061] CPU    5: hi:    6, btch:   1 usd:   0
[ 1063.248834] CPU    6: hi:    6, btch:   1 usd:   0
[ 1063.253602] CPU    7: hi:    6, btch:   1 usd:   0
[ 1063.258374] active_anon:207481 inactive_anon:25761 isolated_anon:0
[ 1063.258374]  active_file:1241 inactive_file:21246 isolated_file:3
[ 1063.258374]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1063.258374]  free:1015 slab_reclaimable:1044 slab_unreclaimable:1748
[ 1063.258374]  mapped:660 shmem:66 pagetables:236 bounce:0
[ 1063.258374]  free_cma:0
[ 1063.289954] DMA free:52864kB min:4032kB low:4992kB high:6016kB 
active_anon:3030784kB inactive_anon:612608kB active_file:18304kB 
inactive_file:323456kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:11968kB shmem:2112kB 
slab_reclaimable:26240kB slab_unreclaimable:36096kB kernel_stack:352kB 
pagetables:4928kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:26407104 all_unreclaimable? yes
[ 1063.333094] lowmem_reserve[]: 0 765 765
[ 1063.336965] Normal free:12096kB min:12224kB low:15232kB high:18304kB 
active_anon:10248000kB inactive_anon:1036096kB active_file:61120kB 
inactive_file:1036288kB unevictable:0kB isolated(anon):0kB 
isolated(file):192kB present:12582912kB managed:12538176kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:30272kB shmem:2112kB 
slab_reclaimable:40576kB slab_unreclaimable:75776kB kernel_stack:2944kB 
pagetables:10176kB unstable:0kB bounce:0kB free_cma:0kB 
writeback_tmp:0kB pages_scanned:81586304 all_unreclaimable? yes
[ 1063.381407] lowmem_reserve[]: 0 0 0
[ 1063.384924] DMA: 120*64kB (UR) 49*128kB (R) 8*256kB (R) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 
0*65536kB = 52864kB
[ 1063.398274] Normal: 144*64kB (EMR) 9*128kB (R) 6*256kB (R) 0*512kB 
0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 11904kB
[ 1063.411177] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=524288kB
[ 1063.419760] 43809 total pagecache pages
[ 1063.423578] 25864 pages in swap cache
[ 1063.427222] Swap cache stats: add 28224, delete 2360, find 2495/2705
[ 1063.433558] Free swap  = 6636800kB
[ 1063.436942] Total swap = 8388544kB
[ 1063.440333] 262112 pages RAM
[ 1063.443199] 0 pages HighMem/MovableOnly
[ 1063.447014] 18446744073709544361 pages reserved
[ 1063.451530] 8192 pages cma reserved
[ 1063.455000] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[ 1063.462815] [  481]     0   481      234      112       3       26 
           0 systemd-journal
[ 1063.471568] [  487]     0   487     1420       32       3       80 
           0 lvmetad
[ 1063.479630] [  512]     0   512      216       40       2       46 
       -1000 systemd-udevd
[ 1063.488204] [  598]     0   598      258       45       3       38 
       -1000 auditd
[ 1063.496181] [  620]     0   620     6830      252       3       82 
           0 NetworkManager
[ 1063.504847] [  623]     0   623     2028       83       3      100 
           0 abrtd
[ 1063.512737] [  624]     0   624     2019       87       3       84 
           0 abrt-watch-log
[ 1063.521403] [  628]    70   628       87       55       4       18 
           0 avahi-daemon
[ 1063.529896] [  629]   998   629       40       13       2        0 
           0 lsmd
[ 1063.537689] [  631]    70   631       87        0       4       32 
           0 avahi-daemon
[ 1063.546184] [  641]     0   641     2687       82       3       56 
           0 rsyslogd
[ 1063.554337] [  642]     0   642       86       42       3        9 
           0 irqbalance
[ 1063.562660] [  647]     0   647      109       44       2       31 
           0 smartd
[ 1063.570635] [  651]   997   651       83       45       2       18 
           0 chronyd
[ 1063.578687] [  652]     0   652       99       38       3       36 
           0 systemd-logind
[ 1063.587354] [  653]    81   653      222       53       2       21 
        -900 dbus-daemon
[ 1063.595761] [  667]     0   667       79       27       3       29 
           0 atd
[ 1063.603476] [  675]     0   675     1846       68       4       86 
           0 login
[ 1063.611363] [  676]     0   676     1717       23       5       10 
           0 agetty
[ 1063.619338] [  710]     0   710     1761       47       3       36 
           0 bash
[ 1063.627132] [  727]     0   727      240       84       2       74 
       -1000 sshd
[ 1063.634935] [  733]     0   733     1739       34       3        9 
           0 rhsmcertd
[ 1063.643169] [  741]   999   741     7276      132       4      105 
           0 polkitd
[ 1063.651230] [ 1323]     0  1323      334       69       3       69 
           0 master
[ 1063.659205] [ 1325]    89  1325      337       82       2       78 
           0 qmgr
[ 1063.666998] [ 1328]     0  1328      358      130       3      111 
           0 sshd
[ 1063.674802] [ 1331]     0  1331     1762       48       5       39 
           0 bash
[ 1063.682608] [ 1373]    89  1373      336       89       2       69 
           0 pickup
[ 1063.690584] [ 2944]     0  2944     1734       50       4        3 
           0 make
[ 1063.698380] [31425]     0 31425     1760       82       3        0 
           0 make
[ 1063.706184] [32224]     0 32224     1736       51       4        0 
           0 sh
[ 1063.713814] [32230]     0 32230     1737       42       4        0 
           0 gcc
[ 1063.721529] [32232]     0 32232     1736       51       3        0 
           0 sh
[ 1063.729157] [32239]     0 32239     1737       42       5        0 
           0 gcc
[ 1063.736863] [32245]     0 32245     1736       51       4        0 
           0 sh
[ 1063.744494] [32246]     0 32246     2474      313       4        0 
           0 cc1
[ 1063.752208] [32248]     0 32248     1737       42       4        0 
           0 gcc
[ 1063.759923] [32250]     0 32250     1736       52       3        0 
           0 sh
[ 1063.767544] [32252]     0 32252     1737       43       4        0 
           0 gcc
[ 1063.775260] [32253]     0 32253     1736       51       5        0 
           0 sh
[ 1063.782889] [32254]     0 32254     2474      313       4        0 
           0 cc1
[ 1063.790605] [32257]     0 32257     2474      314       3        0 
           0 cc1
[ 1063.798311] [32258]     0 32258     1737       42       4        0 
           0 gcc
[ 1063.806028] [32260]     0 32260     1736       51       3        0 
           0 sh
[ 1063.813663] [32261]     0 32261     1737       42       3        0 
           0 gcc
[ 1063.821379] [32264]     0 32264     2470      295       3        0 
           0 cc1
[ 1063.829094] [32273]     0 32273     1736       51       5        0 
           0 sh
[ 1063.836714] [32276]     0 32276     1737       43       4        0 
           0 gcc
[ 1063.844429] [32278]     0 32278     1736       51       4        0 
           0 sh
[ 1063.852058] [32280]     0 32280     1737       43       4        0 
           0 gcc
[ 1063.859774] [32282]     0 32282     2474      314       4        0 
           0 cc1
[ 1063.867480] [32286]     0 32286     1736       52       4        0 
           0 sh
[ 1063.875110] [32287]     0 32287     2474      313       5        0 
           0 cc1
[ 1063.882826] [32291]     0 32291     1737       42       3        0 
           0 gcc
[ 1063.890541] [32294]     0 32294     2473      296       4        0 
           0 cc1
[ 1063.898248] [32299]     0 32299     1736       51       3        0 
           0 sh
[ 1063.905870] [32300]     0 32300     1737       42       3        0 
           0 gcc
[ 1063.913587] [32301]     0 32301     2464      199       3        0 
           0 cc1
[ 1063.921303] [32302]     0 32302     1736       51       4        0 
           0 sh
[ 1063.928931] [32303]     0 32303     1736       51       3        0 
           0 sh
[ 1063.936552] [32304]     0 32304     1737       42       4        0 
           0 gcc
[ 1063.944272] [32305]     0 32305     1737       43       4        0 
           0 gcc
[ 1063.951993] [32306]     0 32306     2464      199       4        0 
           0 cc1
[ 1063.959708] [32307]     0 32307       45       12       3        0 
           0 as
[ 1063.967329] [32308]     0 32308     2464      200       4        0 
           0 cc1
[ 1063.975044] [32309]     0 32309     1925      222       4        0 
           0 as
[ 1063.982674] [32310]     0 32310      211       61       3       63 
           0 systemd
[ 1063.990735] Out of memory: Kill process 620 (NetworkManager) score 0 
or sacrifice child
[ 1063.998701] Killed process 620 (NetworkManager) total-vm:437120kB, 
anon-rss:3712kB, file-rss:12416kB
[ 1064.007830] NetworkManager: page allocation failure: order:0, 
mode:0x2015a
[ 1064.014689] CPU: 1 PID: 620 Comm: NetworkManager Not tainted 
3.19.0-rc3+ #2
[ 1064.021622] Hardware name: APM X-Gene Mustang board (DT)
[ 1064.026907] Call trace:
[ 1064.029355] [<fffffe0000096954>] dump_backtrace+0x0/0x16c
[ 1064.034728] [<fffffe0000096ad0>] show_stack+0x10/0x1c
[ 1064.039761] [<fffffe000062911c>] dump_stack+0x74/0x98
[ 1064.044789] [<fffffe000017dc10>] warn_alloc_failed+0xe0/0x138
[ 1064.050513] [<fffffe00001809c8>] __alloc_pages_nodemask+0x658/0x8d8
[ 1064.056748] [<fffffe000017a690>] generic_file_read_iter+0x3c4/0x534
[ 1064.063030] [<fffffdfffc196754>] xfs_file_read_iter+0xe8/0x2a0 [xfs]
[ 1064.069367] [<fffffe00001d6500>] new_sync_read+0x84/0xc8
[ 1064.074654] [<fffffe00001d6ff4>] __vfs_read+0x14/0x50
[ 1064.079694] [<fffffe00001d70ac>] vfs_read+0x7c/0x154
[ 1064.084638] [<fffffe00001d71c4>] SyS_read+0x40/0xa0
[ 1064.089503] Mem-Info:
[ 1064.091765] DMA per-cpu:
[ 1064.094285] CPU    0: hi:    6, btch:   1 usd:   0
[ 1064.099056] CPU    1: hi:    6, btch:   1 usd:   0
[ 1064.103822] CPU    2: hi:    6, btch:   1 usd:   0
[ 1064.108586] CPU    3: hi:    6, btch:   1 usd:   0
[ 1064.113368] CPU    4: hi:    6, btch:   1 usd:   0
[ 1064.118134] CPU    5: hi:    6, btch:   1 usd:   0
[ 1064.122905] CPU    6: hi:    6, btch:   1 usd:   0
[ 1064.127671] CPU    7: hi:    6, btch:   1 usd:   0
[ 1064.132442] Normal per-cpu:
[ 1064.135220] CPU    0: hi:    6, btch:   1 usd:   0
[ 1064.139991] CPU    1: hi:    6, btch:   1 usd:   1
[ 1064.144755] CPU    2: hi:    6, btch:   1 usd:   0
[ 1064.149526] CPU    3: hi:    6, btch:   1 usd:   0
[ 1064.154290] CPU    4: hi:    6, btch:   1 usd:   0
[ 1064.159061] CPU    5: hi:    6, btch:   1 usd:   0
[ 1064.163826] CPU    6: hi:    6, btch:   1 usd:   0
[ 1064.168591] CPU    7: hi:    6, btch:   1 usd:   0
[ 1064.173365] active_anon:207484 inactive_anon:25769 isolated_anon:0
[ 1064.173365]  active_file:1247 inactive_file:21260 isolated_file:3
[ 1064.173365]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1064.173365]  free:1012 slab_reclaimable:1044 slab_unreclaimable:1748
[ 1064.173365]  mapped:661 shmem:66 pagetables:233 bounce:0
[ 1064.173365]  free_cma:0
[ 1064.204935] DMA free:52864kB min:4032kB low:4992kB high:6016kB 
active_anon:3030784kB inactive_anon:612608kB active_file:18304kB 
inactive_file:323520kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:11968kB shmem:2112kB 
slab_reclaimable:26240kB slab_unreclaimable:36096kB kernel_stack:336kB 
pagetables:4928kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:32080832 all_unreclaimable? yes
[ 1064.248071] lowmem_reserve[]: 0 765 765
[ 1064.251936] Normal free:11904kB min:12224kB low:15232kB high:18304kB 
active_anon:10248192kB inactive_anon:1036608kB active_file:61504kB 
inactive_file:1037120kB unevictable:0kB isolated(anon):0kB 
isolated(file):192kB present:12582912kB managed:12538176kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:30336kB shmem:2112kB 
slab_reclaimable:40576kB slab_unreclaimable:75776kB kernel_stack:2976kB 
pagetables:9984kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:98616256 all_unreclaimable? yes
[ 1064.296280] lowmem_reserve[]: 0 0 0
[ 1064.299798] DMA: 120*64kB (UR) 49*128kB (R) 8*256kB (R) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 
0*65536kB = 52864kB
[ 1064.313134] Normal: 142*64kB (EMR) 9*128kB (R) 6*256kB (R) 0*512kB 
0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 11776kB
[ 1064.326021] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=524288kB
[ 1064.334600] 43816 total pagecache pages
[ 1064.338418] 25866 pages in swap cache
[ 1064.342068] Swap cache stats: add 28226, delete 2360, find 2497/2709
[ 1064.348389] Free swap  = 6636800kB
[ 1064.351777] Total swap = 8388544kB
[ 1064.355161] 262112 pages RAM
[ 1064.358025] 0 pages HighMem/MovableOnly
[ 1064.361844] 18446744073709544361 pages reserved
[ 1064.366349] 8192 pages cma reserved

\aMessage from syslogd@localhost at Jan  8 07:46:45 ...
  kernel:Call trace:

\aMessage from syslogd@localhost at Jan  8 07:46:45 ...
  kernel:Call trace:
[ 1065.640588] systemd-journal invoked oom-killer: gfp_mask=0x200da, 
order=0, oom_score_adj=0
[ 1065.648823] systemd-journal cpuset=/ mems_allowed=0
[ 1065.653724] CPU: 7 PID: 481 Comm: systemd-journal Not tainted 
3.19.0-rc3+ #2
[ 1065.660753] Hardware name: APM X-Gene Mustang board (DT)
[ 1065.666041] Call trace:
[ 1065.668484] [<fffffe0000096954>] dump_backtrace+0x0/0x16c
[ 1065.673878] [<fffffe0000096ad0>] show_stack+0x10/0x1c
[ 1065.678910] [<fffffe000062911c>] dump_stack+0x74/0x98
[ 1065.683957] [<fffffe0000626550>] dump_header.isra.11+0x98/0x1d0
[ 1065.689867] [<fffffe000017c4e0>] oom_kill_process+0x3ac/0x408
[ 1065.695590] [<fffffe000017c9d4>] out_of_memory+0x2a4/0x2d4
[ 1065.701067] [<fffffe0000180b78>] __alloc_pages_nodemask+0x808/0x8d8
[ 1065.707307] [<fffffe00001a443c>] handle_mm_fault+0x938/0xd1c
[ 1065.712945] [<fffffe00000a01c8>] do_page_fault+0x214/0x338
[ 1065.718403] [<fffffe000009022c>] do_mem_abort+0x38/0x9c
[ 1065.723607] Exception stack(0xfffffe03d6323bf0 to 0xfffffe03d6323d10)
[ 1065.730022] 3be0:                                     d6109780 
fffffe03 00000000 00000000
[ 1065.738159] 3c00: d6323db0 fffffe03 0031c184 fffffe00 00000005 
00000000 d6323cf0 fffffe03
[ 1065.746301] 3c20: d6323d00 fffffe03 ff0a0004 ffffffff d6323c70 
fffffe03 001f8b84 fffffe00
[ 1065.754443] 3c40: d6109780 fffffe03 00100073 00000000 d6323cf0 
fffffe03 d6323cf0 fffffe03
[ 1065.762584] 3c60: d6323cc0 fffffe03 ffffffd0 00000000 d6323cb0 
fffffe03 0023d2b0 fffffe00
[ 1065.770725] 3c80: 00d7d1d8 fffffe00 00000140 00000000 94560000 
000003ff ebed0008 fffffe03
[ 1065.778861] 3ca0: 0000006d 00000000 65703a39 655f6672 9456007d 
000003ff 002b5d34 00000000
[ 1065.787002] 3cc0: d9ff1000 fffffe03 00000000 00000000 00000000 
00000000 fefefefe ff2efefe
[ 1065.795143] 3ce0: ffffffff 7f7fffff 01010101 01010101 00000038 
00000000 00000038 00000000
[ 1065.803289] 3d00: ffffffff 0fffffff ffffffed ffffffff
[ 1065.808319] [<fffffe0000092b64>] el1_da+0x14/0x70
[ 1065.813007] [<fffffe00001d6ff4>] __vfs_read+0x14/0x50
[ 1065.818032] [<fffffe00001d70ac>] vfs_read+0x7c/0x154
[ 1065.822978] [<fffffe00001d71c4>] SyS_read+0x40/0xa0
[ 1065.827829] Mem-Info:
[ 1065.830097] DMA per-cpu:
[ 1065.832617] CPU    0: hi:    6, btch:   1 usd:   0
[ 1065.837381] CPU    1: hi:    6, btch:   1 usd:   0
[ 1065.842151] CPU    2: hi:    6, btch:   1 usd:   0
[ 1065.846916] CPU    3: hi:    6, btch:   1 usd:   0
[ 1065.851684] CPU    4: hi:    6, btch:   1 usd:   0
[ 1065.856449] CPU    5: hi:    6, btch:   1 usd:   0
[ 1065.861219] CPU    6: hi:    6, btch:   1 usd:   0
[ 1065.865984] CPU    7: hi:    6, btch:   1 usd:   0
[ 1065.870752] Normal per-cpu:
[ 1065.873529] CPU    0: hi:    6, btch:   1 usd:   0
[ 1065.878294] CPU    1: hi:    6, btch:   1 usd:   0
[ 1065.883063] CPU    2: hi:    6, btch:   1 usd:   0
[ 1065.887828] CPU    3: hi:    6, btch:   1 usd:   0
[ 1065.892598] CPU    4: hi:    6, btch:   1 usd:   0
[ 1065.897362] CPU    5: hi:    6, btch:   1 usd:   0
[ 1065.902133] CPU    6: hi:    6, btch:   1 usd:   0
[ 1065.906897] CPU    7: hi:    6, btch:   1 usd:   0
[ 1065.911671] active_anon:207543 inactive_anon:25751 isolated_anon:0
[ 1065.911671]  active_file:1192 inactive_file:21238 isolated_file:0
[ 1065.911671]  unevictable:0 dirty:1 writeback:0 unstable:0
[ 1065.911671]  free:1016 slab_reclaimable:1044 slab_unreclaimable:1748
[ 1065.911671]  mapped:572 shmem:68 pagetables:224 bounce:0
[ 1065.911671]  free_cma:1
[ 1065.943242] DMA free:52928kB min:4032kB low:4992kB high:6016kB 
active_anon:3030720kB inactive_anon:612544kB active_file:18496kB 
inactive_file:323008kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:4192256kB managed:4177024kB mlocked:0kB 
dirty:64kB writeback:0kB mapped:11904kB shmem:2112kB 
slab_reclaimable:26240kB slab_unreclaimable:36096kB kernel_stack:336kB 
pagetables:4736kB unstable:0kB bounce:0kB free_cma:64kB 
writeback_tmp:0kB pages_scanned:25027136 all_unreclaimable? yes
[ 1065.986550] lowmem_reserve[]: 0 765 765
[ 1065.990433] Normal free:12480kB min:12224kB low:15232kB high:18304kB 
active_anon:10252032kB inactive_anon:1035776kB active_file:57792kB 
inactive_file:1036672kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:12582912kB managed:12538176kB mlocked:0kB 
dirty:0kB writeback:0kB mapped:24704kB shmem:2240kB 
slab_reclaimable:40576kB slab_unreclaimable:75776kB kernel_stack:2928kB 
pagetables:9600kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:81503232 all_unreclaimable? yes
[ 1066.034608] lowmem_reserve[]: 0 0 0
[ 1066.038124] DMA: 122*64kB (UR) 49*128kB (R) 8*256kB (R) 0*512kB 
0*1024kB 0*2048kB 1*4096kB (R) 0*8192kB 0*16384kB 1*32768kB (R) 
0*65536kB = 52992kB
[ 1066.051465] Normal: 116*64kB (EMR) 18*128kB (MR) 6*256kB (R) 0*512kB 
0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 11264kB
[ 1066.064522] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=524288kB
[ 1066.073093] 43745 total pagecache pages
[ 1066.076909] 25845 pages in swap cache
[ 1066.080557] Swap cache stats: add 28232, delete 2387, find 2500/2717
[ 1066.086877] Free swap  = 6643328kB
[ 1066.090264] Total swap = 8388544kB
[ 1066.093648] 262112 pages RAM
[ 1066.096511] 0 pages HighMem/MovableOnly
[ 1066.100330] 18446744073709544361 pages reserved
[ 1066.104835] 8192 pages cma reserved
[ 1066.108305] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents 
oom_score_adj name
[ 1066.116115] [  481]     0   481      234      112       3       26 
           0 systemd-journal
[ 1066.124864] [  487]     0   487     1420       32       3       80 
           0 lvmetad
[ 1066.132920] [  512]     0   512      216       40       2       46 
       -1000 systemd-udevd
[ 1066.141498] [  598]     0   598      258       45       3       38 
       -1000 auditd
[ 1066.149467] [  623]     0   623     2028       83       3      100 
           0 abrtd
[ 1066.157345] [  624]     0   624     2019       87       3       84 
           0 abrt-watch-log
[ 1066.166006] [  628]    70   628       87       55       4       18 
           0 avahi-daemon
[ 1066.174494] [  629]   998   629       40       13       2        0 
           0 lsmd
[ 1066.182291] [  631]    70   631       87        0       4       32 
           0 avahi-daemon
[ 1066.190779] [  641]     0   641     2687       82       3       56 
           0 rsyslogd
[ 1066.198915] [  642]     0   642       86       42       3        9 
           0 irqbalance
[ 1066.207230] [  647]     0   647      109       44       2       31 
           0 smartd
[ 1066.215199] [  651]   997   651       83       45       2       18 
           0 chronyd
[ 1066.223255] [  652]     0   652       99       38       3       36 
           0 systemd-logind
[ 1066.231916] [  653]    81   653      222       53       2       21 
        -900 dbus-daemon
[ 1066.240319] [  667]     0   667       79       27       3       29 
           0 atd
[ 1066.248024] [  675]     0   675     1846       68       4       86 
           0 login
[ 1066.255907] [  676]     0   676     1717       23       5       10 
           0 agetty
[ 1066.263878] [  710]     0   710     1761       47       3       36 
           0 bash
[ 1066.271674] [  727]     0   727      240       84       2       74 
       -1000 sshd
[ 1066.279472] [  733]     0   733     1739       34       3        9 
           0 rhsmcertd
[ 1066.287695] [  741]   999   741     7276      132       4      105 
           0 polkitd
[ 1066.295750] [ 1323]     0  1323      334       69       3       69 
           0 master
[ 1066.303720] [ 1325]    89  1325      337       82       2       78 
           0 qmgr
[ 1066.311516] [ 1328]     0  1328      358      130       3      111 
           0 sshd
[ 1066.319313] [ 1331]     0  1331     1762       48       5       39 
           0 bash
[ 1066.327105] [ 1373]    89  1373      336       89       2       69 
           0 pickup
[ 1066.335073] [ 2944]     0  2944     1734       50       4        3 
           0 make
[ 1066.342871] [31425]     0 31425     1760       82       3        0 
           0 make
[ 1066.350668] [32224]     0 32224     1736       51       4        0 
           0 sh
[ 1066.358287] [32232]     0 32232     1736       51       3        0 
           0 sh
[ 1066.365912] [32239]     0 32239     1737       42       5        0 
           0 gcc
[ 1066.373624] [32245]     0 32245     1736       51       4        0 
           0 sh
[ 1066.381249] [32246]     0 32246     2474      313       4        0 
           0 cc1
[ 1066.388953] [32248]     0 32248     1737       42       4        0 
           0 gcc
[ 1066.396667] [32250]     0 32250     1736       52       3        0 
           0 sh
[ 1066.404293] [32252]     0 32252     1737       43       4        0 
           0 gcc
[ 1066.412005] [32253]     0 32253     1736       51       5        0 
           0 sh
[ 1066.419631] [32254]     0 32254     2474      313       4        0 
           0 cc1
[ 1066.427336] [32257]     0 32257     2474      314       3        0 
           0 cc1
[ 1066.435046] [32258]     0 32258     1737       42       4        0 
           0 gcc
[ 1066.442757] [32260]     0 32260     1736       51       3        0 
           0 sh
[ 1066.450387] [32261]     0 32261     1737       42       3        0 
           0 gcc
[ 1066.458092] [32264]     0 32264     2473      295       3        0 
           0 cc1
[ 1066.465802] [32273]     0 32273     1736       51       5        0 
           0 sh
[ 1066.473426] [32276]     0 32276     1737       43       4        0 
           0 gcc
[ 1066.481136] [32278]     0 32278     1736       51       4        0 
           0 sh
[ 1066.488755] [32280]     0 32280     1737       43       4        0 
           0 gcc
[ 1066.496464] [32282]     0 32282     2474      314       4        0 
           0 cc1
[ 1066.504175] [32286]     0 32286     1736       52       4        0 
           0 sh
[ 1066.511799] [32287]     0 32287     2474      313       5        0 
           0 cc1
[ 1066.519509] [32291]     0 32291     1737       42       3        0 
           0 gcc
[ 1066.527213] [32294]     0 32294     2473      296       4        0 
           0 cc1
[ 1066.534923] [32299]     0 32299     1736       51       3        0 
           0 sh
[ 1066.542547] [32300]     0 32300     1737       42       3        0 
           0 gcc
[ 1066.550257] [32301]     0 32301     2464      199       3        0 
           0 cc1
[ 1066.557961] [32302]     0 32302     1736       51       4        0 
           0 sh
[ 1066.565584] [32303]     0 32303     1736       51       3        0 
           0 sh
[ 1066.573208] [32304]     0 32304     1737       42       4        0 
           0 gcc
[ 1066.580917] [32305]     0 32305     1737       43       4        0 
           0 gcc
[ 1066.588622] [32306]     0 32306     2466      232       4        0 
           0 cc1
[ 1066.596332] [32307]     0 32307     1918      222       5        0 
           0 as
[ 1066.603955] [32308]     0 32308     2464      200       4        0 
           0 cc1
[ 1066.611666] [32310]     0 32310       42       13       2        0 
           0 nm-dispatcher
[ 1066.620239] [32311]     0 32311        7        1       1        0 
           0 systemd-cgroups
[ 1066.628980] Out of memory: Kill process 32282 (cc1) score 0 or 
sacrifice child
[ 1066.636172] Killed process 32282 (cc1) total-vm:158336kB, 
anon-rss:10816kB, file-rss:9280kB

\aMessage from syslogd@localhost at Jan  8 07:46:47 ...
  kernel:Call trace:


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  2:46 ` Dave Jones
  2015-01-06  8:18   ` Takashi Iwai
@ 2015-01-06  9:45   ` Jiri Kosina
  1 sibling, 0 replies; 101+ messages in thread
From: Jiri Kosina @ 2015-01-06  9:45 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, Linux Kernel Mailing List, Al Viro, Eric Paris

On Mon, 5 Jan 2015, Dave Jones wrote:

>  > It's a day delayed - not because of any particular development issues,
>  > but simply because I was tiling a bathroom yesterday. But rc3 is out
>  > there now, and things have stayed reasonably calm. I really hope that
>  > implies that 3.19 is looking good, but it's equally likely that it's
>  > just that people are still recovering from the holiday season.
>  > 
>  > A bit over three quarters of the changes here are drivers - mostly
>  > networking, thermal, input layer, sound, power management. The rest is
>  > misc - filesystems, core networking, some arch fixes, etc. But all of
>  > it is pretty small.
>  > 
>  > So go out and test,
>  
> This has been there since just before rc1. Is there a fix for this
> stalled in someones git tree maybe ?

This is an issue in fanotify code which has been there since ages, but 
only now has scheduler started to warn about this.

I've reported it to Eric here:

	https://lkml.org/lkml/2014/12/30/95

but no response so far. Adding Eric to CC here as well.

> [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303 __might_sleep+0x8d/0xa0()
> [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
> [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted 3.19.0-rc3+ #100 
> [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88 ffffffff915b47c7
> [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8 ffffffff91062c30
> [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d 0000000000000000
> [    7.952600] Call Trace:
> [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
> [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
> [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
> [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
> [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
> [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
> [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
> [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
> [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
> [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
> [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
> [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
> [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
> [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
> [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
> [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  2:46 ` Dave Jones
@ 2015-01-06  8:18   ` Takashi Iwai
  2015-01-06  9:45   ` Jiri Kosina
  1 sibling, 0 replies; 101+ messages in thread
From: Takashi Iwai @ 2015-01-06  8:18 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linus Torvalds, Linux Kernel Mailing List, Al Viro, Eric Paris,
	Peter Zijlstra

At Mon, 5 Jan 2015 21:46:34 -0500,
Dave Jones wrote:
> 
> On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
>  > It's a day delayed - not because of any particular development issues,
>  > but simply because I was tiling a bathroom yesterday. But rc3 is out
>  > there now, and things have stayed reasonably calm. I really hope that
>  > implies that 3.19 is looking good, but it's equally likely that it's
>  > just that people are still recovering from the holiday season.
>  > 
>  > A bit over three quarters of the changes here are drivers - mostly
>  > networking, thermal, input layer, sound, power management. The rest is
>  > misc - filesystems, core networking, some arch fixes, etc. But all of
>  > it is pretty small.
>  > 
>  > So go out and test,
>  
> This has been there since just before rc1. Is there a fix for this
> stalled in someones git tree maybe ?
> 
> [    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303 __might_sleep+0x8d/0xa0()
> [    7.952592] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
> [    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted 3.19.0-rc3+ #100 
> [    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88 ffffffff915b47c7
> [    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8 ffffffff91062c30
> [    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d 0000000000000000
> [    7.952600] Call Trace:
> [    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
> [    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
> [    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
> [    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
> [    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> [    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
> [    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
> [    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
> [    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
> [    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
> [    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
> [    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
> [    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
> [    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
> [    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
> [    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
> [    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
> [    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17

Just "me too" (but overlooked until recently).

The cause is a mutex_lock() call right after prepare_to_wait() with
TASK_INTERRUPTIBLE in fanotify_read().

static ssize_t fanotify_read(struct file *file, char __user *buf,
			     size_t count, loff_t *pos)
{
	....
	while (1) {
		prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
		mutex_lock(&group->notification_mutex);


I saw Peter already fixed a similar code in inotify_user.c by commit
e23738a7300a (but interestingly for a different reason, "Deal with
nested sleeps").  Supposedly a similar fix would be needed for
fanotify_user.c.

Eric, any fixes planned?


thanks,

Takashi

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: Linux 3.19-rc3
  2015-01-06  1:46 Linus Torvalds
@ 2015-01-06  2:46 ` Dave Jones
  2015-01-06  8:18   ` Takashi Iwai
  2015-01-06  9:45   ` Jiri Kosina
  2015-01-08 12:51 ` Mark Langsdorf
  1 sibling, 2 replies; 101+ messages in thread
From: Dave Jones @ 2015-01-06  2:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, Al Viro

On Mon, Jan 05, 2015 at 05:46:15PM -0800, Linus Torvalds wrote:
 > It's a day delayed - not because of any particular development issues,
 > but simply because I was tiling a bathroom yesterday. But rc3 is out
 > there now, and things have stayed reasonably calm. I really hope that
 > implies that 3.19 is looking good, but it's equally likely that it's
 > just that people are still recovering from the holiday season.
 > 
 > A bit over three quarters of the changes here are drivers - mostly
 > networking, thermal, input layer, sound, power management. The rest is
 > misc - filesystems, core networking, some arch fixes, etc. But all of
 > it is pretty small.
 > 
 > So go out and test,
 
This has been there since just before rc1. Is there a fix for this
stalled in someones git tree maybe ?

[    7.952588] WARNING: CPU: 0 PID: 299 at kernel/sched/core.c:7303 __might_sleep+0x8d/0xa0()
[    7.952592] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
[    7.952595] CPU: 0 PID: 299 Comm: systemd-readahe Not tainted 3.19.0-rc3+ #100 
[    7.952597]  0000000000001c87 00000000720a2c76 ffff8800b2513c88 ffffffff915b47c7
[    7.952598]  ffffffff910a3648 ffff8800b2513ce0 ffff8800b2513cc8 ffffffff91062c30
[    7.952599]  0000000000000000 ffffffff91796fb2 000000000000026d 0000000000000000
[    7.952600] Call Trace:
[    7.952603]  [<ffffffff915b47c7>] dump_stack+0x4c/0x65
[    7.952604]  [<ffffffff910a3648>] ? down_trylock+0x28/0x40
[    7.952606]  [<ffffffff91062c30>] warn_slowpath_common+0x80/0xc0
[    7.952607]  [<ffffffff91062cc0>] warn_slowpath_fmt+0x50/0x70
[    7.952608]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
[    7.952610]  [<ffffffff910a0f7a>] ? prepare_to_wait+0x2a/0x90
[    7.952611]  [<ffffffff910867ed>] __might_sleep+0x8d/0xa0
[    7.952614]  [<ffffffff915b8ea9>] mutex_lock_nested+0x39/0x3e0
[    7.952616]  [<ffffffff910a77ad>] ? trace_hardirqs_on+0xd/0x10
[    7.952617]  [<ffffffff910a0fac>] ? prepare_to_wait+0x5c/0x90
[    7.952620]  [<ffffffff911a63e0>] fanotify_read+0xe0/0x5b0
[    7.952622]  [<ffffffff91090801>] ? sched_clock_cpu+0xc1/0xd0
[    7.952624]  [<ffffffff91242459>] ? selinux_file_permission+0xb9/0x130
[    7.952626]  [<ffffffff910a14d0>] ? prepare_to_wait_event+0xf0/0xf0
[    7.952628]  [<ffffffff91162513>] __vfs_read+0x13/0x50
[    7.952629]  [<ffffffff911625d8>] vfs_read+0x88/0x140
[    7.952631]  [<ffffffff911626e7>] SyS_read+0x57/0xd0
[    7.952633]  [<ffffffff915bd952>] system_call_fastpath+0x12/0x17



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Linux 3.19-rc3
@ 2015-01-06  1:46 Linus Torvalds
  2015-01-06  2:46 ` Dave Jones
  2015-01-08 12:51 ` Mark Langsdorf
  0 siblings, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2015-01-06  1:46 UTC (permalink / raw)
  To: Linux Kernel Mailing List

It's a day delayed - not because of any particular development issues,
but simply because I was tiling a bathroom yesterday. But rc3 is out
there now, and things have stayed reasonably calm. I really hope that
implies that 3.19 is looking good, but it's equally likely that it's
just that people are still recovering from the holiday season.

A bit over three quarters of the changes here are drivers - mostly
networking, thermal, input layer, sound, power management. The rest is
misc - filesystems, core networking, some arch fixes, etc. But all of
it is pretty small.

So go out and test,

                       Linus

---

Aaron Lu (1):
      ACPI / video: Add some Samsung models to disable_native_backlight list

Abhilash Kesavan (1):
      drivers: thermal: Remove ARCH_HAS_BANDGAP dependency for samsung

Al Viro (3):
      Bluetooth: hidp_connection_add() unsafe use of l2cap_pi()
      Bluetooth: cmtp: cmtp_add_connection() should verify that it's
dealing with l2cap socket
      Bluetooth: bnep: bnep_add_connection() should verify that it's
dealing with l2cap socket

Alan Stern (1):
      SCSI: fix regression in scsi_send_eh_cmnd()

Alexandre Belloni (1):
      mmc: core: stop trying to switch width when only one bit is supported

Amir Vadai (1):
      net/mlx4_en: Doorbell is byteswapped in Little Endian archs

Amit Daniel Kachhap (1):
      PM / Domains: Export of_genpd_get_from_provider function

Andrew Bresticker (2):
      spi: img-spfi: Enable controller before starting TX DMA
      spi: img-spfi: Increase DMA burst size

Andrew Jackson (2):
      ASoC: dwc: Ensure FIFOs are flushed to prevent channel swap
      ASoC: dwc: Iterate over all channels

Anil Chintalapati (achintal) (1):
      fnic: IOMMU Fault occurs when IO and abort IO is out of order

Aniroop Mathur (1):
      Input: evdev - add CLOCK_BOOTTIME support

Anshul Garg (1):
      Input: optimize events_per_packet count calculation

Antonio Quartulli (1):
      batman-adv: avoid NULL dereferences and fix if check

Appana Durga Kedareswara Rao (1):
      net: xilinx: Remove unnecessary temac_property in the driver

Asaf Vertz (1):
      Input: edt-ft5x06 - fixed a macro coding style issue

Catalin Marinas (1):
      clocksource: arch_timer: Only use the virtual counter (CNTVCT) on arm64

Dan Carpenter (2):
      thermal: cpu_cooling: small memory leak on error
      OMAPDSS: pll: NULL dereference in error handling

Dan Collins (1):
      packet: Fixed TPACKET V3 to signal poll when block is closed
rather than every packet

Daniel Borkmann (1):
      x86, um: actually mark system call tables readonly

Daniel Glöckner (1):
      net: s6gmac: remove driver

David S. Miller (1):
      genetlink: A genl_bind() to an out-of-range multicast group
should not WARN().

Dmitry Torokhov (6):
      Input: gpio_keys - allow separating gpio and irq in device tree
      Input: gpio_keys - replace timer and workqueue with delayed workqueue
      PM / OPP: add some lockdep annotations
      PM / OPP: fix warning in of_free_opp_table()
      PM / OPP: take RCU lock in dev_pm_opp_get_opp_count
      cpufreq-dt: defer probing if OPP table is not ready

Eduardo Valentin (3):
      thermal: cpu_cooling: check for the readiness of cpufreq layer
      thermal: db8500: Do not print error message in the EPROBE_DEFER case
      thermal: ti-soc-thermal: Do not print error message in the
EPROBE_DEFER case

Eliad Peller (1):
      iwlwifi: mvm: clear IN_HW_RESTART flag on stop()

Emmanuel Grumbach (3):
      iwlwifi: pcie: re-ACK all interrupts after device reset
      iwlwifi: don't double free a pointer if no FW was found
      iwlwifi: add new device IDs for 3165

Ethan Zhao (1):
      cpufreq: fix a NULL pointer dereference in __cpufreq_governor()

Fabio Estevam (1):
      thermal: imx: Do not print error message in the EPROBE_DEFER case

Fang, Yang A (1):
      ASoC: rt5677: fixed rt5677_dsp_vad_put rt5677_dsp_vad_get panic

Geert Uytterhoeven (1):
      selftests/exec: Use %zu to format size_t

Govindarajulu Varadarajan (1):
      enic: fix rx skb checksum

Gregory CLEMENT (1):
      ARM: mvebu: Fix pinctrl configuration for Armada 370 DB

Haiyang Zhang (1):
      hyperv: Fix some variable name typos in send-buffer init/revoke

Hans de Goede (4):
      Input: alps - v7: ignore new packets
      Input: alps - v7: sometimes a single touch is reported in mt[1]
      Input: alps - v7: fix finger counting for > 2 fingers on clickpads
      Input: alps - v7: document the v7 touchpad packet protocol

Hari Bathini (1):
      powerpc/kdump: Ignore failure in enabling big endian exception
during crash

Hariprasad Shenai (1):
      cxgb4vf: Fix ethtool get_settings for VF driver

Herbert Xu (6):
      virtio_net: Fix napi poll list corruption
      caif: Fix napi poll list corruption
      net: Move napi polling code out of net_rx_action
      net: Detect drivers that reschedule NAPI and exhaust budget
      net: Always poll at least one device in net_rx_action
      net: Rearrange loop in net_rx_action

Hisashi Nakamura (1):
      spi: sh-msiof: Add runtime PM lock in initializing

Huacai Chen (1):
      stmmac: Don't init ptp again when resume from suspend/hibernation

Ilkka Koskinen (1):
      Thermal/int340x: Handle properly the case when _trt or _art acpi
entry is missing

Jacob Pan (2):
      powercap / RAPL: add IDs for future Xeon CPUs
      thermal/powerclamp: add ids for future xeon cpus

Jan Kara (6):
      isofs: Fix unchecked printing of ER records
      udf: Verify i_size when loading inode
      udf: Verify symlink size before loading it
      udf: Check path length when reading symlink
      udf: Check component length before reading it
      udf: Reduce repeated dereferences

Jarkko Nikula (3):
      ASoC: Intel: Add I2C dependency to two new machines
      ASoC: Intel: Fix BYTCR firmware name
      ASoC: Intel: Fix BYTCR machine driver MODULE_ALIAS

Jason Wang (1):
      net: drop the packet when fails to do software segmentation or
header check

Javi Merino (2):
      thermal: cpu_cooling: return ERR_PTR() for !CPU_THERMAL or !THERMAL_OF
      thermal: cpu_cooling: document node in struct cpufreq_cooling_device

Jay Vosburgh (1):
      net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding

Jesse Gross (1):
      net: Generalize ndo_gso_check to ndo_features_check

Jia-Ju Bai (3):
      8139too: Fix the lack of pci_disable_device
      8139too: Add netif_napi_del in the driver
      ne2k-pci: Add pci_disable_device in error handling

Jianqun Xu (2):
      ASoC: rockchip: i2s: fix error defination of transmit data level
      ASoC: rockchip: i2s: fix maxburst of dma data to 4

Jie Yang (1):
      ASoC: Intel: correct the fixed free block allocation

Jiri Kosina (1):
      Revert "cfg80211: make WEXT compatibility unselectable"

Johan Hedberg (1):
      Bluetooth: Fix accepting connections when not using mgmt

Johan Hovold (1):
      net: phy: micrel: use generic config_init for KSZ8021/KSZ8031

Johannes Berg (6):
      netlink: rename netlink_unbind() to netlink_undo_bind()
      genetlink: pass only network namespace to genl_has_listeners()
      netlink: update listeners directly when removing socket
      netlink: call unbind when releasing socket
      genetlink: pass multicast bind/unbind to families
      netlink/genetlink: pass network namespace to bind/unbind

Jukka Rissanen (1):
      Bluetooth: 6lowpan: Do not free skb when packet is dropped

Kevin Cernekee (1):
      Fix signed/unsigned pointer warning

Krzysztof Kozlowski (1):
      regulator: s2mps11: Fix dw_mmc failure on Gear 2

Lars-Peter Clausen (1):
      ALSA: pcm: Fix kerneldoc for params_*() functions

Len Brown (3):
      cpuidle: menu: Better idle duration measurement without using
CPUIDLE_FLAG_TIME_INVALID
      cpuidle: ladder: Better idle duration measurement without using
CPUIDLE_FLAG_TIME_INVALID
      cpuidle / ACPI: remove unused CPUIDLE_FLAG_TIME_INVALID

Li RongQing (1):
      sunvnet: fix a memory leak in vnet_handle_offloads

Liad Kaufman (1):
      iwlwifi: pcie: limit fw chunk sizes given to fh

Linus Torvalds (2):
      Revert "Input: atmel_mxt_ts - use deep sleep mode when stopped"
      Linux 3.19-rc3

Linus Walleij (3):
      mfd: stmpe: add pull up/down register offsets for STMPE
      Input: stmpe - enforce device tree only mode
      Input: stmpe - bias keypad columns properly

Lukasz Majewski (1):
      thermal:core:fix: Check return code of the ->get_max_state() callback

Marcel Holtmann (1):
      Bluetooth: Fix controller configuration with HCI_QUIRK_INVALID_BDADDR

Mark Brown (1):
      ASoC: dapm: Remove snd_soc_of_parse_audio_routing() due to deferred probe

Martin K. Petersen (1):
      sd: tweak discard heuristics to work around QEMU SCSI issue

Michael Ellerman (1):
      Revert "powerpc: Secondary CPUs must set cpu_callin_map after
setting active and online"

Michael S. Tsirkin (2):
      virtio_ring: document alignment requirements
      vhost: relax used address alignment

Michal Hocko (1):
      mm: get rid of radix tree gfp mask for pagecache_get_page

Michal Privoznik (1):
      tools / cpupower: Correctly detect if running as root

Mika Westerberg (1):
      brcmfmac: Do not crash if platform data is not populated

Nakajima Akira (1):
      cifs: make new inode cache when file type is different

Nicholas Mc Guire (2):
      net: incorrect use of init_completion fixup
      Input: hil_kbd - fix incorrect use of init_completion

Nicolas Dichtel (2):
      tcp6: don't move IP6CB before xfrm6_policy_check()
      neigh: remove next ptr from struct neigh_table

Paul Bolle (1):
      ipw2200: select CFG80211_WEXT

Paul Moore (1):
      audit: create private file name copies when auditing inodes

Pavel Machek (1):
      Revert "ARM: 7830/1: delay: don't bother reporting bogomips in
/proc/cpuinfo"

Pranith Kumar (1):
      powerpc: Wire up sys_execveat() syscall

Prarit Bhargava (1):
      tools / cpupower: Fix no idle state information return value

Prashant Sreedharan (1):
      tg3: tg3_disable_ints using uninitialized mailbox value to
disable interrupts

Pravin B Shelar (6):
      mpls: Fix config check for mpls.
      mpls: Fix allowed protocols for mpls gso
      openvswitch: Fix MPLS action validation.
      openvswitch: Fix GSO with multiple MPLS label.
      openvswitch: Fix vport_send double free
      vxlan: Fix double free of skb.

Punit Agrawal (1):
      thermal: Fix cdev registration with THERMAL_NO_LIMIT on 64bit

Rabin Vincent (1):
      crypto: af_alg - fix backlog handling

Richard Weinberger (1):
      um: Skip futex_atomic_cmpxchg_inatomic() test

Rickard Strandqvist (1):
      net: ethernet: micrel: ksz884x.c: Remove unused function

Sachin Prabhu (1):
      Convert MessageID in smb2_hdr to LE

Srinivas Pandruvada (4):
      thermal: int340x: Introduce processor reporting device
      Thermal/int340x/int3403: Fix memory leak
      Thermal/int340x/processor_thermal: Fix memory leak
      Thermal/int340x/int3403: Free acpi notification handler

Steev Klimaszewski (1):
      Add USB_EHCI_EXYNOS to multi_v7_defconfig

Sven Eckelmann (2):
      batman-adv: Calculate extra tail size based on queued fragments
      batman-adv: Unify fragment size calculation

Thomas Graf (1):
      net: Reset secmark when scrubbing packet

Tobias Klauser (1):
      nios2: Use preempt_schedule_irq

Tomi Valkeinen (4):
      OMAPDSS: HDMI: remove double initializer entries
      video/logo: prevent use of logos after they have been freed
      video/fbdev: fix defio's fsync
      OMAPDSS: SDI: fix output port_num

Tony Luck (1):
      [IA64] Enable execveat syscall for ia64

Toshiaki Makita (1):
      net: Fix stacked vlan offload features computation

Viresh Kumar (25):
      thermal: db8500: pass cpu_present_mask to cpufreq_cooling_register()
      thermal: imx: pass cpu_present_mask to cpufreq_cooling_register()
      thermal: exynos: pass cpu_present_mask to cpufreq_cooling_register()
      thermal: cpu_cooling: random comment fixups
      thermal: cpu_cooling: fix doc comment over struct cpufreq_cooling_device
      thermal: cpu_cooling: Add comment to clarify relation between
cooling state and frequency
      thermal: cpu_cooling: Pass variable instead of its type to sizeof()
      thermal: cpu_cooling: no need to set cpufreq_state to zero
      thermal: cpu_cooling: no need to set cpufreq_dev to NULL
      thermal: cpu_cooling: no need to initialze 'ret'
      thermal: cpu_cooling: propagate error returned by idr_alloc()
      thermal: cpu_cooling: Don't match min/max frequencies for all
CPUs on cooling register
      thermal: cpu_cooling: don't iterate over all allowed_cpus to
update cpufreq policy
      thermal: cpu_cooling: Don't check is_cpufreq_valid()
      thermal: cpu_cooling: do error handling at the bottom in
__cpufreq_cooling_register()
      thermal: cpu_cooling: initialize 'cpufreq_val' on registration
      thermal: cpu_cooling: Merge cpufreq_apply_cooling() into
cpufreq_set_cur_state()
      thermal: cpu_cooling: remove unnecessary wrapper get_cpu_frequency()
      thermal: cpu_cooling: find max level during device registration
      thermal: cpu_cooling: get_property() doesn't need to support
GET_MAXL anymore
      thermal: cpu_cooling: use cpufreq_dev_list instead of cpufreq_dev_count
      thermal: cpu_cooling: Pass 'cpufreq_dev' to get_property()
      thermal: cpu_cooling: Store frequencies in descending order
      thermal: cpu_cooling: Use cpufreq_dev->freq_table for finding level/freq
      thermal: cpu_cooling: update copyright tags

Walter Goossens (1):
      nios2: Initialize cpuinfo.mmu

Wengang Wang (1):
      bonding: change error message to debug message in __bond_release_one()

Wolfram Sang (3):
      thermal: drop owner assignment from platform_drivers
      thermal: int340x_thermal: drop owner assignment from platform_drivers
      net: ethernet: stmicro: stmmac: drop owner assignment from
platform_drivers

Wu Fengguang (1):
      openvswitch: fix odd_ptr_err.cocci warnings

haarp (1):
      Input: psmouse - expose drift duration for IBM trackpoints

stephen hemminger (1):
      in6: fix conflict with glibc

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2015-02-02 19:33 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-06  4:49 Linux 3.19-rc3 Sedat Dilek
2015-01-06  9:34 ` Sedat Dilek
2015-01-06  9:56   ` Takashi Iwai
2015-01-06 10:06     ` Sedat Dilek
2015-01-06 10:28       ` Takashi Iwai
2015-01-06 10:31         ` Sedat Dilek
2015-01-06 10:37           ` Takashi Iwai
2015-01-06 10:42             ` Sedat Dilek
2015-01-06  9:59   ` Peter Zijlstra
2015-01-06  9:40 ` Peter Zijlstra
2015-01-06  9:42   ` Sedat Dilek
2015-01-06  9:57     ` Sedat Dilek
2015-01-06 10:06       ` Peter Zijlstra
2015-01-06 10:18         ` Sedat Dilek
2015-01-06 11:01           ` Peter Zijlstra
2015-01-06 11:07             ` Kent Overstreet
2015-01-06 11:25               ` Sedat Dilek
2015-01-06 11:40                 ` Kent Overstreet
2015-01-06 12:51                   ` Sedat Dilek
2015-01-06 11:42               ` Peter Zijlstra
2015-01-06 11:48                 ` Peter Zijlstra
2015-01-06 12:01                   ` Kent Overstreet
2015-01-06 12:20                     ` Peter Zijlstra
2015-01-06 12:45                       ` Kent Overstreet
2015-01-06 12:55                       ` Peter Hurley
2015-01-06 17:38                         ` Paul E. McKenney
2015-01-06 17:58                           ` Peter Hurley
2015-01-06 19:25                             ` Paul E. McKenney
2015-01-06 19:57                               ` Peter Hurley
2015-01-06 20:47                                 ` Paul E. McKenney
2015-01-20  0:30                                   ` Paul E. McKenney
2015-01-20 14:03                                     ` Peter Hurley
2015-02-02 16:11                                       ` Paul E. McKenney
2015-02-02 19:03                                         ` Peter Hurley
2015-02-02 19:33                                           ` Paul E. McKenney
2015-01-06 11:56                 ` Kent Overstreet
2015-01-06 12:16                   ` Peter Zijlstra
2015-01-06 12:43                     ` Kent Overstreet
2015-01-06 13:03                       ` Peter Zijlstra
2015-01-06 13:28                         ` Kent Overstreet
2015-01-13 15:23                           ` Peter Zijlstra
2015-01-06 11:58               ` Peter Zijlstra
2015-01-06 12:18                 ` Kent Overstreet
2015-01-16 16:56               ` Peter Hurley
2015-01-16 17:00                 ` Chris Mason
2015-01-16 18:58                   ` Peter Hurley
2015-01-06 10:29   ` Sedat Dilek
  -- strict thread matches above, loose matches on Subject: below --
2015-01-06  1:46 Linus Torvalds
2015-01-06  2:46 ` Dave Jones
2015-01-06  8:18   ` Takashi Iwai
2015-01-06  9:45   ` Jiri Kosina
2015-01-08 12:51 ` Mark Langsdorf
2015-01-08 13:45   ` Catalin Marinas
2015-01-08 17:29     ` Mark Langsdorf
2015-01-08 17:34       ` Catalin Marinas
2015-01-08 18:48         ` Mark Langsdorf
2015-01-08 19:21           ` Linus Torvalds
2015-01-09 23:27             ` Catalin Marinas
2015-01-10  0:35               ` Kirill A. Shutemov
2015-01-10  2:27                 ` Linus Torvalds
2015-01-10  2:51                   ` David Lang
2015-01-10  3:06                     ` Linus Torvalds
2015-01-10 10:46                       ` Andreas Mohr
2015-01-10 19:42                         ` Linus Torvalds
2015-01-13  3:33                     ` Rik van Riel
2015-01-13 10:28                       ` Catalin Marinas
2015-01-10  3:17                   ` Tony Luck
2015-01-10 20:16                   ` Arnd Bergmann
2015-01-10 21:00                     ` Linus Torvalds
2015-01-10 21:36                       ` Arnd Bergmann
2015-01-10 21:48                         ` Linus Torvalds
2015-01-12 11:37                         ` Kirill A. Shutemov
2015-01-12 12:18                         ` Catalin Marinas
2015-01-12 13:57                           ` Arnd Bergmann
2015-01-12 14:23                             ` Catalin Marinas
2015-01-12 15:42                               ` Arnd Bergmann
2015-01-12 11:53                     ` Catalin Marinas
2015-01-12 13:15                       ` Arnd Bergmann
2015-01-08 15:08   ` Michal Hocko
2015-01-08 16:37     ` Mark Langsdorf
2015-01-09 15:56       ` Michal Hocko
2015-01-09 12:13   ` Mark Rutland
2015-01-09 14:19     ` Steve Capper
2015-01-09 14:27       ` Mark Langsdorf
2015-01-09 17:57         ` Mark Rutland
2015-01-09 18:37           ` Marc Zyngier
2015-01-09 19:43             ` Will Deacon
2015-01-10  3:29               ` Laszlo Ersek
2015-01-10  4:39                 ` Linus Torvalds
2015-01-10 13:37                   ` Will Deacon
2015-01-10 19:47                     ` Laszlo Ersek
2015-01-10 19:56                       ` Linus Torvalds
2015-01-10 20:08                         ` Laszlo Ersek
2015-01-10 19:51                     ` Linus Torvalds
2015-01-12 12:42                       ` Will Deacon
2015-01-12 13:22                         ` Mark Langsdorf
2015-01-12 19:03                         ` Dave Hansen
2015-01-12 19:06                         ` Linus Torvalds
2015-01-12 19:07                           ` Linus Torvalds
2015-01-12 19:24                             ` Will Deacon
2015-01-10 15:22                 ` Kyle McMartin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).