LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
@ 2018-05-23 23:22 Qing Huang
  2018-05-24  7:23 ` Gi-Oh Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Qing Huang @ 2018-05-23 23:22 UTC (permalink / raw)
  To: tariqt, davem, haakon.bugge, yanjun.zhu
  Cc: netdev, linux-rdma, linux-kernel, gi-oh.kim, Qing Huang

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Acked-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
v4: use kvzalloc instead of vzalloc
    add one err condition check
    don't include vmalloc.h any more

v3: use PAGE_SIZE instead of PAGE_SHIFT
    add comma to the end of enum variables
    include vmalloc.h header file to avoid build issues on Sparc

v2: adjusted chunk size to reflect different architectures

 drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..685337d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-	MLX4_ICM_ALLOC_SIZE	= 1 << 18,
-	MLX4_TABLE_CHUNK_SIZE	= 1 << 18
+	MLX4_ICM_ALLOC_SIZE	= PAGE_SIZE,
+	MLX4_TABLE_CHUNK_SIZE	= PAGE_SIZE,
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
@@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 	u64 size;
 
 	obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
+	if (WARN_ON(!obj_per_chunk))
+		return -EINVAL;
 	num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-	table->icm      = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+	table->icm      = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
 	if (!table->icm)
 		return -ENOMEM;
 	table->virt     = virt;
@@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 			mlx4_free_icm(dev, table->icm[i], use_coherent);
 		}
 
-	kfree(table->icm);
+	kvfree(table->icm);
 
 	return -ENOMEM;
 }
@@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
 			mlx4_free_icm(dev, table->icm[i], table->coherent);
 		}
 
-	kfree(table->icm);
+	kvfree(table->icm);
 }
-- 
2.9.3

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-23 23:22 [PATCH V4] mlx4_core: allocate ICM memory in page size chunks Qing Huang
@ 2018-05-24  7:23 ` Gi-Oh Kim
  2018-05-24  9:45 ` Tariq Toukan
  2018-05-25 14:23 ` David Miller
  2 siblings, 0 replies; 22+ messages in thread
From: Gi-Oh Kim @ 2018-05-24  7:23 UTC (permalink / raw)
  To: Qing Huang
  Cc: Tariq Toukan, davem, haakon.bugge, yanjun.zhu, netdev,
	linux-rdma, linux-kernel

On Thu, May 24, 2018 at 1:22 AM, Qing Huang <qing.huang@oracle.com> wrote:
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
>
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
>
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
>
> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>
> However in order to support smaller ICM chunk size, we need to fix
> another issue in large size kcalloc allocations.
>
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
>
> The solution is to use kvzalloc to replace kcalloc which will fall back
> to vmalloc automatically if kmalloc fails.


Hi,

Could you please write why it first try to allocate the contiguous pages?
I think it is necessary to comment why it uses kvzalloc instead of vzalloc.


>
> Signed-off-by: Qing Huang <qing.huang@oracle.com>
> Acked-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>

+Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com>

> ---
> v4: use kvzalloc instead of vzalloc
>     add one err condition check
>     don't include vmalloc.h any more
>
> v3: use PAGE_SIZE instead of PAGE_SHIFT
>     add comma to the end of enum variables
>     include vmalloc.h header file to avoid build issues on Sparc
>
> v2: adjusted chunk size to reflect different architectures
>
>  drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index a822f7a..685337d 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,12 @@
>  #include "fw.h"
>
>  /*
> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
> - * per chunk.
> + * We allocate in page size (default 4KB on many archs) chunks to avoid high
> + * order memory allocations in fragmented/high usage memory situation.
>   */
>  enum {
> -       MLX4_ICM_ALLOC_SIZE     = 1 << 18,
> -       MLX4_TABLE_CHUNK_SIZE   = 1 << 18
> +       MLX4_ICM_ALLOC_SIZE     = PAGE_SIZE,
> +       MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
>  };
>
>  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
>         u64 size;
>
>         obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
> +       if (WARN_ON(!obj_per_chunk))
> +               return -EINVAL;
>         num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>
> -       table->icm      = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
> +       table->icm      = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
>         if (!table->icm)
>                 return -ENOMEM;
>         table->virt     = virt;
> @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
>                         mlx4_free_icm(dev, table->icm[i], use_coherent);
>                 }
>
> -       kfree(table->icm);
> +       kvfree(table->icm);
>
>         return -ENOMEM;
>  }
> @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
>                         mlx4_free_icm(dev, table->icm[i], table->coherent);
>                 }
>
> -       kfree(table->icm);
> +       kvfree(table->icm);
>  }
> --
> 2.9.3
>



-- 
GIOH KIM
Linux Kernel Entwickler

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 176 2697 8962
Fax:      +49 30 577 008 299
Email:    gi-oh.kim@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-23 23:22 [PATCH V4] mlx4_core: allocate ICM memory in page size chunks Qing Huang
  2018-05-24  7:23 ` Gi-Oh Kim
@ 2018-05-24  9:45 ` Tariq Toukan
  2018-05-25 14:23 ` David Miller
  2 siblings, 0 replies; 22+ messages in thread
From: Tariq Toukan @ 2018-05-24  9:45 UTC (permalink / raw)
  To: Qing Huang, tariqt, davem, haakon.bugge, yanjun.zhu
  Cc: netdev, linux-rdma, linux-kernel, gi-oh.kim



On 24/05/2018 2:22 AM, Qing Huang wrote:
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
> 
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
> 
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
> 
> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
> 
> However in order to support smaller ICM chunk size, we need to fix
> another issue in large size kcalloc allocations.
> 
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
> 
> The solution is to use kvzalloc to replace kcalloc which will fall back
> to vmalloc automatically if kmalloc fails.
> 
> Signed-off-by: Qing Huang <qing.huang@oracle.com>
> Acked-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> ---
> v4: use kvzalloc instead of vzalloc
>      add one err condition check
>      don't include vmalloc.h any more
> 
> v3: use PAGE_SIZE instead of PAGE_SHIFT
>      add comma to the end of enum variables
>      include vmalloc.h header file to avoid build issues on Sparc
> 
> v2: adjusted chunk size to reflect different architectures
> 
>   drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +++++++++-------
>   1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index a822f7a..685337d 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,12 @@
>   #include "fw.h"
>   
>   /*
> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
> - * per chunk.
> + * We allocate in page size (default 4KB on many archs) chunks to avoid high
> + * order memory allocations in fragmented/high usage memory situation.
>    */
>   enum {
> -	MLX4_ICM_ALLOC_SIZE	= 1 << 18,
> -	MLX4_TABLE_CHUNK_SIZE	= 1 << 18
> +	MLX4_ICM_ALLOC_SIZE	= PAGE_SIZE,
> +	MLX4_TABLE_CHUNK_SIZE	= PAGE_SIZE,
>   };
>   
>   static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
>   	u64 size;
>   
>   	obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
> +	if (WARN_ON(!obj_per_chunk))
> +		return -EINVAL;
>   	num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>   
> -	table->icm      = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
> +	table->icm      = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
>   	if (!table->icm)
>   		return -ENOMEM;
>   	table->virt     = virt;
> @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
>   			mlx4_free_icm(dev, table->icm[i], use_coherent);
>   		}
>   
> -	kfree(table->icm);
> +	kvfree(table->icm);
>   
>   	return -ENOMEM;
>   }
> @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
>   			mlx4_free_icm(dev, table->icm[i], table->coherent);
>   		}
>   
> -	kfree(table->icm);
> +	kvfree(table->icm);
>   }
> 

Thanks Qing.

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-23 23:22 [PATCH V4] mlx4_core: allocate ICM memory in page size chunks Qing Huang
  2018-05-24  7:23 ` Gi-Oh Kim
  2018-05-24  9:45 ` Tariq Toukan
@ 2018-05-25 14:23 ` David Miller
  2018-05-30  3:34   ` Eric Dumazet
  2 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2018-05-25 14:23 UTC (permalink / raw)
  To: qing.huang
  Cc: tariqt, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
	linux-kernel, gi-oh.kim

From: Qing Huang <qing.huang@oracle.com>
Date: Wed, 23 May 2018 16:22:46 -0700

> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
> 
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
> 
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
> 
> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
> 
> However in order to support smaller ICM chunk size, we need to fix
> another issue in large size kcalloc allocations.
> 
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
> 
> The solution is to use kvzalloc to replace kcalloc which will fall back
> to vmalloc automatically if kmalloc fails.
> 
> Signed-off-by: Qing Huang <qing.huang@oracle.com>
> Acked-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Applied, thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-25 14:23 ` David Miller
@ 2018-05-30  3:34   ` Eric Dumazet
  2018-05-30  3:44     ` Eric Dumazet
  2018-05-30 17:53     ` Qing Huang
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2018-05-30  3:34 UTC (permalink / raw)
  To: David Miller, qing.huang
  Cc: tariqt, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
	linux-kernel, gi-oh.kim



On 05/25/2018 10:23 AM, David Miller wrote:
> From: Qing Huang <qing.huang@oracle.com>
> Date: Wed, 23 May 2018 16:22:46 -0700
> 
>> When a system is under memory presure (high usage with fragments),
>> the original 256KB ICM chunk allocations will likely trigger kernel
>> memory management to enter slow path doing memory compact/migration
>> ops in order to complete high order memory allocations.
>>
>> When that happens, user processes calling uverb APIs may get stuck
>> for more than 120s easily even though there are a lot of free pages
>> in smaller chunks available in the system.
>>
>> Syslog:
>> ...
>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>> oracle_205573_e:205573 blocked for more than 120 seconds.
>> ...
>>
>> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>>
>> However in order to support smaller ICM chunk size, we need to fix
>> another issue in large size kcalloc allocations.
>>
>> E.g.
>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
>> entry). So we need a 16MB allocation for a table->icm pointer array to
>> hold 2M pointers which can easily cause kcalloc to fail.
>>
>> The solution is to use kvzalloc to replace kcalloc which will fall back
>> to vmalloc automatically if kmalloc fails.
>>
>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>> Acked-by: Daniel Jurgens <danielj@mellanox.com>
>> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> 
> Applied, thanks.
> 

I must say this patch causes regressions here.

KASAN is not happy.

It looks that you guys did not really looked at mlx4_alloc_icm()

This function is properly handling high order allocations with fallbacks to order-0 pages
under high memory pressure.

BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib]
Read of size 4 at addr ffff8817df584f68 by task qp_listing_test/92585

CPU: 38 PID: 92585 Comm: qp_listing_test Tainted: G           O     
Call Trace:
 [<ffffffffba80d7bb>] dump_stack+0x4d/0x72
 [<ffffffffb951dc5f>] print_address_description+0x6f/0x260
 [<ffffffffb951e1c7>] kasan_report+0x257/0x370
 [<ffffffffb951e339>] __asan_report_load4_noabort+0x19/0x20
 [<ffffffffc0256d28>] to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib]
 [<ffffffffc02785b3>] mlx4_ib_query_qp+0x1213/0x1660 [mlx4_ib]
 [<ffffffffc02dbfdb>] qpstat_print_qp+0x13b/0x500 [ib_uverbs]
 [<ffffffffc02dc3ea>] qpstat_seq_show+0x4a/0xb0 [ib_uverbs]
 [<ffffffffb95f125c>] seq_read+0xa9c/0x1230
 [<ffffffffb96e0821>] proc_reg_read+0xc1/0x180
 [<ffffffffb9577918>] __vfs_read+0xe8/0x730
 [<ffffffffb9578057>] vfs_read+0xf7/0x300
 [<ffffffffb95794d2>] SyS_read+0xd2/0x1b0
 [<ffffffffb8e06b16>] do_syscall_64+0x186/0x420
 [<ffffffffbaa00071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f851a7bb30d
RSP: 002b:00007ffd09a758c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00007f84ff959440 RCX: 00007f851a7bb30d
RDX: 000000000003fc00 RSI: 00007f84ff60a000 RDI: 000000000000000b
RBP: 00007ffd09a75900 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000293 R12: 0000000000000000
R13: 000000000003ffff R14: 000000000003ffff R15: 00007f84ff60a000

Allocated by task 4488: 
 save_stack+0x46/0xd0
 kasan_kmalloc+0xad/0xe0
 __kmalloc+0x101/0x5e0
 ib_register_device+0xc03/0x1250 [ib_core]
 mlx4_ib_add+0x27d6/0x4dd0 [mlx4_ib]
 mlx4_add_device+0xa9/0x340 [mlx4_core]
 mlx4_register_interface+0x16e/0x390 [mlx4_core]
 xhci_pci_remove+0x7a/0x180 [xhci_pci]
 do_one_initcall+0xa0/0x230
 do_init_module+0x1b9/0x5a4
 load_module+0x63e6/0x94c0
 SYSC_init_module+0x1a4/0x1c0
 SyS_init_module+0xe/0x10
 do_syscall_64+0x186/0x420
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Freed by task 0:
(stack is not available)

The buggy address belongs to the object at ffff8817df584f40
 which belongs to the cache kmalloc-32 of size 32
The buggy address is located 8 bytes to the right of
 32-byte region [ffff8817df584f40, ffff8817df584f60)
The buggy address belongs to the page:
page:ffffea005f7d6100 count:1 mapcount:0 mapping:ffff8817df584000 index:0xffff8817df584fc1
flags: 0x880000000000100(slab)
raw: 0880000000000100 ffff8817df584000 ffff8817df584fc1 000000010000003f
raw: ffffea005f3ac0a0 ffffea005c476760 ffff8817fec00900 ffff883ff78d26c0
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff883ff78d26c0

Memory state around the buggy address:
 ffff8817df584e00: 00 03 fc fc fc fc fc fc 00 03 fc fc fc fc fc fc
 ffff8817df584e80: 00 00 00 04 fc fc fc fc 00 00 00 fc fc fc fc fc
>ffff8817df584f00: fb fb fb fb fc fc fc fc 00 00 00 00 fc fc fc fc
                                                          ^
 ffff8817df584f80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8817df585000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

I will test :

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 685337d58276fc91baeeb64387c52985e1bc6dda..4d2a71381acb739585d662175e86caef72338097 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,13 @@
 #include "fw.h"
 
 /*
- * We allocate in page size (default 4KB on many archs) chunks to avoid high
- * order memory allocations in fragmented/high usage memory situation.
+ * We allocate in as big chunks as we can, up to a maximum of 256 KB
+ * per chunk. Note that the chunks are not necessarily in contiguous
+ * physical memory.
  */
 enum {
-       MLX4_ICM_ALLOC_SIZE     = PAGE_SIZE,
-       MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
+       MLX4_ICM_ALLOC_SIZE     = 1 << 18,
+       MLX4_TABLE_CHUNK_SIZE   = 1 << 18
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-30  3:34   ` Eric Dumazet
@ 2018-05-30  3:44     ` Eric Dumazet
  2018-05-30  3:49       ` Eric Dumazet
  2018-05-30 17:53     ` Qing Huang
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2018-05-30  3:44 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, qing.huang
  Cc: tariqt, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
	linux-kernel, gi-oh.kim



On 05/29/2018 11:34 PM, Eric Dumazet wrote:

> I will test :
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index 685337d58276fc91baeeb64387c52985e1bc6dda..4d2a71381acb739585d662175e86caef72338097 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,13 @@
>  #include "fw.h"
>  
>  /*
> - * We allocate in page size (default 4KB on many archs) chunks to avoid high
> - * order memory allocations in fragmented/high usage memory situation.
> + * We allocate in as big chunks as we can, up to a maximum of 256 KB
> + * per chunk. Note that the chunks are not necessarily in contiguous
> + * physical memory.
>   */
>  enum {
> -       MLX4_ICM_ALLOC_SIZE     = PAGE_SIZE,
> -       MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
> +       MLX4_ICM_ALLOC_SIZE     = 1 << 18,
> +       MLX4_TABLE_CHUNK_SIZE   = 1 << 18
>  };
>  
>  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> 

And I will add this simple fix, this really should address your initial concern much better.

@@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order,
 {
        struct page *page;
 
+       if (order)
+               gfp_mask |= __GFP_NORETRY;
        page = alloc_pages_node(node, gfp_mask, order);
        if (!page) {
                page = alloc_pages(gfp_mask, order);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-30  3:44     ` Eric Dumazet
@ 2018-05-30  3:49       ` Eric Dumazet
  2018-05-30 17:39         ` Qing Huang
  2018-05-31  6:54         ` Michal Hocko
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2018-05-30  3:49 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, qing.huang
  Cc: tariqt, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
	linux-kernel, gi-oh.kim



On 05/29/2018 11:44 PM, Eric Dumazet wrote:

> 
> And I will add this simple fix, this really should address your initial concern much better.
> 
> @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order,
>  {
>         struct page *page;
>  
> +       if (order)
> +               gfp_mask |= __GFP_NORETRY;

    and also      gfp_mask &= ~__GFP_DIRECT_RECLAIM


>         page = alloc_pages_node(node, gfp_mask, order);
>         if (!page) {
>                 page = alloc_pages(gfp_mask, order);
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-30  3:49       ` Eric Dumazet
@ 2018-05-30 17:39         ` Qing Huang
  2018-05-31  6:54         ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Qing Huang @ 2018-05-30 17:39 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: tariqt, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
	linux-kernel, gi-oh.kim



On 5/29/2018 8:49 PM, Eric Dumazet wrote:
>
> On 05/29/2018 11:44 PM, Eric Dumazet wrote:
>
>> And I will add this simple fix, this really should address your initial concern much better.
>>
>> @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order,
>>   {
>>          struct page *page;
>>   
>> +       if (order)
>> +               gfp_mask |= __GFP_NORETRY;
>      and also      gfp_mask &= ~__GFP_DIRECT_RECLAIM
>

Would this just fail the allocation without trying to reclaim memory 
under memory pressure? We've tried something similar but it didn't fix 
the original problem we were facing.


>>          page = alloc_pages_node(node, gfp_mask, order);
>>          if (!page) {
>>                  page = alloc_pages(gfp_mask, order);
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-30  3:34   ` Eric Dumazet
  2018-05-30  3:44     ` Eric Dumazet
@ 2018-05-30 17:53     ` Qing Huang
  1 sibling, 0 replies; 22+ messages in thread
From: Qing Huang @ 2018-05-30 17:53 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: tariqt, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
	linux-kernel, gi-oh.kim



On 5/29/2018 8:34 PM, Eric Dumazet wrote:
>
> On 05/25/2018 10:23 AM, David Miller wrote:
>> From: Qing Huang <qing.huang@oracle.com>
>> Date: Wed, 23 May 2018 16:22:46 -0700
>>
>>> When a system is under memory presure (high usage with fragments),
>>> the original 256KB ICM chunk allocations will likely trigger kernel
>>> memory management to enter slow path doing memory compact/migration
>>> ops in order to complete high order memory allocations.
>>>
>>> When that happens, user processes calling uverb APIs may get stuck
>>> for more than 120s easily even though there are a lot of free pages
>>> in smaller chunks available in the system.
>>>
>>> Syslog:
>>> ...
>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>>> oracle_205573_e:205573 blocked for more than 120 seconds.
>>> ...
>>>
>>> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>>>
>>> However in order to support smaller ICM chunk size, we need to fix
>>> another issue in large size kcalloc allocations.
>>>
>>> E.g.
>>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
>>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
>>> entry). So we need a 16MB allocation for a table->icm pointer array to
>>> hold 2M pointers which can easily cause kcalloc to fail.
>>>
>>> The solution is to use kvzalloc to replace kcalloc which will fall back
>>> to vmalloc automatically if kmalloc fails.
>>>
>>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>>> Acked-by: Daniel Jurgens <danielj@mellanox.com>
>>> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
>> Applied, thanks.
>>
> I must say this patch causes regressions here.
>
> KASAN is not happy.
>
> It looks that you guys did not really looked at mlx4_alloc_icm()
>
> This function is properly handling high order allocations with fallbacks to order-0 pages
> under high memory pressure.
>
> BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib]
> Read of size 4 at addr ffff8817df584f68 by task qp_listing_test/92585
>
> CPU: 38 PID: 92585 Comm: qp_listing_test Tainted: G           O
> Call Trace:
>   [<ffffffffba80d7bb>] dump_stack+0x4d/0x72
>   [<ffffffffb951dc5f>] print_address_description+0x6f/0x260
>   [<ffffffffb951e1c7>] kasan_report+0x257/0x370
>   [<ffffffffb951e339>] __asan_report_load4_noabort+0x19/0x20
>   [<ffffffffc0256d28>] to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib]
>   [<ffffffffc02785b3>] mlx4_ib_query_qp+0x1213/0x1660 [mlx4_ib]
>   [<ffffffffc02dbfdb>] qpstat_print_qp+0x13b/0x500 [ib_uverbs]
>   [<ffffffffc02dc3ea>] qpstat_seq_show+0x4a/0xb0 [ib_uverbs]
>   [<ffffffffb95f125c>] seq_read+0xa9c/0x1230
>   [<ffffffffb96e0821>] proc_reg_read+0xc1/0x180
>   [<ffffffffb9577918>] __vfs_read+0xe8/0x730
>   [<ffffffffb9578057>] vfs_read+0xf7/0x300
>   [<ffffffffb95794d2>] SyS_read+0xd2/0x1b0
>   [<ffffffffb8e06b16>] do_syscall_64+0x186/0x420
>   [<ffffffffbaa00071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x7f851a7bb30d
> RSP: 002b:00007ffd09a758c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 00007f84ff959440 RCX: 00007f851a7bb30d
> RDX: 000000000003fc00 RSI: 00007f84ff60a000 RDI: 000000000000000b
> RBP: 00007ffd09a75900 R08: 00000000ffffffff R09: 0000000000000000
> R10: 0000000000000022 R11: 0000000000000293 R12: 0000000000000000
> R13: 000000000003ffff R14: 000000000003ffff R15: 00007f84ff60a000
>
> Allocated by task 4488:
>   save_stack+0x46/0xd0
>   kasan_kmalloc+0xad/0xe0
>   __kmalloc+0x101/0x5e0
>   ib_register_device+0xc03/0x1250 [ib_core]
>   mlx4_ib_add+0x27d6/0x4dd0 [mlx4_ib]
>   mlx4_add_device+0xa9/0x340 [mlx4_core]
>   mlx4_register_interface+0x16e/0x390 [mlx4_core]
>   xhci_pci_remove+0x7a/0x180 [xhci_pci]
>   do_one_initcall+0xa0/0x230
>   do_init_module+0x1b9/0x5a4
>   load_module+0x63e6/0x94c0
>   SYSC_init_module+0x1a4/0x1c0
>   SyS_init_module+0xe/0x10
>   do_syscall_64+0x186/0x420
>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>
> Freed by task 0:
> (stack is not available)
>
> The buggy address belongs to the object at ffff8817df584f40
>   which belongs to the cache kmalloc-32 of size 32
> The buggy address is located 8 bytes to the right of
>   32-byte region [ffff8817df584f40, ffff8817df584f60)
> The buggy address belongs to the page:
> page:ffffea005f7d6100 count:1 mapcount:0 mapping:ffff8817df584000 index:0xffff8817df584fc1
> flags: 0x880000000000100(slab)
> raw: 0880000000000100 ffff8817df584000 ffff8817df584fc1 000000010000003f
> raw: ffffea005f3ac0a0 ffffea005c476760 ffff8817fec00900 ffff883ff78d26c0
> page dumped because: kasan: bad access detected
> page->mem_cgroup:ffff883ff78d26c0

What kind of test case did you run? It looks like a bug somewhere in the 
code.
Perhaps smaller chunks make it easier to occur, we should fix the bug 
though.


>
> Memory state around the buggy address:
>   ffff8817df584e00: 00 03 fc fc fc fc fc fc 00 03 fc fc fc fc fc fc
>   ffff8817df584e80: 00 00 00 04 fc fc fc fc 00 00 00 fc fc fc fc fc
>> ffff8817df584f00: fb fb fb fb fc fc fc fc 00 00 00 00 fc fc fc fc
>                                                            ^
>   ffff8817df584f80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
>   ffff8817df585000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
> I will test :
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index 685337d58276fc91baeeb64387c52985e1bc6dda..4d2a71381acb739585d662175e86caef72338097 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,13 @@
>   #include "fw.h"
>   
>   /*
> - * We allocate in page size (default 4KB on many archs) chunks to avoid high
> - * order memory allocations in fragmented/high usage memory situation.
> + * We allocate in as big chunks as we can, up to a maximum of 256 KB
> + * per chunk. Note that the chunks are not necessarily in contiguous
> + * physical memory.
>    */
>   enum {
> -       MLX4_ICM_ALLOC_SIZE     = PAGE_SIZE,
> -       MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
> +       MLX4_ICM_ALLOC_SIZE     = 1 << 18,
> +       MLX4_TABLE_CHUNK_SIZE   = 1 << 18
>   };
>   
>   static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-30  3:49       ` Eric Dumazet
  2018-05-30 17:39         ` Qing Huang
@ 2018-05-31  6:54         ` Michal Hocko
  2018-05-31  8:35           ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-05-31  6:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim

On Tue 29-05-18 23:49:59, Eric Dumazet wrote:
> 
> 
> On 05/29/2018 11:44 PM, Eric Dumazet wrote:
> 
> > 
> > And I will add this simple fix, this really should address your initial concern much better.
> > 
> > @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order,
> >  {
> >         struct page *page;
> >  
> > +       if (order)
> > +               gfp_mask |= __GFP_NORETRY;
> 
>     and also      gfp_mask &= ~__GFP_DIRECT_RECLAIM

JFTR the latter one makes __GFP_NORETRY pointless. Non sleeping allocations
are not retrying.

> >         page = alloc_pages_node(node, gfp_mask, order);
> >         if (!page) {
> >                 page = alloc_pages(gfp_mask, order);
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-31  6:54         ` Michal Hocko
@ 2018-05-31  8:35           ` Eric Dumazet
  2018-05-31  8:55             ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2018-05-31  8:35 UTC (permalink / raw)
  To: Michal Hocko, Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim



On 05/31/2018 02:54 AM, Michal Hocko wrote:
> On Tue 29-05-18 23:49:59, Eric Dumazet wrote:
>>
>>
>> On 05/29/2018 11:44 PM, Eric Dumazet wrote:
>>
>>>
>>> And I will add this simple fix, this really should address your initial concern much better.
>>>
>>> @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order,
>>>  {
>>>         struct page *page;
>>>  
>>> +       if (order)
>>> +               gfp_mask |= __GFP_NORETRY;
>>
>>     and also      gfp_mask &= ~__GFP_DIRECT_RECLAIM
> 
> JFTR the latter one makes __GFP_NORETRY pointless. Non sleeping allocations
> are not retrying.

Hi Michal

Since when this rule is applied ?

These GFP flags change all the time, I suggest mm experts to cleanup existing call sites ?

I merely copied/pasted from alloc_skb_with_frags() :/

Thanks !

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-31  8:35           ` Eric Dumazet
@ 2018-05-31  8:55             ` Michal Hocko
  2018-05-31  9:10               ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-05-31  8:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim

On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
> 
> 
> On 05/31/2018 02:54 AM, Michal Hocko wrote:
> > On Tue 29-05-18 23:49:59, Eric Dumazet wrote:
> >>
> >>
> >> On 05/29/2018 11:44 PM, Eric Dumazet wrote:
> >>
> >>>
> >>> And I will add this simple fix, this really should address your initial concern much better.
> >>>
> >>> @@ -99,6 +100,8 @@ static int mlx4_alloc_icm_pages(struct scatterlist *mem, int order,
> >>>  {
> >>>         struct page *page;
> >>>  
> >>> +       if (order)
> >>> +               gfp_mask |= __GFP_NORETRY;
> >>
> >>     and also      gfp_mask &= ~__GFP_DIRECT_RECLAIM
> > 
> > JFTR the latter one makes __GFP_NORETRY pointless. Non sleeping allocations
> > are not retrying.
> 
> Hi Michal
> 
> Since when this rule is applied ?

Well, this has been the case since I remember. ~__GFP_DIRECT_RECLAIM
resp. GFP_NOWAIT formerly meant that the allocation doesn't perform
_any_ reclaim and therefore it never retries. Why would it? Any retry
would just hope for a reclaim on behalf of somebody else. __GFP_NORETRY
has always simply stopped retrying after one/few reclaim attempts.
Implementation has changed here and there over time but in essence the
semantic remained the same. Please have a look at the current
documentation in gfp.h file and let me know if you would like some
clarfications.

> These GFP flags change all the time, I suggest mm experts to cleanup existing call sites ?

Do you have any specific areas to look at?

> I merely copied/pasted from alloc_skb_with_frags() :/

I will have a look at it. Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-31  8:55             ` Michal Hocko
@ 2018-05-31  9:10               ` Michal Hocko
  2018-06-01  2:04                 ` Qing Huang
  2018-06-04 13:11                 ` Michal Hocko
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2018-05-31  9:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim

On Thu 31-05-18 10:55:32, Michal Hocko wrote:
> On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
[...]
> > I merely copied/pasted from alloc_skb_with_frags() :/
> 
> I will have a look at it. Thanks!

OK, so this is an example of an incremental development ;).

__GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
high order allocations") to prevent from OOM killer. Yet this was
not enough because fb05e7a89f50 ("net: don't wait for order-3 page
allocation") didn't want an excessive reclaim for non-costly orders
so it made it completely NOWAIT while it preserved __GFP_NORETRY in
place which is now redundant. Should I send a patch?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-31  9:10               ` Michal Hocko
@ 2018-06-01  2:04                 ` Qing Huang
  2018-06-01  7:31                   ` Michal Hocko
  2018-06-04 13:11                 ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Qing Huang @ 2018-06-01  2:04 UTC (permalink / raw)
  To: Michal Hocko, Eric Dumazet
  Cc: David Miller, tariqt, haakon.bugge, yanjun.zhu, netdev,
	linux-rdma, linux-kernel, gi-oh.kim, santosh.shilimkar



On 5/31/2018 2:10 AM, Michal Hocko wrote:
> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
>> On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
> [...]
>>> I merely copied/pasted from alloc_skb_with_frags() :/
>> I will have a look at it. Thanks!
> OK, so this is an example of an incremental development ;).
>
> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> high order allocations") to prevent from OOM killer. Yet this was
> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> allocation") didn't want an excessive reclaim for non-costly orders
> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> place which is now redundant. Should I send a patch?
>

Just curious, how about GFP_ATOMIC flag? Would it work in a similar 
fashion? We experimented
with it a bit in the past but it seemed to cause other issue in our 
tests. :-)

By the way, we didn't encounter any OOM killer events. It seemed that 
the mlx4_alloc_icm() triggered slowpath.
We still had about 2GB free memory while it was highly fragmented.

  #0 [ffff8801f308b380] remove_migration_pte at ffffffff811f0e0b
  #1 [ffff8801f308b3e0] rmap_walk_file at ffffffff811cb890
  #2 [ffff8801f308b440] rmap_walk at ffffffff811cbaf2
  #3 [ffff8801f308b450] remove_migration_ptes at ffffffff811f0db0
  #4 [ffff8801f308b490] __unmap_and_move at ffffffff811f2ea6
  #5 [ffff8801f308b4e0] unmap_and_move at ffffffff811f2fc5
  #6 [ffff8801f308b540] migrate_pages at ffffffff811f3219
  #7 [ffff8801f308b5c0] compact_zone at ffffffff811b707e
  #8 [ffff8801f308b650] compact_zone_order at ffffffff811b735d
  #9 [ffff8801f308b6e0] try_to_compact_pages at ffffffff811b7485
#10 [ffff8801f308b770] __alloc_pages_direct_compact at ffffffff81195f96
#11 [ffff8801f308b7b0] __alloc_pages_slowpath at ffffffff811978a1
#12 [ffff8801f308b890] __alloc_pages_nodemask at ffffffff81197ec1
#13 [ffff8801f308b970] alloc_pages_current at ffffffff811e261f
#14 [ffff8801f308b9e0] mlx4_alloc_icm at ffffffffa01f39b2 [mlx4_core]

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-06-01  2:04                 ` Qing Huang
@ 2018-06-01  7:31                   ` Michal Hocko
  2018-06-01 22:05                     ` Qing Huang
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-06-01  7:31 UTC (permalink / raw)
  To: Qing Huang
  Cc: Eric Dumazet, David Miller, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim, santosh.shilimkar

On Thu 31-05-18 19:04:46, Qing Huang wrote:
> 
> 
> On 5/31/2018 2:10 AM, Michal Hocko wrote:
> > On Thu 31-05-18 10:55:32, Michal Hocko wrote:
> > > On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
> > [...]
> > > > I merely copied/pasted from alloc_skb_with_frags() :/
> > > I will have a look at it. Thanks!
> > OK, so this is an example of an incremental development ;).
> > 
> > __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> > high order allocations") to prevent from OOM killer. Yet this was
> > not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> > allocation") didn't want an excessive reclaim for non-costly orders
> > so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> > place which is now redundant. Should I send a patch?
> > 
> 
> Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion?
> We experimented
> with it a bit in the past but it seemed to cause other issue in our tests.
> :-)

GFP_ATOMIC is a non-sleeping (aka no reclaim) context with an access to
memory reserves. So the risk is that you deplete those reserves and
cause issues to other subsystems which need them as well.

> By the way, we didn't encounter any OOM killer events. It seemed that the
> mlx4_alloc_icm() triggered slowpath.
> We still had about 2GB free memory while it was highly fragmented.

The compaction was able to make a reasonable forward progress for you.
But considering mlx4_alloc_icm is called with GFP_KERNEL resp. GFP_HIGHUSER
then the OOM killer is clearly possible as long as the order is lower
than 4.

>  #0 [ffff8801f308b380] remove_migration_pte at ffffffff811f0e0b
>  #1 [ffff8801f308b3e0] rmap_walk_file at ffffffff811cb890
>  #2 [ffff8801f308b440] rmap_walk at ffffffff811cbaf2
>  #3 [ffff8801f308b450] remove_migration_ptes at ffffffff811f0db0
>  #4 [ffff8801f308b490] __unmap_and_move at ffffffff811f2ea6
>  #5 [ffff8801f308b4e0] unmap_and_move at ffffffff811f2fc5
>  #6 [ffff8801f308b540] migrate_pages at ffffffff811f3219
>  #7 [ffff8801f308b5c0] compact_zone at ffffffff811b707e
>  #8 [ffff8801f308b650] compact_zone_order at ffffffff811b735d
>  #9 [ffff8801f308b6e0] try_to_compact_pages at ffffffff811b7485
> #10 [ffff8801f308b770] __alloc_pages_direct_compact at ffffffff81195f96
> #11 [ffff8801f308b7b0] __alloc_pages_slowpath at ffffffff811978a1
> #12 [ffff8801f308b890] __alloc_pages_nodemask at ffffffff81197ec1
> #13 [ffff8801f308b970] alloc_pages_current at ffffffff811e261f
> #14 [ffff8801f308b9e0] mlx4_alloc_icm at ffffffffa01f39b2 [mlx4_core]
> 
> Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-06-01  7:31                   ` Michal Hocko
@ 2018-06-01 22:05                     ` Qing Huang
  2018-06-04  6:27                       ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Qing Huang @ 2018-06-01 22:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Eric Dumazet, David Miller, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim, santosh.shilimkar



On 6/1/2018 12:31 AM, Michal Hocko wrote:
> On Thu 31-05-18 19:04:46, Qing Huang wrote:
>>
>> On 5/31/2018 2:10 AM, Michal Hocko wrote:
>>> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
>>>> On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
>>> [...]
>>>>> I merely copied/pasted from alloc_skb_with_frags() :/
>>>> I will have a look at it. Thanks!
>>> OK, so this is an example of an incremental development ;).
>>>
>>> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
>>> high order allocations") to prevent from OOM killer. Yet this was
>>> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
>>> allocation") didn't want an excessive reclaim for non-costly orders
>>> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
>>> place which is now redundant. Should I send a patch?
>>>
>> Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion?
>> We experimented
>> with it a bit in the past but it seemed to cause other issue in our tests.
>> :-)
> GFP_ATOMIC is a non-sleeping (aka no reclaim) context with an access to
> memory reserves. So the risk is that you deplete those reserves and
> cause issues to other subsystems which need them as well.
>
>> By the way, we didn't encounter any OOM killer events. It seemed that the
>> mlx4_alloc_icm() triggered slowpath.
>> We still had about 2GB free memory while it was highly fragmented.
> The compaction was able to make a reasonable forward progress for you.
> But considering mlx4_alloc_icm is called with GFP_KERNEL resp. GFP_HIGHUSER
> then the OOM killer is clearly possible as long as the order is lower
> than 4.

The allocation was 256KB so the order was much higher than 4. The 
compaction seemed to be the root
cause for our problem. It took too long to finish its work while putting 
mlx4_alloc_icm to sleep in a heavily
fragmented memory situation . Will NORETRY flag avoid the compaction ops 
and fail the 256KB allocation
immediately so mlx4_alloc_icm can enter adjustable lower order 
allocation code path quickly?

Thanks.

>
>>   #0 [ffff8801f308b380] remove_migration_pte at ffffffff811f0e0b
>>   #1 [ffff8801f308b3e0] rmap_walk_file at ffffffff811cb890
>>   #2 [ffff8801f308b440] rmap_walk at ffffffff811cbaf2
>>   #3 [ffff8801f308b450] remove_migration_ptes at ffffffff811f0db0
>>   #4 [ffff8801f308b490] __unmap_and_move at ffffffff811f2ea6
>>   #5 [ffff8801f308b4e0] unmap_and_move at ffffffff811f2fc5
>>   #6 [ffff8801f308b540] migrate_pages at ffffffff811f3219
>>   #7 [ffff8801f308b5c0] compact_zone at ffffffff811b707e
>>   #8 [ffff8801f308b650] compact_zone_order at ffffffff811b735d
>>   #9 [ffff8801f308b6e0] try_to_compact_pages at ffffffff811b7485
>> #10 [ffff8801f308b770] __alloc_pages_direct_compact at ffffffff81195f96
>> #11 [ffff8801f308b7b0] __alloc_pages_slowpath at ffffffff811978a1
>> #12 [ffff8801f308b890] __alloc_pages_nodemask at ffffffff81197ec1
>> #13 [ffff8801f308b970] alloc_pages_current at ffffffff811e261f
>> #14 [ffff8801f308b9e0] mlx4_alloc_icm at ffffffffa01f39b2 [mlx4_core]
>>
>> Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-06-01 22:05                     ` Qing Huang
@ 2018-06-04  6:27                       ` Michal Hocko
  2018-06-04 12:40                         ` Vlastimil Babka
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-06-04  6:27 UTC (permalink / raw)
  To: Qing Huang
  Cc: Eric Dumazet, David Miller, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim, santosh.shilimkar,
	Vlastimil Babka

On Fri 01-06-18 15:05:26, Qing Huang wrote:
> 
> 
> On 6/1/2018 12:31 AM, Michal Hocko wrote:
> > On Thu 31-05-18 19:04:46, Qing Huang wrote:
> > > 
> > > On 5/31/2018 2:10 AM, Michal Hocko wrote:
> > > > On Thu 31-05-18 10:55:32, Michal Hocko wrote:
> > > > > On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
> > > > [...]
> > > > > > I merely copied/pasted from alloc_skb_with_frags() :/
> > > > > I will have a look at it. Thanks!
> > > > OK, so this is an example of an incremental development ;).
> > > > 
> > > > __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> > > > high order allocations") to prevent from OOM killer. Yet this was
> > > > not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> > > > allocation") didn't want an excessive reclaim for non-costly orders
> > > > so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> > > > place which is now redundant. Should I send a patch?
> > > > 
> > > Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion?
> > > We experimented
> > > with it a bit in the past but it seemed to cause other issue in our tests.
> > > :-)
> > GFP_ATOMIC is a non-sleeping (aka no reclaim) context with an access to
> > memory reserves. So the risk is that you deplete those reserves and
> > cause issues to other subsystems which need them as well.
> > 
> > > By the way, we didn't encounter any OOM killer events. It seemed that the
> > > mlx4_alloc_icm() triggered slowpath.
> > > We still had about 2GB free memory while it was highly fragmented.
> > The compaction was able to make a reasonable forward progress for you.
> > But considering mlx4_alloc_icm is called with GFP_KERNEL resp. GFP_HIGHUSER
> > then the OOM killer is clearly possible as long as the order is lower
> > than 4.
> 
> The allocation was 256KB so the order was much higher than 4. The compaction
> seemed to be the root
> cause for our problem. It took too long to finish its work while putting
> mlx4_alloc_icm to sleep in a heavily
> fragmented memory situation . Will NORETRY flag avoid the compaction ops and
> fail the 256KB allocation
> immediately so mlx4_alloc_icm can enter adjustable lower order allocation
> code path quickly?

Costly orders should only perform a light compaction attempt unless
__GFP_RETRY_MAY_FAIL is used IIRC. CCing Vlastimil. So __GFP_NORETRY
shouldn't make any difference.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-06-04  6:27                       ` Michal Hocko
@ 2018-06-04 12:40                         ` Vlastimil Babka
  2018-06-05 18:51                           ` Qing Huang
  0 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2018-06-04 12:40 UTC (permalink / raw)
  To: Michal Hocko, Qing Huang
  Cc: Eric Dumazet, David Miller, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim, santosh.shilimkar

On 06/04/2018 08:27 AM, Michal Hocko wrote:
> On Fri 01-06-18 15:05:26, Qing Huang wrote:
>>
>>
>> On 6/1/2018 12:31 AM, Michal Hocko wrote:
>>> On Thu 31-05-18 19:04:46, Qing Huang wrote:
>>>>
>>>> On 5/31/2018 2:10 AM, Michal Hocko wrote:
>>>>> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
>>>>>> On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
>>>>> [...]
>>>>>>> I merely copied/pasted from alloc_skb_with_frags() :/
>>>>>> I will have a look at it. Thanks!
>>>>> OK, so this is an example of an incremental development ;).
>>>>>
>>>>> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
>>>>> high order allocations") to prevent from OOM killer. Yet this was
>>>>> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
>>>>> allocation") didn't want an excessive reclaim for non-costly orders
>>>>> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
>>>>> place which is now redundant. Should I send a patch?
>>>>>
>>>> Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion?
>>>> We experimented
>>>> with it a bit in the past but it seemed to cause other issue in our tests.
>>>> :-)
>>> GFP_ATOMIC is a non-sleeping (aka no reclaim) context with an access to
>>> memory reserves. So the risk is that you deplete those reserves and
>>> cause issues to other subsystems which need them as well.
>>>
>>>> By the way, we didn't encounter any OOM killer events. It seemed that the
>>>> mlx4_alloc_icm() triggered slowpath.
>>>> We still had about 2GB free memory while it was highly fragmented.
>>> The compaction was able to make a reasonable forward progress for you.
>>> But considering mlx4_alloc_icm is called with GFP_KERNEL resp. GFP_HIGHUSER
>>> then the OOM killer is clearly possible as long as the order is lower
>>> than 4.
>>
>> The allocation was 256KB so the order was much higher than 4. The compaction
>> seemed to be the root
>> cause for our problem. It took too long to finish its work while putting
>> mlx4_alloc_icm to sleep in a heavily
>> fragmented memory situation . Will NORETRY flag avoid the compaction ops and
>> fail the 256KB allocation
>> immediately so mlx4_alloc_icm can enter adjustable lower order allocation
>> code path quickly?
> 
> Costly orders should only perform a light compaction attempt unless
> __GFP_RETRY_MAY_FAIL is used IIRC. CCing Vlastimil. So __GFP_NORETRY
> shouldn't make any difference.

It's a bit more complicated. Costly allocations will try the light
compaction attempt first, even before reclaim. This is followed by
reclaim and a more costly compaction attempt. With __GFP_NORETRY, the
second compaction attempt is also only the light one, so the flag does
make a difference here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-05-31  9:10               ` Michal Hocko
  2018-06-01  2:04                 ` Qing Huang
@ 2018-06-04 13:11                 ` Michal Hocko
  2018-06-04 13:22                   ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-06-04 13:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim

On Thu 31-05-18 11:10:22, Michal Hocko wrote:
> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
> > On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
> [...]
> > > I merely copied/pasted from alloc_skb_with_frags() :/
> > 
> > I will have a look at it. Thanks!
> 
> OK, so this is an example of an incremental development ;).
> 
> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> high order allocations") to prevent from OOM killer. Yet this was
> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> allocation") didn't want an excessive reclaim for non-costly orders
> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> place which is now redundant. Should I send a patch?

Just in case you are interested
---
>From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 4 Jun 2018 15:07:37 +0200
Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags

alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
which is just a noop and a little bit confusing.

__GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
high order allocations") to prevent from the OOM killer. Yet this was
not enough because fb05e7a89f50 ("net: don't wait for order-3 page
allocation") didn't want an excessive reclaim for non-costly orders
so it made it completely NOWAIT while it preserved __GFP_NORETRY in
place which is now redundant.

Drop the pointless __GFP_NORETRY because this function is used as
copy&paste source for other places.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/core/skbuff.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 857e4e6f751a..c1f22adc30de 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5239,8 +5239,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 			if (npages >= 1 << order) {
 				page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
 						   __GFP_COMP |
-						   __GFP_NOWARN |
-						   __GFP_NORETRY,
+						   __GFP_NOWARN,
 						   order);
 				if (page)
 					goto fill_page;
-- 
2.17.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-06-04 13:11                 ` Michal Hocko
@ 2018-06-04 13:22                   ` Eric Dumazet
  2018-06-05  8:32                     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2018-06-04 13:22 UTC (permalink / raw)
  To: Michal Hocko, Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim



On 06/04/2018 06:11 AM, Michal Hocko wrote:
> On Thu 31-05-18 11:10:22, Michal Hocko wrote:

> Just in case you are interested
> ---
> From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 4 Jun 2018 15:07:37 +0200
> Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags
> 
> alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
> which is just a noop and a little bit confusing.
> 
> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> high order allocations") to prevent from the OOM killer. Yet this was
> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> allocation") didn't want an excessive reclaim for non-costly orders
> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> place which is now redundant.
> 
> Drop the pointless __GFP_NORETRY because this function is used as
> copy&paste source for other places.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-06-04 13:22                   ` Eric Dumazet
@ 2018-06-05  8:32                     ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-06-05  8:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim

On Mon 04-06-18 06:22:26, Eric Dumazet wrote:
> 
> 
> On 06/04/2018 06:11 AM, Michal Hocko wrote:
> > On Thu 31-05-18 11:10:22, Michal Hocko wrote:
> 
> > Just in case you are interested
> > ---
> > From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Mon, 4 Jun 2018 15:07:37 +0200
> > Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags
> > 
> > alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
> > which is just a noop and a little bit confusing.
> > 
> > __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> > high order allocations") to prevent from the OOM killer. Yet this was
> > not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> > allocation") didn't want an excessive reclaim for non-costly orders
> > so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> > place which is now redundant.
> > 
> > Drop the pointless __GFP_NORETRY because this function is used as
> > copy&paste source for other places.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks! What is the proper process now? Should I resend or somebody can
pick it up from this thread?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
  2018-06-04 12:40                         ` Vlastimil Babka
@ 2018-06-05 18:51                           ` Qing Huang
  0 siblings, 0 replies; 22+ messages in thread
From: Qing Huang @ 2018-06-05 18:51 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko
  Cc: Eric Dumazet, David Miller, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim, santosh.shilimkar,
	rama nichanamatlu



On 6/4/2018 5:40 AM, Vlastimil Babka wrote:
> On 06/04/2018 08:27 AM, Michal Hocko wrote:
>> On Fri 01-06-18 15:05:26, Qing Huang wrote:
>>>
>>> On 6/1/2018 12:31 AM, Michal Hocko wrote:
>>>> On Thu 31-05-18 19:04:46, Qing Huang wrote:
>>>>> On 5/31/2018 2:10 AM, Michal Hocko wrote:
>>>>>> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
>>>>>>> On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
>>>>>> [...]
>>>>>>>> I merely copied/pasted from alloc_skb_with_frags() :/
>>>>>>> I will have a look at it. Thanks!
>>>>>> OK, so this is an example of an incremental development ;).
>>>>>>
>>>>>> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
>>>>>> high order allocations") to prevent from OOM killer. Yet this was
>>>>>> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
>>>>>> allocation") didn't want an excessive reclaim for non-costly orders
>>>>>> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
>>>>>> place which is now redundant. Should I send a patch?
>>>>>>
>>>>> Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion?
>>>>> We experimented
>>>>> with it a bit in the past but it seemed to cause other issue in our tests.
>>>>> :-)
>>>> GFP_ATOMIC is a non-sleeping (aka no reclaim) context with an access to
>>>> memory reserves. So the risk is that you deplete those reserves and
>>>> cause issues to other subsystems which need them as well.
>>>>
>>>>> By the way, we didn't encounter any OOM killer events. It seemed that the
>>>>> mlx4_alloc_icm() triggered slowpath.
>>>>> We still had about 2GB free memory while it was highly fragmented.
>>>> The compaction was able to make a reasonable forward progress for you.
>>>> But considering mlx4_alloc_icm is called with GFP_KERNEL resp. GFP_HIGHUSER
>>>> then the OOM killer is clearly possible as long as the order is lower
>>>> than 4.
>>> The allocation was 256KB so the order was much higher than 4. The compaction
>>> seemed to be the root
>>> cause for our problem. It took too long to finish its work while putting
>>> mlx4_alloc_icm to sleep in a heavily
>>> fragmented memory situation . Will NORETRY flag avoid the compaction ops and
>>> fail the 256KB allocation
>>> immediately so mlx4_alloc_icm can enter adjustable lower order allocation
>>> code path quickly?
>> Costly orders should only perform a light compaction attempt unless
>> __GFP_RETRY_MAY_FAIL is used IIRC. CCing Vlastimil. So __GFP_NORETRY
>> shouldn't make any difference.
> It's a bit more complicated. Costly allocations will try the light
> compaction attempt first, even before reclaim. This is followed by
> reclaim and a more costly compaction attempt. With __GFP_NORETRY, the
> second compaction attempt is also only the light one, so the flag does
> make a difference here.

Thanks for the clarification!

Looks like our production kernel is kinda old, neither 
__GFP_DIRECT_RECLAIM nor __GFP_NORETRY
has been used in __alloc_pages_slowpath() in our kernel.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-06-05 18:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-23 23:22 [PATCH V4] mlx4_core: allocate ICM memory in page size chunks Qing Huang
2018-05-24  7:23 ` Gi-Oh Kim
2018-05-24  9:45 ` Tariq Toukan
2018-05-25 14:23 ` David Miller
2018-05-30  3:34   ` Eric Dumazet
2018-05-30  3:44     ` Eric Dumazet
2018-05-30  3:49       ` Eric Dumazet
2018-05-30 17:39         ` Qing Huang
2018-05-31  6:54         ` Michal Hocko
2018-05-31  8:35           ` Eric Dumazet
2018-05-31  8:55             ` Michal Hocko
2018-05-31  9:10               ` Michal Hocko
2018-06-01  2:04                 ` Qing Huang
2018-06-01  7:31                   ` Michal Hocko
2018-06-01 22:05                     ` Qing Huang
2018-06-04  6:27                       ` Michal Hocko
2018-06-04 12:40                         ` Vlastimil Babka
2018-06-05 18:51                           ` Qing Huang
2018-06-04 13:11                 ` Michal Hocko
2018-06-04 13:22                   ` Eric Dumazet
2018-06-05  8:32                     ` Michal Hocko
2018-05-30 17:53     ` Qing Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).