LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users
@ 2021-03-01 16:11 Mel Gorman
  2021-03-01 16:11 ` [PATCH 1/5] SUNRPC: Set rq_page_end differently Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Mel Gorman @ 2021-03-01 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

This series introduces a bulk order-0 page allocator with sunrpc and
the network page pool being the first users. The implementation is not
particularly efficient and the intention is to iron out what the semantics
of the API should have for users. Once the semantics are ironed out, it can
be made more efficient.

Improving the implementation requires fairly deep surgery in numerous
places. The lock scope would need to be significantly reduced, particularly
as vmstat, per-cpu and the buddy allocator have different locking protocol
that overall -- e.g. all partially depend on irqs being disabled at
various points. Secondly, the core of the allocator deals with single
pages where as both the bulk allocator and per-cpu allocator operate in
batches. All of that has to be reconciled with all the existing users and
their constraints (memory offline, CMA and cpusets being the trickiest).

Light testing passed, I'm relying on Chuck and Jesper to test the target
users more aggressively but both report performance improvements with the
initial RFC.

Patch 1 of this series is a cleanup to sunrpc, it could be merged
	separately but is included here as a pre-requisite.

Patch 2 is the prototype bulk allocator

Patch 3 is the sunrpc user. Chuck also has a patch which further caches
	pages but is not included in this series. It's not directly
	related to the bulk allocator and as it caches pages, it might
	have other concerns (e.g. does it need a shrinker?)

Patch 4 is a preparation patch only for the network user

Patch 5 converts the net page pool to the bulk allocator for order-0 pages.

 include/linux/gfp.h   |  13 +++++
 mm/page_alloc.c       | 113 +++++++++++++++++++++++++++++++++++++++++-
 net/core/page_pool.c  | 102 +++++++++++++++++++++++---------------
 net/sunrpc/svc_xprt.c |  47 ++++++++++++------
 4 files changed, 220 insertions(+), 55 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] SUNRPC: Set rq_page_end differently
  2021-03-01 16:11 [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
@ 2021-03-01 16:11 ` Mel Gorman
  2021-03-01 16:11 ` [PATCH 2/5] mm/page_alloc: Add a bulk page allocator Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-03-01 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

From: Chuck Lever <chuck.lever@oracle.com>

Refactor:

I'm about to use the loop variable @i for something else.

As far as the "i++" is concerned, that is a post-increment. The
value of @i is not used subsequently, so the increment operator
is unnecessary and can be removed.

Also note that nfsd_read_actor() was renamed nfsd_splice_actor()
by commit cf8208d0eabd ("sendfile: convert nfsd to
splice_direct_to_actor()").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/sunrpc/svc_xprt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index dcc50ae54550..cfa7e4776d0e 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -667,8 +667,8 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 			}
 			rqstp->rq_pages[i] = p;
 		}
-	rqstp->rq_page_end = &rqstp->rq_pages[i];
-	rqstp->rq_pages[i++] = NULL; /* this might be seen in nfs_read_actor */
+	rqstp->rq_page_end = &rqstp->rq_pages[pages];
+	rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
 
 	/* Make arg->head point to first page and arg->pages point to rest */
 	arg = &rqstp->rq_arg;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/5] mm/page_alloc: Add a bulk page allocator
  2021-03-01 16:11 [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
  2021-03-01 16:11 ` [PATCH 1/5] SUNRPC: Set rq_page_end differently Mel Gorman
@ 2021-03-01 16:11 ` Mel Gorman
  2021-03-09 17:12   ` Christoph Hellwig
  2021-03-10 11:04   ` Shay Agroskin
  2021-03-01 16:11 ` [PATCH 3/5] SUNRPC: Refresh rq_pages using " Mel Gorman
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Mel Gorman @ 2021-03-01 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h |  13 +++++
 mm/page_alloc.c     | 113 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8572a1474e16..4903d1cc48dc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct page *page)
 }
 #endif
 
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+				nodemask_t *nodemask, int nr_pages,
+				struct list_head *list);
+
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 							nodemask_t *nodemask);
@@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
 	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct list_head *list)
+{
+	return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
+							nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..ff1e55793786 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
 	}
 }
 
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		trace_mm_page_free_batched(page);
+		if (put_page_testzero(page)) {
+			list_del(&page->lru);
+			__free_pages_ok(page, 0, FPI_NONE);
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
 static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct alloc_context *ac, gfp_t *alloc_mask,
 		unsigned int *alloc_flags)
 {
+	gfp_mask &= gfp_allowed_mask;
+	*alloc_mask = gfp_mask;
+
 	ac->highest_zoneidx = gfp_zone(gfp_mask);
 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
@@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	return true;
 }
 
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ */
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+			nodemask_t *nodemask, int nr_pages,
+			struct list_head *alloc_list)
+{
+	struct page *page;
+	unsigned long flags;
+	struct zone *zone;
+	struct zoneref *z;
+	struct per_cpu_pages *pcp;
+	struct list_head *pcp_list;
+	struct alloc_context ac;
+	gfp_t alloc_mask;
+	unsigned int alloc_flags;
+	int alloced = 0;
+
+	if (nr_pages == 1)
+		goto failed;
+
+	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+	if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
+		return 0;
+	gfp_mask = alloc_mask;
+
+	/* Find an allowed local zone that meets the high watermark. */
+	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
+		unsigned long mark;
+
+		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+		    !__cpuset_zone_allowed(zone, gfp_mask)) {
+			continue;
+		}
+
+		if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+		    zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
+			goto failed;
+		}
+
+		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
+		if (zone_watermark_fast(zone, 0,  mark,
+				zonelist_zone_idx(ac.preferred_zoneref),
+				alloc_flags, gfp_mask)) {
+			break;
+		}
+	}
+	if (!zone)
+		return 0;
+
+	/* Attempt the batch allocation */
+	local_irq_save(flags);
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
+	pcp_list = &pcp->lists[ac.migratetype];
+
+	while (alloced < nr_pages) {
+		page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
+								pcp, pcp_list);
+		if (!page)
+			break;
+
+		prep_new_page(page, 0, gfp_mask, 0);
+		list_add(&page->lru, alloc_list);
+		alloced++;
+	}
+
+	if (!alloced)
+		goto failed_irq;
+
+	if (alloced) {
+		__count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
+		zone_statistics(zone, zone);
+	}
+
+	local_irq_restore(flags);
+
+	return alloced;
+
+failed_irq:
+	local_irq_restore(flags);
+
+failed:
+	page = __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, nodemask);
+	if (page) {
+		alloced++;
+		list_add(&page->lru, alloc_list);
+	}
+
+	return alloced;
+}
+EXPORT_SYMBOL_GPL(__alloc_pages_bulk_nodemask);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -4981,8 +5092,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 		return NULL;
 	}
 
-	gfp_mask &= gfp_allowed_mask;
-	alloc_mask = gfp_mask;
 	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
 		return NULL;
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/5] SUNRPC: Refresh rq_pages using a bulk page allocator
  2021-03-01 16:11 [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
  2021-03-01 16:11 ` [PATCH 1/5] SUNRPC: Set rq_page_end differently Mel Gorman
  2021-03-01 16:11 ` [PATCH 2/5] mm/page_alloc: Add a bulk page allocator Mel Gorman
@ 2021-03-01 16:11 ` Mel Gorman
  2021-03-01 16:11 ` [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map Mel Gorman
  2021-03-01 16:12 ` [PATCH 5/5] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
  4 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-03-01 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

From: Chuck Lever <chuck.lever@oracle.com>

Reduce the rate at which nfsd threads hammer on the page allocator.
This improve throughput scalability by enabling the threads to run
more independently of each other.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index cfa7e4776d0e..38a8d6283801 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
 static int svc_alloc_arg(struct svc_rqst *rqstp)
 {
 	struct svc_serv *serv = rqstp->rq_server;
+	unsigned long needed;
 	struct xdr_buf *arg;
+	struct page *page;
 	int pages;
 	int i;
 
-	/* now allocate needed pages.  If we get a failure, sleep briefly */
 	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
 	if (pages > RPCSVC_MAXPAGES) {
 		pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
@@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 		/* use as many pages as possible */
 		pages = RPCSVC_MAXPAGES;
 	}
-	for (i = 0; i < pages ; i++)
-		while (rqstp->rq_pages[i] == NULL) {
-			struct page *p = alloc_page(GFP_KERNEL);
-			if (!p) {
-				set_current_state(TASK_INTERRUPTIBLE);
-				if (signalled() || kthread_should_stop()) {
-					set_current_state(TASK_RUNNING);
-					return -EINTR;
-				}
-				schedule_timeout(msecs_to_jiffies(500));
+
+	for (needed = 0, i = 0; i < pages ; i++)
+		if (!rqstp->rq_pages[i])
+			needed++;
+	if (needed) {
+		LIST_HEAD(list);
+
+retry:
+		alloc_pages_bulk(GFP_KERNEL, needed, &list);
+		for (i = 0; i < pages; i++) {
+			if (!rqstp->rq_pages[i]) {
+				page = list_first_entry_or_null(&list,
+								struct page,
+								lru);
+				if (unlikely(!page))
+					goto empty_list;
+				list_del(&page->lru);
+				rqstp->rq_pages[i] = page;
+				needed--;
 			}
-			rqstp->rq_pages[i] = p;
 		}
+	}
 	rqstp->rq_page_end = &rqstp->rq_pages[pages];
 	rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
 
@@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 	arg->len = (pages-1)*PAGE_SIZE;
 	arg->tail[0].iov_len = 0;
 	return 0;
+
+empty_list:
+	set_current_state(TASK_INTERRUPTIBLE);
+	if (signalled() || kthread_should_stop()) {
+		set_current_state(TASK_RUNNING);
+		return -EINTR;
+	}
+	schedule_timeout(msecs_to_jiffies(500));
+	goto retry;
 }
 
 static bool
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map
  2021-03-01 16:11 [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
                   ` (2 preceding siblings ...)
  2021-03-01 16:11 ` [PATCH 3/5] SUNRPC: Refresh rq_pages using " Mel Gorman
@ 2021-03-01 16:11 ` Mel Gorman
  2021-03-02 18:49   ` Ilias Apalodimas
  2021-03-01 16:12 ` [PATCH 5/5] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
  4 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-03-01 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

From: Jesper Dangaard Brouer <brouer@redhat.com>

In preparation for next patch, move the dma mapping into its own
function, as this will make it easier to follow the changes.

V2: make page_pool_dma_map return boolean (Ilias)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/core/page_pool.c | 45 +++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..a26f2ceb6a87 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -180,14 +180,37 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					 pool->p.dma_dir);
 }
 
+static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
+{
+	dma_addr_t dma;
+
+	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
+	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page_attrs(pool->p.dev, page, 0,
+				 (PAGE_SIZE << pool->p.order),
+				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+	if (dma_mapping_error(pool->p.dev, dma))
+		return false;
+
+	page->dma_addr = dma;
+
+	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
+
+	return true;
+}
+
 /* slow path */
 noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 						 gfp_t _gfp)
 {
+	unsigned int pp_flags = pool->p.flags;
 	struct page *page;
 	gfp_t gfp = _gfp;
-	dma_addr_t dma;
 
 	/* We could always set __GFP_COMP, and avoid this branch, as
 	 * prep_new_page() can handle order-0 with __GFP_COMP.
@@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	if (!page)
 		return NULL;
 
-	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
-		goto skip_dma_map;
-
-	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
-	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
-	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
-	 * This mapping is kept for lifetime of page, until leaving pool.
-	 */
-	dma = dma_map_page_attrs(pool->p.dev, page, 0,
-				 (PAGE_SIZE << pool->p.order),
-				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
-	if (dma_mapping_error(pool->p.dev, dma)) {
+	if (pp_flags & PP_FLAG_DMA_MAP &&
+	    unlikely(!page_pool_dma_map(pool, page))) {
 		put_page(page);
 		return NULL;
 	}
-	page->dma_addr = dma;
 
-	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
-		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
-
-skip_dma_map:
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
-
 	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
 
 	/* When page just alloc'ed is should/must have refcnt 1. */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/5] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-01 16:11 [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
                   ` (3 preceding siblings ...)
  2021-03-01 16:11 ` [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map Mel Gorman
@ 2021-03-01 16:12 ` Mel Gorman
  2021-03-02 19:05   ` Ilias Apalodimas
  4 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-03-01 16:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, LKML, Linux-Net, Linux-MM,
	Linux-NFS, Mel Gorman

From: Jesper Dangaard Brouer <brouer@redhat.com>

There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.

For these workload it can improve performance to bulk alloc pages from
the page-allocator to refill the alloc cache.

For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create
an SKB from the xdp_frame, which then cannot return the page to the
page_pool. In this case, we saw[1] an improvement of 18.8% from using
the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 net/core/page_pool.c | 63 ++++++++++++++++++++++++++++----------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a26f2ceb6a87..567680bd91c4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -208,44 +208,61 @@ noinline
 static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 						 gfp_t _gfp)
 {
+	const int bulk = PP_ALLOC_CACHE_REFILL;
+	struct page *page, *next, *first_page;
 	unsigned int pp_flags = pool->p.flags;
-	struct page *page;
+	unsigned int pp_order = pool->p.order;
+	int pp_nid = pool->p.nid;
+	LIST_HEAD(page_list);
 	gfp_t gfp = _gfp;
 
-	/* We could always set __GFP_COMP, and avoid this branch, as
-	 * prep_new_page() can handle order-0 with __GFP_COMP.
-	 */
-	if (pool->p.order)
+	/* Don't support bulk alloc for high-order pages */
+	if (unlikely(pp_order)) {
 		gfp |= __GFP_COMP;
+		first_page = alloc_pages_node(pp_nid, gfp, pp_order);
+		if (unlikely(!first_page))
+			return NULL;
+		goto out;
+	}
 
-	/* FUTURE development:
-	 *
-	 * Current slow-path essentially falls back to single page
-	 * allocations, which doesn't improve performance.  This code
-	 * need bulk allocation support from the page allocator code.
-	 */
-
-	/* Cache was empty, do real allocation */
-#ifdef CONFIG_NUMA
-	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
-#else
-	page = alloc_pages(gfp, pool->p.order);
-#endif
-	if (!page)
+	if (unlikely(!__alloc_pages_bulk_nodemask(gfp, pp_nid, NULL,
+						  bulk, &page_list)))
 		return NULL;
 
+	/* First page is extracted and returned to caller */
+	first_page = list_first_entry(&page_list, struct page, lru);
+	list_del(&first_page->lru);
+
+	/* Remaining pages store in alloc.cache */
+	list_for_each_entry_safe(page, next, &page_list, lru) {
+		list_del(&page->lru);
+		if (pp_flags & PP_FLAG_DMA_MAP &&
+		    unlikely(!page_pool_dma_map(pool, page))) {
+			put_page(page);
+			continue;
+		}
+		if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
+			pool->alloc.cache[pool->alloc.count++] = page;
+			pool->pages_state_hold_cnt++;
+			trace_page_pool_state_hold(pool, page,
+						   pool->pages_state_hold_cnt);
+		} else {
+			put_page(page);
+		}
+	}
+out:
 	if (pp_flags & PP_FLAG_DMA_MAP &&
-	    unlikely(!page_pool_dma_map(pool, page))) {
-		put_page(page);
+	    unlikely(!page_pool_dma_map(pool, first_page))) {
+		put_page(first_page);
 		return NULL;
 	}
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
-	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
+	trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
 
 	/* When page just alloc'ed is should/must have refcnt 1. */
-	return page;
+	return first_page;
 }
 
 /* For using page_pool replace: alloc_pages() API calls, but provide
-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map
  2021-03-01 16:11 ` [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map Mel Gorman
@ 2021-03-02 18:49   ` Ilias Apalodimas
  2021-03-03  9:18     ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Ilias Apalodimas @ 2021-03-02 18:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer, LKML,
	Linux-Net, Linux-MM, Linux-NFS

Hi Mel,

Can you please CC me in future revisions. I almost missed that!

On Mon, Mar 01, 2021 at 04:11:59PM +0000, Mel Gorman wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> In preparation for next patch, move the dma mapping into its own
> function, as this will make it easier to follow the changes.
> 
> V2: make page_pool_dma_map return boolean (Ilias)
> 

[...]

>  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>  						 gfp_t _gfp)
>  {
> +	unsigned int pp_flags = pool->p.flags;
>  	struct page *page;
>  	gfp_t gfp = _gfp;
> -	dma_addr_t dma;
>  
>  	/* We could always set __GFP_COMP, and avoid this branch, as
>  	 * prep_new_page() can handle order-0 with __GFP_COMP.
> @@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>  	if (!page)
>  		return NULL;
>  
> -	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> -		goto skip_dma_map;
> -
> -	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> -	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
> -	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
> -	 * This mapping is kept for lifetime of page, until leaving pool.
> -	 */
> -	dma = dma_map_page_attrs(pool->p.dev, page, 0,
> -				 (PAGE_SIZE << pool->p.order),
> -				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> -	if (dma_mapping_error(pool->p.dev, dma)) {
> +	if (pp_flags & PP_FLAG_DMA_MAP &&

Nit pick but can we have if ((pp_flags & PP_FLAG_DMA_MAP) && ...

> +	    unlikely(!page_pool_dma_map(pool, page))) {
>  		put_page(page);
>  		return NULL;
>  	}
> -	page->dma_addr = dma;
>  
> -	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> -		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> -
> -skip_dma_map:
>  	/* Track how many pages are held 'in-flight' */
>  	pool->pages_state_hold_cnt++;
> -
>  	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
>  
>  	/* When page just alloc'ed is should/must have refcnt 1. */
> -- 
> 2.26.2
> 

Otherwise 
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/5] net: page_pool: use alloc_pages_bulk in refill code path
  2021-03-01 16:12 ` [PATCH 5/5] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
@ 2021-03-02 19:05   ` Ilias Apalodimas
  0 siblings, 0 replies; 19+ messages in thread
From: Ilias Apalodimas @ 2021-03-02 19:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer, LKML,
	Linux-Net, Linux-MM, Linux-NFS

On Mon, Mar 01, 2021 at 04:12:00PM +0000, Mel Gorman wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> There are cases where the page_pool need to refill with pages from the
> page allocator. Some workloads cause the page_pool to release pages
> instead of recycling these pages.
> 
> For these workload it can improve performance to bulk alloc pages from
> the page-allocator to refill the alloc cache.
> 
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> an SKB from the xdp_frame, which then cannot return the page to the
> page_pool. In this case, we saw[1] an improvement of 18.8% from using
> the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> 
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  net/core/page_pool.c | 63 ++++++++++++++++++++++++++++----------------
>  1 file changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a26f2ceb6a87..567680bd91c4 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -208,44 +208,61 @@ noinline
>  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>  						 gfp_t _gfp)
>  {
> +	const int bulk = PP_ALLOC_CACHE_REFILL;
> +	struct page *page, *next, *first_page;
>  	unsigned int pp_flags = pool->p.flags;
> -	struct page *page;
> +	unsigned int pp_order = pool->p.order;
> +	int pp_nid = pool->p.nid;
> +	LIST_HEAD(page_list);
>  	gfp_t gfp = _gfp;
>  
> -	/* We could always set __GFP_COMP, and avoid this branch, as
> -	 * prep_new_page() can handle order-0 with __GFP_COMP.
> -	 */
> -	if (pool->p.order)
> +	/* Don't support bulk alloc for high-order pages */
> +	if (unlikely(pp_order)) {
>  		gfp |= __GFP_COMP;
> +		first_page = alloc_pages_node(pp_nid, gfp, pp_order);
> +		if (unlikely(!first_page))
> +			return NULL;
> +		goto out;
> +	}
>  
> -	/* FUTURE development:
> -	 *
> -	 * Current slow-path essentially falls back to single page
> -	 * allocations, which doesn't improve performance.  This code
> -	 * need bulk allocation support from the page allocator code.
> -	 */
> -
> -	/* Cache was empty, do real allocation */
> -#ifdef CONFIG_NUMA
> -	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
> -#else
> -	page = alloc_pages(gfp, pool->p.order);
> -#endif
> -	if (!page)
> +	if (unlikely(!__alloc_pages_bulk_nodemask(gfp, pp_nid, NULL,
> +						  bulk, &page_list)))
>  		return NULL;
>  
> +	/* First page is extracted and returned to caller */
> +	first_page = list_first_entry(&page_list, struct page, lru);
> +	list_del(&first_page->lru);
> +
> +	/* Remaining pages store in alloc.cache */
> +	list_for_each_entry_safe(page, next, &page_list, lru) {
> +		list_del(&page->lru);
> +		if (pp_flags & PP_FLAG_DMA_MAP &&
> +		    unlikely(!page_pool_dma_map(pool, page))) {
> +			put_page(page);
> +			continue;
> +		}
> +		if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> +			pool->alloc.cache[pool->alloc.count++] = page;
> +			pool->pages_state_hold_cnt++;
> +			trace_page_pool_state_hold(pool, page,
> +						   pool->pages_state_hold_cnt);
> +		} else {
> +			put_page(page);
> +		}
> +	}
> +out:
>  	if (pp_flags & PP_FLAG_DMA_MAP &&
> -	    unlikely(!page_pool_dma_map(pool, page))) {
> -		put_page(page);
> +	    unlikely(!page_pool_dma_map(pool, first_page))) {
> +		put_page(first_page);
>  		return NULL;
>  	}
>  
>  	/* Track how many pages are held 'in-flight' */
>  	pool->pages_state_hold_cnt++;
> -	trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
> +	trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
>  
>  	/* When page just alloc'ed is should/must have refcnt 1. */
> -	return page;
> +	return first_page;
>  }
>  
>  /* For using page_pool replace: alloc_pages() API calls, but provide
> -- 
> 2.26.2
> 

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map
  2021-03-02 18:49   ` Ilias Apalodimas
@ 2021-03-03  9:18     ` Mel Gorman
  2021-03-03 10:19       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-03-03  9:18 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer, LKML,
	Linux-Net, Linux-MM, Linux-NFS

On Tue, Mar 02, 2021 at 08:49:06PM +0200, Ilias Apalodimas wrote:
> Hi Mel,
> 
> Can you please CC me in future revisions. I almost missed that!
> 

Will do.

> On Mon, Mar 01, 2021 at 04:11:59PM +0000, Mel Gorman wrote:
> > From: Jesper Dangaard Brouer <brouer@redhat.com>
> > 
> > In preparation for next patch, move the dma mapping into its own
> > function, as this will make it easier to follow the changes.
> > 
> > V2: make page_pool_dma_map return boolean (Ilias)
> > 
> 
> [...]
> 
> > @@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> >  	if (!page)
> >  		return NULL;
> >  
> > -	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> > -		goto skip_dma_map;
> > -
> > -	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> > -	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
> > -	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > -	 * This mapping is kept for lifetime of page, until leaving pool.
> > -	 */
> > -	dma = dma_map_page_attrs(pool->p.dev, page, 0,
> > -				 (PAGE_SIZE << pool->p.order),
> > -				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > -	if (dma_mapping_error(pool->p.dev, dma)) {
> > +	if (pp_flags & PP_FLAG_DMA_MAP &&
> 
> Nit pick but can we have if ((pp_flags & PP_FLAG_DMA_MAP) && ...
> 

Done.

> [...]
>
> > 
> 
> Otherwise 
> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> 

Thanks. I'll wait for other review feedback before sending a V2.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map
  2021-03-03  9:18     ` Mel Gorman
@ 2021-03-03 10:19       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-03 10:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ilias Apalodimas, Andrew Morton, Chuck Lever, LKML, Linux-Net,
	Linux-MM, Linux-NFS, brouer

On Wed, 3 Mar 2021 09:18:25 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:
> On Tue, Mar 02, 2021 at 08:49:06PM +0200, Ilias Apalodimas wrote:
> > On Mon, Mar 01, 2021 at 04:11:59PM +0000, Mel Gorman wrote:  
> > > From: Jesper Dangaard Brouer <brouer@redhat.com>
> > > 
> > > In preparation for next patch, move the dma mapping into its own
> > > function, as this will make it easier to follow the changes.
> > > 
> > > V2: make page_pool_dma_map return boolean (Ilias)
> > >   
> > 
> > [...]
> >   
> > > @@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> > >  	if (!page)
> > >  		return NULL;
> > >  
> > > -	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> > > -		goto skip_dma_map;
> > > -
> > > -	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> > > -	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
> > > -	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > > -	 * This mapping is kept for lifetime of page, until leaving pool.
> > > -	 */
> > > -	dma = dma_map_page_attrs(pool->p.dev, page, 0,
> > > -				 (PAGE_SIZE << pool->p.order),
> > > -				 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > > -	if (dma_mapping_error(pool->p.dev, dma)) {
> > > +	if (pp_flags & PP_FLAG_DMA_MAP &&  
> > 
> > Nit pick but can we have if ((pp_flags & PP_FLAG_DMA_MAP) && ...
> >   
> 
> Done.

Thanks for fixing this nitpick, and carrying the patch.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator
  2021-03-01 16:11 ` [PATCH 2/5] mm/page_alloc: Add a bulk page allocator Mel Gorman
@ 2021-03-09 17:12   ` Christoph Hellwig
  2021-03-09 18:10     ` Mel Gorman
  2021-03-10 11:04   ` Shay Agroskin
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2021-03-09 17:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer, LKML,
	Linux-Net, Linux-MM, Linux-NFS

Would vmalloc be another good user of this API? 

> +	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> +	if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))

This crazy long line is really hard to follow.

> +		return 0;
> +	gfp_mask = alloc_mask;
> +
> +	/* Find an allowed local zone that meets the high watermark. */
> +	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {

Same here.

> +		unsigned long mark;
> +
> +		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> +		    !__cpuset_zone_allowed(zone, gfp_mask)) {
> +			continue;
> +		}

No need for the curly braces.

>  	}
>  
> -	gfp_mask &= gfp_allowed_mask;
> -	alloc_mask = gfp_mask;

Is this change intentional?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator
  2021-03-09 17:12   ` Christoph Hellwig
@ 2021-03-09 18:10     ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2021-03-09 18:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer, LKML,
	Linux-Net, Linux-MM, Linux-NFS

On Tue, Mar 09, 2021 at 05:12:30PM +0000, Christoph Hellwig wrote:
> Would vmalloc be another good user of this API? 
> 
> > +	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > +	if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> 
> This crazy long line is really hard to follow.
> 

It's not crazier than what is already in alloc_pages_nodemask to share
code.

> > +		return 0;
> > +	gfp_mask = alloc_mask;
> > +
> > +	/* Find an allowed local zone that meets the high watermark. */
> > +	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> 
> Same here.
> 

Similar to what happens in get_page_from_freelist with the
for_next_zone_zonelist_nodemask iterator.

> > +		unsigned long mark;
> > +
> > +		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > +		    !__cpuset_zone_allowed(zone, gfp_mask)) {
> > +			continue;
> > +		}
> 
> No need for the curly braces.
> 

Yes, but it's for coding style. MM has no hard coding style guidelines
around this but for sched, it's generally preferred that if the "if"
statement spans multiple lines then it should use {} even if the block
is one line long for clarity.

> >  	}
> >  
> > -	gfp_mask &= gfp_allowed_mask;
> > -	alloc_mask = gfp_mask;
> 
> Is this change intentional?

Yes so that prepare_alloc_pages works for both the single page and bulk
allocator. Slightly less code duplication.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator
  2021-03-01 16:11 ` [PATCH 2/5] mm/page_alloc: Add a bulk page allocator Mel Gorman
  2021-03-09 17:12   ` Christoph Hellwig
@ 2021-03-10 11:04   ` Shay Agroskin
  2021-03-10 11:38     ` Mel Gorman
  1 sibling, 1 reply; 19+ messages in thread
From: Shay Agroskin @ 2021-03-10 11:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer, LKML,
	Linux-Net, Linux-MM, Linux-NFS


Mel Gorman <mgorman@techsingularity.net> writes:

>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 8572a1474e16..4903d1cc48dc 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -515,6 +515,10 @@ static inline int 
> arch_make_page_accessible(struct page *page)
>  }
>  #endif
>  
> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int 
> preferred_nid,
> +				nodemask_t *nodemask, int 
> nr_pages,
> +				struct list_head *list);
> +
>  struct page *
>  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int 
>  preferred_nid,
>  							nodemask_t 
>  *nodemask);
> @@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int 
> order, int preferred_nid)
>  	return __alloc_pages_nodemask(gfp_mask, order, 
>  preferred_nid, NULL);
>  }
>  
> +/* Bulk allocate order-0 pages */
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct 
> list_head *list)
> +{
> +	return __alloc_pages_bulk_nodemask(gfp_mask, 
> numa_mem_id(), NULL,
> +							nr_pages, 
> list);

Is the second line indentation intentional ? Why not align it to 
the first argument (gfp_mask) ?

> +}
> +
>  /*
>   * Allocate pages, preferring the node given as nid. The node 
>   must be valid and
>   * online. For more general interface, see alloc_pages_node().
> @@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int 
> nid, size_t size, gfp_t gfp_mask);
>  
>  extern void __free_pages(struct page *page, unsigned int 
>  order);
>  extern void free_pages(unsigned long addr, unsigned int order);
> +extern void free_pages_bulk(struct list_head *list);
>  
>  struct page_frag_cache;
>  extern void __page_frag_cache_drain(struct page *page, unsigned 
>  int count);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..ff1e55793786 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int 
> order, gfp_t gfp_mask,
>  	}
>  }
> ...
>  
> +/*
> + * This is a batched version of the page allocator that 
> attempts to
> + * allocate nr_pages quickly from the preferred zone and add 
> them to list.
> + */
> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int 
> preferred_nid,
> +			nodemask_t *nodemask, int nr_pages,
> +			struct list_head *alloc_list)
> +{
> +	struct page *page;
> +	unsigned long flags;
> +	struct zone *zone;
> +	struct zoneref *z;
> +	struct per_cpu_pages *pcp;
> +	struct list_head *pcp_list;
> +	struct alloc_context ac;
> +	gfp_t alloc_mask;
> +	unsigned int alloc_flags;
> +	int alloced = 0;

Does alloced count the number of allocated pages ? Do you mind 
renaming it to 'allocated' ?

> +
> +	if (nr_pages == 1)
> +		goto failed;
> +
> +	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 
> page. */
> +	if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, 
> nodemask, &ac, &alloc_mask, &alloc_flags))
> +		return 0;
> +	gfp_mask = alloc_mask;
> +
> +	/* Find an allowed local zone that meets the high 
> watermark. */
> +	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> ac.highest_zoneidx, ac.nodemask) {
> +		unsigned long mark;
> +
> +		if (cpusets_enabled() && (alloc_flags & 
> ALLOC_CPUSET) &&
> +		    !__cpuset_zone_allowed(zone, gfp_mask)) {
> +			continue;
> +		}
> +
> +		if (nr_online_nodes > 1 && zone != 
> ac.preferred_zoneref->zone &&
> +		    zone_to_nid(zone) != 
> zone_to_nid(ac.preferred_zoneref->zone)) {
> +			goto failed;
> +		}
> +
> +		mark = wmark_pages(zone, alloc_flags & 
> ALLOC_WMARK_MASK) + nr_pages;
> +		if (zone_watermark_fast(zone, 0,  mark,
> + 
> zonelist_zone_idx(ac.preferred_zoneref),
> +				alloc_flags, gfp_mask)) {
> +			break;
> +		}
> +	}
> +	if (!zone)
> +		return 0;
> +
> +	/* Attempt the batch allocation */
> +	local_irq_save(flags);
> +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> +	pcp_list = &pcp->lists[ac.migratetype];
> +
> +	while (alloced < nr_pages) {
> +		page = __rmqueue_pcplist(zone, ac.migratetype, 
> alloc_flags,
> + 
> pcp, pcp_list);

Same indentation comment as before

> +		if (!page)
> +			break;
> +
> +		prep_new_page(page, 0, gfp_mask, 0);
> +		list_add(&page->lru, alloc_list);
> +		alloced++;
> +	}
> +
> +	if (!alloced)
> +		goto failed_irq;
> +
> +	if (alloced) {
> +		__count_zid_vm_events(PGALLOC, zone_idx(zone), 
> alloced);
> +		zone_statistics(zone, zone);
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return alloced;
> +
> +failed_irq:
> +	local_irq_restore(flags);
> +
> +failed:
> +	page = __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, 
> nodemask);
> +	if (page) {
> +		alloced++;
> +		list_add(&page->lru, alloc_list);
> +	}
> +
> +	return alloced;
> +}
> +EXPORT_SYMBOL_GPL(__alloc_pages_bulk_nodemask);
> +
>  /*
>   * This is the 'heart' of the zoned buddy allocator.
>   */
> @@ -4981,8 +5092,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, 
> unsigned int order, int preferred_nid,
>  		return NULL;
>  	}
>  
> -	gfp_mask &= gfp_allowed_mask;
> -	alloc_mask = gfp_mask;
>  	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, 
>  nodemask, &ac, &alloc_mask, &alloc_flags))
>  		return NULL;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator
  2021-03-10 11:04   ` Shay Agroskin
@ 2021-03-10 11:38     ` Mel Gorman
  2021-03-12 12:01       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-03-10 11:38 UTC (permalink / raw)
  To: Shay Agroskin
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer, LKML,
	Linux-Net, Linux-MM, Linux-NFS

On Wed, Mar 10, 2021 at 01:04:17PM +0200, Shay Agroskin wrote:
> 
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 8572a1474e16..4903d1cc48dc 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct
> > page *page)
> >  }
> >  #endif
> >   +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > +				nodemask_t *nodemask, int nr_pages,
> > +				struct list_head *list);
> > +
> >  struct page *
> >  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> > preferred_nid,
> >  							nodemask_t  *nodemask);
> > @@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > int preferred_nid)
> >  	return __alloc_pages_nodemask(gfp_mask, order,  preferred_nid, NULL);
> >  }
> > +/* Bulk allocate order-0 pages */
> > +static inline unsigned long
> > +alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct
> > list_head *list)
> > +{
> > +	return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
> > +							nr_pages, list);
> 
> Is the second line indentation intentional ? Why not align it to the first
> argument (gfp_mask) ?
> 

No particular reason. I usually pick this as it's visually clearer to me
that it's part of the same line when the multi-line is part of an if block.

> > +}
> > +
> >  /*
> >   * Allocate pages, preferring the node given as nid. The node   must be
> > valid and
> >   * online. For more general interface, see alloc_pages_node().
> > @@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int nid,
> > size_t size, gfp_t gfp_mask);
> >    extern void __free_pages(struct page *page, unsigned int  order);
> >  extern void free_pages(unsigned long addr, unsigned int order);
> > +extern void free_pages_bulk(struct list_head *list);
> >  struct page_frag_cache;
> >  extern void __page_frag_cache_drain(struct page *page, unsigned  int
> > count);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3e4b29ee2b1e..ff1e55793786 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order,
> > gfp_t gfp_mask,
> >  	}
> >  }
> > ...
> > +/*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to
> > list.
> > + */
> > +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > +			nodemask_t *nodemask, int nr_pages,
> > +			struct list_head *alloc_list)
> > +{
> > +	struct page *page;
> > +	unsigned long flags;
> > +	struct zone *zone;
> > +	struct zoneref *z;
> > +	struct per_cpu_pages *pcp;
> > +	struct list_head *pcp_list;
> > +	struct alloc_context ac;
> > +	gfp_t alloc_mask;
> > +	unsigned int alloc_flags;
> > +	int alloced = 0;
> 
> Does alloced count the number of allocated pages ?

Yes.

> Do you mind renaming it to 'allocated' ?

I will if there is another version as I do not feel particularly strongly
about alloced vs allocated. alloc was to match the function name and I
don't think the change makes it much clearer.


> > <SNIP>
> > +	/* Attempt the batch allocation */
> > +	local_irq_save(flags);
> > +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > +	pcp_list = &pcp->lists[ac.migratetype];
> > +
> > +	while (alloced < nr_pages) {
> > +		page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> > + pcp, pcp_list);
> 
> Same indentation comment as before
> 

Again, simple personal perference to avoid any possibility it's mixed
up with a later line. There has not been consistent code styling
enforcement of what indentation style should be used for a multi-line
within mm/page_alloc.c

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator
  2021-03-10 11:38     ` Mel Gorman
@ 2021-03-12 12:01       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2021-03-12 12:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shay Agroskin, Andrew Morton, Chuck Lever, LKML, Linux-Net,
	Linux-MM, Linux-NFS, brouer

On Wed, 10 Mar 2021 11:38:36 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Mar 10, 2021 at 01:04:17PM +0200, Shay Agroskin wrote:
> > 
> > Mel Gorman <mgorman@techsingularity.net> writes:
> >   
> > > 
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 8572a1474e16..4903d1cc48dc 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct
> > > page *page)
> > >  }
> > >  #endif
> > >   +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > > +				nodemask_t *nodemask, int nr_pages,
> > > +				struct list_head *list);
> > > +
> > >  struct page *
> > >  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> > > preferred_nid,
> > >  							nodemask_t  *nodemask);
> > > @@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > > int preferred_nid)
> > >  	return __alloc_pages_nodemask(gfp_mask, order,  preferred_nid, NULL);
> > >  }
> > > +/* Bulk allocate order-0 pages */
> > > +static inline unsigned long
> > > +alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct
> > > list_head *list)
> > > +{
> > > +	return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
> > > +							nr_pages, list);  
> > 
> > Is the second line indentation intentional ? Why not align it to the first
> > argument (gfp_mask) ?
> >   
> 
> No particular reason. I usually pick this as it's visually clearer to me
> that it's part of the same line when the multi-line is part of an if block.
> 
> > > +}
> > > +
[...]
> > 
> > Same indentation comment as before
> >   
> 
> Again, simple personal perference to avoid any possibility it's mixed
> up with a later line. There has not been consistent code styling
> enforcement of what indentation style should be used for a multi-line
> within mm/page_alloc.c

Hi Shay, it is might be surprising that indentation style actually
differs slightly in different parts of the kernel.  I started in
networking area of the kernel, and I was also surprised when I started
working in MM area that the coding style differs.  I can tell you that
the indentation style Mel choose is consistent with the code styling in
MM area.  I usually respect that even-though I prefer the networking
style as I was "raised" with that style.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users
  2021-03-11  8:48   ` Mel Gorman
@ 2021-03-12 15:10     ` Matthew Wilcox
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2021-03-12 15:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Chuck Lever, Jesper Dangaard Brouer,
	Christoph Hellwig, LKML, Linux-Net, Linux-MM, Linux-NFS

On Thu, Mar 11, 2021 at 08:48:27AM +0000, Mel Gorman wrote:
> I don't have that information unfortunately. It's a chicken and egg
> problem because without the API, there is no point creating new users.
> For example, fault around or readahead could potentially batch pages
> but whether it is actually noticable when page zeroing has to happen
> is a completely different story. It's a similar story for SLUB, we know
> lower order allocations hurt some microbenchmarks like hackbench-sockets
> but have not quantified what happens if SLUB batch allocates pages when
> high-order allocations fail.

I'm planning on reducing overhead in the readahead path by allocating
higher-order pages rather than by allocating a batch of order-0 pages.
With the old ->readpages interface, it would have made sense to allocate a
batch of pages, but with the new ->readahead interface, we put the pages
into the page cache for the filesystem, so it doesn't make as much sense
any more.

Right now, measuring performance in the readahead path is hard because
we end up contending against kswapd that's trying to evict all the clean
pages that we earlier readahead into this same file.  Could avoid that by
having N files, each about half the size of memory, but then we restart
the readahead algorithm for each file ...

I feel like the real solution for that is to do a GFP_NOWAIT allocation,
then try to evict earlier pages for the same file we're working on so
that kswapd doesn't get woken up if we're just streaming a read through
a gargantuan file.  But I should probably talk to Johannes first.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users
  2021-03-10 23:47 ` Andrew Morton
@ 2021-03-11  8:48   ` Mel Gorman
  2021-03-12 15:10     ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-03-11  8:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig, LKML,
	Linux-Net, Linux-MM, Linux-NFS

On Wed, Mar 10, 2021 at 03:47:04PM -0800, Andrew Morton wrote:
> On Wed, 10 Mar 2021 10:46:13 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This series introduces a bulk order-0 page allocator with sunrpc and
> > the network page pool being the first users.
> 
> <scratches head>
> 
> Right now, the [0/n] doesn't even tell us that it's a performance
> patchset!
> 

I'll add a note about this improving performance for users that operate
on batches of patches and want to avoid multiple round-trips to the
page allocator.

> The whole point of this patchset appears to appear in the final paragraph
> of the final patch's changelog.
> 

I'll copy&paste that note to the introduction. It's likely that high-speed
networking is the most relevant user in the short-term.

> : For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> : redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> : an SKB from the xdp_frame, which then cannot return the page to the
> : page_pool.  In this case, we saw[1] an improvement of 18.8% from using
> : the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> 
> Much more detail on the overall objective and the observed results,
> please?
> 

I cannot generate that data right now so I need Jesper to comment on
exactly why this is beneficial. For example, while I get that more data
can be processed in a microbenchmark, I do not have a good handle on how
much difference that makes to a practical application. About all I know
is that this problem has been knocking around for 3-4 years at least.

> Also, that workload looks awfully corner-casey.  How beneficial is this
> work for more general and widely-used operations?
> 

At this point, probably nothing for most users because batch page
allocation is not common. It's primarily why I avoided reworking the
whole allocator just to make this a bit tidier.

> > The implementation is not
> > particularly efficient and the intention is to iron out what the semantics
> > of the API should have for users. Once the semantics are ironed out, it can
> > be made more efficient.
> 
> And some guesstimates about how much benefit remains to be realized
> would be helpful.
> 

I don't have that information unfortunately. It's a chicken and egg
problem because without the API, there is no point creating new users.
For example, fault around or readahead could potentially batch pages
but whether it is actually noticable when page zeroing has to happen
is a completely different story. It's a similar story for SLUB, we know
lower order allocations hurt some microbenchmarks like hackbench-sockets
but have not quantified what happens if SLUB batch allocates pages when
high-order allocations fail.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users
  2021-03-10 10:46 [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
@ 2021-03-10 23:47 ` Andrew Morton
  2021-03-11  8:48   ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2021-03-10 23:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig, LKML,
	Linux-Net, Linux-MM, Linux-NFS

On Wed, 10 Mar 2021 10:46:13 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> This series introduces a bulk order-0 page allocator with sunrpc and
> the network page pool being the first users.

<scratches head>

Right now, the [0/n] doesn't even tell us that it's a performance
patchset!

The whole point of this patchset appears to appear in the final paragraph
of the final patch's changelog.

: For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
: redirecting xdp_frame packets into a veth, that does XDP_PASS to create
: an SKB from the xdp_frame, which then cannot return the page to the
: page_pool.  In this case, we saw[1] an improvement of 18.8% from using
: the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Much more detail on the overall objective and the observed results,
please?

Also, that workload looks awfully corner-casey.  How beneficial is this
work for more general and widely-used operations?

> The implementation is not
> particularly efficient and the intention is to iron out what the semantics
> of the API should have for users. Once the semantics are ironed out, it can
> be made more efficient.

And some guesstimates about how much benefit remains to be realized
would be helpful.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users
@ 2021-03-10 10:46 Mel Gorman
  2021-03-10 23:47 ` Andrew Morton
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2021-03-10 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chuck Lever, Jesper Dangaard Brouer, Christoph Hellwig, LKML,
	Linux-Net, Linux-MM, Linux-NFS, Mel Gorman

Changelog since v1
o Parenthesise binary and boolean comparisons
o Add reviewed-bys
o Rebase to 5.12-rc2

This series introduces a bulk order-0 page allocator with sunrpc and
the network page pool being the first users. The implementation is not
particularly efficient and the intention is to iron out what the semantics
of the API should have for users. Once the semantics are ironed out, it can
be made more efficient.

Improving the implementation requires fairly deep surgery in numerous
places. The lock scope would need to be significantly reduced, particularly
as vmstat, per-cpu and the buddy allocator have different locking protocol
that overall -- e.g. all partially depend on irqs being disabled at
various points. Secondly, the core of the allocator deals with single
pages where as both the bulk allocator and per-cpu allocator operate in
batches. All of that has to be reconciled with all the existing users and
their constraints (memory offline, CMA and cpusets being the trickiest).

Light testing passed, I'm relying on Chuck and Jesper to test the target
users more aggressively but both report performance improvements with the
initial RFC.

Patch 1 of this series is a cleanup to sunrpc, it could be merged
	separately but is included here as a pre-requisite.

Patch 2 is the prototype bulk allocator

Patch 3 is the sunrpc user. Chuck also has a patch which further caches
	pages but is not included in this series. It's not directly
	related to the bulk allocator and as it caches pages, it might
	have other concerns (e.g. does it need a shrinker?)

Patch 4 is a preparation patch only for the network user

Patch 5 converts the net page pool to the bulk allocator for order-0 pages.

There is no obvious impact to the existing paths as only new users of the
API should notice a difference between multiple calls to the allocator
and a single bulk allocation.

 include/linux/gfp.h   |  13 +++++
 mm/page_alloc.c       | 113 +++++++++++++++++++++++++++++++++++++++++-
 net/core/page_pool.c  | 102 +++++++++++++++++++++++---------------
 net/sunrpc/svc_xprt.c |  47 ++++++++++++------
 4 files changed, 220 insertions(+), 55 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-03-12 15:11 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-01 16:11 [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
2021-03-01 16:11 ` [PATCH 1/5] SUNRPC: Set rq_page_end differently Mel Gorman
2021-03-01 16:11 ` [PATCH 2/5] mm/page_alloc: Add a bulk page allocator Mel Gorman
2021-03-09 17:12   ` Christoph Hellwig
2021-03-09 18:10     ` Mel Gorman
2021-03-10 11:04   ` Shay Agroskin
2021-03-10 11:38     ` Mel Gorman
2021-03-12 12:01       ` Jesper Dangaard Brouer
2021-03-01 16:11 ` [PATCH 3/5] SUNRPC: Refresh rq_pages using " Mel Gorman
2021-03-01 16:11 ` [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map Mel Gorman
2021-03-02 18:49   ` Ilias Apalodimas
2021-03-03  9:18     ` Mel Gorman
2021-03-03 10:19       ` Jesper Dangaard Brouer
2021-03-01 16:12 ` [PATCH 5/5] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
2021-03-02 19:05   ` Ilias Apalodimas
2021-03-10 10:46 [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
2021-03-10 23:47 ` Andrew Morton
2021-03-11  8:48   ` Mel Gorman
2021-03-12 15:10     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).