Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC v6 00/51] Large pages in the page cache
@ 2020-06-10 20:12 Matthew Wilcox
  2020-06-10 20:12 ` [PATCH v6 01/51] mm: Print head flags in dump_page Matthew Wilcox
                   ` (52 more replies)
  0 siblings, 53 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Another fortnight, another dump of my current large pages work.
I've squished a lot of bugs this time.  xfstests is much happier now,
running for 1631 seconds and getting as far as generic/086.  This patchset
is getting a little big, so I'm going to try to get some bits of it
upstream soon (the bits that make sense regardless of whether the rest
of this is merged).

It's now based on linus' master (6f630784cc0d), and you can get it from
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-pagecache
if you'd rather see it there (this branch is force-pushed frequently)

The primary idea here is that a large part of the overhead in dealing
with individual pages is that there's just so darned many of them.
We would be better off dealing with fewer, larger pages, even if they
don't get to be the size necessary for the CPU to use a larger TLB entry.

The approach taken is to make THPs support arbitrary power-of-two sizes
(instead of just PMDs).  There's probably some tuning to be done to decide
what sizes are worth using, but we're a fair way from doing performance
work with this patchset yet.

TODO:
 - Fix arc/arm/arm64/mips/powerpc/space flush_dcache_page() to
   support THPs natively
 - Actually create large pages for sufficiently large writes
 - Copy in larger chunks for write() in iomap
 - More bug fixing

v6:
 - Improved debug output for large pages (will send to Andrew soon)
 - Make compound_nr() more efficient (will send to Andrew soon)
 - Renamed hpage_nr_pages() to thp_nr_pages()
 - Added thp_head()
 - Set the THP_SUPPORT flag in shmfs
 - Change zero_user_segments() to call flush_dcache_page() once for the
   head page instead of once for each subpage.  The architectures listed
   above need to be fixed.
 - Fix shmem & truncate to call zero_user_segment() with the head page
 - Fix page_is_mergeable() for THPs
 - Fix a bug in iomap_iop_set_range_uptodate() where I was assuming that
   the offset was block-aligned
 - Fix a few more places that assume unsigned int is large enough to hold
   offset/length within a page
 - Fix doing writeback of a page after discarding its iop due to a partial
   truncate
 - Convert the iomap write paths more comprehensively.  That's now four
   separate patches
v5:
 - Add a mapping AS_LARGE_PAGES flag to reduce the levels of indirection
   (Dave Chinner)
 - Change iomap_invalidate_page() to handle subpages of a THP being punched
 - Ensure we don't call page_cache_async_readahead() with a tail page
 - Revert to Bill's original patch for thp_get_unmapped_area() to allow
   for hardware page sizes other than PMD to be supported more easily
 - Remove a few more HPAGE_PMD_NR
 - Move shmem_punch_compound() to truncate.c and rename it to punch_thp()
 - Add support for page_private to punch_thp()

v4:
 - Fix thp_size typo
 - Fix the iomap page_mkwrite() path to operate on the head page, even
   though the vm_fault has a pointer to the tail page
 - Fix iomap_finish_ioend() to use bio_for_each_thp_segment_all()
 - Rework PageDoubleMap (see first two patches for details)
 - Fix page_cache_delete() to handle shadow entries being stored to a THP
 - Fix the assertion in pagecache_get_page() to handle tail pages
 - Change PageReadahead from NO_COMPOUND to ONLY_HEAD
 - Handle PageReadahead being set on head pages
 - Handle total_mapcount correctly (Kirill)
 - Pull the FS_LARGE_PAGES check out into mapping_large_pages()
 - Fix page size assumption in truncate_cleanup_page()
 - Avoid splitting large pages unnecessarily on truncate
 - Disable the page cache truncation introduced as part of the read-only
   THP patch set
 - Call compound_head() in iomap buffered write paths -- we retrieve a
   (potentially) tail page from the page cache and need to use that for
   flush_dcache_page(), but we expect to operate on a head page in most
   of the iomap code



Kirill A. Shutemov (1):
  mm: Fix total_mapcount assumption of page size

Matthew Wilcox (Oracle) (49):
  mm: Print head flags in dump_page
  mm: Print the inode number in dump_page
  mm: Print hashed address of struct page
  mm: Move PageDoubleMap bit
  mm: Simplify PageDoubleMap with PF_SECOND policy
  mm: Store compound_nr as well as compound_order
  mm: Move page-flags include to top of file
  mm: Add thp_order
  mm: Add thp_size
  mm: Replace hpage_nr_pages with thp_nr_pages
  mm: Add thp_head
  mm: Introduce offset_in_thp
  mm: Support arbitrary THP sizes
  fs: Add a filesystem flag for THPs
  fs: Do not update nr_thps for mappings which support THPs
  fs: Introduce i_blocks_per_page
  fs: Make page_mkwrite_check_truncate thp-aware
  mm: Support THPs in zero_user_segments
  mm: Zero the head page, not the tail page
  block: Add bio_for_each_thp_segment_all
  block: Support THPs in page_is_mergeable
  iomap: Support arbitrarily many blocks per page
  iomap: Support THPs in iomap_adjust_read_range
  iomap: Support THPs in invalidatepage
  iomap: Support THPs in read paths
  iomap: Convert iomap_write_end types
  iomap: Change calling convention for zeroing
  iomap: Change iomap_write_begin calling convention
  iomap: Support THPs in write paths
  iomap: Inline data shouldn't see THPs
  iomap: Handle tail pages in iomap_page_mkwrite
  xfs: Support THPs
  mm: Make prep_transhuge_page return its argument
  mm: Add __page_cache_alloc_order
  mm: Allow THPs to be added to the page cache
  mm: Allow THPs to be removed from the page cache
  mm: Remove page fault assumption of compound page size
  mm: Remove assumptions of THP size
  mm: Avoid splitting THPs
  mm: Fix truncation for pages of arbitrary size
  mm: Handle truncates that split THPs
  mm: Support storing shadow entries for THPs
  mm: Support retrieving tail pages from the page cache
  mm: Support tail pages in wait_for_stable_page
  mm: Add DEFINE_READAHEAD
  mm: Make page_cache_readahead_unbounded take a readahead_control
  mm: Make __do_page_cache_readahead take a readahead_control
  mm: Allow PageReadahead to be set on head pages
  mm: Add THP readahead

William Kucharski (1):
  mm: Align THP mappings for non-DAX

 block/bio.c                |   2 +-
 drivers/nvdimm/btt.c       |   4 +-
 drivers/nvdimm/pmem.c      |   6 +-
 fs/dax.c                   |  13 +-
 fs/ext4/verity.c           |   4 +-
 fs/f2fs/verity.c           |   4 +-
 fs/inode.c                 |   2 +
 fs/iomap/buffered-io.c     | 250 +++++++++++++++++++------------------
 fs/jfs/jfs_metapage.c      |   2 +-
 fs/xfs/xfs_aops.c          |   4 +-
 fs/xfs/xfs_super.c         |   2 +-
 include/linux/bio.h        |  13 ++
 include/linux/bvec.h       |  23 ++++
 include/linux/dax.h        |   3 +-
 include/linux/fs.h         |  28 +----
 include/linux/highmem.h    |  11 +-
 include/linux/huge_mm.h    |  65 ++++++++--
 include/linux/mm.h         |  46 +++----
 include/linux/mm_inline.h  |   6 +-
 include/linux/mm_types.h   |   1 +
 include/linux/page-flags.h |  46 ++-----
 include/linux/pagemap.h    | 102 ++++++++++++---
 mm/compaction.c            |   2 +-
 mm/debug.c                 |  23 ++--
 mm/filemap.c               | 101 +++++++++------
 mm/gup.c                   |   2 +-
 mm/highmem.c               |  62 ++++++++-
 mm/huge_memory.c           |  38 +++---
 mm/hugetlb.c               |   2 +-
 mm/internal.h              |  17 +--
 mm/memcontrol.c            |  10 +-
 mm/memory.c                |   7 +-
 mm/memory_hotplug.c        |   7 +-
 mm/mempolicy.c             |   2 +-
 mm/migrate.c               |  16 +--
 mm/mlock.c                 |   9 +-
 mm/page-writeback.c        |   1 +
 mm/page_alloc.c            |   5 +-
 mm/page_io.c               |   4 +-
 mm/page_vma_mapped.c       |   6 +-
 mm/readahead.c             | 145 ++++++++++++++++-----
 mm/rmap.c                  |  18 +--
 mm/shmem.c                 |  39 ++----
 mm/swap.c                  |  16 +--
 mm/swap_state.c            |   6 +-
 mm/swapfile.c              |   2 +-
 mm/truncate.c              |  70 ++++++++++-
 mm/vmscan.c                |  12 +-
 48 files changed, 795 insertions(+), 464 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 01/51] mm: Print head flags in dump_page
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
@ 2020-06-10 20:12 ` Matthew Wilcox
  2020-06-10 20:12 ` [PATCH v6 02/51] mm: Print the inode number " Matthew Wilcox
                   ` (51 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Tail page flags contain very little useful information.  Print the head
page's flags instead (even though PageHead is a little misleading).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/debug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/debug.c b/mm/debug.c
index b5b1de8c71ac..384eef80649d 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -163,7 +163,7 @@ void __dump_page(struct page *page, const char *reason)
 out_mapping:
 	BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1);
 
-	pr_warn("%sflags: %#lx(%pGp)%s\n", type, page->flags, &page->flags,
+	pr_warn("%sflags: %#lx(%pGp)%s\n", type, head->flags, &head->flags,
 		page_cma ? " CMA" : "");
 
 hex_only:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 02/51] mm: Print the inode number in dump_page
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
  2020-06-10 20:12 ` [PATCH v6 01/51] mm: Print head flags in dump_page Matthew Wilcox
@ 2020-06-10 20:12 ` Matthew Wilcox
  2020-06-10 20:12 ` [PATCH v6 03/51] mm: Print hashed address of struct page Matthew Wilcox
                   ` (50 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The inode number helps correlate this page with debug messages elsewhere
in the kernel.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/debug.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/debug.c b/mm/debug.c
index 384eef80649d..e30e35b41d0e 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -133,15 +133,16 @@ void __dump_page(struct page *page, const char *reason)
 			goto out_mapping;
 		}
 
-		if (probe_kernel_read(&dentry_first,
-			&host->i_dentry.first, sizeof(struct hlist_node *))) {
+		if (probe_kernel_read(&dentry_first, &host->i_dentry.first,
+					sizeof(struct hlist_node *))) {
 			pr_warn("mapping->a_ops:%ps with invalid mapping->host inode address %px\n",
 				a_ops, host);
 			goto out_mapping;
 		}
 
 		if (!dentry_first) {
-			pr_warn("mapping->a_ops:%ps\n", a_ops);
+			pr_warn("mapping->a_ops:%ps ino %lx\n", a_ops,
+					host->i_ino);
 			goto out_mapping;
 		}
 
@@ -156,8 +157,8 @@ void __dump_page(struct page *page, const char *reason)
 			 * crash, but it's unlikely that we reach here with a
 			 * corrupted struct page
 			 */
-			pr_warn("mapping->aops:%ps dentry name:\"%pd\"\n",
-								a_ops, &dentry);
+			pr_warn("mapping->aops:%ps ino %lx dentry name:\"%pd\"\n",
+					a_ops, host->i_ino, &dentry);
 		}
 	}
 out_mapping:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 03/51] mm: Print hashed address of struct page
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
  2020-06-10 20:12 ` [PATCH v6 01/51] mm: Print head flags in dump_page Matthew Wilcox
  2020-06-10 20:12 ` [PATCH v6 02/51] mm: Print the inode number " Matthew Wilcox
@ 2020-06-10 20:12 ` Matthew Wilcox
  2020-06-10 20:12 ` [PATCH v6 04/51] mm: Move PageDoubleMap bit Matthew Wilcox
                   ` (49 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The actual address of the struct page isn't particularly helpful,
while the hashed address helps match with other messages elsewhere.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/debug.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/debug.c b/mm/debug.c
index e30e35b41d0e..b17909c16a77 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -86,23 +86,23 @@ void __dump_page(struct page *page, const char *reason)
 
 	if (compound)
 		if (hpage_pincount_available(page)) {
-			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
-				"index:%#lx head:%px order:%u "
+			pr_warn("page:%p refcount:%d mapcount:%d mapping:%p "
+				"index:%#lx head:%p order:%u "
 				"compound_mapcount:%d compound_pincount:%d\n",
 				page, page_ref_count(head), mapcount,
 				mapping, page_to_pgoff(page), head,
 				compound_order(head), compound_mapcount(page),
 				compound_pincount(page));
 		} else {
-			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
-				"index:%#lx head:%px order:%u "
+			pr_warn("page:%p refcount:%d mapcount:%d mapping:%p "
+				"index:%#lx head:%p order:%u "
 				"compound_mapcount:%d\n",
 				page, page_ref_count(head), mapcount,
 				mapping, page_to_pgoff(page), head,
 				compound_order(head), compound_mapcount(page));
 		}
 	else
-		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
+		pr_warn("page:%p refcount:%d mapcount:%d mapping:%p index:%#lx\n",
 			page, page_ref_count(page), mapcount,
 			mapping, page_to_pgoff(page));
 	if (PageKsm(page))
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 04/51] mm: Move PageDoubleMap bit
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (2 preceding siblings ...)
  2020-06-10 20:12 ` [PATCH v6 03/51] mm: Print hashed address of struct page Matthew Wilcox
@ 2020-06-10 20:12 ` Matthew Wilcox
  2020-06-11 15:03   ` Zi Yan
  2020-06-10 20:12 ` [PATCH v6 05/51] mm: Simplify PageDoubleMap with PF_SECOND policy Matthew Wilcox
                   ` (48 subsequent siblings)
  52 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

PG_private_2 is defined as being PF_ANY (applicable to tail pages
as well as regular & head pages).  That means that the first tail
page of a double-map page will appear to have Private2 set.  Use the
Workingset bit instead which is defined as PF_HEAD so any attempt to
access the Workingset bit on a tail page will redirect to the head page's
Workingset bit.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/page-flags.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 222f6f7b2bb3..de6e0696f55c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -164,7 +164,7 @@ enum pageflags {
 	PG_slob_free = PG_private,
 
 	/* Compound pages. Stored in first tail page's flags */
-	PG_double_map = PG_private_2,
+	PG_double_map = PG_workingset,
 
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 05/51] mm: Simplify PageDoubleMap with PF_SECOND policy
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (3 preceding siblings ...)
  2020-06-10 20:12 ` [PATCH v6 04/51] mm: Move PageDoubleMap bit Matthew Wilcox
@ 2020-06-10 20:12 ` Matthew Wilcox
  2020-06-11 15:14   ` Zi Yan
  2020-06-10 20:13 ` [PATCH v6 06/51] mm: Store compound_nr as well as compound_order Matthew Wilcox
                   ` (47 subsequent siblings)
  52 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Introduce the new page policy of PF_SECOND which lets us use the
normal pageflags generation machinery to create the various DoubleMap
manipulation functions.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/page-flags.h | 40 ++++++++++----------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index de6e0696f55c..979460df4768 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -232,6 +232,9 @@ static inline void page_init_poison(struct page *page, size_t size)
  *
  * PF_NO_COMPOUND:
  *     the page flag is not relevant for compound pages.
+ *
+ * PF_SECOND:
+ *     the page flag is stored in the first tail page.
  */
 #define PF_POISONED_CHECK(page) ({					\
 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
@@ -247,6 +250,9 @@ static inline void page_init_poison(struct page *page, size_t size)
 #define PF_NO_COMPOUND(page, enforce) ({				\
 		VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page);	\
 		PF_POISONED_CHECK(page); })
+#define PF_SECOND(page, enforce) ({					\
+		VM_BUG_ON_PGFLAGS(!PageHead(page), page);		\
+		PF_POISONED_CHECK(&page[1]); })
 
 /*
  * Macros to create function definitions for page flags
@@ -685,42 +691,15 @@ static inline int PageTransTail(struct page *page)
  *
  * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap().
  */
-static inline int PageDoubleMap(struct page *page)
-{
-	return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
-}
-
-static inline void SetPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	set_bit(PG_double_map, &page[1].flags);
-}
-
-static inline void ClearPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	clear_bit(PG_double_map, &page[1].flags);
-}
-static inline int TestSetPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_and_set_bit(PG_double_map, &page[1].flags);
-}
-
-static inline int TestClearPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_and_clear_bit(PG_double_map, &page[1].flags);
-}
-
+PAGEFLAG(DoubleMap, double_map, PF_SECOND)
+	TESTSCFLAG(DoubleMap, double_map, PF_SECOND)
 #else
 TESTPAGEFLAG_FALSE(TransHuge)
 TESTPAGEFLAG_FALSE(TransCompound)
 TESTPAGEFLAG_FALSE(TransCompoundMap)
 TESTPAGEFLAG_FALSE(TransTail)
 PAGEFLAG_FALSE(DoubleMap)
-	TESTSETFLAG_FALSE(DoubleMap)
-	TESTCLEARFLAG_FALSE(DoubleMap)
+	TESTSCFLAG_FALSE(DoubleMap)
 #endif
 
 /*
@@ -875,6 +854,7 @@ static inline int page_has_private(struct page *page)
 #undef PF_ONLY_HEAD
 #undef PF_NO_TAIL
 #undef PF_NO_COMPOUND
+#undef PF_SECOND
 #endif /* !__GENERATING_BOUNDS_H */
 
 #endif	/* PAGE_FLAGS_H */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 06/51] mm: Store compound_nr as well as compound_order
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (4 preceding siblings ...)
  2020-06-10 20:12 ` [PATCH v6 05/51] mm: Simplify PageDoubleMap with PF_SECOND policy Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 07/51] mm: Move page-flags include to top of file Matthew Wilcox
                   ` (46 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This removes a few instructions from functions which need to know how many
pages are in a compound page.  The storage used is either page->mapping
on 64-bit or page->index on 32-bit.  Both of these are fine to overlay
on tail pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h       | 5 ++++-
 include/linux/mm_types.h | 1 +
 mm/page_alloc.c          | 5 +++--
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..af0305ad090f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -911,12 +911,15 @@ static inline int compound_pincount(struct page *page)
 static inline void set_compound_order(struct page *page, unsigned int order)
 {
 	page[1].compound_order = order;
+	page[1].compound_nr = 1U << order;
 }
 
 /* Returns the number of pages in this potentially compound page. */
 static inline unsigned long compound_nr(struct page *page)
 {
-	return 1UL << compound_order(page);
+	if (!PageHead(page))
+		return 1;
+	return page[1].compound_nr;
 }
 
 /* Returns the number of bytes in this potentially compound page. */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 64ede5f150dc..561ed987ab44 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -134,6 +134,7 @@ struct page {
 			unsigned char compound_dtor;
 			unsigned char compound_order;
 			atomic_t compound_mapcount;
+			unsigned int compound_nr; /* 1 << compound_order */
 		};
 		struct {	/* Second tail page of compound page */
 			unsigned long _compound_pad_1;	/* compound_head */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 727751219003..3fb61ef4c3a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -673,8 +673,6 @@ void prep_compound_page(struct page *page, unsigned int order)
 	int i;
 	int nr_pages = 1 << order;
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
@@ -682,6 +680,9 @@ void prep_compound_page(struct page *page, unsigned int order)
 		p->mapping = TAIL_MAPPING;
 		set_compound_head(p, page);
 	}
+
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
 	atomic_set(compound_mapcount_ptr(page), -1);
 	if (hpage_pincount_available(page))
 		atomic_set(compound_pincount_ptr(page), 0);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 07/51] mm: Move page-flags include to top of file
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (5 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 06/51] mm: Store compound_nr as well as compound_order Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 08/51] mm: Add thp_order Matthew Wilcox
                   ` (45 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Give up on the notion that we can remove page-flags.h from mm.h.
There are currently 14 inline functions which use a PageFoo function.
Also, two of the files directly included by mm.h include page-flags.h
themselves, and there are probably more indirect inclusions.  So just
include it at the top like any other header file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index af0305ad090f..6c29b663135f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -24,6 +24,7 @@
 #include <linux/resource.h>
 #include <linux/page_ext.h>
 #include <linux/err.h>
+#include <linux/page-flags.h>
 #include <linux/page_ref.h>
 #include <linux/memremap.h>
 #include <linux/overflow.h>
@@ -667,11 +668,6 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 struct mmu_gather;
 struct inode;
 
-/*
- * FIXME: take this include out, include page-flags.h in
- * files which need it (119 of them)
- */
-#include <linux/page-flags.h>
 #include <linux/huge_mm.h>
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 08/51] mm: Add thp_order
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (6 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 07/51] mm: Move page-flags include to top of file Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 09/51] mm: Add thp_size Matthew Wilcox
                   ` (44 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This function returns the order of a transparent huge page.  It
compiles to 0 if CONFIG_TRANSPARENT_HUGEPAGE is disabled.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 71f20776b06c..dd19720a8bc2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -265,6 +265,19 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 	else
 		return NULL;
 }
+
+/**
+ * thp_order - Order of a transparent huge page.
+ * @page: Head page of a transparent huge page.
+ */
+static inline unsigned int thp_order(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageTail(page), page);
+	if (PageHead(page))
+		return HPAGE_PMD_ORDER;
+	return 0;
+}
+
 static inline int hpage_nr_pages(struct page *page)
 {
 	if (unlikely(PageTransHuge(page)))
@@ -324,6 +337,12 @@ static inline struct list_head *page_deferred_list(struct page *page)
 #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
+static inline unsigned int thp_order(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageTail(page), page);
+	return 0;
+}
+
 static inline int hpage_nr_pages(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 09/51] mm: Add thp_size
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (7 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 08/51] mm: Add thp_order Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 10/51] mm: Replace hpage_nr_pages with thp_nr_pages Matthew Wilcox
                   ` (43 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This function returns the number of bytes in a THP.  It is like
page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
is disabled.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 drivers/nvdimm/btt.c    |  4 +---
 drivers/nvdimm/pmem.c   |  6 ++----
 include/linux/huge_mm.h | 11 +++++++++++
 mm/internal.h           |  2 +-
 mm/page_io.c            |  2 +-
 mm/page_vma_mapped.c    |  4 ++--
 6 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 90c0c4bbe77b..7255cfe6ebe2 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1490,10 +1490,8 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
 {
 	struct btt *btt = bdev->bd_disk->private_data;
 	int rc;
-	unsigned int len;
 
-	len = hpage_nr_pages(page) * PAGE_SIZE;
-	rc = btt_do_bvec(btt, NULL, page, len, 0, op, sector);
+	rc = btt_do_bvec(btt, NULL, page, thp_size(page), 0, op, sector);
 	if (rc == 0)
 		page_endio(page, op_is_write(op), 0);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d1ecd6da11a2..d1999c266b20 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -238,11 +238,9 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	blk_status_t rc;
 
 	if (op_is_write(op))
-		rc = pmem_do_write(pmem, page, 0, sector,
-				   hpage_nr_pages(page) * PAGE_SIZE);
+		rc = pmem_do_write(pmem, page, 0, sector, thp_size(page));
 	else
-		rc = pmem_do_read(pmem, page, 0, sector,
-				   hpage_nr_pages(page) * PAGE_SIZE);
+		rc = pmem_do_read(pmem, page, 0, sector, thp_size(page));
 	/*
 	 * The ->rw_page interface is subtle and tricky.  The core
 	 * retries on any error, so we can only invoke page_endio() in
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index dd19720a8bc2..0ec3b5a73d38 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -469,4 +469,15 @@ static inline bool thp_migration_supported(void)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+/**
+ * thp_size - Size of a transparent huge page.
+ * @page: Head page of a transparent huge page.
+ *
+ * Return: Number of bytes in this page.
+ */
+static inline unsigned long thp_size(struct page *page)
+{
+	return PAGE_SIZE << thp_order(page);
+}
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/internal.h b/mm/internal.h
index 9886db20d94f..de9f1d0ba5fc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,7 +395,7 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	unsigned long start, end;
 
 	start = __vma_address(page, vma);
-	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+	end = start + thp_size(page) - PAGE_SIZE;
 
 	/* page should be within @vma mapping range */
 	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
diff --git a/mm/page_io.c b/mm/page_io.c
index e8726f3e3820..888000d1a8cc 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -40,7 +40,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 		bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
 		bio->bi_end_io = end_io;
 
-		bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
+		bio_add_page(bio, page, thp_size(page), 0);
 	}
 	return bio;
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 719c35246cfa..e65629c056e8 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -227,7 +227,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			if (pvmw->address >= pvmw->vma->vm_end ||
 			    pvmw->address >=
 					__vma_address(pvmw->page, pvmw->vma) +
-					hpage_nr_pages(pvmw->page) * PAGE_SIZE)
+					thp_size(pvmw->page))
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -268,7 +268,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	unsigned long start, end;
 
 	start = __vma_address(page, vma);
-	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+	end = start + thp_size(page) - PAGE_SIZE;
 
 	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
 		return 0;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 10/51] mm: Replace hpage_nr_pages with thp_nr_pages
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (8 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 09/51] mm: Add thp_size Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 11/51] mm: Add thp_head Matthew Wilcox
                   ` (42 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The thp prefix is more frequently used than hpage and we should
be consistent between the various functions.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h   | 13 +++++++++----
 include/linux/mm_inline.h |  6 +++---
 include/linux/pagemap.h   |  6 +++---
 mm/compaction.c           |  2 +-
 mm/filemap.c              |  2 +-
 mm/gup.c                  |  2 +-
 mm/hugetlb.c              |  2 +-
 mm/internal.h             |  2 +-
 mm/memcontrol.c           | 10 +++++-----
 mm/memory_hotplug.c       |  7 +++----
 mm/mempolicy.c            |  2 +-
 mm/migrate.c              | 16 ++++++++--------
 mm/mlock.c                |  9 ++++-----
 mm/page_io.c              |  2 +-
 mm/page_vma_mapped.c      |  2 +-
 mm/rmap.c                 |  8 ++++----
 mm/swap.c                 | 16 ++++++++--------
 mm/swap_state.c           |  6 +++---
 mm/swapfile.c             |  2 +-
 mm/vmscan.c               |  6 +++---
 20 files changed, 62 insertions(+), 59 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0ec3b5a73d38..dcdfd21763a3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -278,9 +278,14 @@ static inline unsigned int thp_order(struct page *page)
 	return 0;
 }
 
-static inline int hpage_nr_pages(struct page *page)
+/**
+ * thp_nr_pages - The number of regular pages in this huge page.
+ * @page: The head page of a huge page.
+ */
+static inline int thp_nr_pages(struct page *page)
 {
-	if (unlikely(PageTransHuge(page)))
+	VM_BUG_ON_PGFLAGS(PageTail(page), page);
+	if (PageHead(page))
 		return HPAGE_PMD_NR;
 	return 1;
 }
@@ -343,9 +348,9 @@ static inline unsigned int thp_order(struct page *page)
 	return 0;
 }
 
-static inline int hpage_nr_pages(struct page *page)
+static inline int thp_nr_pages(struct page *page)
 {
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	VM_BUG_ON_PGFLAGS(PageTail(page), page);
 	return 1;
 }
 
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 219bef41d87c..8fc71e9d7bb0 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -48,14 +48,14 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void add_page_to_lru_list_tail(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
 	list_add_tail(&page->lru, &lruvec->lists[lru]);
 }
 
@@ -63,7 +63,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
 	list_del(&page->lru);
-	update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), -thp_nr_pages(page));
 }
 
 /**
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cf2468da68e9..484a36185bb5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -381,7 +381,7 @@ static inline struct page *find_subpage(struct page *head, pgoff_t index)
 	if (PageHuge(head))
 		return head;
 
-	return head + (index & (hpage_nr_pages(head) - 1));
+	return head + (index & (thp_nr_pages(head) - 1));
 }
 
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
@@ -730,7 +730,7 @@ static inline struct page *readahead_page(struct readahead_control *rac)
 
 	page = xa_load(&rac->mapping->i_pages, rac->_index);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	rac->_batch_count = hpage_nr_pages(page);
+	rac->_batch_count = thp_nr_pages(page);
 
 	return page;
 }
@@ -753,7 +753,7 @@ static inline unsigned int __readahead_batch(struct readahead_control *rac,
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageTail(page), page);
 		array[i++] = page;
-		rac->_batch_count += hpage_nr_pages(page);
+		rac->_batch_count += thp_nr_pages(page);
 
 		/*
 		 * The page cache isn't using multi-index entries yet,
diff --git a/mm/compaction.c b/mm/compaction.c
index fd988b7e5f2b..81eaffcfbe4e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -991,7 +991,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		mod_node_page_state(page_pgdat(page),
 				NR_ISOLATED_ANON + page_is_file_lru(page),
-				hpage_nr_pages(page));
+				thp_nr_pages(page));
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
diff --git a/mm/filemap.c b/mm/filemap.c
index f0ae9a6308cb..80ce3658b147 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -197,7 +197,7 @@ static void unaccount_page_cache_page(struct address_space *mapping,
 	if (PageHuge(page))
 		return;
 
-	nr = hpage_nr_pages(page);
+	nr = thp_nr_pages(page);
 
 	__mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page)) {
diff --git a/mm/gup.c b/mm/gup.c
index de9e36262ccb..12d066baee11 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1703,7 +1703,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 					mod_node_page_state(page_pgdat(head),
 							    NR_ISOLATED_ANON +
 							    page_is_file_lru(head),
-							    hpage_nr_pages(head));
+							    thp_nr_pages(head));
 				}
 			}
 		}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 57ece74e3aae..6bb07bc655f7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1593,7 +1593,7 @@ static struct address_space *_get_hugetlb_page_mapping(struct page *hpage)
 
 	/* Use first found vma */
 	pgoff_start = page_to_pgoff(hpage);
-	pgoff_end = pgoff_start + hpage_nr_pages(hpage) - 1;
+	pgoff_end = pgoff_start + thp_nr_pages(hpage) - 1;
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
 					pgoff_start, pgoff_end) {
 		struct vm_area_struct *vma = avc->vma;
diff --git a/mm/internal.h b/mm/internal.h
index de9f1d0ba5fc..ac3c79408045 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -368,7 +368,7 @@ extern void clear_page_mlock(struct page *page);
 static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 {
 	if (TestClearPageMlocked(page)) {
-		int nr_pages = hpage_nr_pages(page);
+		int nr_pages = thp_nr_pages(page);
 
 		/* Holding pmd lock, no change in irq context: __mod is safe */
 		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b38b6ad547d..af02455db4b3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5363,7 +5363,7 @@ static int mem_cgroup_move_account(struct page *page,
 {
 	struct lruvec *from_vec, *to_vec;
 	struct pglist_data *pgdat;
-	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+	unsigned int nr_pages = compound ? thp_nr_pages(page) : 1;
 	int ret;
 
 	VM_BUG_ON(from == to);
@@ -6453,7 +6453,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
  */
 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
-	unsigned int nr_pages = hpage_nr_pages(page);
+	unsigned int nr_pages = thp_nr_pages(page);
 	struct mem_cgroup *memcg = NULL;
 	int ret = 0;
 
@@ -6684,7 +6684,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 		return;
 
 	/* Force-charge the new page. The old one will be freed soon */
-	nr_pages = hpage_nr_pages(newpage);
+	nr_pages = thp_nr_pages(newpage);
 
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
@@ -6897,7 +6897,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * ancestor for the swap instead and transfer the memory+swap charge.
 	 */
 	swap_memcg = mem_cgroup_id_get_online(memcg);
-	nr_entries = hpage_nr_pages(page);
+	nr_entries = thp_nr_pages(page);
 	/* Get references for the tail pages, too */
 	if (nr_entries > 1)
 		mem_cgroup_id_get_many(swap_memcg, nr_entries - 1);
@@ -6942,7 +6942,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
  */
 int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 {
-	unsigned int nr_pages = hpage_nr_pages(page);
+	unsigned int nr_pages = thp_nr_pages(page);
 	struct page_counter *counter;
 	struct mem_cgroup *memcg;
 	unsigned short oldid;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c4d5c45820d0..dfbff3565dad 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1253,7 +1253,7 @@ static int
 do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 {
 	unsigned long pfn;
-	struct page *page;
+	struct page *page, *head;
 	int ret = 0;
 	LIST_HEAD(source);
 
@@ -1261,15 +1261,14 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		if (!pfn_valid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
+		head = compound_head(page);
 
 		if (PageHuge(page)) {
-			struct page *head = compound_head(page);
 			pfn = page_to_pfn(head) + compound_nr(head) - 1;
 			isolate_huge_page(head, &source);
 			continue;
 		} else if (PageTransHuge(page))
-			pfn = page_to_pfn(compound_head(page))
-				+ hpage_nr_pages(page) - 1;
+			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
 
 		/*
 		 * HWPoison pages have elevated reference counts so the migration would
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 381320671677..d2b11c291e78 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1049,7 +1049,7 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
 			list_add_tail(&head->lru, pagelist);
 			mod_node_page_state(page_pgdat(head),
 				NR_ISOLATED_ANON + page_is_file_lru(head),
-				hpage_nr_pages(head));
+				thp_nr_pages(head));
 		} else if (flags & MPOL_MF_STRICT) {
 			/*
 			 * Non-movable page may reach here.  And, there may be
diff --git a/mm/migrate.c b/mm/migrate.c
index f37729673558..9d0c6a853c1c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -193,7 +193,7 @@ void putback_movable_pages(struct list_head *l)
 			put_page(page);
 		} else {
 			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_lru(page), -hpage_nr_pages(page));
+					page_is_file_lru(page), -thp_nr_pages(page));
 			putback_lru_page(page);
 		}
 	}
@@ -386,7 +386,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page)
 	 */
 	expected_count += is_device_private_page(page);
 	if (mapping)
-		expected_count += hpage_nr_pages(page) + page_has_private(page);
+		expected_count += thp_nr_pages(page) + page_has_private(page);
 
 	return expected_count;
 }
@@ -441,7 +441,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 */
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
-	page_ref_add(newpage, hpage_nr_pages(page)); /* add cache reference */
+	page_ref_add(newpage, thp_nr_pages(page)); /* add cache reference */
 	if (PageSwapBacked(page)) {
 		__SetPageSwapBacked(newpage);
 		if (PageSwapCache(page)) {
@@ -474,7 +474,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * to one less reference.
 	 * We know this isn't the last reference.
 	 */
-	page_ref_unfreeze(page, expected_count - hpage_nr_pages(page));
+	page_ref_unfreeze(page, expected_count - thp_nr_pages(page));
 
 	xas_unlock(&xas);
 	/* Leave irq disabled to prevent preemption while updating stats */
@@ -591,7 +591,7 @@ static void copy_huge_page(struct page *dst, struct page *src)
 	} else {
 		/* thp page */
 		BUG_ON(!PageTransHuge(src));
-		nr_pages = hpage_nr_pages(src);
+		nr_pages = thp_nr_pages(src);
 	}
 
 	for (i = 0; i < nr_pages; i++) {
@@ -1224,7 +1224,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 */
 		if (likely(!__PageMovable(page)))
 			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
-					page_is_file_lru(page), -hpage_nr_pages(page));
+					page_is_file_lru(page), -thp_nr_pages(page));
 	}
 
 	/*
@@ -1598,7 +1598,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		list_add_tail(&head->lru, pagelist);
 		mod_node_page_state(page_pgdat(head),
 			NR_ISOLATED_ANON + page_is_file_lru(head),
-			hpage_nr_pages(head));
+			thp_nr_pages(head));
 	}
 out_putpage:
 	/*
@@ -1962,7 +1962,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 
 	page_lru = page_is_file_lru(page);
 	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
-				hpage_nr_pages(page));
+				thp_nr_pages(page));
 
 	/*
 	 * Isolating the page has taken another reference, so the
diff --git a/mm/mlock.c b/mm/mlock.c
index f8736136fad7..93ca2bf30b4f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -61,8 +61,7 @@ void clear_page_mlock(struct page *page)
 	if (!TestClearPageMlocked(page))
 		return;
 
-	mod_zone_page_state(page_zone(page), NR_MLOCK,
-			    -hpage_nr_pages(page));
+	mod_zone_page_state(page_zone(page), NR_MLOCK, -thp_nr_pages(page));
 	count_vm_event(UNEVICTABLE_PGCLEARED);
 	/*
 	 * The previous TestClearPageMlocked() corresponds to the smp_mb()
@@ -95,7 +94,7 @@ void mlock_vma_page(struct page *page)
 
 	if (!TestSetPageMlocked(page)) {
 		mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
+				    thp_nr_pages(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 		if (!isolate_lru_page(page))
 			putback_lru_page(page);
@@ -192,7 +191,7 @@ unsigned int munlock_vma_page(struct page *page)
 	/*
 	 * Serialize with any parallel __split_huge_page_refcount() which
 	 * might otherwise copy PageMlocked to part of the tail pages before
-	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
+	 * we clear it in the head page. It also stabilizes thp_nr_pages().
 	 */
 	spin_lock_irq(&pgdat->lru_lock);
 
@@ -202,7 +201,7 @@ unsigned int munlock_vma_page(struct page *page)
 		goto unlock_out;
 	}
 
-	nr_pages = hpage_nr_pages(page);
+	nr_pages = thp_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
 	if (__munlock_isolate_lru_page(page, true)) {
diff --git a/mm/page_io.c b/mm/page_io.c
index 888000d1a8cc..77170b7e6f04 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -274,7 +274,7 @@ static inline void count_swpout_vm_event(struct page *page)
 	if (unlikely(PageTransHuge(page)))
 		count_vm_event(THP_SWPOUT);
 #endif
-	count_vm_events(PSWPOUT, hpage_nr_pages(page));
+	count_vm_events(PSWPOUT, thp_nr_pages(page));
 }
 
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e65629c056e8..5e77b269c330 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -61,7 +61,7 @@ static inline bool pfn_is_match(struct page *page, unsigned long pfn)
 		return page_pfn == pfn;
 
 	/* THP can be referenced by any subpage */
-	return pfn >= page_pfn && pfn - page_pfn < hpage_nr_pages(page);
+	return pfn >= page_pfn && pfn - page_pfn < thp_nr_pages(page);
 }
 
 /**
diff --git a/mm/rmap.c b/mm/rmap.c
index 5fe2dedce1fc..c56fab5826c1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1130,7 +1130,7 @@ void do_page_add_anon_rmap(struct page *page,
 	}
 
 	if (first) {
-		int nr = compound ? hpage_nr_pages(page) : 1;
+		int nr = compound ? thp_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
@@ -1169,7 +1169,7 @@ void do_page_add_anon_rmap(struct page *page,
 void page_add_new_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
-	int nr = compound ? hpage_nr_pages(page) : 1;
+	int nr = compound ? thp_nr_pages(page) : 1;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	__SetPageSwapBacked(page);
@@ -1860,7 +1860,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		return;
 
 	pgoff_start = page_to_pgoff(page);
-	pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+	pgoff_end = pgoff_start + thp_nr_pages(page) - 1;
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
 			pgoff_start, pgoff_end) {
 		struct vm_area_struct *vma = avc->vma;
@@ -1913,7 +1913,7 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 		return;
 
 	pgoff_start = page_to_pgoff(page);
-	pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+	pgoff_end = pgoff_start + thp_nr_pages(page) - 1;
 	if (!locked)
 		i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap,
diff --git a/mm/swap.c b/mm/swap.c
index dbcab84c6fce..a3b8cc8bdc17 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -241,7 +241,7 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-		(*pgmoved) += hpage_nr_pages(page);
+		(*pgmoved) += thp_nr_pages(page);
 	}
 }
 
@@ -312,7 +312,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 void lru_note_cost_page(struct page *page)
 {
 	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
-		      page_is_file_lru(page), hpage_nr_pages(page));
+		      page_is_file_lru(page), thp_nr_pages(page));
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
@@ -320,7 +320,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
-		int nr_pages = hpage_nr_pages(page);
+		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec, lru);
 		SetPageActive(page);
@@ -500,7 +500,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
 		 * lock is held(spinlock), which implies preemption disabled.
 		 */
 		__mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
+				    thp_nr_pages(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 	}
 	lru_cache_add(page);
@@ -532,7 +532,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 {
 	int lru;
 	bool active;
-	int nr_pages = hpage_nr_pages(page);
+	int nr_pages = thp_nr_pages(page);
 
 	if (!PageLRU(page))
 		return;
@@ -580,7 +580,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
-		int nr_pages = hpage_nr_pages(page);
+		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
 		ClearPageActive(page);
@@ -599,7 +599,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
-		int nr_pages = hpage_nr_pages(page);
+		int nr_pages = thp_nr_pages(page);
 
 		del_page_from_lru_list(page, lruvec,
 				       LRU_INACTIVE_ANON + active);
@@ -972,7 +972,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
-	int nr_pages = hpage_nr_pages(page);
+	int nr_pages = thp_nr_pages(page);
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e98ff460e9e9..108d2574371f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -115,7 +115,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
 	struct address_space *address_space = swap_address_space(entry);
 	pgoff_t idx = swp_offset(entry);
 	XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
-	unsigned long i, nr = hpage_nr_pages(page);
+	unsigned long i, nr = thp_nr_pages(page);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -157,7 +157,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
 void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
 {
 	struct address_space *address_space = swap_address_space(entry);
-	int i, nr = hpage_nr_pages(page);
+	int i, nr = thp_nr_pages(page);
 	pgoff_t idx = swp_offset(entry);
 	XA_STATE(xas, &address_space->i_pages, idx);
 
@@ -250,7 +250,7 @@ void delete_from_swap_cache(struct page *page)
 	xa_unlock_irq(&address_space->i_pages);
 
 	put_swap_page(page, entry);
-	page_ref_sub(page, hpage_nr_pages(page));
+	page_ref_sub(page, thp_nr_pages(page));
 }
 
 /* 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 987276c557d1..142095774e55 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1368,7 +1368,7 @@ void put_swap_page(struct page *page, swp_entry_t entry)
 	unsigned char *map;
 	unsigned int i, free_entries = 0;
 	unsigned char val;
-	int size = swap_entry_size(hpage_nr_pages(page));
+	int size = swap_entry_size(thp_nr_pages(page));
 
 	si = _swap_info_get(entry);
 	if (!si)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..17934e03b3aa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,7 +1359,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			case PAGE_ACTIVATE:
 				goto activate_locked;
 			case PAGE_SUCCESS:
-				stat->nr_pageout += hpage_nr_pages(page);
+				stat->nr_pageout += thp_nr_pages(page);
 
 				if (PageWriteback(page))
 					goto keep;
@@ -1867,7 +1867,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		SetPageLRU(page);
 		lru = page_lru(page);
 
-		nr_pages = hpage_nr_pages(page);
+		nr_pages = thp_nr_pages(page);
 		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
 		list_move(&page->lru, &lruvec->lists[lru]);
 
@@ -2067,7 +2067,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			 * so we ignore them here.
 			 */
 			if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {
-				nr_rotated += hpage_nr_pages(page);
+				nr_rotated += thp_nr_pages(page);
 				list_add(&page->lru, &l_active);
 				continue;
 			}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 11/51] mm: Add thp_head
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (9 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 10/51] mm: Replace hpage_nr_pages with thp_nr_pages Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 12/51] mm: Introduce offset_in_thp Matthew Wilcox
                   ` (41 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This is like compound_head() but compiles away when
CONFIG_TRANSPARENT_HUGEPAGE is not enabled.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index dcdfd21763a3..bd13e9ac3437 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -266,6 +266,15 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 		return NULL;
 }
 
+/**
+ * thp_head - Head page of a transparent huge page.
+ * @page: Any page (tail, head or regular) found in the page cache.
+ */
+static inline struct page *thp_head(struct page *page)
+{
+	return compound_head(page);
+}
+
 /**
  * thp_order - Order of a transparent huge page.
  * @page: Head page of a transparent huge page.
@@ -342,6 +351,12 @@ static inline struct list_head *page_deferred_list(struct page *page)
 #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
+static inline struct page *thp_head(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(PageTail(page), page);
+	return page;
+}
+
 static inline unsigned int thp_order(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(PageTail(page), page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 12/51] mm: Introduce offset_in_thp
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (10 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 11/51] mm: Add thp_head Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 13/51] mm: Support arbitrary THP sizes Matthew Wilcox
                   ` (40 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, David Hildenbrand

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Mirroring offset_in_page(), this gives you the offset within this
particular page, no matter what size page it is.  It optimises down
to offset_in_page() if CONFIG_TRANSPARENT_HUGEPAGE is not set.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6c29b663135f..3fc7e8121216 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1583,6 +1583,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
 extern void pagefault_out_of_memory(void);
 
 #define offset_in_page(p)	((unsigned long)(p) & ~PAGE_MASK)
+#define offset_in_thp(page, p)	((unsigned long)(p) & (thp_size(page) - 1))
 
 /*
  * Flags passed to show_mem() and show_free_areas() to suppress output in
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 13/51] mm: Support arbitrary THP sizes
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (11 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 12/51] mm: Introduce offset_in_thp Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 14/51] fs: Add a filesystem flag for THPs Matthew Wilcox
                   ` (39 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the compound size of the page instead of assuming PTE or PMD size.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h |  8 ++------
 include/linux/mm.h      | 42 ++++++++++++++++++++---------------------
 2 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bd13e9ac3437..d125912a3e0d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -282,9 +282,7 @@ static inline struct page *thp_head(struct page *page)
 static inline unsigned int thp_order(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(PageTail(page), page);
-	if (PageHead(page))
-		return HPAGE_PMD_ORDER;
-	return 0;
+	return compound_order(page);
 }
 
 /**
@@ -294,9 +292,7 @@ static inline unsigned int thp_order(struct page *page)
 static inline int thp_nr_pages(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(PageTail(page), page);
-	if (PageHead(page))
-		return HPAGE_PMD_NR;
-	return 1;
+	return compound_nr(page);
 }
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3fc7e8121216..67b36b141ec7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -668,6 +668,27 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 struct mmu_gather;
 struct inode;
 
+static inline unsigned int compound_order(struct page *page)
+{
+	if (!PageHead(page))
+		return 0;
+	return page[1].compound_order;
+}
+
+/* Returns the number of pages in this potentially compound page. */
+static inline unsigned long compound_nr(struct page *page)
+{
+	if (!PageHead(page))
+		return 1;
+	return page[1].compound_nr;
+}
+
+static inline void set_compound_order(struct page *page, unsigned int order)
+{
+	page[1].compound_order = order;
+	page[1].compound_nr = 1U << order;
+}
+
 #include <linux/huge_mm.h>
 
 /*
@@ -879,13 +900,6 @@ static inline void destroy_compound_page(struct page *page)
 	compound_page_dtors[page[1].compound_dtor](page);
 }
 
-static inline unsigned int compound_order(struct page *page)
-{
-	if (!PageHead(page))
-		return 0;
-	return page[1].compound_order;
-}
-
 static inline bool hpage_pincount_available(struct page *page)
 {
 	/*
@@ -904,20 +918,6 @@ static inline int compound_pincount(struct page *page)
 	return atomic_read(compound_pincount_ptr(page));
 }
 
-static inline void set_compound_order(struct page *page, unsigned int order)
-{
-	page[1].compound_order = order;
-	page[1].compound_nr = 1U << order;
-}
-
-/* Returns the number of pages in this potentially compound page. */
-static inline unsigned long compound_nr(struct page *page)
-{
-	if (!PageHead(page))
-		return 1;
-	return page[1].compound_nr;
-}
-
 /* Returns the number of bytes in this potentially compound page. */
 static inline unsigned long page_size(struct page *page)
 {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 14/51] fs: Add a filesystem flag for THPs
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (12 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 13/51] mm: Support arbitrary THP sizes Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 15/51] fs: Do not update nr_thps for mappings which support THPs Matthew Wilcox
                   ` (38 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The page cache needs to know whether the filesystem supports THPs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/inode.c              | 2 ++
 include/linux/fs.h      | 1 +
 include/linux/pagemap.h | 6 ++++++
 mm/shmem.c              | 2 +-
 4 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 72c4c347afb7..9d78c37b00b8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -181,6 +181,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->a_ops = &empty_aops;
 	mapping->host = inode;
 	mapping->flags = 0;
+	if (sb->s_type->fs_flags & FS_THP_SUPPORT)
+		__set_bit(AS_THP_SUPPORT, &mapping->flags);
 	mapping->wb_err = 0;
 	atomic_set(&mapping->i_mmap_writable, 0);
 #ifdef CONFIG_READ_ONLY_THP_FOR_FS
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 19ef6c88c152..ac411a08f83c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2241,6 +2241,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
+#define FS_THP_SUPPORT		8192	/* Remove once all fs converted */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
 	const struct fs_parameter_spec *parameters;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 484a36185bb5..8a45b572bcf1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -29,6 +29,7 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
+	AS_THP_SUPPORT = 6,	/* THPs supported */
 };
 
 /**
@@ -119,6 +120,11 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+static inline bool mapping_thp_support(struct address_space *mapping)
+{
+	return test_bit(AS_THP_SUPPORT, &mapping->flags);
+}
+
 void release_pages(struct page **pages, int nr);
 
 /*
diff --git a/mm/shmem.c b/mm/shmem.c
index a0dbe62f8042..a05d129a45e9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3854,7 +3854,7 @@ static struct file_system_type shmem_fs_type = {
 	.parameters	= shmem_fs_parameters,
 #endif
 	.kill_sb	= kill_litter_super,
-	.fs_flags	= FS_USERNS_MOUNT,
+	.fs_flags	= FS_USERNS_MOUNT | FS_THP_SUPPORT,
 };
 
 int __init shmem_init(void)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 15/51] fs: Do not update nr_thps for mappings which support THPs
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (13 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 14/51] fs: Add a filesystem flag for THPs Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 16/51] fs: Introduce i_blocks_per_page Matthew Wilcox
                   ` (37 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The nr_thps counter is to support THPs in the page cache when the
filesystem doesn't understand THPs.  Eventually it will be removed, but
we should still support filesystems which do not understand THPs yet.
Move the nr_thp manipulation functions to filemap.h since they're
page-cache specific.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/fs.h      | 27 ---------------------------
 include/linux/pagemap.h | 29 +++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac411a08f83c..e11929152103 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2839,33 +2839,6 @@ static inline errseq_t file_sample_sb_err(struct file *file)
 	return errseq_sample(&file->f_path.dentry->d_sb->s_wb_err);
 }
 
-static inline int filemap_nr_thps(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	return atomic_read(&mapping->nr_thps);
-#else
-	return 0;
-#endif
-}
-
-static inline void filemap_nr_thps_inc(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	atomic_inc(&mapping->nr_thps);
-#else
-	WARN_ON_ONCE(1);
-#endif
-}
-
-static inline void filemap_nr_thps_dec(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
-	atomic_dec(&mapping->nr_thps);
-#else
-	WARN_ON_ONCE(1);
-#endif
-}
-
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
 			   int datasync);
 extern int vfs_fsync(struct file *file, int datasync);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8a45b572bcf1..644caff3e78b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -125,6 +125,35 @@ static inline bool mapping_thp_support(struct address_space *mapping)
 	return test_bit(AS_THP_SUPPORT, &mapping->flags);
 }
 
+static inline int filemap_nr_thps(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	return atomic_read(&mapping->nr_thps);
+#else
+	return 0;
+#endif
+}
+
+static inline void filemap_nr_thps_inc(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	if (!mapping_thp_support(mapping))
+		atomic_inc(&mapping->nr_thps);
+#else
+	WARN_ON_ONCE(1);
+#endif
+}
+
+static inline void filemap_nr_thps_dec(struct address_space *mapping)
+{
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS
+	if (!mapping_thp_support(mapping))
+		atomic_dec(&mapping->nr_thps);
+#else
+	WARN_ON_ONCE(1);
+#endif
+}
+
 void release_pages(struct page **pages, int nr);
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 16/51] fs: Introduce i_blocks_per_page
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (14 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 15/51] fs: Do not update nr_thps for mappings which support THPs Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 17/51] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
                   ` (36 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, Christoph Hellwig

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This helper is useful for both THPs and for supporting block size larger
than page size.  Convert some example users (we have a few different
ways of writing this idiom).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c  |  8 ++++----
 fs/jfs/jfs_metapage.c   |  2 +-
 fs/xfs/xfs_aops.c       |  2 +-
 include/linux/pagemap.h | 15 +++++++++++++++
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a1ed7620fbac..4d3d0abc1421 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
 
-	if (iop || i_blocksize(inode) == PAGE_SIZE)
+	if (iop || i_blocks_per_page(inode, page) <= 1)
 		return iop;
 
 	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
@@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	unsigned int i;
 
 	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+	for (i = 0; i < i_blocks_per_page(inode, page); i++) {
 		if (i >= first && i <= last)
 			set_bit(i, iop->uptodate);
 		else if (!test_bit(i, iop->uptodate))
@@ -1079,7 +1079,7 @@ iomap_finish_page_writeback(struct inode *inode, struct page *page,
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
 
-	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0);
 
 	if (!iop || atomic_dec_and_test(&iop->write_count))
@@ -1375,7 +1375,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
-	WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);
 
 	/*
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index a2f5338a5ea1..176580f54af9 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page *page)
 	struct inode *inode = page->mapping->host;
 	struct bio *bio = NULL;
 	int block_offset;
-	int blocks_per_page = PAGE_SIZE >> inode->i_blkbits;
+	int blocks_per_page = i_blocks_per_page(inode, page);
 	sector_t page_start;	/* address of page in fs blocks */
 	sector_t pblock;
 	int xlen;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..55d126d4e096 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -544,7 +544,7 @@ xfs_discard_page(
 			page, ip->i_ino, offset);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			PAGE_SIZE / i_blocksize(inode));
+			i_blocks_per_page(inode, page));
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 644caff3e78b..3ad0c92be649 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -891,4 +891,19 @@ static inline int page_mkwrite_check_truncate(struct page *page,
 	return offset;
 }
 
+/**
+ * i_blocks_per_page - How many blocks fit in this page.
+ * @inode: The inode which contains the blocks.
+ * @page: The page (head page if the page is a THP).
+ *
+ * If the block size is larger than the size of this page, return zero.
+ *
+ * Context: Any context.
+ * Return: The number of filesystem blocks covered by this page.
+ */
+static inline
+unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
+{
+	return thp_size(page) >> inode->i_blkbits;
+}
 #endif /* _LINUX_PAGEMAP_H */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 17/51] fs: Make page_mkwrite_check_truncate thp-aware
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (15 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 16/51] fs: Introduce i_blocks_per_page Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 18/51] mm: Support THPs in zero_user_segments Matthew Wilcox
                   ` (35 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the page is compound, check the last index in the page and return
the appropriate size.  Change the return type to ssize_t in case we ever
support pages larger than 2GB.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3ad0c92be649..f47ba9f18f3e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -868,22 +868,22 @@ static inline unsigned long dir_pages(struct inode *inode)
  * @page: the page to check
  * @inode: the inode to check the page against
  *
- * Returns the number of bytes in the page up to EOF,
+ * Return: The number of bytes in the page up to EOF,
  * or -EFAULT if the page was truncated.
  */
-static inline int page_mkwrite_check_truncate(struct page *page,
+static inline ssize_t page_mkwrite_check_truncate(struct page *page,
 					      struct inode *inode)
 {
 	loff_t size = i_size_read(inode);
 	pgoff_t index = size >> PAGE_SHIFT;
-	int offset = offset_in_page(size);
+	unsigned long offset = offset_in_thp(page, size);
 
 	if (page->mapping != inode->i_mapping)
 		return -EFAULT;
 
 	/* page is wholly inside EOF */
-	if (page->index < index)
-		return PAGE_SIZE;
+	if (page->index + thp_nr_pages(page) - 1 < index)
+		return thp_size(page);
 	/* page is wholly past EOF */
 	if (page->index > index || !offset)
 		return -EFAULT;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 18/51] mm: Support THPs in zero_user_segments
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (16 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 17/51] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 19/51] mm: Zero the head page, not the tail page Matthew Wilcox
                   ` (34 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We can only kmap() one subpage of a THP at a time, so loop over all
relevant subpages, skipping ones which don't need to be zeroed.  This is
too large to inline when THPs are enabled and we actually need highmem,
so put it in highmem.c.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/highmem.h | 11 ++++++--
 mm/highmem.c            | 62 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index d6e82e3de027..f05589513103 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -284,13 +284,17 @@ static inline void clear_highpage(struct page *page)
 	kunmap_atomic(kaddr);
 }
 
+#if defined(CONFIG_HIGHMEM) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2);
+#else /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
 static inline void zero_user_segments(struct page *page,
-	unsigned start1, unsigned end1,
-	unsigned start2, unsigned end2)
+		unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2)
 {
 	void *kaddr = kmap_atomic(page);
 
-	BUG_ON(end1 > PAGE_SIZE || end2 > PAGE_SIZE);
+	BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
 
 	if (end1 > start1)
 		memset(kaddr + start1, 0, end1 - start1);
@@ -301,6 +305,7 @@ static inline void zero_user_segments(struct page *page,
 	kunmap_atomic(kaddr);
 	flush_dcache_page(page);
 }
+#endif /* !HIGHMEM || !TRANSPARENT_HUGEPAGE */
 
 static inline void zero_user_segment(struct page *page,
 	unsigned start, unsigned end)
diff --git a/mm/highmem.c b/mm/highmem.c
index 64d8dea47dd1..686cae2f1ba5 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -367,9 +367,67 @@ void kunmap_high(struct page *page)
 	if (need_wakeup)
 		wake_up(pkmap_map_wait);
 }
-
 EXPORT_SYMBOL(kunmap_high);
-#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2)
+{
+	unsigned int i;
+
+	BUG_ON(end1 > thp_size(page) || end2 > thp_size(page));
+
+	for (i = 0; i < thp_nr_pages(page); i++) {
+		void *kaddr;
+		unsigned this_end;
+
+		if (end1 == 0 && start2 >= PAGE_SIZE) {
+			start2 -= PAGE_SIZE;
+			end2 -= PAGE_SIZE;
+			continue;
+		}
+
+		if (start1 >= PAGE_SIZE) {
+			start1 -= PAGE_SIZE;
+			end1 -= PAGE_SIZE;
+			if (start2) {
+				start2 -= PAGE_SIZE;
+				end2 -= PAGE_SIZE;
+			}
+			continue;
+		}
+
+		kaddr = kmap_atomic(page + i);
+
+		this_end = min_t(unsigned, end1, PAGE_SIZE);
+		if (end1 > start1)
+			memset(kaddr + start1, 0, this_end - start1);
+		end1 -= this_end;
+		start1 = 0;
+
+		if (start2 >= PAGE_SIZE) {
+			start2 -= PAGE_SIZE;
+			end2 -= PAGE_SIZE;
+		} else {
+			this_end = min_t(unsigned, end2, PAGE_SIZE);
+			if (end2 > start2)
+				memset(kaddr + start2, 0, this_end - start2);
+			end2 -= this_end;
+			start2 = 0;
+		}
+
+		kunmap_atomic(kaddr);
+
+		if (!end1 && !end2)
+			break;
+	}
+	flush_dcache_page(page);
+
+	BUG_ON((start1 | start2 | end1 | end2) != 0);
+}
+EXPORT_SYMBOL(zero_user_segments);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_HIGHMEM */
 
 #if defined(HASHED_PAGE_VIRTUAL)
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 19/51] mm: Zero the head page, not the tail page
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (17 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 18/51] mm: Support THPs in zero_user_segments Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 20/51] block: Add bio_for_each_thp_segment_all Matthew Wilcox
                   ` (33 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Pass the head page to zero_user_segment(), not the tail page, and adjust
the byte offsets appropriately.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/shmem.c    | 7 +++++++
 mm/truncate.c | 7 +++++++
 2 files changed, 14 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index a05d129a45e9..55405d811cfd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -898,11 +898,18 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		struct page *page = NULL;
 		shmem_getpage(inode, start - 1, &page, SGP_READ);
 		if (page) {
+			struct page *head = thp_head(page);
 			unsigned int top = PAGE_SIZE;
 			if (start > end) {
 				top = partial_end;
 				partial_end = 0;
 			}
+			if (head != page) {
+				unsigned int diff = start - 1 - head->index;
+				partial_start += diff << PAGE_SHIFT;
+				top += diff << PAGE_SHIFT;
+				page = head;
+			}
 			zero_user_segment(page, partial_start, top);
 			set_page_dirty(page);
 			unlock_page(page);
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..152974888124 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -374,12 +374,19 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	if (partial_start) {
 		struct page *page = find_lock_page(mapping, start - 1);
 		if (page) {
+			struct page *head = thp_head(page);
 			unsigned int top = PAGE_SIZE;
 			if (start > end) {
 				/* Truncation within a single page */
 				top = partial_end;
 				partial_end = 0;
 			}
+			if (head != page) {
+				unsigned int diff = start - 1 - head->index;
+				partial_start += diff << PAGE_SHIFT;
+				top += diff << PAGE_SHIFT;
+				page = head;
+			}
 			wait_on_page_writeback(page);
 			zero_user_segment(page, partial_start, top);
 			cleancache_invalidate_page(mapping, page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 20/51] block: Add bio_for_each_thp_segment_all
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (18 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 19/51] mm: Zero the head page, not the tail page Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-11 18:20   ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 21/51] block: Support THPs in page_is_mergeable Matthew Wilcox
                   ` (32 subsequent siblings)
  52 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Iterate once for each THP instead of once for each base page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/bio.h  | 13 +++++++++++++
 include/linux/bvec.h | 23 +++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 91676d4b2dfe..1489e196abf5 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -131,12 +131,25 @@ static inline bool bio_next_segment(const struct bio *bio,
 	return true;
 }
 
+static inline bool bio_next_thp_segment(const struct bio *bio,
+				    struct bvec_iter_all *iter)
+{
+	if (iter->idx >= bio->bi_vcnt)
+		return false;
+
+	bvec_thp_advance(&bio->bi_io_vec[iter->idx], iter);
+	return true;
+}
+
 /*
  * drivers should _never_ use the all version - the bio may have been split
  * before it got to the driver and the driver won't own all of it
  */
 #define bio_for_each_segment_all(bvl, bio, iter) \
 	for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+#define bio_for_each_thp_segment_all(bvl, bio, iter) \
+	for (bvl = bvec_init_iter_all(&iter); \
+	     bio_next_thp_segment((bio), &iter); )
 
 static inline void bio_advance_iter(const struct bio *bio,
 				    struct bvec_iter *iter, unsigned int bytes)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ac0c7299d5b8..71b435e573e1 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -162,4 +162,27 @@ static inline void bvec_advance(const struct bio_vec *bvec,
 	}
 }
 
+static inline void bvec_thp_advance(const struct bio_vec *bvec,
+				struct bvec_iter_all *iter_all)
+{
+	struct bio_vec *bv = &iter_all->bv;
+	unsigned int page_size = thp_size(bvec->bv_page);
+
+	if (iter_all->done) {
+		bv->bv_page += thp_nr_pages(bv->bv_page);
+		bv->bv_offset = 0;
+	} else {
+		BUG_ON(bvec->bv_offset >= page_size);
+		bv->bv_page = bvec->bv_page;
+		bv->bv_offset = bvec->bv_offset & (page_size - 1);
+	}
+	bv->bv_len = min(page_size - bv->bv_offset,
+			 bvec->bv_len - iter_all->done);
+	iter_all->done += bv->bv_len;
+
+	if (iter_all->done == bvec->bv_len) {
+		iter_all->idx++;
+		iter_all->done = 0;
+	}
+}
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 21/51] block: Support THPs in page_is_mergeable
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (19 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 20/51] block: Add bio_for_each_thp_segment_all Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-12 16:17   ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 22/51] iomap: Support arbitrarily many blocks per page Matthew Wilcox
                   ` (31 subsequent siblings)
  52 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

page_is_mergeable() would incorrectly claim that two IOs were on different
pages because they were on different base pages rather than on the
same THP.  This led to a reference counting bug in iomap.  Simplify the
'same_page' test by just comparing whether we have the same struct page
instead of doing arithmetic on the physical addresses.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 block/bio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 5235da6434aa..cd677cde853d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -747,7 +747,7 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
 		return false;
 
-	*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
+	*same_page = bv->bv_page == page;
 	if (!*same_page && pfn_to_page(PFN_DOWN(vec_end_addr)) + 1 != page)
 		return false;
 	return true;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 22/51] iomap: Support arbitrarily many blocks per page
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (20 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 21/51] block: Support THPs in page_is_mergeable Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 23/51] iomap: Support THPs in iomap_adjust_read_range Matthew Wilcox
                   ` (30 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Size the uptodate array dynamically.  Now that this array is protected
by a spinlock, we can use bitmap functions to set the bits in this array
instead of a loop around set_bit().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4d3d0abc1421..1a398af42ed2 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -22,14 +22,14 @@
 #include "../internal.h"
 
 /*
- * Structure allocated for each page when block size < PAGE_SIZE to track
+ * Structure allocated for each page when block size < page size to track
  * sub-page uptodate status and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_count;
 	atomic_t		write_count;
 	spinlock_t		uptodate_lock;
-	DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
+	unsigned long		uptodate[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct page *page)
@@ -45,15 +45,14 @@ static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
 	struct iomap_page *iop = to_iomap_page(page);
+	unsigned int nr_blocks = i_blocks_per_page(inode, page);
 
-	if (iop || i_blocks_per_page(inode, page) <= 1)
+	if (iop || nr_blocks <= 1)
 		return iop;
 
-	iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
-	atomic_set(&iop->read_count, 0);
-	atomic_set(&iop->write_count, 0);
+	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+				GFP_NOFS | __GFP_NOFAIL);
 	spin_lock_init(&iop->uptodate_lock);
-	bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
 
 	/*
 	 * migrate_page_move_mapping() assumes that pages with private data have
@@ -142,19 +141,11 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	struct inode *inode = page->mapping->host;
 	unsigned first = off >> inode->i_blkbits;
 	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	bool uptodate = true;
 	unsigned long flags;
-	unsigned int i;
 
 	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	for (i = 0; i < i_blocks_per_page(inode, page); i++) {
-		if (i >= first && i <= last)
-			set_bit(i, iop->uptodate);
-		else if (!test_bit(i, iop->uptodate))
-			uptodate = false;
-	}
-
-	if (uptodate)
+	bitmap_set(iop->uptodate, first, last - first + 1);
+	if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
 		SetPageUptodate(page);
 	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 23/51] iomap: Support THPs in iomap_adjust_read_range
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (21 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 22/51] iomap: Support arbitrarily many blocks per page Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 24/51] iomap: Support THPs in invalidatepage Matthew Wilcox
                   ` (29 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Pass the struct page instead of the iomap_page so we can determine the
size of the page.  Use offset_in_thp() instead of offset_in_page() and
use thp_size() instead of PAGE_SIZE.  Convert the arguments to be size_t
instead of unsigned int, in case pages ever get larger than 2^31 bytes.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1a398af42ed2..1b450e966822 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -77,16 +77,16 @@ iomap_page_release(struct page *page)
 /*
  * Calculate the range inside the page that we actually need to read.
  */
-static void
-iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
-		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
+static void iomap_adjust_read_range(struct inode *inode, struct page *page,
+		loff_t *pos, loff_t length, size_t *offp, size_t *lenp)
 {
+	struct iomap_page *iop = to_iomap_page(page);
 	loff_t orig_pos = *pos;
 	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
 	unsigned block_size = (1 << block_bits);
-	unsigned poff = offset_in_page(*pos);
-	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	size_t poff = offset_in_thp(page, *pos);
+	size_t plen = min_t(loff_t, thp_size(page) - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
@@ -124,7 +124,7 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
 	if (orig_pos <= isize && orig_pos + length > isize) {
-		unsigned end = offset_in_page(isize - 1) >> block_bits;
+		unsigned end = offset_in_thp(page, isize - 1) >> block_bits;
 
 		if (first <= end && last > end)
 			plen -= (last - end) * block_size;
@@ -241,7 +241,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	struct iomap_page *iop = iomap_page_create(inode, page);
 	bool same_page = false, is_contig = false;
 	loff_t orig_pos = pos;
-	unsigned poff, plen;
+	size_t poff, plen;
 	sector_t sector;
 
 	if (iomap->type == IOMAP_INLINE) {
@@ -251,7 +251,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	}
 
 	/* zero post-eof blocks as the page may be mapped */
-	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+	iomap_adjust_read_range(inode, page, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
 
@@ -560,18 +560,19 @@ static int
 __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
 		struct page *page, struct iomap *srcmap)
 {
-	struct iomap_page *iop = iomap_page_create(inode, page);
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
-	unsigned from = offset_in_page(pos), to = from + len, poff, plen;
+	unsigned from = offset_in_page(pos), to = from + len;
+	size_t poff, plen;
 	int status;
 
 	if (PageUptodate(page))
 		return 0;
+	iomap_page_create(inode, page);
 
 	do {
-		iomap_adjust_read_range(inode, iop, &block_start,
+		iomap_adjust_read_range(inode, page, &block_start,
 				block_end - block_start, &poff, &plen);
 		if (plen == 0)
 			break;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 24/51] iomap: Support THPs in invalidatepage
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (22 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 23/51] iomap: Support THPs in iomap_adjust_read_range Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 25/51] iomap: Support THPs in read paths Matthew Wilcox
                   ` (28 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If we're punching a hole in a THP, we need to remove the per-page iomap
data, but not clear the dirty bit from the page, so separate the two
conditions.  This means that writepage can now come across a page with
no iop allocated, so remove the assertions that there is already one,
and just create one if there isn't one.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1b450e966822..b1a2ab2c7a8d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -487,17 +487,21 @@ EXPORT_SYMBOL_GPL(iomap_releasepage);
 void
 iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
 {
+	bool full_page = (offset == 0) && (len == thp_size(page));
 	trace_iomap_invalidatepage(page->mapping->host, offset, len);
 
 	/*
 	 * If we are invalidating the entire page, clear the dirty state from it
 	 * and release it to avoid unnecessary buildup of the LRU.
 	 */
-	if (offset == 0 && len == PAGE_SIZE) {
+	if (full_page) {
 		WARN_ON_ONCE(PageWriteback(page));
 		cancel_dirty_page(page);
-		iomap_page_release(page);
 	}
+
+	/* Punching a hole in a THP requires releasing the iop */
+	if (full_page || thp_order(page) > 0)
+		iomap_page_release(page);
 }
 EXPORT_SYMBOL_GPL(iomap_invalidatepage);
 
@@ -1064,14 +1068,13 @@ static void
 iomap_finish_page_writeback(struct inode *inode, struct page *page,
 		int error)
 {
-	struct iomap_page *iop = to_iomap_page(page);
+	struct iomap_page *iop = iomap_page_create(inode, page);
 
 	if (error) {
 		SetPageError(page);
 		mapping_set_error(inode->i_mapping, -EIO);
 	}
 
-	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0);
 
 	if (!iop || atomic_dec_and_test(&iop->write_count))
@@ -1360,14 +1363,13 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct page *page, u64 end_offset)
 {
-	struct iomap_page *iop = to_iomap_page(page);
+	struct iomap_page *iop = iomap_page_create(inode, page);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	u64 file_offset; /* file offset of page */
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
-	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);
 
 	/*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 25/51] iomap: Support THPs in read paths
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (23 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 24/51] iomap: Support THPs in invalidatepage Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 26/51] iomap: Convert iomap_write_end types Matthew Wilcox
                   ` (27 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use thp_size() instead of PAGE_SIZE, offset_in_thp() instead of
offset_in_page() and bio_for_each_thp_segment_all().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b1a2ab2c7a8d..555ec90e8b54 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -192,7 +192,7 @@ iomap_read_end_io(struct bio *bio)
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, iter_all)
+	bio_for_each_thp_segment_all(bvec, bio, iter_all)
 		iomap_read_page_end_io(bvec, error);
 	bio_put(bio);
 }
@@ -232,6 +232,16 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
 		pos >= i_size_read(inode);
 }
 
+/*
+ * Estimate the number of vectors we need based on the current page size;
+ * if we're wrong we'll end up doing an overly large allocation or needing
+ * to do a second allocation, neither of which is a big deal.
+ */
+static unsigned int iomap_nr_vecs(struct page *page, loff_t length)
+{
+	return (length + thp_size(page) - 1) >> page_shift(page);
+}
+
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap, struct iomap *srcmap)
@@ -288,7 +298,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
 		gfp_t orig_gfp = gfp;
-		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		int nr_vecs = iomap_nr_vecs(page, length);
 
 		if (ctx->bio)
 			submit_bio(ctx->bio);
@@ -332,9 +342,9 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 
 	trace_iomap_readpage(page->mapping->host, 1);
 
-	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
+	for (poff = 0; poff < thp_size(page); poff += ret) {
 		ret = iomap_apply(inode, page_offset(page) + poff,
-				PAGE_SIZE - poff, 0, ops, &ctx,
+				thp_size(page) - poff, 0, ops, &ctx,
 				iomap_readpage_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
@@ -368,7 +378,8 @@ iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
 	loff_t done, ret;
 
 	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
+		if (ctx->cur_page &&
+		    offset_in_thp(ctx->cur_page, pos + done) == 0) {
 			if (!ctx->cur_page_in_bio)
 				unlock_page(ctx->cur_page);
 			put_page(ctx->cur_page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 26/51] iomap: Convert iomap_write_end types
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (24 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 25/51] iomap: Support THPs in read paths Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 27/51] iomap: Change calling convention for zeroing Matthew Wilcox
                   ` (26 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

iomap_write_end cannot return an error, so switch it to return
size_t instead of int and remove the error checking from the callers.
Also convert the arguments to size_t from unsigned int, in case anyone
ever wants to support a page size larger than 2GB.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 555ec90e8b54..fffdf5906e99 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -692,9 +692,8 @@ iomap_set_page_dirty(struct page *page)
 }
 EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
 
-static int
-__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
-		unsigned copied, struct page *page)
+static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+		size_t copied, struct page *page)
 {
 	flush_dcache_page(page);
 
@@ -716,9 +715,8 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 	return copied;
 }
 
-static int
-iomap_write_end_inline(struct inode *inode, struct page *page,
-		struct iomap *iomap, loff_t pos, unsigned copied)
+static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
+		struct iomap *iomap, loff_t pos, size_t copied)
 {
 	void *addr;
 
@@ -733,13 +731,14 @@ iomap_write_end_inline(struct inode *inode, struct page *page,
 	return copied;
 }
 
-static int
-iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
-		struct page *page, struct iomap *iomap, struct iomap *srcmap)
+/* Returns the number of bytes copied.  May be 0.  Cannot be an errno. */
+static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+		size_t copied, struct page *page, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
 	loff_t old_size = inode->i_size;
-	int ret;
+	size_t ret;
 
 	if (srcmap->type == IOMAP_INLINE) {
 		ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
@@ -820,11 +819,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 		flush_dcache_page(page);
 
-		status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
+		copied = iomap_write_end(inode, pos, bytes, copied, page, iomap,
 				srcmap);
-		if (unlikely(status < 0))
-			break;
-		copied = status;
 
 		cond_resched();
 
@@ -898,11 +894,8 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 		status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
 				srcmap);
-		if (unlikely(status <= 0)) {
-			if (WARN_ON_ONCE(status == 0))
-				return -EIO;
-			return status;
-		}
+		if (WARN_ON_ONCE(status == 0))
+			return -EIO;
 
 		cond_resched();
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 27/51] iomap: Change calling convention for zeroing
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (25 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 26/51] iomap: Convert iomap_write_end types Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 28/51] iomap: Change iomap_write_begin calling convention Matthew Wilcox
                   ` (25 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Pass the full length to iomap_zero() and dax_iomap_zero(), then have
them return how many bytes they actually handled.  This is prepatory
work for handling THP, although it looks like DAX could actually take
advantage of it if there's a larger contiguous area.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/dax.c               | 13 ++++++-------
 fs/iomap/buffered-io.c | 33 +++++++++++++++------------------
 include/linux/dax.h    |  3 +--
 3 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..18e9b5f0dedf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1038,18 +1038,18 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	return ret;
 }
 
-int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
-		   struct iomap *iomap)
+ssize_t dax_iomap_zero(loff_t pos, loff_t length, struct iomap *iomap)
 {
 	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
 	pgoff_t pgoff;
 	long rc, id;
 	void *kaddr;
 	bool page_aligned = false;
-
+	unsigned offset = offset_in_page(pos);
+	unsigned size = min_t(loff_t, PAGE_SIZE - offset, length);
 
 	if (IS_ALIGNED(sector << SECTOR_SHIFT, PAGE_SIZE) &&
-	    IS_ALIGNED(size, PAGE_SIZE))
+	    (size == PAGE_SIZE))
 		page_aligned = true;
 
 	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
@@ -1059,8 +1059,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
 	id = dax_read_lock();
 
 	if (page_aligned)
-		rc = dax_zero_page_range(iomap->dax_dev, pgoff,
-					 size >> PAGE_SHIFT);
+		rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
 	else
 		rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
 	if (rc < 0) {
@@ -1073,7 +1072,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
 		dax_flush(iomap->dax_dev, kaddr + offset, size);
 	}
 	dax_read_unlock(id);
-	return 0;
+	return size;
 }
 
 static loff_t
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index fffdf5906e99..8d690ad68657 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -928,11 +928,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
-static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
-		unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
+static ssize_t iomap_zero(struct inode *inode, loff_t pos, loff_t length,
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct page *page;
 	int status;
+	unsigned offset = offset_in_page(pos);
+	unsigned bytes = min_t(loff_t, PAGE_SIZE - offset, length);
 
 	status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap, srcmap);
 	if (status)
@@ -944,38 +946,33 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
-static loff_t
-iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
+		loff_t length, void *data, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
-	int status;
 
 	/* already zeroed?  we're done. */
 	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
-		return count;
+		return length;
 
 	do {
-		unsigned offset, bytes;
-
-		offset = offset_in_page(pos);
-		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
+		ssize_t bytes;
 
 		if (IS_DAX(inode))
-			status = dax_iomap_zero(pos, offset, bytes, iomap);
+			bytes = dax_iomap_zero(pos, length, iomap);
 		else
-			status = iomap_zero(inode, pos, offset, bytes, iomap,
-					srcmap);
-		if (status < 0)
-			return status;
+			bytes = iomap_zero(inode, pos, length, iomap, srcmap);
+		if (bytes < 0)
+			return bytes;
 
 		pos += bytes;
-		count -= bytes;
+		length -= bytes;
 		written += bytes;
 		if (did_zero)
 			*did_zero = true;
-	} while (count > 0);
+	} while (length > 0);
 
 	return written;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..0ba618aeec98 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -214,8 +214,7 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
-int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
-			struct iomap *iomap);
+ssize_t dax_iomap_zero(loff_t pos, loff_t length, struct iomap *iomap);
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 28/51] iomap: Change iomap_write_begin calling convention
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (26 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 27/51] iomap: Change calling convention for zeroing Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 29/51] iomap: Support THPs in write paths Matthew Wilcox
                   ` (24 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Pass (up to) the remaining length of the extent to iomap_write_begin()
and have it return the number of bytes that will fit in the page.
That lets us copy more bytes per call to iomap_write_begin() if the page
cache has already allocated a THP (and will in future allow us to pass
a hint to the page cache that it should try to allocate a larger page
if there are none in the cache).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 63 +++++++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8d690ad68657..e445ee5f0521 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -571,14 +571,14 @@ iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
 	return submit_bio_wait(&bio);
 }
 
-static int
-__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
-		struct page *page, struct iomap *srcmap)
+static ssize_t __iomap_write_begin(struct inode *inode, loff_t pos,
+		size_t len, int flags, struct page *page, struct iomap *srcmap)
 {
 	loff_t block_size = i_blocksize(inode);
 	loff_t block_start = pos & ~(block_size - 1);
 	loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
-	unsigned from = offset_in_page(pos), to = from + len;
+	size_t from = offset_in_thp(page, pos);
+	size_t to = from + len;
 	size_t poff, plen;
 	int status;
 
@@ -614,12 +614,13 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags,
 	return 0;
 }
 
-static int
-iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
-		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
+static ssize_t iomap_write_begin(struct inode *inode, loff_t pos, loff_t len,
+		unsigned flags, struct page **pagep, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
 	struct page *page;
+	size_t offset;
 	int status = 0;
 
 	BUG_ON(pos + len > iomap->offset + iomap->length);
@@ -630,6 +631,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		return -EINTR;
 
 	if (page_ops && page_ops->page_prepare) {
+		if (len > UINT_MAX)
+			len = UINT_MAX;
 		status = page_ops->page_prepare(inode, pos, len, iomap);
 		if (status)
 			return status;
@@ -641,6 +644,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		status = -ENOMEM;
 		goto out_no_page;
 	}
+	page = thp_head(page);
+	offset = offset_in_thp(page, pos);
+	if (len > thp_size(page) - offset)
+		len = thp_size(page) - offset;
 
 	if (srcmap->type == IOMAP_INLINE)
 		iomap_read_inline_data(inode, page, srcmap);
@@ -650,11 +657,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		status = __iomap_write_begin(inode, pos, len, flags, page,
 				srcmap);
 
-	if (unlikely(status))
+	if (status < 0)
 		goto out_unlock;
 
 	*pagep = page;
-	return 0;
+	return len;
 
 out_unlock:
 	unlock_page(page);
@@ -809,8 +816,10 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
 				srcmap);
-		if (unlikely(status))
+		if (status < 0)
 			break;
+		/* We may be partway through a THP */
+		offset = offset_in_thp(page, pos);
 
 		if (mapping_writably_mapped(inode->i_mapping))
 			flush_dcache_page(page);
@@ -872,8 +881,7 @@ static loff_t
 iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap, struct iomap *srcmap)
 {
-	long status = 0;
-	ssize_t written = 0;
+	loff_t written = 0;
 
 	/* don't bother with blocks that are not shared to start with */
 	if (!(iomap->flags & IOMAP_F_SHARED))
@@ -883,25 +891,24 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		return length;
 
 	do {
-		unsigned long offset = offset_in_page(pos);
-		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
 		struct page *page;
+		ssize_t bytes;
 
-		status = iomap_write_begin(inode, pos, bytes,
+		bytes = iomap_write_begin(inode, pos, length,
 				IOMAP_WRITE_F_UNSHARE, &page, iomap, srcmap);
-		if (unlikely(status))
-			return status;
+		if (bytes < 0)
+			return bytes;
 
-		status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
+		bytes = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
 				srcmap);
-		if (WARN_ON_ONCE(status == 0))
+		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
 		cond_resched();
 
-		pos += status;
-		written += status;
-		length -= status;
+		pos += bytes;
+		written += bytes;
+		length -= bytes;
 
 		balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (length);
@@ -932,15 +939,13 @@ static ssize_t iomap_zero(struct inode *inode, loff_t pos, loff_t length,
 		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct page *page;
-	int status;
-	unsigned offset = offset_in_page(pos);
-	unsigned bytes = min_t(loff_t, PAGE_SIZE - offset, length);
+	ssize_t bytes;
 
-	status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap, srcmap);
-	if (status)
-		return status;
+	bytes = iomap_write_begin(inode, pos, length, 0, &page, iomap, srcmap);
+	if (bytes < 0)
+		return bytes;
 
-	zero_user(page, offset, bytes);
+	zero_user(page, offset_in_thp(page, pos), bytes);
 	mark_page_accessed(page);
 
 	return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 29/51] iomap: Support THPs in write paths
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (27 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 28/51] iomap: Change iomap_write_begin calling convention Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 30/51] iomap: Inline data shouldn't see THPs Matthew Wilcox
                   ` (23 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use thp_size() instead of PAGE_SIZE and offset_in_thp() instead of
offset_in_page().  Also simplify the logic in iomap_do_writepage()
for determining end of file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 44 +++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e445ee5f0521..9275268ea97e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -460,7 +460,7 @@ iomap_is_partially_uptodate(struct page *page, unsigned long from,
 	unsigned i;
 
 	/* Limit range to one page */
-	len = min_t(unsigned, PAGE_SIZE - from, count);
+	len = min_t(unsigned, thp_size(page) - from, count);
 
 	/* First and last blocks in range within page */
 	first = from >> inode->i_blkbits;
@@ -654,8 +654,8 @@ static ssize_t iomap_write_begin(struct inode *inode, loff_t pos, loff_t len,
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
 	else
-		status = __iomap_write_begin(inode, pos, len, flags, page,
-				srcmap);
+		status = __iomap_write_begin(inode, pos, len, flags,
+				thp_head(page), srcmap);
 
 	if (status < 0)
 		goto out_unlock;
@@ -717,7 +717,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !PageUptodate(page)))
 		return 0;
-	iomap_set_range_uptodate(page, offset_in_page(pos), len);
+	iomap_set_range_uptodate(page, offset_in_thp(page, pos), len);
 	iomap_set_page_dirty(page);
 	return copied;
 }
@@ -753,7 +753,8 @@ static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		ret = block_write_end(NULL, inode->i_mapping, pos, len, copied,
 				page, NULL);
 	} else {
-		ret = __iomap_write_end(inode, pos, len, copied, page);
+		ret = __iomap_write_end(inode, pos, len, copied,
+				thp_head(page));
 	}
 
 	/*
@@ -792,6 +793,10 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 
+		/*
+		 * XXX: We don't know what size page we'll find in the
+		 * page cache, so only copy up to a regular page boundary.
+		 */
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
 						iov_iter_count(i));
@@ -1116,7 +1121,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 			next = bio->bi_private;
 
 		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bv, bio, iter_all)
+		bio_for_each_thp_segment_all(bv, bio, iter_all)
 			iomap_finish_page_writeback(inode, bv->bv_page, error);
 		bio_put(bio);
 	}
@@ -1322,7 +1327,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
 {
 	sector_t sector = iomap_sector(&wpc->iomap, offset);
 	unsigned len = i_blocksize(inode);
-	unsigned poff = offset & (PAGE_SIZE - 1);
+	unsigned poff = offset & (thp_size(page) - 1);
 	bool merged, same_page = false;
 
 	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
@@ -1372,8 +1377,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	struct iomap_page *iop = iomap_page_create(inode, page);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
-	u64 file_offset; /* file offset of page */
+	loff_t pos;
 	int error = 0, count = 0, i;
+	int nr_blocks = i_blocks_per_page(inode, page);
 	LIST_HEAD(submit_list);
 
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);
@@ -1383,20 +1389,20 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * end of the current map or find the current map invalid, grab a new
 	 * one.
 	 */
-	for (i = 0, file_offset = page_offset(page);
-	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
-	     i++, file_offset += len) {
+	for (i = 0, pos = page_offset(page);
+	     i < nr_blocks && pos < end_offset;
+	     i++, pos += len) {
 		if (iop && !test_bit(i, iop->uptodate))
 			continue;
 
-		error = wpc->ops->map_blocks(wpc, inode, file_offset);
+		error = wpc->ops->map_blocks(wpc, inode, pos);
 		if (error)
 			break;
 		if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE))
 			continue;
 		if (wpc->iomap.type == IOMAP_HOLE)
 			continue;
-		iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
+		iomap_add_to_ioend(inode, pos, page, iop, wpc, wbc,
 				 &submit_list);
 		count++;
 	}
@@ -1478,7 +1484,6 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 {
 	struct iomap_writepage_ctx *wpc = data;
 	struct inode *inode = page->mapping->host;
-	pgoff_t end_index;
 	u64 end_offset;
 	loff_t offset;
 
@@ -1519,10 +1524,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 	 * ---------------------------------^------------------|
 	 */
 	offset = i_size_read(inode);
-	end_index = offset >> PAGE_SHIFT;
-	if (page->index < end_index)
-		end_offset = (loff_t)(page->index + 1) << PAGE_SHIFT;
-	else {
+	end_offset = page_offset(page) + thp_size(page);
+	if (end_offset > offset) {
 		/*
 		 * Check whether the page to write out is beyond or straddles
 		 * i_size or not.
@@ -1534,7 +1537,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * |				    |      Straddles     |
 		 * ---------------------------------^-----------|--------|
 		 */
-		unsigned offset_into_page = offset & (PAGE_SIZE - 1);
+		unsigned offset_into_page = offset_in_thp(page, offset);
+		pgoff_t end_index = offset >> PAGE_SHIFT;
 
 		/*
 		 * Skip the page if it is fully outside i_size, e.g. due to a
@@ -1565,7 +1569,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * memory is zeroed when mapped, and writes to that region are
 		 * not written out to the file."
 		 */
-		zero_user_segment(page, offset_into_page, PAGE_SIZE);
+		zero_user_segment(page, offset_into_page, thp_size(page));
 
 		/* Adjust the end_offset to the end of file */
 		end_offset = offset;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 30/51] iomap: Inline data shouldn't see THPs
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (28 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 29/51] iomap: Support THPs in write paths Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 31/51] iomap: Handle tail pages in iomap_page_mkwrite Matthew Wilcox
                   ` (22 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, Christoph Hellwig

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Assert that we're not seeing THPs in functions that read/write
inline data, rather than zeroing out the tail.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9275268ea97e..318c1ecc18c0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -215,6 +215,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 		return;
 
 	BUG_ON(page->index);
+	BUG_ON(PageCompound(page));
 	BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
@@ -728,6 +729,7 @@ static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
 	void *addr;
 
 	WARN_ON_ONCE(!PageUptodate(page));
+	BUG_ON(PageCompound(page));
 	BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
 	addr = kmap_atomic(page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 31/51] iomap: Handle tail pages in iomap_page_mkwrite
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (29 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 30/51] iomap: Inline data shouldn't see THPs Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 32/51] xfs: Support THPs Matthew Wilcox
                   ` (21 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

iomap_page_mkwrite() can be called with a tail page.  If we are,
operate on the head page, since we're treating the entire thing as a
single unit and the whole page is dirtied at the same time.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 318c1ecc18c0..8359e369c4b5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1046,7 +1046,7 @@ iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
 
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 {
-	struct page *page = vmf->page;
+	struct page *page = thp_head(vmf->page);
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	unsigned long length;
 	loff_t offset;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 32/51] xfs: Support THPs
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (30 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 31/51] iomap: Handle tail pages in iomap_page_mkwrite Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 33/51] mm: Make prep_transhuge_page return its argument Matthew Wilcox
                   ` (20 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

There is one place which assumes the size of a page; fix it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/xfs/xfs_aops.c  | 2 +-
 fs/xfs/xfs_super.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 55d126d4e096..20968842b2f0 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -548,7 +548,7 @@ xfs_discard_page(
 	if (error && !XFS_FORCED_SHUTDOWN(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
-	iomap_invalidatepage(page, 0, PAGE_SIZE);
+	iomap_invalidatepage(page, 0, thp_size(page));
 }
 
 static const struct iomap_writeback_ops xfs_writeback_ops = {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 379cbff438bc..1a4a7a8766db 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1829,7 +1829,7 @@ static struct file_system_type xfs_fs_type = {
 	.init_fs_context	= xfs_init_fs_context,
 	.parameters		= xfs_fs_parameters,
 	.kill_sb		= kill_block_super,
-	.fs_flags		= FS_REQUIRES_DEV,
+	.fs_flags		= FS_REQUIRES_DEV | FS_THP_SUPPORT,
 };
 MODULE_ALIAS_FS("xfs");
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 33/51] mm: Make prep_transhuge_page return its argument
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (31 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 32/51] xfs: Support THPs Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 34/51] mm: Add __page_cache_alloc_order Matthew Wilcox
                   ` (19 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, Kirill A . Shutemov

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

By permitting NULL or order-0 pages as an argument, and returning the
argument, callers can write:

	return prep_transhuge_page(alloc_pages(...));

instead of assigning the result to a temporary variable and conditionally
passing that to prep_transhuge_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h | 7 +++++--
 mm/huge_memory.c        | 9 +++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d125912a3e0d..61de7e8683cd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -193,7 +193,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 		unsigned long addr, unsigned long len, unsigned long pgoff,
 		unsigned long flags);
 
-extern void prep_transhuge_page(struct page *page);
+extern struct page *prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 bool is_transparent_hugepage(struct page *page);
 
@@ -381,7 +381,10 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 	return false;
 }
 
-static inline void prep_transhuge_page(struct page *page) {}
+static inline struct page *prep_transhuge_page(struct page *page)
+{
+	return page;
+}
 
 static inline bool is_transparent_hugepage(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78c84bee7e29..80fb38e93837 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -508,15 +508,20 @@ static inline struct deferred_split *get_deferred_split_queue(struct page *page)
 }
 #endif
 
-void prep_transhuge_page(struct page *page)
+struct page *prep_transhuge_page(struct page *page)
 {
+	if (!page || compound_order(page) == 0)
+		return page;
 	/*
-	 * we use page->mapping and page->indexlru in second tail page
+	 * we use page->mapping and page->index in second tail page
 	 * as list_head: assuming THP order >= 2
 	 */
+	BUG_ON(compound_order(page) == 1);
 
 	INIT_LIST_HEAD(page_deferred_list(page));
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+
+	return page;
 }
 
 bool is_transparent_hugepage(struct page *page)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 34/51] mm: Add __page_cache_alloc_order
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (32 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 33/51] mm: Make prep_transhuge_page return its argument Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 35/51] mm: Allow THPs to be added to the page cache Matthew Wilcox
                   ` (18 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel, Kirill A . Shutemov

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This new function allows page cache pages to be allocated that are
larger than an order-0 page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h | 24 +++++++++++++++++++++---
 mm/filemap.c            | 12 ++++++++----
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f47ba9f18f3e..8455a3e16900 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -280,15 +280,33 @@ static inline void *detach_page_private(struct page *page)
 	return data;
 }
 
+static inline gfp_t thp_gfpmask(gfp_t gfp)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/* We'd rather allocate smaller pages than stall a page fault */
+	gfp |= GFP_TRANSHUGE_LIGHT;
+	gfp &= ~__GFP_DIRECT_RECLAIM;
+#endif
+	return gfp;
+}
+
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline
+struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order)
 {
-	return alloc_pages(gfp, 0);
+	if (order == 0)
+		return alloc_pages(gfp, 0);
+	return prep_transhuge_page(alloc_pages(thp_gfpmask(gfp), order));
 }
 #endif
 
+static inline struct page *__page_cache_alloc(gfp_t gfp)
+{
+	return __page_cache_alloc_order(gfp, 0);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x));
diff --git a/mm/filemap.c b/mm/filemap.c
index 80ce3658b147..3a9579a1ffa7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -938,24 +938,28 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order)
 {
 	int n;
 	struct page *page;
 
+	if (order > 0)
+		gfp = thp_gfpmask(gfp);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
 			cpuset_mems_cookie = read_mems_allowed_begin();
 			n = cpuset_mem_spread_node();
-			page = __alloc_pages_node(n, gfp, 0);
+			page = __alloc_pages_node(n, gfp, order);
+			prep_transhuge_page(page);
 		} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
 
 		return page;
 	}
-	return alloc_pages(gfp, 0);
+	return prep_transhuge_page(alloc_pages(gfp, order));
 }
-EXPORT_SYMBOL(__page_cache_alloc);
+EXPORT_SYMBOL(__page_cache_alloc_order);
 #endif
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 35/51] mm: Allow THPs to be added to the page cache
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (33 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 34/51] mm: Add __page_cache_alloc_order Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 36/51] mm: Allow THPs to be removed from " Matthew Wilcox
                   ` (17 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We return -EEXIST if there are any non-shadow entries in the page
cache in the range covered by the THP.  If there are multiple
shadow entries in the range, we set *shadowp to one of them (currently
the one at the highest index).  If that turns out to be the wrong
answer, we can implement something more complex.  This is mostly
modelled after the equivalent function in the shmem code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 52 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 17 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3a9579a1ffa7..ab9746aff766 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -834,41 +834,58 @@ static int __add_to_page_cache_locked(struct page *page,
 	XA_STATE(xas, &mapping->i_pages, offset);
 	int huge = PageHuge(page);
 	int error;
+	unsigned int nr = 1;
 	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 	mapping_set_update(&xas, mapping);
 
-	get_page(page);
-	page->mapping = mapping;
-	page->index = offset;
-
 	if (!huge) {
 		error = mem_cgroup_charge(page, current->mm, gfp_mask);
 		if (error)
-			goto error;
+			return error;
+		xas_set_order(&xas, offset, thp_order(page));
+		nr = thp_nr_pages(page);
 	}
 
+	page_ref_add(page, nr);
+	page->mapping = mapping;
+	page->index = offset;
+
 	do {
+		unsigned long exceptional = 0;
+		unsigned int i = 0;
+
 		xas_lock_irq(&xas);
-		old = xas_load(&xas);
-		if (old && !xa_is_value(old))
-			xas_set_err(&xas, -EEXIST);
-		xas_store(&xas, page);
+		xas_for_each_conflict(&xas, old) {
+			if (!xa_is_value(old)) {
+				xas_set_err(&xas, -EEXIST);
+				break;
+			}
+			exceptional++;
+			if (shadowp)
+				*shadowp = old;
+		}
+		xas_create_range(&xas);
 		if (xas_error(&xas))
 			goto unlock;
 
-		if (xa_is_value(old)) {
-			mapping->nrexceptional--;
-			if (shadowp)
-				*shadowp = old;
+next:
+		xas_store(&xas, page);
+		if (++i < nr) {
+			xas_next(&xas);
+			goto next;
 		}
-		mapping->nrpages++;
+		mapping->nrexceptional -= exceptional;
+		mapping->nrpages += nr;
 
 		/* hugetlb pages do not participate in page cache accounting */
-		if (!huge)
-			__inc_lruvec_page_state(page, NR_FILE_PAGES);
+		if (!huge) {
+			__mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
+			if (nr > 1)
+				__inc_lruvec_page_state(page, NR_FILE_THPS);
+		}
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
@@ -883,7 +900,8 @@ static int __add_to_page_cache_locked(struct page *page,
 error:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
-	put_page(page);
+	page_ref_sub(page, nr);
+	VM_BUG_ON_PAGE(page_count(page) <= 0, page);
 	return error;
 }
 ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 36/51] mm: Allow THPs to be removed from the page cache
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (34 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 35/51] mm: Allow THPs to be added to the page cache Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 37/51] mm: Remove page fault assumption of compound page size Matthew Wilcox
                   ` (16 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

page_cache_free_page() assumes THPs are PMD_SIZE; fix that assumption.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ab9746aff766..78f888d028c5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -248,7 +248,7 @@ static void page_cache_free_page(struct address_space *mapping,
 		freepage(page);
 
 	if (PageTransHuge(page) && !PageHuge(page)) {
-		page_ref_sub(page, HPAGE_PMD_NR);
+		page_ref_sub(page, thp_nr_pages(page));
 		VM_BUG_ON_PAGE(page_count(page) <= 0, page);
 	} else {
 		put_page(page);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 37/51] mm: Remove page fault assumption of compound page size
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (35 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 36/51] mm: Allow THPs to be removed from " Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 38/51] mm: Fix total_mapcount assumption of " Matthew Wilcox
                   ` (15 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

A compound page in the page cache will not necessarily be of PMD size,
so check explicitly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index dc7f3543b1fd..3e6ef0ebfdd0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3552,13 +3552,14 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	pmd_t entry;
 	int i;
-	vm_fault_t ret;
+	vm_fault_t ret = VM_FAULT_FALLBACK;
 
 	if (!transhuge_vma_suitable(vma, haddr))
-		return VM_FAULT_FALLBACK;
+		return ret;
 
-	ret = VM_FAULT_FALLBACK;
 	page = compound_head(page);
+	if (page_order(page) != HPAGE_PMD_ORDER)
+		return ret;
 
 	/*
 	 * Archs like ppc64 need additonal space to store information
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 38/51] mm: Fix total_mapcount assumption of page size
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (36 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 37/51] mm: Remove page fault assumption of compound page size Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 39/51] mm: Remove assumptions of THP size Matthew Wilcox
                   ` (14 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Kirill A. Shutemov, linux-mm, linux-kernel, Matthew Wilcox

From: "Kirill A. Shutemov" <kirill@shutemov.name>

File THPs may now be of arbitrary order.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 80fb38e93837..744863aa0374 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2490,7 +2490,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 int total_mapcount(struct page *page)
 {
-	int i, compound, ret;
+	int i, compound, nr, ret;
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
@@ -2498,16 +2498,17 @@ int total_mapcount(struct page *page)
 		return atomic_read(&page->_mapcount) + 1;
 
 	compound = compound_mapcount(page);
+	nr = compound_nr(page);
 	if (PageHuge(page))
 		return compound;
 	ret = compound;
-	for (i = 0; i < HPAGE_PMD_NR; i++)
+	for (i = 0; i < nr; i++)
 		ret += atomic_read(&page[i]._mapcount) + 1;
 	/* File pages has compound_mapcount included in _mapcount */
 	if (!PageAnon(page))
-		return ret - compound * HPAGE_PMD_NR;
+		return ret - compound * nr;
 	if (PageDoubleMap(page))
-		ret -= HPAGE_PMD_NR;
+		ret -= nr;
 	return ret;
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 39/51] mm: Remove assumptions of THP size
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (37 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 38/51] mm: Fix total_mapcount assumption of " Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 40/51] mm: Avoid splitting THPs Matthew Wilcox
                   ` (13 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Remove direct uses of HPAGE_PMD_NR in paths that aren't necessarily
PMD sized.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 15 ++++++++-------
 mm/rmap.c        | 10 +++++-----
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 744863aa0374..c25d8e2310e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2340,7 +2340,7 @@ static void remap_page(struct page *page)
 	if (PageTransHuge(page)) {
 		remove_migration_ptes(page, page, true);
 	} else {
-		for (i = 0; i < HPAGE_PMD_NR; i++)
+		for (i = 0; i < thp_nr_pages(page); i++)
 			remove_migration_ptes(page + i, page + i, true);
 	}
 }
@@ -2415,6 +2415,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
+	unsigned int nr = thp_nr_pages(head);
 	int i;
 
 	lruvec = mem_cgroup_page_lruvec(head, pgdat);
@@ -2430,7 +2431,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
+	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
@@ -2471,7 +2472,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	remap_page(head);
 
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < nr; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
 			continue;
@@ -2553,14 +2554,14 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 	page = compound_head(page);
 
 	_total_mapcount = ret = 0;
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < thp_nr_pages(page); i++) {
 		mapcount = atomic_read(&page[i]._mapcount) + 1;
 		ret = max(ret, mapcount);
 		_total_mapcount += mapcount;
 	}
 	if (PageDoubleMap(page)) {
 		ret -= 1;
-		_total_mapcount -= HPAGE_PMD_NR;
+		_total_mapcount -= thp_nr_pages(page);
 	}
 	mapcount = compound_mapcount(page);
 	ret += mapcount;
@@ -2577,9 +2578,9 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 
 	/* Additional pins from page cache */
 	if (PageAnon(page))
-		extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
+		extra_pins = PageSwapCache(page) ? thp_nr_pages(page) : 0;
 	else
-		extra_pins = HPAGE_PMD_NR;
+		extra_pins = thp_nr_pages(page);
 	if (pextra_pins)
 		*pextra_pins = extra_pins;
 	return total_mapcount(page) == page_count(page) - extra_pins - 1;
diff --git a/mm/rmap.c b/mm/rmap.c
index c56fab5826c1..c0295282928b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1205,7 +1205,7 @@ void page_add_file_rmap(struct page *page, bool compound)
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
 	if (compound && PageTransHuge(page)) {
-		for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+		for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
 			if (atomic_inc_and_test(&page[i]._mapcount))
 				nr++;
 		}
@@ -1246,7 +1246,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 
 	/* page still mapped by someone else? */
 	if (compound && PageTransHuge(page)) {
-		for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+		for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
 			if (atomic_add_negative(-1, &page[i]._mapcount))
 				nr++;
 		}
@@ -1293,7 +1293,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		 * Subpages can be mapped with PTEs too. Check how many of
 		 * them are still mapped.
 		 */
-		for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+		for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
 			if (atomic_add_negative(-1, &page[i]._mapcount))
 				nr++;
 		}
@@ -1303,10 +1303,10 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		 * page of the compound page is unmapped, but at least one
 		 * small page is still mapped.
 		 */
-		if (nr && nr < HPAGE_PMD_NR)
+		if (nr && nr < thp_nr_pages(page))
 			deferred_split_huge_page(page);
 	} else {
-		nr = HPAGE_PMD_NR;
+		nr = thp_nr_pages(page);
 	}
 
 	if (unlikely(PageMlocked(page)))
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 40/51] mm: Avoid splitting THPs
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (38 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 39/51] mm: Remove assumptions of THP size Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 41/51] mm: Fix truncation for pages of arbitrary size Matthew Wilcox
                   ` (12 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the filesystem supports THPs, then do not split them before
removing them from the page cache; remove them as a unit.
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 17934e03b3aa..0db62c1001f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1277,9 +1277,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 				/* Adding to swap updated mapping */
 				mapping = page_mapping(page);
 			}
-		} else if (unlikely(PageTransHuge(page))) {
-			/* Split file THP */
-			if (split_huge_page_to_list(page, page_list))
+		} else if (PageTransHuge(page)) {
+			if ((!mapping || !mapping_thp_support(mapping)) &&
+			    split_huge_page_to_list(page, page_list))
 				goto keep_locked;
 		}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 41/51] mm: Fix truncation for pages of arbitrary size
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (39 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 40/51] mm: Avoid splitting THPs Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 42/51] mm: Handle truncates that split THPs Matthew Wilcox
                   ` (11 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Remove the assumption that a compound page is HPAGE_PMD_SIZE,
and the assumption that any page is PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/truncate.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 152974888124..a9fde773179b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -168,7 +168,7 @@ void do_invalidatepage(struct page *page, unsigned int offset,
  * becomes orphaned.  It will be left on the LRU and may even be mapped into
  * user pagetables if we're racing with filemap_fault().
  *
- * We need to bale out if page->mapping is no longer equal to the original
+ * We need to bail out if page->mapping is no longer equal to the original
  * mapping.  This happens a) when the VM reclaimed the page while we waited on
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
@@ -177,12 +177,12 @@ static void
 truncate_cleanup_page(struct address_space *mapping, struct page *page)
 {
 	if (page_mapped(page)) {
-		pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
+		unsigned int nr = thp_nr_pages(page);
 		unmap_mapping_pages(mapping, page->index, nr, false);
 	}
 
 	if (page_has_private(page))
-		do_invalidatepage(page, 0, PAGE_SIZE);
+		do_invalidatepage(page, 0, thp_size(page));
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 42/51] mm: Handle truncates that split THPs
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (40 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 41/51] mm: Fix truncation for pages of arbitrary size Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 43/51] mm: Support storing shadow entries for THPs Matthew Wilcox
                   ` (10 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Move shmem_punch_compound() to truncate.c and rename it to punch_thp().
Change its arguments to loff_t to make calling do_invalidatepage()
easier.  Call it when we find a THP in the cache.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h |  2 ++
 mm/shmem.c    | 30 ++-------------------------
 mm/truncate.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 59 insertions(+), 30 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index ac3c79408045..cd7038a36354 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -613,4 +613,6 @@ static inline bool is_migrate_highatomic_page(struct page *page)
 
 void setup_zone_pageset(struct zone *zone);
 extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+
+bool punch_thp(struct page *page, loff_t start, loff_t end);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index 55405d811cfd..495b8684d94a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -804,32 +804,6 @@ void shmem_unlock_mapping(struct address_space *mapping)
 	}
 }
 
-/*
- * Check whether a hole-punch or truncation needs to split a huge page,
- * returning true if no split was required, or the split has been successful.
- *
- * Eviction (or truncation to 0 size) should never need to split a huge page;
- * but in rare cases might do so, if shmem_undo_range() failed to trylock on
- * head, and then succeeded to trylock on tail.
- *
- * A split can only succeed when there are no additional references on the
- * huge page: so the split below relies upon find_get_entries() having stopped
- * when it found a subpage of the huge page, without getting further references.
- */
-static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
-{
-	if (!PageTransCompound(page))
-		return true;
-
-	/* Just proceed to delete a huge page wholly within the range punched */
-	if (PageHead(page) &&
-	    page->index >= start && page->index + HPAGE_PMD_NR <= end)
-		return true;
-
-	/* Try to split huge page, so we can truly punch the hole or truncate */
-	return split_huge_page(page) >= 0;
-}
-
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -883,7 +857,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			if ((!unfalloc || !PageUptodate(page)) &&
 			    page_mapping(page) == mapping) {
 				VM_BUG_ON_PAGE(PageWriteback(page), page);
-				if (shmem_punch_compound(page, start, end))
+				if (punch_thp(page, lstart, lend))
 					truncate_inode_page(mapping, page);
 			}
 			unlock_page(page);
@@ -973,7 +947,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 					break;
 				}
 				VM_BUG_ON_PAGE(PageWriteback(page), page);
-				if (shmem_punch_compound(page, start, end))
+				if (punch_thp(page, lstart, lend))
 					truncate_inode_page(mapping, page);
 				else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 					/* Wipe the page and don't get stuck */
diff --git a/mm/truncate.c b/mm/truncate.c
index a9fde773179b..0ef2001c2f65 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -229,6 +229,55 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 	return 0;
 }
 
+/*
+ * Check whether a hole-punch or truncation needs to split a huge page,
+ * returning true if no split was required, or the split has been
+ * successful.
+ *
+ * Eviction (or truncation to 0 size) should never need to split a huge
+ * page; but in rare cases might do so, if shmem_undo_range() failed to
+ * trylock on head, and then succeeded to trylock on tail.
+ *
+ * A split can only succeed when there are no additional references on
+ * the huge page: so the split below relies upon find_get_entries()
+ * having stopped when it found a subpage of the huge page, without
+ * getting further references.
+ */
+bool punch_thp(struct page *page, loff_t start, loff_t end)
+{
+	struct page *head = thp_head(page);
+	loff_t pos = page_offset(head);
+	unsigned int offset, length;
+
+	if (!PageTransCompound(page))
+		return true;
+
+	if (pos < start)
+		offset = start - pos;
+	else
+		offset = 0;
+	length = thp_size(head);
+	if (pos + length < end)
+		length = length - offset;
+	else
+		length = end - pos - offset;
+
+	/* Just proceed to delete a huge page wholly within the range punched */
+	if (length == thp_size(head))
+		return true;
+
+	/*
+	 * We're going to split the page into order-0 pages.  Tell the
+	 * filesystem which range of the page is going to be punched out
+	 * so it can discard unnecessary private data.
+	 */
+	if (page_has_private(head))
+		do_invalidatepage(head, offset, length);
+
+	/* Try to split huge page, so we can truly punch the hole or truncate */
+	return split_huge_page(page) >= 0;
+}
+
 /*
  * Used to get rid of pages on hardware memory corruption.
  */
@@ -359,7 +408,10 @@ void truncate_inode_pages_range(struct address_space *mapping,
 				unlock_page(page);
 				continue;
 			}
-			pagevec_add(&locked_pvec, page);
+			if (punch_thp(page, lstart, lend))
+				pagevec_add(&locked_pvec, page);
+			else
+				unlock_page(page);
 		}
 		for (i = 0; i < pagevec_count(&locked_pvec); i++)
 			truncate_cleanup_page(mapping, locked_pvec.pages[i]);
@@ -453,7 +505,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			lock_page(page);
 			WARN_ON(page_to_index(page) != index);
 			wait_on_page_writeback(page);
-			truncate_inode_page(mapping, page);
+			if (punch_thp(page, lstart, lend))
+				truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
 		truncate_exceptional_pvec_entries(mapping, &pvec, indices, end);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 43/51] mm: Support storing shadow entries for THPs
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (41 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 42/51] mm: Handle truncates that split THPs Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 44/51] mm: Support retrieving tail pages from the page cache Matthew Wilcox
                   ` (9 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the page is being replaced with a NULL, we can do a single store that
erases the entire range of indices.  Otherwise we have to use a loop to
store one shadow entry in each index.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 78f888d028c5..17db007f0277 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -120,22 +120,27 @@ static void page_cache_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
 	XA_STATE(xas, &mapping->i_pages, page->index);
-	unsigned int nr = 1;
+	unsigned int i, nr = 1, entries = 1;
 
 	mapping_set_update(&xas, mapping);
 
 	/* hugetlb pages are represented by a single entry in the xarray */
 	if (!PageHuge(page)) {
-		xas_set_order(&xas, page->index, compound_order(page));
-		nr = compound_nr(page);
+		entries = nr = thp_nr_pages(page);
+		if (!shadow) {
+			xas_set_order(&xas, page->index, thp_order(page));
+			entries = 1;
+		}
 	}
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	VM_BUG_ON_PAGE(nr != 1 && shadow, page);
 
-	xas_store(&xas, shadow);
-	xas_init_marks(&xas);
+	for (i = 0; i < entries; i++) {
+		xas_store(&xas, shadow);
+		xas_init_marks(&xas);
+		xas_next(&xas);
+	}
 
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 44/51] mm: Support retrieving tail pages from the page cache
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (42 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 43/51] mm: Support storing shadow entries for THPs Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 45/51] mm: Support tail pages in wait_for_stable_page Matthew Wilcox
                   ` (8 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

page->index is not meaningful for tail pages; we have to use
page_to_index() in this assertion.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 17db007f0277..869b970fe1ab 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1652,7 +1652,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
 			put_page(page);
 			goto repeat;
 		}
-		VM_BUG_ON_PAGE(page->index != index, page);
+		VM_BUG_ON_PAGE(page_to_index(page) != index, page);
 	}
 
 	if (fgp_flags & FGP_ACCESSED)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 45/51] mm: Support tail pages in wait_for_stable_page
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (43 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 44/51] mm: Support retrieving tail pages from the page cache Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 46/51] mm: Add DEFINE_READAHEAD Matthew Wilcox
                   ` (7 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

page->mapping is undefined for tail pages, so operate exclusively on
the head page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/page-writeback.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 28b3e7a67565..1b358aac065f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2851,6 +2851,7 @@ EXPORT_SYMBOL_GPL(wait_on_page_writeback);
  */
 void wait_for_stable_page(struct page *page)
 {
+	page = thp_head(page);
 	if (bdi_cap_stable_pages_required(inode_to_bdi(page->mapping->host)))
 		wait_on_page_writeback(page);
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 46/51] mm: Add DEFINE_READAHEAD
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (44 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 45/51] mm: Support tail pages in wait_for_stable_page Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 47/51] mm: Make page_cache_readahead_unbounded take a readahead_control Matthew Wilcox
                   ` (6 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Allow for a more concise definition of a struct readahead_control.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 7 +++++++
 mm/readahead.c          | 6 +-----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8455a3e16900..77c20da5ae99 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -759,6 +759,13 @@ struct readahead_control {
 	unsigned int _batch_count;
 };
 
+#define DEFINE_READAHEAD(rac, f, m, i)					\
+	struct readahead_control rac = {				\
+		.file = f,						\
+		.mapping = m,						\
+		._index = i,						\
+	}
+
 /**
  * readahead_page - Get the next page to read.
  * @rac: The current readahead request.
diff --git a/mm/readahead.c b/mm/readahead.c
index 3c9a8dd7c56c..2126a2754e22 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,11 +179,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 {
 	LIST_HEAD(page_pool);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	struct readahead_control rac = {
-		.mapping = mapping,
-		.file = file,
-		._index = index,
-	};
+	DEFINE_READAHEAD(rac, file, mapping, index);
 	unsigned long i;
 
 	/*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 47/51] mm: Make page_cache_readahead_unbounded take a readahead_control
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (45 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 46/51] mm: Add DEFINE_READAHEAD Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 48/51] mm: Make __do_page_cache_readahead " Matthew Wilcox
                   ` (5 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Define it in the callers instead of in page_cache_readahead_unbounded().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ext4/verity.c        |  4 ++--
 fs/f2fs/verity.c        |  4 ++--
 include/linux/pagemap.h |  5 ++---
 mm/readahead.c          | 26 ++++++++++++--------------
 4 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index dec1244dd062..fe2e541543da 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -346,6 +346,7 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
 {
+	DEFINE_READAHEAD(rac, NULL, inode->i_mapping, index);
 	struct page *page;
 
 	index += ext4_verity_metadata_pos(inode) >> PAGE_SHIFT;
@@ -355,8 +356,7 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			page_cache_readahead_unbounded(inode->i_mapping, NULL,
-					index, num_ra_pages, 0);
+			page_cache_readahead_unbounded(&rac, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 865c9fb774fb..707a94745472 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -226,6 +226,7 @@ static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
 {
+	DEFINE_READAHEAD(rac, NULL, inode->i_mapping, index);
 	struct page *page;
 
 	index += f2fs_verity_metadata_pos(inode) >> PAGE_SHIFT;
@@ -235,8 +236,7 @@ static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			page_cache_readahead_unbounded(inode->i_mapping, NULL,
-					index, num_ra_pages, 0);
+			page_cache_readahead_unbounded(&rac, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 77c20da5ae99..f36714473b1e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -715,9 +715,8 @@ void page_cache_sync_readahead(struct address_space *, struct file_ra_state *,
 void page_cache_async_readahead(struct address_space *, struct file_ra_state *,
 		struct file *, struct page *, pgoff_t index,
 		unsigned long req_count);
-void page_cache_readahead_unbounded(struct address_space *, struct file *,
-		pgoff_t index, unsigned long nr_to_read,
-		unsigned long lookahead_count);
+void page_cache_readahead_unbounded(struct readahead_control *,
+		unsigned long nr_to_read, unsigned long lookahead_count);
 
 /*
  * Like add_to_page_cache_locked, but used to add newly allocated pages:
diff --git a/mm/readahead.c b/mm/readahead.c
index 2126a2754e22..62da2d4beed1 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -159,9 +159,7 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 
 /**
  * page_cache_readahead_unbounded - Start unchecked readahead.
- * @mapping: File address space.
- * @file: This instance of the open file; used for authentication.
- * @index: First page index to read.
+ * @rac: Readahead control.
  * @nr_to_read: The number of pages to read.
  * @lookahead_size: Where to start the next readahead.
  *
@@ -173,13 +171,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
  * Context: File is referenced by caller.  Mutexes may be held by caller.
  * May sleep, but will not reenter filesystem to reclaim memory.
  */
-void page_cache_readahead_unbounded(struct address_space *mapping,
-		struct file *file, pgoff_t index, unsigned long nr_to_read,
-		unsigned long lookahead_size)
+void page_cache_readahead_unbounded(struct readahead_control *rac,
+		unsigned long nr_to_read, unsigned long lookahead_size)
 {
+	struct address_space *mapping = rac->mapping;
+	unsigned long index = readahead_index(rac);
 	LIST_HEAD(page_pool);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	DEFINE_READAHEAD(rac, file, mapping, index);
 	unsigned long i;
 
 	/*
@@ -200,7 +198,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 	for (i = 0; i < nr_to_read; i++) {
 		struct page *page = xa_load(&mapping->i_pages, index + i);
 
-		BUG_ON(index + i != rac._index + rac._nr_pages);
+		BUG_ON(index + i != rac->_index + rac->_nr_pages);
 
 		if (page && !xa_is_value(page)) {
 			/*
@@ -211,7 +209,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 			 * have a stable reference to this page, and it's
 			 * not worth getting one just for that.
 			 */
-			read_pages(&rac, &page_pool, true);
+			read_pages(rac, &page_pool, true);
 			continue;
 		}
 
@@ -224,12 +222,12 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 		} else if (add_to_page_cache_lru(page, mapping, index + i,
 					gfp_mask) < 0) {
 			put_page(page);
-			read_pages(&rac, &page_pool, true);
+			read_pages(rac, &page_pool, true);
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
-		rac._nr_pages++;
+		rac->_nr_pages++;
 	}
 
 	/*
@@ -237,7 +235,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 	 * uptodate then the caller will launch readpage again, and
 	 * will then handle the error.
 	 */
-	read_pages(&rac, &page_pool, false);
+	read_pages(rac, &page_pool, false);
 	memalloc_nofs_restore(nofs);
 }
 EXPORT_SYMBOL_GPL(page_cache_readahead_unbounded);
@@ -252,6 +250,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		struct file *file, pgoff_t index, unsigned long nr_to_read,
 		unsigned long lookahead_size)
 {
+	DEFINE_READAHEAD(rac, file, mapping, index);
 	struct inode *inode = mapping->host;
 	loff_t isize = i_size_read(inode);
 	pgoff_t end_index;	/* The last page we want to read */
@@ -266,8 +265,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	if (nr_to_read > end_index - index)
 		nr_to_read = end_index - index + 1;
 
-	page_cache_readahead_unbounded(mapping, file, index, nr_to_read,
-			lookahead_size);
+	page_cache_readahead_unbounded(&rac, nr_to_read, lookahead_size);
 }
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 48/51] mm: Make __do_page_cache_readahead take a readahead_control
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (46 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 47/51] mm: Make page_cache_readahead_unbounded take a readahead_control Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 49/51] mm: Allow PageReadahead to be set on head pages Matthew Wilcox
                   ` (4 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Also call __do_page_cache_readahead() directly from ondemand_readahead()
instead of indirecting via ra_submit().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h  | 11 +++++------
 mm/readahead.c | 26 ++++++++++++++------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index cd7038a36354..63c493f892e2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -51,18 +51,17 @@ void unmap_page_range(struct mmu_gather *tlb,
 
 void force_page_cache_readahead(struct address_space *, struct file *,
 		pgoff_t index, unsigned long nr_to_read);
-void __do_page_cache_readahead(struct address_space *, struct file *,
-		pgoff_t index, unsigned long nr_to_read,
-		unsigned long lookahead_size);
+void __do_page_cache_readahead(struct readahead_control *,
+		unsigned long nr_to_read, unsigned long lookahead_size);
 
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */
 static inline void ra_submit(struct file_ra_state *ra,
-		struct address_space *mapping, struct file *filp)
+		struct address_space *mapping, struct file *file)
 {
-	__do_page_cache_readahead(mapping, filp,
-			ra->start, ra->size, ra->async_size);
+	DEFINE_READAHEAD(rac, file, mapping, ra->start);
+	__do_page_cache_readahead(&rac, ra->size, ra->async_size);
 }
 
 /**
diff --git a/mm/readahead.c b/mm/readahead.c
index 62da2d4beed1..74c7e1eff540 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -246,12 +246,11 @@ EXPORT_SYMBOL_GPL(page_cache_readahead_unbounded);
  * behaviour which would occur if page allocations are causing VM writeback.
  * We really don't want to intermingle reads and writes like that.
  */
-void __do_page_cache_readahead(struct address_space *mapping,
-		struct file *file, pgoff_t index, unsigned long nr_to_read,
-		unsigned long lookahead_size)
+void __do_page_cache_readahead(struct readahead_control *rac,
+		unsigned long nr_to_read, unsigned long lookahead_size)
 {
-	DEFINE_READAHEAD(rac, file, mapping, index);
-	struct inode *inode = mapping->host;
+	struct inode *inode = rac->mapping->host;
+	unsigned long index = readahead_index(rac);
 	loff_t isize = i_size_read(inode);
 	pgoff_t end_index;	/* The last page we want to read */
 
@@ -265,7 +264,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	if (nr_to_read > end_index - index)
 		nr_to_read = end_index - index + 1;
 
-	page_cache_readahead_unbounded(&rac, nr_to_read, lookahead_size);
+	page_cache_readahead_unbounded(rac, nr_to_read, lookahead_size);
 }
 
 /*
@@ -273,10 +272,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
  * memory at once.
  */
 void force_page_cache_readahead(struct address_space *mapping,
-		struct file *filp, pgoff_t index, unsigned long nr_to_read)
+		struct file *file, pgoff_t index, unsigned long nr_to_read)
 {
+	DEFINE_READAHEAD(rac, file, mapping, index);
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
-	struct file_ra_state *ra = &filp->f_ra;
+	struct file_ra_state *ra = &file->f_ra;
 	unsigned long max_pages;
 
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages &&
@@ -294,7 +294,7 @@ void force_page_cache_readahead(struct address_space *mapping,
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
-		__do_page_cache_readahead(mapping, filp, index, this_chunk, 0);
+		__do_page_cache_readahead(&rac, this_chunk, 0);
 
 		index += this_chunk;
 		nr_to_read -= this_chunk;
@@ -432,10 +432,11 @@ static int try_context_readahead(struct address_space *mapping,
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
 static void ondemand_readahead(struct address_space *mapping,
-		struct file_ra_state *ra, struct file *filp,
+		struct file_ra_state *ra, struct file *file,
 		bool hit_readahead_marker, pgoff_t index,
 		unsigned long req_size)
 {
+	DEFINE_READAHEAD(rac, file, mapping, index);
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
 	unsigned long max_pages = ra->ra_pages;
 	unsigned long add_pages;
@@ -516,7 +517,7 @@ static void ondemand_readahead(struct address_space *mapping,
 	 * standalone, small random read
 	 * Read as is, and do not pollute the readahead state.
 	 */
-	__do_page_cache_readahead(mapping, filp, index, req_size, 0);
+	__do_page_cache_readahead(&rac, req_size, 0);
 	return;
 
 initial_readahead:
@@ -542,7 +543,8 @@ static void ondemand_readahead(struct address_space *mapping,
 		}
 	}
 
-	ra_submit(ra, mapping, filp);
+	rac._index = ra->start;
+	__do_page_cache_readahead(&rac, ra->size, ra->async_size);
 }
 
 /**
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 49/51] mm: Allow PageReadahead to be set on head pages
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (47 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 48/51] mm: Make __do_page_cache_readahead " Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 50/51] mm: Add THP readahead Matthew Wilcox
                   ` (3 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Adjust the callers to only call PageReadahead on the head page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/page-flags.h |  4 ++--
 mm/filemap.c               | 14 +++++++-------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 979460df4768..a3110d675cd0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -377,8 +377,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
 	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
-	TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Readahead, reclaim, PF_ONLY_HEAD)
+	TESTCLEARFLAG(Readahead, reclaim, PF_ONLY_HEAD)
 
 #ifdef CONFIG_HIGHMEM
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 869b970fe1ab..7a746b486237 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2064,9 +2064,9 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
-		if (PageReadahead(page)) {
+		if (PageReadahead(thp_head(page))) {
 			page_cache_async_readahead(mapping,
-					ra, filp, page,
+					ra, filp, thp_head(page),
 					index, last_index - index);
 		}
 		if (!PageUptodate(page)) {
@@ -2452,10 +2452,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 		return fpin;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page)) {
+	if (PageReadahead(thp_head(page))) {
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_readahead(mapping, ra, file,
-					   page, offset, ra->ra_pages);
+				thp_head(page), offset, ra->ra_pages);
 	}
 	return fpin;
 }
@@ -2637,11 +2637,11 @@ void filemap_map_pages(struct vm_fault *vmf,
 		/* Has the page moved or been split? */
 		if (unlikely(page != xas_reload(&xas)))
 			goto skip;
+		if (PageReadahead(page))
+			goto skip;
 		page = find_subpage(page, xas.xa_index);
 
-		if (!PageUptodate(page) ||
-				PageReadahead(page) ||
-				PageHWPoison(page))
+		if (!PageUptodate(page) || PageHWPoison(page))
 			goto skip;
 		if (!trylock_page(page))
 			goto skip;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 50/51] mm: Add THP readahead
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (48 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 49/51] mm: Allow PageReadahead to be set on head pages Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-10 20:13 ` [PATCH v6 51/51] mm: Align THP mappings for non-DAX Matthew Wilcox
                   ` (2 subsequent siblings)
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Matthew Wilcox (Oracle), linux-mm, linux-kernel

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

If the filesystem supports THPs, allocate larger pages in the
readahead code when it seems worth doing.  The heuristic for choosing
larger page sizes will surely need some tuning, but this aggressive
ramp-up seems good for testing.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 87 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 74c7e1eff540..98bbcc986b39 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -149,7 +149,7 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 
 	blk_finish_plug(&plug);
 
-	BUG_ON(!list_empty(pages));
+	BUG_ON(pages && !list_empty(pages));
 	BUG_ON(readahead_count(rac));
 
 out:
@@ -428,13 +428,92 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int ra_alloc_page(struct readahead_control *rac, pgoff_t index,
+		pgoff_t mark, unsigned int order, gfp_t gfp)
+{
+	int err;
+	struct page *page = __page_cache_alloc_order(gfp, order);
+
+	if (!page)
+		return -ENOMEM;
+	if (mark - index < (1UL << order))
+		SetPageReadahead(page);
+	err = add_to_page_cache_lru(page, rac->mapping, index, gfp);
+	if (err)
+		put_page(page);
+	else
+		rac->_nr_pages += 1UL << order;
+	return err;
+}
+
+static bool page_cache_readahead_order(struct readahead_control *rac,
+		struct file_ra_state *ra, unsigned int order)
+{
+	struct address_space *mapping = rac->mapping;
+	unsigned int old_order = order;
+	pgoff_t index = readahead_index(rac);
+	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t mark = index + ra->size - ra->async_size;
+	int err = 0;
+	gfp_t gfp = readahead_gfp_mask(mapping);
+
+	if (!mapping_thp_support(mapping))
+		return false;
+
+	limit = min(limit, index + ra->size - 1);
+
+	/* Grow page size up to PMD size */
+	if (order < HPAGE_PMD_ORDER) {
+		order += 2;
+		if (order > HPAGE_PMD_ORDER)
+			order = HPAGE_PMD_ORDER;
+		while ((1 << order) > ra->size)
+			order--;
+	}
+
+	/* If size is somehow misaligned, fill with order-0 pages */
+	while (!err && index & ((1UL << old_order) - 1))
+		err = ra_alloc_page(rac, index++, mark, 0, gfp);
+
+	while (!err && index & ((1UL << order) - 1)) {
+		err = ra_alloc_page(rac, index, mark, old_order, gfp);
+		index += 1UL << old_order;
+	}
+
+	while (!err && index <= limit) {
+		err = ra_alloc_page(rac, index, mark, order, gfp);
+		index += 1UL << order;
+	}
+
+	if (index > limit) {
+		ra->size += index - limit - 1;
+		ra->async_size += index - limit - 1;
+	}
+
+	read_pages(rac, NULL, false);
+
+	/*
+	 * If there were already pages in the page cache, then we may have
+	 * left some gaps.  Let the regular readahead code take care of this
+	 * situation.
+	 */
+	return !err;
+}
+#else
+static bool page_cache_readahead_order(struct readahead_control *rac,
+		struct file_ra_state *ra, unsigned int order)
+{
+	return false;
+}
+#endif
+
 /*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
 static void ondemand_readahead(struct address_space *mapping,
 		struct file_ra_state *ra, struct file *file,
-		bool hit_readahead_marker, pgoff_t index,
-		unsigned long req_size)
+		struct page *page, pgoff_t index, unsigned long req_size)
 {
 	DEFINE_READAHEAD(rac, file, mapping, index);
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
@@ -473,7 +552,7 @@ static void ondemand_readahead(struct address_space *mapping,
 	 * Query the pagecache for async_size, which normally equals to
 	 * readahead size. Ramp it up and use it as the new readahead size.
 	 */
-	if (hit_readahead_marker) {
+	if (page) {
 		pgoff_t start;
 
 		rcu_read_lock();
@@ -544,6 +623,8 @@ static void ondemand_readahead(struct address_space *mapping,
 	}
 
 	rac._index = ra->start;
+	if (page && page_cache_readahead_order(&rac, ra, thp_order(page)))
+		return;
 	__do_page_cache_readahead(&rac, ra->size, ra->async_size);
 }
 
@@ -578,7 +659,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
 	}
 
 	/* do read-ahead */
-	ondemand_readahead(mapping, ra, filp, false, index, req_count);
+	ondemand_readahead(mapping, ra, filp, NULL, index, req_count);
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
 
@@ -624,7 +705,7 @@ page_cache_async_readahead(struct address_space *mapping,
 		return;
 
 	/* do read-ahead */
-	ondemand_readahead(mapping, ra, filp, true, index, req_count);
+	ondemand_readahead(mapping, ra, filp, page, index, req_count);
 }
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v6 51/51] mm: Align THP mappings for non-DAX
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (49 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 50/51] mm: Add THP readahead Matthew Wilcox
@ 2020-06-10 20:13 ` Matthew Wilcox
  2020-06-11  6:59 ` [RFC v6 00/51] Large pages in the page cache Christoph Hellwig
  2020-06-14 16:26 ` Matthew Wilcox
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-10 20:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: William Kucharski, linux-mm, linux-kernel, Matthew Wilcox

From: William Kucharski <william.kucharski@oracle.com>

When we have the opportunity to use transparent huge pages to map a
file, we want to follow the same rules as DAX.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/huge_memory.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c25d8e2310e8..0fb08e4eeb8e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -577,13 +577,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
-	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
-		goto out;
-
 	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE);
 	if (ret)
 		return ret;
-out:
+
 	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC v6 00/51] Large pages in the page cache
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (50 preceding siblings ...)
  2020-06-10 20:13 ` [PATCH v6 51/51] mm: Align THP mappings for non-DAX Matthew Wilcox
@ 2020-06-11  6:59 ` Christoph Hellwig
  2020-06-11 11:24   ` Matthew Wilcox
  2020-06-14 16:26 ` Matthew Wilcox
  52 siblings, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2020-06-11  6:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Wed, Jun 10, 2020 at 01:12:54PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Another fortnight, another dump of my current large pages work.
> I've squished a lot of bugs this time.  xfstests is much happier now,
> running for 1631 seconds and getting as far as generic/086.  This patchset
> is getting a little big, so I'm going to try to get some bits of it
> upstream soon (the bits that make sense regardless of whether the rest
> of this is merged).

At this size a git tree to pull would also be nice..

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC v6 00/51] Large pages in the page cache
  2020-06-11  6:59 ` [RFC v6 00/51] Large pages in the page cache Christoph Hellwig
@ 2020-06-11 11:24   ` Matthew Wilcox
  2020-06-15 13:32     ` Christoph Hellwig
  0 siblings, 1 reply; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-11 11:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, linux-kernel

On Wed, Jun 10, 2020 at 11:59:54PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 10, 2020 at 01:12:54PM -0700, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > Another fortnight, another dump of my current large pages work.
> > I've squished a lot of bugs this time.  xfstests is much happier now,
> > running for 1631 seconds and getting as far as generic/086.  This patchset
> > is getting a little big, so I'm going to try to get some bits of it
> > upstream soon (the bits that make sense regardless of whether the rest
> > of this is merged).
> 
> At this size a git tree to pull would also be nice..

That was literally the next paragraph ...

It's now based on linus' master (6f630784cc0d), and you can get it from
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-pa
+gecache
if you'd rather see it there (this branch is force-pushed frequently)

Or are you saying you'd rather see the git URL than the link to gitweb?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v6 04/51] mm: Move PageDoubleMap bit
  2020-06-10 20:12 ` [PATCH v6 04/51] mm: Move PageDoubleMap bit Matthew Wilcox
@ 2020-06-11 15:03   ` Zi Yan
  0 siblings, 0 replies; 60+ messages in thread
From: Zi Yan @ 2020-06-11 15:03 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 654 bytes --]

On 10 Jun 2020, at 16:12, Matthew Wilcox wrote:

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> PG_private_2 is defined as being PF_ANY (applicable to tail pages
> as well as regular & head pages).  That means that the first tail
> page of a double-map page will appear to have Private2 set.  Use the
> Workingset bit instead which is defined as PF_HEAD so any attempt to
> access the Workingset bit on a tail page will redirect to the head page's
> Workingset bit.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---

Make sense to me.

Reviewed-by: Zi Yan <ziy@nvidia.com>

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v6 05/51] mm: Simplify PageDoubleMap with PF_SECOND policy
  2020-06-10 20:12 ` [PATCH v6 05/51] mm: Simplify PageDoubleMap with PF_SECOND policy Matthew Wilcox
@ 2020-06-11 15:14   ` Zi Yan
  0 siblings, 0 replies; 60+ messages in thread
From: Zi Yan @ 2020-06-11 15:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

On 10 Jun 2020, at 16:12, Matthew Wilcox wrote:

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> Introduce the new page policy of PF_SECOND which lets us use the
> normal pageflags generation machinery to create the various DoubleMap
> manipulation functions.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/page-flags.h | 40 ++++++++++----------------------------
>  1 file changed, 10 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index de6e0696f55c..979460df4768 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -232,6 +232,9 @@ static inline void page_init_poison(struct page *page, size_t size)
>   *
>   * PF_NO_COMPOUND:
>   *     the page flag is not relevant for compound pages.
> + *
> + * PF_SECOND:
> + *     the page flag is stored in the first tail page.
>   */

Would PF_FIRST_TAIL or PF_SECOND_IN_COMPOUND be more informative?

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 508 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v6 20/51] block: Add bio_for_each_thp_segment_all
  2020-06-10 20:13 ` [PATCH v6 20/51] block: Add bio_for_each_thp_segment_all Matthew Wilcox
@ 2020-06-11 18:20   ` Matthew Wilcox
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-11 18:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, linux-kernel

On Wed, Jun 10, 2020 at 01:13:14PM -0700, Matthew Wilcox wrote:
> +static inline void bvec_thp_advance(const struct bio_vec *bvec,
> +				struct bvec_iter_all *iter_all)
> +{
> +	struct bio_vec *bv = &iter_all->bv;
> +	unsigned int page_size = thp_size(bvec->bv_page);
> +
> +	if (iter_all->done) {
> +		bv->bv_page += thp_nr_pages(bv->bv_page);
> +		bv->bv_offset = 0;
> +	} else {
> +		BUG_ON(bvec->bv_offset >= page_size);
> +		bv->bv_page = bvec->bv_page;
> +		bv->bv_offset = bvec->bv_offset & (page_size - 1);
> +	}
> +	bv->bv_len = min(page_size - bv->bv_offset,
> +			 bvec->bv_len - iter_all->done);
> +	iter_all->done += bv->bv_len;
> +
> +	if (iter_all->done == bvec->bv_len) {
> +		iter_all->idx++;
> +		iter_all->done = 0;
> +	}
> +}

If, for example, we have an order-2 page followed by two order-0 pages
(thanks, generic/127!) in the bvec, we'll end up skipping the third
page because we calculate the size based on bvec->bv_page instead of
bv->bv_page.

+++ b/include/linux/bvec.h
@@ -166,15 +166,19 @@ static inline void bvec_thp_advance(const struct bio_vec *bvec,
                                struct bvec_iter_all *iter_all)
 {
        struct bio_vec *bv = &iter_all->bv;
-       unsigned int page_size = thp_size(bvec->bv_page);
+       unsigned int page_size;
 
        if (iter_all->done) {
                bv->bv_page += thp_nr_pages(bv->bv_page);
+               page_size = thp_size(bv->bv_page);
                bv->bv_offset = 0;
        } else {
-               BUG_ON(bvec->bv_offset >= page_size);
-               bv->bv_page = bvec->bv_page;
-               bv->bv_offset = bvec->bv_offset & (page_size - 1);
+               bv->bv_page = thp_head(bvec->bv_page +
+                               (bvec->bv_offset >> PAGE_SHIFT));
+               page_size = thp_size(bv->bv_page);
+               bv->bv_offset = bvec->bv_offset -
+                               (bv->bv_page - bvec->bv_page) * PAGE_SIZE;
+               BUG_ON(bv->bv_offset >= page_size);
        }
        bv->bv_len = min(page_size - bv->bv_offset,
                         bvec->bv_len - iter_all->done);

The previous code also wasn't handling the case fixed in 6bedf00e55e5
where a split bio might end up splitting a bvec.  That BUG_ON can probably
come out after a few months of testing.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v6 21/51] block: Support THPs in page_is_mergeable
  2020-06-10 20:13 ` [PATCH v6 21/51] block: Support THPs in page_is_mergeable Matthew Wilcox
@ 2020-06-12 16:17   ` Matthew Wilcox
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-12 16:17 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, linux-kernel

On Wed, Jun 10, 2020 at 01:13:15PM -0700, Matthew Wilcox wrote:
> page_is_mergeable() would incorrectly claim that two IOs were on different
> pages because they were on different base pages rather than on the
> same THP.  This led to a reference counting bug in iomap.  Simplify the
> 'same_page' test by just comparing whether we have the same struct page
> instead of doing arithmetic on the physical addresses.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  block/bio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 5235da6434aa..cd677cde853d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -747,7 +747,7 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
>  	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
>  		return false;
>  
> -	*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
> +	*same_page = bv->bv_page == page;
>  	if (!*same_page && pfn_to_page(PFN_DOWN(vec_end_addr)) + 1 != page)
>  		return false;
>  	return true;

No, this is also wrong.  If you put two order-2 pages into the same
bvec, and then add the contents of each subpage one at a time, page 0
will be !same, pages 1-3 will be same, but then pages 4-7 will all be
!same instead of page 4 being !same and pages 5-7 being !same.  And the
reference count will be wrong on the second THP.

But now I'm thinking about the whole interaction with the block
layer, and it's all a bit complicated.  Changing the definition
of page_is_mergeable() to treat compound pages differently without
also changing bio_next_segment() seems like it'll cause problems with
refcounting if any current users can submit (parts of) a compound page.
And changing how bio_next_segment() works requires auditing all the users,
and I don't want to do that.

We could use a bio flag to indicate whether this is a THP-bearing BIO
and how it should decide whether two pages are actually part of the same
page, but that seems like a really bad bit of added complexity.

We could also pass a flag to __bio_try_merge_page() from the iomap code,
but again added complexity.  We could also add a __bio_try_merge_thp()
that would only be called from iomap for now.  That would call a new
thp_is_mergable() which would use the THP definition of what a "same
page" is.  I think I hate this idea least of all the ones named so far.

My preferred solution is to change the definition of iop->write_count
(and iop->read_count).  Instead of being a count of the number of segments
submitted, make it a count of the number of bytes submitted.  Like this:

diff -u b/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
--- b/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -174,9 +174,9 @@
 }
 
 static void
-iomap_read_finish(struct iomap_page *iop, struct page *page)
+iomap_read_finish(struct iomap_page *iop, struct page *page, unsigned int len)
 {
-	if (!iop || atomic_dec_and_test(&iop->read_count))
+	if (!iop || atomic_sub_and_test(len, &iop->read_count))
 		unlock_page(page);
 }
 
@@ -193,7 +193,7 @@
 		iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
 	}
 
-	iomap_read_finish(iop, page);
+	iomap_read_finish(iop, page, bvec->bv_len);
 }
 
 static void
@@ -294,8 +294,8 @@
 
 	if (is_contig &&
 	    __bio_try_merge_page(ctx->bio, page, plen, poff, &same_page)) {
-		if (!same_page && iop)
-			atomic_inc(&iop->read_count);
+		if (iop)
+			atomic_add(plen, &iop->read_count);
 		goto done;
 	}
 
@@ -305,7 +305,7 @@
 	 * that we don't prematurely unlock the page.
 	 */
 	if (iop)
-		atomic_inc(&iop->read_count);
+		atomic_add(plen, &iop->read_count);
 
 	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
@@ -1090,7 +1090,7 @@
 
 static void
 iomap_finish_page_writeback(struct inode *inode, struct page *page,
-		int error)
+		int error, unsigned int len)
 {
 	struct iomap_page *iop = iomap_page_create(inode, page);
 
@@ -1101,7 +1101,7 @@
 
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_count) <= 0);
 
-	if (!iop || atomic_dec_and_test(&iop->write_count))
+	if (!iop || atomic_sub_and_test(len, &iop->write_count))
 		end_page_writeback(page);
 }
 
@@ -1135,7 +1135,8 @@
 
 		/* walk each page on bio, ending page IO on them */
 		bio_for_each_thp_segment_all(bv, bio, iter_all)
-			iomap_finish_page_writeback(inode, bv->bv_page, error);
+			iomap_finish_page_writeback(inode, bv->bv_page, error,
+					bv->bv_len);
 		bio_put(bio);
 	}
 	/* The ioend has been freed by bio_put() */
@@ -1351,8 +1352,8 @@
 
 	merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff,
 			&same_page);
-	if (iop && !same_page)
-		atomic_inc(&iop->write_count);
+	if (iop)
+		atomic_add(len, &iop->write_count);
 
 	if (!merged) {
 		if (bio_full(wpc->ioend->io_bio, len)) {

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC v6 00/51] Large pages in the page cache
  2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
                   ` (51 preceding siblings ...)
  2020-06-11  6:59 ` [RFC v6 00/51] Large pages in the page cache Christoph Hellwig
@ 2020-06-14 16:26 ` Matthew Wilcox
  52 siblings, 0 replies; 60+ messages in thread
From: Matthew Wilcox @ 2020-06-14 16:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, linux-kernel, Hugh Dickins

On Wed, Jun 10, 2020 at 01:12:54PM -0700, Matthew Wilcox wrote:
> Another fortnight, another dump of my current large pages work.

The generic/127 test has pointed out to me that range writeback is
broken by this patchset.  Here's how (may not be exactly what's going on,
but it's close):

page cache allocates an order-2 page covering indices 40-43.
bytes are written, page is dirtied
test then calls fallocate(FALLOC_FL_COLLAPSE_RANGE) for a range which
starts in page 41.
XFS calls filemap_write_and_wait_range() which calls
__filemap_fdatawrite_range() which calls
do_writepages() which calls
iomap_writepages() which calls
write_cache_pages() which calls
tag_pages_for_writeback() which calls
xas_for_each_marked() starting at page 41.  Which doesn't find page
  41 because when we dirtied pages 40-43, we only marked index 40 as
  being dirty.

Annoyingly, the XArray actually handles this just fine ... if we were
using multi-order entries, we'd find it.  But we're still storing 2^N
entries for an order N page.

I can see two ways to fix this.  One is to bite the bullet and do the
conversion of the page cache to use multi-order entries.  The second
is to set and clear the marks on all entries.  I'm concerned about the
performance of the latter solution.  Not so bad for order-2 pages, but for
an order-9 page we have 520 bits to set, spread over 9 non-consecutive
cachelines.  Also, I'm unenthusiastic about writing code that I want to
throw away as quickly as possible.

So unless somebody has a really good alternative idea, I'm going to
convert the page cache over to multi-order entries.  This will have
several positive effects:

 - Get DAX and regular page cache using the xarray in a more similar way
 - Saves about 4.5kB of memory for every 2MB page in tmpfs/shmem
 - Prep work for converting hugetlbfs to use the page cache the same
   way as tmpfs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC v6 00/51] Large pages in the page cache
  2020-06-11 11:24   ` Matthew Wilcox
@ 2020-06-15 13:32     ` Christoph Hellwig
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2020-06-15 13:32 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christoph Hellwig, linux-fsdevel, linux-mm, linux-kernel

On Thu, Jun 11, 2020 at 04:24:12AM -0700, Matthew Wilcox wrote:
> On Wed, Jun 10, 2020 at 11:59:54PM -0700, Christoph Hellwig wrote:
> > On Wed, Jun 10, 2020 at 01:12:54PM -0700, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > Another fortnight, another dump of my current large pages work.
> > > I've squished a lot of bugs this time.  xfstests is much happier now,
> > > running for 1631 seconds and getting as far as generic/086.  This patchset
> > > is getting a little big, so I'm going to try to get some bits of it
> > > upstream soon (the bits that make sense regardless of whether the rest
> > > of this is merged).
> > 
> > At this size a git tree to pull would also be nice..
> 
> That was literally the next paragraph ...

Oops.  Next time with more coffee..

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2020-06-15 13:32 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-10 20:12 [RFC v6 00/51] Large pages in the page cache Matthew Wilcox
2020-06-10 20:12 ` [PATCH v6 01/51] mm: Print head flags in dump_page Matthew Wilcox
2020-06-10 20:12 ` [PATCH v6 02/51] mm: Print the inode number " Matthew Wilcox
2020-06-10 20:12 ` [PATCH v6 03/51] mm: Print hashed address of struct page Matthew Wilcox
2020-06-10 20:12 ` [PATCH v6 04/51] mm: Move PageDoubleMap bit Matthew Wilcox
2020-06-11 15:03   ` Zi Yan
2020-06-10 20:12 ` [PATCH v6 05/51] mm: Simplify PageDoubleMap with PF_SECOND policy Matthew Wilcox
2020-06-11 15:14   ` Zi Yan
2020-06-10 20:13 ` [PATCH v6 06/51] mm: Store compound_nr as well as compound_order Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 07/51] mm: Move page-flags include to top of file Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 08/51] mm: Add thp_order Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 09/51] mm: Add thp_size Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 10/51] mm: Replace hpage_nr_pages with thp_nr_pages Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 11/51] mm: Add thp_head Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 12/51] mm: Introduce offset_in_thp Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 13/51] mm: Support arbitrary THP sizes Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 14/51] fs: Add a filesystem flag for THPs Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 15/51] fs: Do not update nr_thps for mappings which support THPs Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 16/51] fs: Introduce i_blocks_per_page Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 17/51] fs: Make page_mkwrite_check_truncate thp-aware Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 18/51] mm: Support THPs in zero_user_segments Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 19/51] mm: Zero the head page, not the tail page Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 20/51] block: Add bio_for_each_thp_segment_all Matthew Wilcox
2020-06-11 18:20   ` Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 21/51] block: Support THPs in page_is_mergeable Matthew Wilcox
2020-06-12 16:17   ` Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 22/51] iomap: Support arbitrarily many blocks per page Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 23/51] iomap: Support THPs in iomap_adjust_read_range Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 24/51] iomap: Support THPs in invalidatepage Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 25/51] iomap: Support THPs in read paths Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 26/51] iomap: Convert iomap_write_end types Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 27/51] iomap: Change calling convention for zeroing Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 28/51] iomap: Change iomap_write_begin calling convention Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 29/51] iomap: Support THPs in write paths Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 30/51] iomap: Inline data shouldn't see THPs Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 31/51] iomap: Handle tail pages in iomap_page_mkwrite Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 32/51] xfs: Support THPs Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 33/51] mm: Make prep_transhuge_page return its argument Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 34/51] mm: Add __page_cache_alloc_order Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 35/51] mm: Allow THPs to be added to the page cache Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 36/51] mm: Allow THPs to be removed from " Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 37/51] mm: Remove page fault assumption of compound page size Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 38/51] mm: Fix total_mapcount assumption of " Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 39/51] mm: Remove assumptions of THP size Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 40/51] mm: Avoid splitting THPs Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 41/51] mm: Fix truncation for pages of arbitrary size Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 42/51] mm: Handle truncates that split THPs Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 43/51] mm: Support storing shadow entries for THPs Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 44/51] mm: Support retrieving tail pages from the page cache Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 45/51] mm: Support tail pages in wait_for_stable_page Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 46/51] mm: Add DEFINE_READAHEAD Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 47/51] mm: Make page_cache_readahead_unbounded take a readahead_control Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 48/51] mm: Make __do_page_cache_readahead " Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 49/51] mm: Allow PageReadahead to be set on head pages Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 50/51] mm: Add THP readahead Matthew Wilcox
2020-06-10 20:13 ` [PATCH v6 51/51] mm: Align THP mappings for non-DAX Matthew Wilcox
2020-06-11  6:59 ` [RFC v6 00/51] Large pages in the page cache Christoph Hellwig
2020-06-11 11:24   ` Matthew Wilcox
2020-06-15 13:32     ` Christoph Hellwig
2020-06-14 16:26 ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).