LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers
@ 2018-07-02  0:56 john.hubbard
  2018-07-02  0:56 ` [PATCH v2 1/6] mm: get_user_pages: consolidate error handling john.hubbard
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: john.hubbard @ 2018-07-02  0:56 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

This fixes a few problems that came up when using devices (NICs, GPUs,
for example) that want to have direct access to a chunk of system (CPU)
memory, so that they can DMA to/from that memory. Problems [1] come up
if that memory is backed by persistence storage; for example, an ext4
file system. I've been working on several customer bugs that are hitting
this, and this patchset fixes those bugs.

The bugs happen via:

-- get_user_pages() on some ext4-backed pages
-- device does DMA for a while to/from those pages

    -- Somewhere in here, some of the pages get disconnected from the
       file system, via try_to_unmap() and eventually drop_buffers()

-- device is all done, device driver calls set_page_dirty_locked, then
   put_page()

And then at some point, we see a this BUG():

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

How to fix this:

If a page is pinned by any of the get_user_page("gup", here) variants, then
there is no need for that page to be on an LRU. So, this patchset removes
such pages from their LRU, thus leaving the page->lru fields *mostly*
available for tracking gup pages. (The lowest bit of page->lru.next is used
as PageTail, and these flags have to be checked when we don't know if it
really is a tail page or not, so avoid that bit.)

After that, the page is reference-counted via page->dma_pinned_count, and
flagged via page->dma_pinned_flags. The PageDmaPinned flag is cleared when
the reference count hits zero, and the reference count is only used when
the flag is set.

All of the above provides a reliable PageDmaPinned flag, which is then used
to decide when to abort or wait for operations such as:

    try_to_unmap()
    page_mkclean()

In order to handle page_mkclean(), new information had to be plumbed down
from the filesystems, so that page_mkclean can decide whether to skip
dma-pinned pages, or to wait for them.

Thanks to Matthew Wilcox for suggesting re-using page->lru fields for a
new refcount and flag, and to Jan Kara for explaining the rest of the
design details (how to deal with page_mkclean() and try_to_unmap(),
especially). Also thanks to Dan Williams for design advice and DAX,
long-term pinning, and page flag thoughts.

References:

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Changes since v1:

 -- Use page->lru and full reference counting, instead of a single page flag.
 -- Proper handling of page_mkclean().

John Hubbard (6):
  mm: get_user_pages: consolidate error handling
  mm: introduce page->dma_pinned_flags, _count
  mm: introduce zone_gup_lock, for dma-pinned pages
  mm/fs: add a sync_mode param for clear_page_dirty_for_io()
  mm: track gup pages with page->dma_pinned_* fields
  mm: page_mkclean, ttu: handle pinned pages

 drivers/video/fbdev/core/fb_defio.c |  3 +-
 fs/9p/vfs_addr.c                    |  2 +-
 fs/afs/write.c                      |  6 +-
 fs/btrfs/extent_io.c                | 14 ++---
 fs/btrfs/file.c                     |  2 +-
 fs/btrfs/free-space-cache.c         |  2 +-
 fs/btrfs/ioctl.c                    |  2 +-
 fs/ceph/addr.c                      |  4 +-
 fs/cifs/cifssmb.c                   |  3 +-
 fs/cifs/file.c                      |  5 +-
 fs/ext4/inode.c                     |  5 +-
 fs/f2fs/checkpoint.c                |  4 +-
 fs/f2fs/data.c                      |  2 +-
 fs/f2fs/dir.c                       |  2 +-
 fs/f2fs/gc.c                        |  4 +-
 fs/f2fs/inline.c                    |  2 +-
 fs/f2fs/node.c                      | 10 ++--
 fs/f2fs/segment.c                   |  3 +-
 fs/fuse/file.c                      |  2 +-
 fs/gfs2/aops.c                      |  2 +-
 fs/nfs/write.c                      |  2 +-
 fs/nilfs2/page.c                    |  2 +-
 fs/nilfs2/segment.c                 | 10 ++--
 fs/ubifs/file.c                     |  2 +-
 fs/xfs/xfs_aops.c                   |  2 +-
 include/linux/mm.h                  | 22 ++++++-
 include/linux/mm_types.h            | 22 +++++--
 include/linux/mmzone.h              |  7 +++
 include/linux/page-flags.h          | 50 ++++++++++++++++
 include/linux/rmap.h                |  4 +-
 mm/gup.c                            | 93 +++++++++++++++++++++++------
 mm/memcontrol.c                     |  7 +++
 mm/memory-failure.c                 |  3 +-
 mm/migrate.c                        |  2 +-
 mm/page-writeback.c                 | 14 +++--
 mm/page_alloc.c                     |  1 +
 mm/rmap.c                           | 71 ++++++++++++++++++++--
 mm/swap.c                           | 48 +++++++++++++++
 mm/truncate.c                       |  3 +-
 mm/vmscan.c                         |  2 +-
 40 files changed, 361 insertions(+), 85 deletions(-)

-- 
2.18.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/6] mm: get_user_pages: consolidate error handling
  2018-07-02  0:56 [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers john.hubbard
@ 2018-07-02  0:56 ` john.hubbard
  2018-07-02 10:17   ` Jan Kara
  2018-07-02  0:56 ` [PATCH v2 2/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: john.hubbard @ 2018-07-02  0:56 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b70d7ba7cc13..73f0b3316fa7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
 	long i = 0;
+	int err = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (!vma || start >= vma->vm_end) {
 			vma = find_extend_vma(mm, start);
 			if (!vma && in_gate_area(mm, start)) {
-				int ret;
-				ret = get_gate_page(mm, start & PAGE_MASK,
+				err = get_gate_page(mm, start & PAGE_MASK,
 						gup_flags, &vma,
 						pages ? &pages[i] : NULL);
-				if (ret)
-					return i ? : ret;
+				if (err)
+					goto out;
 				page_mask = 0;
 				goto next_page;
 			}
 
-			if (!vma || check_vma_flags(vma, gup_flags))
-				return i ? : -EFAULT;
+			if (!vma || check_vma_flags(vma, gup_flags)) {
+				err = -EFAULT;
+				goto out;
+			}
 			if (is_vm_hugetlb_page(vma)) {
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		 * If we have a pending SIGKILL, don't keep faulting pages and
 		 * potentially allocating memory.
 		 */
-		if (unlikely(fatal_signal_pending(current)))
-			return i ? i : -ERESTARTSYS;
+		if (unlikely(fatal_signal_pending(current))) {
+			err = -ERESTARTSYS;
+			goto out;
+		}
 		cond_resched();
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
 		if (!page) {
-			int ret;
-			ret = faultin_page(tsk, vma, start, &foll_flags,
+			err = faultin_page(tsk, vma, start, &foll_flags,
 					nonblocking);
-			switch (ret) {
+			switch (err) {
 			case 0:
 				goto retry;
 			case -EFAULT:
 			case -ENOMEM:
 			case -EHWPOISON:
-				return i ? i : ret;
+				goto out;
 			case -EBUSY:
-				return i;
+				err = 0;
+				goto out;
 			case -ENOENT:
 				goto next_page;
 			}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			 */
 			goto next_page;
 		} else if (IS_ERR(page)) {
-			return i ? i : PTR_ERR(page);
+			err = PTR_ERR(page);
+			goto out;
 		}
 		if (pages) {
 			pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
 	} while (nr_pages);
-	return i;
+
+out:
+	return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.18.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 2/6] mm: introduce page->dma_pinned_flags, _count
  2018-07-02  0:56 [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers john.hubbard
  2018-07-02  0:56 ` [PATCH v2 1/6] mm: get_user_pages: consolidate error handling john.hubbard
@ 2018-07-02  0:56 ` john.hubbard
  2018-07-02  0:56 ` [PATCH v2 3/6] mm: introduce zone_gup_lock, for dma-pinned pages john.hubbard
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: john.hubbard @ 2018-07-02  0:56 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Add two struct page fields that, combined, are unioned with
struct page->lru. There is no change in the size of
struct page. These new fields are for type safety and clarity.

Also add page flag accessors to test, set and clear the new
page->dma_pinned_flags field.

The page->dma_pinned_count field will be used in upcoming
patches

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm_types.h   | 22 ++++++++++++-----
 include/linux/page-flags.h | 50 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..0ecd29dcd642 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,12 +78,22 @@ struct page {
 	 */
 	union {
 		struct {	/* Page cache and anonymous pages */
-			/**
-			 * @lru: Pageout list, eg. active_list protected by
-			 * zone_lru_lock.  Sometimes used as a generic list
-			 * by the page owner.
-			 */
-			struct list_head lru;
+			union {
+				/**
+				 * @lru: Pageout list, eg. active_list protected
+				 * by zone_lru_lock.  Sometimes used as a
+				 * generic list by the page owner.
+				 */
+				struct list_head lru;
+				/* Used by get_user_pages*(). Pages may not be
+				 * on an LRU while these dma_pinned_* fields
+				 * are in use.
+				 */
+				struct {
+					unsigned long dma_pinned_flags;
+					atomic_t      dma_pinned_count;
+				};
+			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
 			pgoff_t index;		/* Our offset within mapping. */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 901943e4754b..b694a1a3bdf3 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -420,6 +420,56 @@ static __always_inline int __PageMovable(struct page *page)
 				PAGE_MAPPING_MOVABLE;
 }
 
+/*
+ * page->dma_pinned_flags is protected by the zone_gup_lock, plus the
+ * page->dma_pinned_count field as well.
+ *
+ * Because page->dma_pinned_flags is unioned with page->lru, any page that
+ * uses these flags must NOT be on an LRU. That's partly enforced by
+ * ClearPageDmaPinned, which gives the page back to LRU.
+ *
+ * Because PageDmaPinned also corresponds to PageTail (the lowest bit in
+ * the first union of struct page), and this flag is checked without knowing
+ * whether it is a tail page or a PageDmaPinned page, start the flags at
+ * bit 1 (0x2), rather than bit 0.
+ */
+#define PAGE_DMA_PINNED		0x2
+#define PAGE_DMA_PINNED_FLAGS	(PAGE_DMA_PINNED)
+
+/*
+ * Because these flags are read outside of a lock, ensure visibility between
+ * different threads, by using READ|WRITE_ONCE.
+ */
+static __always_inline int PageDmaPinnedFlags(struct page *page)
+{
+	VM_BUG_ON(page != compound_head(page));
+	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED_FLAGS) != 0;
+}
+
+static __always_inline int PageDmaPinned(struct page *page)
+{
+	VM_BUG_ON(page != compound_head(page));
+	return (READ_ONCE(page->dma_pinned_flags) & PAGE_DMA_PINNED) != 0;
+}
+
+static __always_inline void SetPageDmaPinned(struct page *page)
+{
+	VM_BUG_ON(page != compound_head(page));
+	WRITE_ONCE(page->dma_pinned_flags, PAGE_DMA_PINNED);
+}
+
+static __always_inline void ClearPageDmaPinned(struct page *page)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON_PAGE(!PageDmaPinnedFlags(page), page);
+
+	/* This does a WRITE_ONCE to the lru.next, which is also the
+	 * page->dma_pinned_flags field. So in addition to restoring page->lru,
+	 * this provides visibility to other threads.
+	 */
+	INIT_LIST_HEAD(&page->lru);
+}
+
 #ifdef CONFIG_KSM
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
-- 
2.18.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 3/6] mm: introduce zone_gup_lock, for dma-pinned pages
  2018-07-02  0:56 [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers john.hubbard
  2018-07-02  0:56 ` [PATCH v2 1/6] mm: get_user_pages: consolidate error handling john.hubbard
  2018-07-02  0:56 ` [PATCH v2 2/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
@ 2018-07-02  0:56 ` john.hubbard
  2018-07-02  0:56 ` [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io() john.hubbard
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: john.hubbard @ 2018-07-02  0:56 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

The page->dma_pinned_flags and _count fields require
lock protection. A lock at approximately the granularity
of the zone_lru_lock is called for, but adding to the
locking contention of zone_lru_lock is undesirable,
because that is a pre-existing hot spot. Fortunately,
these new dma_pinned_* fields can use an independent
lock, so this patch creates an entirely new lock, right
next to the zone_lru_lock.

Why "zone_gup_lock"?

Most of the naming refers to "DMA-pinned pages", but
"zone DMA lock" has other meanings already, so this is
called zone_gup_lock instead. The "dma pinning" is a result
of get_user_pages (gup) being called, so the name still
helps explain its use.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mmzone.h | 7 +++++++
 mm/page_alloc.c        | 1 +
 2 files changed, 8 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2dc52a..5b4ceef82657 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -662,6 +662,8 @@ typedef struct pglist_data {
 
 	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
 
+	spinlock_t pinned_dma_lock;
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
@@ -740,6 +742,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
 	return &zone->zone_pgdat->lru_lock;
 }
 
+static inline spinlock_t *zone_gup_lock(struct zone *zone)
+{
+	return &zone->zone_pgdat->pinned_dma_lock;
+}
+
 static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
 	return &pgdat->lruvec;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..9c493442b57c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6211,6 +6211,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	int nid = pgdat->node_id;
 
 	pgdat_resize_init(pgdat);
+	spin_lock_init(&pgdat->pinned_dma_lock);
 #ifdef CONFIG_NUMA_BALANCING
 	spin_lock_init(&pgdat->numabalancing_migrate_lock);
 	pgdat->numabalancing_migrate_nr_pages = 0;
-- 
2.18.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io()
  2018-07-02  0:56 [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers john.hubbard
                   ` (2 preceding siblings ...)
  2018-07-02  0:56 ` [PATCH v2 3/6] mm: introduce zone_gup_lock, for dma-pinned pages john.hubbard
@ 2018-07-02  0:56 ` john.hubbard
  2018-07-02  2:11   ` kbuild test robot
  2018-07-02  2:47   ` kbuild test robot
  2018-07-02  0:56 ` [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: john.hubbard @ 2018-07-02  0:56 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Add a sync_mode parameter to clear_page_dirty_for_io(), to specify the
writeback sync mode, and also pass in the appropriate value
(WB_SYNC_NONE or WB_SYNC_ALL), from each filesystem location that calls
it. This will be used in subsequent patches, to allow page_mkclean() to
decide how to handle pinned pages.

How to decide which value to pass to clear_page_dirty_for_io?
Jan Kara's write-up explains that:

    What needs to happen in page_mkclean() depends on the caller. Most of
    the callers really need to be sure the page is write-protected once
    page_mkclean() returns. Those are:

      pagecache_isize_extended()
      fb_deferred_io_work()
      clear_page_dirty_for_io() if called for data-integrity
      writeback--which is currently known only in its caller
      (e.g. write_cache_pages()), where it can be determined as
      wbc->sync_mode == WB_SYNC_ALL. Getting this information into
      page_mkclean() will require some plumbing and
      clear_page_dirty_for_io() has some 50 callers but it's doable.

    clear_page_dirty_for_io() for cleaning writeback
    (wbc->sync_mode != WB_SYNC_ALL) can just skip pinned pages and we
    probably need to do that as otherwise memory cleaning would get
    stuck on pinned pages until RDMA drivers release its pins.

An enum (writeback_sync_modes) is used instead of a bool,
for the usual readability reasons, but it is declared as
an int, because mm.h neads it, and unless a new header file
is used, an actual enum would further complicate header
file dependencies.

The enum writeback_sync_modes was chosen because these changes are being
made to the filesystems, in order to pass down a hint to the memory
management system. Therefore, it's best to use naming that is
filesystem-centric. The name will be interpreted into more mm-specific
terms, in page_mkclean, in a subsequent patch.

CC: Jan Kara <jack@suse.cz>
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/9p/vfs_addr.c            |  2 +-
 fs/afs/write.c              |  6 +++---
 fs/btrfs/extent_io.c        | 14 +++++++-------
 fs/btrfs/file.c             |  2 +-
 fs/btrfs/free-space-cache.c |  2 +-
 fs/btrfs/ioctl.c            |  2 +-
 fs/ceph/addr.c              |  4 ++--
 fs/cifs/cifssmb.c           |  3 ++-
 fs/cifs/file.c              |  5 +++--
 fs/ext4/inode.c             |  5 +++--
 fs/f2fs/checkpoint.c        |  4 ++--
 fs/f2fs/data.c              |  2 +-
 fs/f2fs/dir.c               |  2 +-
 fs/f2fs/gc.c                |  4 ++--
 fs/f2fs/inline.c            |  2 +-
 fs/f2fs/node.c              | 10 +++++-----
 fs/f2fs/segment.c           |  3 ++-
 fs/fuse/file.c              |  2 +-
 fs/gfs2/aops.c              |  2 +-
 fs/nfs/write.c              |  2 +-
 fs/nilfs2/page.c            |  2 +-
 fs/nilfs2/segment.c         | 10 ++++++----
 fs/ubifs/file.c             |  2 +-
 fs/xfs/xfs_aops.c           |  2 +-
 include/linux/mm.h          |  7 ++++++-
 mm/migrate.c                |  2 +-
 mm/page-writeback.c         | 11 ++++++++---
 mm/vmscan.c                 |  2 +-
 28 files changed, 66 insertions(+), 50 deletions(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index e1cbdfdb7c68..35572f150b31 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -221,7 +221,7 @@ static int v9fs_launder_page(struct page *page)
 	struct inode *inode = page->mapping->host;
 
 	v9fs_fscache_wait_on_page_write(inode, page);
-	if (clear_page_dirty_for_io(page)) {
+	if (clear_page_dirty_for_io(page, WB_SYNC_NONE)) {
 		retval = v9fs_vfs_writepage_locked(page);
 		if (retval)
 			return retval;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 8b39e6ebb40b..a0c7a364ffb7 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -472,7 +472,7 @@ static int afs_write_back_from_locked_page(struct address_space *mapping,
 			trace_afs_page_dirty(vnode, tracepoint_string("store+"),
 					     page->index, priv);
 
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(page, wbc->sync_mode))
 				BUG();
 			if (test_set_page_writeback(page))
 				BUG();
@@ -612,7 +612,7 @@ static int afs_writepages_region(struct address_space *mapping,
 			continue;
 		}
 
-		if (!clear_page_dirty_for_io(page))
+		if (!clear_page_dirty_for_io(page, wbc->sync_mode))
 			BUG();
 		ret = afs_write_back_from_locked_page(mapping, wbc, page, end);
 		put_page(page);
@@ -838,7 +838,7 @@ int afs_launder_page(struct page *page)
 	_enter("{%lx}", page->index);
 
 	priv = page_private(page);
-	if (clear_page_dirty_for_io(page)) {
+	if (clear_page_dirty_for_io(page, WB_SYNC_NONE)) {
 		f = 0;
 		t = PAGE_SIZE;
 		if (PagePrivate(page)) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e55843f536bc..78fafec3b607 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1364,7 +1364,7 @@ void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
 	while (index <= end_index) {
 		page = find_get_page(inode->i_mapping, index);
 		BUG_ON(!page); /* Pages should be in the extent_io_tree */
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(page, WB_SYNC_ALL);
 		put_page(page);
 		index++;
 	}
@@ -1707,7 +1707,7 @@ static int __process_pages_contig(struct address_space *mapping,
 				continue;
 			}
 			if (page_ops & PAGE_CLEAR_DIRTY)
-				clear_page_dirty_for_io(pages[i]);
+				clear_page_dirty_for_io(pages[i], WB_SYNC_ALL);
 			if (page_ops & PAGE_SET_WRITEBACK)
 				set_page_writeback(pages[i]);
 			if (page_ops & PAGE_SET_ERROR)
@@ -3740,7 +3740,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = eb->pages[i];
 
-		clear_page_dirty_for_io(p);
+		clear_page_dirty_for_io(p, wbc->sync_mode);
 		set_page_writeback(p);
 		ret = submit_extent_page(REQ_OP_WRITE | write_flags, tree, wbc,
 					 p, offset, PAGE_SIZE, 0, bdev,
@@ -3764,7 +3764,7 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
 	if (unlikely(ret)) {
 		for (; i < num_pages; i++) {
 			struct page *p = eb->pages[i];
-			clear_page_dirty_for_io(p);
+			clear_page_dirty_for_io(p, wbc->sync_mode);
 			unlock_page(p);
 		}
 	}
@@ -3984,7 +3984,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
 			}
 
 			if (PageWriteback(page) ||
-			    !clear_page_dirty_for_io(page)) {
+			    !clear_page_dirty_for_io(page, wbc->sync_mode)) {
 				unlock_page(page);
 				continue;
 			}
@@ -4089,7 +4089,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 
 	while (start <= end) {
 		page = find_get_page(mapping, start >> PAGE_SHIFT);
-		if (clear_page_dirty_for_io(page))
+		if (clear_page_dirty_for_io(page, wbc_writepages.sync_mode))
 			ret = __extent_writepage(page, &wbc_writepages, &epd);
 		else {
 			if (tree->ops && tree->ops->writepage_end_io_hook)
@@ -5170,7 +5170,7 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
 		lock_page(page);
 		WARN_ON(!PagePrivate(page));
 
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(page, WB_SYNC_ALL);
 		xa_lock_irq(&page->mapping->i_pages);
 		if (!PageDirty(page)) {
 			radix_tree_tag_clear(&page->mapping->i_pages,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 51e77d72068a..dace1375f366 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1514,7 +1514,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
 	}
 
 	for (i = 0; i < num_pages; i++) {
-		if (clear_page_dirty_for_io(pages[i]))
+		if (clear_page_dirty_for_io(pages[i]), WB_SYNC_NONE)
 			account_page_redirty(pages[i]);
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index d5f80cb300be..2014e9a2b659 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -387,7 +387,7 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, struct inode *inode
 	}
 
 	for (i = 0; i < io_ctl->num_pages; i++) {
-		clear_page_dirty_for_io(io_ctl->pages[i]);
+		clear_page_dirty_for_io(io_ctl->pages[i], WB_SYNC_ALL);
 		set_page_extent_mapped(io_ctl->pages[i]);
 	}
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 43ecbe620dea..e7291284b4c8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1337,7 +1337,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 			     page_start, page_end - 1, &cached_state);
 
 	for (i = 0; i < i_done; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		clear_page_dirty_for_io(pages[i], WB_SYNC_ALL);
 		ClearPageChecked(pages[i]);
 		set_page_extent_mapped(pages[i]);
 		set_page_dirty(pages[i]);
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 292b3d72d725..29948c0464c0 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -937,7 +937,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 				wait_on_page_writeback(page);
 			}
 
-			if (!clear_page_dirty_for_io(page)) {
+			if (!clear_page_dirty_for_io(page, wbc->sync_mode)) {
 				dout("%p !clear_page_dirty_for_io\n", page);
 				unlock_page(page);
 				continue;
@@ -1277,7 +1277,7 @@ static int ceph_update_writeable_page(struct file *file,
 		/* yay, writeable, do it now (without dropping page lock) */
 		dout(" page %p snapc %p not current, but oldest\n",
 		     page, snapc);
-		if (!clear_page_dirty_for_io(page))
+		if (!clear_page_dirty_for_io(page, WB_SYNC_ALL))
 			goto retry_locked;
 		r = writepage_nounlock(page, NULL);
 		if (r < 0)
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index d352da325de3..0f8ddc80a4db 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1999,7 +1999,8 @@ cifs_writev_requeue(struct cifs_writedata *wdata)
 		for (j = 0; j < nr_pages; j++) {
 			wdata2->pages[j] = wdata->pages[i + j];
 			lock_page(wdata2->pages[j]);
-			clear_page_dirty_for_io(wdata2->pages[j]);
+			clear_page_dirty_for_io(wdata2->pages[j],
+						wdata->sync_mode);
 		}
 
 		wdata2->sync_mode = wdata->sync_mode;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 8d41ca7bfcf1..488061185b95 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2019,7 +2019,8 @@ wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages,
 			wait_on_page_writeback(page);
 
 		if (PageWriteback(page) ||
-				!clear_page_dirty_for_io(page)) {
+				!clear_page_dirty_for_io(page,
+							 wbc->sync_mode)) {
 			unlock_page(page);
 			break;
 		}
@@ -4089,7 +4090,7 @@ static int cifs_launder_page(struct page *page)
 
 	cifs_dbg(FYI, "Launder page: %p\n", page);
 
-	if (clear_page_dirty_for_io(page))
+	if (clear_page_dirty_for_io(page, wbc.sync_mode))
 		rc = cifs_writepage_locked(page, &wbc);
 
 	cifs_fscache_invalidate_page(page, page->mapping->host);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ea07efbe016..78f77d31c8ad 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1741,7 +1741,8 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 			BUG_ON(PageWriteback(page));
 			if (invalidate) {
 				if (page_mapped(page))
-					clear_page_dirty_for_io(page);
+					clear_page_dirty_for_io(page,
+								WB_SYNC_ALL);
 				block_invalidatepage(page, 0, PAGE_SIZE);
 				ClearPageUptodate(page);
 			}
@@ -2187,7 +2188,7 @@ static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
 	int err;
 
 	BUG_ON(page->index != mpd->first_page);
-	clear_page_dirty_for_io(page);
+	clear_page_dirty_for_io(page, WB_SYNC_ALL);
 	/*
 	 * We have to be very careful here!  Nothing protects writeback path
 	 * against i_size changes and the page can be writeably mapped into
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 9f1c96caebda..e1dcc848207b 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -354,7 +354,7 @@ long f2fs_sync_meta_pages(struct f2fs_sb_info *sbi, enum page_type type,
 			f2fs_wait_on_page_writeback(page, META, true);
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(page, wbc.sync_mode))
 				goto continue_unlock;
 
 			if (__f2fs_write_meta_page(page, &wbc, io_type)) {
@@ -1197,7 +1197,7 @@ static void commit_checkpoint(struct f2fs_sb_info *sbi,
 
 	f2fs_wait_on_page_writeback(page, META, true);
 	f2fs_bug_on(sbi, PageWriteback(page));
-	if (unlikely(!clear_page_dirty_for_io(page)))
+	if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
 		f2fs_bug_on(sbi, 1);
 
 	/* writeout cp pack 2 page */
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 8f931d699287..a03eac98cfe3 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2018,7 +2018,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
 			}
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(page), wbc->sync_mode)
 				goto continue_unlock;
 
 			ret = __write_data_page(page, &submitted, wbc, io_type);
diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 7f955c4e86a4..258f9dc117f4 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -731,7 +731,7 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page,
 	if (bit_pos == NR_DENTRY_IN_BLOCK &&
 		!f2fs_truncate_hole(dir, page->index, page->index + 1)) {
 		f2fs_clear_radix_tree_dirty_tag(page);
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(page, WB_SYNC_ALL);
 		ClearPagePrivate(page);
 		ClearPageUptodate(page);
 		inode_dec_dirty_pages(dir);
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 9093be6e7a7d..a6eb7ecf6f0b 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -693,7 +693,7 @@ static void move_data_block(struct inode *inode, block_t bidx,
 
 	set_page_dirty(fio.encrypted_page);
 	f2fs_wait_on_page_writeback(fio.encrypted_page, DATA, true);
-	if (clear_page_dirty_for_io(fio.encrypted_page))
+	if (clear_page_dirty_for_io(fio.encrypted_page, fio.io_wbc->sync_mode))
 		dec_page_count(fio.sbi, F2FS_DIRTY_META);
 
 	set_page_writeback(fio.encrypted_page);
@@ -780,7 +780,7 @@ static void move_data_page(struct inode *inode, block_t bidx, int gc_type,
 retry:
 		set_page_dirty(page);
 		f2fs_wait_on_page_writeback(page, DATA, true);
-		if (clear_page_dirty_for_io(page)) {
+		if (clear_page_dirty_for_io(page, fio.io_wbc->sync_mode)) {
 			inode_dec_dirty_pages(inode);
 			f2fs_remove_dirty_inode(inode);
 		}
diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c
index 043830be5662..97dbc721e985 100644
--- a/fs/f2fs/inline.c
+++ b/fs/f2fs/inline.c
@@ -136,7 +136,7 @@ int f2fs_convert_inline_page(struct dnode_of_data *dn, struct page *page)
 	set_page_dirty(page);
 
 	/* clear dirty state */
-	dirty = clear_page_dirty_for_io(page);
+	dirty = clear_page_dirty_for_io(page, fio.io_wbc->sync_mode);
 
 	/* write data page to try to make data consistent */
 	set_page_writeback(page);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 10643b11bd59..9401f70d3f9f 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -104,7 +104,7 @@ static void clear_node_page_dirty(struct page *page)
 {
 	if (PageDirty(page)) {
 		f2fs_clear_radix_tree_dirty_tag(page);
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(page, WB_SYNC_ALL);
 		dec_page_count(F2FS_P_SB(page), F2FS_DIRTY_NODES);
 	}
 	ClearPageUptodate(page);
@@ -1276,7 +1276,7 @@ static void flush_inline_data(struct f2fs_sb_info *sbi, nid_t ino)
 	if (!PageDirty(page))
 		goto page_out;
 
-	if (!clear_page_dirty_for_io(page))
+	if (!clear_page_dirty_for_io(page, WB_SYNC_ALL))
 		goto page_out;
 
 	ret = f2fs_write_inline_data(inode, page);
@@ -1444,7 +1444,7 @@ void f2fs_move_node_page(struct page *node_page, int gc_type)
 		f2fs_wait_on_page_writeback(node_page, NODE, true);
 
 		f2fs_bug_on(F2FS_P_SB(node_page), PageWriteback(node_page));
-		if (!clear_page_dirty_for_io(node_page))
+		if (!clear_page_dirty_for_io(node_page, wbc.sync_mode))
 			goto out_page;
 
 		if (__write_node_page(node_page, false, NULL,
@@ -1544,7 +1544,7 @@ int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
 					set_page_dirty(page);
 			}
 
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(page, WB_SYNC_ALL))
 				goto continue_unlock;
 
 			ret = __write_node_page(page, atomic &&
@@ -1658,7 +1658,7 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
 			f2fs_wait_on_page_writeback(page, NODE, true);
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(page, wbc->sync_mode))
 				goto continue_unlock;
 
 			set_fsync_mark(page, 0);
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 9efce174c51a..947a16c1836f 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -382,7 +382,8 @@ static int __f2fs_commit_inmem_pages(struct inode *inode)
 
 			set_page_dirty(page);
 			f2fs_wait_on_page_writeback(page, DATA, true);
-			if (clear_page_dirty_for_io(page)) {
+			if (clear_page_dirty_for_io(page,
+						    fio.io_wbc->sync_mode)) {
 				inode_dec_dirty_pages(inode);
 				f2fs_remove_dirty_inode(inode);
 			}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a201fb0ac64f..a7837d909894 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2015,7 +2015,7 @@ static int fuse_write_end(struct file *file, struct address_space *mapping,
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
-	if (clear_page_dirty_for_io(page)) {
+	if (clear_page_dirty_for_io(page, WB_SYNC_NONE)) {
 		struct inode *inode = page->mapping->host;
 		err = fuse_writepage_locked(page);
 		if (!err)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 35f5ee23566d..43d8baabaa13 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -304,7 +304,7 @@ static int gfs2_write_jdata_pagevec(struct address_space *mapping,
 		}
 
 		BUG_ON(PageWriteback(page));
-		if (!clear_page_dirty_for_io(page))
+		if (!clear_page_dirty_for_io(page, wbc->sync_mode))
 			goto continue_unlock;
 
 		trace_wbc_writepage(wbc, inode_to_bdi(inode));
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a057b4f45a46..0535ee5c1b88 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2037,7 +2037,7 @@ int nfs_wb_page(struct inode *inode, struct page *page)
 
 	for (;;) {
 		wait_on_page_writeback(page);
-		if (clear_page_dirty_for_io(page)) {
+		if (clear_page_dirty_for_io(page, wbc.sync_mode)) {
 			ret = nfs_writepage_locked(page, &wbc);
 			if (ret < 0)
 				goto out_error;
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 4cb850a6f1c2..f38805177248 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -480,7 +480,7 @@ int __nilfs_clear_page_dirty(struct page *page)
 					     page_index(page),
 					     PAGECACHE_TAG_DIRTY);
 			xa_unlock_irq(&mapping->i_pages);
-			return clear_page_dirty_for_io(page);
+			return clear_page_dirty_for_io(page, WB_SYNC_ALL);
 		}
 		xa_unlock_irq(&mapping->i_pages);
 		return 0;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 0953635e7d48..b237378cbc81 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1644,7 +1644,7 @@ static void nilfs_begin_page_io(struct page *page)
 		return;
 
 	lock_page(page);
-	clear_page_dirty_for_io(page);
+	clear_page_dirty_for_io(page, WB_SYNC_ALL);
 	set_page_writeback(page);
 	unlock_page(page);
 }
@@ -1662,7 +1662,8 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
 			if (bh->b_page != bd_page) {
 				if (bd_page) {
 					lock_page(bd_page);
-					clear_page_dirty_for_io(bd_page);
+					clear_page_dirty_for_io(bd_page,
+								WB_SYNC_ALL);
 					set_page_writeback(bd_page);
 					unlock_page(bd_page);
 				}
@@ -1676,7 +1677,8 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
 			if (bh == segbuf->sb_super_root) {
 				if (bh->b_page != bd_page) {
 					lock_page(bd_page);
-					clear_page_dirty_for_io(bd_page);
+					clear_page_dirty_for_io(bd_page,
+								WB_SYNC_ALL);
 					set_page_writeback(bd_page);
 					unlock_page(bd_page);
 					bd_page = bh->b_page;
@@ -1691,7 +1693,7 @@ static void nilfs_segctor_prepare_write(struct nilfs_sc_info *sci)
 	}
 	if (bd_page) {
 		lock_page(bd_page);
-		clear_page_dirty_for_io(bd_page);
+		clear_page_dirty_for_io(bd_page, WB_SYNC_ALL);
 		set_page_writeback(bd_page);
 		unlock_page(bd_page);
 	}
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index fd7eb6fe9090..80804e13a05f 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1169,7 +1169,7 @@ static int do_truncation(struct ubifs_info *c, struct inode *inode,
 				 */
 				ubifs_assert(PagePrivate(page));
 
-				clear_page_dirty_for_io(page);
+				clear_page_dirty_for_io(page, WB_SYNC_ALL);
 				if (UBIFS_BLOCKS_PER_PAGE_SHIFT)
 					offset = new_size &
 						 (PAGE_SIZE - 1);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 8eb3ba3d4d00..2d61e7c92287 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -496,7 +496,7 @@ xfs_start_page_writeback(
 	 * write this page in this writeback sweep will be made.
 	 */
 	if (clear_dirty) {
-		clear_page_dirty_for_io(page);
+		clear_page_dirty_for_io(page, WB_SYNC_ALL);
 		set_page_writeback(page);
 	} else
 		set_page_writeback_keepwrite(page);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9ffe380..3094500f5cff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1532,7 +1532,12 @@ static inline void cancel_dirty_page(struct page *page)
 	if (PageDirty(page))
 		__cancel_dirty_page(page);
 }
-int clear_page_dirty_for_io(struct page *page);
+
+/* The sync_mode argument expects enum writeback_sync_modes (see
+ * include/linux/writeback.h), but is declared as an int here, to avoid
+ * even more header file dependencies in mm.h.
+ */
+int clear_page_dirty_for_io(struct page *page, int sync_mode);
 
 int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f7cab1..703c9ee86309 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -849,7 +849,7 @@ static int writeout(struct address_space *mapping, struct page *page)
 		/* No write method for the address space */
 		return -EINVAL;
 
-	if (!clear_page_dirty_for_io(page))
+	if (!clear_page_dirty_for_io(page, wbc.sync_mode))
 		/* Someone else already triggered a write */
 		return -EAGAIN;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 337c6afb3345..e526b3cbf900 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2233,7 +2233,7 @@ int write_cache_pages(struct address_space *mapping,
 			}
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io(page, wbc->sync_mode))
 				goto continue_unlock;
 
 			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
@@ -2371,7 +2371,7 @@ int write_one_page(struct page *page)
 
 	wait_on_page_writeback(page);
 
-	if (clear_page_dirty_for_io(page)) {
+	if (clear_page_dirty_for_io(page, wbc.sync_mode)) {
 		get_page(page);
 		ret = mapping->a_ops->writepage(page, &wbc);
 		if (ret == 0)
@@ -2643,8 +2643,13 @@ EXPORT_SYMBOL(__cancel_dirty_page);
  *
  * This incoherency between the page's dirty flag and radix-tree tag is
  * unfortunate, but it only exists while the page is locked.
+ *
+ * The sync_mode argument expects enum writeback_sync_modes (see
+ * include/linux/writeback.h), but is declared as an int here, to avoid
+ * even more header file dependencies in mm.h.
  */
-int clear_page_dirty_for_io(struct page *page)
+
+int clear_page_dirty_for_io(struct page *page, int sync_mode)
 {
 	struct address_space *mapping = page_mapping(page);
 	int ret = 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03822f86f288..16c305a8d476 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -668,7 +668,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	if (!may_write_to_inode(mapping->host, sc))
 		return PAGE_KEEP;
 
-	if (clear_page_dirty_for_io(page)) {
+	if (clear_page_dirty_for_io(page, WB_SYNC_ALL)) {
 		int res;
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_NONE,
-- 
2.18.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-02  0:56 [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers john.hubbard
                   ` (3 preceding siblings ...)
  2018-07-02  0:56 ` [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io() john.hubbard
@ 2018-07-02  0:56 ` john.hubbard
  2018-07-02  2:11   ` kbuild test robot
                     ` (2 more replies)
  2018-07-02  0:56 ` [PATCH v2 6/6] mm: page_mkclean, ttu: handle pinned pages john.hubbard
  2018-07-02  5:54 ` [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers John Hubbard
  6 siblings, 3 replies; 30+ messages in thread
From: john.hubbard @ 2018-07-02  0:56 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

This patch sets and restores the new page->dma_pinned_flags and
page->dma_pinned_count fields, but does not actually use them for
anything yet.

In order to use these fields at all, the page must be removed from
any LRU list that it's on. The patch also adds some precautions that
prevent the page from getting moved back onto an LRU, once it is
in this state.

This is in preparation to fix some problems that came up when using
devices (NICs, GPUs, for example) that set up direct access to a chunk
of system (CPU) memory, so that they can DMA to/from that memory.

CC: Matthew Wilcox <willy@infradead.org>
CC: Jan Kara <jack@suse.cz>
CC: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 15 +++++++++++++
 mm/gup.c           | 56 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/memcontrol.c    |  7 ++++++
 mm/swap.c          | 48 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3094500f5cff..aeba3a13a2e3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -895,6 +895,9 @@ static inline bool is_device_public_page(const struct page *page)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+void __put_page_for_pinned_dma(struct page *page);
+void __get_page_for_pinned_dma(struct page *page);
+
 static inline void get_page(struct page *page)
 {
 	page = compound_head(page);
@@ -904,12 +907,24 @@ static inline void get_page(struct page *page)
 	 */
 	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
 	page_ref_inc(page);
+
+	if (unlikely(PageDmaPinned(page)))
+		__get_page_for_pinned_dma(page);
 }
 
 static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
+	/* Because the page->dma_pinned_* fields are unioned with
+	 * page->lru, there is no way to do classical refcount-style
+	 * decrement-and-test-for-zero. Instead, PageDmaPinned(page) must
+	 * be checked, in order to safely check if we are allowed to decrement
+	 * page->dma_pinned_count at all.
+	 */
+	if (unlikely(PageDmaPinned(page)))
+		__put_page_for_pinned_dma(page);
+
 	/*
 	 * For devmap managed pages we need to catch refcount transition from
 	 * 2 to 1, when refcount reach one it means the page is free and we
diff --git a/mm/gup.c b/mm/gup.c
index 73f0b3316fa7..e5c0104fd234 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -20,6 +20,51 @@
 
 #include "internal.h"
 
+static int pin_page_for_dma(struct page *page)
+{
+	int ret = 0;
+	struct zone *zone;
+
+	page = compound_head(page);
+	zone = page_zone(page);
+
+	spin_lock(zone_gup_lock(zone));
+
+	if (PageDmaPinned(page)) {
+		/* Page was not on an LRU list, because it was DMA-pinned. */
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+
+		atomic_inc(&page->dma_pinned_count);
+		goto unlock_out;
+	}
+
+	/*
+	 * Note that page->dma_pinned_flags is unioned with page->lru.
+	 * Therefore, the rules are: checking if any of the
+	 * PAGE_DMA_PINNED_FLAGS bits are set may be done while page->lru
+	 * is in use. However, setting those flags requires that
+	 * the page is both locked, and also, removed from the LRU.
+	 */
+	ret = isolate_lru_page(page);
+
+	if (ret == 0) {
+		/* Avoid problems later, when freeing the page: */
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+
+		/* counteract isolate_lru_page's effects: */
+		put_page(page);
+
+		atomic_set(&page->dma_pinned_count, 1);
+		SetPageDmaPinned(page);
+	}
+
+unlock_out:
+	spin_unlock(zone_gup_lock(zone));
+
+	return ret;
+}
+
 static struct page *no_page_table(struct vm_area_struct *vma,
 		unsigned int flags)
 {
@@ -659,7 +704,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
-	long i = 0;
+	long i = 0, j;
 	int err = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
@@ -764,6 +809,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	} while (nr_pages);
 
 out:
+	if (pages)
+		for (j = 0; j < i; j++)
+			pin_page_for_dma(pages[j]);
+
 	return i ? i : err;
 }
 
@@ -1843,7 +1892,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
 	unsigned long addr, len, end;
-	int nr = 0, ret = 0;
+	int nr = 0, ret = 0, i;
 
 	start &= PAGE_MASK;
 	addr = start;
@@ -1864,6 +1913,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		ret = nr;
 	}
 
+	for (i = 0; i < nr; i++)
+		pin_page_for_dma(pages[i]);
+
 	if (nr < nr_pages) {
 		/* Try to get the remaining pages with get_user_pages */
 		start += nr << PAGE_SHIFT;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e6f0d5ef320a..510d442647c2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2062,6 +2062,11 @@ static void lock_page_lru(struct page *page, int *isolated)
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
+		/* LRU and PageDmaPinned are mutually exclusive: they use the
+		 * same fields in struct page, but for different purposes.
+		 */
+		VM_BUG_ON_PAGE(PageDmaPinned(page), page);
+
 		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
@@ -2079,6 +2084,8 @@ static void unlock_page_lru(struct page *page, int isolated)
 
 		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
+
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
diff --git a/mm/swap.c b/mm/swap.c
index 26fc9b5f1b6c..09ba61300d06 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -52,6 +52,43 @@ static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 #endif
 
+void __get_page_for_pinned_dma(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock(zone_gup_lock(zone));
+
+	if (PageDmaPinned(page))
+		atomic_inc(&page->dma_pinned_count);
+
+	spin_unlock(zone_gup_lock(zone));
+}
+EXPORT_SYMBOL(__get_page_for_pinned_dma);
+
+void __put_page_for_pinned_dma(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	if (atomic_dec_and_test(&page->dma_pinned_count)) {
+		spin_lock(zone_gup_lock(zone));
+
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+
+		/* Re-check while holding the lock, because
+		 * pin_page_for_dma() or get_page() may have snuck in right
+		 * after the atomic_dec_and_test, and raised the count
+		 * above zero again. If so, just leave the flag set. And
+		 * because the atomic_dec_and_test above already got the
+		 * accounting correct, no other action is required.
+		 */
+		if (atomic_read(&page->dma_pinned_count) == 0)
+			ClearPageDmaPinned(page);
+
+		spin_unlock(zone_gup_lock(zone));
+	}
+}
+EXPORT_SYMBOL(__put_page_for_pinned_dma);
+
 /*
  * This path almost never happens for VM activity - pages are normally
  * freed via pagevecs.  But it gets used by networking.
@@ -824,6 +861,11 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+
+	/* LRU and PageDmaPinned are mutually exclusive: they use the
+	 * same fields in struct page, but for different purposes.
+	 */
+	VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
 	VM_BUG_ON(NR_CPUS != 1 &&
 		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
 
@@ -863,6 +905,12 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
+	/* LRU and PageDmaPinned are mutually exclusive: they use the
+	 * same fields in struct page, but for different purposes.
+	 */
+	if (PageDmaPinned(page))
+		return;
+
 	SetPageLRU(page);
 	/*
 	 * Page becomes evictable in two ways:
-- 
2.18.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 6/6] mm: page_mkclean, ttu: handle pinned pages
  2018-07-02  0:56 [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers john.hubbard
                   ` (4 preceding siblings ...)
  2018-07-02  0:56 ` [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
@ 2018-07-02  0:56 ` john.hubbard
  2018-07-02 10:15   ` Jan Kara
  2018-07-02  5:54 ` [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers John Hubbard
  6 siblings, 1 reply; 30+ messages in thread
From: john.hubbard @ 2018-07-02  0:56 UTC (permalink / raw)
  To: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Update page_mkclean(), page_mkclean's callers, and try_to_unmap(), so that
there is a choice: in some cases, skipped dma-pinned pages. In other cases
(sync_mode ==  WB_SYNC_ALL), wait for those pages to become unpinned.

This fixes some problems that came up when using devices (NICs, GPUs, for
example) that set up direct access to a chunk of system (CPU) memory, so
that they can DMA to/from that memory. Problems [1] come up if that memory
is backed by persistence storage; for example, an ext4 file system. This
has caused several customers to experience kernel oops crashes, due to the
BUG_ON, below.

The bugs happen via:

-- get_user_pages() on some ext4-backed pages
-- device does DMA for a while to/from those pages

    -- Somewhere in here, some of the pages get disconnected from the
       file system, via try_to_unmap() and eventually drop_buffers()

-- device is all done, device driver calls set_page_dirty_locked, then
   put_page()

And then at some point, we see a this BUG():

    kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
    backtrace:
        ext4_writepage
        __writepage
        write_cache_pages
        ext4_writepages
        do_writepages
        __writeback_single_inode
        writeback_sb_inodes
        __writeback_inodes_wb
        wb_writeback
        wb_workfn
        process_one_work
        worker_thread
        kthread
        ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

        ({                                                      \
                BUG_ON(!PagePrivate(page));                     \
                ((struct buffer_head *)page_private(page));     \
        })

How to fix this:

If a page is pinned by any of the get_user_page("gup", here) variants, then
there is no need for that page to be on an LRU. So, this patch removes such
pages from their LRU, thus leaving the page->lru fields *mostly* available
for tracking gup pages. (The lowest bit of page->lru.next is used as
PageTail, and these flags have to be checked when we don't know if it
really is a tail page or not, so avoid that bit.)

After that, the page is reference-counted via page->dma_pinned_count, and
flagged via page->dma_pinned_flags. The PageDmaPinned flag is cleared when
the reference count hits zero, and the reference count is only used when
the flag is set, and we can only lock the page *most* of the time that we
look at the flag, so it's a little bit complicated, but it works.

All of the above provides a reliable PageDmaPinned flag, which will be used
in subsequent patches, to decide when to abort or wait for operations such
as:

    try_to_unmap()
    page_mkclean()

Thanks to Matthew Wilcox for suggesting re-using page->lru fields for a
new refcount and flag, and to Jan Kara for explaining the rest of the
design details (how to deal with page_mkclean() and try_to_unmap(),
especially). Also thanks to Dan Williams for design advice and DAX,
long-term pinning, and page flag thoughts.

References:

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

CC: Matthew Wilcox <willy@infradead.org>
CC: Jan Kara <jack@suse.cz>
CC: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/video/fbdev/core/fb_defio.c |  3 +-
 include/linux/rmap.h                |  4 +-
 mm/memory-failure.c                 |  3 +-
 mm/page-writeback.c                 |  3 +-
 mm/rmap.c                           | 71 ++++++++++++++++++++++++++---
 mm/truncate.c                       |  3 +-
 6 files changed, 75 insertions(+), 12 deletions(-)

diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c
index 82c20c6047b0..f5aca45adb75 100644
--- a/drivers/video/fbdev/core/fb_defio.c
+++ b/drivers/video/fbdev/core/fb_defio.c
@@ -181,12 +181,13 @@ static void fb_deferred_io_work(struct work_struct *work)
 	struct list_head *node, *next;
 	struct page *cur;
 	struct fb_deferred_io *fbdefio = info->fbdefio;
+	bool skip_pinned_pages = false;
 
 	/* here we mkclean the pages, then do all deferred IO */
 	mutex_lock(&fbdefio->lock);
 	list_for_each_entry(cur, &fbdefio->pagelist, lru) {
 		lock_page(cur);
-		page_mkclean(cur);
+		page_mkclean(cur, skip_pinned_pages);
 		unlock_page(cur);
 	}
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..f68a473a48fb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -233,7 +233,7 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
  *
  * returns the number of cleaned PTEs.
  */
-int page_mkclean(struct page *);
+int page_mkclean(struct page *page, bool skip_pinned_pages);
 
 /*
  * called in munlock()/munmap() path to check for other vmas holding
@@ -291,7 +291,7 @@ static inline int page_referenced(struct page *page, int is_locked,
 
 #define try_to_unmap(page, refs) false
 
-static inline int page_mkclean(struct page *page)
+static inline int page_mkclean(struct page *page, bool skip_pinned_pages)
 {
 	return 0;
 }
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9d142b9b86dc..c4bc8d216746 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -931,6 +931,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
+	bool skip_pinned_pages = false;
 
 	/*
 	 * Here we are interested only in user-mapped pages, so skip any
@@ -968,7 +969,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	mapping = page_mapping(hpage);
 	if (!(flags & MF_MUST_KILL) && !PageDirty(hpage) && mapping &&
 	    mapping_cap_writeback_dirty(mapping)) {
-		if (page_mkclean(hpage)) {
+		if (page_mkclean(hpage, skip_pinned_pages)) {
 			SetPageDirty(hpage);
 		} else {
 			kill = 0;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e526b3cbf900..19f4972ba5c6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2660,6 +2660,7 @@ int clear_page_dirty_for_io(struct page *page, int sync_mode)
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
 		struct wb_lock_cookie cookie = {};
+		bool skip_pinned_pages = (sync_mode != WB_SYNC_ALL);
 
 		/*
 		 * Yes, Virginia, this is indeed insane.
@@ -2686,7 +2687,7 @@ int clear_page_dirty_for_io(struct page *page, int sync_mode)
 		 * as a serialization point for all the different
 		 * threads doing their things.
 		 */
-		if (page_mkclean(page))
+		if (page_mkclean(page, skip_pinned_pages))
 			set_page_dirty(page);
 		/*
 		 * We carefully synchronise fault handlers against
diff --git a/mm/rmap.c b/mm/rmap.c
index 6db729dc4c50..c137c43eb2ad 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -879,6 +879,26 @@ int page_referenced(struct page *page,
 	return pra.referenced;
 }
 
+/* Must be called with pinned_dma_lock held. */
+static void wait_for_dma_pinned_to_clear(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	while (PageDmaPinnedFlags(page)) {
+		spin_unlock(zone_gup_lock(zone));
+
+		schedule();
+
+		spin_lock(zone_gup_lock(zone));
+	}
+}
+
+struct page_mkclean_info {
+	int cleaned;
+	int skipped;
+	bool skip_pinned_pages;
+};
+
 static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 			    unsigned long address, void *arg)
 {
@@ -889,7 +909,24 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		.flags = PVMW_SYNC,
 	};
 	unsigned long start = address, end;
-	int *cleaned = arg;
+	struct page_mkclean_info *mki = (struct page_mkclean_info *)arg;
+	bool is_dma_pinned;
+	struct zone *zone = page_zone(page);
+
+	/* Serialize with get_user_pages: */
+	spin_lock(zone_gup_lock(zone));
+	is_dma_pinned = PageDmaPinned(page);
+
+	if (is_dma_pinned) {
+		if (mki->skip_pinned_pages) {
+			spin_unlock(zone_gup_lock(zone));
+			mki->skipped++;
+			return false;
+		}
+	}
+
+	/* Unlock while doing mmu notifier callbacks */
+	spin_unlock(zone_gup_lock(zone));
 
 	/*
 	 * We have to assume the worse case ie pmd for invalidation. Note that
@@ -898,6 +935,10 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
+	spin_lock(zone_gup_lock(zone));
+
+	wait_for_dma_pinned_to_clear(page);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		unsigned long cstart;
 		int ret = 0;
@@ -945,9 +986,11 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		if (ret)
-			(*cleaned)++;
+			(mki->cleaned)++;
 	}
 
+	spin_unlock(zone_gup_lock(zone));
+
 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
 
 	return true;
@@ -961,12 +1004,17 @@ static bool invalid_mkclean_vma(struct vm_area_struct *vma, void *arg)
 	return true;
 }
 
-int page_mkclean(struct page *page)
+int page_mkclean(struct page *page, bool skip_pinned_pages)
 {
-	int cleaned = 0;
+	struct page_mkclean_info mki = {
+		.cleaned = 0,
+		.skipped = 0,
+		.skip_pinned_pages = skip_pinned_pages
+	};
+
 	struct address_space *mapping;
 	struct rmap_walk_control rwc = {
-		.arg = (void *)&cleaned,
+		.arg = (void *)&mki,
 		.rmap_one = page_mkclean_one,
 		.invalid_vma = invalid_mkclean_vma,
 	};
@@ -982,7 +1030,7 @@ int page_mkclean(struct page *page)
 
 	rmap_walk(page, &rwc);
 
-	return cleaned;
+	return mki.cleaned && !mki.skipped;
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
@@ -1346,6 +1394,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	unsigned long start = address, end;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	struct zone *zone = page_zone(page);
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
@@ -1360,6 +1409,16 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				flags & TTU_SPLIT_FREEZE, page);
 	}
 
+	/* Serialize with get_user_pages: */
+	spin_lock(zone_gup_lock(zone));
+
+	if (PageDmaPinned(page)) {
+		spin_unlock(zone_gup_lock(zone));
+		return false;
+	}
+
+	spin_unlock(zone_gup_lock(zone));
+
 	/*
 	 * We have to assume the worse case ie pmd for invalidation. Note that
 	 * the page can not be free in this function as call of try_to_unmap()
diff --git a/mm/truncate.c b/mm/truncate.c
index 1d2fb2dca96f..61e73d0d8777 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -852,6 +852,7 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
 	loff_t rounded_from;
 	struct page *page;
 	pgoff_t index;
+	bool skip_pinned_pages = false;
 
 	WARN_ON(to > inode->i_size);
 
@@ -871,7 +872,7 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
 	 * See clear_page_dirty_for_io() for details why set_page_dirty()
 	 * is needed.
 	 */
-	if (page_mkclean(page))
+	if (page_mkclean(page, skip_pinned_pages))
 		set_page_dirty(page);
 	unlock_page(page);
 	put_page(page);
-- 
2.18.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-02  0:56 ` [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
@ 2018-07-02  2:11   ` kbuild test robot
  2018-07-02  2:58   ` kbuild test robot
  2018-07-02  9:53   ` Jan Kara
  2 siblings, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2018-07-02  2:11 UTC (permalink / raw)
  To: john.hubbard
  Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 3228 bytes --]

Hi John,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc3]
[cannot apply to next-20180629]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/mm-fs-gup-don-t-unmap-or-drop-filesystem-buffers/20180702-090125
config: i386-randconfig-x074-201826 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/asm-generic/atomic-instrumented.h:16:0,
                    from arch/x86/include/asm/atomic.h:283,
                    from include/linux/atomic.h:5,
                    from include/linux/page_counter.h:5,
                    from mm/memcontrol.c:34:
   mm/memcontrol.c: In function 'unlock_page_lru':
>> mm/memcontrol.c:2087:32: error: 'page_tail' undeclared (first use in this function); did you mean 'page_pool'?
      VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
                                   ^
   include/linux/build_bug.h:36:63: note: in definition of macro 'BUILD_BUG_ON_INVALID'
    #define BUILD_BUG_ON_INVALID(e) ((void)(sizeof((__force long)(e))))
                                                                  ^
>> include/linux/mmdebug.h:46:36: note: in expansion of macro 'VM_BUG_ON'
    #define VM_BUG_ON_PAGE(cond, page) VM_BUG_ON(cond)
                                       ^~~~~~~~~
>> mm/memcontrol.c:2087:3: note: in expansion of macro 'VM_BUG_ON_PAGE'
      VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
      ^~~~~~~~~~~~~~
   mm/memcontrol.c:2087:32: note: each undeclared identifier is reported only once for each function it appears in
      VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
                                   ^
   include/linux/build_bug.h:36:63: note: in definition of macro 'BUILD_BUG_ON_INVALID'
    #define BUILD_BUG_ON_INVALID(e) ((void)(sizeof((__force long)(e))))
                                                                  ^
>> include/linux/mmdebug.h:46:36: note: in expansion of macro 'VM_BUG_ON'
    #define VM_BUG_ON_PAGE(cond, page) VM_BUG_ON(cond)
                                       ^~~~~~~~~
>> mm/memcontrol.c:2087:3: note: in expansion of macro 'VM_BUG_ON_PAGE'
      VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
      ^~~~~~~~~~~~~~

vim +2087 mm/memcontrol.c

  2077	
  2078	static void unlock_page_lru(struct page *page, int isolated)
  2079	{
  2080		struct zone *zone = page_zone(page);
  2081	
  2082		if (isolated) {
  2083			struct lruvec *lruvec;
  2084	
  2085			lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
  2086			VM_BUG_ON_PAGE(PageLRU(page), page);
> 2087			VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
  2088	
  2089			SetPageLRU(page);
  2090			add_page_to_lru_list(page, lruvec, page_lru(page));
  2091		}
  2092		spin_unlock_irq(zone_lru_lock(zone));
  2093	}
  2094	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29636 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io()
  2018-07-02  0:56 ` [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io() john.hubbard
@ 2018-07-02  2:11   ` kbuild test robot
  2018-07-02  4:40     ` John Hubbard
  2018-07-02  2:47   ` kbuild test robot
  1 sibling, 1 reply; 30+ messages in thread
From: kbuild test robot @ 2018-07-02  2:11 UTC (permalink / raw)
  To: john.hubbard
  Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 12697 bytes --]

Hi John,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.18-rc3]
[cannot apply to next-20180629]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/mm-fs-gup-don-t-unmap-or-drop-filesystem-buffers/20180702-090125
config: x86_64-randconfig-x010-201826 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/kernel.h:10:0,
                    from include/linux/list.h:9,
                    from include/linux/wait.h:7,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from fs/f2fs/checkpoint.c:11:
   fs/f2fs/checkpoint.c: In function 'commit_checkpoint':
   fs/f2fs/checkpoint.c:1200:49: error: invalid type argument of '->' (have 'struct writeback_control')
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
                                                    ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> fs/f2fs/checkpoint.c:1200:2: note: in expansion of macro 'if'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
     ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
   fs/f2fs/checkpoint.c:1200:6: note: in expansion of macro 'unlikely'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
         ^~~~~~~~
   fs/f2fs/checkpoint.c:1200:49: error: invalid type argument of '->' (have 'struct writeback_control')
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
                                                    ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> fs/f2fs/checkpoint.c:1200:2: note: in expansion of macro 'if'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
     ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
   fs/f2fs/checkpoint.c:1200:6: note: in expansion of macro 'unlikely'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
         ^~~~~~~~
   fs/f2fs/checkpoint.c:1200:49: error: invalid type argument of '->' (have 'struct writeback_control')
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
                                                    ^
   include/linux/compiler.h:58:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^~~~
>> fs/f2fs/checkpoint.c:1200:2: note: in expansion of macro 'if'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
     ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
   fs/f2fs/checkpoint.c:1200:6: note: in expansion of macro 'unlikely'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
         ^~~~~~~~
   fs/f2fs/checkpoint.c:1200:49: error: invalid type argument of '->' (have 'struct writeback_control')
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
                                                    ^
   include/linux/compiler.h:58:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^~~~
>> fs/f2fs/checkpoint.c:1200:2: note: in expansion of macro 'if'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
     ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
   fs/f2fs/checkpoint.c:1200:6: note: in expansion of macro 'unlikely'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
         ^~~~~~~~
   fs/f2fs/checkpoint.c:1200:49: error: invalid type argument of '->' (have 'struct writeback_control')
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
                                                    ^
   include/linux/compiler.h:69:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
>> fs/f2fs/checkpoint.c:1200:2: note: in expansion of macro 'if'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
     ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
   fs/f2fs/checkpoint.c:1200:6: note: in expansion of macro 'unlikely'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
         ^~~~~~~~
   fs/f2fs/checkpoint.c:1200:49: error: invalid type argument of '->' (have 'struct writeback_control')
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
                                                    ^
   include/linux/compiler.h:69:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
>> fs/f2fs/checkpoint.c:1200:2: note: in expansion of macro 'if'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
     ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
   fs/f2fs/checkpoint.c:1200:6: note: in expansion of macro 'unlikely'
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
         ^~~~~~~~
--
   In file included from include/linux/kernel.h:10:0,
                    from include/linux/list.h:9,
                    from include/linux/wait.h:7,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from fs/f2fs/data.c:11:
   fs/f2fs/data.c: In function 'f2fs_write_cache_pages':
   fs/f2fs/data.c:2021:9: error: too few arguments to function 'clear_page_dirty_for_io'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
            ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> fs/f2fs/data.c:2021:4: note: in expansion of macro 'if'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
       ^~
   In file included from include/linux/pagemap.h:8:0,
                    from include/linux/f2fs_fs.h:14,
                    from fs/f2fs/data.c:12:
   include/linux/mm.h:1540:5: note: declared here
    int clear_page_dirty_for_io(struct page *page, int sync_mode);
        ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/kernel.h:10:0,
                    from include/linux/list.h:9,
                    from include/linux/wait.h:7,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from fs/f2fs/data.c:11:
   include/linux/compiler.h:56:41: warning: left-hand operand of comma expression has no effect [-Wunused-value]
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                                            ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> fs/f2fs/data.c:2021:4: note: in expansion of macro 'if'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
       ^~
   fs/f2fs/data.c:2021:9: error: too few arguments to function 'clear_page_dirty_for_io'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
            ^
   include/linux/compiler.h:58:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^~~~
>> fs/f2fs/data.c:2021:4: note: in expansion of macro 'if'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
       ^~
   In file included from include/linux/pagemap.h:8:0,
                    from include/linux/f2fs_fs.h:14,
                    from fs/f2fs/data.c:12:
   include/linux/mm.h:1540:5: note: declared here
    int clear_page_dirty_for_io(struct page *page, int sync_mode);
        ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/kernel.h:10:0,
                    from include/linux/list.h:9,
                    from include/linux/wait.h:7,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from fs/f2fs/data.c:11:
   include/linux/compiler.h:56:41: warning: left-hand operand of comma expression has no effect [-Wunused-value]
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                                            ^
   include/linux/compiler.h:58:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^~~~
>> fs/f2fs/data.c:2021:4: note: in expansion of macro 'if'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
       ^~
   fs/f2fs/data.c:2021:9: error: too few arguments to function 'clear_page_dirty_for_io'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
            ^
   include/linux/compiler.h:69:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
>> fs/f2fs/data.c:2021:4: note: in expansion of macro 'if'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
       ^~
   In file included from include/linux/pagemap.h:8:0,
                    from include/linux/f2fs_fs.h:14,
                    from fs/f2fs/data.c:12:
   include/linux/mm.h:1540:5: note: declared here
    int clear_page_dirty_for_io(struct page *page, int sync_mode);
        ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/kernel.h:10:0,
                    from include/linux/list.h:9,
                    from include/linux/wait.h:7,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from fs/f2fs/data.c:11:
   include/linux/compiler.h:56:41: warning: left-hand operand of comma expression has no effect [-Wunused-value]
    #define if(cond, ...) __trace_if( (cond , ## __VA_ARGS__) )
                                            ^
   include/linux/compiler.h:69:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
>> fs/f2fs/data.c:2021:4: note: in expansion of macro 'if'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
       ^~

vim +/if +1200 fs/f2fs/checkpoint.c

  1179	
  1180	static void commit_checkpoint(struct f2fs_sb_info *sbi,
  1181		void *src, block_t blk_addr)
  1182	{
  1183		struct writeback_control wbc = {
  1184			.for_reclaim = 0,
  1185		};
  1186	
  1187		/*
  1188		 * pagevec_lookup_tag and lock_page again will take
  1189		 * some extra time. Therefore, f2fs_update_meta_pages and
  1190		 * f2fs_sync_meta_pages are combined in this function.
  1191		 */
  1192		struct page *page = f2fs_grab_meta_page(sbi, blk_addr);
  1193		int err;
  1194	
  1195		memcpy(page_address(page), src, PAGE_SIZE);
  1196		set_page_dirty(page);
  1197	
  1198		f2fs_wait_on_page_writeback(page, META, true);
  1199		f2fs_bug_on(sbi, PageWriteback(page));
> 1200		if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
  1201			f2fs_bug_on(sbi, 1);
  1202	
  1203		/* writeout cp pack 2 page */
  1204		err = __f2fs_write_meta_page(page, &wbc, FS_CP_META_IO);
  1205		f2fs_bug_on(sbi, err);
  1206	
  1207		f2fs_put_page(page, 0);
  1208	
  1209		/* submit checkpoint (with barrier if NOBARRIER is not set) */
  1210		f2fs_submit_merged_write(sbi, META_FLUSH);
  1211	}
  1212	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35450 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io()
  2018-07-02  0:56 ` [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io() john.hubbard
  2018-07-02  2:11   ` kbuild test robot
@ 2018-07-02  2:47   ` kbuild test robot
  2018-07-02  4:40     ` John Hubbard
  1 sibling, 1 reply; 30+ messages in thread
From: kbuild test robot @ 2018-07-02  2:47 UTC (permalink / raw)
  To: john.hubbard
  Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 4950 bytes --]

Hi John,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc3]
[cannot apply to next-20180629]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/mm-fs-gup-don-t-unmap-or-drop-filesystem-buffers/20180702-090125
config: i386-randconfig-x075-201826 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/f2fs/dir.c: In function 'f2fs_delete_entry':
>> fs/f2fs/dir.c:734:33: error: 'WB_SYNC_ALL' undeclared (first use in this function); did you mean 'FS_SYNC_FL'?
      clear_page_dirty_for_io(page, WB_SYNC_ALL);
                                    ^~~~~~~~~~~
                                    FS_SYNC_FL
   fs/f2fs/dir.c:734:33: note: each undeclared identifier is reported only once for each function it appears in
--
   fs/f2fs/inline.c: In function 'f2fs_convert_inline_page':
>> fs/f2fs/inline.c:139:50: error: dereferencing pointer to incomplete type 'struct writeback_control'
     dirty = clear_page_dirty_for_io(page, fio.io_wbc->sync_mode);
                                                     ^~
--
   In file included from include/linux/kernel.h:10:0,
                    from include/linux/list.h:9,
                    from include/linux/wait.h:7,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from fs/f2fs/checkpoint.c:11:
   fs/f2fs/checkpoint.c: In function 'commit_checkpoint':
>> fs/f2fs/checkpoint.c:1200:49: error: invalid type argument of '->' (have 'struct writeback_control')
     if (unlikely(!clear_page_dirty_for_io(page, wbc->sync_mode)))
                                                    ^
   include/linux/compiler.h:77:42: note: in definition of macro 'unlikely'
    # define unlikely(x) __builtin_expect(!!(x), 0)
                                             ^
--
   fs/f2fs/data.c: In function 'f2fs_write_cache_pages':
>> fs/f2fs/data.c:2021:9: error: too few arguments to function 'clear_page_dirty_for_io'
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
            ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/pagemap.h:8:0,
                    from include/linux/f2fs_fs.h:14,
                    from fs/f2fs/data.c:12:
   include/linux/mm.h:1540:5: note: declared here
    int clear_page_dirty_for_io(struct page *page, int sync_mode);
        ^~~~~~~~~~~~~~~~~~~~~~~
   fs/f2fs/data.c:2021:38: warning: left-hand operand of comma expression has no effect [-Wunused-value]
       if (!clear_page_dirty_for_io(page), wbc->sync_mode)
                                         ^

vim +734 fs/f2fs/dir.c

   690	
   691	/*
   692	 * It only removes the dentry from the dentry page, corresponding name
   693	 * entry in name page does not need to be touched during deletion.
   694	 */
   695	void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page,
   696						struct inode *dir, struct inode *inode)
   697	{
   698		struct	f2fs_dentry_block *dentry_blk;
   699		unsigned int bit_pos;
   700		int slots = GET_DENTRY_SLOTS(le16_to_cpu(dentry->name_len));
   701		int i;
   702	
   703		f2fs_update_time(F2FS_I_SB(dir), REQ_TIME);
   704	
   705		if (F2FS_OPTION(F2FS_I_SB(dir)).fsync_mode == FSYNC_MODE_STRICT)
   706			f2fs_add_ino_entry(F2FS_I_SB(dir), dir->i_ino, TRANS_DIR_INO);
   707	
   708		if (f2fs_has_inline_dentry(dir))
   709			return f2fs_delete_inline_entry(dentry, page, dir, inode);
   710	
   711		lock_page(page);
   712		f2fs_wait_on_page_writeback(page, DATA, true);
   713	
   714		dentry_blk = page_address(page);
   715		bit_pos = dentry - dentry_blk->dentry;
   716		for (i = 0; i < slots; i++)
   717			__clear_bit_le(bit_pos + i, &dentry_blk->dentry_bitmap);
   718	
   719		/* Let's check and deallocate this dentry page */
   720		bit_pos = find_next_bit_le(&dentry_blk->dentry_bitmap,
   721				NR_DENTRY_IN_BLOCK,
   722				0);
   723		set_page_dirty(page);
   724	
   725		dir->i_ctime = dir->i_mtime = current_time(dir);
   726		f2fs_mark_inode_dirty_sync(dir, false);
   727	
   728		if (inode)
   729			f2fs_drop_nlink(dir, inode);
   730	
   731		if (bit_pos == NR_DENTRY_IN_BLOCK &&
   732			!f2fs_truncate_hole(dir, page->index, page->index + 1)) {
   733			f2fs_clear_radix_tree_dirty_tag(page);
 > 734			clear_page_dirty_for_io(page, WB_SYNC_ALL);
   735			ClearPagePrivate(page);
   736			ClearPageUptodate(page);
   737			inode_dec_dirty_pages(dir);
   738			f2fs_remove_dirty_inode(dir);
   739		}
   740		f2fs_put_page(page, 1);
   741	}
   742	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31381 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-02  0:56 ` [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
  2018-07-02  2:11   ` kbuild test robot
@ 2018-07-02  2:58   ` kbuild test robot
  2018-07-02  5:05     ` John Hubbard
  2018-07-02  9:53   ` Jan Kara
  2 siblings, 1 reply; 30+ messages in thread
From: kbuild test robot @ 2018-07-02  2:58 UTC (permalink / raw)
  To: john.hubbard
  Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

[-- Attachment #1: Type: text/plain, Size: 6428 bytes --]

Hi John,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.18-rc3]
[cannot apply to next-20180629]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/mm-fs-gup-don-t-unmap-or-drop-filesystem-buffers/20180702-090125
config: x86_64-randconfig-x010-201826 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from arch/x86/include/asm/atomic.h:5:0,
                    from include/linux/atomic.h:5,
                    from include/linux/page_counter.h:5,
                    from mm/memcontrol.c:34:
   mm/memcontrol.c: In function 'unlock_page_lru':
   mm/memcontrol.c:2087:32: error: 'page_tail' undeclared (first use in this function); did you mean 'page_pool'?
      VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
                                   ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> include/linux/mmdebug.h:21:3: note: in expansion of macro 'if'
      if (unlikely(cond)) {     \
      ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
>> include/linux/mmdebug.h:21:7: note: in expansion of macro 'unlikely'
      if (unlikely(cond)) {     \
          ^~~~~~~~
   mm/memcontrol.c:2087:3: note: in expansion of macro 'VM_BUG_ON_PAGE'
      VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
      ^~~~~~~~~~~~~~
   mm/memcontrol.c:2087:32: note: each undeclared identifier is reported only once for each function it appears in
      VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
                                   ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> include/linux/mmdebug.h:21:3: note: in expansion of macro 'if'
      if (unlikely(cond)) {     \
      ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
>> include/linux/mmdebug.h:21:7: note: in expansion of macro 'unlikely'
      if (unlikely(cond)) {     \
          ^~~~~~~~
   mm/memcontrol.c:2087:3: note: in expansion of macro 'VM_BUG_ON_PAGE'
      VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
      ^~~~~~~~~~~~~~

vim +/if +21 include/linux/mmdebug.h

309381feae Sasha Levin           2014-01-23  16  
59ea746337 Jiri Slaby            2008-06-12  17  #ifdef CONFIG_DEBUG_VM
59ea746337 Jiri Slaby            2008-06-12  18  #define VM_BUG_ON(cond) BUG_ON(cond)
309381feae Sasha Levin           2014-01-23  19  #define VM_BUG_ON_PAGE(cond, page)					\
e4f674229c Dave Hansen           2014-06-04  20  	do {								\
e4f674229c Dave Hansen           2014-06-04 @21  		if (unlikely(cond)) {					\
e4f674229c Dave Hansen           2014-06-04  22  			dump_page(page, "VM_BUG_ON_PAGE(" __stringify(cond)")");\
e4f674229c Dave Hansen           2014-06-04  23  			BUG();						\
e4f674229c Dave Hansen           2014-06-04  24  		}							\
e4f674229c Dave Hansen           2014-06-04  25  	} while (0)
fa3759ccd5 Sasha Levin           2014-10-09  26  #define VM_BUG_ON_VMA(cond, vma)					\
fa3759ccd5 Sasha Levin           2014-10-09  27  	do {								\
fa3759ccd5 Sasha Levin           2014-10-09  28  		if (unlikely(cond)) {					\
fa3759ccd5 Sasha Levin           2014-10-09  29  			dump_vma(vma);					\
fa3759ccd5 Sasha Levin           2014-10-09  30  			BUG();						\
fa3759ccd5 Sasha Levin           2014-10-09  31  		}							\
fa3759ccd5 Sasha Levin           2014-10-09  32  	} while (0)
31c9afa6db Sasha Levin           2014-10-09  33  #define VM_BUG_ON_MM(cond, mm)						\
31c9afa6db Sasha Levin           2014-10-09  34  	do {								\
31c9afa6db Sasha Levin           2014-10-09  35  		if (unlikely(cond)) {					\
31c9afa6db Sasha Levin           2014-10-09  36  			dump_mm(mm);					\
31c9afa6db Sasha Levin           2014-10-09  37  			BUG();						\
31c9afa6db Sasha Levin           2014-10-09  38  		}							\
31c9afa6db Sasha Levin           2014-10-09  39  	} while (0)
91241681c6 Michal Hocko          2018-04-05  40  #define VM_WARN_ON(cond) (void)WARN_ON(cond)
91241681c6 Michal Hocko          2018-04-05  41  #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
91241681c6 Michal Hocko          2018-04-05  42  #define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format)
91241681c6 Michal Hocko          2018-04-05  43  #define VM_WARN(cond, format...) (void)WARN(cond, format)
59ea746337 Jiri Slaby            2008-06-12  44  #else
02602a18c3 Konstantin Khlebnikov 2012-05-29  45  #define VM_BUG_ON(cond) BUILD_BUG_ON_INVALID(cond)
309381feae Sasha Levin           2014-01-23  46  #define VM_BUG_ON_PAGE(cond, page) VM_BUG_ON(cond)
fa3759ccd5 Sasha Levin           2014-10-09  47  #define VM_BUG_ON_VMA(cond, vma) VM_BUG_ON(cond)
31c9afa6db Sasha Levin           2014-10-09  48  #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond)
02a8efeda8 Andrew Morton         2014-06-04  49  #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond)
02a8efeda8 Andrew Morton         2014-06-04  50  #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond)
ef6b571fb8 Andrew Morton         2014-08-06  51  #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)
a54f9aebaa Aneesh Kumar K.V      2016-07-26  52  #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond)
59ea746337 Jiri Slaby            2008-06-12  53  #endif
59ea746337 Jiri Slaby            2008-06-12  54  

:::::: The code at line 21 was first introduced by commit
:::::: e4f674229ce63dac60be0c4ddfb5ef8d1225d30d mm: pass VM_BUG_ON() reason to dump_page()

:::::: TO: Dave Hansen <dave.hansen@linux.intel.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35450 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io()
  2018-07-02  2:11   ` kbuild test robot
@ 2018-07-02  4:40     ` John Hubbard
  0 siblings, 0 replies; 30+ messages in thread
From: John Hubbard @ 2018-07-02  4:40 UTC (permalink / raw)
  To: kbuild test robot, john.hubbard
  Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel

On 07/01/2018 07:11 PM, kbuild test robot wrote:
> Hi John,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on linus/master]
> [also build test WARNING on v4.18-rc3]
> [cannot apply to next-20180629]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/mm-fs-gup-don-t-unmap-or-drop-filesystem-buffers/20180702-090125
> config: x86_64-randconfig-x010-201826 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
[...]
>                                             ^
>    include/linux/compiler.h:58:42: note: in definition of macro '__trace_if'
>      if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
>                                              ^~~~
>>> fs/f2fs/data.c:2021:4: note: in expansion of macro 'if'
>        if (!clear_page_dirty_for_io(page), wbc->sync_mode)
>        ^~
>    fs/f2fs/data.c:2021:9: error: too few arguments to function 'clear_page_dirty_for_io'
>        if (!clear_page_dirty_for_io(page), wbc->sync_mode)
>             ^
>    include/linux/compiler.h:69:16: note: in definition of macro '__trace_if'
>       ______r = !!(cond);     \
>                    ^~~~
>>> fs/f2fs/data.c:2021:4: note: in expansion of macro 'if'
>        if (!clear_page_dirty_for_io(page), wbc->sync_mode)
>        ^~

> 

Typo, that should have been:
         if (!clear_page_dirty_for_io(page, wbc->sync_mode))

...fixed locally, I'll include it in the next spin. (Somehow my last build didn't
have all the filesystems enabled, sorry for the glitches.)
   

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io()
  2018-07-02  2:47   ` kbuild test robot
@ 2018-07-02  4:40     ` John Hubbard
  0 siblings, 0 replies; 30+ messages in thread
From: John Hubbard @ 2018-07-02  4:40 UTC (permalink / raw)
  To: kbuild test robot, john.hubbard
  Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel

On 07/01/2018 07:47 PM, kbuild test robot wrote:
> Hi John,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.18-rc3]
> [cannot apply to next-20180629]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/mm-fs-gup-don-t-unmap-or-drop-filesystem-buffers/20180702-090125
> config: i386-randconfig-x075-201826 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>    fs/f2fs/dir.c: In function 'f2fs_delete_entry':
>>> fs/f2fs/dir.c:734:33: error: 'WB_SYNC_ALL' undeclared (first use in this function); did you mean 'FS_SYNC_FL'?
>       clear_page_dirty_for_io(page, WB_SYNC_ALL);
>                                     ^~~~~~~~~~~
>                                     FS_SYNC_FL

Fixed locally, via:

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 258f9dc117f4..ca20c1262582 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -16,6 +16,7 @@
 #include "acl.h"
 #include "xattr.h"
 #include <trace/events/f2fs.h>
+#include <linux/writeback.h>
 
 static unsigned long dir_blocks(struct inode *inode)
 {



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-02  2:58   ` kbuild test robot
@ 2018-07-02  5:05     ` John Hubbard
  0 siblings, 0 replies; 30+ messages in thread
From: John Hubbard @ 2018-07-02  5:05 UTC (permalink / raw)
  To: kbuild test robot, john.hubbard
  Cc: kbuild-all, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel

On 07/01/2018 07:58 PM, kbuild test robot wrote:
> Hi John,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on linus/master]
> [also build test WARNING on v4.18-rc3]
> [cannot apply to next-20180629]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/john-hubbard-gmail-com/mm-fs-gup-don-t-unmap-or-drop-filesystem-buffers/20180702-090125
> config: x86_64-randconfig-x010-201826 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All warnings (new ones prefixed by >>):
> 
>    In file included from arch/x86/include/asm/atomic.h:5:0,
>                     from include/linux/atomic.h:5,
>                     from include/linux/page_counter.h:5,
>                     from mm/memcontrol.c:34:
>    mm/memcontrol.c: In function 'unlock_page_lru':
>    mm/memcontrol.c:2087:32: error: 'page_tail' undeclared (first use in this function); did you mean 'page_pool'?
>       VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
>                                    ^
Yes, that should have been:

        VM_BUG_ON_PAGE(PageDmaPinned(page), page);

Fixed locally...maybe I'll post a v3 right now, as there were half a dozen ridiculous typos that
snuck in.



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers
  2018-07-02  0:56 [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers john.hubbard
                   ` (5 preceding siblings ...)
  2018-07-02  0:56 ` [PATCH v2 6/6] mm: page_mkclean, ttu: handle pinned pages john.hubbard
@ 2018-07-02  5:54 ` John Hubbard
  6 siblings, 0 replies; 30+ messages in thread
From: John Hubbard @ 2018-07-02  5:54 UTC (permalink / raw)
  To: john.hubbard, Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara
  Cc: linux-mm, LKML, linux-rdma, linux-fsdevel

On 07/01/2018 05:56 PM, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 

There were some typos in patches #4 and #5, which I've fixed locally.
Let me know if anyone would like me to repost with those right away, otherwise
I'll wait for other review besides the kbuild test robot.

Meanwhile, for convenience, you can pull down the latest version of the
patchset from:

    git@github.com:johnhubbard/linux (branch: gup_dma_next)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-02  0:56 ` [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
  2018-07-02  2:11   ` kbuild test robot
  2018-07-02  2:58   ` kbuild test robot
@ 2018-07-02  9:53   ` Jan Kara
  2018-07-02 20:43     ` John Hubbard
  2 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2018-07-02  9:53 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

On Sun 01-07-18 17:56:53, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> This patch sets and restores the new page->dma_pinned_flags and
> page->dma_pinned_count fields, but does not actually use them for
> anything yet.
> 
> In order to use these fields at all, the page must be removed from
> any LRU list that it's on. The patch also adds some precautions that
> prevent the page from getting moved back onto an LRU, once it is
> in this state.
> 
> This is in preparation to fix some problems that came up when using
> devices (NICs, GPUs, for example) that set up direct access to a chunk
> of system (CPU) memory, so that they can DMA to/from that memory.
> 
> CC: Matthew Wilcox <willy@infradead.org>
> CC: Jan Kara <jack@suse.cz>
> CC: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

...

> @@ -904,12 +907,24 @@ static inline void get_page(struct page *page)
>  	 */
>  	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
>  	page_ref_inc(page);
> +
> +	if (unlikely(PageDmaPinned(page)))
> +		__get_page_for_pinned_dma(page);
>  }
>  
>  static inline void put_page(struct page *page)
>  {
>  	page = compound_head(page);
>  
> +	/* Because the page->dma_pinned_* fields are unioned with
> +	 * page->lru, there is no way to do classical refcount-style
> +	 * decrement-and-test-for-zero. Instead, PageDmaPinned(page) must
> +	 * be checked, in order to safely check if we are allowed to decrement
> +	 * page->dma_pinned_count at all.
> +	 */
> +	if (unlikely(PageDmaPinned(page)))
> +		__put_page_for_pinned_dma(page);
> +

These two are just wrong. You cannot make any page reference for
PageDmaPinned() account against a pin count. First, it is just conceptually
wrong as these references need not be long term pins, second, you can
easily race like:

Pinner				Random process
				get_page(page)
pin_page_for_dma()
				put_page(page)
				 -> oops, page gets unpinned too early

So you really have to create counterpart to get_user_pages() - like
put_user_page() or whatever... It is inconvenient to have to modify all GUP
users but I don't see a way around that.

								Honza

>  	/*
>  	 * For devmap managed pages we need to catch refcount transition from
>  	 * 2 to 1, when refcount reach one it means the page is free and we
> diff --git a/mm/gup.c b/mm/gup.c
> index 73f0b3316fa7..e5c0104fd234 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -20,6 +20,51 @@
>  
>  #include "internal.h"
>  
> +static int pin_page_for_dma(struct page *page)
> +{
> +	int ret = 0;
> +	struct zone *zone;
> +
> +	page = compound_head(page);
> +	zone = page_zone(page);
> +
> +	spin_lock(zone_gup_lock(zone));
> +
> +	if (PageDmaPinned(page)) {
> +		/* Page was not on an LRU list, because it was DMA-pinned. */
> +		VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> +		atomic_inc(&page->dma_pinned_count);
> +		goto unlock_out;
> +	}
> +
> +	/*
> +	 * Note that page->dma_pinned_flags is unioned with page->lru.
> +	 * Therefore, the rules are: checking if any of the
> +	 * PAGE_DMA_PINNED_FLAGS bits are set may be done while page->lru
> +	 * is in use. However, setting those flags requires that
> +	 * the page is both locked, and also, removed from the LRU.
> +	 */
> +	ret = isolate_lru_page(page);
> +
> +	if (ret == 0) {
> +		/* Avoid problems later, when freeing the page: */
> +		ClearPageActive(page);
> +		ClearPageUnevictable(page);
> +
> +		/* counteract isolate_lru_page's effects: */
> +		put_page(page);
> +
> +		atomic_set(&page->dma_pinned_count, 1);
> +		SetPageDmaPinned(page);
> +	}
> +
> +unlock_out:
> +	spin_unlock(zone_gup_lock(zone));
> +
> +	return ret;
> +}
> +
>  static struct page *no_page_table(struct vm_area_struct *vma,
>  		unsigned int flags)
>  {
> @@ -659,7 +704,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas, int *nonblocking)
>  {
> -	long i = 0;
> +	long i = 0, j;
>  	int err = 0;
>  	unsigned int page_mask;
>  	struct vm_area_struct *vma = NULL;
> @@ -764,6 +809,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  	} while (nr_pages);
>  
>  out:
> +	if (pages)
> +		for (j = 0; j < i; j++)
> +			pin_page_for_dma(pages[j]);
> +
>  	return i ? i : err;
>  }
>  
> @@ -1843,7 +1892,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			struct page **pages)
>  {
>  	unsigned long addr, len, end;
> -	int nr = 0, ret = 0;
> +	int nr = 0, ret = 0, i;
>  
>  	start &= PAGE_MASK;
>  	addr = start;
> @@ -1864,6 +1913,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  		ret = nr;
>  	}
>  
> +	for (i = 0; i < nr; i++)
> +		pin_page_for_dma(pages[i]);
> +
>  	if (nr < nr_pages) {
>  		/* Try to get the remaining pages with get_user_pages */
>  		start += nr << PAGE_SHIFT;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e6f0d5ef320a..510d442647c2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2062,6 +2062,11 @@ static void lock_page_lru(struct page *page, int *isolated)
>  	if (PageLRU(page)) {
>  		struct lruvec *lruvec;
>  
> +		/* LRU and PageDmaPinned are mutually exclusive: they use the
> +		 * same fields in struct page, but for different purposes.
> +		 */
> +		VM_BUG_ON_PAGE(PageDmaPinned(page), page);
> +
>  		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>  		ClearPageLRU(page);
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
> @@ -2079,6 +2084,8 @@ static void unlock_page_lru(struct page *page, int isolated)
>  
>  		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>  		VM_BUG_ON_PAGE(PageLRU(page), page);
> +		VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
> +
>  		SetPageLRU(page);
>  		add_page_to_lru_list(page, lruvec, page_lru(page));
>  	}
> diff --git a/mm/swap.c b/mm/swap.c
> index 26fc9b5f1b6c..09ba61300d06 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -52,6 +52,43 @@ static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>  #endif
>  
> +void __get_page_for_pinned_dma(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	spin_lock(zone_gup_lock(zone));
> +
> +	if (PageDmaPinned(page))
> +		atomic_inc(&page->dma_pinned_count);
> +
> +	spin_unlock(zone_gup_lock(zone));
> +}
> +EXPORT_SYMBOL(__get_page_for_pinned_dma);
> +
> +void __put_page_for_pinned_dma(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	if (atomic_dec_and_test(&page->dma_pinned_count)) {
> +		spin_lock(zone_gup_lock(zone));
> +
> +		VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> +		/* Re-check while holding the lock, because
> +		 * pin_page_for_dma() or get_page() may have snuck in right
> +		 * after the atomic_dec_and_test, and raised the count
> +		 * above zero again. If so, just leave the flag set. And
> +		 * because the atomic_dec_and_test above already got the
> +		 * accounting correct, no other action is required.
> +		 */
> +		if (atomic_read(&page->dma_pinned_count) == 0)
> +			ClearPageDmaPinned(page);
> +
> +		spin_unlock(zone_gup_lock(zone));
> +	}
> +}
> +EXPORT_SYMBOL(__put_page_for_pinned_dma);
> +
>  /*
>   * This path almost never happens for VM activity - pages are normally
>   * freed via pagevecs.  But it gets used by networking.
> @@ -824,6 +861,11 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>  	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>  	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
> +
> +	/* LRU and PageDmaPinned are mutually exclusive: they use the
> +	 * same fields in struct page, but for different purposes.
> +	 */
> +	VM_BUG_ON_PAGE(PageDmaPinned(page_tail), page);
>  	VM_BUG_ON(NR_CPUS != 1 &&
>  		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
>  
> @@ -863,6 +905,12 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> +	/* LRU and PageDmaPinned are mutually exclusive: they use the
> +	 * same fields in struct page, but for different purposes.
> +	 */
> +	if (PageDmaPinned(page))
> +		return;
> +
>  	SetPageLRU(page);
>  	/*
>  	 * Page becomes evictable in two ways:
> -- 
> 2.18.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/6] mm: page_mkclean, ttu: handle pinned pages
  2018-07-02  0:56 ` [PATCH v2 6/6] mm: page_mkclean, ttu: handle pinned pages john.hubbard
@ 2018-07-02 10:15   ` Jan Kara
  2018-07-02 21:07     ` John Hubbard
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2018-07-02 10:15 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

On Sun 01-07-18 17:56:54, john.hubbard@gmail.com wrote:
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 9d142b9b86dc..c4bc8d216746 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -931,6 +931,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  	int kill = 1, forcekill;
>  	struct page *hpage = *hpagep;
>  	bool mlocked = PageMlocked(hpage);
> +	bool skip_pinned_pages = false;

I'm not sure we can afford to wait for page pins when handling page
poisoning. In an ideal world we should but... But I guess this is for
someone understanding memory poisoning better to judge.

> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6db729dc4c50..c137c43eb2ad 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -879,6 +879,26 @@ int page_referenced(struct page *page,
>  	return pra.referenced;
>  }
>  
> +/* Must be called with pinned_dma_lock held. */
> +static void wait_for_dma_pinned_to_clear(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	while (PageDmaPinnedFlags(page)) {
> +		spin_unlock(zone_gup_lock(zone));
> +
> +		schedule();
> +
> +		spin_lock(zone_gup_lock(zone));
> +	}
> +}

Ouch, we definitely need something better here. Either reuse the
page_waitqueue() mechanism or create at least a global wait queue for this
(I don't expect too much contention on the waitqueue and even if there
eventually is, we can switch to page_waitqueue() when we find it).  But
this is a no-go...

> +
> +struct page_mkclean_info {
> +	int cleaned;
> +	int skipped;
> +	bool skip_pinned_pages;
> +};
> +
>  static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  			    unsigned long address, void *arg)
>  {
> @@ -889,7 +909,24 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  		.flags = PVMW_SYNC,
>  	};
>  	unsigned long start = address, end;
> -	int *cleaned = arg;
> +	struct page_mkclean_info *mki = (struct page_mkclean_info *)arg;
> +	bool is_dma_pinned;
> +	struct zone *zone = page_zone(page);
> +
> +	/* Serialize with get_user_pages: */
> +	spin_lock(zone_gup_lock(zone));
> +	is_dma_pinned = PageDmaPinned(page);

Hum, why do you do this for each page table this is mapped in? Also the
locking is IMHO going to hurt a lot and we need to avoid it.

What I think needs to happen is that in page_mkclean(), after you've
cleared all the page tables, you check PageDmaPinned() and wait if needed.
Page cannot be faulted in again as we hold page lock and so races with
concurrent GUP are fairly limited. So with some careful ordering & memory
barriers you should be able to get away without any locking. Ditto for the
unmap path...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/6] mm: get_user_pages: consolidate error handling
  2018-07-02  0:56 ` [PATCH v2 1/6] mm: get_user_pages: consolidate error handling john.hubbard
@ 2018-07-02 10:17   ` Jan Kara
  2018-07-02 21:34     ` John Hubbard
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2018-07-02 10:17 UTC (permalink / raw)
  To: john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, Jan Kara, linux-mm, LKML,
	linux-rdma, linux-fsdevel, John Hubbard

On Sun 01-07-18 17:56:49, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> An upcoming patch requires a way to operate on each page that
> any of the get_user_pages_*() variants returns.
> 
> In preparation for that, consolidate the error handling for
> __get_user_pages(). This provides a single location (the "out:" label)
> for operating on the collected set of pages that are about to be returned.
> 
> As long every use of the "ret" variable is being edited, rename
> "ret" --> "err", so that its name matches its true role.
> This also gets rid of two shadowed variable declarations, as a
> tiny beneficial a side effect.
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

This looks nice! You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/gup.c | 37 ++++++++++++++++++++++---------------
>  1 file changed, 22 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index b70d7ba7cc13..73f0b3316fa7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		struct vm_area_struct **vmas, int *nonblocking)
>  {
>  	long i = 0;
> +	int err = 0;
>  	unsigned int page_mask;
>  	struct vm_area_struct *vma = NULL;
>  
> @@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		if (!vma || start >= vma->vm_end) {
>  			vma = find_extend_vma(mm, start);
>  			if (!vma && in_gate_area(mm, start)) {
> -				int ret;
> -				ret = get_gate_page(mm, start & PAGE_MASK,
> +				err = get_gate_page(mm, start & PAGE_MASK,
>  						gup_flags, &vma,
>  						pages ? &pages[i] : NULL);
> -				if (ret)
> -					return i ? : ret;
> +				if (err)
> +					goto out;
>  				page_mask = 0;
>  				goto next_page;
>  			}
>  
> -			if (!vma || check_vma_flags(vma, gup_flags))
> -				return i ? : -EFAULT;
> +			if (!vma || check_vma_flags(vma, gup_flags)) {
> +				err = -EFAULT;
> +				goto out;
> +			}
>  			if (is_vm_hugetlb_page(vma)) {
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
> @@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		 * If we have a pending SIGKILL, don't keep faulting pages and
>  		 * potentially allocating memory.
>  		 */
> -		if (unlikely(fatal_signal_pending(current)))
> -			return i ? i : -ERESTARTSYS;
> +		if (unlikely(fatal_signal_pending(current))) {
> +			err = -ERESTARTSYS;
> +			goto out;
> +		}
>  		cond_resched();
>  		page = follow_page_mask(vma, start, foll_flags, &page_mask);
>  		if (!page) {
> -			int ret;
> -			ret = faultin_page(tsk, vma, start, &foll_flags,
> +			err = faultin_page(tsk, vma, start, &foll_flags,
>  					nonblocking);
> -			switch (ret) {
> +			switch (err) {
>  			case 0:
>  				goto retry;
>  			case -EFAULT:
>  			case -ENOMEM:
>  			case -EHWPOISON:
> -				return i ? i : ret;
> +				goto out;
>  			case -EBUSY:
> -				return i;
> +				err = 0;
> +				goto out;
>  			case -ENOENT:
>  				goto next_page;
>  			}
> @@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			 */
>  			goto next_page;
>  		} else if (IS_ERR(page)) {
> -			return i ? i : PTR_ERR(page);
> +			err = PTR_ERR(page);
> +			goto out;
>  		}
>  		if (pages) {
>  			pages[i] = page;
> @@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		start += page_increm * PAGE_SIZE;
>  		nr_pages -= page_increm;
>  	} while (nr_pages);
> -	return i;
> +
> +out:
> +	return i ? i : err;
>  }
>  
>  static bool vma_permits_fault(struct vm_area_struct *vma,
> -- 
> 2.18.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-02  9:53   ` Jan Kara
@ 2018-07-02 20:43     ` John Hubbard
  2018-07-03  0:08       ` Christopher Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: John Hubbard @ 2018-07-02 20:43 UTC (permalink / raw)
  To: Jan Kara, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On 07/02/2018 02:53 AM, Jan Kara wrote:
> On Sun 01-07-18 17:56:53, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
> ...
> 
>> @@ -904,12 +907,24 @@ static inline void get_page(struct page *page)
>>  	 */
>>  	VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page);
>>  	page_ref_inc(page);
>> +
>> +	if (unlikely(PageDmaPinned(page)))
>> +		__get_page_for_pinned_dma(page);
>>  }
>>  
>>  static inline void put_page(struct page *page)
>>  {
>>  	page = compound_head(page);
>>  
>> +	/* Because the page->dma_pinned_* fields are unioned with
>> +	 * page->lru, there is no way to do classical refcount-style
>> +	 * decrement-and-test-for-zero. Instead, PageDmaPinned(page) must
>> +	 * be checked, in order to safely check if we are allowed to decrement
>> +	 * page->dma_pinned_count at all.
>> +	 */
>> +	if (unlikely(PageDmaPinned(page)))
>> +		__put_page_for_pinned_dma(page);
>> +
> 
> These two are just wrong. You cannot make any page reference for
> PageDmaPinned() account against a pin count. First, it is just conceptually
> wrong as these references need not be long term pins, second, you can
> easily race like:
> 
> Pinner				Random process
> 				get_page(page)
> pin_page_for_dma()
> 				put_page(page)
> 				 -> oops, page gets unpinned too early
> 

I'll drop this approach, without mentioning any of the locking that is hiding in
there, since that was probably breaking other rules anyway. :) Thanks for your
patience in reviewing this.

> So you really have to create counterpart to get_user_pages() - like
> put_user_page() or whatever... It is inconvenient to have to modify all GUP
> users but I don't see a way around that. 

OK, there will be a long-ish pause, while I go visit all the gup sites. I count about
88 callers, which is not nearly as crazy as my first casual grep showed, but still
quite a chunk, since I have to track down where each one does its put_page call(s).

It's definitely worth the effort, though. These pins just plain need some special
handling in order to get everything correct.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/6] mm: page_mkclean, ttu: handle pinned pages
  2018-07-02 10:15   ` Jan Kara
@ 2018-07-02 21:07     ` John Hubbard
  0 siblings, 0 replies; 30+ messages in thread
From: John Hubbard @ 2018-07-02 21:07 UTC (permalink / raw)
  To: Jan Kara, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On 07/02/2018 03:15 AM, Jan Kara wrote:
> On Sun 01-07-18 17:56:54, john.hubbard@gmail.com wrote:
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 9d142b9b86dc..c4bc8d216746 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -931,6 +931,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
>>  	int kill = 1, forcekill;
>>  	struct page *hpage = *hpagep;
>>  	bool mlocked = PageMlocked(hpage);
>> +	bool skip_pinned_pages = false;
> 
> I'm not sure we can afford to wait for page pins when handling page
> poisoning. In an ideal world we should but... But I guess this is for
> someone understanding memory poisoning better to judge.


OK, then until I hear otherwise, in the next version I'll set 
skipped_pinned_pages = true here, based on the idea that it's probably
better to be sure we don't hang while trying to remove a bad page. It's
hard to achieve perfection in the presence of a memory failure.

> 
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 6db729dc4c50..c137c43eb2ad 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -879,6 +879,26 @@ int page_referenced(struct page *page,
>>  	return pra.referenced;
>>  }
>>  
>> +/* Must be called with pinned_dma_lock held. */
>> +static void wait_for_dma_pinned_to_clear(struct page *page)
>> +{
>> +	struct zone *zone = page_zone(page);
>> +
>> +	while (PageDmaPinnedFlags(page)) {
>> +		spin_unlock(zone_gup_lock(zone));
>> +
>> +		schedule();
>> +
>> +		spin_lock(zone_gup_lock(zone));
>> +	}
>> +}
> 
> Ouch, we definitely need something better here. Either reuse the
> page_waitqueue() mechanism or create at least a global wait queue for this
> (I don't expect too much contention on the waitqueue and even if there
> eventually is, we can switch to page_waitqueue() when we find it).  But
> this is a no-go...

Yes, no problem. At one point I had a separate bit waiting queue, which was
only a few lines of code to do, but I dropped it because I thought that maybe 
it was overkill. I'll put it back in.

> 
>> +
>> +struct page_mkclean_info {
>> +	int cleaned;
>> +	int skipped;
>> +	bool skip_pinned_pages;
>> +};
>> +
>>  static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>  			    unsigned long address, void *arg)
>>  {
>> @@ -889,7 +909,24 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>  		.flags = PVMW_SYNC,
>>  	};
>>  	unsigned long start = address, end;
>> -	int *cleaned = arg;
>> +	struct page_mkclean_info *mki = (struct page_mkclean_info *)arg;
>> +	bool is_dma_pinned;
>> +	struct zone *zone = page_zone(page);
>> +
>> +	/* Serialize with get_user_pages: */
>> +	spin_lock(zone_gup_lock(zone));
>> +	is_dma_pinned = PageDmaPinned(page);
> 
> Hum, why do you do this for each page table this is mapped in? Also the
> locking is IMHO going to hurt a lot and we need to avoid it.
> 
> What I think needs to happen is that in page_mkclean(), after you've
> cleared all the page tables, you check PageDmaPinned() and wait if needed.
> Page cannot be faulted in again as we hold page lock and so races with
> concurrent GUP are fairly limited. So with some careful ordering & memory
> barriers you should be able to get away without any locking. Ditto for the
> unmap path...
> 

I guess I was thinking about this backwards. It would work much better if
we go ahead and write protect or unmap first, let things drain, and wait later.
Very nice!


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/6] mm: get_user_pages: consolidate error handling
  2018-07-02 10:17   ` Jan Kara
@ 2018-07-02 21:34     ` John Hubbard
  0 siblings, 0 replies; 30+ messages in thread
From: John Hubbard @ 2018-07-02 21:34 UTC (permalink / raw)
  To: Jan Kara, john.hubbard
  Cc: Matthew Wilcox, Michal Hocko, Christopher Lameter,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On 07/02/2018 03:17 AM, Jan Kara wrote:
> On Sun 01-07-18 17:56:49, john.hubbard@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> An upcoming patch requires a way to operate on each page that
>> any of the get_user_pages_*() variants returns.
>>
>> In preparation for that, consolidate the error handling for
>> __get_user_pages(). This provides a single location (the "out:" label)
>> for operating on the collected set of pages that are about to be returned.
>>
>> As long every use of the "ret" variable is being edited, rename
>> "ret" --> "err", so that its name matches its true role.
>> This also gets rid of two shadowed variable declarations, as a
>> tiny beneficial a side effect.
>>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> 
> This looks nice! You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 

Great, thanks for the review!

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-02 20:43     ` John Hubbard
@ 2018-07-03  0:08       ` Christopher Lameter
  2018-07-03  4:30         ` John Hubbard
  0 siblings, 1 reply; 30+ messages in thread
From: Christopher Lameter @ 2018-07-03  0:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, john.hubbard, Matthew Wilcox, Michal Hocko,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On Mon, 2 Jul 2018, John Hubbard wrote:

> >
> > These two are just wrong. You cannot make any page reference for
> > PageDmaPinned() account against a pin count. First, it is just conceptually
> > wrong as these references need not be long term pins, second, you can
> > easily race like:
> >
> > Pinner				Random process
> > 				get_page(page)
> > pin_page_for_dma()
> > 				put_page(page)
> > 				 -> oops, page gets unpinned too early
> >
>
> I'll drop this approach, without mentioning any of the locking that is hiding in
> there, since that was probably breaking other rules anyway. :) Thanks for your
> patience in reviewing this.

Mayb the following would work:

If you establish a reference to a page then increase the page count. If
the reference is a dma pin action also then increase the pinned count.

That way you know how many of the references to the page are dma
pins and you can correctly manage the state of the page if the dma pins go
away.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-03  0:08       ` Christopher Lameter
@ 2018-07-03  4:30         ` John Hubbard
  2018-07-03 17:08           ` Christopher Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: John Hubbard @ 2018-07-03  4:30 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Jan Kara, john.hubbard, Matthew Wilcox, Michal Hocko,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On 07/02/2018 05:08 PM, Christopher Lameter wrote:
> On Mon, 2 Jul 2018, John Hubbard wrote:
> 
>>>
>>> These two are just wrong. You cannot make any page reference for
>>> PageDmaPinned() account against a pin count. First, it is just conceptually
>>> wrong as these references need not be long term pins, second, you can
>>> easily race like:
>>>
>>> Pinner				Random process
>>> 				get_page(page)
>>> pin_page_for_dma()
>>> 				put_page(page)
>>> 				 -> oops, page gets unpinned too early
>>>
>>
>> I'll drop this approach, without mentioning any of the locking that is hiding in
>> there, since that was probably breaking other rules anyway. :) Thanks for your
>> patience in reviewing this.
> 
> Mayb the following would work:
> 
> If you establish a reference to a page then increase the page count. If
> the reference is a dma pin action also then increase the pinned count.
> 
> That way you know how many of the references to the page are dma
> pins and you can correctly manage the state of the page if the dma pins go
> away.
> 

I think this sounds like what this patch already does, right? See:
__put_page_for_pinned_dma(), __get_page_for_pinned_dma(), and 
pin_page_for_dma(). The locking seems correct to me, but I suspect it's 
too heavyweight for such a hot path. But without adding a new put_user_page()
call, that was the best I could come up with.

What I'm hearing now from Jan and Michal is that the desired end result is
a separate API call, put_user_pages(), so that we can explicitly manage
these pinned pages.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-03  4:30         ` John Hubbard
@ 2018-07-03 17:08           ` Christopher Lameter
  2018-07-03 17:36             ` John Hubbard
  0 siblings, 1 reply; 30+ messages in thread
From: Christopher Lameter @ 2018-07-03 17:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, john.hubbard, Matthew Wilcox, Michal Hocko,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On Mon, 2 Jul 2018, John Hubbard wrote:

> > If you establish a reference to a page then increase the page count. If
> > the reference is a dma pin action also then increase the pinned count.
> >
> > That way you know how many of the references to the page are dma
> > pins and you can correctly manage the state of the page if the dma pins go
> > away.
> >
>
> I think this sounds like what this patch already does, right? See:
> __put_page_for_pinned_dma(), __get_page_for_pinned_dma(), and
> pin_page_for_dma(). The locking seems correct to me, but I suspect it's
> too heavyweight for such a hot path. But without adding a new put_user_page()
> call, that was the best I could come up with.

When I saw the patch it looked like you were avoiding to increment the
page->count field.

> What I'm hearing now from Jan and Michal is that the desired end result is
> a separate API call, put_user_pages(), so that we can explicitly manage
> these pinned pages.

Certainly a good approach.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-03 17:08           ` Christopher Lameter
@ 2018-07-03 17:36             ` John Hubbard
  2018-07-03 17:48               ` Christopher Lameter
  2018-07-04 10:43               ` Jan Kara
  0 siblings, 2 replies; 30+ messages in thread
From: John Hubbard @ 2018-07-03 17:36 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Jan Kara, john.hubbard, Matthew Wilcox, Michal Hocko,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On 07/03/2018 10:08 AM, Christopher Lameter wrote:
> On Mon, 2 Jul 2018, John Hubbard wrote:
> 
>>> If you establish a reference to a page then increase the page count. If
>>> the reference is a dma pin action also then increase the pinned count.
>>>
>>> That way you know how many of the references to the page are dma
>>> pins and you can correctly manage the state of the page if the dma pins go
>>> away.
>>>
>>
>> I think this sounds like what this patch already does, right? See:
>> __put_page_for_pinned_dma(), __get_page_for_pinned_dma(), and
>> pin_page_for_dma(). The locking seems correct to me, but I suspect it's
>> too heavyweight for such a hot path. But without adding a new put_user_page()
>> call, that was the best I could come up with.
> 
> When I saw the patch it looked like you were avoiding to increment the
> page->count field.

Looking at it again, this patch is definitely susceptible to Jan's "page gets
dma-unpinnned too soon" problem.  That leaves a window in which the original
problem can occur.

The page->_refcount field is used normally, in addition to the dma_pinned_count.
But the problem is that, unless the caller knows what kind of page it is,
the page->dma_pinned_count cannot be looked at, because it is unioned with
page->lru.prev.  page->dma_pinned_flags, at least starting at bit 1, are 
safe to look at due to pointer alignment, but now you cannot atomically 
count...

So this seems unsolvable without having the caller specify that it knows the
page type, and that it is therefore safe to decrement page->dma_pinned_count.
I was hoping I'd found a way, but clearly I haven't. :)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-03 17:36             ` John Hubbard
@ 2018-07-03 17:48               ` Christopher Lameter
  2018-07-03 18:48                 ` John Hubbard
  2018-07-04 10:43               ` Jan Kara
  1 sibling, 1 reply; 30+ messages in thread
From: Christopher Lameter @ 2018-07-03 17:48 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, john.hubbard, Matthew Wilcox, Michal Hocko,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On Tue, 3 Jul 2018, John Hubbard wrote:

> The page->_refcount field is used normally, in addition to the dma_pinned_count.
> But the problem is that, unless the caller knows what kind of page it is,
> the page->dma_pinned_count cannot be looked at, because it is unioned with
> page->lru.prev.  page->dma_pinned_flags, at least starting at bit 1, are
> safe to look at due to pointer alignment, but now you cannot atomically
> count...
>
> So this seems unsolvable without having the caller specify that it knows the
> page type, and that it is therefore safe to decrement page->dma_pinned_count.
> I was hoping I'd found a way, but clearly I haven't. :)

Try to find some way to indicate that the page is pinned by using some of
the existing page flags? There is already an MLOCK flag. Maybe some
creativity with that can lead to something (but then the MLOCKed pages are
on the unevictable LRU....). cgroups used to have something called struct
page_ext. Oh its there in linux/mm/page_ext.c.





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-03 17:48               ` Christopher Lameter
@ 2018-07-03 18:48                 ` John Hubbard
  0 siblings, 0 replies; 30+ messages in thread
From: John Hubbard @ 2018-07-03 18:48 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Jan Kara, john.hubbard, Matthew Wilcox, Michal Hocko,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On 07/03/2018 10:48 AM, Christopher Lameter wrote:
> On Tue, 3 Jul 2018, John Hubbard wrote:
> 
>> The page->_refcount field is used normally, in addition to the dma_pinned_count.
>> But the problem is that, unless the caller knows what kind of page it is,
>> the page->dma_pinned_count cannot be looked at, because it is unioned with
>> page->lru.prev.  page->dma_pinned_flags, at least starting at bit 1, are
>> safe to look at due to pointer alignment, but now you cannot atomically
>> count...
>>
>> So this seems unsolvable without having the caller specify that it knows the
>> page type, and that it is therefore safe to decrement page->dma_pinned_count.
>> I was hoping I'd found a way, but clearly I haven't. :)
> 
> Try to find some way to indicate that the page is pinned by using some of
> the existing page flags? There is already an MLOCK flag. Maybe some
> creativity with that can lead to something (but then the MLOCKed pages are
> on the unevictable LRU....). cgroups used to have something called struct
> page_ext. Oh its there in linux/mm/page_ext.c.
> 

Yes, that would provide just a touch more cabability: we could both read and
write a dma-pinned page(_ext) flag safely, instead of only being able to just 
read.  I'm doubt that that's enough additional information, though. The general
problem of allowing random put_page() calls to decrement the dma-pinned count (see
Jan's diagram at the beginning of this thread) cannot be solved by anything less
than some sort of "who (or which special type of caller, at least) owns this page"
approach, as far as I can see. The put_user_pages() provides arguably the simplest 
version of that kind of solution.

Also, even just using a single bit from page extensions would cost some extra memory, 
because for example on 64-bit systems many configurations do not need the additional 
flags that page_ext.h provides, so they return "false" from the page_ext_operations.need() 
callback. Changing get_user_pages to require page extensions would lead to
*every* configuration requiring page extensions, so 64-bit users would lose some memory
for sure. On the other hand, it avoids the "take page off of the LRU" complexity that 
I've got here. But again, I don't think a single flag, or even a count, would actually 
solve the problem.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-03 17:36             ` John Hubbard
  2018-07-03 17:48               ` Christopher Lameter
@ 2018-07-04 10:43               ` Jan Kara
  2018-07-05 14:17                 ` Christopher Lameter
  1 sibling, 1 reply; 30+ messages in thread
From: Jan Kara @ 2018-07-04 10:43 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christopher Lameter, Jan Kara, john.hubbard, Matthew Wilcox,
	Michal Hocko, Jason Gunthorpe, Dan Williams, linux-mm, LKML,
	linux-rdma, linux-fsdevel

On Tue 03-07-18 10:36:05, John Hubbard wrote:
> On 07/03/2018 10:08 AM, Christopher Lameter wrote:
> > On Mon, 2 Jul 2018, John Hubbard wrote:
> > 
> >>> If you establish a reference to a page then increase the page count. If
> >>> the reference is a dma pin action also then increase the pinned count.
> >>>
> >>> That way you know how many of the references to the page are dma
> >>> pins and you can correctly manage the state of the page if the dma pins go
> >>> away.
> >>>
> >>
> >> I think this sounds like what this patch already does, right? See:
> >> __put_page_for_pinned_dma(), __get_page_for_pinned_dma(), and
> >> pin_page_for_dma(). The locking seems correct to me, but I suspect it's
> >> too heavyweight for such a hot path. But without adding a new put_user_page()
> >> call, that was the best I could come up with.
> > 
> > When I saw the patch it looked like you were avoiding to increment the
> > page->count field.
> 
> Looking at it again, this patch is definitely susceptible to Jan's "page gets
> dma-unpinnned too soon" problem.  That leaves a window in which the original
> problem can occur.
> 
> The page->_refcount field is used normally, in addition to the dma_pinned_count.
> But the problem is that, unless the caller knows what kind of page it is,
> the page->dma_pinned_count cannot be looked at, because it is unioned with
> page->lru.prev.  page->dma_pinned_flags, at least starting at bit 1, are 
> safe to look at due to pointer alignment, but now you cannot atomically 
> count...
> 
> So this seems unsolvable without having the caller specify that it knows the
> page type, and that it is therefore safe to decrement page->dma_pinned_count.
> I was hoping I'd found a way, but clearly I haven't. :)

Well, I think the misconception is that "pinned" is a fundamental property
of a page. It is not. "pinned" is a property of a page reference (i.e., a
kind of reference that can be used for DMA access) and page gets into
"pinned" state if it has any reference of "pinned" type. And when you
realize this, it is obvious that you just have to have a special api for
getting and dropping references of this "pinned" type. For getting we
already have get_user_pages(), for putting we have to create the api...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-04 10:43               ` Jan Kara
@ 2018-07-05 14:17                 ` Christopher Lameter
  2018-07-09 13:49                   ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Christopher Lameter @ 2018-07-05 14:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, john.hubbard, Matthew Wilcox, Michal Hocko,
	Jason Gunthorpe, Dan Williams, linux-mm, LKML, linux-rdma,
	linux-fsdevel

On Wed, 4 Jul 2018, Jan Kara wrote:

> > So this seems unsolvable without having the caller specify that it knows the
> > page type, and that it is therefore safe to decrement page->dma_pinned_count.
> > I was hoping I'd found a way, but clearly I haven't. :)
>
> Well, I think the misconception is that "pinned" is a fundamental property
> of a page. It is not. "pinned" is a property of a page reference (i.e., a
> kind of reference that can be used for DMA access) and page gets into
> "pinned" state if it has any reference of "pinned" type. And when you
> realize this, it is obvious that you just have to have a special api for
> getting and dropping references of this "pinned" type. For getting we
> already have get_user_pages(), for putting we have to create the api...

Maybe we can do something by creating a special "pinned" bit in the pte?
If it is a RDMA reference then set that pinned bit there.

Thus any of the references could cause a pin. Since the page struct does
not contain that information we therefore have to scan through the ptes to
figure out if a page is pinned?

If so then we would not need a special function for dropping the
reference.

References to a page can also be created from devices mmu. Maybe we could
at some point start to manage them in a similar way to the page tables of
the processor? The mmu notifiers are a bit awkward if we need closer mm
integration.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields
  2018-07-05 14:17                 ` Christopher Lameter
@ 2018-07-09 13:49                   ` Jan Kara
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2018-07-09 13:49 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Jan Kara, John Hubbard, john.hubbard, Matthew Wilcox,
	Michal Hocko, Jason Gunthorpe, Dan Williams, linux-mm, LKML,
	linux-rdma, linux-fsdevel

On Thu 05-07-18 14:17:19, Christopher Lameter wrote:
> On Wed, 4 Jul 2018, Jan Kara wrote:
> 
> > > So this seems unsolvable without having the caller specify that it knows the
> > > page type, and that it is therefore safe to decrement page->dma_pinned_count.
> > > I was hoping I'd found a way, but clearly I haven't. :)
> >
> > Well, I think the misconception is that "pinned" is a fundamental property
> > of a page. It is not. "pinned" is a property of a page reference (i.e., a
> > kind of reference that can be used for DMA access) and page gets into
> > "pinned" state if it has any reference of "pinned" type. And when you
> > realize this, it is obvious that you just have to have a special api for
> > getting and dropping references of this "pinned" type. For getting we
> > already have get_user_pages(), for putting we have to create the api...
> 
> Maybe we can do something by creating a special "pinned" bit in the pte?
> If it is a RDMA reference then set that pinned bit there.
> 
> Thus any of the references could cause a pin. Since the page struct does
> not contain that information we therefore have to scan through the ptes to
> figure out if a page is pinned?
> 
> If so then we would not need a special function for dropping the
> reference.

I don't really see how a PTE bit would help in getting rid of the special
function for dropping "pinned" reference. You still need to distinguish
preexisting page references (and corresponding page ref drops which must
not unpin the page) from the references acquired after transitioning PTE to
the pinned state...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-07-09 13:49 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-02  0:56 [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers john.hubbard
2018-07-02  0:56 ` [PATCH v2 1/6] mm: get_user_pages: consolidate error handling john.hubbard
2018-07-02 10:17   ` Jan Kara
2018-07-02 21:34     ` John Hubbard
2018-07-02  0:56 ` [PATCH v2 2/6] mm: introduce page->dma_pinned_flags, _count john.hubbard
2018-07-02  0:56 ` [PATCH v2 3/6] mm: introduce zone_gup_lock, for dma-pinned pages john.hubbard
2018-07-02  0:56 ` [PATCH v2 4/6] mm/fs: add a sync_mode param for clear_page_dirty_for_io() john.hubbard
2018-07-02  2:11   ` kbuild test robot
2018-07-02  4:40     ` John Hubbard
2018-07-02  2:47   ` kbuild test robot
2018-07-02  4:40     ` John Hubbard
2018-07-02  0:56 ` [PATCH v2 5/6] mm: track gup pages with page->dma_pinned_* fields john.hubbard
2018-07-02  2:11   ` kbuild test robot
2018-07-02  2:58   ` kbuild test robot
2018-07-02  5:05     ` John Hubbard
2018-07-02  9:53   ` Jan Kara
2018-07-02 20:43     ` John Hubbard
2018-07-03  0:08       ` Christopher Lameter
2018-07-03  4:30         ` John Hubbard
2018-07-03 17:08           ` Christopher Lameter
2018-07-03 17:36             ` John Hubbard
2018-07-03 17:48               ` Christopher Lameter
2018-07-03 18:48                 ` John Hubbard
2018-07-04 10:43               ` Jan Kara
2018-07-05 14:17                 ` Christopher Lameter
2018-07-09 13:49                   ` Jan Kara
2018-07-02  0:56 ` [PATCH v2 6/6] mm: page_mkclean, ttu: handle pinned pages john.hubbard
2018-07-02 10:15   ` Jan Kara
2018-07-02 21:07     ` John Hubbard
2018-07-02  5:54 ` [PATCH v2 0/6] mm/fs: gup: don't unmap or drop filesystem buffers John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).