LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCHv4 00/24] THP refcounting redesign
@ 2015-03-04 16:32 Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 01/24] thp: cluster split_huge_page* code together Kirill A. Shutemov
                   ` (26 more replies)
  0 siblings, 27 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

Hello everybody,

It's bug-fix update of my thp refcounting work.

The goal of patchset is to make refcounting on THP pages cheaper with
simpler semantics and allow the same THP compound page to be mapped with
PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.

With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal.
It makes gup_fast() implementation simpler and doesn't require
special-case in futex code to handle tail THP pages.

It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.

= Changelog =

v4:
  - fix sizes reported in smaps;
  - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
  - skip THP pages on munlock_vma_pages_range(): they are never mlocked;
  - properly handle huge zero page on FOLL_SPLIT;
  - fix lock_page() slow path on tail pages;
  - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
  - fix split_huge_page() on huge page with unmapped head page;
  - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
  - call page_remove_rmap() in unfreeze_page under ptl.

= Design overview =

The main reason why we can't map THP with 4k is how refcounting on THP
designed. It built around two requirements:

  - split of huge page should never fail;
  - we can't change interface of get_user_page();

To be able to split huge page at any point we have to track which tail
page was pinned. It leads to tricky and expensive get_page() on tail pages
and also occupy tail_page->_mapcount.

Most split_huge_page*() users want PMD to be split into table of PTEs and
don't care whether compound page is going to be split or not.

The plan is:

 - allow split_huge_page() to fail if the page is pinned. It's trivial to
   split non-pinned page and it doesn't require tail page refcounting, so
   tail_page->_mapcount is free to be reused.

 - introduce new routine -- split_huge_pmd() -- to split PMD into table of
   PTEs. It splits only one PMD, not touching other PMDs the page is
   mapped with or underlying compound page. Unlike new split_huge_page(),
   split_huge_pmd() never fails.

Fortunately, we have only few places where split_huge_page() is needed:
swap out, memory failure, migration, KSM. And all of them can handle
split_huge_page() fail.

In new scheme we use page->_mapcount is used to account how many time
the page is mapped with PTEs. We have separate compound_mapcount() to
count mappings with PMD. page_mapcount() returns sum of PTE and PMD
mappings of the page.

Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
It may be a surprise to some code to see a PTE which points to tail page
or VMA start/end in the middle of compound page.

munmap() part of THP will split PMD, but doesn't split the huge page. In
order to take memory consuption under control we put partially unmapped
huge page on per-zone list, which would be drained on first shrink_zone()
call. This way we also avoid unnecessary split_huge_page() on exit(2) if a
THP belong to more than one VMA.

= Patches overview =

Patch 1:
        Move split_huge_page code around. Preparation for future changes.

Patches 2-3:
        Make PageAnon() and PG_locked related helpers to look on head
        page if tail page is passed. It's required since pte_page() can
        now point to tail page. It's likely that we need to change other
        pageflags-related helpers too, but I haven't step on any other
        yet.

Patch 4:
        With PTE-mapeed THP, rmap cannot rely on PageTransHuge() check to
        decide if map small page or THP. We need to get the info from
        caller.

Patch 5:
        We need to look on all subpages of compound page to calculate
        correct PSS, because they can have different mapcount.

Patch 6:
        Store mapcount for compound pages separately: in the first tail
        page ->mapping.

Patch 7:
        Adjust conditions when we can re-use the page on write-protection
        fault.

Patch 8:
        FOLL_SPLIT should be handled on PTE level too.

Patch 9:
        Split all pages in mlocked VMA. We would need to look on this
        again later.

Patch 10:
        Make khugepaged aware about PTE-mapped huge pages.

Patch 11:
        split_huge_page_pmd() to split_huge_pmd() to reflect that page is
        not going to be split, only PMD.

Patch 12:
        Temporary make split_huge_page() to return -EBUSY on all split
        requests. This allows to drop tail-page refcounting and change
        implementation of split_huge_pmd() to split PMD to table of PTEs
        without splitting compound page.

Patch 13:
        New THP_SPLIT_* vmstats.

Patch 14:
        Implement new split_huge_page() which fails if the page is pinned.
        For now, we rely on compound_lock() to make page counts stable.

Patches 15-16:
        Drop infrastructure for handling PMD splitting. We don't use it
        anymore in split_huge_page(). For now we only remove it from
        generic code and x86. I'll cleanup other architectures later.

Patch 17:
        Remove ugly special case if futex happened to be in tail THP page.
        With new refcounting it much easier to protect against split.

Patches 18-20:
        Replaces compound_lock with migration entries as mechanism to
        freeze page counts on split_huge_page(). We don't need
        compound_lock anymore. It makes get_page()/put_page() on tail
        pages faster.

Patch 21:
        Handle partial unmap of THP. We put partially unmapped huge page
        on per-zone list, which would be drained on first shrink_zone()
        call. This way we also avoid unnecessary split_huge_page() on
        exit(2) if a THP belong to more than one VMA.

Patch 22:
        Make memcg aware about new refcounting. Validation needed.

Patch 23:
        Fix never-succeed split_huge_page() inside KSM machinery.

Patch 24:
        Documentation update.

I believe all known bugs have been fixed, but I'm sure Sasha will bring more
reports.

The patchset also available on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v4

Comments?
Kirill A. Shutemov (24):
  thp: cluster split_huge_page* code together
  mm: change PageAnon() and page_anon_vma() to work on tail pages
  mm: avoid PG_locked on tail pages
  rmap: add argument to charge compound page
  mm, proc: adjust PSS calculation
  mm: store mapcount for compound page separately
  mm, thp: adjust conditions when we can reuse the page on WP fault
  mm: adjust FOLL_SPLIT for new refcounting
  thp, mlock: do not allow huge pages in mlocked area
  khugepaged: ignore pmd tables with THP mapped with ptes
  thp: rename split_huge_page_pmd() to split_huge_pmd()
  thp: PMD splitting without splitting compound page
  mm, vmstats: new THP splitting event
  thp: implement new split_huge_page()
  mm, thp: remove infrastructure for handling splitting PMDs
  x86, thp: remove infrastructure for handling splitting PMDs
  futex, thp: remove special case for THP in get_futex_key
  thp, mm: split_huge_page(): caller need to lock page
  thp, mm: use migration entries to freeze page counts on split
  mm, thp: remove compound_lock
  thp: introduce deferred_split_huge_page()
  memcg: adjust to support new THP refcounting
  ksm: split huge pages on follow_page()
  thp: update documentation

 Documentation/vm/transhuge.txt       | 100 ++--
 arch/mips/mm/gup.c                   |   4 -
 arch/powerpc/mm/hugetlbpage.c        |  13 +-
 arch/powerpc/mm/subpage-prot.c       |   2 +-
 arch/s390/mm/gup.c                   |  13 +-
 arch/sparc/mm/gup.c                  |  14 +-
 arch/x86/include/asm/pgtable.h       |   9 -
 arch/x86/include/asm/pgtable_types.h |   2 -
 arch/x86/kernel/vm86_32.c            |   6 +-
 arch/x86/mm/gup.c                    |  17 +-
 arch/x86/mm/pgtable.c                |  14 -
 fs/proc/task_mmu.c                   |  51 ++-
 include/asm-generic/pgtable.h        |   5 -
 include/linux/huge_mm.h              |  40 +-
 include/linux/hugetlb_inline.h       |   9 +-
 include/linux/memcontrol.h           |  16 +-
 include/linux/migrate.h              |   3 +
 include/linux/mm.h                   | 134 ++----
 include/linux/mm_types.h             |  20 +-
 include/linux/mmzone.h               |   5 +
 include/linux/page-flags.h           |  15 +-
 include/linux/pagemap.h              |  14 +-
 include/linux/rmap.h                 |  17 +-
 include/linux/swap.h                 |   3 +-
 include/linux/vm_event_item.h        |   4 +-
 kernel/events/uprobes.c              |  11 +-
 kernel/futex.c                       |  61 +--
 mm/debug.c                           |   8 +-
 mm/filemap.c                         |  19 +-
 mm/gup.c                             |  96 ++--
 mm/huge_memory.c                     | 866 ++++++++++++++++++-----------------
 mm/hugetlb.c                         |   8 +-
 mm/internal.h                        |  57 +--
 mm/ksm.c                             |  60 +--
 mm/madvise.c                         |   2 +-
 mm/memcontrol.c                      |  76 +--
 mm/memory-failure.c                  |  12 +-
 mm/memory.c                          |  67 ++-
 mm/mempolicy.c                       |   2 +-
 mm/migrate.c                         |  97 ++--
 mm/mincore.c                         |   2 +-
 mm/mlock.c                           |  51 +--
 mm/mprotect.c                        |   2 +-
 mm/mremap.c                          |   2 +-
 mm/page_alloc.c                      |  13 +-
 mm/pagewalk.c                        |   2 +-
 mm/pgtable-generic.c                 |  14 -
 mm/rmap.c                            | 121 +++--
 mm/shmem.c                           |  21 +-
 mm/slub.c                            |   2 +
 mm/swap.c                            | 260 ++---------
 mm/swapfile.c                        |  16 +-
 mm/vmscan.c                          |   3 +
 mm/vmstat.c                          |   4 +-
 54 files changed, 1069 insertions(+), 1416 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 01/24] thp: cluster split_huge_page* code together
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 02/24] mm: change PageAnon() and page_anon_vma() to work on tail pages Kirill A. Shutemov
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

Rearrange code in mm/huge_memory.c to make future changes somewhat
easier.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 224 +++++++++++++++++++++++++++----------------------------
 1 file changed, 112 insertions(+), 112 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e6dfa527a35..8da6d46c5fb9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1622,6 +1622,118 @@ int pmd_freeable(pmd_t pmd)
 	return !pmd_dirty(pmd);
 }
 
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+		unsigned long haddr, pmd_t *pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int i;
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+		entry = pte_mkspecial(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	put_huge_zero_page();
+}
+
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	struct page *page;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
+
+	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+
+	mmun_start = haddr;
+	mmun_end   = haddr + HPAGE_PMD_SIZE;
+again:
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(ptl);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		return;
+	}
+	if (is_huge_zero_pmd(*pmd)) {
+		__split_huge_zero_page_pmd(vma, haddr, pmd);
+		spin_unlock(ptl);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		return;
+	}
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	get_page(page);
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+	split_huge_page(page);
+
+	put_page(page);
+
+	/*
+	 * We don't always have down_write of mmap_sem here: a racing
+	 * do_huge_pmd_wp_page() might have copied-on-write to another
+	 * huge page before our split_huge_page() got the anon_vma lock.
+	 */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		goto again;
+}
+
+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+		pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address);
+	BUG_ON(vma == NULL);
+	split_huge_page_pmd(vma, address, pmd);
+}
+
+static void split_huge_page_address(struct mm_struct *mm,
+				    unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return;
+	/*
+	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
+	 * materialize from under us.
+	 */
+	split_huge_page_pmd_mm(mm, address, pmd);
+}
+
 static int __split_huge_page_splitting(struct page *page,
 				       struct vm_area_struct *vma,
 				       unsigned long address)
@@ -2870,118 +2982,6 @@ static int khugepaged(void *none)
 	return 0;
 }
 
-static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
-	pmd_t _pmd;
-	int i;
-
-	pmdp_clear_flush_notify(vma, haddr, pmd);
-	/* leave pmd empty until pte is filled */
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pmd_populate(mm, &_pmd, pgtable);
-
-	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-		pte_t *pte, entry;
-		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
-		entry = pte_mkspecial(entry);
-		pte = pte_offset_map(&_pmd, haddr);
-		VM_BUG_ON(!pte_none(*pte));
-		set_pte_at(mm, haddr, pte, entry);
-		pte_unmap(pte);
-	}
-	smp_wmb(); /* make pte visible before pmd */
-	pmd_populate(mm, pmd, pgtable);
-	put_huge_zero_page();
-}
-
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd)
-{
-	spinlock_t *ptl;
-	struct page *page;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
-
-	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
-
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	if (is_huge_zero_pmd(*pmd)) {
-		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-	get_page(page);
-	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	split_huge_page(page);
-
-	put_page(page);
-
-	/*
-	 * We don't always have down_write of mmap_sem here: a racing
-	 * do_huge_pmd_wp_page() might have copied-on-write to another
-	 * huge page before our split_huge_page() got the anon_vma lock.
-	 */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		goto again;
-}
-
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd)
-{
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
-				    unsigned long address)
-{
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-
-	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
-
-	pgd = pgd_offset(mm, address);
-	if (!pgd_present(*pgd))
-		return;
-
-	pud = pud_offset(pgd, address);
-	if (!pud_present(*pud))
-		return;
-
-	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
-		return;
-	/*
-	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
-	 * materialize from under us.
-	 */
-	split_huge_page_pmd_mm(mm, address, pmd);
-}
-
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 			     unsigned long start,
 			     unsigned long end,
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 02/24] mm: change PageAnon() and page_anon_vma() to work on tail pages
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 01/24] thp: cluster split_huge_page* code together Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 03/24] mm: avoid PG_locked " Kirill A. Shutemov
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

Currently PageAnon() and page_anon_vma() are always return false/NULL
for tail. We need to look on head page for correct answer.

Let's change the function to give the correct result for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h   | 1 +
 include/linux/rmap.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c21b42d07bf..1aea94e837a0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1047,6 +1047,7 @@ struct address_space *page_file_mapping(struct page *page)
 
 static inline int PageAnon(struct page *page)
 {
+	page = compound_head(page);
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
 }
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9c5ff69fa0cd..c4088feac1fc 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -108,6 +108,7 @@ static inline void put_anon_vma(struct anon_vma *anon_vma)
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
+	page = compound_head(page);
 	if (((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) !=
 					    PAGE_MAPPING_ANON)
 		return NULL;
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 03/24] mm: avoid PG_locked on tail pages
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 01/24] thp: cluster split_huge_page* code together Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 02/24] mm: change PageAnon() and page_anon_vma() to work on tail pages Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 18:48   ` Christoph Lameter
  2015-03-04 16:32 ` [PATCHv4 04/24] rmap: add argument to charge compound page Kirill A. Shutemov
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting pte entries can point to tail pages. It's doesn't
make much sense to mark tail page locked -- we need to protect whole
compound page.

This patch adjust helpers related to PG_locked to operate on head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/page-flags.h |  3 ++-
 include/linux/pagemap.h    |  5 +++++
 mm/filemap.c               | 11 +++++++----
 mm/slub.c                  |  2 ++
 4 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c851ff92d5b3..58b98bced299 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -207,7 +207,8 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
 
 struct page;	/* forward declaration */
 
-TESTPAGEFLAG(Locked, locked)
+#define PageLocked(page) test_bit(PG_locked, &compound_head(page)->flags)
+
 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
 	__SETPAGEFLAG(Referenced, referenced)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b3736f7065c..ad6da4e49555 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -428,16 +428,19 @@ extern void unlock_page(struct page *page);
 
 static inline void __set_page_locked(struct page *page)
 {
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	__set_bit(PG_locked, &page->flags);
 }
 
 static inline void __clear_page_locked(struct page *page)
 {
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	__clear_bit(PG_locked, &page->flags);
 }
 
 static inline int trylock_page(struct page *page)
 {
+	page = compound_head(page);
 	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
 }
 
@@ -490,6 +493,7 @@ extern int wait_on_page_bit_killable_timeout(struct page *page,
 
 static inline int wait_on_page_locked_killable(struct page *page)
 {
+	page = compound_head(page);
 	if (PageLocked(page))
 		return wait_on_page_bit_killable(page, PG_locked);
 	return 0;
@@ -510,6 +514,7 @@ static inline void wake_up_page(struct page *page, int bit)
  */
 static inline void wait_on_page_locked(struct page *page)
 {
+	page = compound_head(page);
 	if (PageLocked(page))
 		wait_on_page_bit(page, PG_locked);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 434dba317400..a0c3a57d29c9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -743,6 +743,7 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
  */
 void unlock_page(struct page *page)
 {
+	page = compound_head(page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
 	smp_mb__after_atomic();
@@ -807,18 +808,20 @@ EXPORT_SYMBOL_GPL(page_endio);
  */
 void __lock_page(struct page *page)
 {
-	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	struct page *page_head = compound_head(page);
+	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, bit_wait_io,
+	__wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io,
 							TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
 
 int __lock_page_killable(struct page *page)
 {
-	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	struct page *page_head = compound_head(page);
+	DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
+	return __wait_on_bit_lock(page_waitqueue(page_head), &wait,
 					bit_wait_io, TASK_KILLABLE);
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
diff --git a/mm/slub.c b/mm/slub.c
index 6832c4eab104..37205f648294 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -338,11 +338,13 @@ static inline int oo_objects(struct kmem_cache_order_objects x)
  */
 static __always_inline void slab_lock(struct page *page)
 {
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	bit_spin_lock(PG_locked, &page->flags);
 }
 
 static __always_inline void slab_unlock(struct page *page)
 {
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	__bit_spin_unlock(PG_locked, &page->flags);
 }
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 04/24] rmap: add argument to charge compound page
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 03/24] mm: avoid PG_locked " Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 05/24] mm, proc: adjust PSS calculation Kirill A. Shutemov
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if map
small page or THP.

The patch adds new argument to rmap function to indicate whethe we want
to map whole compound page or only the small page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/rmap.h    | 12 +++++++++---
 kernel/events/uprobes.c |  4 ++--
 mm/huge_memory.c        | 16 ++++++++--------
 mm/hugetlb.c            |  4 ++--
 mm/ksm.c                |  4 ++--
 mm/memory.c             | 14 +++++++-------
 mm/migrate.c            |  8 ++++----
 mm/rmap.c               | 43 +++++++++++++++++++++++++++----------------
 mm/swapfile.c           |  4 ++--
 9 files changed, 63 insertions(+), 46 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4088feac1fc..ebe50ceacea6 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -168,16 +168,22 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *page_get_anon_vma(struct page *page);
 
+/* bitflags for do_page_add_anon_rmap() */
+#define RMAP_EXCLUSIVE 0x01
+#define RMAP_COMPOUND 0x02
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 			   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f26a22d..5523daf59953 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		goto unlock;
 
 	get_page(kpage);
-	page_add_new_anon_rmap(kpage, vma, addr);
+	page_add_new_anon_rmap(kpage, vma, addr, false);
 	mem_cgroup_commit_charge(kpage, memcg, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
@@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	pte_unmap_unlock(ptep, ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8da6d46c5fb9..38c6b72cbe80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -743,7 +743,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pmd_t entry;
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr);
+		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -1034,7 +1034,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		page_add_new_anon_rmap(pages[i], vma, haddr);
+		page_add_new_anon_rmap(pages[i], vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
@@ -1046,7 +1046,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 	spin_unlock(ptl);
 
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
@@ -1168,7 +1168,7 @@ alloc:
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush_notify(vma, haddr, pmd);
-		page_add_new_anon_rmap(new_page, vma, haddr);
+		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -1178,7 +1178,7 @@ alloc:
 			put_huge_zero_page();
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			put_page(page);
 		}
 		ret |= VM_FAULT_WRITE;
@@ -1431,7 +1431,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			put_huge_zero_page();
 		} else {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -2380,7 +2380,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page);
+			page_remove_rmap(src_page, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -2670,7 +2670,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address);
+	page_add_new_anon_rmap(new_page, vma, address, true);
 	mem_cgroup_commit_charge(new_page, memcg, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0a9ac6c26832..ebb7329301c4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2688,7 +2688,7 @@ again:
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, true);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
 			address += sz;
@@ -2908,7 +2908,7 @@ retry_avoidcopy:
 		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page);
+		page_remove_rmap(old_page, true);
 		hugepage_add_new_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
diff --git a/mm/ksm.c b/mm/ksm.c
index 4162dce2eb44..92182eeba87d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -957,13 +957,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	get_page(kpage);
-	page_add_anon_rmap(kpage, vma, addr);
+	page_add_anon_rmap(kpage, vma, addr, false);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 3f858a136f3e..f4fd3e078b51 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1125,7 +1125,7 @@ again:
 					mark_page_accessed(page);
 				rss[MM_FILEPAGES]--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
@@ -2112,7 +2112,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * thread doing COW.
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
-		page_add_new_anon_rmap(new_page, vma, address);
+		page_add_new_anon_rmap(new_page, vma, address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
@@ -2145,7 +2145,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page);
+			page_remove_rmap(old_page, false);
 		}
 
 		/* Free the old page.. */
@@ -2526,7 +2526,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
-		exclusive = 1;
+		exclusive = RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(orig_pte))
@@ -2536,7 +2536,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		do_page_add_anon_rmap(page, vma, address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
@@ -2674,7 +2674,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, address);
+	page_add_new_anon_rmap(page, vma, address, false);
 	mem_cgroup_commit_charge(page, memcg, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
@@ -2762,7 +2762,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	if (anon) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
 		page_add_file_rmap(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 85e042686031..0d2b3110277a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -166,7 +166,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		else
 			page_dup_rmap(new);
 	} else if (PageAnon(new))
-		page_add_anon_rmap(new, vma, addr);
+		page_add_anon_rmap(new, vma, addr, false);
 	else
 		page_add_file_rmap(new);
 
@@ -1803,7 +1803,7 @@ fail_putback:
 	 * guarantee the copy is visible before the pagetable update.
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
+	page_add_anon_rmap(new_page, vma, mmun_start, true);
 	pmdp_clear_flush_notify(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
@@ -1814,13 +1814,13 @@ fail_putback:
 		flush_tlb_range(vma, mmun_start, mmun_end);
 		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
-		page_remove_rmap(new_page);
+		page_remove_rmap(new_page, true);
 		goto fail_putback;
 	}
 
 	mem_cgroup_migrate(page, new_page, false);
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
diff --git a/mm/rmap.c b/mm/rmap.c
index 47b3ba87c2dd..f67e83be75e4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1041,9 +1041,9 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
-	do_page_add_anon_rmap(page, vma, address, 0);
+	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
 }
 
 /*
@@ -1052,21 +1052,24 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int exclusive)
+	struct vm_area_struct *vma, unsigned long address, int flags)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	if (first) {
+		bool compound = flags & RMAP_COMPOUND;
+		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (PageTransHuge(page))
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-				hpage_nr_pages(page));
+		}
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1074,7 +1077,8 @@ void do_page_add_anon_rmap(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
-		__page_set_anon_rmap(page, vma, address, exclusive);
+		__page_set_anon_rmap(page, vma, address,
+				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
@@ -1090,15 +1094,18 @@ void do_page_add_anon_rmap(struct page *page,
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			hpage_nr_pages(page));
+	}
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1154,9 +1161,12 @@ out:
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	if (!PageAnon(page)) {
+		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
 		return;
 	}
@@ -1174,11 +1184,12 @@ void page_remove_rmap(struct page *page)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			      -hpage_nr_pages(page));
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1320,7 +1331,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		dec_mm_counter(mm, MM_FILEPAGES);
 
 discard:
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	page_cache_release(page);
 
 out_unmap:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63f55ccb9b26..200298895cee 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1121,10 +1121,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr);
+		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr);
+		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 05/24] mm, proc: adjust PSS calculation
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 04/24] rmap: add argument to charge compound page Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 06/24] mm: store mapcount for compound page separately Kirill A. Shutemov
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting all subpages of the compound page are not nessessary
have the same mapcount. We need to take into account mapcount of every
sub-page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 956b75d61809..50acf17538db 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -449,9 +449,10 @@ struct mem_size_stats {
 };
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
-		unsigned long size, bool young, bool dirty)
+		bool compound, bool young, bool dirty)
 {
-	int mapcount;
+	int i, nr = compound ? hpage_nr_pages(page) : 1;
+	unsigned long size = nr << PAGE_SHIFT;
 
 	if (PageAnon(page))
 		mss->anonymous += size;
@@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	/* Accumulate the size in pages that have been accessed. */
 	if (young || PageReferenced(page))
 		mss->referenced += size;
-	mapcount = page_mapcount(page);
-	if (mapcount >= 2) {
-		u64 pss_delta;
 
-		if (dirty || PageDirty(page))
-			mss->shared_dirty += size;
-		else
-			mss->shared_clean += size;
-		pss_delta = (u64)size << PSS_SHIFT;
-		do_div(pss_delta, mapcount);
-		mss->pss += pss_delta;
-	} else {
-		if (dirty || PageDirty(page))
-			mss->private_dirty += size;
-		else
-			mss->private_clean += size;
-		mss->pss += (u64)size << PSS_SHIFT;
+	for (i = 0; i < nr; i++) {
+		int mapcount = page_mapcount(page + i);
+
+		if (mapcount >= 2) {
+			if (dirty || PageDirty(page + i))
+				mss->shared_dirty += PAGE_SIZE;
+			else
+				mss->shared_clean += PAGE_SIZE;
+			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
+		} else {
+			if (dirty || PageDirty(page + i))
+				mss->private_dirty += PAGE_SIZE;
+			else
+				mss->private_clean += PAGE_SIZE;
+			mss->pss += PAGE_SIZE << PSS_SHIFT;
+		}
 	}
 }
 
@@ -500,7 +501,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (!page)
 		return;
-	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
+
+	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -516,8 +518,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	if (IS_ERR_OR_NULL(page))
 		return;
 	mss->anonymous_thp += HPAGE_PMD_SIZE;
-	smaps_account(mss, page, HPAGE_PMD_SIZE,
-			pmd_young(*pmd), pmd_dirty(*pmd));
+	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
 static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 06/24] mm: store mapcount for compound page separately
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 05/24] mm, proc: adjust PSS calculation Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 07/24] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound and
we need a cheap way to find out how many time the compound page is
mapped with PMD -- compound_mapcount() does this.

We use the same approach as with compound page destructor and compound
order: use space in first tail page, ->mapping this time.

page_mapcount() counts both: PTE and PMD mappings of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h       | 16 ++++++++++++++--
 include/linux/mm_types.h |  1 +
 include/linux/rmap.h     |  4 ++--
 mm/debug.c               |  5 ++++-
 mm/huge_memory.c         | 23 ++++++++++++++---------
 mm/hugetlb.c             |  4 ++--
 mm/memory.c              |  4 ++--
 mm/migrate.c             |  2 +-
 mm/page_alloc.c          |  7 ++++++-
 mm/rmap.c                | 47 +++++++++++++++++++++++++++++++++++++++--------
 10 files changed, 85 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1aea94e837a0..b64dfe352d71 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -472,6 +472,18 @@ static inline struct page *compound_head_fast(struct page *page)
 		return page->first_page;
 	return page;
 }
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+	return &page[1].compound_mapcount;
+}
+
+static inline int compound_mapcount(struct page *page)
+{
+	if (!PageCompound(page))
+		return 0;
+	page = compound_head(page);
+	return atomic_read(compound_mapcount_ptr(page)) + 1;
+}
 
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
@@ -486,7 +498,7 @@ static inline void page_mapcount_reset(struct page *page)
 static inline int page_mapcount(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageSlab(page), page);
-	return atomic_read(&page->_mapcount) + 1;
+	return atomic_read(&page->_mapcount) + compound_mapcount(page) + 1;
 }
 
 static inline int page_count(struct page *page)
@@ -1081,7 +1093,7 @@ static inline pgoff_t page_file_index(struct page *page)
  */
 static inline int page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) >= 0;
+	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 590630eb59ba..aefbc95148c4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -56,6 +56,7 @@ struct page {
 						 * see PAGE_MAPPING_ANON below.
 						 */
 		void *s_mem;			/* slab first object */
+		atomic_t compound_mapcount;	/* first tail page */
 	};
 
 	/* Second double word */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index ebe50ceacea6..ec63d3f20ca3 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -190,9 +190,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 				unsigned long);
 
-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, bool compound)
 {
-	atomic_inc(&page->_mapcount);
+	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
 }
 
 /*
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..13d2b8146ef9 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -83,9 +83,12 @@ static void dump_flags(unsigned long flags,
 void dump_page_badflags(struct page *page, const char *reason,
 		unsigned long badflags)
 {
-	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
+	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
 		  page, atomic_read(&page->_count), page_mapcount(page),
 		  page->mapping, page->index);
+	if (PageCompound(page))
+		pr_cont(" compound_mapcount: %d", compound_mapcount(page));
+	pr_cont("\n");
 	BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS);
 	dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names));
 	if (reason)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 38c6b72cbe80..0c83451679f8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -890,7 +890,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
-	page_dup_rmap(src_page);
+	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
@@ -1787,8 +1787,8 @@ static void __split_huge_page_refcount(struct page *page,
 		struct page *page_tail = page + i;
 
 		/* tail_page->_mapcount cannot change */
-		BUG_ON(page_mapcount(page_tail) < 0);
-		tail_count += page_mapcount(page_tail);
+		BUG_ON(atomic_read(&page_tail->_mapcount) + 1 < 0);
+		tail_count += atomic_read(&page_tail->_mapcount) + 1;
 		/* check for overflow */
 		BUG_ON(tail_count < 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
@@ -1805,8 +1805,7 @@ static void __split_huge_page_refcount(struct page *page,
 		 * atomic_set() here would be safe on all archs (and
 		 * not only on x86), it's safer to use atomic_add().
 		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
+		atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb__after_atomic();
@@ -1843,15 +1842,18 @@ static void __split_huge_page_refcount(struct page *page,
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		page_tail->_mapcount = page->_mapcount;
+		atomic_set(&page_tail->_mapcount, compound_mapcount(page) - 1);
 
-		BUG_ON(page_tail->mapping);
-		page_tail->mapping = page->mapping;
+		/* ->mapping in first tail page is compound_mapcount */
+		if (i != 1) {
+			BUG_ON(page_tail->mapping);
+			page_tail->mapping = page->mapping;
+			BUG_ON(!PageAnon(page_tail));
+		}
 
 		page_tail->index = page->index + i;
 		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
 		BUG_ON(!PageSwapBacked(page_tail));
@@ -1861,6 +1863,9 @@ static void __split_huge_page_refcount(struct page *page,
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
+	page->_mapcount = *compound_mapcount_ptr(page);
+	page[1].mapping = page->mapping;
+
 	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ebb7329301c4..2aa2a850d002 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2606,7 +2606,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
-			page_dup_rmap(ptepage);
+			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
 		spin_unlock(src_ptl);
@@ -3065,7 +3065,7 @@ retry:
 		ClearPagePrivate(page);
 		hugepage_add_new_anon_rmap(page, vma, address);
 	} else
-		page_dup_rmap(page);
+		page_dup_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
diff --git a/mm/memory.c b/mm/memory.c
index f4fd3e078b51..c6e1a557f609 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -873,7 +873,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-		page_dup_rmap(page);
+		page_dup_rmap(page, false);
 		if (PageAnon(page))
 			rss[MM_ANONPAGES]++;
 		else
@@ -3033,7 +3033,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * pinned by vma->vm_file's reference.  We rely on unlock_page()'s
 	 * release semantics to prevent the compiler from undoing this copying.
 	 */
-	mapping = fault_page->mapping;
+	mapping = compound_head(fault_page)->mapping;
 	unlock_page(fault_page);
 	if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
 		/*
diff --git a/mm/migrate.c b/mm/migrate.c
index 0d2b3110277a..01449826b914 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -164,7 +164,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		if (PageAnon(new))
 			hugepage_add_anon_rmap(new, vma, addr);
 		else
-			page_dup_rmap(new);
+			page_dup_rmap(new, false);
 	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr, false);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f939cb10e10d..730f6ad2ab10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -369,6 +369,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 
 	set_compound_page_dtor(page, free_compound_page);
 	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
@@ -658,7 +659,9 @@ static inline int free_pages_check(struct page *page)
 
 	if (unlikely(page_mapcount(page)))
 		bad_reason = "nonzero mapcount";
-	if (unlikely(page->mapping != NULL))
+	if (unlikely(compound_mapcount(page)))
+		bad_reason = "nonzero compound_mapcount";
+	if (unlikely(page->mapping != NULL) && !PageTail(page))
 		bad_reason = "non-NULL mapping";
 	if (unlikely(atomic_read(&page->_count) != 0))
 		bad_reason = "nonzero _count";
@@ -800,6 +803,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 	}
 	if (bad)
 		return false;
+	if (order)
+		page[1].mapping = NULL;
 
 	reset_page_owner(page, order);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index f67e83be75e4..333938475831 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1025,7 +1025,7 @@ static void __page_check_anon_rmap(struct page *page,
 	 * over the call to page_add_new_anon_rmap.
 	 */
 	BUG_ON(page_anon_vma(page)->root != vma->anon_vma->root);
-	BUG_ON(page->index != linear_page_index(vma, address));
+	BUG_ON(page_to_pgoff(page) != linear_page_index(vma, address));
 #endif
 }
 
@@ -1054,9 +1054,26 @@ void page_add_anon_rmap(struct page *page,
 void do_page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, int flags)
 {
-	int first = atomic_inc_and_test(&page->_mapcount);
+	bool compound = flags & RMAP_COMPOUND;
+	bool first;
+
+	if (PageTransCompound(page)) {
+		VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+			first = atomic_inc_and_test(compound_mapcount_ptr(page));
+		} else {
+			/* Anon THP always mapped first with PMD */
+			first = 0;
+			VM_BUG_ON_PAGE(!page_mapcount(page), page);
+			atomic_inc(&page->_mapcount);
+		}
+	} else {
+		VM_BUG_ON_PAGE(compound, page);
+		first = atomic_inc_and_test(&page->_mapcount);
+	}
+
 	if (first) {
-		bool compound = flags & RMAP_COMPOUND;
 		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
@@ -1074,7 +1091,8 @@ void do_page_add_anon_rmap(struct page *page,
 	if (unlikely(PageKsm(page)))
 		return;
 
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
+
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
@@ -1100,10 +1118,16 @@ void page_add_new_anon_rmap(struct page *page,
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
-	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (compound) {
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(compound_mapcount_ptr(page), 0);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	} else {
+		/* Anon THP always mapped first with PMD */
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
 	}
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
@@ -1172,7 +1196,9 @@ void page_remove_rmap(struct page *page, bool compound)
 	}
 
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	if (!atomic_add_negative(-1, compound ?
+			       compound_mapcount_ptr(page) :
+			       &page->_mapcount))
 		return;
 
 	/* Hugepages are not counted in NR_ANON_PAGES for now. */
@@ -1185,8 +1211,13 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
 	if (compound) {
+		int i;
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		/* The page can be mapped with ptes */
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			if (page_mapcount(page + i))
+				nr--;
 	}
 
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
@@ -1630,7 +1661,7 @@ void hugepage_add_anon_rmap(struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!anon_vma);
 	/* address might be in next vma when migration races vma_adjust */
-	first = atomic_inc_and_test(&page->_mapcount);
+	first = atomic_inc_and_test(compound_mapcount_ptr(page));
 	if (first)
 		__hugepage_set_anon_rmap(page, vma, address, 0);
 }
@@ -1639,7 +1670,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-	atomic_set(&page->_mapcount, 0);
+	atomic_set(compound_mapcount_ptr(page), 0);
 	__hugepage_set_anon_rmap(page, vma, address, 1);
 }
 #endif /* CONFIG_HUGETLB_PAGE */
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 07/24] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 06/24] mm: store mapcount for compound page separately Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 08/24] mm: adjust FOLL_SPLIT for new refcounting Kirill A. Shutemov
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.

For PTE fault we can't reuse the page if it's part of huge page.

For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.

The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.

This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/swap.h |  3 ++-
 mm/huge_memory.c     | 12 +++++++++++-
 mm/rmap.c            |  2 +-
 mm/swapfile.c        |  3 +++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7067eca501e2..f0e4868f63b1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -523,7 +523,8 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
-#define reuse_swap_page(page)	(page_mapcount(page) == 1)
+#define reuse_swap_page(page) \
+	(!PageTransCompound(page) && page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0c83451679f8..98b99e620af9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1092,7 +1092,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
-	if (page_mapcount(page) == 1) {
+	/*
+	 * We can only reuse the page if nobody else maps the huge page or it's
+	 * part. We can do it by checking page_mapcount() on each sub-page, but
+	 * it's expensive.
+	 * The cheaper way is to check page_count() to be equal 1: every
+	 * mapcount takes page reference reference, so this way we can
+	 * guarantee, that the PMD is the only mapping.
+	 * This can give false negative if somebody pinned the page, but that's
+	 * fine.
+	 */
+	if (page_mapcount(page) == 1 && page_count(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
diff --git a/mm/rmap.c b/mm/rmap.c
index 333938475831..db8b99e48966 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1215,7 +1215,7 @@ void page_remove_rmap(struct page *page, bool compound)
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 		/* The page can be mapped with ptes */
-		for (i = 0; i < HPAGE_PMD_NR; i++)
+		for (i = 0; i < hpage_nr_pages(page); i++)
 			if (page_mapcount(page + i))
 				nr--;
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 200298895cee..99f97c31ede5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
 		return 0;
+	/* The page is part of THP and cannot be reused */
+	if (PageTransCompound(page))
+		return 0;
 	count = page_mapcount(page);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 08/24] mm: adjust FOLL_SPLIT for new refcounting
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 07/24] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 09/24] thp, mlock: do not allow huge pages in mlocked area Kirill A. Shutemov
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

We prepare kernel to allow transhuge pages to be mapped with ptes too.
We need to handle FOLL_SPLIT in follow_page_pte().

Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/gup.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 19 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ca7b607ab671..080535d5efab 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -79,6 +79,20 @@ retry:
 		page = pte_page(pte);
 	}
 
+	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
+		int ret;
+		page = compound_head(page);
+		get_page(page);
+		pte_unmap_unlock(ptep, ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		if (ret)
+			return ERR_PTR(ret);
+		goto retry;
+	}
+
 	if (flags & FOLL_GET)
 		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
@@ -186,27 +200,45 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		return no_page_table(vma, flags);
-	if (pmd_trans_huge(*pmd)) {
-		if (flags & FOLL_SPLIT) {
-			split_huge_page_pmd(vma, address, pmd);
-			return follow_page_pte(vma, address, pmd, flags);
-		}
-		ptl = pmd_lock(mm, pmd);
-		if (likely(pmd_trans_huge(*pmd))) {
-			if (unlikely(pmd_trans_splitting(*pmd))) {
-				spin_unlock(ptl);
-				wait_split_huge_page(vma->anon_vma, pmd);
-			} else {
-				page = follow_trans_huge_pmd(vma, address,
-							     pmd, flags);
-				spin_unlock(ptl);
-				*page_mask = HPAGE_PMD_NR - 1;
-				return page;
-			}
-		} else
+	if (likely(!pmd_trans_huge(*pmd)))
+		return follow_page_pte(vma, address, pmd, flags);
+
+	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(ptl);
+		return follow_page_pte(vma, address, pmd, flags);
+	}
+
+	if (unlikely(pmd_trans_splitting(*pmd))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmd);
+		return follow_page_pte(vma, address, pmd, flags);
+	}
+
+	if (flags & FOLL_SPLIT) {
+		int ret;
+		page = pmd_page(*pmd);
+		if (is_huge_zero_page(page)) {
+			spin_unlock(ptl);
+			ret = 0;
+			split_huge_pmd(vma, pmd, address);
+		} else {
+			get_page(page);
 			spin_unlock(ptl);
+			lock_page(page);
+			ret = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+		}
+
+		return ret ? ERR_PTR(ret) :
+			follow_page_pte(vma, address, pmd, flags);
 	}
-	return follow_page_pte(vma, address, pmd, flags);
+
+	page = follow_trans_huge_pmd(vma, address, pmd, flags);
+	spin_unlock(ptl);
+	*page_mask = HPAGE_PMD_NR - 1;
+	return page;
 }
 
 static int get_gate_page(struct mm_struct *mm, unsigned long address,
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 09/24] thp, mlock: do not allow huge pages in mlocked area
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 08/24] mm: adjust FOLL_SPLIT for new refcounting Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 10/24] khugepaged: ignore pmd tables with THP mapped with ptes Kirill A. Shutemov
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting THP can belong to several VMAs. This makes tricky to
tracking THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.

With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.

I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas.  But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.

We can bring something better later, but this should be good enough for
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/gup.c         |  2 ++
 mm/huge_memory.c |  5 ++++-
 mm/memory.c      |  3 ++-
 mm/mlock.c       | 51 +++++++++++++++++++--------------------------------
 4 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 080535d5efab..01a53052f7c8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -883,6 +883,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
 
 	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+	if (vma->vm_flags & VM_LOCKED)
+		gup_flags |= FOLL_SPLIT;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 98b99e620af9..36e03eb9aa41 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -787,6 +787,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return VM_FAULT_FALLBACK;
+	if (vma->vm_flags & VM_LOCKED)
+		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
@@ -2565,7 +2567,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
 	    (vma->vm_flags & VM_NOHUGEPAGE))
 		return false;
-
+	if (vma->vm_flags & VM_LOCKED)
+		return false;
 	if (!vma->anon_vma || vma->vm_ops)
 		return false;
 	if (is_vma_temporary_stack(vma))
diff --git a/mm/memory.c b/mm/memory.c
index c6e1a557f609..9be5b36a5010 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2160,7 +2160,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	pte_unmap_unlock(page_table, ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	if (old_page) {
+	/* THP pages are never mlocked */
+	if (old_page && !PageTransCompound(old_page)) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
 		 * keep the mlocked page.
diff --git a/mm/mlock.c b/mm/mlock.c
index 283ff972ea43..4e73ffccb204 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -443,39 +443,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
 				&page_mask);
 
-		if (page && !IS_ERR(page)) {
-			if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to recompute
-				 * the page_mask here.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-				zoneid = page_zone_id(page);
+		if (page && !IS_ERR(page) && !PageTransCompound(page)) {
+			/*
+			 * Non-huge pages are handled in batches via
+			 * pagevec. The pin from follow_page_mask()
+			 * prevents them from collapsing by THP.
+			 */
+			pagevec_add(&pvec, page);
+			zone = page_zone(page);
+			zoneid = page_zone_id(page);
 
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zoneid, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
+			/*
+			 * Try to fill the rest of pagevec using fast
+			 * pte walk. This will also update start to
+			 * the next page to process. Then munlock the
+			 * pagevec.
+			 */
+			start = __munlock_pagevec_fill(&pvec, vma,
+					zoneid, start, end);
+			__munlock_pagevec(&pvec, zone);
+			goto next;
 		}
 		/* It's a bug to munlock in the middle of a THP page */
 		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 10/24] khugepaged: ignore pmd tables with THP mapped with ptes
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 09/24] thp, mlock: do not allow huge pages in mlocked area Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:32 ` [PATCHv4 11/24] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

Prepare khugepaged to see compound pages mapped with pte. For now we
won't collapse the pmd table with such pte.

khugepaged is subject for future rework wrt new refcounting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 36e03eb9aa41..ff1282b1859c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2747,6 +2747,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
+
+		/* TODO: teach khugepaged to collapse THP mapped with pte */
+		if (PageCompound(page))
+			goto out_unmap;
+
 		/*
 		 * Record which node the original page is from and save this
 		 * information to khugepaged_node_load[].
@@ -2757,7 +2762,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (khugepaged_scan_abort(node))
 			goto out_unmap;
 		khugepaged_node_load[node]++;
-		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
 		/*
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 11/24] thp: rename split_huge_page_pmd() to split_huge_pmd()
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 10/24] khugepaged: ignore pmd tables with THP mapped with ptes Kirill A. Shutemov
@ 2015-03-04 16:32 ` Kirill A. Shutemov
  2015-03-04 16:33 ` [PATCHv4 12/24] thp: PMD splitting without splitting compound page Kirill A. Shutemov
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/mm/subpage-prot.c |  2 +-
 arch/x86/kernel/vm86_32.c      |  6 +++++-
 include/linux/huge_mm.h        |  8 ++------
 mm/huge_memory.c               | 33 +++++++++++++--------------------
 mm/madvise.c                   |  2 +-
 mm/memory.c                    |  2 +-
 mm/mempolicy.c                 |  2 +-
 mm/mprotect.c                  |  2 +-
 mm/mremap.c                    |  2 +-
 mm/pagewalk.c                  |  2 +-
 10 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index fa9fb5b4c66c..d5543514c1df 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	return 0;
 }
 
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index e8edcf52e069..883160599965 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
-	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
+
+	if (pmd_trans_huge(*pmd)) {
+		struct vm_area_struct *vma = find_vma(mm, 0xA0000);
+		split_huge_pmd(vma, pmd, 0xA0000);
+	}
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44a840a53974..34bbf769d52e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
 }
 extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd);
-#define split_huge_page_pmd(__vma, __address, __pmd)			\
+#define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
@@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
-extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd);
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -187,11 +185,9 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define split_huge_page_pmd(__vma, __address, __pmd)	\
-	do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
-#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
+#define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff1282b1859c..c9e2e07ac033 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1127,13 +1127,13 @@ alloc:
 
 	if (unlikely(!new_page)) {
 		if (!page) {
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
 		} else {
 			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 					pmd, orig_pmd, page, haddr);
 			if (ret & VM_FAULT_OOM) {
-				split_huge_page(page);
+				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
 			put_user_huge_page(page);
@@ -1146,10 +1146,10 @@ alloc:
 					   GFP_TRANSHUGE, &memcg))) {
 		put_page(new_page);
 		if (page) {
-			split_huge_page(page);
+			split_huge_pmd(vma, pmd, address);
 			put_user_huge_page(page);
 		} else
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -1709,17 +1709,7 @@ again:
 		goto again;
 }
 
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd)
-{
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
+static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
 	pgd_t *pgd;
@@ -1728,7 +1718,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 
 	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
 
-	pgd = pgd_offset(mm, address);
+	pgd = pgd_offset(vma->vm_mm, address);
 	if (!pgd_present(*pgd))
 		return;
 
@@ -1739,11 +1729,14 @@ static void split_huge_page_address(struct mm_struct *mm,
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return;
+
+	if (!pmd_trans_huge(*pmd))
+		return;
 	/*
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	split_huge_page_pmd_mm(mm, address, pmd);
+	__split_huge_page_pmd(vma, address, pmd);
 }
 
 static int __split_huge_page_splitting(struct page *page,
@@ -3017,7 +3010,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (start & ~HPAGE_PMD_MASK &&
 	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, start);
+		split_huge_pmd_address(vma, start);
 
 	/*
 	 * If the new end address isn't hpage aligned and it could
@@ -3027,7 +3020,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (end & ~HPAGE_PMD_MASK &&
 	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, end);
+		split_huge_pmd_address(vma, end);
 
 	/*
 	 * If we're also updating the vma->vm_next->vm_start, if the new
@@ -3041,6 +3034,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
-			split_huge_page_address(next->vm_mm, nstart);
+			split_huge_pmd_address(next, nstart);
 	}
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8921c2..c0bbe52eb3bf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -279,7 +279,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
 		if (next - addr != HPAGE_PMD_SIZE)
-			split_huge_page_pmd(vma, addr, pmd);
+			split_huge_pmd(vma, pmd, addr);
 		else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
 			goto next;
 		/* fall through */
diff --git a/mm/memory.c b/mm/memory.c
index 9be5b36a5010..0db0481311d5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1204,7 +1204,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 					BUG();
 				}
 #endif
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4721046a134a..eeec3dd199f5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 44727811bf4c..81632238f857 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -155,7 +155,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
 						newprot, prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index 57dadc025c64..95ac595de7da 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -208,7 +208,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				need_flush = true;
 				continue;
 			} else if (!err) {
-				split_huge_page_pmd(vma, old_addr, old_pmd);
+				split_huge_pmd(vma, old_pmd, old_addr);
 			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 75c1f2878519..5b23a69439c8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
 		if (!walk->pte_entry)
 			continue;
 
-		split_huge_page_pmd_mm(walk->mm, addr, pmd);
+		split_huge_pmd(walk->vma, pmd, addr);
 		if (pmd_trans_unstable(pmd))
 			goto again;
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2015-03-04 16:32 ` [PATCHv4 11/24] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-17  8:32   ` Aneesh Kumar K.V
                     ` (3 more replies)
  2015-03-04 16:33 ` [PATCHv4 13/24] mm, vmstats: new THP splitting event Kirill A. Shutemov
                   ` (14 subsequent siblings)
  26 siblings, 4 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

Current split_huge_page() combines two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
changes split_huge_pmd() implementation to split the given PMD without
splitting other PMDs this page mapped with or underlying compound page.

In order to do this we have to get rid of tail page refcounting, which
uses _mapcount of tail pages. Tail page refcounting is needed to be able
to split THP page at any point: we always know which of tail pages is
pinned (i.e. by get_user_pages()) and can distribute page count
correctly.

We can avoid this by allowing split_huge_page() to fail if the compound
page is pinned. This patch removes all infrastructure for tail page
refcounting and make split_huge_page() to always return -EBUSY. All
split_huge_page() users already know how to handle its fail. Proper
implementation will be added later.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Memory cgroup is not yet ready for new refcouting. Let's disable it on
Kconfig level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/mips/mm/gup.c            |   4 -
 arch/powerpc/mm/hugetlbpage.c |  13 +-
 arch/s390/mm/gup.c            |  13 +-
 arch/sparc/mm/gup.c           |  14 +--
 arch/x86/mm/gup.c             |   4 -
 include/linux/huge_mm.h       |   7 +-
 include/linux/mm.h            |  79 +++---------
 include/linux/mm_types.h      |  19 +--
 mm/Kconfig                    |   2 +-
 mm/gup.c                      |  31 +----
 mm/huge_memory.c              | 282 +++++++++++-------------------------------
 mm/internal.h                 |  31 +----
 mm/swap.c                     | 245 +-----------------------------------
 13 files changed, 104 insertions(+), 640 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 349995d19c7f..36a35115dc2e 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 7e408bfc7948..9a7f513d0068 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1037,7 +1037,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 {
 	unsigned long mask;
 	unsigned long pte_end;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	pte_t pte;
 	int refs;
 
@@ -1060,7 +1060,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	head = pte_page(pte);
 
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -1082,15 +1081,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 1eb41bb3010c..f8112899f6fe 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask, result;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
@@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 2e5c4fc2daa9..9091c5daa2e1 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(head);
 			return 0;
 		}
-		if (head != page)
-			get_huge_page_tail(page);
 
 		pages[*nr] = page;
 		(*nr)++;
@@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 			unsigned long end, int write, struct page **pages,
 			int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (!(pmd_val(pmd) & _PAGE_VALID))
@@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/* Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2af3eb..62a887a3cf50 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 34bbf769d52e..3c5fe722cc14 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -102,14 +102,13 @@ static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
 }
-extern void __split_huge_page_pmd(struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd);
+extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address);
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
-			__split_huge_page_pmd(__vma, __address,		\
-					____pmd);			\
+			__split_huge_pmd(__vma, __pmd, __address);	\
 	}  while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)				\
 	do {								\
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b64dfe352d71..020dbbe1563c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -433,31 +433,10 @@ static inline void compound_unlock_irqrestore(struct page *page,
 #endif
 }
 
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
-	struct page *head = tail->first_page;
-
-	/*
-	 * page->first_page may be a dangling pointer to an old
-	 * compound page, so recheck that it is still a tail
-	 * page before returning.
-	 */
-	smp_rmb();
-	if (likely(PageTail(tail)))
-		return head;
-	return tail;
-}
-
-/*
- * Since either compound page could be dismantled asynchronously in THP
- * or we access asynchronously arbitrary positioned struct page, there
- * would be tail flag race. To handle this race, we should call
- * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
- */
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
-		return compound_head_by_tail(page);
+		return page->first_page;
 	return page;
 }
 
@@ -515,50 +494,11 @@ static inline int PageHeadHuge(struct page *page_head)
 }
 #endif /* CONFIG_HUGETLB_PAGE */
 
-static inline bool __compound_tail_refcounted(struct page *page)
-{
-	return !PageSlab(page) && !PageHeadHuge(page);
-}
-
-/*
- * This takes a head page as parameter and tells if the
- * tail page reference counting can be skipped.
- *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
- * have been released.
- */
-static inline bool compound_tail_refcounted(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return __compound_tail_refcounted(page);
-}
-
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run from under us.
-	 */
-	VM_BUG_ON_PAGE(!PageTail(page), page);
-	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
-		atomic_inc(&page->_mapcount);
-}
-
-extern bool __get_page_tail(struct page *page);
-
 static inline void get_page(struct page *page)
 {
-	if (unlikely(PageTail(page)))
-		if (likely(__get_page_tail(page)))
-			return;
-	/*
-	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-	atomic_inc(&page->_count);
+	struct page *page_head = compound_head(page);
+	VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
+	atomic_inc(&page_head->_count);
 }
 
 static inline struct page *virt_to_head_page(const void *x)
@@ -1093,7 +1033,16 @@ static inline pgoff_t page_file_index(struct page *page)
  */
 static inline int page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
+	int i;
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) >= 0;
+	if (compound_mapcount(page))
+		return 1;
+	for (i = 0; i < hpage_nr_pages(page); i++) {
+		if (atomic_read(&page[i]._mapcount) >= 0)
+			return 1;
+	}
+	return 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index aefbc95148c4..907f99e74281 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -93,20 +93,9 @@ struct page {
 
 				union {
 					/*
-					 * Count of ptes mapped in
-					 * mms, to show when page is
-					 * mapped & limit reverse map
-					 * searches.
-					 *
-					 * Used also for tail pages
-					 * refcounting instead of
-					 * _count. Tail pages cannot
-					 * be mapped and keeping the
-					 * tail page _count zero at
-					 * all times guarantees
-					 * get_page_unless_zero() will
-					 * never succeed on tail
-					 * pages.
+					 * Count of ptes mapped in mms, to show
+					 * when page is mapped & limit reverse
+					 * map searches.
 					 */
 					atomic_t _mapcount;
 
@@ -117,7 +106,7 @@ struct page {
 					};
 					int units;	/* SLOB */
 				};
-				atomic_t _count;		/* Usage count, see below. */
+				atomic_t _count; /* Usage count, see below. */
 			};
 			unsigned int active;	/* SLAB */
 		};
diff --git a/mm/Kconfig b/mm/Kconfig
index 390214da4546..19d090534194 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -409,7 +409,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !MEMCG
 	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
diff --git a/mm/gup.c b/mm/gup.c
index 01a53052f7c8..b68d9ffa3c9e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1107,7 +1107,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (write && !pmd_write(orig))
@@ -1116,7 +1116,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	refs = 0;
 	head = pmd_page(orig);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1137,24 +1136,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail pages need their mapcount reference taken before we
-	 * return. (This allows the THP code to bump their ref count when
-	 * they are split into base pages).
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (write && !pud_write(orig))
@@ -1163,7 +1151,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	refs = 0;
 	head = pud_page(orig);
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1184,12 +1171,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
@@ -1198,7 +1179,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 			struct page **pages, int *nr)
 {
 	int refs;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 
 	if (write && !pgd_write(orig))
 		return 0;
@@ -1227,12 +1208,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 	}
 
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c9e2e07ac033..f420847a3288 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -932,37 +932,6 @@ unlock:
 	spin_unlock(ptl);
 }
 
-/*
- * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
- * during copy_user_huge_page()'s copy_page_rep(): in the case when
- * the source page gets split and a tail freed before copy completes.
- * Called under pmd_lock of checked pmd, so safe from splitting itself.
- */
-static void get_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		atomic_add(HPAGE_PMD_NR, &page->_count);
-		while (++page < endpage)
-			get_huge_page_tail(page);
-	} else {
-		get_page(page);
-	}
-}
-
-static void put_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		while (page < endpage)
-			put_page(page++);
-	} else {
-		put_page(page);
-	}
-}
-
 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address,
@@ -1113,7 +1082,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
 	}
-	get_user_huge_page(page);
+	get_page(page);
 	spin_unlock(ptl);
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
@@ -1136,7 +1105,7 @@ alloc:
 				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
-			put_user_huge_page(page);
+			put_page(page);
 		}
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -1147,7 +1116,7 @@ alloc:
 		put_page(new_page);
 		if (page) {
 			split_huge_pmd(vma, pmd, address);
-			put_user_huge_page(page);
+			put_page(page);
 		} else
 			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
@@ -1169,7 +1138,7 @@ alloc:
 
 	spin_lock(ptl);
 	if (page)
-		put_user_huge_page(page);
+		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
 		mem_cgroup_cancel_charge(new_page, memcg);
@@ -1662,51 +1631,78 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	put_huge_zero_page();
 }
 
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd)
+
+static void __split_huge_pmd_locked(struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
-	spinlock_t *ptl;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	bool young, write;
+	int i;
 
-	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
+	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
+
+	if (is_huge_zero_pmd(*pmd))
+		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
+
+	write = pmd_write(*pmd);
+	young = pmd_young(*pmd);
+
+	/* leave pmd empty until pte is filled */
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t entry, *pte;
+		/*
+		 * Note that NUMA hinting access restrictions are not
+		 * transferred to avoid any possibility of altering
+		 * permissions across VMAs.
+		 */
+		entry = mk_pte(page + i, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (!write)
+			entry = pte_wrprotect(entry);
+		if (!young)
+			entry = pte_mkold(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		BUG_ON(!pte_none(*pte));
+		atomic_inc(&page[i]._mapcount);
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	atomic_dec(compound_mapcount_ptr(page));
+}
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	unsigned long mmun_start;       /* For mmu_notifiers */
+	unsigned long mmun_end;         /* For mmu_notifiers */
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-again:
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	if (is_huge_zero_pmd(*pmd)) {
-		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-	get_page(page);
+	if (likely(pmd_trans_huge(*pmd)))
+		__split_huge_pmd_locked(vma, pmd, address);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	split_huge_page(page);
-
-	put_page(page);
-
-	/*
-	 * We don't always have down_write of mmap_sem here: a racing
-	 * do_huge_pmd_wp_page() might have copied-on-write to another
-	 * huge page before our split_huge_page() got the anon_vma lock.
-	 */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		goto again;
 }
 
 static void split_huge_pmd_address(struct vm_area_struct *vma,
@@ -1736,42 +1732,10 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	__split_huge_page_pmd(vma, address, pmd);
-}
-
-static int __split_huge_page_splitting(struct page *page,
-				       struct vm_area_struct *vma,
-				       unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd;
-	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
-
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
-	if (pmd) {
-		/*
-		 * We can't temporarily set the pmd to null in order
-		 * to split it, the pmd must remain marked huge at all
-		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->rwsem to
-		 * serialize against split_huge_page*.
-		 */
-		pmdp_splitting_flush(vma, address, pmd);
-
-		ret = 1;
-		spin_unlock(ptl);
-	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	return ret;
+	__split_huge_pmd(vma, pmd, address);
 }
 
+#if 0
 static void __split_huge_page_refcount(struct page *page,
 				       struct list_head *list)
 {
@@ -1897,82 +1861,6 @@ static void __split_huge_page_refcount(struct page *page,
 	BUG_ON(page_count(page) <= 0);
 }
 
-static int __split_huge_page_map(struct page *page,
-				 struct vm_area_struct *vma,
-				 unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd, _pmd;
-	int ret = 0, i;
-	pgtable_t pgtable;
-	unsigned long haddr;
-
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
-	if (pmd) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-		pmd_populate(mm, &_pmd, pgtable);
-		if (pmd_write(*pmd))
-			BUG_ON(page_mapcount(page) != 1);
-
-		haddr = address;
-		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-			pte_t *pte, entry;
-			BUG_ON(PageCompound(page+i));
-			/*
-			 * Note that NUMA hinting access restrictions are not
-			 * transferred to avoid any possibility of altering
-			 * permissions across VMAs.
-			 */
-			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			if (!pmd_write(*pmd))
-				entry = pte_wrprotect(entry);
-			if (!pmd_young(*pmd))
-				entry = pte_mkold(entry);
-			pte = pte_offset_map(&_pmd, haddr);
-			BUG_ON(!pte_none(*pte));
-			set_pte_at(mm, haddr, pte, entry);
-			pte_unmap(pte);
-		}
-
-		smp_wmb(); /* make pte visible before pmd */
-		/*
-		 * Up to this point the pmd is present and huge and
-		 * userland has the whole access to the hugepage
-		 * during the split (which happens in place). If we
-		 * overwrite the pmd with the not-huge version
-		 * pointing to the pte here (which of course we could
-		 * if all CPUs were bug free), userland could trigger
-		 * a small page size TLB miss on the small sized TLB
-		 * while the hugepage TLB entry is still established
-		 * in the huge TLB. Some CPU doesn't like that. See
-		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
-		 * Erratum 383 on page 93. Intel should be safe but is
-		 * also warns that it's only safe if the permission
-		 * and cache attributes of the two entries loaded in
-		 * the two TLB is identical (which should be the case
-		 * here). But it is generally safer to never allow
-		 * small and huge TLB entries for the same virtual
-		 * address to be loaded simultaneously. So instead of
-		 * doing "pmd_populate(); flush_tlb_range();" we first
-		 * mark the current pmd notpresent (atomically because
-		 * here the pmd_trans_huge and pmd_trans_splitting
-		 * must remain set at all times on the pmd until the
-		 * split is complete for this pmd), then we flush the
-		 * SMP TLB and finally we write the non-huge version
-		 * of the pmd entry with pmd_populate.
-		 */
-		pmdp_invalidate(vma, address, pmd);
-		pmd_populate(mm, pmd, pgtable);
-		ret = 1;
-		spin_unlock(ptl);
-	}
-
-	return ret;
-}
-
 /* must be called with anon_vma->root->rwsem held */
 static void __split_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
@@ -2023,48 +1911,18 @@ static void __split_huge_page(struct page *page,
 		BUG();
 	}
 }
+#endif
 
 /*
  * Split a hugepage into normal pages. This doesn't change the position of head
  * page. If @list is null, tail pages will be added to LRU list, otherwise, to
  * @list. Both head page and tail pages will inherit mapping, flags, and so on
  * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return 1.
+ * Return 0 if the hugepage is split successfully otherwise return -errno.
  */
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
-	struct anon_vma *anon_vma;
-	int ret = 1;
-
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
-	/*
-	 * The caller does not necessarily hold an mmap_sem that would prevent
-	 * the anon_vma disappearing so we first we take a reference to it
-	 * and then lock the anon_vma for write. This is similar to
-	 * page_lock_anon_vma_read except the write lock is taken to serialise
-	 * against parallel split or collapse operations.
-	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
-
-	ret = 0;
-	if (!PageCompound(page))
-		goto out_unlock;
-
-	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
-	count_vm_event(THP_SPLIT);
-
-	BUG_ON(PageCompound(page));
-out_unlock:
-	anon_vma_unlock_write(anon_vma);
-	put_anon_vma(anon_vma);
-out:
-	return ret;
+	return -EBUSY;
 }
 
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
diff --git a/mm/internal.h b/mm/internal.h
index edaab69a9c35..67017c32dfba 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,26 +47,6 @@ static inline void set_page_refcounted(struct page *page)
 	set_page_count(page, 1);
 }
 
-static inline void __get_page_tail_foll(struct page *page,
-					bool get_page_head)
-{
-	/*
-	 * If we're getting a tail page, the elevated page->_count is
-	 * required only in the head page and we will elevate the head
-	 * page->_count and tail page->_mapcount.
-	 *
-	 * We elevate page_tail->_mapcount for tail pages to force
-	 * page_tail->_count to be zero at all times to avoid getting
-	 * false positives from get_page_unless_zero() with
-	 * speculative page access (like in
-	 * page_cache_get_speculative()) on tail pages.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
-	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
-	get_huge_page_tail(page);
-}
-
 /*
  * This is meant to be called as the FOLL_GET operation of
  * follow_page() and it must be called while holding the proper PT
@@ -74,14 +54,9 @@ static inline void __get_page_tail_foll(struct page *page,
  */
 static inline void get_page_foll(struct page *page)
 {
-	if (unlikely(PageTail(page)))
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount() can't run under
-		 * get_page_foll() because we hold the proper PT lock.
-		 */
-		__get_page_tail_foll(page, true);
-	else {
+	if (unlikely(PageTail(page))) {
+		atomic_inc(&page->first_page->_count);
+	} else {
 		/*
 		 * Getting a normal page or the head of a compound page
 		 * requires to already have an elevated page->_count.
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e64cea9..2e647d4dc6bb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,185 +80,12 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
-/**
- * Two special cases here: we could avoid taking compound_lock_irqsave
- * and could skip the tail refcounting(in _mapcount).
- *
- * 1. Hugetlbfs page:
- *
- *    PageHeadHuge will remain true until the compound page
- *    is released and enters the buddy allocator, and it could
- *    not be split by __split_huge_page_refcount().
- *
- *    So if we see PageHeadHuge set, and we have the tail page pin,
- *    then we could safely put head page.
- *
- * 2. Slab THP page:
- *
- *    PG_slab is cleared before the slab frees the head page, and
- *    tail pin cannot be the last reference left on the head page,
- *    because the slab code is free to reuse the compound page
- *    after a kfree/kmem_cache_free without having to check if
- *    there's any tail pin left.  In turn all tail pinsmust be always
- *    released while the head is still pinned by the slab code
- *    and so we know PG_slab will be still set too.
- *
- *    So if we see PageSlab set, and we have the tail page pin,
- *    then we could safely put head page.
- */
-static __always_inline
-void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
-{
-	/*
-	 * If @page is a THP tail, we must read the tail page
-	 * flags after the head page flags. The
-	 * __split_huge_page_refcount side enforces write memory barriers
-	 * between clearing PageTail and before the head page
-	 * can be freed and reallocated.
-	 */
-	smp_rmb();
-	if (likely(PageTail(page))) {
-		/*
-		 * __split_huge_page_refcount cannot race
-		 * here, see the comment above this function.
-		 */
-		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-		VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
-		if (put_page_testzero(page_head)) {
-			/*
-			 * If this is the tail of a slab THP page,
-			 * the tail pin must not be the last reference
-			 * held on the page, because the PG_slab cannot
-			 * be cleared before all tail pins (which skips
-			 * the _mapcount tail refcounting) have been
-			 * released.
-			 *
-			 * If this is the tail of a hugetlbfs page,
-			 * the tail pin may be the last reference on
-			 * the page instead, because PageHeadHuge will
-			 * not go away until the compound page enters
-			 * the buddy allocator.
-			 */
-			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
-			__put_compound_page(page_head);
-		}
-	} else
-		/*
-		 * __split_huge_page_refcount run before us,
-		 * @page was a THP tail. The split @page_head
-		 * has been freed and reallocated as slab or
-		 * hugetlbfs page of smaller order (only
-		 * possible if reallocated as slab on x86).
-		 */
-		if (put_page_testzero(page))
-			__put_single_page(page);
-}
-
-static __always_inline
-void put_refcounted_compound_page(struct page *page_head, struct page *page)
-{
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		unsigned long flags;
-
-		/*
-		 * @page_head wasn't a dangling pointer but it may not
-		 * be a head page anymore by the time we obtain the
-		 * lock. That is ok as long as it can't be freed from
-		 * under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		if (unlikely(!PageTail(page))) {
-			/* __split_huge_page_refcount run before us */
-			compound_unlock_irqrestore(page_head, flags);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * The @page_head may have been freed
-				 * and reallocated as a compound page
-				 * of smaller order and then freed
-				 * again.  All we know is that it
-				 * cannot have become: a THP page, a
-				 * compound page of higher order, a
-				 * tail page.  That is because we
-				 * still hold the refcount of the
-				 * split THP tail and page_head was
-				 * the THP head before the split.
-				 */
-				if (PageHead(page_head))
-					__put_compound_page(page_head);
-				else
-					__put_single_page(page_head);
-			}
-out_put_single:
-			if (put_page_testzero(page))
-				__put_single_page(page);
-			return;
-		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
-		/*
-		 * We can release the refcount taken by
-		 * get_page_unless_zero() now that
-		 * __split_huge_page_refcount() is blocked on the
-		 * compound_lock.
-		 */
-		if (put_page_testzero(page_head))
-			VM_BUG_ON_PAGE(1, page_head);
-		/* __split_huge_page_refcount will wait now */
-		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
-		atomic_dec(&page->_mapcount);
-		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-		compound_unlock_irqrestore(page_head, flags);
-
-		if (put_page_testzero(page_head)) {
-			if (PageHead(page_head))
-				__put_compound_page(page_head);
-			else
-				__put_single_page(page_head);
-		}
-	} else {
-		/* @page_head is a dangling pointer */
-		VM_BUG_ON_PAGE(PageTail(page), page);
-		goto out_put_single;
-	}
-}
-
 static void put_compound_page(struct page *page)
 {
-	struct page *page_head;
-
-	/*
-	 * We see the PageCompound set and PageTail not set, so @page maybe:
-	 *  1. hugetlbfs head page, or
-	 *  2. THP head page.
-	 */
-	if (likely(!PageTail(page))) {
-		if (put_page_testzero(page)) {
-			/*
-			 * By the time all refcounts have been released
-			 * split_huge_page cannot run anymore from under us.
-			 */
-			if (PageHead(page))
-				__put_compound_page(page);
-			else
-				__put_single_page(page);
-		}
-		return;
-	}
+	struct page *page_head = compound_head(page);
 
-	/*
-	 * We see the PageCompound set and PageTail set, so @page maybe:
-	 *  1. a tail hugetlbfs page, or
-	 *  2. a tail THP page, or
-	 *  3. a split THP page.
-	 *
-	 *  Case 3 is possible, as we may race with
-	 *  __split_huge_page_refcount tearing down a THP page.
-	 */
-	page_head = compound_head_by_tail(page);
-	if (!__compound_tail_refcounted(page_head))
-		put_unrefcounted_compound_page(page_head, page);
-	else
-		put_refcounted_compound_page(page_head, page);
+	if (put_page_testzero(page_head))
+			__put_compound_page(page_head);
 }
 
 void put_page(struct page *page)
@@ -270,72 +97,6 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
-/*
- * This function is exported but must not be called by anything other
- * than get_page(). It implements the slow path of get_page().
- */
-bool __get_page_tail(struct page *page)
-{
-	/*
-	 * This takes care of get_page() if run on a tail page
-	 * returned by one of the get_user_pages/follow_page variants.
-	 * get_user_pages/follow_page itself doesn't need the compound
-	 * lock because it runs __get_page_tail_foll() under the
-	 * proper PT lock that already serializes against
-	 * split_huge_page().
-	 */
-	unsigned long flags;
-	bool got;
-	struct page *page_head = compound_head(page);
-
-	/* Ref to put_compound_page() comment. */
-	if (!__compound_tail_refcounted(page_head)) {
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/*
-			 * This is a hugetlbfs page or a slab
-			 * page. __split_huge_page_refcount
-			 * cannot race here.
-			 */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			__get_page_tail_foll(page, true);
-			return true;
-		} else {
-			/*
-			 * __split_huge_page_refcount run
-			 * before us, "page" was a THP
-			 * tail. The split page_head has been
-			 * freed and reallocated as slab or
-			 * hugetlbfs page of smaller order
-			 * (only possible if reallocated as
-			 * slab on x86).
-			 */
-			return false;
-		}
-	}
-
-	got = false;
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		/*
-		 * page_head wasn't a dangling pointer but it
-		 * may not be a head page anymore by the time
-		 * we obtain the lock. That is ok as long as it
-		 * can't be freed from under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		/* here __split_huge_page_refcount won't run anymore */
-		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page, false);
-			got = true;
-		}
-		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
-			put_page(page_head);
-	}
-	return got;
-}
-EXPORT_SYMBOL(__get_page_tail);
-
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 13/24] mm, vmstats: new THP splitting event
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 12/24] thp: PMD splitting without splitting compound page Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-04 18:49   ` Christoph Lameter
  2015-03-04 16:33 ` [PATCHv4 14/24] thp: implement new split_huge_page() Kirill A. Shutemov
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
now can split PMD without the compound page and that split_huge_page()
can fail.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h | 4 +++-
 mm/huge_memory.c              | 3 +++
 mm/vmstat.c                   | 4 +++-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef88b827..3261bfe2156a 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
-		THP_SPLIT,
+		THP_SPLIT_PAGE,
+		THP_SPLIT_PAGE_FAILED,
+		THP_SPLIT_PMD,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f420847a3288..3409a5c7dbb8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1646,6 +1646,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma,
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
 
+	count_vm_event(THP_SPLIT_PMD);
+
 	if (is_huge_zero_pmd(*pmd))
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 
@@ -1922,6 +1924,7 @@ static void __split_huge_page(struct page *page,
  */
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
+	count_vm_event(THP_SPLIT_PAGE_FAILED);
 	return -EBUSY;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886a389f..e1c87425fe11 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
-	"thp_split",
+	"thp_split_page",
+	"thp_split_page_failed",
+	"thp_split_pmd",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 #endif
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 14/24] thp: implement new split_huge_page()
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 13/24] mm, vmstats: new THP splitting event Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-04 16:33 ` [PATCHv4 15/24] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

The new split_huge_page() can fail if the compound is pinned: we expect
only caller to have one reference to head page. If the page is pinned
split_huge_page() returns -EBUSY and caller must handle this correctly.

We don't need mark PMDs splitting since now we can split one PMD a time
with split_huge_pmd().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/hugetlb_inline.h |   9 +-
 include/linux/mm.h             |  22 +++--
 mm/huge_memory.c               | 183 +++++++++++++++++++++++------------------
 mm/swap.c                      | 126 +++++++++++++++++++++++++++-
 4 files changed, 244 insertions(+), 96 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 2bb681fbeb35..c5cd37479731 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -10,6 +10,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 	return !!(vma->vm_flags & VM_HUGETLB);
 }
 
+int PageHeadHuge(struct page *page_head);
+
 #else
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -17,6 +19,11 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 	return 0;
 }
 
-#endif
+static inline int PageHeadHuge(struct page *page_head)
+{
+       return 0;
+}
+
+#endif /* CONFIG_HUGETLB_PAGE */
 
 #endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 020dbbe1563c..28aeae6e553b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -485,20 +485,18 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-extern int PageHeadHuge(struct page *page_head);
-#else /* CONFIG_HUGETLB_PAGE */
-static inline int PageHeadHuge(struct page *page_head)
-{
-	return 0;
-}
-#endif /* CONFIG_HUGETLB_PAGE */
-
+void __get_page_tail(struct page *page);
 static inline void get_page(struct page *page)
 {
-	struct page *page_head = compound_head(page);
-	VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
-	atomic_inc(&page_head->_count);
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count.
+	 */
+	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+	atomic_inc(&page->_count);
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3409a5c7dbb8..6f6429426edb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1737,31 +1737,52 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
 	__split_huge_pmd(vma, pmd, address);
 }
 
-#if 0
-static void __split_huge_page_refcount(struct page *page,
+static int __split_huge_page_refcount(struct page *page,
 				       struct list_head *list)
 {
 	int i;
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
-	int tail_count = 0;
+	int tail_mapcount = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(page, zone);
 
 	compound_lock(page);
+
+	/*
+	 * We cannot split pinned THP page: we expect page count to be equal
+	 * to sum of mapcount of all sub-pages plus one (split_huge_page()
+	 * caller must take reference for head page).
+	 *
+	 * Compound lock only prevents page->_count to be updated from
+	 * get_page() or put_page() on tail page. It means means page_count()
+	 * can change under us from head page after the check, but it's okay:
+	 * all new refernces will stay on head page after split.
+	 */
+	tail_mapcount = 0;
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		tail_mapcount += page_mapcount(page + i);
+	if (tail_mapcount != page_count(page) - 1) {
+		BUG_ON(tail_mapcount > page_count(page) - 1);
+		compound_unlock(page);
+		spin_unlock_irq(&zone->lru_lock);
+		return -EBUSY;
+	}
+
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
+	tail_mapcount = 0;
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		struct page *page_tail = page + i;
 
 		/* tail_page->_mapcount cannot change */
-		BUG_ON(atomic_read(&page_tail->_mapcount) + 1 < 0);
-		tail_count += atomic_read(&page_tail->_mapcount) + 1;
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_mapcount += page_mapcount(page_tail);
 		/* check for overflow */
-		BUG_ON(tail_count < 0);
+		BUG_ON(tail_mapcount < 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
 		/*
 		 * tail_page->_count is zero and not changing from
@@ -1799,28 +1820,9 @@ static void __split_huge_page_refcount(struct page *page,
 		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
-		/*
-		 * __split_huge_page_splitting() already set the
-		 * splitting bit in all pmd that could map this
-		 * hugepage, that will ensure no CPU can alter the
-		 * mapcount on the head page. The mapcount is only
-		 * accounted in the head page and it has to be
-		 * transferred to all tail pages in the below code. So
-		 * for this code to be safe, the split the mapcount
-		 * can't change. But that doesn't mean userland can't
-		 * keep changing and reading the page contents while
-		 * we transfer the mapcount, so the pmd splitting
-		 * status is achieved setting a reserved bit in the
-		 * pmd, not by clearing the present bit.
-		*/
-		atomic_set(&page_tail->_mapcount, compound_mapcount(page) - 1);
-
 		/* ->mapping in first tail page is compound_mapcount */
-		if (i != 1) {
-			BUG_ON(page_tail->mapping);
-			page_tail->mapping = page->mapping;
-			BUG_ON(!PageAnon(page_tail));
-		}
+		BUG_ON(i != 1 && page_tail->mapping);
+		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
 		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
@@ -1831,12 +1833,9 @@ static void __split_huge_page_refcount(struct page *page,
 
 		lru_add_page_tail(page, page_tail, lruvec, list);
 	}
-	atomic_sub(tail_count, &page->_count);
+	atomic_sub(tail_mapcount, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	page->_mapcount = *compound_mapcount_ptr(page);
-	page[1].mapping = page->mapping;
-
 	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
@@ -1861,71 +1860,95 @@ static void __split_huge_page_refcount(struct page *page,
 	 * to be pinned by the caller.
 	 */
 	BUG_ON(page_count(page) <= 0);
+	return 0;
 }
 
-/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
-			      struct anon_vma *anon_vma,
-			      struct list_head *list)
+/*
+ * Split a hugepage into normal pages. This doesn't change the position of head
+ * page. If @list is null, tail pages will be added to LRU list, otherwise, to
+ * @list. Both head page and tail pages will inherit mapping, flags, and so on
+ * from the hugepage.
+ * Return 0 if the hugepage is split successfully otherwise return -errno.
+ */
+int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
-	int mapcount, mapcount2;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct anon_vma *anon_vma;
 	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	int i, tail_mapcount;
+	int ret = -EBUSY;
 
-	BUG_ON(!PageHead(page));
-	BUG_ON(PageTail(page));
+	BUG_ON(is_huge_zero_page(page));
+	BUG_ON(!PageAnon(page));
 
-	mapcount = 0;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount += __split_huge_page_splitting(page, vma, addr);
-	}
 	/*
-	 * It is critical that new vmas are added to the tail of the
-	 * anon_vma list. This guarantes that if copy_huge_pmd() runs
-	 * and establishes a child pmd before
-	 * __split_huge_page_splitting() freezes the parent pmd (so if
-	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
-	 * the newly established pmd of the child later during the
-	 * walk, to be able to set it as pmd_trans_splitting too.
+	 * The caller does not necessarily hold an mmap_sem that would prevent
+	 * the anon_vma disappearing so we first we take a reference to it
+	 * and then lock the anon_vma for write. This is similar to
+	 * page_lock_anon_vma_read except the write lock is taken to serialise
+	 * against parallel split or collapse operations.
 	 */
-	if (mapcount != page_mapcount(page)) {
-		pr_err("mapcount %d page_mapcount %d\n",
-			mapcount, page_mapcount(page));
-		BUG();
+	anon_vma = page_get_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	anon_vma_lock_write(anon_vma);
+
+	if (!PageCompound(page)) {
+		ret = 0;
+		goto out_unlock;
 	}
 
-	__split_huge_page_refcount(page, list);
+	BUG_ON(!PageSwapBacked(page));
+
+	/*
+	 * Racy check if __split_huge_page_refcount() can be successful, before
+	 * splitting PMDs.
+	 */
+	tail_mapcount = compound_mapcount(page);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		tail_mapcount += atomic_read(&page[i]._mapcount) + 1;
+	if (tail_mapcount != page_count(page) - 1) {
+		VM_BUG_ON_PAGE(tail_mapcount > page_count(page) - 1, page);
+		ret = -EBUSY;
+		goto out_unlock;
+	}
 
-	mapcount2 = 0;
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount2 += __split_huge_page_map(page, vma, addr);
-	}
-	if (mapcount != mapcount2) {
-		pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
-			mapcount, mapcount2, page_mapcount(page));
-		BUG();
+		spinlock_t *ptl;
+		pmd_t *pmd;
+		unsigned long haddr = addr & HPAGE_PMD_MASK;
+		unsigned long mmun_start;	/* For mmu_notifiers */
+		unsigned long mmun_end;		/* For mmu_notifiers */
+
+		mmun_start = haddr;
+		mmun_end   = haddr + HPAGE_PMD_SIZE;
+		mmu_notifier_invalidate_range_start(vma->vm_mm,
+				mmun_start, mmun_end);
+		pmd = page_check_address_pmd(page, vma->vm_mm, addr,
+				PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		if (pmd) {
+			__split_huge_pmd_locked(vma, pmd, addr);
+			spin_unlock(ptl);
+		}
+		mmu_notifier_invalidate_range_end(vma->vm_mm,
+				mmun_start, mmun_end);
 	}
-}
-#endif
 
-/*
- * Split a hugepage into normal pages. This doesn't change the position of head
- * page. If @list is null, tail pages will be added to LRU list, otherwise, to
- * @list. Both head page and tail pages will inherit mapping, flags, and so on
- * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return -errno.
- */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
-{
-	count_vm_event(THP_SPLIT_PAGE_FAILED);
-	return -EBUSY;
+	BUG_ON(compound_mapcount(page));
+	ret = __split_huge_page_refcount(page, list);
+	BUG_ON(!ret && PageCompound(page));
+
+out_unlock:
+	anon_vma_unlock_write(anon_vma);
+	put_anon_vma(anon_vma);
+out:
+	if (ret)
+		count_vm_event(THP_SPLIT_PAGE_FAILED);
+	else
+		count_vm_event(THP_SPLIT_PAGE);
+	return ret;
 }
 
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
diff --git a/mm/swap.c b/mm/swap.c
index 2e647d4dc6bb..7b4fbb26cc2c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,12 +80,86 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
+static inline bool compound_lock_needed(struct page *page)
+{
+	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+		!PageSlab(page) && !PageHeadHuge(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	struct page *page_head = compound_head(page);
+	struct page *page_head;
+	unsigned long flags;
+
+	if (likely(!PageTail(page))) {
+		if (put_page_testzero(page)) {
+			/*
+			 * By the time all refcounts have been released
+			 * split_huge_page cannot run anymore from under us.
+			 */
+			if (PageHead(page))
+				__put_compound_page(page);
+			else
+				__put_single_page(page);
+		}
+		return;
+	}
+
+	/* __split_huge_page_refcount can run under us */
+	page_head = compound_head(page);
+
+	if (!compound_lock_needed(page_head)) {
+		/*
+		 * If "page" is a THP tail, we must read the tail page flags
+		 * after the head page flags. The split_huge_page side enforces
+		 * write memory barriers between clearing PageTail and before
+		 * the head page can be freed and reallocated.
+		 */
+		smp_rmb();
+		if (likely(PageTail(page))) {
+			/* __split_huge_page_refcount cannot race here. */
+			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+			VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
+			if (put_page_testzero(page_head)) {
+				/*
+				 * If this is the tail of a slab compound page,
+				 * the tail pin must not be the last reference
+				 * held on the page, because the PG_slab cannot
+				 * be cleared before all tail pins (which skips
+				 * the _mapcount tail refcounting) have been
+				 * released. For hugetlbfs the tail pin may be
+				 * the last reference on the page instead,
+				 * because PageHeadHuge will not go away until
+				 * the compound page enters the buddy
+				 * allocator.
+				 */
+				VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
+				__put_compound_page(page_head);
+			}
+		} else if (put_page_testzero(page))
+			__put_single_page(page);
+		return;
+	}
 
-	if (put_page_testzero(page_head))
-			__put_compound_page(page_head);
+	flags = compound_lock_irqsave(page_head);
+	/* here __split_huge_page_refcount won't run anymore */
+	if (likely(page != page_head && PageTail(page))) {
+		bool free;
+
+		free = put_page_testzero(page_head);
+		compound_unlock_irqrestore(page_head, flags);
+		if (free) {
+			if (PageHead(page_head))
+				__put_compound_page(page_head);
+			else
+				__put_single_page(page_head);
+		}
+	} else {
+		compound_unlock_irqrestore(page_head, flags);
+		VM_BUG_ON_PAGE(PageTail(page), page);
+		if (put_page_testzero(page))
+			__put_single_page(page);
+	}
 }
 
 void put_page(struct page *page)
@@ -97,6 +171,52 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+void __get_page_tail(struct page *page)
+{
+	struct page *page_head = compound_head(page);
+	unsigned long flags;
+
+	if (!compound_lock_needed(page_head)) {
+		smp_rmb();
+		if (likely(PageTail(page))) {
+			/*
+			 * This is a hugetlbfs page or a slab page.
+			 * __split_huge_page_refcount cannot race here.
+			 */
+			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+			VM_BUG_ON(page_head != page->first_page);
+			VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
+					page);
+			atomic_inc(&page_head->_count);
+		} else {
+			/*
+			 * __split_huge_page_refcount run before us, "page" was
+			 * a thp tail. the split page_head has been freed and
+			 * reallocated as slab or hugetlbfs page of smaller
+			 * order (only possible if reallocated as slab on x86).
+			 */
+			VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+			atomic_inc(&page->_count);
+		}
+		return;
+	}
+
+	flags = compound_lock_irqsave(page_head);
+	/* here __split_huge_page_refcount won't run anymore */
+	if (unlikely(page == page_head || !PageTail(page) ||
+				!get_page_unless_zero(page_head))) {
+		/* page is not part of THP page anymore */
+		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+		atomic_inc(&page->_count);
+	}
+	compound_unlock_irqrestore(page_head, flags);
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 15/24] mm, thp: remove infrastructure for handling splitting PMDs
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 14/24] thp: implement new split_huge_page() Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-29 16:10   ` Aneesh Kumar K.V
  2015-03-04 16:33 ` [PATCHv4 16/24] x86, " Kirill A. Shutemov
                   ` (11 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop code
to handle this.

Arch-specific code will removed separately.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/proc/task_mmu.c            |  8 +++---
 include/asm-generic/pgtable.h |  5 ----
 include/linux/huge_mm.h       | 16 ------------
 mm/gup.c                      |  7 ------
 mm/huge_memory.c              | 57 +++++++++----------------------------------
 mm/memcontrol.c               | 14 ++---------
 mm/memory.c                   | 18 ++------------
 mm/mincore.c                  |  2 +-
 mm/pgtable-generic.c          | 14 -----------
 mm/rmap.c                     |  4 +--
 10 files changed, 21 insertions(+), 124 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 50acf17538db..fcd1b887eccc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -534,7 +534,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
@@ -799,7 +799,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
 			clear_soft_dirty_pmd(vma, addr, pmd);
 			goto out;
@@ -1112,7 +1112,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte, *orig_pte;
 	int err = 0;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		int pmd_flags2;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1416,7 +1416,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte;
 	pte_t *pte;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 1f9f5da6828f..8a89bb60d6f5 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -183,11 +183,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp);
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3c5fe722cc14..7a0c477a2b38 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -49,15 +49,9 @@ enum transparent_hugepage_flag {
 #endif
 };
 
-enum page_check_address_pmd_flag {
-	PAGE_CHECK_ADDRESS_PMD_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
-};
 extern pmd_t *page_check_address_pmd(struct page *page,
 				     struct mm_struct *mm,
 				     unsigned long address,
-				     enum page_check_address_pmd_flag flag,
 				     spinlock_t **ptl);
 extern int pmd_freeable(pmd_t pmd);
 
@@ -110,14 +104,6 @@ extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
 			__split_huge_pmd(__vma, __pmd, __address);	\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
-		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
-		       pmd_trans_huge(*____pmd));			\
-	} while (0)
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -184,8 +170,6 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define wait_split_huge_page(__anon_vma, __pmd)	\
-	do { } while (0)
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/mm/gup.c b/mm/gup.c
index b68d9ffa3c9e..2f776850108c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -208,13 +208,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
 	}
-
-	if (unlikely(pmd_trans_splitting(*pmd))) {
-		spin_unlock(ptl);
-		wait_split_huge_page(vma->anon_vma, pmd);
-		return follow_page_pte(vma, address, pmd, flags);
-	}
-
 	if (flags & FOLL_SPLIT) {
 		int ret;
 		page = pmd_page(*pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6f6429426edb..3741f81e423e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -880,15 +880,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 
-	if (unlikely(pmd_trans_splitting(pmd))) {
-		/* split huge page running from under us */
-		spin_unlock(src_ptl);
-		spin_unlock(dst_ptl);
-		pte_free(dst_mm, pgtable);
-
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
-		goto out;
-	}
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
@@ -1392,7 +1383,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
@@ -1432,7 +1423,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		  pmd_t *old_pmd, pmd_t *new_pmd)
 {
 	spinlock_t *old_ptl, *new_ptl;
-	int ret = 0;
 	pmd_t pmd;
 
 	struct mm_struct *mm = vma->vm_mm;
@@ -1441,7 +1431,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	    (new_addr & ~HPAGE_PMD_MASK) ||
 	    old_end - old_addr < HPAGE_PMD_SIZE ||
 	    (new_vma->vm_flags & VM_NOHUGEPAGE))
-		goto out;
+		return 0;
 
 	/*
 	 * The destination pmd shouldn't be established, free_pgtables()
@@ -1449,15 +1439,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	 */
 	if (WARN_ON(!pmd_none(*new_pmd))) {
 		VM_BUG_ON(pmd_trans_huge(*new_pmd));
-		goto out;
+		return 0;
 	}
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
 	 * ptlocks because exclusive mmap_sem prevents deadlock.
 	 */
-	ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
-	if (ret == 1) {
+	if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
 		new_ptl = pmd_lockptr(mm, new_pmd);
 		if (new_ptl != old_ptl)
 			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -1473,9 +1462,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		spin_unlock(old_ptl);
+		return 1;
 	}
-out:
-	return ret;
+	return 0;
 }
 
 /*
@@ -1491,7 +1480,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pmd_t entry;
 
 		/*
@@ -1529,17 +1518,8 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 		spinlock_t **ptl)
 {
 	*ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(*ptl);
-			wait_split_huge_page(vma->anon_vma, pmd);
-			return -1;
-		} else {
-			/* Thp mapped by 'pmd' is stable, so we can
-			 * handle it as it is. */
-			return 1;
-		}
-	}
+	if (likely(pmd_trans_huge(*pmd)))
+		return 1;
 	spin_unlock(*ptl);
 	return 0;
 }
@@ -1555,7 +1535,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
-			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
 	pgd_t *pgd;
@@ -1578,21 +1557,8 @@ pmd_t *page_check_address_pmd(struct page *page,
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-	/*
-	 * split_vma() may create temporary aliased mappings. There is
-	 * no risk as long as all huge pmd are found and have their
-	 * splitting bit set before __split_huge_page_refcount
-	 * runs. Finding the same huge pmd more than once during the
-	 * same rmap walk is not a problem.
-	 */
-	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
-	    pmd_trans_splitting(*pmd))
-		goto unlock;
-	if (pmd_trans_huge(*pmd)) {
-		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
-			  !pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd))
 		return pmd;
-	}
 unlock:
 	spin_unlock(*ptl);
 	return NULL;
@@ -1926,8 +1892,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		mmun_end   = haddr + HPAGE_PMD_SIZE;
 		mmu_notifier_invalidate_range_start(vma->vm_mm,
 				mmun_start, mmun_end);
-		pmd = page_check_address_pmd(page, vma->vm_mm, addr,
-				PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		pmd = page_check_address_pmd(page, vma->vm_mm, addr, &ptl);
 		if (pmd) {
 			__split_huge_pmd_locked(vma, pmd, addr);
 			spin_unlock(ptl);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c86945bcc9a..0724cd50b48a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4903,7 +4903,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -5071,17 +5071,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	union mc_target target;
 	struct page *page;
 
-	/*
-	 * We don't take compound_lock() here but no race with splitting thp
-	 * happens because:
-	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
-	 *    under splitting, which means there's no concurrent thp split,
-	 *  - if another thread runs into split_huge_page() just after we
-	 *    entered this if-block, the thread must wait for page table lock
-	 *    to be unlocked in __split_huge_page_splitting(), where the main
-	 *    part of thp split is not executed yet.
-	 */
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(ptl);
 			return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 0db0481311d5..e7a08de704b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -565,7 +565,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	pgtable_t new = pte_alloc_one(mm, address);
-	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -585,18 +584,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	ptl = pmd_lock(mm, pmd);
-	wait_split_huge_page = 0;
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		atomic_long_inc(&mm->nr_ptes);
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	} else if (unlikely(pmd_trans_splitting(*pmd)))
-		wait_split_huge_page = 1;
+	}
 	spin_unlock(ptl);
 	if (new)
 		pte_free(mm, new);
-	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -612,8 +607,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	} else
-		VM_BUG_ON(pmd_trans_splitting(*pmd));
+	}
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3284,14 +3278,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_trans_huge(orig_pmd)) {
 			unsigned int dirty = flags & FAULT_FLAG_WRITE;
 
-			/*
-			 * If the pmd is splitting, return and retry the
-			 * the fault.  Alternative: wait until the split
-			 * is done, and goto retry.
-			 */
-			if (pmd_trans_splitting(orig_pmd))
-				return 0;
-
 			if (pmd_protnone(orig_pmd))
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
diff --git a/mm/mincore.c b/mm/mincore.c
index be25efde64a4..feb867f5fdf4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		memset(vec, 1, nr);
 		spin_unlock(ptl);
 		goto out;
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c25f94b33811..2fe699cedd4d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp)
-{
-	pmd_t pmd = pmd_mksplitting(*pmdp);
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-	/* tlb flush only to serialize against gup-fast */
-	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/mm/rmap.c b/mm/rmap.c
index db8b99e48966..eb2f4a0d3961 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -730,8 +730,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 * rmap might return false positives; we must filter
 		 * these out using page_check_address_pmd().
 		 */
-		pmd = page_check_address_pmd(page, mm, address,
-					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		pmd = page_check_address_pmd(page, mm, address, &ptl);
 		if (!pmd)
 			return SWAP_AGAIN;
 
@@ -741,7 +740,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return SWAP_FAIL; /* To break the loop */
 		}
 
-		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 16/24] x86, thp: remove infrastructure for handling splitting PMDs
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 15/24] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-04 16:33 ` [PATCHv4 17/24] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h       |  9 ---------
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/gup.c                    | 13 +------------
 arch/x86/mm/pgtable.c                | 14 --------------
 4 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index affcb3459847..822ede95a64e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
 static inline int pmd_trans_huge(pmd_t pmd)
 {
 	return pmd_val(pmd) & _PAGE_PSE;
@@ -794,10 +789,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
 #define __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 78f0c8cbe316..45f7cff1baac 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,7 +22,6 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
-#define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
@@ -46,7 +45,6 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
-#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 62a887a3cf50..49bbbc57603b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -157,18 +157,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		/*
-		 * The pmd_trans_splitting() check below explains why
-		 * pmdp_splitting_flush has to flush the tlb, to stop
-		 * this gup-fast code from running while we set the
-		 * splitting bit in the pmd. Returning zero will take
-		 * the slow path that will call wait_split_huge_page()
-		 * if the pmd is still in splitting state. gup-fast
-		 * can't because it has irq disabled and
-		 * wait_split_huge_page() would never return as the
-		 * tlb flush IPI wouldn't run.
-		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
 			/*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index b28edfecbdfe..dad3b2bb57e0 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -508,20 +508,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			  unsigned long address, pmd_t *pmdp)
-{
-	int set;
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
-				(unsigned long *)pmdp);
-	if (set) {
-		pmd_update(vma->vm_mm, address, pmdp);
-		/* need tlb flush only to serialize against gup-fast */
-		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-	}
-}
 #endif
 
 /**
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 17/24] futex, thp: remove special case for THP in get_futex_key
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 16/24] x86, " Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-04 16:33 ` [PATCHv4 18/24] thp, mm: split_huge_page(): caller need to lock page Kirill A. Shutemov
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.

This patch effectively reverts a5b338f2b0b1.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 kernel/futex.c | 61 ++++++++++++----------------------------------------------
 1 file changed, 12 insertions(+), 49 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2579e407ff67..2c7cec27058b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page, *page_head;
+	struct page *page;
 	int err, ro = 0;
 
 	/*
@@ -442,46 +442,9 @@ again:
 	else
 		err = 0;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page_head = page;
-	if (unlikely(PageTail(page))) {
-		put_page(page);
-		/* serialize against __split_huge_page_splitting() */
-		local_irq_disable();
-		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
-			page_head = compound_head(page);
-			/*
-			 * page_head is valid pointer but we must pin
-			 * it before taking the PG_lock and/or
-			 * PG_compound_lock. The moment we re-enable
-			 * irqs __split_huge_page_splitting() can
-			 * return and the head page can be freed from
-			 * under us. We can't take the PG_lock and/or
-			 * PG_compound_lock on a page that could be
-			 * freed from under us.
-			 */
-			if (page != page_head) {
-				get_page(page_head);
-				put_page(page);
-			}
-			local_irq_enable();
-		} else {
-			local_irq_enable();
-			goto again;
-		}
-	}
-#else
-	page_head = compound_head(page);
-	if (page != page_head) {
-		get_page(page_head);
-		put_page(page);
-	}
-#endif
-
-	lock_page(page_head);
-
+	lock_page(page);
 	/*
-	 * If page_head->mapping is NULL, then it cannot be a PageAnon
+	 * If page->mapping is NULL, then it cannot be a PageAnon
 	 * page; but it might be the ZERO_PAGE or in the gate area or
 	 * in a special mapping (all cases which we are happy to fail);
 	 * or it may have been a good file page when get_user_pages_fast
@@ -493,12 +456,12 @@ again:
 	 *
 	 * The case we do have to guard against is when memory pressure made
 	 * shmem_writepage move it from filecache to swapcache beneath us:
-	 * an unlikely race, but we do need to retry for page_head->mapping.
+	 * an unlikely race, but we do need to retry for page->mapping.
 	 */
-	if (!page_head->mapping) {
-		int shmem_swizzled = PageSwapCache(page_head);
-		unlock_page(page_head);
-		put_page(page_head);
+	if (!page->mapping) {
+		int shmem_swizzled = PageSwapCache(page);
+		unlock_page(page);
+		put_page(page);
 		if (shmem_swizzled)
 			goto again;
 		return -EFAULT;
@@ -511,7 +474,7 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page_head)) {
+	if (PageAnon(page)) {
 		/*
 		 * A RO anonymous page will never change and thus doesn't make
 		 * sense for futex operations.
@@ -526,15 +489,15 @@ again:
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page_head->mapping->host;
+		key->shared.inode = page->mapping->host;
 		key->shared.pgoff = basepage_index(page);
 	}
 
 	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
-	unlock_page(page_head);
-	put_page(page_head);
+	unlock_page(page);
+	put_page(page);
 	return err;
 }
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 18/24] thp, mm: split_huge_page(): caller need to lock page
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 17/24] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-30 14:10   ` Aneesh Kumar K.V
  2015-03-04 16:33 ` [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
                   ` (8 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.

Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c    |  1 +
 mm/ksm.c            |  6 ++++--
 mm/memory-failure.c | 12 +++++++++---
 mm/migrate.c        |  8 ++++++--
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3741f81e423e..f1d88b9059e2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1846,6 +1846,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
+	BUG_ON(!PageLocked(page));
 
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
diff --git a/mm/ksm.c b/mm/ksm.c
index 92182eeba87d..a8a88b0f6f62 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -987,9 +987,11 @@ static int page_trans_compound_anon_split(struct page *page)
 			 * Recheck we got the reference while the head
 			 * was still anonymous.
 			 */
-			if (PageAnon(transhuge_head))
+			if (PageAnon(transhuge_head)) {
+				lock_page(transhuge_head);
 				ret = split_huge_page(transhuge_head);
-			else
+				unlock_page(transhuge_head);
+			} else
 				/*
 				 * Retry later if split_huge_page run
 				 * from under us.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d487f8dc6d39..74c5aaddae85 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -950,7 +950,10 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 		 * enough * to be safe.
 		 */
 		if (!PageHuge(hpage) && PageAnon(hpage)) {
-			if (unlikely(split_huge_page(hpage))) {
+			lock_page(hpage);
+			ret = split_huge_page(hpage);
+			unlock_page(hpage);
+			if (unlikely(ret)) {
 				/*
 				 * FIXME: if splitting THP is failed, it is
 				 * better to stop the following operation rather
@@ -1694,10 +1697,13 @@ int soft_offline_page(struct page *page, int flags)
 		return -EBUSY;
 	}
 	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
+		lock_page(page);
+		ret = split_huge_page(hpage);
+		unlock_page(page);
+		if (unlikely(ret)) {
 			pr_info("soft offline: %#lx: failed to split THP\n",
 				pfn);
-			return -EBUSY;
+			return ret;
 		}
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 01449826b914..91a67029bb18 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -920,9 +920,13 @@ static int unmap_and_move(new_page_t get_new_page, free_page_t put_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page)))
-		if (unlikely(split_huge_page(page)))
+	if (unlikely(PageTransHuge(page))) {
+		lock_page(page);
+		rc = split_huge_page(page);
+		unlock_page(page);
+		if (rc)
 			goto out;
+	}
 
 	rc = __unmap_and_move(page, newpage, force, mode);
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 18/24] thp, mm: split_huge_page(): caller need to lock page Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-30 14:19   ` Aneesh Kumar K.V
  2015-03-30 15:08   ` Aneesh Kumar K.V
  2015-03-04 16:33 ` [PATCHv4 20/24] mm, thp: remove compound_lock Kirill A. Shutemov
                   ` (7 subsequent siblings)
  26 siblings, 2 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

Currently, we rely on compound_lock() to get page counts stable on
splitting page refcounting. To get it work we also take the lock on
get_page() and put_page() which is hot path.

This patch rework splitting code to setup migration entries to stabilaze
page count/mapcount before distribute refcounts. It means we don't need
to compound lock in get_page()/put_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/migrate.h |   3 +
 include/linux/mm.h      |   1 +
 include/linux/pagemap.h |   9 ++-
 mm/huge_memory.c        | 188 +++++++++++++++++++++++++++++++++++-------------
 mm/internal.h           |  26 +++++--
 mm/migrate.c            |  79 +++++++++++---------
 mm/rmap.c               |  21 ------
 7 files changed, 218 insertions(+), 109 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 78baed5f2952..b9bc86c24829 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -43,6 +43,9 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode,
 		int extra_count);
+extern int __remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, struct page *old);
+
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 28aeae6e553b..43a9993f1333 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -981,6 +981,7 @@ extern struct address_space *page_mapping(struct page *page);
 /* Neutral page->mapping pointer to address_space or anon_vma or other */
 static inline void *page_rmapping(struct page *page)
 {
+	page = compound_head(page);
 	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
 }
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ad6da4e49555..faef48e04fc4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -387,10 +387,17 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
  */
 static inline pgoff_t page_to_pgoff(struct page *page)
 {
+	pgoff_t pgoff;
+
 	if (unlikely(PageHeadHuge(page)))
 		return page->index << compound_order(page);
-	else
+
+	if (likely(!PageTransTail(page)))
 		return page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+	pgoff = page->first_page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff += page - page->first_page;
+	return pgoff;
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f1d88b9059e2..2bc89b2169aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/swapops.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1599,7 +1600,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma,
-		pmd_t *pmd, unsigned long address)
+		pmd_t *pmd, unsigned long address, int freeze)
 {
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 	struct page *page;
@@ -1637,12 +1638,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		entry = mk_pte(page + i, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (!write)
-			entry = pte_wrprotect(entry);
-		if (!young)
-			entry = pte_mkold(entry);
+		if (freeze) {
+			swp_entry_t swp_entry;
+			swp_entry = make_migration_entry(page + i, write);
+			entry = swp_entry_to_pte(swp_entry);
+		} else {
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!write)
+				entry = pte_wrprotect(entry);
+			if (!young)
+				entry = pte_mkold(entry);
+		}
 		pte = pte_offset_map(&_pmd, haddr);
 		BUG_ON(!pte_none(*pte));
 		atomic_inc(&page[i]._mapcount);
@@ -1668,7 +1675,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
 	if (likely(pmd_trans_huge(*pmd)))
-		__split_huge_pmd_locked(vma, pmd, address);
+		__split_huge_pmd_locked(vma, pmd, address, 0);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 }
@@ -1703,20 +1710,127 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
 	__split_huge_pmd(vma, pmd, address);
 }
 
-static int __split_huge_page_refcount(struct page *page,
-				       struct list_head *list)
+static void freeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	struct vm_area_struct *vma;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	unsigned long addr, haddr;
+	unsigned long mmun_start, mmun_end;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *start_pte, *pte;
+	spinlock_t *ptl;
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
+			pgoff + HPAGE_PMD_NR - 1) {
+		vma = avc->vma;
+		haddr = addr = __vma_address(page, vma) & HPAGE_PMD_MASK;
+		mmun_start = haddr;
+		mmun_end   = haddr + HPAGE_PMD_SIZE;
+		mmu_notifier_invalidate_range_start(vma->vm_mm,
+				mmun_start, mmun_end);
+
+		pgd = pgd_offset(vma->vm_mm, addr);
+		if (!pgd_present(*pgd))
+			goto next;
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud))
+			goto next;
+		pmd = pmd_offset(pud, addr);
+
+		ptl = pmd_lock(vma->vm_mm, pmd);
+		if (!pmd_present(*pmd)) {
+			spin_unlock(ptl);
+			goto next;
+		}
+		if (pmd_trans_huge(*pmd)) {
+			if (page == pmd_page(*pmd))
+				__split_huge_pmd_locked(vma, pmd, addr, 1);
+			spin_unlock(ptl);
+			goto next;
+		}
+		spin_unlock(ptl);
+
+		start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+		pte = start_pte;
+		do {
+			pte_t entry, swp_pte;
+			swp_entry_t swp_entry;
+
+			if (!pte_present(*pte))
+				continue;
+			if (page_to_pfn(page) != pte_pfn(*pte))
+				continue;
+			flush_cache_page(vma, addr, page_to_pfn(page));
+			entry = ptep_clear_flush(vma, addr, pte);
+			swp_entry = make_migration_entry(page,
+					pte_write(entry));
+			swp_pte = swp_entry_to_pte(swp_entry);
+			if (pte_soft_dirty(entry))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(vma->vm_mm, addr, pte, swp_pte);
+		} while (pte++, addr += PAGE_SIZE, page++, addr != mmun_end);
+		pte_unmap_unlock(start_pte, ptl);
+next:
+		mmu_notifier_invalidate_range_end(vma->vm_mm,
+				mmun_start, mmun_end);
+	}
+}
+
+static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page_to_pgoff(page);
+	unsigned long addr;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	int i;
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, pgoff++, page++) {
+		if (!page_mapcount(page))
+			continue;
+
+		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+				pgoff, pgoff) {
+			addr = vma_address(page, avc->vma);
+			pmd = mm_find_pmd(avc->vma->vm_mm, addr);
+			if (!pmd)
+				continue;
+			ptep = pte_offset_map_lock(avc->vma->vm_mm, pmd, addr,
+					&ptl);
+			__remove_migration_pte(page, avc->vma, addr,
+					ptep, page);
+
+			/*
+			 * remove_migration_pte() adds page to rmap, but we
+			 * didn't remove it on freeze_page().
+			 * Let's fix it up here.
+			 */
+			page_remove_rmap(page, false);
+			put_page(page);
+			pte_unmap_unlock(ptep, ptl);
+		}
+	}
+}
+
+static int __split_huge_page_refcount(struct anon_vma *anon_vma,
+		struct page *page, struct list_head *list)
 {
 	int i;
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_mapcount = 0;
 
+	freeze_page(anon_vma, page);
+	VM_BUG_ON_PAGE(compound_mapcount(page), page);
+
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(page, zone);
 
-	compound_lock(page);
-
 	/*
 	 * We cannot split pinned THP page: we expect page count to be equal
 	 * to sum of mapcount of all sub-pages plus one (split_huge_page()
@@ -1732,8 +1846,8 @@ static int __split_huge_page_refcount(struct page *page,
 		tail_mapcount += page_mapcount(page + i);
 	if (tail_mapcount != page_count(page) - 1) {
 		BUG_ON(tail_mapcount > page_count(page) - 1);
-		compound_unlock(page);
 		spin_unlock_irq(&zone->lru_lock);
+		unfreeze_page(anon_vma, page);
 		return -EBUSY;
 	}
 
@@ -1780,6 +1894,7 @@ static int __split_huge_page_refcount(struct page *page,
 				      (1L << PG_mlocked) |
 				      (1L << PG_uptodate) |
 				      (1L << PG_active) |
+				      (1L << PG_locked) |
 				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
@@ -1805,12 +1920,14 @@ static int __split_huge_page_refcount(struct page *page,
 	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
-	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
 
+	unfreeze_page(anon_vma, page);
+
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 		BUG_ON(page_count(page_tail) <= 0);
+		unlock_page(page_tail);
 		/*
 		 * Tail pages may be freed if there wasn't any mapping
 		 * like if add_to_swap() is running on a lru page that
@@ -1839,14 +1956,13 @@ static int __split_huge_page_refcount(struct page *page,
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
-	struct anon_vma_chain *avc;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	int i, tail_mapcount;
-	int ret = -EBUSY;
+	int ret;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
 	BUG_ON(!PageLocked(page));
+	BUG_ON(PageTail(page));
 
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
@@ -1856,17 +1972,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * against parallel split or collapse operations.
 	 */
 	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
+	if (!anon_vma) {
+		ret = -EBUSY;
 		goto out;
+	}
 	anon_vma_lock_write(anon_vma);
 
+	BUG_ON(!PageSwapBacked(page));
 	if (!PageCompound(page)) {
 		ret = 0;
 		goto out_unlock;
 	}
 
-	BUG_ON(!PageSwapBacked(page));
-
 	/*
 	 * Racy check if __split_huge_page_refcount() can be successful, before
 	 * splitting PMDs.
@@ -1880,40 +1997,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		goto out_unlock;
 	}
 
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		spinlock_t *ptl;
-		pmd_t *pmd;
-		unsigned long haddr = addr & HPAGE_PMD_MASK;
-		unsigned long mmun_start;	/* For mmu_notifiers */
-		unsigned long mmun_end;		/* For mmu_notifiers */
-
-		mmun_start = haddr;
-		mmun_end   = haddr + HPAGE_PMD_SIZE;
-		mmu_notifier_invalidate_range_start(vma->vm_mm,
-				mmun_start, mmun_end);
-		pmd = page_check_address_pmd(page, vma->vm_mm, addr, &ptl);
-		if (pmd) {
-			__split_huge_pmd_locked(vma, pmd, addr);
-			spin_unlock(ptl);
-		}
-		mmu_notifier_invalidate_range_end(vma->vm_mm,
-				mmun_start, mmun_end);
-	}
-
-	BUG_ON(compound_mapcount(page));
-	ret = __split_huge_page_refcount(page, list);
+	ret = __split_huge_page_refcount(anon_vma, page, list);
 	BUG_ON(!ret && PageCompound(page));
-
 out_unlock:
 	anon_vma_unlock_write(anon_vma);
 	put_anon_vma(anon_vma);
 out:
-	if (ret)
-		count_vm_event(THP_SPLIT_PAGE_FAILED);
-	else
-		count_vm_event(THP_SPLIT_PAGE);
+	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
 	return ret;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 67017c32dfba..07ffd9fe0728 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -13,6 +13,7 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/pagemap.h>
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
@@ -263,10 +264,27 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-extern unsigned long vma_address(struct page *page,
-				 struct vm_area_struct *vma);
-#endif
+/*
+ * At what user virtual address is page expected in @vma?
+ */
+static inline unsigned long
+__vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	pgoff_t pgoff = page_to_pgoff(page);
+	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+}
+
+static inline unsigned long
+vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	unsigned long address = __vma_address(page, vma);
+
+	/* page should be within @vma mapping range */
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+
+	return address;
+}
+
 #else /* !CONFIG_MMU */
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
diff --git a/mm/migrate.c b/mm/migrate.c
index 91a67029bb18..dbc4fc974464 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -102,45 +102,24 @@ void putback_movable_pages(struct list_head *l)
 /*
  * Restore a potential migration pte to a working pte entry
  */
-static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
-				 unsigned long addr, void *old)
+int __remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+				 unsigned long addr, pte_t *ptep,
+				 struct page *old_page)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	swp_entry_t entry;
- 	pmd_t *pmd;
-	pte_t *ptep, pte;
- 	spinlock_t *ptl;
-
-	if (unlikely(PageHuge(new))) {
-		ptep = huge_pte_offset(mm, addr);
-		if (!ptep)
-			goto out;
-		ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
-	} else {
-		pmd = mm_find_pmd(mm, addr);
-		if (!pmd)
-			goto out;
-
-		ptep = pte_offset_map(pmd, addr);
+	pte_t  pte;
 
-		/*
-		 * Peek to check is_swap_pte() before taking ptlock?  No, we
-		 * can race mremap's move_ptes(), which skips anon_vma lock.
-		 */
-
-		ptl = pte_lockptr(mm, pmd);
-	}
-
- 	spin_lock(ptl);
 	pte = *ptep;
-	if (!is_swap_pte(pte))
-		goto unlock;
+	if (!is_swap_pte(pte)) {
+		return SWAP_AGAIN;
+	}
 
 	entry = pte_to_swp_entry(pte);
 
 	if (!is_migration_entry(entry) ||
-	    migration_entry_to_page(entry) != old)
-		goto unlock;
+	    migration_entry_to_page(entry) != old_page) {
+		return SWAP_AGAIN;
+	}
 
 	get_page(new);
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
@@ -158,7 +137,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	}
 #endif
 	flush_dcache_page(new);
-	set_pte_at(mm, addr, ptep, pte);
+	set_pte_at(vma->vm_mm, addr, ptep, pte);
 
 	if (PageHuge(new)) {
 		if (PageAnon(new))
@@ -172,9 +151,41 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, ptep);
-unlock:
+	return SWAP_AGAIN;
+}
+
+static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+		unsigned long addr, void *old)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	int ret;
+
+	if (unlikely(PageHuge(new))) {
+		ptep = huge_pte_offset(mm, addr);
+		if (!ptep)
+			return SWAP_AGAIN;
+		ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
+	} else {
+		pmd = mm_find_pmd(mm, addr);
+		if (!pmd)
+			return SWAP_AGAIN;
+
+		ptep = pte_offset_map(pmd, addr);
+
+		/*
+		 * Peek to check is_swap_pte() before taking ptlock?  No, we
+		 * can race mremap's move_ptes(), which skips anon_vma lock.
+		 */
+
+		ptl = pte_lockptr(mm, pmd);
+	}
+
+	spin_lock(ptl);
+	ret = __remove_migration_pte(new, vma, addr, ptep, old);
 	pte_unmap_unlock(ptep, ptl);
-out:
 	return SWAP_AGAIN;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index eb2f4a0d3961..2dc26770d1d3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -554,27 +554,6 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 }
 
 /*
- * At what user virtual address is page expected in @vma?
- */
-static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
-{
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-}
-
-inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
-{
-	unsigned long address = __vma_address(page, vma);
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-
-	return address;
-}
-
-/*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
  */
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 20/24] mm, thp: remove compound_lock
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (18 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-31 15:50   ` Aneesh Kumar K.V
  2015-03-04 16:33 ` [PATCHv4 21/24] thp: introduce deferred_split_huge_page() Kirill A. Shutemov
                   ` (6 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

We don't need a compound lock anymore: split_huge_page() doesn't need it
anymore.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h         |  35 ------------
 include/linux/page-flags.h |  12 +---
 mm/debug.c                 |   3 -
 mm/swap.c                  | 135 +++++++++++++++------------------------------
 4 files changed, 46 insertions(+), 139 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43a9993f1333..370aaa17fab8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -398,41 +398,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 
 extern void kvfree(const void *addr);
 
-static inline void compound_lock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_lock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline void compound_unlock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_unlock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline unsigned long compound_lock_irqsave(struct page *page)
-{
-	unsigned long uninitialized_var(flags);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	local_irq_save(flags);
-	compound_lock(page);
-#endif
-	return flags;
-}
-
-static inline void compound_unlock_irqrestore(struct page *page,
-					      unsigned long flags)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	compound_unlock(page);
-	local_irq_restore(flags);
-#endif
-}
-
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 58b98bced299..dbaa54259f62 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -106,9 +106,6 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	PG_compound_lock,
-#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -514,12 +511,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 #define __PG_MLOCKED		0
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
-#else
-#define __PG_COMPOUND_LOCK		0
-#endif
-
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -529,8 +520,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
-	 __PG_COMPOUND_LOCK)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/debug.c b/mm/debug.c
index 13d2b8146ef9..4a82f639b964 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_MEMORY_FAILURE
 	{1UL << PG_hwpoison,		"hwpoison"	},
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	{1UL << PG_compound_lock,	"compound_lock"	},
-#endif
 };
 
 static void dump_flags(unsigned long flags,
diff --git a/mm/swap.c b/mm/swap.c
index 7b4fbb26cc2c..6c9e764f95d7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,16 +80,9 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
-static inline bool compound_lock_needed(struct page *page)
-{
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		!PageSlab(page) && !PageHeadHuge(page);
-}
-
 static void put_compound_page(struct page *page)
 {
 	struct page *page_head;
-	unsigned long flags;
 
 	if (likely(!PageTail(page))) {
 		if (put_page_testzero(page)) {
@@ -108,58 +101,33 @@ static void put_compound_page(struct page *page)
 	/* __split_huge_page_refcount can run under us */
 	page_head = compound_head(page);
 
-	if (!compound_lock_needed(page_head)) {
-		/*
-		 * If "page" is a THP tail, we must read the tail page flags
-		 * after the head page flags. The split_huge_page side enforces
-		 * write memory barriers between clearing PageTail and before
-		 * the head page can be freed and reallocated.
-		 */
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/* __split_huge_page_refcount cannot race here. */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * If this is the tail of a slab compound page,
-				 * the tail pin must not be the last reference
-				 * held on the page, because the PG_slab cannot
-				 * be cleared before all tail pins (which skips
-				 * the _mapcount tail refcounting) have been
-				 * released. For hugetlbfs the tail pin may be
-				 * the last reference on the page instead,
-				 * because PageHeadHuge will not go away until
-				 * the compound page enters the buddy
-				 * allocator.
-				 */
-				VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
-				__put_compound_page(page_head);
-			}
-		} else if (put_page_testzero(page))
-			__put_single_page(page);
-		return;
-	}
-
-	flags = compound_lock_irqsave(page_head);
-	/* here __split_huge_page_refcount won't run anymore */
-	if (likely(page != page_head && PageTail(page))) {
-		bool free;
-
-		free = put_page_testzero(page_head);
-		compound_unlock_irqrestore(page_head, flags);
-		if (free) {
-			if (PageHead(page_head))
-				__put_compound_page(page_head);
-			else
-				__put_single_page(page_head);
+	/*
+	 * If "page" is a THP tail, we must read the tail page flags after the
+	 * head page flags. The split_huge_page side enforces write memory
+	 * barriers between clearing PageTail and before the head page can be
+	 * freed and reallocated.
+	 */
+	smp_rmb();
+	if (likely(PageTail(page))) {
+		/* __split_huge_page_refcount cannot race here. */
+		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+		if (put_page_testzero(page_head)) {
+			/*
+			 * If this is the tail of a slab compound page, the
+			 * tail pin must not be the last reference held on the
+			 * page, because the PG_slab cannot be cleared before
+			 * all tail pins (which skips the _mapcount tail
+			 * refcounting) have been released. For hugetlbfs the
+			 * tail pin may be the last reference on the page
+			 * instead, because PageHeadHuge will not go away until
+			 * the compound page enters the buddy allocator.
+			 */
+			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
+			__put_compound_page(page_head);
 		}
-	} else {
-		compound_unlock_irqrestore(page_head, flags);
-		VM_BUG_ON_PAGE(PageTail(page), page);
-		if (put_page_testzero(page))
-			__put_single_page(page);
-	}
+	} else if (put_page_testzero(page))
+		__put_single_page(page);
+	return;
 }
 
 void put_page(struct page *page)
@@ -178,42 +146,29 @@ EXPORT_SYMBOL(put_page);
 void __get_page_tail(struct page *page)
 {
 	struct page *page_head = compound_head(page);
-	unsigned long flags;
 
-	if (!compound_lock_needed(page_head)) {
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/*
-			 * This is a hugetlbfs page or a slab page.
-			 * __split_huge_page_refcount cannot race here.
-			 */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			VM_BUG_ON(page_head != page->first_page);
-			VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
-					page);
-			atomic_inc(&page_head->_count);
-		} else {
-			/*
-			 * __split_huge_page_refcount run before us, "page" was
-			 * a thp tail. the split page_head has been freed and
-			 * reallocated as slab or hugetlbfs page of smaller
-			 * order (only possible if reallocated as slab on x86).
-			 */
-			VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-			atomic_inc(&page->_count);
-		}
-		return;
-	}
-
-	flags = compound_lock_irqsave(page_head);
-	/* here __split_huge_page_refcount won't run anymore */
-	if (unlikely(page == page_head || !PageTail(page) ||
-				!get_page_unless_zero(page_head))) {
-		/* page is not part of THP page anymore */
+	smp_rmb();
+	if (likely(PageTail(page))) {
+		/*
+		 * This is a hugetlbfs page or a slab page.
+		 * __split_huge_page_refcount cannot race here.
+		 */
+		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+		VM_BUG_ON(page_head != page->first_page);
+		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
+				page);
+		atomic_inc(&page_head->_count);
+	} else {
+		/*
+		 * __split_huge_page_refcount run before us, "page" was
+		 * a thp tail. the split page_head has been freed and
+		 * reallocated as slab or hugetlbfs page of smaller
+		 * order (only possible if reallocated as slab on x86).
+		 */
 		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
 		atomic_inc(&page->_count);
 	}
-	compound_unlock_irqrestore(page_head, flags);
+	return;
 }
 EXPORT_SYMBOL(__get_page_tail);
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 21/24] thp: introduce deferred_split_huge_page()
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (19 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 20/24] mm, thp: remove compound_lock Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-04 16:33 ` [PATCHv4 22/24] memcg: adjust to support new THP refcounting Kirill A. Shutemov
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure. The page will be dropped from list on freeing through
compound page destructor.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |  9 +++++
 include/linux/mm.h      |  2 ++
 include/linux/mmzone.h  |  5 +++
 mm/huge_memory.c        | 90 +++++++++++++++++++++++++++++++++++++++++++++++--
 mm/page_alloc.c         |  6 +++-
 mm/rmap.c               | 10 +++++-
 mm/vmscan.c             |  3 ++
 7 files changed, 120 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7a0c477a2b38..0aaebd81beb6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,6 +96,13 @@ static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
 }
+void deferred_split_huge_page(struct page *page);
+void __drain_split_queue(struct zone *zone);
+static inline void drain_split_queue(struct zone *zone)
+{
+	if (!list_empty(&zone->split_queue))
+		__drain_split_queue(zone);
+}
 extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address);
 #define split_huge_pmd(__vma, __pmd, __address)				\
@@ -170,6 +177,8 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+static inline void deferred_split_huge_page(struct page *page) {}
+static inline void drain_split_queue(struct zone *zone) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 370aaa17fab8..c155f1262b19 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -568,6 +568,8 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].compound_order = order;
 }
 
+void free_compound_page(struct page *page);
+
 #ifdef CONFIG_MMU
 /*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f279d9c158cd..4f1afa447e2d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -525,6 +525,11 @@ struct zone {
 	bool			compact_blockskip_flush;
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long split_queue_length;
+	struct list_head split_queue;
+#endif
+
 	ZONE_PADDING(_pad3_)
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2bc89b2169aa..92afb8d2b46d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -69,6 +69,8 @@ static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
 static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
 
+static void free_transhuge_page(struct page *page);
+
 #define MM_SLOTS_HASH_BITS 10
 static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
@@ -825,6 +827,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
+	INIT_LIST_HEAD(&page[2].lru);
+	set_compound_page_dtor(page, free_transhuge_page);
 	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1086,7 +1090,10 @@ alloc:
 	} else
 		new_page = NULL;
 
-	if (unlikely(!new_page)) {
+	if (likely(new_page)) {
+		INIT_LIST_HEAD(&new_page[2].lru);
+		set_compound_page_dtor(new_page, free_transhuge_page);
+	} else {
 		if (!page) {
 			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
@@ -1851,6 +1858,10 @@ static int __split_huge_page_refcount(struct anon_vma *anon_vma,
 		return -EBUSY;
 	}
 
+	spin_lock(&zone->lock);
+	list_del(&page[2].lru);
+	spin_unlock(&zone->lock);
+
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
@@ -2007,6 +2018,71 @@ out:
 	return ret;
 }
 
+static void free_transhuge_page(struct page *page)
+{
+	if (!list_empty(&page[2].lru)) {
+		struct zone *zone = page_zone(page);
+		unsigned long flags;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		list_del(&page[2].lru);
+		memset(&page[2].lru, 0, sizeof(page[2].lru));
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+	free_compound_page(page);
+}
+
+void deferred_split_huge_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+	/* we use page->lru in second tail page: assuming THP order >= 2 */
+	BUILD_BUG_ON(HPAGE_PMD_ORDER < 2);
+
+	if (!list_empty(&page[2].lru))
+		return;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	if (list_empty(&page[2].lru))
+		list_add_tail(&page[2].lru, &zone->split_queue);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+void __drain_split_queue(struct zone *zone)
+{
+	unsigned long flags;
+	LIST_HEAD(list);
+	struct page *page, *next;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list_splice_init(&zone->split_queue, &list);
+	/*
+	 * take reference for all pages under zone->lock to avoid race
+	 * with free_transhuge_page().
+	 */
+	list_for_each_entry_safe(page, next, &list, lru)
+		get_page(compound_head(page));
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	list_for_each_entry_safe(page, next, &list, lru) {
+		page = compound_head(page);
+		lock_page(page);
+		/* split_huge_page() removes page from list on success */
+		split_huge_page(compound_head(page));
+		unlock_page(page);
+		put_page(page);
+	}
+
+	if (!list_empty(&list)) {
+		spin_lock_irqsave(&zone->lock, flags);
+		list_splice_tail(&list, &zone->split_queue);
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+}
+
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
@@ -2438,6 +2514,8 @@ static struct page
 		return NULL;
 	}
 
+	INIT_LIST_HEAD(&(*hpage)[2].lru);
+	set_compound_page_dtor(*hpage, free_transhuge_page);
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
@@ -2449,8 +2527,14 @@ static int khugepaged_find_target_node(void)
 
 static inline struct page *alloc_hugepage(int defrag)
 {
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
-			   HPAGE_PMD_ORDER);
+	struct page *page;
+
+	page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0), HPAGE_PMD_ORDER);
+	if (page) {
+		INIT_LIST_HEAD(&page[2].lru);
+		set_compound_page_dtor(page, free_transhuge_page);
+	}
+	return page;
 }
 
 static struct page *khugepaged_alloc_hugepage(bool *wait)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 730f6ad2ab10..9b979447fce2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -357,7 +357,7 @@ out:
  * This usage means that zero-order pages may not be compound.
  */
 
-static void free_compound_page(struct page *page)
+void free_compound_page(struct page *page)
 {
 	__free_pages_ok(page, compound_order(page));
 }
@@ -5009,6 +5009,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone->zone_pgdat = pgdat;
 		zone_pcp_init(zone);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		INIT_LIST_HEAD(&zone->split_queue);
+#endif
+
 		/* For bootup, initialized properly in watermark setup */
 		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 2dc26770d1d3..6795babf5739 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1165,6 +1165,7 @@ out:
 void page_remove_rmap(struct page *page, bool compound)
 {
 	int nr = compound ? hpage_nr_pages(page) : 1;
+	bool partial_thp_unmap;
 
 	if (!PageAnon(page)) {
 		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
@@ -1195,13 +1196,20 @@ void page_remove_rmap(struct page *page, bool compound)
 		for (i = 0; i < hpage_nr_pages(page); i++)
 			if (page_mapcount(page + i))
 				nr--;
-	}
+		partial_thp_unmap = nr != hpage_nr_pages(page);
+	} else if (PageTransCompound(page)) {
+		partial_thp_unmap = !compound_mapcount(page);
+	} else
+		partial_thp_unmap = false;
 
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
+	if (partial_thp_unmap)
+		deferred_split_huge_page(compound_head(page));
+
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
 	 * but that might overwrite a racing page_add_anon_rmap
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 260c413d39cd..83a8346a5494 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2376,6 +2376,9 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 		unsigned long zone_lru_pages = 0;
 		struct mem_cgroup *memcg;
 
+		/* XXX: accounting for shrinking progress ? */
+		drain_split_queue(zone);
+
 		nr_reclaimed = sc->nr_reclaimed;
 		nr_scanned = sc->nr_scanned;
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 22/24] memcg: adjust to support new THP refcounting
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (20 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 21/24] thp: introduce deferred_split_huge_page() Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-04 16:33 ` [PATCHv4 23/24] ksm: split huge pages on follow_page() Kirill A. Shutemov
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we cannot rely on PageTransHuge() check if we need
to charge size of huge page form the cgroup. We need to get information
from caller to know whether it was mapped with PMD or PTE.

We do uncharge when last reference on the page gone. At that point if we
see PageTransHuge() it means we need to unchange whole huge page.

The tricky part is partial unmap. We don't do a special handing of this
situation, meaning we don't uncharge the part of huge page unless last
user is gone of split_huge_page() is triggered. In case of cgroup memory
pressure happens the partial unmapped page will be split through
shrink_zone(). This should be good enough.

I did quick sanity check, more testing is required.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/memcontrol.h | 16 +++++++-----
 kernel/events/uprobes.c    |  7 +++---
 mm/Kconfig                 |  2 +-
 mm/filemap.c               |  8 +++---
 mm/huge_memory.c           | 29 +++++++++++-----------
 mm/memcontrol.c            | 62 +++++++++++++++++-----------------------------
 mm/memory.c                | 26 +++++++++----------
 mm/shmem.c                 | 21 +++++++++-------
 mm/swapfile.c              |  9 ++++---
 9 files changed, 87 insertions(+), 93 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..6a70e6c4bece 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -74,10 +74,12 @@ void mem_cgroup_events(struct mem_cgroup *memcg,
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
-			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      bool lrucare);
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
+			      bool lrucare, bool compound);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+		bool compound);
 void mem_cgroup_uncharge(struct page *page);
 void mem_cgroup_uncharge_list(struct list_head *page_list);
 
@@ -209,7 +211,8 @@ static inline bool mem_cgroup_low(struct mem_cgroup *root,
 
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
-					struct mem_cgroup **memcgp)
+					struct mem_cgroup **memcgp,
+					bool compound)
 {
 	*memcgp = NULL;
 	return 0;
@@ -217,12 +220,13 @@ static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 
 static inline void mem_cgroup_commit_charge(struct page *page,
 					    struct mem_cgroup *memcg,
-					    bool lrucare)
+					    bool lrucare, bool compound)
 {
 }
 
 static inline void mem_cgroup_cancel_charge(struct page *page,
-					    struct mem_cgroup *memcg)
+					    struct mem_cgroup *memcg,
+					    bool compound)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5523daf59953..04e26bdf0717 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -169,7 +169,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	const unsigned long mmun_end   = addr + PAGE_SIZE;
 	struct mem_cgroup *memcg;
 
-	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
+	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg,
+			false);
 	if (err)
 		return err;
 
@@ -184,7 +185,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	get_page(kpage);
 	page_add_new_anon_rmap(kpage, vma, addr, false);
-	mem_cgroup_commit_charge(kpage, memcg, false);
+	mem_cgroup_commit_charge(kpage, memcg, false, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
@@ -207,7 +208,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mem_cgroup_cancel_charge(kpage, memcg);
+	mem_cgroup_cancel_charge(kpage, memcg, false);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	unlock_page(page);
 	return err;
diff --git a/mm/Kconfig b/mm/Kconfig
index 19d090534194..390214da4546 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -409,7 +409,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !MEMCG
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
diff --git a/mm/filemap.c b/mm/filemap.c
index a0c3a57d29c9..25bec2c3e7a3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -555,7 +555,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
-					      gfp_mask, &memcg);
+					      gfp_mask, &memcg, false);
 		if (error)
 			return error;
 	}
@@ -563,7 +563,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		if (!huge)
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, false);
 		return error;
 	}
 
@@ -579,7 +579,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
@@ -587,7 +587,7 @@ err_insert:
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92afb8d2b46d..d14f814f5eed 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -719,12 +719,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
+	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg, true))
 		return VM_FAULT_OOM;
 
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, true);
 		return VM_FAULT_OOM;
 	}
 
@@ -739,7 +739,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, true);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -747,7 +747,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -957,13 +957,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
-						   &memcg))) {
+						   &memcg, false))) {
 			if (pages[i])
 				put_page(pages[i]);
 			while (--i >= 0) {
 				memcg = (void *)page_private(pages[i]);
 				set_page_private(pages[i], 0);
-				mem_cgroup_cancel_charge(pages[i], memcg);
+				mem_cgroup_cancel_charge(pages[i], memcg,
+						false);
 				put_page(pages[i]);
 			}
 			kfree(pages);
@@ -1002,7 +1003,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vma, haddr, false);
-		mem_cgroup_commit_charge(pages[i], memcg, false);
+		mem_cgroup_commit_charge(pages[i], memcg, false, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
@@ -1030,7 +1031,7 @@ out_free_pages:
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		mem_cgroup_cancel_charge(pages[i], memcg);
+		mem_cgroup_cancel_charge(pages[i], memcg, false);
 		put_page(pages[i]);
 	}
 	kfree(pages);
@@ -1111,7 +1112,7 @@ alloc:
 	}
 
 	if (unlikely(mem_cgroup_try_charge(new_page, mm,
-					   GFP_TRANSHUGE, &memcg))) {
+					   GFP_TRANSHUGE, &memcg, true))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_pmd(vma, pmd, address);
@@ -1140,7 +1141,7 @@ alloc:
 		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_cancel_charge(new_page, memcg);
+		mem_cgroup_cancel_charge(new_page, memcg, true);
 		put_page(new_page);
 		goto out_mn;
 	} else {
@@ -1149,7 +1150,7 @@ alloc:
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush_notify(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
-		mem_cgroup_commit_charge(new_page, memcg, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache_pmd(vma, address, pmd);
@@ -2619,7 +2620,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 		return;
 
 	if (unlikely(mem_cgroup_try_charge(new_page, mm,
-					   GFP_TRANSHUGE, &memcg)))
+					   GFP_TRANSHUGE, &memcg, true)))
 		return;
 
 	/*
@@ -2706,7 +2707,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address, true);
-	mem_cgroup_commit_charge(new_page, memcg, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, true);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
@@ -2721,7 +2722,7 @@ out_up_write:
 	return;
 
 out:
-	mem_cgroup_cancel_charge(new_page, memcg);
+	mem_cgroup_cancel_charge(new_page, memcg, true);
 	goto out_up_write;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0724cd50b48a..72e0a886f68e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -826,7 +826,7 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 int nr_pages)
+					 bool compound, int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
@@ -839,9 +839,11 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
+	}
 
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
@@ -2800,30 +2802,24 @@ void mem_cgroup_split_huge_fixup(struct page *head)
  * from old cgroup.
  */
 static int mem_cgroup_move_account(struct page *page,
-				   unsigned int nr_pages,
+				   bool compound,
 				   struct mem_cgroup *from,
 				   struct mem_cgroup *to)
 {
 	unsigned long flags;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
-	/*
-	 * The page is isolated from LRU. So, collapse function
-	 * will not handle this page. But page splitting can happen.
-	 * Do this check under compound_page_lock(). The caller should
-	 * hold it.
-	 */
-	ret = -EBUSY;
-	if (nr_pages > 1 && !PageTransHuge(page))
-		goto out;
+	VM_BUG_ON(compound && !PageTransHuge(page));
 
 	/*
 	 * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup
 	 * of its source page while we change it: page migration takes
 	 * both pages off the LRU, but page cache replacement doesn't.
 	 */
+	ret = -EBUSY;
 	if (!trylock_page(page))
 		goto out;
 
@@ -2860,9 +2856,9 @@ static int mem_cgroup_move_account(struct page *page,
 	ret = 0;
 
 	local_irq_disable();
-	mem_cgroup_charge_statistics(to, page, nr_pages);
+	mem_cgroup_charge_statistics(to, page, compound, nr_pages);
 	memcg_check_events(to, page);
-	mem_cgroup_charge_statistics(from, page, -nr_pages);
+	mem_cgroup_charge_statistics(from, page, compound, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
 out_unlock:
@@ -5080,7 +5076,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		if (target_type == MC_TARGET_PAGE) {
 			page = target.page;
 			if (!isolate_lru_page(page)) {
-				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
+				if (!mem_cgroup_move_account(page, true,
 							     mc.from, mc.to)) {
 					mc.precharge -= HPAGE_PMD_NR;
 					mc.moved_charge += HPAGE_PMD_NR;
@@ -5109,7 +5105,8 @@ retry:
 			page = target.page;
 			if (isolate_lru_page(page))
 				goto put;
-			if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) {
+			if (!mem_cgroup_move_account(page, false,
+						mc.from, mc.to)) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
@@ -5455,10 +5452,11 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
  * with mem_cgroup_cancel_charge() in case page instantiation fails.
  */
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
-			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound)
 {
 	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret = 0;
 
 	if (mem_cgroup_disabled())
@@ -5476,11 +5474,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			goto out;
 	}
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	if (do_swap_account && PageSwapCache(page))
 		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
@@ -5516,9 +5509,9 @@ out:
  * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
  */
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      bool lrucare)
+			      bool lrucare, bool compound)
 {
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 
 	VM_BUG_ON_PAGE(!page->mapping, page);
 	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5535,13 +5528,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 
 	commit_charge(page, memcg, lrucare);
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	local_irq_disable();
-	mem_cgroup_charge_statistics(memcg, page, nr_pages);
+	mem_cgroup_charge_statistics(memcg, page, compound, nr_pages);
 	memcg_check_events(memcg, page);
 	local_irq_enable();
 
@@ -5563,9 +5551,10 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
  *
  * Cancel a charge transaction started by mem_cgroup_try_charge().
  */
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+		bool compound)
 {
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -5577,11 +5566,6 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	if (!memcg)
 		return;
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	cancel_charge(memcg, nr_pages);
 }
 
@@ -5835,7 +5819,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	/* XXX: caller holds IRQ-safe mapping->tree_lock */
 	VM_BUG_ON(!irqs_disabled());
 
-	mem_cgroup_charge_statistics(memcg, page, -1);
+	mem_cgroup_charge_statistics(memcg, page, false, -1);
 	memcg_check_events(memcg, page);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index e7a08de704b4..4af9b61e97a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2078,7 +2078,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_new;
 
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
@@ -2107,7 +2107,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address, false);
-		mem_cgroup_commit_charge(new_page, memcg, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -2146,7 +2146,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		new_page = old_page;
 		page_copied = 1;
 	} else {
-		mem_cgroup_cancel_charge(new_page, memcg);
+		mem_cgroup_cancel_charge(new_page, memcg, false);
 	}
 
 	if (new_page)
@@ -2487,7 +2487,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2529,10 +2529,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, address, page_table, pte);
 	if (page == swapcache) {
 		do_page_add_anon_rmap(page, vma, address, exclusive);
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, address, false);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
 
@@ -2567,7 +2567,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
@@ -2657,7 +2657,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -2670,7 +2670,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address, false);
-	mem_cgroup_commit_charge(page, memcg, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
@@ -2681,7 +2681,7 @@ unlock:
 	pte_unmap_unlock(page_table, ptl);
 	return 0;
 release:
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
@@ -2932,7 +2932,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
@@ -2961,7 +2961,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
-	mem_cgroup_commit_charge(new_page, memcg, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
 	if (fault_page) {
@@ -2976,7 +2976,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	return ret;
 uncharge_out:
-	mem_cgroup_cancel_charge(new_page, memcg);
+	mem_cgroup_cancel_charge(new_page, memcg, false);
 	page_cache_release(new_page);
 	return ret;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 2f17cb5f00a4..3468552274db 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -706,7 +706,8 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
+	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg,
+			false);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -729,9 +730,9 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	if (error) {
 		if (error != -ENOMEM)
 			error = 0;
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 	} else
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 out:
 	unlock_page(page);
 	page_cache_release(page);
@@ -1114,7 +1115,8 @@ repeat:
 				goto failed;
 		}
 
-		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+				false);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						swp_to_radix_entry(swap));
@@ -1131,14 +1133,14 @@ repeat:
 			 * "repeat": reading a hole and writing should succeed.
 			 */
 			if (error) {
-				mem_cgroup_cancel_charge(page, memcg);
+				mem_cgroup_cancel_charge(page, memcg, false);
 				delete_from_swap_cache(page);
 			}
 		}
 		if (error)
 			goto failed;
 
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 
 		spin_lock(&info->lock);
 		info->swapped--;
@@ -1177,7 +1179,8 @@ repeat:
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 
-		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+				false);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1187,10 +1190,10 @@ repeat:
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, false);
 			goto decused;
 		}
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 99f97c31ede5..298efd73dc60 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1106,14 +1106,15 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+	{
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 		ret = 0;
 		goto out;
 	}
@@ -1125,10 +1126,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, false);
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
 	swap_free(entry);
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 23/24] ksm: split huge pages on follow_page()
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (21 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 22/24] memcg: adjust to support new THP refcounting Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-04 16:33 ` [PATCHv4 24/24] thp: update documentation Kirill A. Shutemov
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

Let's split THP with FOLL_SPLIT. Attempting to split them laterk would
always fail bacause we take references on tail pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/ksm.c | 58 ++++++++--------------------------------------------------
 1 file changed, 8 insertions(+), 50 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index a8a88b0f6f62..8d977f074a74 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -441,20 +441,6 @@ static void break_cow(struct rmap_item *rmap_item)
 	up_read(&mm->mmap_sem);
 }
 
-static struct page *page_trans_compound_anon(struct page *page)
-{
-	if (PageTransCompound(page)) {
-		struct page *head = compound_head(page);
-		/*
-		 * head may actually be splitted and freed from under
-		 * us but it's ok here.
-		 */
-		if (PageAnon(head))
-			return head;
-	}
-	return NULL;
-}
-
 static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
@@ -467,10 +453,10 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	if (!vma)
 		goto out;
 
-	page = follow_page(vma, addr, FOLL_GET);
+	page = follow_page(vma, addr, FOLL_GET | FOLL_SPLIT);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
-	if (PageAnon(page) || page_trans_compound_anon(page)) {
+	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
@@ -976,35 +962,6 @@ out:
 	return err;
 }
 
-static int page_trans_compound_anon_split(struct page *page)
-{
-	int ret = 0;
-	struct page *transhuge_head = page_trans_compound_anon(page);
-	if (transhuge_head) {
-		/* Get the reference on the head to split it. */
-		if (get_page_unless_zero(transhuge_head)) {
-			/*
-			 * Recheck we got the reference while the head
-			 * was still anonymous.
-			 */
-			if (PageAnon(transhuge_head)) {
-				lock_page(transhuge_head);
-				ret = split_huge_page(transhuge_head);
-				unlock_page(transhuge_head);
-			} else
-				/*
-				 * Retry later if split_huge_page run
-				 * from under us.
-				 */
-				ret = 1;
-			put_page(transhuge_head);
-		} else
-			/* Retry later if split_huge_page run from under us. */
-			ret = 1;
-	}
-	return ret;
-}
-
 /*
  * try_to_merge_one_page - take two pages and merge them into one
  * @vma: the vma that holds the pte pointing to page
@@ -1025,9 +982,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 
 	if (!(vma->vm_flags & VM_MERGEABLE))
 		goto out;
-	if (PageTransCompound(page) && page_trans_compound_anon_split(page))
-		goto out;
+
+	/* huge pages must be split by this time */
 	BUG_ON(PageTransCompound(page));
+
 	if (!PageAnon(page))
 		goto out;
 
@@ -1616,14 +1574,14 @@ next_mm:
 		while (ksm_scan.address < vma->vm_end) {
 			if (ksm_test_exit(mm))
 				break;
-			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
+			*page = follow_page(vma, ksm_scan.address,
+					FOLL_GET | FOLL_SPLIT);
 			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
 				cond_resched();
 				continue;
 			}
-			if (PageAnon(*page) ||
-			    page_trans_compound_anon(*page)) {
+			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
 				rmap_item = get_next_rmap_item(slot,
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCHv4 24/24] thp: update documentation
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (22 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 23/24] ksm: split huge pages on follow_page() Kirill A. Shutemov
@ 2015-03-04 16:33 ` Kirill A. Shutemov
  2015-03-05 12:55 ` [PATCHv4 00/24] THP refcounting redesign Jerome Marchand
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 16:33 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm, Kirill A. Shutemov

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 100 +++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 55 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 6b31cfbe2a9a..a12171e850d4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -35,10 +35,10 @@ miss is going to run faster.
 
 == Design ==
 
-- "graceful fallback": mm components which don't have transparent
-  hugepage knowledge fall back to breaking a transparent hugepage and
-  working on the regular pages and their respective regular pmd/pte
-  mappings
+- "graceful fallback": mm components which don't have transparent hugepage
+  knowledge fall back to breaking huge pmd mapping into table of ptes and,
+  if nesessary, split a transparent hugepage. Therefore these components
+  can continue working on the regular pages or regular pte mappings.
 
 - if a hugepage allocation fails because of memory fragmentation,
   regular pages should be gracefully allocated instead and mixed in
@@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
+	This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+	page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+	This can happen, for instance, when application calls mprotect() or
+	munmap() on part of huge page. It doesn't split huge page, only
+	page table entry.
 
 thp_zero_page_alloc is incremented every time a huge zero page is
 	successfully allocated. It includes allocations which where
@@ -253,10 +262,8 @@ is complete, so they won't ever notice the fact the page is huge. But
 if any driver is going to mangle over the page structure of the tail
 page (like for checking page->mapping or other bits that are relevant
 for the head page and not the tail page), it should be updated to jump
-to check head page instead (while serializing properly against
-split_huge_page() to avoid the head and tail pages to disappear from
-under it, see the futex code to see an example of that, hugetlbfs also
-needed special handling in futex code for similar reasons).
+to check head page instead. Taking reference on any head/tail page would
+prevent page from being split by anyone.
 
 NOTE: these aren't new constraints to the GUP API, and they match the
 same constrains that applies to hugetlbfs too, so any driver capable
@@ -291,9 +298,9 @@ unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==
 
 Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
 pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
 missing after pmd_offset returns the pmd. Thanks to the graceful
 fallback design, with a one liner change, you can avoid to write
 hundred if not thousand of lines of complex code to make your code
@@ -302,7 +309,8 @@ hugepage aware.
 If you're not walking pagetables but you run into a physical hugepage
 but you can't handle it natively in your code, you can split it by
 calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page() can fail
+if the page is pinned and you must handle this correctly.
 
 Example to make mremap.c transparent hugepage aware with a one liner
 change:
@@ -314,14 +322,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-+	split_huge_page_pmd(vma, addr, pmd);
++	split_huge_pmd(vma, pmd, addr);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
 == Locking in hugepage aware code ==
 
 We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.
 
 To make pagetable walks huge pmd aware, all you need to do is to call
 pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -330,47 +338,29 @@ created from under you by khugepaged (khugepaged collapse_huge_page
 takes the mmap_sem in write mode in addition to the anon_vma lock). If
 pmd_trans_huge returns false, you just fallback in the old code
 paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
 pagetable walk). If the second pmd_trans_huge returns false, you
-should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+should just drop the page table lock and fallback to the old code as
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page table lock.
+
+== Refcounts and transparent huge pages ==
 
+As with other compound page types we do all refcounting for THP on head
+page, but unlike other compound pages THP support splitting.
 split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses migration entries to stabilize page->_count and
+page->_mapcount.
+
+Note that split_huge_pmd() doesn't have any limitation on refcounting:
+pmd can be split at any point and never fails.
-- 
2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 03/24] mm: avoid PG_locked on tail pages
  2015-03-04 16:32 ` [PATCHv4 03/24] mm: avoid PG_locked " Kirill A. Shutemov
@ 2015-03-04 18:48   ` Christoph Lameter
  2015-03-04 20:56     ` Kirill A. Shutemov
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2015-03-04 18:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Dave Hansen, Hugh Dickins,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm

On Wed, 4 Mar 2015, Kirill A. Shutemov wrote:

> index c851ff92d5b3..58b98bced299 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -207,7 +207,8 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
>
>  struct page;	/* forward declaration */
>
> -TESTPAGEFLAG(Locked, locked)
> +#define PageLocked(page) test_bit(PG_locked, &compound_head(page)->flags)
> +
>  PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)

Hmmm... Now one of the pageflag functions operates on the head page unlike
the other pageflag functions that only operate on the flag indicated.

Given that pageflags provide a way to implement checks for head / tail
pages this seems to be a bad idea.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 13/24] mm, vmstats: new THP splitting event
  2015-03-04 16:33 ` [PATCHv4 13/24] mm, vmstats: new THP splitting event Kirill A. Shutemov
@ 2015-03-04 18:49   ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2015-03-04 18:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Dave Hansen, Hugh Dickins,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm

On Wed, 4 Mar 2015, Kirill A. Shutemov wrote:

> The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
> THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
> now can split PMD without the compound page and that split_huge_page()
> can fail.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 03/24] mm: avoid PG_locked on tail pages
  2015-03-04 18:48   ` Christoph Lameter
@ 2015-03-04 20:56     ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-04 20:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, linux-kernel, linux-mm

On Wed, Mar 04, 2015 at 12:48:54PM -0600, Christoph Lameter wrote:
> On Wed, 4 Mar 2015, Kirill A. Shutemov wrote:
> 
> > index c851ff92d5b3..58b98bced299 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -207,7 +207,8 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
> >
> >  struct page;	/* forward declaration */
> >
> > -TESTPAGEFLAG(Locked, locked)
> > +#define PageLocked(page) test_bit(PG_locked, &compound_head(page)->flags)
> > +
> >  PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
> 
> Hmmm... Now one of the pageflag functions operates on the head page unlike
> the other pageflag functions that only operate on the flag indicated.
> 
> Given that pageflags provide a way to implement checks for head / tail
> pages this seems to be a bad idea.

I agree, I need to take more systematic look on page flags vs. compound
pages. I'll try to come up with something before the summit.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 00/24] THP refcounting redesign
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (23 preceding siblings ...)
  2015-03-04 16:33 ` [PATCHv4 24/24] thp: update documentation Kirill A. Shutemov
@ 2015-03-05 12:55 ` Jerome Marchand
  2015-03-05 16:04   ` Kirill A. Shutemov
  2015-03-06 12:18   ` Kirill A. Shutemov
  2015-03-17  9:42 ` Aneesh Kumar K.V
  2015-03-30 15:40 ` Aneesh Kumar K.V
  26 siblings, 2 replies; 56+ messages in thread
From: Jerome Marchand @ 2015-03-05 12:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 8087 bytes --]

On 03/04/2015 05:32 PM, Kirill A. Shutemov wrote:
> Hello everybody,
> 
> It's bug-fix update of my thp refcounting work.
> 
> The goal of patchset is to make refcounting on THP pages cheaper with
> simpler semantics and allow the same THP compound page to be mapped with
> PMD and PTEs. This is required to get reasonable THP-pagecache
> implementation.
> 
> With the new refcounting design it's much easier to protect against
> split_huge_page(): simple reference on a page will make you the deal.
> It makes gup_fast() implementation simpler and doesn't require
> special-case in futex code to handle tail THP pages.
> 
> It should improve THP utilization over the system since splitting THP in
> one process doesn't necessary lead to splitting the page in all other
> processes have the page mapped.
> 
[...]
> I believe all known bugs have been fixed, but I'm sure Sasha will bring more
> reports.
> 
> The patchset also available on git:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v4
> 

Hi Kirill,

I ran some ltp tests and it triggered two bugs:

[  318.526528] ------------[ cut here ]------------
[  318.527031] kernel BUG at mm/filemap.c:203!
[  318.527031] invalid opcode: 0000 [#1] SMP 
[  318.527031] Modules linked in: loop ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev microcode joydev virtio_console pcspkr serio_raw virtio_balloon parport_pc parport nfsd pvpanic i2c_piix4 acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc qxl virtio_blk virtio_net drm_kms_helper ttm drm virtio_pci virtio_ring virtio ata_generic pata_acpi floppy
[  318.527031] CPU: 0 PID: 8929 Comm: hugemmap01 Not tainted 4.0.0-rc1-next-20150227+ #213
[  318.527031] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  318.527031] task: ffff8800783d9200 ti: ffff880078bc0000 task.ti: ffff880078bc0000
[  318.527031] RIP: 0010:[<ffffffff811a74dc>]  [<ffffffff811a74dc>] __delete_from_page_cache+0x2ec/0x350
[  318.527031] RSP: 0018:ffff880078bc3c58  EFLAGS: 00010002
[  318.527031] RAX: 0000000000000001 RBX: ffffea0001338000 RCX: 00000000ffffffec
[  318.527031] RDX: 003ffc0000000001 RSI: 000000000000000a RDI: ffff88007ffe97c0
[  318.527031] RBP: ffff880078bc3ca8 R08: ffffffff82660f14 R09: 0000000000000000
[  318.527031] R10: ffff8800783d9200 R11: 0000000000000001 R12: ffff880077824980
[  318.527031] R13: ffff880077824968 R14: 0000000000000000 R15: ffff880077824970
[  318.527031] FS:  00007fe5cff0b700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  318.527031] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  318.527031] CR2: 00007fe5cff17000 CR3: 00000000778fb000 CR4: 00000000001407f0
[  318.527031] Stack:
[  318.527031]  ffff880077824980 ffff880077824980 0000000000000000 ffff880077824978
[  318.527031]  ffff880078bc3ca8 ffffea0001338000 ffff880077824980 0000000000000000
[  318.527031]  0000000000000001 0000000000000000 ffff880078bc3cd8 ffffffff811a758c
[  318.527031] Call Trace:
[  318.527031]  [<ffffffff811a758c>] delete_from_page_cache+0x4c/0x80
[  318.527031]  [<ffffffff81304ecb>] truncate_hugepages+0xfb/0x1d0
[  318.527031]  [<ffffffff8124b19e>] ? inode_wait_for_writeback+0x1e/0x40
[  318.527031]  [<ffffffff81246f8d>] ? __inode_wait_for_writeback+0x6d/0xc0
[  318.527031]  [<ffffffff81305118>] hugetlbfs_evict_inode+0x18/0x40
[  318.527031]  [<ffffffff8123852b>] evict+0xab/0x180
[  318.527031]  [<ffffffff81238e7b>] iput+0x19b/0x200
[  318.527031]  [<ffffffff8122b8c9>] do_unlinkat+0x1e9/0x310
[  318.527031]  [<ffffffff817b25f9>] ? ret_from_sys_call+0x24/0x63
[  318.527031]  [<ffffffff810e274d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[  318.527031]  [<ffffffff813ad73b>] ? trace_hardirqs_on_thunk+0x17/0x19
[  318.527031]  [<ffffffff8122cac6>] SyS_unlink+0x16/0x20
[  318.527031]  [<ffffffff817b25d0>] system_call_fastpath+0x12/0x17
[  318.527031] Code: 00 00 48 83 f8 01 19 c0 25 01 fe ff ff 05 00 02 00 00 39 d0 0f 8e 92 fe ff ff 48 63 c2 48 c1 e0 06 48 01 d8 8b 40 18 85 c0 78 c4 <0f> 0b 48 8b 43 30 e9 7b fd ff ff e8 aa e6 5f 00 80 3d c0 d8 b5 
[  318.527031] RIP  [<ffffffff811a74dc>] __delete_from_page_cache+0x2ec/0x350
[  318.527031]  RSP <ffff880078bc3c58>
[  318.527031] ---[ end trace 4595a8f53048ea33 ]---
[  320.670687] ------------[ cut here ]------------

And:

[ 4309.839683] page:ffffea0000000640 count:0 mapcount:0 mapping:00000000ffffffff index:0x0 compound_mapcount: 0
[ 4309.842296] flags: 0x1ffc0000008000(tail)
[ 4309.843253] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
[ 4309.845357] ------------[ cut here ]------------
[ 4309.846306] kernel BUG at include/linux/page-flags.h:438!
[ 4309.846306] invalid opcode: 0000 [#3] SMP 
[ 4309.846306] Modules linked in: binfmt_misc loop ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev microcode joydev virtio_console pcspkr serio_raw virtio_balloon parport_pc parport nfsd pvpanic i2c_piix4 acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc qxl virtio_blk virtio_net drm_kms_helper ttm drm virtio_pci virtio_ring virtio ata_generic pata_acpi floppy
[ 4309.851030] CPU: 1 PID: 30932 Comm: proc01 Tainted: G      D         4.0.0-rc1-next-20150227+ #213
[ 4309.851030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 4309.851030] task: ffff88007c883600 ti: ffff8800237e0000 task.ti: ffff8800237e0000
[ 4309.851030] RIP: 0010:[<ffffffff81299d7c>]  [<ffffffff81299d7c>] stable_page_flags+0x30c/0x330
[ 4309.851030] RSP: 0018:ffff8800237e3e08  EFLAGS: 00010292
[ 4309.851030] RAX: 0000000000000033 RBX: 0000000000000800 RCX: 0000000000000000
[ 4309.851030] RDX: 0000000000000001 RSI: ffff88007fd0e438 RDI: ffff88007fd0e438
[ 4309.851030] RBP: ffff8800237e3e28 R08: 0000000000000001 R09: 0000000000000001
[ 4309.851030] R10: 0000000000000000 R11: 0000000000000000 R12: 001ffc0000008000
[ 4309.851030] R13: ffffea0000000640 R14: 0000000000000019 R15: 0000000000629508
[ 4309.851030] FS:  00007f8fa69c7700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[ 4309.851030] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4309.851030] CR2: 00007f129037c750 CR3: 0000000078c79000 CR4: 00000000001407e0
[ 4309.851030] Stack:
[ 4309.851030]  ffffffff811d4d82 ffffea0000000640 0000000000000338 ffffea0000000640
[ 4309.851030]  ffff8800237e3e78 ffffffff81299e67 0000000000629440 ffff8800237e3f20
[ 4309.851030]  ffff8800784b8a90 ffff88007c01c300 ffff8800784b8a80 ffff8800237e3f20
[ 4309.851030] Call Trace:
[ 4309.851030]  [<ffffffff811d4d82>] ? might_fault+0x42/0xa0
[ 4309.851030]  [<ffffffff81299e67>] kpageflags_read+0xc7/0x140
[ 4309.851030]  [<ffffffff8128a4fd>] proc_reg_read+0x3d/0x80
[ 4309.851030]  [<ffffffff81219ec6>] ? rw_verify_area+0x56/0xe0
[ 4309.851030]  [<ffffffff8121a9b8>] __vfs_read+0x18/0x50
[ 4309.851030]  [<ffffffff8121aa76>] vfs_read+0x86/0x140
[ 4309.851030]  [<ffffffff8121ab79>] SyS_read+0x49/0xb0
[ 4309.851030]  [<ffffffff817b25d0>] system_call_fastpath+0x12/0x17
[ 4309.851030] Code: 63 c2 48 c1 e0 06 4c 01 e8 8b 40 18 85 c0 79 26 83 c2 01 49 8b 45 00 f6 c4 80 74 c6 48 c7 c6 38 49 a5 81 4c 89 ef e8 84 95 f3 ff <0f> 0b 48 8b 50 30 e9 e2 fe ff ff bb 00 08 00 00 e9 0d fd ff ff 
[ 4309.851030] RIP  [<ffffffff81299d7c>] stable_page_flags+0x30c/0x330
[ 4309.851030]  RSP <ffff8800237e3e08>
[ 4310.285652] ---[ end trace 4595a8f53048ea35 ]---

I'll try to find out which tests triggered the bugs.

Thanks,
Jerome



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 00/24] THP refcounting redesign
  2015-03-05 12:55 ` [PATCHv4 00/24] THP refcounting redesign Jerome Marchand
@ 2015-03-05 16:04   ` Kirill A. Shutemov
  2015-03-06 12:18   ` Kirill A. Shutemov
  1 sibling, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-05 16:04 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, linux-kernel,
	linux-mm

On Thu, Mar 05, 2015 at 01:55:15PM +0100, Jerome Marchand wrote:
> On 03/04/2015 05:32 PM, Kirill A. Shutemov wrote:
> > Hello everybody,
> > 
> > It's bug-fix update of my thp refcounting work.
> > 
> > The goal of patchset is to make refcounting on THP pages cheaper with
> > simpler semantics and allow the same THP compound page to be mapped with
> > PMD and PTEs. This is required to get reasonable THP-pagecache
> > implementation.
> > 
> > With the new refcounting design it's much easier to protect against
> > split_huge_page(): simple reference on a page will make you the deal.
> > It makes gup_fast() implementation simpler and doesn't require
> > special-case in futex code to handle tail THP pages.
> > 
> > It should improve THP utilization over the system since splitting THP in
> > one process doesn't necessary lead to splitting the page in all other
> > processes have the page mapped.
> > 
> [...]
> > I believe all known bugs have been fixed, but I'm sure Sasha will bring more
> > reports.
> > 
> > The patchset also available on git:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v4
> > 
> 
> Hi Kirill,
> 
> I ran some ltp tests and it triggered two bugs:

Okay. The root of both is change in page_mapped(). I'll think how to fix
this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 00/24] THP refcounting redesign
  2015-03-05 12:55 ` [PATCHv4 00/24] THP refcounting redesign Jerome Marchand
  2015-03-05 16:04   ` Kirill A. Shutemov
@ 2015-03-06 12:18   ` Kirill A. Shutemov
  2015-03-06 15:58     ` Jerome Marchand
  1 sibling, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-06 12:18 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, linux-kernel,
	linux-mm

On Thu, Mar 05, 2015 at 01:55:15PM +0100, Jerome Marchand wrote:
> On 03/04/2015 05:32 PM, Kirill A. Shutemov wrote:
> > Hello everybody,
> > 
> > It's bug-fix update of my thp refcounting work.
> > 
> > The goal of patchset is to make refcounting on THP pages cheaper with
> > simpler semantics and allow the same THP compound page to be mapped with
> > PMD and PTEs. This is required to get reasonable THP-pagecache
> > implementation.
> > 
> > With the new refcounting design it's much easier to protect against
> > split_huge_page(): simple reference on a page will make you the deal.
> > It makes gup_fast() implementation simpler and doesn't require
> > special-case in futex code to handle tail THP pages.
> > 
> > It should improve THP utilization over the system since splitting THP in
> > one process doesn't necessary lead to splitting the page in all other
> > processes have the page mapped.
> > 
> [...]
> > I believe all known bugs have been fixed, but I'm sure Sasha will bring more
> > reports.
> > 
> > The patchset also available on git:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v4
> > 
> 
> Hi Kirill,
> 
> I ran some ltp tests and it triggered two bugs:
> 

Could you test with the patch below?

diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
index 8c3a3e02ba92..1baa4d23314b 100644
--- a/arch/arc/mm/cache_arc700.c
+++ b/arch/arc/mm/cache_arc700.c
@@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
 	 */
 	if (!mapping_mapped(mapping)) {
 		clear_bit(PG_dc_clean, &page->flags);
-	} else if (page_mapped(page)) {
+	} else if (page_mapcount(page)) {
 
 		/* kernel reading from page with U-mapping */
 		void *paddr = page_address(page);
@@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
 	 * equally valid for SRC page as well
 	 */
-	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
+	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
 		__flush_dcache_page(kfrom, u_vaddr);
 		clean_src_k_mappings = 1;
 	}
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 34b66af516ea..8f972fc8933d 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
 	mapping = page_mapping(page);
 
 	if (!cache_ops_need_broadcast() &&
-	    mapping && !page_mapped(page))
+	    mapping && !page_mapcount(page))
 		clear_bit(PG_dcache_clean, &page->flags);
 	else {
 		__flush_dcache_page(mapping, page);
diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index 3f8059602765..b28bf185ef77 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
 		 * another ASID than the current one.
 		 */
 		map_coherent = (cpu_has_dc_aliases &&
-				page_mapped(page) && !Page_dcache_dirty(page));
+				page_mapcount(page) &&
+				!Page_dcache_dirty(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, addr);
 		else
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index 7e3ea7766822..e695b28dc32c 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (page_mapped(page) && !Page_dcache_dirty(page)) {
+		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
 			void *kaddr;
 
 			kaddr = kmap_coherent(page, vmaddr);
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 448cde372af0..2c8e44aa536e 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 	if (cpu_has_dc_aliases &&
-	    page_mapped(from) && !Page_dcache_dirty(from)) {
+	    page_mapcount(from) && !Page_dcache_dirty(from)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
 		kunmap_coherent();
@@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
 		kunmap_coherent();
@@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
 		kunmap_coherent();
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index 51d8f7f31d1d..58aaa4f33b81 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
 		 */
 		map_coherent = (current_cpu_data.dcache.n_aliases &&
 			test_bit(PG_dcache_clean, &page->flags) &&
-			page_mapped(page));
+			page_mapcount(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, address);
 		else
diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
index f770e3992620..e58cfbf45150 100644
--- a/arch/sh/mm/cache.c
+++ b/arch/sh/mm/cache.c
@@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
 		       unsigned long vaddr, void *dst, const void *src,
 		       unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
@@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
 			 unsigned long vaddr, void *dst, const void *src,
 			 unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
@@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
 	    test_bit(PG_dcache_clean, &from->flags)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
@@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 		    test_bit(PG_dcache_clean, &page->flags)) {
 			void *kaddr;
 
diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 5ece856c5725..35c822286bbe 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
 						page_mapcount(p));
 				if (!page_count(p))
 					rc |= TLB_INSANE;
-				else if (page_mapped(p))
+				else if (page_mapcount(p))
 					rc |= TLB_SUSPICIOUS;
 			} else {
 				rc |= TLB_INSANE;
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..e99c059339f6 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
 	 * pseudo flags for the well known (anonymous) memory mapped pages
 	 *
 	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
-	 * simple test in page_mapped() is not enough.
+	 * simple test in page_mapcount() is not enough.
 	 */
-	if (!PageSlab(page) && page_mapped(page))
+	if (!PageSlab(page) && page_mapcount(page))
 		u |= 1 << KPF_MMAP;
 	if (PageAnon(page))
 		u |= 1 << KPF_ANON;
diff --git a/mm/filemap.c b/mm/filemap.c
index 25bec2c3e7a3..da8f66067670 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -200,7 +200,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
 		__dec_zone_page_state(page, NR_SHMEM);
-	BUG_ON(page_mapped(page));
+	VM_BUG_ON_PAGE(page_mapped(page), page);
 
 	/*
 	 * At this point page must be either written or cleaned by truncate.
diff --git a/mm/rmap.c b/mm/rmap.c
index 7b8838f2c5d0..3361fe78fbe2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,12 +1141,15 @@ static void page_remove_file_rmap(struct page *page)
 
 	memcg = mem_cgroup_begin_page_stat(page);
 
-	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
+	if (unlikely(PageHuge(page))) {
+		/* hugetlb pages are always mapped with pmds */
+		atomic_dec(compound_mapcount_ptr(page));
 		goto out;
+	}
 
-	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
-	if (unlikely(PageHuge(page)))
+	/* page still mapped by someone else? */
+	if (!atomic_add_negative(-1, &page->_mapcount))
 		goto out;
 
 	/*
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 00/24] THP refcounting redesign
  2015-03-06 12:18   ` Kirill A. Shutemov
@ 2015-03-06 15:58     ` Jerome Marchand
  0 siblings, 0 replies; 56+ messages in thread
From: Jerome Marchand @ 2015-03-06 15:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 10059 bytes --]

On 03/06/2015 01:18 PM, Kirill A. Shutemov wrote:
> On Thu, Mar 05, 2015 at 01:55:15PM +0100, Jerome Marchand wrote:
>> On 03/04/2015 05:32 PM, Kirill A. Shutemov wrote:
>>> Hello everybody,
>>>
>>> It's bug-fix update of my thp refcounting work.
>>>
>>> The goal of patchset is to make refcounting on THP pages cheaper with
>>> simpler semantics and allow the same THP compound page to be mapped with
>>> PMD and PTEs. This is required to get reasonable THP-pagecache
>>> implementation.
>>>
>>> With the new refcounting design it's much easier to protect against
>>> split_huge_page(): simple reference on a page will make you the deal.
>>> It makes gup_fast() implementation simpler and doesn't require
>>> special-case in futex code to handle tail THP pages.
>>>
>>> It should improve THP utilization over the system since splitting THP in
>>> one process doesn't necessary lead to splitting the page in all other
>>> processes have the page mapped.
>>>
>> [...]
>>> I believe all known bugs have been fixed, but I'm sure Sasha will bring more
>>> reports.
>>>
>>> The patchset also available on git:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v4
>>>
>>
>> Hi Kirill,
>>
>> I ran some ltp tests and it triggered two bugs:
>>
> 
> Could you test with the patch below?

It seems to fix the issue. I can't reproduce the bugs anymore.

Thanks,
Jerome

> 
> diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
> index 8c3a3e02ba92..1baa4d23314b 100644
> --- a/arch/arc/mm/cache_arc700.c
> +++ b/arch/arc/mm/cache_arc700.c
> @@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
>  	 */
>  	if (!mapping_mapped(mapping)) {
>  		clear_bit(PG_dc_clean, &page->flags);
> -	} else if (page_mapped(page)) {
> +	} else if (page_mapcount(page)) {
>  
>  		/* kernel reading from page with U-mapping */
>  		void *paddr = page_address(page);
> @@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
>  	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
>  	 * equally valid for SRC page as well
>  	 */
> -	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
> +	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
>  		__flush_dcache_page(kfrom, u_vaddr);
>  		clean_src_k_mappings = 1;
>  	}
> diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> index 34b66af516ea..8f972fc8933d 100644
> --- a/arch/arm/mm/flush.c
> +++ b/arch/arm/mm/flush.c
> @@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
>  	mapping = page_mapping(page);
>  
>  	if (!cache_ops_need_broadcast() &&
> -	    mapping && !page_mapped(page))
> +	    mapping && !page_mapcount(page))
>  		clear_bit(PG_dcache_clean, &page->flags);
>  	else {
>  		__flush_dcache_page(mapping, page);
> diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
> index 3f8059602765..b28bf185ef77 100644
> --- a/arch/mips/mm/c-r4k.c
> +++ b/arch/mips/mm/c-r4k.c
> @@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
>  		 * another ASID than the current one.
>  		 */
>  		map_coherent = (cpu_has_dc_aliases &&
> -				page_mapped(page) && !Page_dcache_dirty(page));
> +				page_mapcount(page) &&
> +				!Page_dcache_dirty(page));
>  		if (map_coherent)
>  			vaddr = kmap_coherent(page, addr);
>  		else
> diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
> index 7e3ea7766822..e695b28dc32c 100644
> --- a/arch/mips/mm/cache.c
> +++ b/arch/mips/mm/cache.c
> @@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
>  	unsigned long addr = (unsigned long) page_address(page);
>  
>  	if (pages_do_alias(addr, vmaddr)) {
> -		if (page_mapped(page) && !Page_dcache_dirty(page)) {
> +		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
>  			void *kaddr;
>  
>  			kaddr = kmap_coherent(page, vmaddr);
> diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
> index 448cde372af0..2c8e44aa536e 100644
> --- a/arch/mips/mm/init.c
> +++ b/arch/mips/mm/init.c
> @@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
>  
>  	vto = kmap_atomic(to);
>  	if (cpu_has_dc_aliases &&
> -	    page_mapped(from) && !Page_dcache_dirty(from)) {
> +	    page_mapcount(from) && !Page_dcache_dirty(from)) {
>  		vfrom = kmap_coherent(from, vaddr);
>  		copy_page(vto, vfrom);
>  		kunmap_coherent();
> @@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
>  	unsigned long len)
>  {
>  	if (cpu_has_dc_aliases &&
> -	    page_mapped(page) && !Page_dcache_dirty(page)) {
> +	    page_mapcount(page) && !Page_dcache_dirty(page)) {
>  		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
>  		memcpy(vto, src, len);
>  		kunmap_coherent();
> @@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
>  	unsigned long len)
>  {
>  	if (cpu_has_dc_aliases &&
> -	    page_mapped(page) && !Page_dcache_dirty(page)) {
> +	    page_mapcount(page) && !Page_dcache_dirty(page)) {
>  		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
>  		memcpy(dst, vfrom, len);
>  		kunmap_coherent();
> diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
> index 51d8f7f31d1d..58aaa4f33b81 100644
> --- a/arch/sh/mm/cache-sh4.c
> +++ b/arch/sh/mm/cache-sh4.c
> @@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
>  		 */
>  		map_coherent = (current_cpu_data.dcache.n_aliases &&
>  			test_bit(PG_dcache_clean, &page->flags) &&
> -			page_mapped(page));
> +			page_mapcount(page));
>  		if (map_coherent)
>  			vaddr = kmap_coherent(page, address);
>  		else
> diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
> index f770e3992620..e58cfbf45150 100644
> --- a/arch/sh/mm/cache.c
> +++ b/arch/sh/mm/cache.c
> @@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
>  		       unsigned long vaddr, void *dst, const void *src,
>  		       unsigned long len)
>  {
> -	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
>  	    test_bit(PG_dcache_clean, &page->flags)) {
>  		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
>  		memcpy(vto, src, len);
> @@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
>  			 unsigned long vaddr, void *dst, const void *src,
>  			 unsigned long len)
>  {
> -	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
>  	    test_bit(PG_dcache_clean, &page->flags)) {
>  		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
>  		memcpy(dst, vfrom, len);
> @@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
>  
>  	vto = kmap_atomic(to);
>  
> -	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
> +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
>  	    test_bit(PG_dcache_clean, &from->flags)) {
>  		vfrom = kmap_coherent(from, vaddr);
>  		copy_page(vto, vfrom);
> @@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
>  	unsigned long addr = (unsigned long) page_address(page);
>  
>  	if (pages_do_alias(addr, vmaddr)) {
> -		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> +		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
>  		    test_bit(PG_dcache_clean, &page->flags)) {
>  			void *kaddr;
>  
> diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
> index 5ece856c5725..35c822286bbe 100644
> --- a/arch/xtensa/mm/tlb.c
> +++ b/arch/xtensa/mm/tlb.c
> @@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
>  						page_mapcount(p));
>  				if (!page_count(p))
>  					rc |= TLB_INSANE;
> -				else if (page_mapped(p))
> +				else if (page_mapcount(p))
>  					rc |= TLB_SUSPICIOUS;
>  			} else {
>  				rc |= TLB_INSANE;
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 7eee2d8b97d9..e99c059339f6 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
>  	 * pseudo flags for the well known (anonymous) memory mapped pages
>  	 *
>  	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
> -	 * simple test in page_mapped() is not enough.
> +	 * simple test in page_mapcount() is not enough.
>  	 */
> -	if (!PageSlab(page) && page_mapped(page))
> +	if (!PageSlab(page) && page_mapcount(page))
>  		u |= 1 << KPF_MMAP;
>  	if (PageAnon(page))
>  		u |= 1 << KPF_ANON;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 25bec2c3e7a3..da8f66067670 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -200,7 +200,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	if (PageSwapBacked(page))
>  		__dec_zone_page_state(page, NR_SHMEM);
> -	BUG_ON(page_mapped(page));
> +	VM_BUG_ON_PAGE(page_mapped(page), page);
>  
>  	/*
>  	 * At this point page must be either written or cleaned by truncate.
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7b8838f2c5d0..3361fe78fbe2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1141,12 +1141,15 @@ static void page_remove_file_rmap(struct page *page)
>  
>  	memcg = mem_cgroup_begin_page_stat(page);
>  
> -	/* page still mapped by someone else? */
> -	if (!atomic_add_negative(-1, &page->_mapcount))
> +	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
> +	if (unlikely(PageHuge(page))) {
> +		/* hugetlb pages are always mapped with pmds */
> +		atomic_dec(compound_mapcount_ptr(page));
>  		goto out;
> +	}
>  
> -	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
> -	if (unlikely(PageHuge(page)))
> +	/* page still mapped by someone else? */
> +	if (!atomic_add_negative(-1, &page->_mapcount))
>  		goto out;
>  
>  	/*
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-03-04 16:33 ` [PATCHv4 12/24] thp: PMD splitting without splitting compound page Kirill A. Shutemov
@ 2015-03-17  8:32   ` Aneesh Kumar K.V
  2015-03-29 15:55   ` Aneesh Kumar K.V
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-17  8:32 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> Current split_huge_page() combines two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> changes split_huge_pmd() implementation to split the given PMD without
> splitting other PMDs this page mapped with or underlying compound page.
>
> In order to do this we have to get rid of tail page refcounting, which
> uses _mapcount of tail pages. Tail page refcounting is needed to be able
> to split THP page at any point: we always know which of tail pages is
> pinned (i.e. by get_user_pages()) and can distribute page count
> correctly.
>
> We can avoid this by allowing split_huge_page() to fail if the compound
> page is pinned. This patch removes all infrastructure for tail page
> refcounting and make split_huge_page() to always return -EBUSY. All
> split_huge_page() users already know how to handle its fail. Proper
> implementation will be added later.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
>

mm/gup.c: In function ‘gup_huge_pgd’:
mm/gup.c:1183:2: error: ‘tail’ undeclared (first use in this function)
  tail = page;
  ^
diff --git a/mm/gup.c b/mm/gup.c
index 2f776850108c..141b5a81cf8a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1180,7 +1180,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	refs = 0;
 	head = pgd_page(orig);
 	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 00/24] THP refcounting redesign
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (24 preceding siblings ...)
  2015-03-05 12:55 ` [PATCHv4 00/24] THP refcounting redesign Jerome Marchand
@ 2015-03-17  9:42 ` Aneesh Kumar K.V
  2015-03-19 17:10   ` Kirill A. Shutemov
  2015-03-30 15:40 ` Aneesh Kumar K.V
  26 siblings, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-17  9:42 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> Hello everybody,
>
> It's bug-fix update of my thp refcounting work.
>
> The goal of patchset is to make refcounting on THP pages cheaper with
> simpler semantics and allow the same THP compound page to be mapped with
> PMD and PTEs. This is required to get reasonable THP-pagecache
> implementation.
>
> With the new refcounting design it's much easier to protect against
> split_huge_page(): simple reference on a page will make you the deal.
> It makes gup_fast() implementation simpler and doesn't require
> special-case in futex code to handle tail THP pages.
>
> It should improve THP utilization over the system since splitting THP in
> one process doesn't necessary lead to splitting the page in all other
> processes have the page mapped.

I tested this patch on ppc64 and verified thp allocation and split.
I also checked the subpage_prot and it worked as expected. I will
run more tests with this series and update if I find any issues.

-aneesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 00/24] THP refcounting redesign
  2015-03-17  9:42 ` Aneesh Kumar K.V
@ 2015-03-19 17:10   ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-19 17:10 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Tue, Mar 17, 2015 at 03:12:05PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > Hello everybody,
> >
> > It's bug-fix update of my thp refcounting work.
> >
> > The goal of patchset is to make refcounting on THP pages cheaper with
> > simpler semantics and allow the same THP compound page to be mapped with
> > PMD and PTEs. This is required to get reasonable THP-pagecache
> > implementation.
> >
> > With the new refcounting design it's much easier to protect against
> > split_huge_page(): simple reference on a page will make you the deal.
> > It makes gup_fast() implementation simpler and doesn't require
> > special-case in futex code to handle tail THP pages.
> >
> > It should improve THP utilization over the system since splitting THP in
> > one process doesn't necessary lead to splitting the page in all other
> > processes have the page mapped.
> 
> I tested this patch on ppc64 and verified thp allocation and split.
> I also checked the subpage_prot and it worked as expected. I will
> run more tests with this series and update if I find any issues.

Thanks a lot.

Could you also prepare patch to drop power-specific code related to
pmd_trans_splitting()? It's not needed anymore with the patchset.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-03-04 16:33 ` [PATCHv4 12/24] thp: PMD splitting without splitting compound page Kirill A. Shutemov
  2015-03-17  8:32   ` Aneesh Kumar K.V
@ 2015-03-29 15:55   ` Aneesh Kumar K.V
  2015-03-29 17:42     ` Kirill A. Shutemov
  2015-03-29 16:28   ` Aneesh Kumar K.V
  2015-04-01  6:38   ` Aneesh Kumar K.V
  3 siblings, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-29 15:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> Current split_huge_page() combines two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> changes split_huge_pmd() implementation to split the given PMD without
> splitting other PMDs this page mapped with or underlying compound page.
>
> In order to do this we have to get rid of tail page refcounting, which
> uses _mapcount of tail pages. Tail page refcounting is needed to be able
> to split THP page at any point: we always know which of tail pages is
> pinned (i.e. by get_user_pages()) and can distribute page count
> correctly.
>
> We can avoid this by allowing split_huge_page() to fail if the compound
> page is pinned. This patch removes all infrastructure for tail page
> refcounting and make split_huge_page() to always return -EBUSY. All
> split_huge_page() users already know how to handle its fail. Proper
> implementation will be added later.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
>
> Memory cgroup is not yet ready for new refcouting. Let's disable it on
> Kconfig level.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
.....
.....

>  static inline int page_mapped(struct page *page)
>  {
> -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> +	int i;
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) >= 0;
> +	if (compound_mapcount(page))
> +		return 1;
> +	for (i = 0; i < hpage_nr_pages(page); i++) {
> +		if (atomic_read(&page[i]._mapcount) >= 0)
> +			return 1;
> +	}

do we need to loop with head page here ? ie,

page = compound_page(page);

> +	return 0;
>  }


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 15/24] mm, thp: remove infrastructure for handling splitting PMDs
  2015-03-04 16:33 ` [PATCHv4 15/24] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
@ 2015-03-29 16:10   ` Aneesh Kumar K.V
  2015-03-29 17:51     ` Kirill A. Shutemov
  0 siblings, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-29 16:10 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> With new refcounting we don't need to mark PMDs splitting. Let's drop code
> to handle this.
>
> Arch-specific code will removed separately.
>

Can you explain this more ? Why we don't care of PMD splitting case even
w.r.t to split_huge_page() ? 

-aneesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-03-04 16:33 ` [PATCHv4 12/24] thp: PMD splitting without splitting compound page Kirill A. Shutemov
  2015-03-17  8:32   ` Aneesh Kumar K.V
  2015-03-29 15:55   ` Aneesh Kumar K.V
@ 2015-03-29 16:28   ` Aneesh Kumar K.V
  2015-03-29 17:43     ` Kirill A. Shutemov
  2015-04-01  6:38   ` Aneesh Kumar K.V
  3 siblings, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-29 16:28 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> Current split_huge_page() combines two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> changes split_huge_pmd() implementation to split the given PMD without
> splitting other PMDs this page mapped with or underlying compound page.
>
> In order to do this we have to get rid of tail page refcounting, which
> uses _mapcount of tail pages. Tail page refcounting is needed to be able
> to split THP page at any point: we always know which of tail pages is
> pinned (i.e. by get_user_pages()) and can distribute page count
> correctly.
>
> We can avoid this by allowing split_huge_page() to fail if the compound
> page is pinned. This patch removes all infrastructure for tail page
> refcounting and make split_huge_page() to always return -EBUSY. All
> split_huge_page() users already know how to handle its fail. Proper
> implementation will be added later.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
>
> Memory cgroup is not yet ready for new refcouting. Let's disable it on
> Kconfig level.
>
.....
......

>
>  	spin_lock(ptl);
>  	if (page)
> -		put_user_huge_page(page);
> +		put_page(page);
>  	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
>  		spin_unlock(ptl);
>  		mem_cgroup_cancel_charge(new_page, memcg);
> @@ -1662,51 +1631,78 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	put_huge_zero_page();
>  }
>
> -void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> -		pmd_t *pmd)
> +
> +static void __split_huge_pmd_locked(struct vm_area_struct *vma,
> +		pmd_t *pmd, unsigned long address)
>  {
> -	spinlock_t *ptl;
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
>  	struct page *page;
>  	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long haddr = address & HPAGE_PMD_MASK;
> -	unsigned long mmun_start;	/* For mmu_notifiers */
> -	unsigned long mmun_end;		/* For mmu_notifiers */
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	bool young, write;
> +	int i;
>
> -	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
> +	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> +	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> +
> +	if (is_huge_zero_pmd(*pmd))
> +		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> +
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON_PAGE(!page_count(page), page);
> +	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
> +
> +	write = pmd_write(*pmd);
> +	young = pmd_young(*pmd);
> +
> +	/* leave pmd empty until pte is filled */
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +

So we now mark pmd none, while we go ahead and split the pmd. But then what
happens to a parallel fault ? We don't hold mmap_sem here right ?

> +	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t entry, *pte;
> +		/*
> +		 * Note that NUMA hinting access restrictions are not
> +		 * transferred to avoid any possibility of altering
> +		 * permissions across VMAs.
> +		 */
> +		entry = mk_pte(page + i, vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		if (!write)
> +			entry = pte_wrprotect(entry);
> +		if (!young)
> +			entry = pte_mkold(entry);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		BUG_ON(!pte_none(*pte));
> +		atomic_inc(&page[i]._mapcount);
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
> +	}
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +	atomic_dec(compound_mapcount_ptr(page));
> +}
> +

-aneesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-03-29 15:55   ` Aneesh Kumar K.V
@ 2015-03-29 17:42     ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-29 17:42 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Sun, Mar 29, 2015 at 09:25:29PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> >
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> >
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> >
> > Memory cgroup is not yet ready for new refcouting. Let's disable it on
> > Kconfig level.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> .....
> .....
> 
> >  static inline int page_mapped(struct page *page)
> >  {
> > -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> > +	int i;
> > +	if (likely(!PageCompound(page)))
> > +		return atomic_read(&page->_mapcount) >= 0;
> > +	if (compound_mapcount(page))
> > +		return 1;
> > +	for (i = 0; i < hpage_nr_pages(page); i++) {
> > +		if (atomic_read(&page[i]._mapcount) >= 0)
> > +			return 1;
> > +	}
> 
> do we need to loop with head page here ? ie,

We do need to loop if we have only tail pages mapped. Partial unmap case.

> 
> page = compound_page(page);

compound_head() ?

This function expects to see head page on input. I'll put VM_BUG_ON()
there to make sure we don't call it tail pages.

> > +	return 0;
> >  }

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-03-29 16:28   ` Aneesh Kumar K.V
@ 2015-03-29 17:43     ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-29 17:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Sun, Mar 29, 2015 at 09:58:25PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> >
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> >
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> >
> > Memory cgroup is not yet ready for new refcouting. Let's disable it on
> > Kconfig level.
> >
> .....
> ......
> 
> >
> >  	spin_lock(ptl);
> >  	if (page)
> > -		put_user_huge_page(page);
> > +		put_page(page);
> >  	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
> >  		spin_unlock(ptl);
> >  		mem_cgroup_cancel_charge(new_page, memcg);
> > @@ -1662,51 +1631,78 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >  	put_huge_zero_page();
> >  }
> >
> > -void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> > -		pmd_t *pmd)
> > +
> > +static void __split_huge_pmd_locked(struct vm_area_struct *vma,
> > +		pmd_t *pmd, unsigned long address)
> >  {
> > -	spinlock_t *ptl;
> > +	unsigned long haddr = address & HPAGE_PMD_MASK;
> >  	struct page *page;
> >  	struct mm_struct *mm = vma->vm_mm;
> > -	unsigned long haddr = address & HPAGE_PMD_MASK;
> > -	unsigned long mmun_start;	/* For mmu_notifiers */
> > -	unsigned long mmun_end;		/* For mmu_notifiers */
> > +	pgtable_t pgtable;
> > +	pmd_t _pmd;
> > +	bool young, write;
> > +	int i;
> >
> > -	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
> > +	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> > +	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> > +
> > +	if (is_huge_zero_pmd(*pmd))
> > +		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> > +
> > +	page = pmd_page(*pmd);
> > +	VM_BUG_ON_PAGE(!page_count(page), page);
> > +	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
> > +
> > +	write = pmd_write(*pmd);
> > +	young = pmd_young(*pmd);
> > +
> > +	/* leave pmd empty until pte is filled */
> > +	pmdp_clear_flush_notify(vma, haddr, pmd);
> > +
> 
> So we now mark pmd none, while we go ahead and split the pmd. But then what
> happens to a parallel fault ? We don't hold mmap_sem here right ?

We do hold ptl. That should be enough.

> > +	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > +	pmd_populate(mm, &_pmd, pgtable);
> > +
> > +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> > +		pte_t entry, *pte;
> > +		/*
> > +		 * Note that NUMA hinting access restrictions are not
> > +		 * transferred to avoid any possibility of altering
> > +		 * permissions across VMAs.
> > +		 */
> > +		entry = mk_pte(page + i, vma->vm_page_prot);
> > +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > +		if (!write)
> > +			entry = pte_wrprotect(entry);
> > +		if (!young)
> > +			entry = pte_mkold(entry);
> > +		pte = pte_offset_map(&_pmd, haddr);
> > +		BUG_ON(!pte_none(*pte));
> > +		atomic_inc(&page[i]._mapcount);
> > +		set_pte_at(mm, haddr, pte, entry);
> > +		pte_unmap(pte);
> > +	}
> > +	smp_wmb(); /* make pte visible before pmd */
> > +	pmd_populate(mm, pmd, pgtable);
> > +	atomic_dec(compound_mapcount_ptr(page));
> > +}
> > +
> 
> -aneesh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 15/24] mm, thp: remove infrastructure for handling splitting PMDs
  2015-03-29 16:10   ` Aneesh Kumar K.V
@ 2015-03-29 17:51     ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-29 17:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Sun, Mar 29, 2015 at 09:40:07PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > With new refcounting we don't need to mark PMDs splitting. Let's drop code
> > to handle this.
> >
> > Arch-specific code will removed separately.
> >
> 
> Can you explain this more ? Why we don't care of PMD splitting case even
> w.r.t to split_huge_page() ? 

It used to be required to keep kernel from updating page refcounts while
we splitting the page. Now with split_huge_pmd() we can split one PMD a
time without blocking refcounting. Once all PMDs split we can freeze the
page's refcounts with compound lock[1] and split underlying compound page.

[1] compound lock is replaced with migration entries by the end of the
    patchset.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 18/24] thp, mm: split_huge_page(): caller need to lock page
  2015-03-04 16:33 ` [PATCHv4 18/24] thp, mm: split_huge_page(): caller need to lock page Kirill A. Shutemov
@ 2015-03-30 14:10   ` Aneesh Kumar K.V
  2015-03-30 15:20     ` Kirill A. Shutemov
  0 siblings, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-30 14:10 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> We're going to use migration entries instead of compound_lock() to
> stabilize page refcounts. Setup and remove migration entries require
> page to be locked.
>
> Some of split_huge_page() callers already have the page locked. Let's
> require everybody to lock the page before calling split_huge_page().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Why not have split_huge_page_locked/unlocked, and call the one which
takes lock internally every where ?

-aneesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split
  2015-03-04 16:33 ` [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
@ 2015-03-30 14:19   ` Aneesh Kumar K.V
  2015-03-30 15:23     ` Kirill A. Shutemov
  2015-03-30 15:08   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-30 14:19 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> Currently, we rely on compound_lock() to get page counts stable on
> splitting page refcounting. To get it work we also take the lock on
> get_page() and put_page() which is hot path.
>
> This patch rework splitting code to setup migration entries to stabilaze
> page count/mapcount before distribute refcounts. It means we don't need
> to compound lock in get_page()/put_page().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/migrate.h |   3 +
>  include/linux/mm.h      |   1 +
>  include/linux/pagemap.h |   9 ++-
>  mm/huge_memory.c        | 188 +++++++++++++++++++++++++++++++++++-------------
>  mm/internal.h           |  26 +++++--
>  mm/migrate.c            |  79 +++++++++++---------
>  mm/rmap.c               |  21 ------
>  7 files changed, 218 insertions(+), 109 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 78baed5f2952..b9bc86c24829 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -43,6 +43,9 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
>  		struct page *newpage, struct page *page,
>  		struct buffer_head *head, enum migrate_mode mode,
>  		int extra_count);
> +extern int __remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, struct page *old);
> +
>  #else
>
>  static inline void putback_movable_pages(struct list_head *l) {}
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 28aeae6e553b..43a9993f1333 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -981,6 +981,7 @@ extern struct address_space *page_mapping(struct page *page);
>  /* Neutral page->mapping pointer to address_space or anon_vma or other */
>  static inline void *page_rmapping(struct page *page)
>  {
> +	page = compound_head(page);
>  	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
>  }
>

The above hunk is related to this patch ?. Are we calling page_rmapping
on tail pages now ? If so it needs additonal comment why we handle them
differently now. Or split it to a seperate patch ?

-aneesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split
  2015-03-04 16:33 ` [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
  2015-03-30 14:19   ` Aneesh Kumar K.V
@ 2015-03-30 15:08   ` Aneesh Kumar K.V
  2015-03-30 15:26     ` Kirill A. Shutemov
  1 sibling, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-30 15:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

....
....
 +static void freeze_page(struct anon_vma *anon_vma, struct page *page)
> +{
> +	struct anon_vma_chain *avc;
> +	struct vm_area_struct *vma;
> +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

So this get called only with head page, We also do
BUG_ON(PageTail(page)) in the caller.  But


> +	unsigned long addr, haddr;
> +	unsigned long mmun_start, mmun_end;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *start_pte, *pte;
> +	spinlock_t *ptl;
......


> +
> +static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
> +{
> +	struct anon_vma_chain *avc;
> +	pgoff_t pgoff = page_to_pgoff(page);

Why ? Can this get called for tail pages ?


-aneesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 18/24] thp, mm: split_huge_page(): caller need to lock page
  2015-03-30 14:10   ` Aneesh Kumar K.V
@ 2015-03-30 15:20     ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-30 15:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Mon, Mar 30, 2015 at 07:40:29PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > We're going to use migration entries instead of compound_lock() to
> > stabilize page refcounts. Setup and remove migration entries require
> > page to be locked.
> >
> > Some of split_huge_page() callers already have the page locked. Let's
> > require everybody to lock the page before calling split_huge_page().
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> Why not have split_huge_page_locked/unlocked, and call the one which
> takes lock internally every where ?

We could do that, but it's not obvoius for me what is benefit. Couple of
lines on caller side?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split
  2015-03-30 14:19   ` Aneesh Kumar K.V
@ 2015-03-30 15:23     ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-30 15:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Mon, Mar 30, 2015 at 07:49:43PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > Currently, we rely on compound_lock() to get page counts stable on
> > splitting page refcounting. To get it work we also take the lock on
> > get_page() and put_page() which is hot path.
> >
> > This patch rework splitting code to setup migration entries to stabilaze
> > page count/mapcount before distribute refcounts. It means we don't need
> > to compound lock in get_page()/put_page().
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/migrate.h |   3 +
> >  include/linux/mm.h      |   1 +
> >  include/linux/pagemap.h |   9 ++-
> >  mm/huge_memory.c        | 188 +++++++++++++++++++++++++++++++++++-------------
> >  mm/internal.h           |  26 +++++--
> >  mm/migrate.c            |  79 +++++++++++---------
> >  mm/rmap.c               |  21 ------
> >  7 files changed, 218 insertions(+), 109 deletions(-)
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 78baed5f2952..b9bc86c24829 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -43,6 +43,9 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
> >  		struct page *newpage, struct page *page,
> >  		struct buffer_head *head, enum migrate_mode mode,
> >  		int extra_count);
> > +extern int __remove_migration_pte(struct page *new, struct vm_area_struct *vma,
> > +		unsigned long addr, pte_t *ptep, struct page *old);
> > +
> >  #else
> >
> >  static inline void putback_movable_pages(struct list_head *l) {}
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 28aeae6e553b..43a9993f1333 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -981,6 +981,7 @@ extern struct address_space *page_mapping(struct page *page);
> >  /* Neutral page->mapping pointer to address_space or anon_vma or other */
> >  static inline void *page_rmapping(struct page *page)
> >  {
> > +	page = compound_head(page);
> >  	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
> >  }
> >
> 
> The above hunk is related to this patch ?. Are we calling page_rmapping
> on tail pages now ? If so it needs additonal comment why we handle them
> differently now. Or split it to a seperate patch ?

This change is already in -mm via my patchset on tail pages vs. ->mapping
and page falgs.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split
  2015-03-30 15:08   ` Aneesh Kumar K.V
@ 2015-03-30 15:26     ` Kirill A. Shutemov
  2015-03-30 15:45       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-03-30 15:26 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Mon, Mar 30, 2015 at 08:38:08PM +0530, Aneesh Kumar K.V wrote:
> ....
> ....
>  +static void freeze_page(struct anon_vma *anon_vma, struct page *page)
> > +{
> > +	struct anon_vma_chain *avc;
> > +	struct vm_area_struct *vma;
> > +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> 
> So this get called only with head page, We also do
> BUG_ON(PageTail(page)) in the caller.  But
> 
> 
> > +	unsigned long addr, haddr;
> > +	unsigned long mmun_start, mmun_end;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> > +	pmd_t *pmd;
> > +	pte_t *start_pte, *pte;
> > +	spinlock_t *ptl;
> ......
> 
> 
> > +
> > +static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
> > +{
> > +	struct anon_vma_chain *avc;
> > +	pgoff_t pgoff = page_to_pgoff(page);
> 
> Why ? Can this get called for tail pages ?

It cannot. pgoff is offset of head page (and therefore whole compound
page) within rmapping.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 00/24] THP refcounting redesign
  2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
                   ` (25 preceding siblings ...)
  2015-03-17  9:42 ` Aneesh Kumar K.V
@ 2015-03-30 15:40 ` Aneesh Kumar K.V
  2015-04-01 13:26   ` Kirill A. Shutemov
  26 siblings, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-30 15:40 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> Hello everybody,
>
> It's bug-fix update of my thp refcounting work.
>
> The goal of patchset is to make refcounting on THP pages cheaper with
> simpler semantics and allow the same THP compound page to be mapped with
> PMD and PTEs. This is required to get reasonable THP-pagecache
> implementation.
>
> With the new refcounting design it's much easier to protect against
> split_huge_page(): simple reference on a page will make you the deal.
> It makes gup_fast() implementation simpler and doesn't require
> special-case in futex code to handle tail THP pages.
>
> It should improve THP utilization over the system since splitting THP in
> one process doesn't necessary lead to splitting the page in all other
> processes have the page mapped.

Can we split this series into two. The compound_lock removal and using
migration entries ro freeze page count can be made into a seperate
series ?

-aneesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split
  2015-03-30 15:26     ` Kirill A. Shutemov
@ 2015-03-30 15:45       ` Aneesh Kumar K.V
  2015-04-01 13:19         ` Kirill A. Shutemov
  0 siblings, 1 reply; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-30 15:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Mon, Mar 30, 2015 at 08:38:08PM +0530, Aneesh Kumar K.V wrote:
>> ....
>> ....
>>  +static void freeze_page(struct anon_vma *anon_vma, struct page *page)
>> > +{
>> > +	struct anon_vma_chain *avc;
>> > +	struct vm_area_struct *vma;
>> > +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>> 
>> So this get called only with head page, We also do
>> BUG_ON(PageTail(page)) in the caller.  But
>> 
>> 
>> > +	unsigned long addr, haddr;
>> > +	unsigned long mmun_start, mmun_end;
>> > +	pgd_t *pgd;
>> > +	pud_t *pud;
>> > +	pmd_t *pmd;
>> > +	pte_t *start_pte, *pte;
>> > +	spinlock_t *ptl;
>> ......
>> 
>> 
>> > +
>> > +static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
>> > +{
>> > +	struct anon_vma_chain *avc;
>> > +	pgoff_t pgoff = page_to_pgoff(page);
>> 
>> Why ? Can this get called for tail pages ?
>
> It cannot. pgoff is offset of head page (and therefore whole compound
> page) within rmapping.
>

This we can use 

	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
        
similar to what we do in freeze_page(). The difference between
freeze/unfreeze confused me.

-aneesh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 20/24] mm, thp: remove compound_lock
  2015-03-04 16:33 ` [PATCHv4 20/24] mm, thp: remove compound_lock Kirill A. Shutemov
@ 2015-03-31 15:50   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-03-31 15:50 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov


....
....

  static void put_compound_page(struct page *page)
>  {
>  	struct page *page_head;
> -	unsigned long flags;
>
>  	if (likely(!PageTail(page))) {
>  		if (put_page_testzero(page)) {
> @@ -108,58 +101,33 @@ static void put_compound_page(struct page *page)
>  	/* __split_huge_page_refcount can run under us */
>  	page_head = compound_head(page);
>
> -	if (!compound_lock_needed(page_head)) {
> -		/*
> -		 * If "page" is a THP tail, we must read the tail page flags
> -		 * after the head page flags. The split_huge_page side enforces
> -		 * write memory barriers between clearing PageTail and before
> -		 * the head page can be freed and reallocated.
> -		 */
> -		smp_rmb();
> -		if (likely(PageTail(page))) {
> -			/* __split_huge_page_refcount cannot race here. */
> -			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
> -			VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
> -			if (put_page_testzero(page_head)) {
> -				/*
> -				 * If this is the tail of a slab compound page,
> -				 * the tail pin must not be the last reference
> -				 * held on the page, because the PG_slab cannot
> -				 * be cleared before all tail pins (which skips
> -				 * the _mapcount tail refcounting) have been
> -				 * released. For hugetlbfs the tail pin may be
> -				 * the last reference on the page instead,
> -				 * because PageHeadHuge will not go away until
> -				 * the compound page enters the buddy
> -				 * allocator.
> -				 */
> -				VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
> -				__put_compound_page(page_head);
> -			}
> -		} else if (put_page_testzero(page))
> -			__put_single_page(page);
> -		return;
> -	}
> -
> -	flags = compound_lock_irqsave(page_head);
> -	/* here __split_huge_page_refcount won't run anymore */
> -	if (likely(page != page_head && PageTail(page))) {
> -		bool free;
> -
> -		free = put_page_testzero(page_head);
> -		compound_unlock_irqrestore(page_head, flags);
> -		if (free) {
> -			if (PageHead(page_head))
> -				__put_compound_page(page_head);
> -			else
> -				__put_single_page(page_head);
> +	/*
> +	 * If "page" is a THP tail, we must read the tail page flags after the
> +	 * head page flags. The split_huge_page side enforces write memory
> +	 * barriers between clearing PageTail and before the head page can be
> +	 * freed and reallocated.
> +	 */
> +	smp_rmb();
> +	if (likely(PageTail(page))) {
> +		/* __split_huge_page_refcount cannot race here. */
> +		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
> +		if (put_page_testzero(page_head)) {
> +			/*
> +			 * If this is the tail of a slab compound page, the
> +			 * tail pin must not be the last reference held on the
> +			 * page, because the PG_slab cannot be cleared before
> +			 * all tail pins (which skips the _mapcount tail
> +			 * refcounting) have been released. For hugetlbfs the
> +			 * tail pin may be the last reference on the page
> +			 * instead, because PageHeadHuge will not go away until
> +			 * the compound page enters the buddy allocator.
> +			 */
> +			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
> +			__put_compound_page(page_head);
>  		}


The comment may need an update ? For THP also a tail pin may be the last
reference on the page right ?


> -	} else {
> -		compound_unlock_irqrestore(page_head, flags);
> -		VM_BUG_ON_PAGE(PageTail(page), page);
> -		if (put_page_testzero(page))
> -			__put_single_page(page);
> -	}
> +	} else if (put_page_testzero(page))
> +		__put_single_page(page);
> +	return;
>  }
>
>  void put_page(struct page *page)
> @@ -178,42 +146,29 @@ EXPORT_SYMBOL(put_page);
>  void __get_page_tail(struct page *page)
>  {
>  	struct page *page_head = compound_head(page);

.....
....

> +	smp_rmb();
> +	if (likely(PageTail(page))) {
> +		/*
> +		 * This is a hugetlbfs page or a slab page.
> +		 * __split_huge_page_refcount cannot race here.
> +		 */

This comment also need an update. This can be a THP tail page right ?

> +		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
> +		VM_BUG_ON(page_head != page->first_page);
> +		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
> +				page);
> +		atomic_inc(&page_head->_count);
> +	} else {
> +		/*
> +		 * __split_huge_page_refcount run before us, "page" was
> +		 * a thp tail. the split page_head has been freed and
> +		 * reallocated as slab or hugetlbfs page of smaller
> +		 * order (only possible if reallocated as slab on x86).
> +		 */
>  		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
>  		atomic_inc(&page->_count);
>  	}
> -	compound_unlock_irqrestore(page_head, flags);
> +	return;
>  }
>  EXPORT_SYMBOL(__get_page_tail);
>
> -- 
> 2.1.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-03-04 16:33 ` [PATCHv4 12/24] thp: PMD splitting without splitting compound page Kirill A. Shutemov
                     ` (2 preceding siblings ...)
  2015-03-29 16:28   ` Aneesh Kumar K.V
@ 2015-04-01  6:38   ` Aneesh Kumar K.V
  2015-04-01 13:17     ` Kirill A. Shutemov
  2015-04-02 15:39     ` Kirill A. Shutemov
  3 siblings, 2 replies; 56+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-01  6:38 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, Jerome Marchand,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> Current split_huge_page() combines two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> changes split_huge_pmd() implementation to split the given PMD without
> splitting other PMDs this page mapped with or underlying compound page.
>
> In order to do this we have to get rid of tail page refcounting, which
> uses _mapcount of tail pages. Tail page refcounting is needed to be able
> to split THP page at any point: we always know which of tail pages is
> pinned (i.e. by get_user_pages()) and can distribute page count
> correctly.
>
> We can avoid this by allowing split_huge_page() to fail if the compound
> page is pinned. This patch removes all infrastructure for tail page
> refcounting and make split_huge_page() to always return -EBUSY. All
> split_huge_page() users already know how to handle its fail. Proper
> implementation will be added later.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
>

With this we now have pte mapping part of a compound page(). Now the
gneric gup implementation does

gup_pte_range()
	ptem = ptep = pte_offset_map(&pmd, addr);
	do {

....
...
		if (!page_cache_get_speculative(page))
			goto pte_unmap;
.....
        }

That page_cache_get_speculative will fail in our case because it does
if (unlikely(!get_page_unless_zero(page))) on a tail page. ??

-aneesh
	


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-04-01  6:38   ` Aneesh Kumar K.V
@ 2015-04-01 13:17     ` Kirill A. Shutemov
  2015-04-01 23:13       ` Hugh Dickins
  2015-04-02 15:39     ` Kirill A. Shutemov
  1 sibling, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-04-01 13:17 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Wed, Apr 01, 2015 at 12:08:35PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> >
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> >
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> >
> 
> With this we now have pte mapping part of a compound page(). Now the
> gneric gup implementation does
> 
> gup_pte_range()
> 	ptem = ptep = pte_offset_map(&pmd, addr);
> 	do {
> 
> ....
> ...
> 		if (!page_cache_get_speculative(page))
> 			goto pte_unmap;
> .....
>         }
> 
> That page_cache_get_speculative will fail in our case because it does
> if (unlikely(!get_page_unless_zero(page))) on a tail page. ??

Nice catch, thanks.

But the problem is not exclusive to my patchset. In current kernel some
drivers (sound, for instance) map compound pages with PTEs.

We can try to get page_cache_get_speculative() work on PTE-mapped tail
pages. Untested patch is below.

BTW, it would be nice to rename the function since it's now used not only
for page cache.

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7c3790764795..b2b6e886ed9b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -142,8 +142,10 @@ void release_pages(struct page **pages, int nr, bool cold);
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
+	struct page *head_page;
 	VM_BUG_ON(in_interrupt());
-
+retry:
+	head_page = compound_head_fast(page);
 #ifdef CONFIG_TINY_RCU
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -157,11 +159,11 @@ static inline int page_cache_get_speculative(struct page *page)
 	 * disabling preempt, and hence no need for the "speculative get" that
 	 * SMP requires.
 	 */
-	VM_BUG_ON_PAGE(page_count(page) == 0, page);
-	atomic_inc(&page->_count);
+	VM_BUG_ON_PAGE(page_count(head_page) == 0, head_page);
+	atomic_inc(&head_page->_count);
 
 #else
-	if (unlikely(!get_page_unless_zero(page))) {
+	if (unlikely(!get_page_unless_zero(head_page))) {
 		/*
 		 * Either the page has been freed, or will be freed.
 		 * In either case, retry here and the caller should
@@ -170,7 +172,21 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	/* compound_head_fast() seen PageTail(page) == true */
+	if (unlikely(head_page != page)) {
+		/*
+		 * compound_head_fast() could fetch dangling page->first_page
+		 * pointer to an old compound page, so recheck that it is still
+		 * a tail page before returning.
+		 */
+		smp_rmb();
+		if (unlikely(!PageTail(page))) {
+			put_page(head_page);
+			goto retry;
+		}
+		/* Do not touch THP tail pages! */
+		VM_BUG_ON_PAGE(compound_tail_refcounted(head_page), head_page);
+	}
 
 	return 1;
 }
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split
  2015-03-30 15:45       ` Aneesh Kumar K.V
@ 2015-04-01 13:19         ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-04-01 13:19 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Mon, Mar 30, 2015 at 09:15:47PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > On Mon, Mar 30, 2015 at 08:38:08PM +0530, Aneesh Kumar K.V wrote:
> >> ....
> >> ....
> >>  +static void freeze_page(struct anon_vma *anon_vma, struct page *page)
> >> > +{
> >> > +	struct anon_vma_chain *avc;
> >> > +	struct vm_area_struct *vma;
> >> > +	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> >> 
> >> So this get called only with head page, We also do
> >> BUG_ON(PageTail(page)) in the caller.  But
> >> 
> >> 
> >> > +	unsigned long addr, haddr;
> >> > +	unsigned long mmun_start, mmun_end;
> >> > +	pgd_t *pgd;
> >> > +	pud_t *pud;
> >> > +	pmd_t *pmd;
> >> > +	pte_t *start_pte, *pte;
> >> > +	spinlock_t *ptl;
> >> ......
> >> 
> >> 
> >> > +
> >> > +static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
> >> > +{
> >> > +	struct anon_vma_chain *avc;
> >> > +	pgoff_t pgoff = page_to_pgoff(page);
> >> 
> >> Why ? Can this get called for tail pages ?
> >
> > It cannot. pgoff is offset of head page (and therefore whole compound
> > page) within rmapping.
> >
> 
> This we can use 
> 
> 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>         
> similar to what we do in freeze_page(). The difference between
> freeze/unfreeze confused me.

Fair enough. Will fix.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 00/24] THP refcounting redesign
  2015-03-30 15:40 ` Aneesh Kumar K.V
@ 2015-04-01 13:26   ` Kirill A. Shutemov
  0 siblings, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-04-01 13:26 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Mon, Mar 30, 2015 at 09:10:41PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > Hello everybody,
> >
> > It's bug-fix update of my thp refcounting work.
> >
> > The goal of patchset is to make refcounting on THP pages cheaper with
> > simpler semantics and allow the same THP compound page to be mapped with
> > PMD and PTEs. This is required to get reasonable THP-pagecache
> > implementation.
> >
> > With the new refcounting design it's much easier to protect against
> > split_huge_page(): simple reference on a page will make you the deal.
> > It makes gup_fast() implementation simpler and doesn't require
> > special-case in futex code to handle tail THP pages.
> >
> > It should improve THP utilization over the system since splitting THP in
> > one process doesn't necessary lead to splitting the page in all other
> > processes have the page mapped.
> 
> Can we split this series into two. The compound_lock removal and using
> migration entries ro freeze page count can be made into a seperate
> series ?

I would problaby move these patchse to the end of patchset. So they can be
applied separately.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-04-01 13:17     ` Kirill A. Shutemov
@ 2015-04-01 23:13       ` Hugh Dickins
  0 siblings, 0 replies; 56+ messages in thread
From: Hugh Dickins @ 2015-04-01 23:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, Dave Hansen, Hugh Dickins, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Johannes Weiner, Michal Hocko,
	Jerome Marchand, linux-kernel, linux-mm

On Wed, 1 Apr 2015, Kirill A. Shutemov wrote:
> On Wed, Apr 01, 2015 at 12:08:35PM +0530, Aneesh Kumar K.V wrote:
> > 
> > With this we now have pte mapping part of a compound page(). Now the
> > gneric gup implementation does
> > 
> > gup_pte_range()
> > 	ptem = ptep = pte_offset_map(&pmd, addr);
> > 	do {
> > 
> > ....
> > ...
> > 		if (!page_cache_get_speculative(page))
> > 			goto pte_unmap;
> > .....
> >         }
> > 
> > That page_cache_get_speculative will fail in our case because it does
> > if (unlikely(!get_page_unless_zero(page))) on a tail page. ??
> 
> Nice catch, thanks.

Indeed; but it's not the generic gup implementation,
it's just the generic fast gup implementation.

> 
> But the problem is not exclusive to my patchset. In current kernel some
> drivers (sound, for instance) map compound pages with PTEs.

Nobody has cared whether fast gup works on those, just so long as
slow gup works on those without VM_IO | VM_PFNMAP.  Whereas people did
care that fast gup work on THPs, so gave them more complicated handling.

> 
> We can try to get page_cache_get_speculative() work on PTE-mapped tail
> pages. Untested patch is below.

I didn't check through; but we'll agree that it's sad to see the
complexity you've managed to reduce elsewhere now popping up again
in other places.

Hugh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHv4 12/24] thp: PMD splitting without splitting compound page
  2015-04-01  6:38   ` Aneesh Kumar K.V
  2015-04-01 13:17     ` Kirill A. Shutemov
@ 2015-04-02 15:39     ` Kirill A. Shutemov
  1 sibling, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2015-04-02 15:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, linux-kernel,
	linux-mm

On Wed, Apr 01, 2015 at 12:08:35PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> >
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> >
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> >
> 
> With this we now have pte mapping part of a compound page(). Now the
> gneric gup implementation does
> 
> gup_pte_range()
> 	ptem = ptep = pte_offset_map(&pmd, addr);
> 	do {
> 
> ....
> ...
> 		if (!page_cache_get_speculative(page))
> 			goto pte_unmap;
> .....
>         }
> 
> That page_cache_get_speculative will fail in our case because it does
> if (unlikely(!get_page_unless_zero(page))) on a tail page. ??

IIUC, something as simple as patch below should work fine with migration
entries.

The reason I'm talking about migration enties is that with new refcounting
split_huge_page() breaks this generic fast GUP invariant:

 *  *) THP splits will broadcast an IPI, this can be achieved by overriding
 *      pmdp_splitting_flush.

We don't necessary trigger IPI during split. The page can be mapped only
with ptes by split time. That's fine for migration entries since we
re-check pte value after taking the pin. But it seems we don't have
anything in place for compound_lock case.

Hm. If I will not find any way to get it work with compound_lock, I would
need to implement new split_huge_page() on migration entries without
intermediate step with compound_lock.

Any comments?

diff --git a/mm/gup.c b/mm/gup.c
index d58af0785d24..b45edb8e6455 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1047,7 +1047,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		 * for an example see gup_get_pte in arch/x86/mm/gup.c
 		 */
 		pte_t pte = READ_ONCE(*ptep);
-		struct page *page;
+		struct page *head, *page;
 
 		/*
 		 * Similar to the PMD case below, NUMA hinting must take slow
@@ -1059,15 +1059,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		head = compound_head(page);
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(head))
 			goto pte_unmap;
 
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-			put_page(page);
+			put_page(head);
 			goto pte_unmap;
 		}
 
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2015-04-02 15:40 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-04 16:32 [PATCHv4 00/24] THP refcounting redesign Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 01/24] thp: cluster split_huge_page* code together Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 02/24] mm: change PageAnon() and page_anon_vma() to work on tail pages Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 03/24] mm: avoid PG_locked " Kirill A. Shutemov
2015-03-04 18:48   ` Christoph Lameter
2015-03-04 20:56     ` Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 04/24] rmap: add argument to charge compound page Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 05/24] mm, proc: adjust PSS calculation Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 06/24] mm: store mapcount for compound page separately Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 07/24] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 08/24] mm: adjust FOLL_SPLIT for new refcounting Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 09/24] thp, mlock: do not allow huge pages in mlocked area Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 10/24] khugepaged: ignore pmd tables with THP mapped with ptes Kirill A. Shutemov
2015-03-04 16:32 ` [PATCHv4 11/24] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 12/24] thp: PMD splitting without splitting compound page Kirill A. Shutemov
2015-03-17  8:32   ` Aneesh Kumar K.V
2015-03-29 15:55   ` Aneesh Kumar K.V
2015-03-29 17:42     ` Kirill A. Shutemov
2015-03-29 16:28   ` Aneesh Kumar K.V
2015-03-29 17:43     ` Kirill A. Shutemov
2015-04-01  6:38   ` Aneesh Kumar K.V
2015-04-01 13:17     ` Kirill A. Shutemov
2015-04-01 23:13       ` Hugh Dickins
2015-04-02 15:39     ` Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 13/24] mm, vmstats: new THP splitting event Kirill A. Shutemov
2015-03-04 18:49   ` Christoph Lameter
2015-03-04 16:33 ` [PATCHv4 14/24] thp: implement new split_huge_page() Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 15/24] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
2015-03-29 16:10   ` Aneesh Kumar K.V
2015-03-29 17:51     ` Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 16/24] x86, " Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 17/24] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 18/24] thp, mm: split_huge_page(): caller need to lock page Kirill A. Shutemov
2015-03-30 14:10   ` Aneesh Kumar K.V
2015-03-30 15:20     ` Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 19/24] thp, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
2015-03-30 14:19   ` Aneesh Kumar K.V
2015-03-30 15:23     ` Kirill A. Shutemov
2015-03-30 15:08   ` Aneesh Kumar K.V
2015-03-30 15:26     ` Kirill A. Shutemov
2015-03-30 15:45       ` Aneesh Kumar K.V
2015-04-01 13:19         ` Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 20/24] mm, thp: remove compound_lock Kirill A. Shutemov
2015-03-31 15:50   ` Aneesh Kumar K.V
2015-03-04 16:33 ` [PATCHv4 21/24] thp: introduce deferred_split_huge_page() Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 22/24] memcg: adjust to support new THP refcounting Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 23/24] ksm: split huge pages on follow_page() Kirill A. Shutemov
2015-03-04 16:33 ` [PATCHv4 24/24] thp: update documentation Kirill A. Shutemov
2015-03-05 12:55 ` [PATCHv4 00/24] THP refcounting redesign Jerome Marchand
2015-03-05 16:04   ` Kirill A. Shutemov
2015-03-06 12:18   ` Kirill A. Shutemov
2015-03-06 15:58     ` Jerome Marchand
2015-03-17  9:42 ` Aneesh Kumar K.V
2015-03-19 17:10   ` Kirill A. Shutemov
2015-03-30 15:40 ` Aneesh Kumar K.V
2015-04-01 13:26   ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).