LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86
@ 2019-05-09 16:05 Larry Bassel
  2019-05-09 16:05 ` [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing Larry Bassel
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Larry Bassel @ 2019-05-09 16:05 UTC (permalink / raw)
  To: mike.kravetz, willy, dan.j.williams, linux-mm, linux-kernel,
	linux-nvdimm
  Cc: Larry Bassel

This patchset implements sharing of page table entries pointing
to 2MiB pages (PMDs) for FS/DAX on x86.

Only shared mmapings of files (i.e. neither private mmapings nor
anonymous pages) are eligible for PMD sharing.

Due to the characteristics of DAX, this code is simpler and
less intrusive than the general case would be.

In our use case (high end Oracle database using DAX/XFS/PMEM/2MiB
pages) there would be significant memory savings.

A future system might have 6 TiB of PMEM on it and
there might be 10000 processes each mapping all of this 6 TiB.
Here the savings would be approximately
(6 TiB / 2 MiB) * 8 bytes (page table size) * 10000 = 240 GiB
(and these page tables themselves would be in non-PMEM (ordinary RAM)).

There would also be a reduction in page faults because in
some cases the page fault has already been satisfied and
the page table entry has been filled in (and so the processes
after the first would not take a fault).

The code for detecting whether PMDs can be shared and
the implementation of sharing and unsharing is based
on, but somewhat different than that in mm/hugetlb.c,
though some of the code from this file could be reused and
thus was made non-static.

Larry Bassel (2):
  Add config option to enable FS/DAX PMD sharing.
  Implement sharing/unsharing of PMDs for FS/DAX.

 arch/x86/Kconfig        |   3 ++
 include/linux/hugetlb.h |   4 ++
 mm/huge_memory.c        |  32 ++++++++++++++
 mm/hugetlb.c            |  21 ++++++++--
 mm/memory.c             | 108 +++++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 163 insertions(+), 5 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing
  2019-05-09 16:05 [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86 Larry Bassel
@ 2019-05-09 16:05 ` Larry Bassel
  2019-05-10 16:32   ` Elliott, Robert (Servers)
  2019-05-09 16:05 ` [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX Larry Bassel
  2019-05-14 12:28 ` [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86 Kirill A. Shutemov
  2 siblings, 1 reply; 14+ messages in thread
From: Larry Bassel @ 2019-05-09 16:05 UTC (permalink / raw)
  To: mike.kravetz, willy, dan.j.williams, linux-mm, linux-kernel,
	linux-nvdimm
  Cc: Larry Bassel

If enabled, sharing of FS/DAX PMDs will be attempted.

Signed-off-by: Larry Bassel <larry.bassel@oracle.com>
---
 arch/x86/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e721273..e11702e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -297,6 +297,9 @@ config ARCH_SUSPEND_POSSIBLE
 config ARCH_WANT_HUGE_PMD_SHARE
 	def_bool y
 
+config MAY_SHARE_FSDAX_PMD
+	def_bool y
+
 config ARCH_WANT_GENERAL_HUGETLB
 	def_bool y
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
  2019-05-09 16:05 [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86 Larry Bassel
  2019-05-09 16:05 ` [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing Larry Bassel
@ 2019-05-09 16:05 ` Larry Bassel
  2019-05-09 16:49   ` Matthew Wilcox
  2019-05-14 13:01   ` Kirill A. Shutemov
  2019-05-14 12:28 ` [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86 Kirill A. Shutemov
  2 siblings, 2 replies; 14+ messages in thread
From: Larry Bassel @ 2019-05-09 16:05 UTC (permalink / raw)
  To: mike.kravetz, willy, dan.j.williams, linux-mm, linux-kernel,
	linux-nvdimm
  Cc: Larry Bassel

This is based on (but somewhat different from) what hugetlbfs
does to share/unshare page tables.

Signed-off-by: Larry Bassel <larry.bassel@oracle.com>
---
 include/linux/hugetlb.h |   4 ++
 mm/huge_memory.c        |  32 ++++++++++++++
 mm/hugetlb.c            |  21 ++++++++--
 mm/memory.c             | 108 +++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 160 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 11943b6..9ed9542 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -142,6 +142,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 				unsigned long *start, unsigned long *end);
+unsigned long page_table_shareable(struct vm_area_struct *svma,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pgoff_t idx);
+bool vma_shareable(struct vm_area_struct *vma, unsigned long addr);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b6a34b3..e1627c3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1747,6 +1747,33 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 	mm_dec_nr_ptes(mm);
 }
 
+#ifdef CONFIG_MAY_SHARE_FSDAX_PMD
+static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr,
+			    pmd_t *pmdp)
+{
+	pgd_t *pgd = pgd_offset(mm, addr);
+	p4d_t *p4d = p4d_offset(pgd, addr);
+	pud_t *pud = pud_offset(p4d, addr);
+
+	WARN_ON(page_count(virt_to_page(pmdp)) == 0);
+	if (page_count(virt_to_page(pmdp)) == 1)
+		return 0;
+
+	pud_clear(pud);
+	put_page(virt_to_page(pmdp));
+	mm_dec_nr_pmds(mm);
+	return 1;
+}
+
+#else
+static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr,
+			    pmd_t *pmdp)
+{
+	return 0;
+}
+
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
@@ -1764,6 +1791,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * pgtable_trans_huge_withdraw after finishing pmdp related
 	 * operations.
 	 */
+	if (unshare_huge_pmd(vma->vm_mm, addr, pmd)) {
+		spin_unlock(ptl);
+		return 1;
+	}
+
 	orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
 			tlb->fullmm);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 641cedf..919a290 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4594,9 +4594,9 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 }
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
-static unsigned long page_table_shareable(struct vm_area_struct *svma,
-				struct vm_area_struct *vma,
-				unsigned long addr, pgoff_t idx)
+unsigned long page_table_shareable(struct vm_area_struct *svma,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pgoff_t idx)
 {
 	unsigned long saddr = ((idx - svma->vm_pgoff) << PAGE_SHIFT) +
 				svma->vm_start;
@@ -4619,7 +4619,7 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	return saddr;
 }
 
-static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
+bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 {
 	unsigned long base = addr & PUD_MASK;
 	unsigned long end = base + PUD_SIZE;
@@ -4763,6 +4763,19 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 				unsigned long *start, unsigned long *end)
 {
 }
+
+unsigned long page_table_shareable(struct vm_area_struct *svma,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pgoff_t idx)
+{
+	return 0;
+}
+
+bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
+{
+	return false;
+}
+
 #define want_pmd_share()	(0)
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
diff --git a/mm/memory.c b/mm/memory.c
index f7d962d..4c1814c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3845,6 +3845,109 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_MAY_SHARE_FSDAX_PMD
+static pmd_t *huge_pmd_offset(struct mm_struct *mm,
+			      unsigned long addr, unsigned long sz)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		return NULL;
+	p4d = p4d_offset(pgd, addr);
+	if (!p4d_present(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, addr);
+	if (sz != PUD_SIZE && pud_none(*pud))
+		return NULL;
+	/* hugepage or swap? */
+	if (pud_huge(*pud) || !pud_present(*pud))
+		return (pmd_t *)pud;
+
+	pmd = pmd_offset(pud, addr);
+	if (sz != PMD_SIZE && pmd_none(*pmd))
+		return NULL;
+	/* hugepage or swap? */
+	if (pmd_huge(*pmd) || !pmd_present(*pmd))
+		return pmd;
+
+	return NULL;
+}
+
+static pmd_t *pmd_share(struct mm_struct *mm, pud_t *pud, unsigned long addr)
+{
+	struct vm_area_struct *vma = find_vma(mm, addr);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	pgoff_t idx = ((addr - vma->vm_start) >> PAGE_SHIFT) +
+			vma->vm_pgoff;
+	struct vm_area_struct *svma;
+	unsigned long saddr;
+	pmd_t *spmd = NULL;
+	pmd_t *pmd;
+	spinlock_t *ptl;
+
+	if (!vma_shareable(vma, addr))
+		return pmd_alloc(mm, pud, addr);
+
+	i_mmap_lock_write(mapping);
+
+	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
+		if (svma == vma)
+			continue;
+
+		saddr = page_table_shareable(svma, vma, addr, idx);
+		if (saddr) {
+			spmd = huge_pmd_offset(svma->vm_mm, saddr,
+					       vma_mmu_pagesize(svma));
+			if (spmd) {
+				get_page(virt_to_page(spmd));
+				break;
+			}
+		}
+	}
+
+	if (!spmd)
+		goto out;
+
+	ptl = pmd_lockptr(mm, spmd);
+	spin_lock(ptl);
+
+	if (pud_none(*pud)) {
+		pud_populate(mm, pud,
+			    (pmd_t *)((unsigned long)spmd & PAGE_MASK));
+		mm_inc_nr_pmds(mm);
+	} else {
+		put_page(virt_to_page(spmd));
+	}
+	spin_unlock(ptl);
+out:
+	pmd = pmd_alloc(mm, pud, addr);
+	i_mmap_unlock_write(mapping);
+	return pmd;
+}
+
+static bool may_share_pmd(struct vm_area_struct *vma)
+{
+	if (vma_is_fsdax(vma))
+		return true;
+	return false;
+}
+#else
+static pmd_t *pmd_share(struct mm_struct *mm, pud_t *pud, unsigned long addr)
+{
+	return pmd_alloc(mm, pud, addr);
+}
+
+static bool may_share_pmd(struct vm_area_struct *vma)
+{
+	return false;
+}
+#endif
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -3898,7 +4001,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		}
 	}
 
-	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
+	if (unlikely(may_share_pmd(vma)))
+		vmf.pmd = pmd_share(mm, vmf.pud, address);
+	else
+		vmf.pmd = pmd_alloc(mm, vmf.pud, address);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
  2019-05-09 16:05 ` [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX Larry Bassel
@ 2019-05-09 16:49   ` Matthew Wilcox
  2019-05-10 16:16     ` Larry Bassel
  2019-05-14 13:01   ` Kirill A. Shutemov
  1 sibling, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2019-05-09 16:49 UTC (permalink / raw)
  To: Larry Bassel
  Cc: mike.kravetz, dan.j.williams, linux-mm, linux-kernel, linux-nvdimm

On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote:
> This is based on (but somewhat different from) what hugetlbfs
> does to share/unshare page tables.

Wow, that worked out far more cleanly than I was expecting to see.

> @@ -4763,6 +4763,19 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end)
>  {
>  }
> +
> +unsigned long page_table_shareable(struct vm_area_struct *svma,
> +				   struct vm_area_struct *vma,
> +				   unsigned long addr, pgoff_t idx)
> +{
> +	return 0;
> +}
> +
> +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	return false;
> +}

I don't think you need these stubs, since the only caller of them is
also gated by MAY_SHARE_FSDAX_PMD ... right?

> +	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> +		if (svma == vma)
> +			continue;
> +
> +		saddr = page_table_shareable(svma, vma, addr, idx);
> +		if (saddr) {
> +			spmd = huge_pmd_offset(svma->vm_mm, saddr,
> +					       vma_mmu_pagesize(svma));
> +			if (spmd) {
> +				get_page(virt_to_page(spmd));
> +				break;
> +			}
> +		}
> +	}

I'd be tempted to reduce the indentation here:

	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
		if (svma == vma)
			continue;

		saddr = page_table_shareable(svma, vma, addr, idx);
		if (!saddr)
			continue;

		spmd = huge_pmd_offset(svma->vm_mm, saddr,
					vma_mmu_pagesize(svma));
		if (spmd)
			break;
	}


> +	if (!spmd)
> +		goto out;

... and move the get_page() down to here, so we don't split the
"when we find it" logic between inside and outside the loop.

	get_page(virt_to_page(spmd));

> +
> +	ptl = pmd_lockptr(mm, spmd);
> +	spin_lock(ptl);
> +
> +	if (pud_none(*pud)) {
> +		pud_populate(mm, pud,
> +			    (pmd_t *)((unsigned long)spmd & PAGE_MASK));
> +		mm_inc_nr_pmds(mm);
> +	} else {
> +		put_page(virt_to_page(spmd));
> +	}
> +	spin_unlock(ptl);
> +out:
> +	pmd = pmd_alloc(mm, pud, addr);
> +	i_mmap_unlock_write(mapping);

I would swap these two lines.  There's no need to hold the i_mmap_lock
while allocating this PMD, is there?  I mean, we don't for the !may_share
case.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
  2019-05-09 16:49   ` Matthew Wilcox
@ 2019-05-10 16:16     ` Larry Bassel
  2019-05-10 22:45       ` Mike Kravetz
  0 siblings, 1 reply; 14+ messages in thread
From: Larry Bassel @ 2019-05-10 16:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Larry Bassel, mike.kravetz, dan.j.williams, linux-mm,
	linux-kernel, linux-nvdimm

On 09 May 19 09:49, Matthew Wilcox wrote:
> On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote:
> > This is based on (but somewhat different from) what hugetlbfs
> > does to share/unshare page tables.
> 
> Wow, that worked out far more cleanly than I was expecting to see.

Yes, I was pleasantly surprised. As I've mentioned already, I 
think this is at least partially due to the nature of DAX.

> 
> > @@ -4763,6 +4763,19 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> >  				unsigned long *start, unsigned long *end)
> >  {
> >  }
> > +
> > +unsigned long page_table_shareable(struct vm_area_struct *svma,
> > +				   struct vm_area_struct *vma,
> > +				   unsigned long addr, pgoff_t idx)
> > +{
> > +	return 0;
> > +}
> > +
> > +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> > +{
> > +	return false;
> > +}
> 
> I don't think you need these stubs, since the only caller of them is
> also gated by MAY_SHARE_FSDAX_PMD ... right?

These are also called in mm/hugetlb.c, but those calls are gated by
CONFIG_ARCH_WANT_HUGE_PMD_SHARE. In fact if this is not set (though
it is the default), then one wouldn't get FS/DAX sharing even if
MAY_SHARE_FSDAX_PMD is set. I think that this isn't what we want
(perhaps the real question is how should these two config options interact?).
Removing the stubs would fix this and I will make that change.

Maybe these two functions should be moved into mm/memory.c as well.

> 
> > +	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> > +		if (svma == vma)
> > +			continue;
> > +
> > +		saddr = page_table_shareable(svma, vma, addr, idx);
> > +		if (saddr) {
> > +			spmd = huge_pmd_offset(svma->vm_mm, saddr,
> > +					       vma_mmu_pagesize(svma));
> > +			if (spmd) {
> > +				get_page(virt_to_page(spmd));
> > +				break;
> > +			}
> > +		}
> > +	}
> 
> I'd be tempted to reduce the indentation here:
> 
> 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> 		if (svma == vma)
> 			continue;
> 
> 		saddr = page_table_shareable(svma, vma, addr, idx);
> 		if (!saddr)
> 			continue;
> 
> 		spmd = huge_pmd_offset(svma->vm_mm, saddr,
> 					vma_mmu_pagesize(svma));
> 		if (spmd)
> 			break;
> 	}
> 
> 
> > +	if (!spmd)
> > +		goto out;
> 
> ... and move the get_page() down to here, so we don't split the
> "when we find it" logic between inside and outside the loop.
> 
> 	get_page(virt_to_page(spmd));
> 
> > +
> > +	ptl = pmd_lockptr(mm, spmd);
> > +	spin_lock(ptl);
> > +
> > +	if (pud_none(*pud)) {
> > +		pud_populate(mm, pud,
> > +			    (pmd_t *)((unsigned long)spmd & PAGE_MASK));
> > +		mm_inc_nr_pmds(mm);
> > +	} else {
> > +		put_page(virt_to_page(spmd));
> > +	}
> > +	spin_unlock(ptl);
> > +out:
> > +	pmd = pmd_alloc(mm, pud, addr);
> > +	i_mmap_unlock_write(mapping);
> 
> I would swap these two lines.  There's no need to hold the i_mmap_lock
> while allocating this PMD, is there?  I mean, we don't for the !may_share
> case.
> 

These were done in the style of functions already in mm/hugetlb.c and I was
trying to change as little as necessary in my copy of those. I agree that
these are good suggestions. One could argue that if these changes
were made, they should also be made in mm/hugetlb.c, though
this is perhaps beyond the scope of getting FS/DAX PMD sharing
implemented -- your thoughts?

Thanks for the review, I'll wait a few more days for other comments
and then send out a v2.

Larry

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing
  2019-05-09 16:05 ` [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing Larry Bassel
@ 2019-05-10 16:32   ` Elliott, Robert (Servers)
  2019-05-10 18:14     ` Dan Williams
  0 siblings, 1 reply; 14+ messages in thread
From: Elliott, Robert (Servers) @ 2019-05-10 16:32 UTC (permalink / raw)
  To: Larry Bassel, mike.kravetz, willy, dan.j.williams, linux-mm,
	linux-kernel, linux-nvdimm



> -----Original Message-----
> From: Linux-nvdimm <linux-nvdimm-bounces@lists.01.org> On Behalf Of
> Larry Bassel
> Sent: Thursday, May 09, 2019 11:06 AM
> Subject: [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD
> sharing
> 
> If enabled, sharing of FS/DAX PMDs will be attempted.
> 
...
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
...
> 
> +config MAY_SHARE_FSDAX_PMD
> +	def_bool y
> +

Is a config option really necessary - is there any reason to
not choose to do this?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing
  2019-05-10 16:32   ` Elliott, Robert (Servers)
@ 2019-05-10 18:14     ` Dan Williams
  0 siblings, 0 replies; 14+ messages in thread
From: Dan Williams @ 2019-05-10 18:14 UTC (permalink / raw)
  To: Elliott, Robert (Servers)
  Cc: Larry Bassel, mike.kravetz, willy, linux-mm, linux-kernel, linux-nvdimm

On Fri, May 10, 2019 at 9:32 AM Elliott, Robert (Servers)
<elliott@hpe.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Linux-nvdimm <linux-nvdimm-bounces@lists.01.org> On Behalf Of
> > Larry Bassel
> > Sent: Thursday, May 09, 2019 11:06 AM
> > Subject: [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD
> > sharing
> >
> > If enabled, sharing of FS/DAX PMDs will be attempted.
> >
> ...
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> ...
> >
> > +config MAY_SHARE_FSDAX_PMD
> > +def_bool y
> > +
>
> Is a config option really necessary - is there any reason to
> not choose to do this?

Agree. Either the arch implementation supports it or it doesn't, I
don't see a need for any further configuration flexibility. Seems
ARCH_WANT_HUGE_PMD_SHARE should be renamed ARCH_HAS_HUGE_PMD_SHARE and
then auto-enable it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
  2019-05-10 16:16     ` Larry Bassel
@ 2019-05-10 22:45       ` Mike Kravetz
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2019-05-10 22:45 UTC (permalink / raw)
  To: Larry Bassel, Matthew Wilcox
  Cc: dan.j.williams, linux-mm, linux-kernel, linux-nvdimm

On 5/10/19 9:16 AM, Larry Bassel wrote:
> On 09 May 19 09:49, Matthew Wilcox wrote:
>> On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote:
>>> This is based on (but somewhat different from) what hugetlbfs
>>> does to share/unshare page tables.
>>
>> Wow, that worked out far more cleanly than I was expecting to see.
> 
> Yes, I was pleasantly surprised. As I've mentioned already, I 
> think this is at least partially due to the nature of DAX.

I have not looked in detail to make sure this is indeed all the places you
need to hook and special case for sharing/unsharing.  Since this scheme is
somewhat like that used for hugetlb, I just wanted to point out some nasty
bugs related to hugetlb PMD sharing that were fixed last year.

5e41540c8a0f hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
dff11abe280b hugetlb: take PMD sharing into account when flushing tlb/caches
017b1660df89 mm: migration: fix migration of huge PMD shared pages

The common issue in these is that when unmapping a page with a shared PMD
mapping you need to flush the entire shared range and not just the unmapped
page.  The above changes were hugetlb specific.  I do not know if any of
this applies in the case of DAX.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86
  2019-05-09 16:05 [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86 Larry Bassel
  2019-05-09 16:05 ` [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing Larry Bassel
  2019-05-09 16:05 ` [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX Larry Bassel
@ 2019-05-14 12:28 ` Kirill A. Shutemov
  2019-05-14 16:09   ` Larry Bassel
  2 siblings, 1 reply; 14+ messages in thread
From: Kirill A. Shutemov @ 2019-05-14 12:28 UTC (permalink / raw)
  To: Larry Bassel
  Cc: mike.kravetz, willy, dan.j.williams, linux-mm, linux-kernel,
	linux-nvdimm

On Thu, May 09, 2019 at 09:05:31AM -0700, Larry Bassel wrote:
> This patchset implements sharing of page table entries pointing
> to 2MiB pages (PMDs) for FS/DAX on x86.

-EPARSE.

How do you share entries? Entries do not take any space, page tables that
cointain these entries do.

Have you checked if the patch makes memory consumption any better. I have
doubts in it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
  2019-05-09 16:05 ` [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX Larry Bassel
  2019-05-09 16:49   ` Matthew Wilcox
@ 2019-05-14 13:01   ` Kirill A. Shutemov
  2019-05-24 16:07     ` Larry Bassel
  1 sibling, 1 reply; 14+ messages in thread
From: Kirill A. Shutemov @ 2019-05-14 13:01 UTC (permalink / raw)
  To: Larry Bassel
  Cc: mike.kravetz, willy, dan.j.williams, linux-mm, linux-kernel,
	linux-nvdimm

On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote:
> This is based on (but somewhat different from) what hugetlbfs
> does to share/unshare page tables.
> 
> Signed-off-by: Larry Bassel <larry.bassel@oracle.com>
> ---
>  include/linux/hugetlb.h |   4 ++
>  mm/huge_memory.c        |  32 ++++++++++++++
>  mm/hugetlb.c            |  21 ++++++++--
>  mm/memory.c             | 108 +++++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 160 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 11943b6..9ed9542 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -142,6 +142,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end);
> +unsigned long page_table_shareable(struct vm_area_struct *svma,
> +				   struct vm_area_struct *vma,
> +				   unsigned long addr, pgoff_t idx);
> +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  			      int write);
>  struct page *follow_huge_pd(struct vm_area_struct *vma,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b6a34b3..e1627c3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1747,6 +1747,33 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
>  	mm_dec_nr_ptes(mm);
>  }
>  
> +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD
> +static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr,
> +			    pmd_t *pmdp)
> +{
> +	pgd_t *pgd = pgd_offset(mm, addr);
> +	p4d_t *p4d = p4d_offset(pgd, addr);
> +	pud_t *pud = pud_offset(p4d, addr);
> +
> +	WARN_ON(page_count(virt_to_page(pmdp)) == 0);
> +	if (page_count(virt_to_page(pmdp)) == 1)
> +		return 0;
> +
> +	pud_clear(pud);

You don't have proper locking in place to do this.

> +	put_page(virt_to_page(pmdp));
> +	mm_dec_nr_pmds(mm);
> +	return 1;
> +}
> +
> +#else
> +static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr,
> +			    pmd_t *pmdp)
> +{
> +	return 0;
> +}
> +
> +#endif
> +
>  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		 pmd_t *pmd, unsigned long addr)
>  {
> @@ -1764,6 +1791,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	 * pgtable_trans_huge_withdraw after finishing pmdp related
>  	 * operations.
>  	 */
> +	if (unshare_huge_pmd(vma->vm_mm, addr, pmd)) {
> +		spin_unlock(ptl);
> +		return 1;
> +	}
> +
>  	orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
>  			tlb->fullmm);
>  	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 641cedf..919a290 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4594,9 +4594,9 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>  }
>  
>  #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> -static unsigned long page_table_shareable(struct vm_area_struct *svma,
> -				struct vm_area_struct *vma,
> -				unsigned long addr, pgoff_t idx)
> +unsigned long page_table_shareable(struct vm_area_struct *svma,
> +				   struct vm_area_struct *vma,
> +				   unsigned long addr, pgoff_t idx)
>  {
>  	unsigned long saddr = ((idx - svma->vm_pgoff) << PAGE_SHIFT) +
>  				svma->vm_start;
> @@ -4619,7 +4619,7 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
>  	return saddr;
>  }
>  
> -static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	unsigned long base = addr & PUD_MASK;
>  	unsigned long end = base + PUD_SIZE;
> @@ -4763,6 +4763,19 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end)
>  {
>  }
> +
> +unsigned long page_table_shareable(struct vm_area_struct *svma,
> +				   struct vm_area_struct *vma,
> +				   unsigned long addr, pgoff_t idx)
> +{
> +	return 0;
> +}
> +
> +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	return false;
> +}
> +
>  #define want_pmd_share()	(0)
>  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index f7d962d..4c1814c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3845,6 +3845,109 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD
> +static pmd_t *huge_pmd_offset(struct mm_struct *mm,
> +			      unsigned long addr, unsigned long sz)

Could you explain what this function suppose to do?

As far as I can see vma_mmu_pagesize() is always PAGE_SIZE of DAX
filesystem. So we have 'sz' == PAGE_SIZE here.

So this function can pointer to PMD of PUD page table entry casted to
pmd_t*.

Why?

> +{
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +
> +	pgd = pgd_offset(mm, addr);
> +	if (!pgd_present(*pgd))
> +		return NULL;
> +	p4d = p4d_offset(pgd, addr);
> +	if (!p4d_present(*p4d))
> +		return NULL;
> +
> +	pud = pud_offset(p4d, addr);
> +	if (sz != PUD_SIZE && pud_none(*pud))
> +		return NULL;
> +	/* hugepage or swap? */
> +	if (pud_huge(*pud) || !pud_present(*pud))
> +		return (pmd_t *)pud;
> +
> +	pmd = pmd_offset(pud, addr);
> +	if (sz != PMD_SIZE && pmd_none(*pmd))
> +		return NULL;
> +	/* hugepage or swap? */
> +	if (pmd_huge(*pmd) || !pmd_present(*pmd))
> +		return pmd;
> +
> +	return NULL;
> +}
> +
> +static pmd_t *pmd_share(struct mm_struct *mm, pud_t *pud, unsigned long addr)
> +{
> +	struct vm_area_struct *vma = find_vma(mm, addr);

Why? Caller has vma on hands.

> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	pgoff_t idx = ((addr - vma->vm_start) >> PAGE_SHIFT) +
> +			vma->vm_pgoff;

linear_page_index()?

> +	struct vm_area_struct *svma;
> +	unsigned long saddr;
> +	pmd_t *spmd = NULL;
> +	pmd_t *pmd;
> +	spinlock_t *ptl;
> +
> +	if (!vma_shareable(vma, addr))
> +		return pmd_alloc(mm, pud, addr);
> +
> +	i_mmap_lock_write(mapping);
> +
> +	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> +		if (svma == vma)
> +			continue;
> +
> +		saddr = page_table_shareable(svma, vma, addr, idx);
> +		if (saddr) {
> +			spmd = huge_pmd_offset(svma->vm_mm, saddr,
> +					       vma_mmu_pagesize(svma));
> +			if (spmd) {
> +				get_page(virt_to_page(spmd));

So, here we get a pin on a page table page. And we don't know if it's PMD
or PUD page table.

And we only checked one entry in the page table.

What if the page table mixes huge-PMD/PUD entries with pointers to page
table.

> +				break;
> +			}
> +		}
> +	}
> +
> +	if (!spmd)
> +		goto out;
> +
> +	ptl = pmd_lockptr(mm, spmd);
> +	spin_lock(ptl);

You take lock on PMD page table...

> +
> +	if (pud_none(*pud)) {
> +		pud_populate(mm, pud,
> +			    (pmd_t *)((unsigned long)spmd & PAGE_MASK));

... and modify PUD page table.

> +		mm_inc_nr_pmds(mm);
> +	} else {
> +		put_page(virt_to_page(spmd));
> +	}
> +	spin_unlock(ptl);
> +out:
> +	pmd = pmd_alloc(mm, pud, addr);
> +	i_mmap_unlock_write(mapping);
> +	return pmd;
> +}
> +
> +static bool may_share_pmd(struct vm_area_struct *vma)
> +{
> +	if (vma_is_fsdax(vma))
> +		return true;
> +	return false;
> +}
> +#else
> +static pmd_t *pmd_share(struct mm_struct *mm, pud_t *pud, unsigned long addr)
> +{
> +	return pmd_alloc(mm, pud, addr);
> +}
> +
> +static bool may_share_pmd(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +#endif
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *
> @@ -3898,7 +4001,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  		}
>  	}
>  
> -	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
> +	if (unlikely(may_share_pmd(vma)))
> +		vmf.pmd = pmd_share(mm, vmf.pud, address);
> +	else
> +		vmf.pmd = pmd_alloc(mm, vmf.pud, address);
>  	if (!vmf.pmd)
>  		return VM_FAULT_OOM;
>  	if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86
  2019-05-14 12:28 ` [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86 Kirill A. Shutemov
@ 2019-05-14 16:09   ` Larry Bassel
  0 siblings, 0 replies; 14+ messages in thread
From: Larry Bassel @ 2019-05-14 16:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Larry Bassel, mike.kravetz, willy, dan.j.williams, linux-mm,
	linux-kernel, linux-nvdimm

On 14 May 19 15:28, Kirill A. Shutemov wrote:
> On Thu, May 09, 2019 at 09:05:31AM -0700, Larry Bassel wrote:
> > This patchset implements sharing of page table entries pointing
> > to 2MiB pages (PMDs) for FS/DAX on x86.
> 
> -EPARSE.
> 
> How do you share entries? Entries do not take any space, page tables that
> cointain these entries do.

Yes, I'll correct this in v2.

> 
> Have you checked if the patch makes memory consumption any better. I have
> doubts in it.

Yes I have -- the following is debugging output I have from my testing.
The (admittedly simple) test case is two copies of a program that mmaps
1GiB of a DAX/XFS file (with 2MiB page size), touches the first page
(physical 200400000 in this case) and then sleeps forever.

sharing disabled:

(process A)
[  420.369975] pgd_index = fe
[  420.369975] pgd = 00000000e1ebf83b
[  420.369975] pgd_val = 8000000405ca8067
[  420.369976] pud_index = 100
[  420.369976] pud = 00000000bd7a7df0
[  420.369976] pud_val = 4058f9067
[  420.369977] pmd_index = 0
[  420.369977] pmd = 00000000791e93d4
[  420.369977] pmd_val = 84000002004008e7
[  420.369978] pmd huge
[  420.369978] page_addr = 200400000, page_offset = 0
[  420.369979] vaddr = 7f4000000000, paddr = 200400000

(process B)
[  420.370013] pgd_index = fe
[  420.370014] pgd = 00000000a2bac60d
[  420.370014] pgd_val = 8000000405a8f067
[  420.370015] pud_index = 100
[  420.370015] pud = 00000000dcc3ff1a
[  420.370015] pud_val = 3fc713067
[  420.370016] pmd_index = 0
[  420.370016] pmd = 000000006b4679db
[  420.370016] pmd_val = 84000002004008e7
[  420.370017] pmd huge
[  420.370017] page_addr = 200400000, page_offset = 0
[  420.370018] vaddr = 7f4000000000, paddr = 200400000

sharing enabled:

(process A)
[  696.992342] pgd_index = fe
[  696.992342] pgd = 000000009612024b
[  696.992343] pgd_val = 8000000404725067
[  696.992343] pud_index = 100
[  696.992343] pud = 00000000c98ab17c
[  696.992344] pud_val = 4038e3067
[  696.992344] pmd_index = 0
[  696.992344] pmd = 000000002437681b
[  696.992344] pmd_val = 84000002004008e7
[  696.992345] pmd huge
[  696.992345] page_addr = 200400000, page_offset = 0
[  696.992345] vaddr = 7f4000000000, paddr = 200400000

(process B)
[  696.992351] pgd_index = fe
[  696.992351] pgd = 0000000012326848
[  696.992352] pgd_val = 800000040a953067
[  696.992352] pud_index = 100
[  696.992352] pud = 00000000f989bcf6
[  696.992352] pud_val = 4038e3067
[  696.992353] pmd_index = 0
[  696.992353] pmd = 000000002437681b
[  696.992353] pmd_val = 84000002004008e7
[  696.992353] pmd huge
[  696.992354] page_addr = 200400000, page_offset = 0
[  696.992354] vaddr = 7f4000000000, paddr = 200400000

Note that in the sharing enabled case, the pud_val and pmd are
the same for the two processes. In the disabled case we
have two separate pmds (and so more memory was allocated).

Also, (though not visible from the output above) the second
process did not take a page fault as the virtual->physical mapping
was already established thanks to the sharing.

Larry

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
  2019-05-14 13:01   ` Kirill A. Shutemov
@ 2019-05-24 16:07     ` Larry Bassel
  2019-05-24 17:02       ` Dan Williams
  2019-06-12  2:07       ` Kirill A. Shutemov
  0 siblings, 2 replies; 14+ messages in thread
From: Larry Bassel @ 2019-05-24 16:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Larry Bassel, mike.kravetz, willy, dan.j.williams, linux-mm,
	linux-kernel, linux-nvdimm

On 14 May 19 16:01, Kirill A. Shutemov wrote:
> On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote:
[trim]
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1747,6 +1747,33 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
> >  	mm_dec_nr_ptes(mm);
> >  }
> >  
> > +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD
> > +static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr,
> > +			    pmd_t *pmdp)
> > +{
> > +	pgd_t *pgd = pgd_offset(mm, addr);
> > +	p4d_t *p4d = p4d_offset(pgd, addr);
> > +	pud_t *pud = pud_offset(p4d, addr);
> > +
> > +	WARN_ON(page_count(virt_to_page(pmdp)) == 0);
> > +	if (page_count(virt_to_page(pmdp)) == 1)
> > +		return 0;
> > +
> > +	pud_clear(pud);
> 
> You don't have proper locking in place to do this.

This code is based on and very similar to the code in
mm/hugetlb.c (huge_pmd_unshare()).

I asked Mike Kravetz why the locking in huge_pmd_share() and
huge_pmd_unshare() is correct. The issue (as you point out later
in your email) is whether in both of those cases it is OK to
take the PMD table lock and then modify the PUD table.

He responded with the following analysis:

---------------------------------------------------------------------------------
I went back and looked at the locking in the hugetlb code.  Here is
most of the code for huge_pmd_share().

	i_mmap_lock_write(mapping);
	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
		if (svma == vma)
			continue;

		saddr = page_table_shareable(svma, vma, addr, idx);
		if (saddr) {
			spte = huge_pte_offset(svma->vm_mm, saddr,
					       vma_mmu_pagesize(svma));
			if (spte) {
				get_page(virt_to_page(spte));
				break;
			}
		}
	}

	if (!spte)
		goto out;

	ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
>>>
The primary reason the page table lock is taken here is for the purpose of
checking and possibly updating the PUD (pointer to PMD page).  Note that by
the time we get here we already have found a PMD page to share.  Also note
that the lock taken is the one associated with the PMD page.

The synchronization question to ask is:  Can anyone else modify the PUD value
while I am holding the PMD lock?  In general, the answer is Yes.  However,
we can infer something subtle about the shared PMD case.  Suppose someone
else wanted to set the PUD value.  The only value they could set it to is the
PMD page we found in this routine.  They also would need to go through this
routine to set the value.  They also would need to get the lock on the same
shared PMD.  Actually, they would hit the mapping->i_mmap_rwsem first.  But,
the bottom line is that nobody else can set it.  What about clearing?  In the
hugetlb case, the only places where PUD gets cleared are final page table
tear down and huge_pmd_unshare().  The final page table tear down case is not
interesting as the process is exiting.  All callers if huge_pmd_unshare must
hold the (PMD) page table lock.  This is a requirement.  Therefore, within
a single process this synchronizes two threads:  one calling huge_pmd_share
and another huge_pmd_unshare.
---------------------------------------------------------------------------------

I assert that the same analysis applies to pmd_share() and unshare_huge_pmd()
which are added in this patch.

> 
> > +	put_page(virt_to_page(pmdp));
> > +	mm_dec_nr_pmds(mm);
> > +	return 1;
> > +}
> > +
> > +#else
> > +static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr,
> > +			    pmd_t *pmdp)
> > +{
> > +	return 0;
> > +}
> > +
> > +#endif
> > +
> >  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		 pmd_t *pmd, unsigned long addr)
> >  {
> > @@ -1764,6 +1791,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  	 * pgtable_trans_huge_withdraw after finishing pmdp related
> >  	 * operations.
> >  	 */
> > +	if (unshare_huge_pmd(vma->vm_mm, addr, pmd)) {
> > +		spin_unlock(ptl);
> > +		return 1;
> > +	}
> > +
> >  	orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> >  			tlb->fullmm);
> >  	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 641cedf..919a290 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4594,9 +4594,9 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> >  }
> >  
> >  #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
> > -static unsigned long page_table_shareable(struct vm_area_struct *svma,
> > -				struct vm_area_struct *vma,
> > -				unsigned long addr, pgoff_t idx)
> > +unsigned long page_table_shareable(struct vm_area_struct *svma,
> > +				   struct vm_area_struct *vma,
> > +				   unsigned long addr, pgoff_t idx)
> >  {
> >  	unsigned long saddr = ((idx - svma->vm_pgoff) << PAGE_SHIFT) +
> >  				svma->vm_start;
> > @@ -4619,7 +4619,7 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> >  	return saddr;
> >  }
> >  
> > -static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> > +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	unsigned long base = addr & PUD_MASK;
> >  	unsigned long end = base + PUD_SIZE;
> > @@ -4763,6 +4763,19 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> >  				unsigned long *start, unsigned long *end)
> >  {
> >  }
> > +
> > +unsigned long page_table_shareable(struct vm_area_struct *svma,
> > +				   struct vm_area_struct *vma,
> > +				   unsigned long addr, pgoff_t idx)
> > +{
> > +	return 0;
> > +}
> > +
> > +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> > +{
> > +	return false;
> > +}
> > +
> >  #define want_pmd_share()	(0)
> >  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> >  
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f7d962d..4c1814c 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3845,6 +3845,109 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> >  	return 0;
> >  }
> >  
> > +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD
> > +static pmd_t *huge_pmd_offset(struct mm_struct *mm,
> > +			      unsigned long addr, unsigned long sz)
> 
> Could you explain what this function suppose to do?
> 
> As far as I can see vma_mmu_pagesize() is always PAGE_SIZE of DAX
> filesystem. So we have 'sz' == PAGE_SIZE here.

I thought so too, but in my testing I found that vma_mmu_pagesize() returns
4KiB, which differs from the DAX filesystem's 2MiB pagesize.

> 
> So this function can pointer to PMD of PUD page table entry casted to
> pmd_t*.
> 
> Why?

I don't understand your question here.

> 
> > +{
> > +	pgd_t *pgd;
> > +	p4d_t *p4d;
> > +	pud_t *pud;
> > +	pmd_t *pmd;
> > +
> > +	pgd = pgd_offset(mm, addr);
> > +	if (!pgd_present(*pgd))
> > +		return NULL;
> > +	p4d = p4d_offset(pgd, addr);
> > +	if (!p4d_present(*p4d))
> > +		return NULL;
> > +
> > +	pud = pud_offset(p4d, addr);
> > +	if (sz != PUD_SIZE && pud_none(*pud))
> > +		return NULL;
> > +	/* hugepage or swap? */
> > +	if (pud_huge(*pud) || !pud_present(*pud))
> > +		return (pmd_t *)pud;
> > +
> > +	pmd = pmd_offset(pud, addr);
> > +	if (sz != PMD_SIZE && pmd_none(*pmd))
> > +		return NULL;
> > +	/* hugepage or swap? */
> > +	if (pmd_huge(*pmd) || !pmd_present(*pmd))
> > +		return pmd;
> > +
> > +	return NULL;
> > +}
> > +
> > +static pmd_t *pmd_share(struct mm_struct *mm, pud_t *pud, unsigned long addr)
> > +{
> > +	struct vm_area_struct *vma = find_vma(mm, addr);
> 
> Why? Caller has vma on hands.

This was taken from huge_pmd_share() in mm/hugetlb.c which does
things that way. Are you suggesting that I just pass vma as
an argument to pmd_share()?

> 
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > +	pgoff_t idx = ((addr - vma->vm_start) >> PAGE_SHIFT) +
> > +			vma->vm_pgoff;
> 
> linear_page_index()?

Again this came from huge_pmd_share(). I was trying to keep
the differences between both functions as small as possible.

> 
> > +	struct vm_area_struct *svma;
> > +	unsigned long saddr;
> > +	pmd_t *spmd = NULL;
> > +	pmd_t *pmd;
> > +	spinlock_t *ptl;
> > +
> > +	if (!vma_shareable(vma, addr))
> > +		return pmd_alloc(mm, pud, addr);
> > +
> > +	i_mmap_lock_write(mapping);
> > +
> > +	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> > +		if (svma == vma)
> > +			continue;
> > +
> > +		saddr = page_table_shareable(svma, vma, addr, idx);
> > +		if (saddr) {
> > +			spmd = huge_pmd_offset(svma->vm_mm, saddr,
> > +					       vma_mmu_pagesize(svma));
> > +			if (spmd) {
> > +				get_page(virt_to_page(spmd));
> 
> So, here we get a pin on a page table page. And we don't know if it's PMD
> or PUD page table.

DAX only does 4 KiB and 2 MiB pagesizes, not 1 GiB. The checks for sharing
prevent any 4 KiB DAX from entering this code.

> 
> And we only checked one entry in the page table.
> 
> What if the page table mixes huge-PMD/PUD entries with pointers to page
> table.

Again, I don't think this can happen in DAX. The only sharing allowed
is for FS/DAX/2MiB pagesize.

> 
> > +				break;
> > +			}
> > +		}
> > +	}
> > +
> > +	if (!spmd)
> > +		goto out;
> > +
> > +	ptl = pmd_lockptr(mm, spmd);
> > +	spin_lock(ptl);
> 
> You take lock on PMD page table...
> 
> > +
> > +	if (pud_none(*pud)) {
> > +		pud_populate(mm, pud,
> > +			    (pmd_t *)((unsigned long)spmd & PAGE_MASK));
> 
> ... and modify PUD page table.

Please see my comments about this issue above.

> 
> > +		mm_inc_nr_pmds(mm);
> > +	} else {
> > +		put_page(virt_to_page(spmd));
> > +	}
> > +	spin_unlock(ptl);
> > +out:
> > +	pmd = pmd_alloc(mm, pud, addr);
> > +	i_mmap_unlock_write(mapping);
> > +	return pmd;
> > +}

[trim]

Thanks for the review. My apologies for not getting
back to you sooner.

Larry

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
  2019-05-24 16:07     ` Larry Bassel
@ 2019-05-24 17:02       ` Dan Williams
  2019-06-12  2:07       ` Kirill A. Shutemov
  1 sibling, 0 replies; 14+ messages in thread
From: Dan Williams @ 2019-05-24 17:02 UTC (permalink / raw)
  To: Larry Bassel
  Cc: Kirill A. Shutemov, Mike Kravetz, Matthew Wilcox, Linux MM,
	Linux Kernel Mailing List, linux-nvdimm

On Fri, May 24, 2019 at 9:07 AM Larry Bassel <larry.bassel@oracle.com> wrote:
> On 14 May 19 16:01, Kirill A. Shutemov wrote:
> > On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote:
[..]
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index f7d962d..4c1814c 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3845,6 +3845,109 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> > >     return 0;
> > >  }
> > >
> > > +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD
> > > +static pmd_t *huge_pmd_offset(struct mm_struct *mm,
> > > +                         unsigned long addr, unsigned long sz)
> >
> > Could you explain what this function suppose to do?
> >
> > As far as I can see vma_mmu_pagesize() is always PAGE_SIZE of DAX
> > filesystem. So we have 'sz' == PAGE_SIZE here.
>
> I thought so too, but in my testing I found that vma_mmu_pagesize() returns
> 4KiB, which differs from the DAX filesystem's 2MiB pagesize.

A given filesystem-dax vma is allowed to support both 4K and 2M
mappings, so the vma_mmu_pagesize() is not granular enough to describe
the capabilities of a filesystem-dax vma. In the device-dax case,
where there are mapping guarantees, the implementation does arrange
for vma_mmu_pagesize() to reflect the right page size.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
  2019-05-24 16:07     ` Larry Bassel
  2019-05-24 17:02       ` Dan Williams
@ 2019-06-12  2:07       ` Kirill A. Shutemov
  1 sibling, 0 replies; 14+ messages in thread
From: Kirill A. Shutemov @ 2019-06-12  2:07 UTC (permalink / raw)
  To: Larry Bassel
  Cc: mike.kravetz, willy, dan.j.williams, linux-mm, linux-kernel,
	linux-nvdimm

On Fri, May 24, 2019 at 09:07:11AM -0700, Larry Bassel wrote:
> Again, I don't think this can happen in DAX. The only sharing allowed
> is for FS/DAX/2MiB pagesize.

Hm. I still don't follow. How do you guarantee that DAX actually allocated
continues space for the file on backing storage and you can map it with
PMD page? I believe you don't have such guarantee.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-06-12  2:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-09 16:05 [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86 Larry Bassel
2019-05-09 16:05 ` [PATCH, RFC 1/2] Add config option to enable FS/DAX PMD sharing Larry Bassel
2019-05-10 16:32   ` Elliott, Robert (Servers)
2019-05-10 18:14     ` Dan Williams
2019-05-09 16:05 ` [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX Larry Bassel
2019-05-09 16:49   ` Matthew Wilcox
2019-05-10 16:16     ` Larry Bassel
2019-05-10 22:45       ` Mike Kravetz
2019-05-14 13:01   ` Kirill A. Shutemov
2019-05-24 16:07     ` Larry Bassel
2019-05-24 17:02       ` Dan Williams
2019-06-12  2:07       ` Kirill A. Shutemov
2019-05-14 12:28 ` [PATCH, RFC 0/2] Share PMDs for FS/DAX on x86 Kirill A. Shutemov
2019-05-14 16:09   ` Larry Bassel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).