LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Yang Shi <yang.shi@linux.alibaba.com>
To: Hugh Dickins <hughd@google.com>
Cc: kirill.shutemov@linux.intel.com, aarcange@redhat.com,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [v2 PATCH] mm: shmem: allow split THP when truncating THP partially
Date: Wed, 4 Dec 2019 16:50:49 -0800
Message-ID: <c0f134f1-caf8-0baa-5f0e-87f2c530c631@linux.alibaba.com> (raw)
In-Reply-To: <alpine.LSU.2.11.1912041601270.12930@eggly.anvils>



On 12/4/19 4:15 PM, Hugh Dickins wrote:
> On Wed, 4 Dec 2019, Yang Shi wrote:
>
>> Currently when truncating shmem file, if the range is partial of THP
>> (start or end is in the middle of THP), the pages actually will just get
>> cleared rather than being freed unless the range cover the whole THP.
>> Even though all the subpages are truncated (randomly or sequentially),
>> the THP may still be kept in page cache.  This might be fine for some
>> usecases which prefer preserving THP.
>>
>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch
>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used.
>> So, when using shmem THP as memory backend QEMU inflation actually doesn't
>> work as expected since it doesn't free memory.  But, the inflation
>> usecase really needs get the memory freed.  Anonymous THP will not get
>> freed right away too but it will be freed eventually when all subpages are
>> unmapped, but shmem THP would still stay in page cache.
>>
>> Split THP right away when doing partial hole punch, and if split fails
>> just clear the page so that read to the hole punched area would return
>> zero.
>>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>> v2: * Adopted the comment from Kirill.
>>      * Dropped fallocate mode flag, THP split is the default behavior.
>>      * Blended Huge's implementation with my v1 patch. TBH I'm not very keen to
>>        Hugh's find_get_entries() hack (basically neutral), but without that hack
> Thanks for giving it a try.  I'm not neutral about my find_get_entries()
> hack: it surely had to go (without it, I'd have just pushed my own patch).

We are on the same page :-)

> I've not noticed anything wrong with your patch, and it's in the right
> direction, but I'm still not thrilled with it.  I also remember that I
> got the looping wrong in my first internal attempt (fixed in what I sent),
> and need to be very sure of the try-again-versus-move-on-to-next conditions
> before agreeing to anything.  No rush, I'll come back to this in days or
> month ahead: I'll try to find a less gotoey blend of yours and mine.

Yes, those goto look a little bit convoluted so I added a lot comments 
to improve the readability. Thanks for your time.

>
> Hugh
>
>>        we have to rely on pagevec_release() to release extra pins and play with
>>        goto. This version does in this way. The patch is bigger than Hugh's due
>>        to extra comments to make the flow clear.
>>
>>   mm/shmem.c | 120 ++++++++++++++++++++++++++++++++++++++++++-------------------
>>   1 file changed, 83 insertions(+), 37 deletions(-)
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 220be9f..1ae0c7f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -806,12 +806,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   	long nr_swaps_freed = 0;
>>   	pgoff_t index;
>>   	int i;
>> +	bool split = false;
>> +	struct page *page = NULL;
>>   
>>   	if (lend == -1)
>>   		end = -1;	/* unsigned, so actually very big */
>>   
>>   	pagevec_init(&pvec);
>>   	index = start;
>> +retry:
>>   	while (index < end) {
>>   		pvec.nr = find_get_entries(mapping, index,
>>   			min(end - index, (pgoff_t)PAGEVEC_SIZE),
>> @@ -819,7 +822,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		if (!pvec.nr)
>>   			break;
>>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>> -			struct page *page = pvec.pages[i];
>> +			split = false;
>> +			page = pvec.pages[i];
>>   
>>   			index = indices[i];
>>   			if (index >= end)
>> @@ -838,23 +842,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			if (!trylock_page(page))
>>   				continue;
>>   
>> -			if (PageTransTail(page)) {
>> -				/* Middle of THP: zero out the page */
>> -				clear_highpage(page);
>> -				unlock_page(page);
>> -				continue;
>> -			} else if (PageTransHuge(page)) {
>> -				if (index == round_down(end, HPAGE_PMD_NR)) {
>> +			if (PageTransCompound(page) && !unfalloc) {
>> +				if (PageHead(page) &&
>> +				    index != round_down(end, HPAGE_PMD_NR)) {
>>   					/*
>> -					 * Range ends in the middle of THP:
>> -					 * zero out the page
>> +					 * Fall through when punching whole
>> +					 * THP.
>>   					 */
>> -					clear_highpage(page);
>> -					unlock_page(page);
>> -					continue;
>> +					index += HPAGE_PMD_NR - 1;
>> +					i += HPAGE_PMD_NR - 1;
>> +				} else {
>> +					/*
>> +					 * Split THP for any partial hole
>> +					 * punch.
>> +					 */
>> +					get_page(page);
>> +					split = true;
>> +					goto split;
>>   				}
>> -				index += HPAGE_PMD_NR - 1;
>> -				i += HPAGE_PMD_NR - 1;
>>   			}
>>   
>>   			if (!unfalloc || !PageUptodate(page)) {
>> @@ -866,9 +871,29 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			}
>>   			unlock_page(page);
>>   		}
>> +split:
>>   		pagevec_remove_exceptionals(&pvec);
>>   		pagevec_release(&pvec);
>>   		cond_resched();
>> +
>> +		if (split) {
>> +			/*
>> +			 * The pagevec_release() released all extra pins
>> +			 * from pagevec lookup.  And we hold an extra pin
>> +			 * and still have the page locked under us.
>> +			 */
>> +			if (!split_huge_page(page)) {
>> +				unlock_page(page);
>> +				put_page(page);
>> +				/* Re-lookup page cache from current index */
>> +				goto retry;
>> +			}
>> +
>> +			/* Fail to split THP, move to next index */
>> +			unlock_page(page);
>> +			put_page(page);
>> +		}
>> +
>>   		index++;
>>   	}
>>   
>> @@ -901,6 +926,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   		return;
>>   
>>   	index = start;
>> +again:
>>   	while (index < end) {
>>   		cond_resched();
>>   
>> @@ -916,7 +942,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			continue;
>>   		}
>>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>> -			struct page *page = pvec.pages[i];
>> +			split = false;
>> +			page = pvec.pages[i];
>>   
>>   			index = indices[i];
>>   			if (index >= end)
>> @@ -936,30 +963,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   
>>   			lock_page(page);
>>   
>> -			if (PageTransTail(page)) {
>> -				/* Middle of THP: zero out the page */
>> -				clear_highpage(page);
>> -				unlock_page(page);
>> -				/*
>> -				 * Partial thp truncate due 'start' in middle
>> -				 * of THP: don't need to look on these pages
>> -				 * again on !pvec.nr restart.
>> -				 */
>> -				if (index != round_down(end, HPAGE_PMD_NR))
>> -					start++;
>> -				continue;
>> -			} else if (PageTransHuge(page)) {
>> -				if (index == round_down(end, HPAGE_PMD_NR)) {
>> +			if (PageTransCompound(page) && !unfalloc) {
>> +				if (PageHead(page) &&
>> +				    index != round_down(end, HPAGE_PMD_NR)) {
>>   					/*
>> -					 * Range ends in the middle of THP:
>> -					 * zero out the page
>> +					 * Fall through when punching whole
>> +					 * THP.
>>   					 */
>> -					clear_highpage(page);
>> -					unlock_page(page);
>> -					continue;
>> +					index += HPAGE_PMD_NR - 1;
>> +					i += HPAGE_PMD_NR - 1;
>> +				} else {
>> +					/*
>> +					 * Split THP for any partial hole
>> +					 * punch.
>> +					 */
>> +					get_page(page);
>> +					split = true;
>> +					goto rescan_split;
>>   				}
>> -				index += HPAGE_PMD_NR - 1;
>> -				i += HPAGE_PMD_NR - 1;
>>   			}
>>   
>>   			if (!unfalloc || !PageUptodate(page)) {
>> @@ -976,8 +997,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>>   			}
>>   			unlock_page(page);
>>   		}
>> +rescan_split:
>>   		pagevec_remove_exceptionals(&pvec);
>>   		pagevec_release(&pvec);
>> +
>> +		if (split) {
>> +			/*
>> +			 * The pagevec_release() released all extra pins
>> +			 * from pagevec lookup.  And we hold an extra pin
>> +			 * and still have the page locked under us.
>> +			 */
>> +			if (!split_huge_page(page)) {
>> +				unlock_page(page);
>> +				put_page(page);
>> +				/* Re-lookup page cache from current index */
>> +				goto again;
>> +			}
>> +
>> +			/*
>> +			 * Split fail, clear the page then move to next
>> +			 * index.
>> +			 */
>> +			clear_highpage(page);
>> +
>> +			unlock_page(page);
>> +			put_page(page);
>> +		}
>> +
>>   		index++;
>>   	}
>>   
>> -- 
>> 1.8.3.1
>>
>>


      reply index

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-04  0:42 Yang Shi
2019-12-05  0:15 ` Hugh Dickins
2019-12-05  0:50   ` Yang Shi [this message]

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c0f134f1-caf8-0baa-5f0e-87f2c530c631@linux.alibaba.com \
    --to=yang.shi@linux.alibaba.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lkml.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lkml.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lkml.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lkml.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lkml.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lkml.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lkml.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lkml.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lkml.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git