LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk,
	peterx@redhat.com, david@redhat.com,
	christian.brauner@ubuntu.com, adobriyan@gmail.com,
	songmuchun@bytedance.com, axboe@kernel.dk,
	vincenzo.frascino@arm.com, catalin.marinas@arm.com,
	peterz@infradead.org, chinwen.chang@mediatek.com,
	linmiaohe@huawei.com, jannh@google.com, apopple@nvidia.com,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, ivan.teterevkov@nutanix.com,
	florian.schmidt@nutanix.com, carl.waldspurger@nutanix.com,
	jonathan.davies@nutanix.com
Subject: Re: [PATCH 0/1] pagemap: swap location for shared pages
Date: Fri, 30 Jul 2021 12:28:17 -0500	[thread overview]
Message-ID: <87y29nbtji.fsf@disp2133> (raw)
In-Reply-To: <20210730160826.63785-1-tiberiu.georgescu@nutanix.com> (Tiberiu A. Georgescu's message of "Fri, 30 Jul 2021 16:08:25 +0000")

Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com> writes:

> This patch follows up on a previous RFC:
> 20210714152426.216217-1-tiberiu.georgescu@nutanix.com
>
> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> entry is cleared. In many cases, there is no difference between swapped-out
> shared pages and newly allocated, non-dirty pages in the pagemap
> interface.

What is the point?

You say a shared swapped out page is the same as a clean shared page
and you are exactly correct.  What is the point in knowing a shared
page was swapped out?  What does is the gain?

I tried to understand the point by looking at your numbers below
and everything I could see looked worse post patch.  

Eric



> Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
>     #define NPAGES (256)
>     /* map 1MiB shared memory */
>     size_t pagesize = getpagesize();
>     char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
>     		   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>     /* Dirty new pages. */
>     for (i = 0; i < PAGES; i++)
>     	p[i * pagesize] = i;
>
> Run the above program in a small cgroup, which causes swapping:
>     /* Initialise cgroup & run a program */
>     $ echo 512K > foo/memory.limit_in_bytes
>     $ echo 60 > foo/memory.swappiness
>     $ cgexec -g memory:foo ./pagemap-test
>
> Check the pagemap report. Example of the current expected output:
>     $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
>     00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>     *
>     00000710  e1 6b 06 00 00 00 00 a1  9e eb 06 00 00 00 00 a1  |.k..............|
>     00000720  6b ee 06 00 00 00 00 a1  a5 a4 05 00 00 00 00 a1  |k...............|
>     00000730  5c bf 06 00 00 00 00 a1  90 b6 06 00 00 00 00 a1  |\...............|
>
> The first pagemap entries are reported as zeroes, indicating the pages have
> never been allocated while they have actually been swapped out.
>
> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> make use of the XArray associated with the virtual memory area struct
> passed as an argument. The XArray contains the location of virtual pages in
> the page cache, swap cache or on disk. If they are in either of the caches,
> then the original implementation still works. If not, then the missing
> information will be retrieved from the XArray.
>
> Performance
> ============
> I measured the performance of the patch on a single socket Xeon E5-2620
> machine, with 128GiB of RAM and 128GiB of swap storage. These were the
> steps taken:
>
>   1. Run example pagemap-test code on a cgroup
>     a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
>     b. allocate 16GiB (about 4 million pages);
>     c. dirty 0,50 or 100% of pages;
>     d. do this for both private and shared memory.
>   2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
>      for each possible configuration above
>     a.  3 times for warm up;
>     b. 10 times to measure performance.
>        Use `time` or another performance measuring tool.
>
> Results (averaged over 10 iterations):
>                +--------+------------+------------+
>                | dirty% |  pre patch | post patch |
>                +--------+------------+------------+
>  private|anon  |     0% |      8.15s |      8.40s |
>                |    50% |     11.83s |     12.19s |
>                |   100% |     12.37s |     12.20s |
>                +--------+------------+------------+
>   shared|anon  |     0% |      8.17s |      8.18s |
>                |    50% | (*) 10.43s |     37.43s |
>                |   100% | (*) 10.20s |     38.59s |
>                +--------+------------+------------+
>
> (*): reminder that pre-patch produces incorrect pagemap entries for swapped
>      out pages.
>
> From run to run the above results are stable (mostly <1% stderr).
>
> The amount of time it takes for a full read of the pagemap depends on the
> granularity used by dd to read the pagemap file. Even though the access is
> sequential, the script only reads 8 bytes at a time, running pagemap_read()
> COUNT times (one time for each page in a 16GiB area).
>
> To reduce overhead, we can use batching for large amounts of sequential
> access. We can make dd read multiple page entries at a time,
> allowing the kernel to make optimisations and yield more throughput.
>
> Performance in real time (seconds) of
> `dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
> count=$((4194304 / $BATCH))`:
> +---------------------------------+ +---------------------------------+
> |     Shared, Anon, 50% dirty     | |     Shared, Anon, 100% dirty    |
> +-------+------------+------------+ +-------+------------+------------+
> | Batch |  Pre-patch | Post-patch | | Batch |  Pre-patch | Post-patch |
> +-------+------------+------------+ +-------+------------+------------+
> |     1 | (*) 10.43s |     37.43s | |     1 | (*) 10.20s |     38.59s |
> |     2 | (*)  5.25s |     18.77s | |     2 | (*)  5.15s |     19.37s |
> |     4 | (*)  2.63s |      9.42s | |     4 | (*)  2.63s |      9.74s |
> |     8 | (*)  1.38s |      4.80s | |     8 | (*)  1.35s |      4.94s |
> |    16 | (*)  0.73s |      2.46s | |    16 | (*)  0.72s |      2.54s |
> |    32 | (*)  0.40s |      1.31s | |    32 | (*)  0.41s |      1.34s |
> |    64 | (*)  0.25s |      0.72s | |    64 | (*)  0.24s |      0.74s |
> |   128 | (*)  0.16s |      0.43s | |   128 | (*)  0.16s |      0.44s |
> |   256 | (*)  0.12s |      0.28s | |   256 | (*)  0.12s |      0.29s |
> |   512 | (*)  0.10s |      0.21s | |   512 | (*)  0.10s |      0.22s |
> |  1024 | (*)  0.10s |      0.20s | |  1024 | (*)  0.10s |      0.21s |
> +-------+------------+------------+ +-------+------------+------------+
>
> To conclude, in order to make the most of the underlying mechanisms of
> pagemap and xarray, one should be using batching to achieve better
> performance.
>
> Future Work
> ============
>
> Note: there are PTE flags which currently do not survive the swap out when
> the page is shmem: SOFT_DIRTY and UFFD_WP.
>
> A solution for saving the state of the UFFD_WP flag has been proposed by
> Peter Xu in the patch linked below. The concept and mechanism proposed
> could be extended to include the SOFT_DIRTY bit as well:
> 20210715201422.211004-1-peterx@redhat.com
> Our patches are mostly orthogonal.
>
> Kind regards,
> Tibi
>
> Tiberiu A Georgescu (1):
>   pagemap: report swap location for shared pages
>
>  fs/proc/task_mmu.c | 38 ++++++++++++++++++++++++++++++--------
>  1 file changed, 30 insertions(+), 8 deletions(-)

  parent reply	other threads:[~2021-07-30 17:28 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-30 16:08 Tiberiu A Georgescu
2021-07-30 16:08 ` [PATCH 1/1] pagemap: report " Tiberiu A Georgescu
2021-07-30 17:28 ` Eric W. Biederman [this message]
2021-08-02 12:20   ` [PATCH 0/1] pagemap: " Tiberiu Georgescu
2021-08-04 18:33 ` Peter Xu
2021-08-04 18:49   ` David Hildenbrand
2021-08-04 19:17     ` Peter Xu
2021-08-11 16:15       ` David Hildenbrand
2021-08-11 16:17         ` David Hildenbrand
2021-08-11 18:25         ` Peter Xu
2021-08-11 18:41           ` David Hildenbrand
2021-08-11 19:54             ` Peter Xu
2021-08-11 20:13               ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y29nbtji.fsf@disp2133 \
    --to=ebiederm@xmission.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=axboe@kernel.dk \
    --cc=carl.waldspurger@nutanix.com \
    --cc=catalin.marinas@arm.com \
    --cc=chinwen.chang@mediatek.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=david@redhat.com \
    --cc=florian.schmidt@nutanix.com \
    --cc=ivan.teterevkov@nutanix.com \
    --cc=jannh@google.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=songmuchun@bytedance.com \
    --cc=tiberiu.georgescu@nutanix.com \
    --cc=vincenzo.frascino@arm.com \
    --cc=viro@zeniv.linux.org.uk \
    --subject='Re: [PATCH 0/1] pagemap: swap location for shared pages' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).