LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Christoph Lameter <clameter@sgi.com>
To: Andrea Arcangeli <andrea@qumranet.com>
Cc: Avi Kivity <avi@qumranet.com>, Izik Eidus <izike@qumranet.com>,
	Andrew Morton <akpm@osdl.org>, Nick Piggin <npiggin@suse.de>,
	kvm-devel@lists.sourceforge.net,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	steiner@sgi.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, daniel.blueman@quadrics.com, holt@sgi.com,
	Hugh Dickins <hugh@veritas.com>
Subject: Re: [kvm-devel] [PATCH] export notifier #1
Date: Thu, 24 Jan 2008 12:01:57 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0801241141030.22285@schroedinger.engr.sgi.com> (raw)
In-Reply-To: <20080124143454.GN7141@v2.random>

On Thu, 24 Jan 2008, Andrea Arcangeli wrote:

> > SetPageExported is set when a remote instance of linux establishes a 
> > reference to the page (a kind of remote page fault). In the KVM scenario 
> > that would occur when memory is made available.
> 
> The remote page fault is exactly the thing that has to wait on the
> PageExported bit to return on! So how can it be the thing that sets
> SetPageExported?

I do not remember us saying that the remote page fault has to wait on PageExported.

> The idea is:
> 
>     NODE0			NODE1

SetPageLocked

>     ->invalidate_page()
>     ClearPageExported
>     GFP_KERNEL (== GFP_ATOMIC in mm/rmap.c, won't ever do any I/O)
> 
> 				->invalidate_page() arrives and drop
>                                   references
> 
ClearPageLocked

>     __free_page -> unpin so it can be freed
>     go ahead after invalidate_page
> 
>     zero locking so previous invalidate_page could schedule (not wait for I/O,
>     there' won't be any I/O out of GFP_KERNEL inside PF_MEMALLOC i.e. mm/rmap.c!!!)

PageLocked is set and there could be synchronization among the 
callbacks. F.e. the mm_struct invalidate_page could set a flag to prevent 
new references to be established. The callback after removal of the OS 
ptes could reenable establishing new references.

> 
> 				remote page fault
> 				tries to instantiate more references
>     remote page fault arrives
>     instantiate more references
>     get_page() -> pin

lock_page	Waits until rmap is complete. Then rechecks if page is 
		still part of the mapping.

>     SetPageExported
> 				remote page fault succeeded
> 
>     zero locking so invalidate_page can schedule (not wait for I/O,
>     there' won't be any I/O out of GFP_KERNEL!)

Ok this is often a PF_MEMALLOC context. We already do disk I/O in that 
context?

 
> I thought your solution was to have the remote page fault wait on
> PG_exported to return ON!! But now you tell me the remote page fault
> is the thing that has to SetPageExported, not the linux VM. So make up
> your mind about this PG_exported mess...

The SetPageExported is mainly a switchon/off of the callbacks for a page. 
Not necessarily used for synchronization. PageExported should be modified 
under Pagelock.


> > You are saying that clearing the main linux ptes and leaving the remote 
> > ptes in place will not allow access to the page via the remote ptes?
> 
> No, I'm saying if you clear the main linux pte while there are still
> remote ptes in place (in turn the page_count has been boosted by 1
> with your current code), and you relay on mm/rmap.c for the
> ->invalidate_page, you will generate a unswappable-pin-leak.

The invalidate_page presumably would reduce the page count to zero after 
clearing the remote ptes?
 
> The linux pte must be present and the page must be mapped in userland
> as long as there are remote references to the page and in turn as long
> as the page_count has been boosted by 1. Otherwise mm/rmap.c won't be
> called.

page_mapped() must be true. So we would need to increase mapcount instead 
of page_count?

> At the very least you should move your invalidate_page in
> mm/vmscan.c and have it called regardless if the page is mapped in
> userland or not.

That would not cover page migration and other uses. We also need the
invalidate_page for page_mkclean etc. Needed for dirty page tracking.


> > Right. That is why the mmu_ops approach does not work and that is why we 
> > need to sleep.
> 
> You told me you worried about atomic allocations. Now you tell me you
> need to sleep after I just explained you how utterly useless is to
> sleep inside GFP_KERNEL allocations when invoked by try_to_unmap in
> the mm/rmap.c paths. You will never sleep in any memory allocation
> other than to call schedule() because need_resched is set. You will do
> zero I/O. all your allocations will come from the PF_MEMALLOC pool
> like I said above, not from swapping, not from the VM. The VM will
> obviously refuse to be invoked recursively.

That may be okay if we do not need to generate listheads to track all the 
mm_structs in the rmap loops. If we loop on our own then we do not need to 
construct this list and can directly communicate with the other partition.

> Also not sure why you call my patch mmops, when it's mmu_notifier instead.

Oh. Sorry. Will use the correct name in the future. I think I keyed of the 
mm_ops structure.

> > > All kvm guest physical pages would need to be marked exported of
> > > course.
> > 
> > Why export all pages? Then you are planning to have mm_struct 
> > notifiers for all processes in the system?
> 
> KVM is 1 process, not sure how you get to imagine I need to track
> process in the system, when infact I only need to track pages
> belonging to the KVM process.

Ahh. A KVM is one process to the host but may have multiple processes 
running in it and you want the notifier for the one process in the host.

> It's utterly useless to call ->invalidate_page(page) on a page that is
> still mapped by some linux pte with the young bit set. You must defer
> the ->invalidate_page after all young bits are gone. This is what I
> do, infact I do tons more than that by also honouring the accessed
> bits in all sptes. There's zero chance you can do as remotely as
> efficient as my mmu-notifiers are, unless you also do "cat rmap.c >>
> /sgi/yoursubsystem/something.c" and you check the young bit in the
> linux ptes yourself _before_ deciding if you've to start dropping
> remote references or not.

I think we agreed on doing the callback after the OS rmaps have been 
walked right.

> > that point we do not have an mm_struct anymore so the callback would have 
> 
> The mm struct wasn't available in the place where you put
> invalidate_page either.

Right.

  parent reply	other threads:[~2008-01-24 20:02 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-13 16:24 [PATCH] mmu notifiers #v2 Andrea Arcangeli
2008-01-13 21:11 ` Benjamin Herrenschmidt
2008-01-14 20:02 ` Christoph Lameter
2008-01-15  4:28   ` Benjamin Herrenschmidt
2008-01-15 12:44   ` Andrea Arcangeli
2008-01-15 20:18     ` Benjamin Herrenschmidt
2008-01-16  1:06       ` Andrea Arcangeli
2008-01-16  9:01 ` Brice Goglin
2008-01-16 10:19   ` Andrea Arcangeli
2008-01-16 17:42 ` Rik van Riel
2008-01-16 17:48   ` Izik Eidus
2008-01-17 16:23     ` Andrea Arcangeli
2008-01-17 18:21       ` Izik Eidus
2008-01-17 19:32         ` Andrea Arcangeli
2008-01-21 12:52           ` [PATCH] mmu notifiers #v3 Andrea Arcangeli
2008-01-22  2:21             ` Rik van Riel
2008-01-22 14:12             ` [kvm-devel] " Avi Kivity
2008-01-22 14:43               ` Andrea Arcangeli
2008-01-22 20:08                 ` [kvm-devel] [PATCH] mmu notifiers #v4 Andrea Arcangeli
2008-01-22 20:34                   ` [kvm-devel] [PATCH] export notifier #1 Christoph Lameter
2008-01-22 22:31                     ` Andrea Arcangeli
2008-01-22 22:53                       ` Christoph Lameter
2008-01-23 10:27                         ` Avi Kivity
2008-01-23 10:52                           ` Robin Holt
2008-01-23 12:04                             ` Andrea Arcangeli
2008-01-23 12:34                               ` Robin Holt
2008-01-23 19:48                               ` Christoph Lameter
2008-01-23 19:58                                 ` Robin Holt
2008-01-23 19:47                             ` Christoph Lameter
2008-01-24  5:56                               ` Avi Kivity
2008-01-24 12:26                                 ` Andrea Arcangeli
2008-01-24 12:34                                   ` Avi Kivity
2008-01-23 11:41                         ` Andrea Arcangeli
2008-01-23 12:32                           ` Robin Holt
2008-01-23 17:33                             ` Andrea Arcangeli
2008-01-23 20:27                               ` Christoph Lameter
2008-01-24 15:42                                 ` Andrea Arcangeli
2008-01-24 20:07                                   ` Christoph Lameter
2008-01-25  6:35                                     ` Avi Kivity
2008-01-23 20:18                           ` Christoph Lameter
2008-01-24 14:34                             ` Andrea Arcangeli
2008-01-24 14:41                               ` Andrea Arcangeli
2008-01-24 15:15                               ` Avi Kivity
2008-01-24 15:18                                 ` Avi Kivity
2008-01-24 20:01                               ` Christoph Lameter [this message]
2008-01-22 23:36                     ` Benjamin Herrenschmidt
2008-01-23  0:40                       ` Christoph Lameter
2008-01-23  1:21                         ` Robin Holt
2008-01-23 12:51                     ` Gerd Hoffmann
2008-01-23 13:19                       ` Robin Holt
2008-01-23 14:12                         ` Gerd Hoffmann
2008-01-23 14:18                           ` Robin Holt
2008-01-23 14:35                             ` Gerd Hoffmann
2008-01-23 15:48                               ` Robin Holt
2008-01-23 14:17                         ` Avi Kivity
2008-01-24  4:03                           ` Benjamin Herrenschmidt
2008-01-23 15:41                       ` Andrea Arcangeli
2008-01-23 17:47                         ` Gerd Hoffmann
2008-01-24  6:01                           ` Avi Kivity
2008-01-24  6:45                           ` Jeremy Fitzhardinge
2008-01-23 20:40                         ` Christoph Lameter
2008-01-24  2:00                   ` Enhance mmu notifiers to accomplish a lockless implementation (incomplete) Robin Holt
2008-01-24  4:05                     ` Robin Holt
2008-01-22 19:28             ` [PATCH] mmu notifiers #v3 Peter Zijlstra
2008-01-22 20:31               ` Christoph Lameter
2008-01-22 20:31               ` Andrea Arcangeli
2008-01-22 22:10                 ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0801241141030.22285@schroedinger.engr.sgi.com \
    --to=clameter@sgi.com \
    --cc=akpm@osdl.org \
    --cc=andrea@qumranet.com \
    --cc=avi@qumranet.com \
    --cc=benh@kernel.crashing.org \
    --cc=daniel.blueman@quadrics.com \
    --cc=holt@sgi.com \
    --cc=hugh@veritas.com \
    --cc=izike@qumranet.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    --cc=steiner@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).