LKML Archive on lore.kernel.org help / color / mirror / Atom feed
From: Christoph Lameter <clameter@sgi.com> To: Andrea Arcangeli <andrea@qumranet.com> Cc: Avi Kivity <avi@qumranet.com>, Izik Eidus <izike@qumranet.com>, Andrew Morton <akpm@osdl.org>, Nick Piggin <npiggin@suse.de>, kvm-devel@lists.sourceforge.net, Benjamin Herrenschmidt <benh@kernel.crashing.org>, steiner@sgi.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, daniel.blueman@quadrics.com, holt@sgi.com, Hugh Dickins <hugh@veritas.com> Subject: Re: [kvm-devel] [PATCH] export notifier #1 Date: Thu, 24 Jan 2008 12:01:57 -0800 (PST) [thread overview] Message-ID: <Pine.LNX.4.64.0801241141030.22285@schroedinger.engr.sgi.com> (raw) In-Reply-To: <20080124143454.GN7141@v2.random> On Thu, 24 Jan 2008, Andrea Arcangeli wrote: > > SetPageExported is set when a remote instance of linux establishes a > > reference to the page (a kind of remote page fault). In the KVM scenario > > that would occur when memory is made available. > > The remote page fault is exactly the thing that has to wait on the > PageExported bit to return on! So how can it be the thing that sets > SetPageExported? I do not remember us saying that the remote page fault has to wait on PageExported. > The idea is: > > NODE0 NODE1 SetPageLocked > ->invalidate_page() > ClearPageExported > GFP_KERNEL (== GFP_ATOMIC in mm/rmap.c, won't ever do any I/O) > > ->invalidate_page() arrives and drop > references > ClearPageLocked > __free_page -> unpin so it can be freed > go ahead after invalidate_page > > zero locking so previous invalidate_page could schedule (not wait for I/O, > there' won't be any I/O out of GFP_KERNEL inside PF_MEMALLOC i.e. mm/rmap.c!!!) PageLocked is set and there could be synchronization among the callbacks. F.e. the mm_struct invalidate_page could set a flag to prevent new references to be established. The callback after removal of the OS ptes could reenable establishing new references. > > remote page fault > tries to instantiate more references > remote page fault arrives > instantiate more references > get_page() -> pin lock_page Waits until rmap is complete. Then rechecks if page is still part of the mapping. > SetPageExported > remote page fault succeeded > > zero locking so invalidate_page can schedule (not wait for I/O, > there' won't be any I/O out of GFP_KERNEL!) Ok this is often a PF_MEMALLOC context. We already do disk I/O in that context? > I thought your solution was to have the remote page fault wait on > PG_exported to return ON!! But now you tell me the remote page fault > is the thing that has to SetPageExported, not the linux VM. So make up > your mind about this PG_exported mess... The SetPageExported is mainly a switchon/off of the callbacks for a page. Not necessarily used for synchronization. PageExported should be modified under Pagelock. > > You are saying that clearing the main linux ptes and leaving the remote > > ptes in place will not allow access to the page via the remote ptes? > > No, I'm saying if you clear the main linux pte while there are still > remote ptes in place (in turn the page_count has been boosted by 1 > with your current code), and you relay on mm/rmap.c for the > ->invalidate_page, you will generate a unswappable-pin-leak. The invalidate_page presumably would reduce the page count to zero after clearing the remote ptes? > The linux pte must be present and the page must be mapped in userland > as long as there are remote references to the page and in turn as long > as the page_count has been boosted by 1. Otherwise mm/rmap.c won't be > called. page_mapped() must be true. So we would need to increase mapcount instead of page_count? > At the very least you should move your invalidate_page in > mm/vmscan.c and have it called regardless if the page is mapped in > userland or not. That would not cover page migration and other uses. We also need the invalidate_page for page_mkclean etc. Needed for dirty page tracking. > > Right. That is why the mmu_ops approach does not work and that is why we > > need to sleep. > > You told me you worried about atomic allocations. Now you tell me you > need to sleep after I just explained you how utterly useless is to > sleep inside GFP_KERNEL allocations when invoked by try_to_unmap in > the mm/rmap.c paths. You will never sleep in any memory allocation > other than to call schedule() because need_resched is set. You will do > zero I/O. all your allocations will come from the PF_MEMALLOC pool > like I said above, not from swapping, not from the VM. The VM will > obviously refuse to be invoked recursively. That may be okay if we do not need to generate listheads to track all the mm_structs in the rmap loops. If we loop on our own then we do not need to construct this list and can directly communicate with the other partition. > Also not sure why you call my patch mmops, when it's mmu_notifier instead. Oh. Sorry. Will use the correct name in the future. I think I keyed of the mm_ops structure. > > > All kvm guest physical pages would need to be marked exported of > > > course. > > > > Why export all pages? Then you are planning to have mm_struct > > notifiers for all processes in the system? > > KVM is 1 process, not sure how you get to imagine I need to track > process in the system, when infact I only need to track pages > belonging to the KVM process. Ahh. A KVM is one process to the host but may have multiple processes running in it and you want the notifier for the one process in the host. > It's utterly useless to call ->invalidate_page(page) on a page that is > still mapped by some linux pte with the young bit set. You must defer > the ->invalidate_page after all young bits are gone. This is what I > do, infact I do tons more than that by also honouring the accessed > bits in all sptes. There's zero chance you can do as remotely as > efficient as my mmu-notifiers are, unless you also do "cat rmap.c >> > /sgi/yoursubsystem/something.c" and you check the young bit in the > linux ptes yourself _before_ deciding if you've to start dropping > remote references or not. I think we agreed on doing the callback after the OS rmaps have been walked right. > > that point we do not have an mm_struct anymore so the callback would have > > The mm struct wasn't available in the place where you put > invalidate_page either. Right.
next prev parent reply other threads:[~2008-01-24 20:02 UTC|newest] Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top 2008-01-13 16:24 [PATCH] mmu notifiers #v2 Andrea Arcangeli 2008-01-13 21:11 ` Benjamin Herrenschmidt 2008-01-14 20:02 ` Christoph Lameter 2008-01-15 4:28 ` Benjamin Herrenschmidt 2008-01-15 12:44 ` Andrea Arcangeli 2008-01-15 20:18 ` Benjamin Herrenschmidt 2008-01-16 1:06 ` Andrea Arcangeli 2008-01-16 9:01 ` Brice Goglin 2008-01-16 10:19 ` Andrea Arcangeli 2008-01-16 17:42 ` Rik van Riel 2008-01-16 17:48 ` Izik Eidus 2008-01-17 16:23 ` Andrea Arcangeli 2008-01-17 18:21 ` Izik Eidus 2008-01-17 19:32 ` Andrea Arcangeli 2008-01-21 12:52 ` [PATCH] mmu notifiers #v3 Andrea Arcangeli 2008-01-22 2:21 ` Rik van Riel 2008-01-22 14:12 ` [kvm-devel] " Avi Kivity 2008-01-22 14:43 ` Andrea Arcangeli 2008-01-22 20:08 ` [kvm-devel] [PATCH] mmu notifiers #v4 Andrea Arcangeli 2008-01-22 20:34 ` [kvm-devel] [PATCH] export notifier #1 Christoph Lameter 2008-01-22 22:31 ` Andrea Arcangeli 2008-01-22 22:53 ` Christoph Lameter 2008-01-23 10:27 ` Avi Kivity 2008-01-23 10:52 ` Robin Holt 2008-01-23 12:04 ` Andrea Arcangeli 2008-01-23 12:34 ` Robin Holt 2008-01-23 19:48 ` Christoph Lameter 2008-01-23 19:58 ` Robin Holt 2008-01-23 19:47 ` Christoph Lameter 2008-01-24 5:56 ` Avi Kivity 2008-01-24 12:26 ` Andrea Arcangeli 2008-01-24 12:34 ` Avi Kivity 2008-01-23 11:41 ` Andrea Arcangeli 2008-01-23 12:32 ` Robin Holt 2008-01-23 17:33 ` Andrea Arcangeli 2008-01-23 20:27 ` Christoph Lameter 2008-01-24 15:42 ` Andrea Arcangeli 2008-01-24 20:07 ` Christoph Lameter 2008-01-25 6:35 ` Avi Kivity 2008-01-23 20:18 ` Christoph Lameter 2008-01-24 14:34 ` Andrea Arcangeli 2008-01-24 14:41 ` Andrea Arcangeli 2008-01-24 15:15 ` Avi Kivity 2008-01-24 15:18 ` Avi Kivity 2008-01-24 20:01 ` Christoph Lameter [this message] 2008-01-22 23:36 ` Benjamin Herrenschmidt 2008-01-23 0:40 ` Christoph Lameter 2008-01-23 1:21 ` Robin Holt 2008-01-23 12:51 ` Gerd Hoffmann 2008-01-23 13:19 ` Robin Holt 2008-01-23 14:12 ` Gerd Hoffmann 2008-01-23 14:18 ` Robin Holt 2008-01-23 14:35 ` Gerd Hoffmann 2008-01-23 15:48 ` Robin Holt 2008-01-23 14:17 ` Avi Kivity 2008-01-24 4:03 ` Benjamin Herrenschmidt 2008-01-23 15:41 ` Andrea Arcangeli 2008-01-23 17:47 ` Gerd Hoffmann 2008-01-24 6:01 ` Avi Kivity 2008-01-24 6:45 ` Jeremy Fitzhardinge 2008-01-23 20:40 ` Christoph Lameter 2008-01-24 2:00 ` Enhance mmu notifiers to accomplish a lockless implementation (incomplete) Robin Holt 2008-01-24 4:05 ` Robin Holt 2008-01-22 19:28 ` [PATCH] mmu notifiers #v3 Peter Zijlstra 2008-01-22 20:31 ` Christoph Lameter 2008-01-22 20:31 ` Andrea Arcangeli 2008-01-22 22:10 ` Hugh Dickins
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=Pine.LNX.4.64.0801241141030.22285@schroedinger.engr.sgi.com \ --to=clameter@sgi.com \ --cc=akpm@osdl.org \ --cc=andrea@qumranet.com \ --cc=avi@qumranet.com \ --cc=benh@kernel.crashing.org \ --cc=daniel.blueman@quadrics.com \ --cc=holt@sgi.com \ --cc=hugh@veritas.com \ --cc=izike@qumranet.com \ --cc=kvm-devel@lists.sourceforge.net \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=npiggin@suse.de \ --cc=steiner@sgi.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).