LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
@ 2008-03-29 0:00 Jeremy Fitzhardinge
2008-03-29 0:47 ` Dave Hansen
2008-03-29 4:38 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-29 0:00 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter, Dave Hansen
Cc: Linux Kernel Mailing List
The Xen balloon driver needs to separate the process of hot-installing
memory into two phases: one to allocate the page structures and
configure the zones, and another to actually online the pages of newly
installed memory.
This patch splits up the innards of online_pages() into two pieces which
correspond to these two phases. The behaviour of online_pages() itself
is unchanged.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
---
include/linux/memory_hotplug.h | 3 +
mm/memory_hotplug.c | 66 ++++++++++++++++++++++++++++++++--------
2 files changed, 57 insertions(+), 12 deletions(-)
===================================================================
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -57,7 +57,10 @@
/* need some defines for these for archs that don't support it */
extern void online_page(struct page *page);
/* VM interface that may be used by firmware interface */
+extern int prepare_online_pages(unsigned long pfn, unsigned long nr_pages);
+extern unsigned long mark_pages_onlined(unsigned long pfn, unsigned long nr_pages);
extern int online_pages(unsigned long, unsigned long);
+
extern void __offline_isolated_pages(unsigned long, unsigned long);
extern int offline_pages(unsigned long, unsigned long, unsigned long);
===================================================================
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -180,31 +180,35 @@
return 0;
}
-
-int online_pages(unsigned long pfn, unsigned long nr_pages)
+/* Tell anyone who's interested that we're onlining some memory */
+static int notify_going_online(unsigned long pfn, unsigned long nr_pages)
{
- unsigned long flags;
- unsigned long onlined_pages = 0;
- struct zone *zone;
- int need_zonelists_rebuild = 0;
+ struct memory_notify arg;
int nid;
int ret;
- struct memory_notify arg;
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
arg.status_change_nid = -1;
-
+
nid = page_to_nid(pfn_to_page(pfn));
if (node_present_pages(nid) == 0)
arg.status_change_nid = nid;
ret = memory_notify(MEM_GOING_ONLINE, &arg);
ret = notifier_to_errno(ret);
- if (ret) {
+ if (ret)
memory_notify(MEM_CANCEL_ONLINE, &arg);
- return ret;
- }
+
+ return ret;
+}
+
+/* Grow the zone to fit the expected amount of memory being added */
+static struct zone *online_pages_zone(unsigned long pfn, unsigned long nr_pages)
+{
+ struct zone *zone;
+ unsigned long flags;
+
/*
* This doesn't need a lock to do pfn_to_page().
* The section can't be removed here because of the
@@ -215,6 +219,16 @@
grow_zone_span(zone, pfn, pfn + nr_pages);
grow_pgdat_span(zone->zone_pgdat, pfn, pfn + nr_pages);
pgdat_resize_unlock(zone->zone_pgdat, &flags);
+
+ return zone;
+}
+
+/* Mark a set of pages as online */
+unsigned long mark_pages_onlined(unsigned long pfn, unsigned long nr_pages)
+{
+ struct zone *zone = page_zone(pfn_to_page(pfn));
+ unsigned long onlined_pages = 0;
+ int need_zonelists_rebuild = 0;
/*
* If this zone is not populated, then it is not in zonelist.
@@ -240,10 +254,38 @@
vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();
- if (onlined_pages)
+ if (onlined_pages) {
+ struct memory_notify arg;
+
+ arg.start_pfn = pfn; /* ? */
+ arg.nr_pages = onlined_pages;
+ arg.status_change_nid = -1; /* ? */
+
memory_notify(MEM_ONLINE, &arg);
+ }
+ return onlined_pages;
+}
+
+int prepare_online_pages(unsigned long pfn, unsigned long nr_pages)
+{
+ int ret = notify_going_online(pfn, nr_pages);
+ if (ret)
+ return ret;
+
+ online_pages_zone(pfn, nr_pages);
return 0;
+}
+
+int online_pages(unsigned long pfn, unsigned long nr_pages)
+{
+ int ret;
+
+ ret = prepare_online_pages(pfn, nr_pages);
+ if (ret == 0)
+ mark_pages_onlined(pfn, nr_pages);
+
+ return ret;
}
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 0:00 [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining Jeremy Fitzhardinge
@ 2008-03-29 0:47 ` Dave Hansen
2008-03-29 2:08 ` Jeremy Fitzhardinge
2008-03-29 4:38 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2008-03-29 0:47 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List
On Fri, 2008-03-28 at 17:00 -0700, Jeremy Fitzhardinge wrote:
> The Xen balloon driver needs to separate the process of hot-installing
> memory into two phases: one to allocate the page structures and
> configure the zones, and another to actually online the pages of newly
> installed memory.
>
> This patch splits up the innards of online_pages() into two pieces which
> correspond to these two phases. The behaviour of online_pages() itself
> is unchanged.
...
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -180,31 +180,35 @@
> return 0;
> }
>
> -
> -int online_pages(unsigned long pfn, unsigned long nr_pages)
> +/* Tell anyone who's interested that we're onlining some memory */
> +static int notify_going_online(unsigned long pfn, unsigned long nr_pages)
> {
> - unsigned long flags;
> - unsigned long onlined_pages = 0;
> - struct zone *zone;
> - int need_zonelists_rebuild = 0;
> + struct memory_notify arg;
> int nid;
> int ret;
> - struct memory_notify arg;
>
> arg.start_pfn = pfn;
> arg.nr_pages = nr_pages;
> arg.status_change_nid = -1;
> -
> +
> nid = page_to_nid(pfn_to_page(pfn));
> if (node_present_pages(nid) == 0)
> arg.status_change_nid = nid;
That's kind a weird line in the patch. How'd that get there? Why are
you moving 'arg'?
This look OK, but it does add ~45 lines of code, and I'm not immediately
sure how you're going to use it. Could you address that a bit?
I do kinda wish you'd take a real hard look at what the new functions
are doing now. I'm not sure their current names are very good.
> +/* Grow the zone to fit the expected amount of memory being added */
> +static struct zone *online_pages_zone(unsigned long pfn, unsigned long nr_pages)
The comment is good, but the function name is not. :) How about
grow_zone_span() or something?
> +/* Mark a set of pages as online */
> +unsigned long mark_pages_onlined(unsigned long pfn, unsigned long nr_pages)
Isn't the comment on this one a bit redundant? :)
This looks to me to have become the real online_pages() now. This
function is what goes and individually onlines pages. If someone was
trying to figure out whether to call online_pages() or
mark_pages_onlined(), which one would they know to call?
> - if (onlined_pages)
> + if (onlined_pages) {
> + struct memory_notify arg;
> +
> + arg.start_pfn = pfn; /* ? */
> + arg.nr_pages = onlined_pages;
> + arg.status_change_nid = -1; /* ? */
> +
> memory_notify(MEM_ONLINE, &arg);
> + }
We should really wrap up memory notify:
static void memory_notify(int state, unsigned long start_pfn,
unsigned long nr_pages, int status_change_nid)
{
struct memory_notify arg;
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
arg.status_change_nid = status_change_nid;
return the_current_memory_notify(state, &arg);
}
We can use that in a couple of spots, right?
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 0:47 ` Dave Hansen
@ 2008-03-29 2:08 ` Jeremy Fitzhardinge
2008-03-29 6:01 ` Dave Hansen
2008-03-29 16:06 ` Dave Hansen
0 siblings, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-29 2:08 UTC (permalink / raw)
To: Dave Hansen
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List
Dave Hansen wrote:
> On Fri, 2008-03-28 at 17:00 -0700, Jeremy Fitzhardinge wrote:
>
>> The Xen balloon driver needs to separate the process of hot-installing
>> memory into two phases: one to allocate the page structures and
>> configure the zones, and another to actually online the pages of newly
>> installed memory.
>>
>> This patch splits up the innards of online_pages() into two pieces which
>> correspond to these two phases. The behaviour of online_pages() itself
>> is unchanged.
>>
> ...
>
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -180,31 +180,35 @@
>> return 0;
>> }
>>
>> -
>> -int online_pages(unsigned long pfn, unsigned long nr_pages)
>> +/* Tell anyone who's interested that we're onlining some memory */
>> +static int notify_going_online(unsigned long pfn, unsigned long nr_pages)
>> {
>> - unsigned long flags;
>> - unsigned long onlined_pages = 0;
>> - struct zone *zone;
>> - int need_zonelists_rebuild = 0;
>> + struct memory_notify arg;
>> int nid;
>> int ret;
>> - struct memory_notify arg;
>>
>> arg.start_pfn = pfn;
>> arg.nr_pages = nr_pages;
>> arg.status_change_nid = -1;
>> -
>> +
>> nid = page_to_nid(pfn_to_page(pfn));
>> if (node_present_pages(nid) == 0)
>> arg.status_change_nid = nid;
>>
>
> That's kind a weird line in the patch. How'd that get there? Why are
> you moving 'arg'?
>
arg is the notifier arg. This function is just wrapping up the
GOING_ONLINE notifier. The code is just a copy, so it isn't doing
anything it wasn't doing before.
The original code also recycles arg for the ONLINE notifier, but that
seemed unnecessary.
> This look OK, but it does add ~45 lines of code, and I'm not immediately
> sure how you're going to use it. Could you address that a bit?
>
Sure. When the balloon driver wants to increase the domain's size, but
it finds its run out of page structures to grow into, it hotplug-adds
some memory. This code uses the add_memory_resource() function I posted
the patch for yesterday. (Error-checking removed for brevity.)
static void balloon_expand(unsigned pages)
{
struct resource *res;
int ret;
u64 size = (u64)pages * PAGE_SIZE;
unsigned pfn;
unsigned start_pfn, end_pfn;
res = kzalloc(sizeof(*res), GFP_KERNEL);
res->name = "Xen Balloon";
res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
ret = allocate_resource(&iomem_resource, res, size, 0, -1,
1ul << SECTION_SIZE_BITS, NULL, NULL);
start_pfn = res->start >> PAGE_SHIFT;
end_pfn = (res->end + 1) >> PAGE_SHIFT;
ret = add_memory_resource(0, res);
ret = prepare_online_pages(start_pfn, pages);
for(pfn = start_pfn; pfn < end_pfn; pfn++) {
struct page *page = pfn_to_page(pfn);
SetPageReserved(page);
set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
balloon_append(page); /* add to a list of balloon pages */
}
}
So this just gives us some page structures, but there's no underlying
memory yet. Later, the balloon driver starts populating the pages with
real memory behind them:
for (i = 0; i < nr_pages; i++) {
page = balloon_retrieve();
pfn = page_to_pfn(page);
/* frame_list is set of real memory pages */
set_phys_to_machine(pfn, frame_list[i]);
/* Relinquish the page back to the allocator. */
mark_pages_onlined(pfn, 1);
/* Link back into the page tables if not highmem. */
if (!PageHighMem(page)) {
int ret;
ret = HYPERVISOR_update_va_mapping(
(unsigned long)__va(pfn << PAGE_SHIFT),
mfn_pte(frame_list[i], PAGE_KERNEL),
0);
}
}
> I do kinda wish you'd take a real hard look at what the new functions
> are doing now. I'm not sure their current names are very good.
>
You're right. This is very much a first pass.
>> +/* Grow the zone to fit the expected amount of memory being added */
>> +static struct zone *online_pages_zone(unsigned long pfn, unsigned long nr_pages)
>>
>
> The comment is good, but the function name is not. :) How about
> grow_zone_span() or something?
>
Sure. I'm not really sure what the bookkeeping its doing really means
though.
>> +/* Mark a set of pages as online */
>> +unsigned long mark_pages_onlined(unsigned long pfn, unsigned long nr_pages)
>>
>
> Isn't the comment on this one a bit redundant? :)
>
> This looks to me to have become the real online_pages() now. This
> function is what goes and individually onlines pages. If someone was
> trying to figure out whether to call online_pages() or
> mark_pages_onlined(), which one would they know to call?
>
Yep. My goal in this was to extract he behaviours I need without
affecting any other users. online_pages() is definitely a trivial
helper function now, and it could be removed without causing much damage
to its couple of callers.
>> - if (onlined_pages)
>> + if (onlined_pages) {
>> + struct memory_notify arg;
>> +
>> + arg.start_pfn = pfn; /* ? */
>> + arg.nr_pages = onlined_pages;
>> + arg.status_change_nid = -1; /* ? */
>> +
>> memory_notify(MEM_ONLINE, &arg);
>> + }
>>
>
> We should really wrap up memory notify:
>
> static void memory_notify(int state, unsigned long start_pfn,
> unsigned long nr_pages, int status_change_nid)
> {
> struct memory_notify arg;
> arg.start_pfn = start_pfn;
> arg.nr_pages = nr_pages;
> arg.status_change_nid = status_change_nid;
> return the_current_memory_notify(state, &arg);
> }
>
> We can use that in a couple of spots, right?
>
Perhaps. Or we could get rid of it altogether. There's only a single
user of the notifier (mm/slub.c), and given that it doesn't even use the
MEM_ONLINE notifier, we could just drop this part. Seems like it would
be simpler to just have the hotplug code call directly into slub if
that's what it needs...
My big remaining problem is how to disable the sysfs interface for this
memory. I need to prevent any onlining via /sys/device/system/memory.
Looks like I need to modify register_new_memory() to pass a "don't
change state" flag, and stash it in struct memory_block so that
store_mem_state() knows to ignore any state changes. But it's not clear
how I can get that information down into __add_section()... I guess I
just need to propagate it down from add_memory().
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 0:00 [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining Jeremy Fitzhardinge
2008-03-29 0:47 ` Dave Hansen
@ 2008-03-29 4:38 ` KAMEZAWA Hiroyuki
2008-03-29 5:48 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-29 4:38 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Yasunori Goto, Christoph Lameter, Dave Hansen, Linux Kernel Mailing List
Hi,
On Fri, 28 Mar 2008 17:00:05 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> - if (onlined_pages)
> + if (onlined_pages) {
> + struct memory_notify arg;
> +
> + arg.start_pfn = pfn; /* ? */
> + arg.nr_pages = onlined_pages;
> + arg.status_change_nid = -1; /* ? */
> +
> memory_notify(MEM_ONLINE, &arg);
> + }
I think you should add "onlined" member instead of reusing nr_pages.
But, in general, I have no objection to this way.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 4:38 ` KAMEZAWA Hiroyuki
@ 2008-03-29 5:48 ` Jeremy Fitzhardinge
2008-03-29 6:26 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-29 5:48 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Yasunori Goto, Christoph Lameter, Dave Hansen, Linux Kernel Mailing List
KAMEZAWA Hiroyuki wrote:
> Hi,
>
> On Fri, 28 Mar 2008 17:00:05 -0700
> Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>
>> - if (onlined_pages)
>> + if (onlined_pages) {
>> + struct memory_notify arg;
>> +
>> + arg.start_pfn = pfn; /* ? */
>> + arg.nr_pages = onlined_pages;
>> + arg.status_change_nid = -1; /* ? */
>> +
>> memory_notify(MEM_ONLINE, &arg);
>> + }
>>
> I think you should add "onlined" member instead of reusing nr_pages.
>
I suppose. What would I put into nr_pages? And anyway, there are no
users for this notification...
> But, in general, I have no objection to this way.
>
The refactoring in general?
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 2:08 ` Jeremy Fitzhardinge
@ 2008-03-29 6:01 ` Dave Hansen
2008-03-29 16:06 ` Dave Hansen
1 sibling, 0 replies; 24+ messages in thread
From: Dave Hansen @ 2008-03-29 6:01 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List
On Fri, 2008-03-28 at 19:08 -0700, Jeremy Fitzhardinge wrote:
> Perhaps. Or we could get rid of it altogether. There's only a single
> user of the notifier (mm/slub.c), and given that it doesn't even use the
> MEM_ONLINE notifier, we could just drop this part.
There is at least one other user that I know of, which is the ehea
driver. They're running through patches now to use it.
Anyway, we need the notifier. We're only going to get more and more
drivers that need notification. There may be one user now, but that's
no reason to rip it out. Feel free to revisit it in a year if there's
still only one user. :)
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 5:48 ` Jeremy Fitzhardinge
@ 2008-03-29 6:26 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-29 6:26 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Yasunori Goto, Christoph Lameter, Dave Hansen, Linux Kernel Mailing List
On Fri, 28 Mar 2008 22:48:21 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> >> - if (onlined_pages)
> >> + if (onlined_pages) {
> >> + struct memory_notify arg;
> >> +
> >> + arg.start_pfn = pfn; /* ? */
> >> + arg.nr_pages = onlined_pages;
> >> + arg.status_change_nid = -1; /* ? */
> >> +
> >> memory_notify(MEM_ONLINE, &arg);
> >> + }
> >>
> > I think you should add "onlined" member instead of reusing nr_pages.
> >
>
> I suppose. What would I put into nr_pages? And anyway, there are no
> users for this notification...
>
My point is "Notifier" is expexted to work correctly and include precise
information regardless of users.
> > But, in general, I have no objection to this way.
> >
>
> The refactoring in general?
>
Separating online_pages() into some meaningful blocks. Then, you can
reuse some parts and avoid dupilication.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 2:08 ` Jeremy Fitzhardinge
2008-03-29 6:01 ` Dave Hansen
@ 2008-03-29 16:06 ` Dave Hansen
2008-03-29 23:53 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2008-03-29 16:06 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List
On Fri, 2008-03-28 at 19:08 -0700, Jeremy Fitzhardinge wrote:
> My big remaining problem is how to disable the sysfs interface for this
> memory. I need to prevent any onlining via /sys/device/system/memory.
I've been thinking about this some more, and I wish that you wouldn't
just throw this interface away or completely disable it. It actually
does *exactly* what you want in a way. :)
When the /memoryXX/ directory appears, that means that the hardware has
found the memory, and that the 'struct page' is allocated and ready to
be initialized.
When the OS actually wants to use the memory (initialize the 'struct
page', and free_page() it), it does the 'echo online > /sys...'. Both
the 'struct page' and the memory represented by it are untouched until
the "online". This was originally in place to avoid fragmenting it
immediately in the case that the system did not need it.
To me, it sounds like the only different thing that you want is to make
sure that only partial sections are onlined. So, shall we work with the
existing interfaces to online partial sections, or will we just disable
it entirely when we see Xen?
For Xen and KVM, how does it get decided that the guest needs more
memory? Is this guest or host driven? Both? How is the guest
notified? Is guest userspace involved at all?
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 16:06 ` Dave Hansen
@ 2008-03-29 23:53 ` Jeremy Fitzhardinge
2008-03-30 0:26 ` Anthony Liguori
2008-03-31 16:42 ` Dave Hansen
0 siblings, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-29 23:53 UTC (permalink / raw)
To: Dave Hansen
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
Dave Hansen wrote:
> On Fri, 2008-03-28 at 19:08 -0700, Jeremy Fitzhardinge wrote:
>
>> My big remaining problem is how to disable the sysfs interface for this
>> memory. I need to prevent any onlining via /sys/device/system/memory.
>>
>
> I've been thinking about this some more, and I wish that you wouldn't
> just throw this interface away or completely disable it.
I had no intention of globally disabling it. I just need to disable it
for my use case.
> It actually
> does *exactly* what you want in a way. :)
>
> When the /memoryXX/ directory appears, that means that the hardware has
> found the memory, and that the 'struct page' is allocated and ready to
> be initialized.
>
> When the OS actually wants to use the memory (initialize the 'struct
> page', and free_page() it), it does the 'echo online > /sys...'. Both
> the 'struct page' and the memory represented by it are untouched until
> the "online". This was originally in place to avoid fragmenting it
> immediately in the case that the system did not need it.
>
> To me, it sounds like the only different thing that you want is to make
> sure that only partial sections are onlined. So, shall we work with the
> existing interfaces to online partial sections, or will we just disable
> it entirely when we see Xen?
>
Well, yes and no.
For the current balloon driver, it doesn't make much sense. It would
add a fair amount of complexity without any real gain. It's currently
based around alloc_page/free_page. When it wants to shrink the domain
and give memory back to the host, it allocates pages, adds the page
structures to a ballooned pages list, and strips off the backing memory
and gives it to the host. Growing the domain is the converse: it gets
pages from the host, pulls page structures off the list, binds them
together and frees them back to the kernel. If it runs out of ballooned
page structures, it hotplugs in some memory to add more.
That said, if (partial-)sections were much smaller - say 2-4 meg - and
page migration/defrag worked reliably, then we could probably do without
the balloon driver and do it all in terms of memory hot plug/unplug.
That would give us a general mechanism which could either be driven from
userspace, and/or have in-kernel Xen/kvm/s390/etc policy modules. Aside
from small sections, the only additional requirement would be an online
hook which can actually attach backing memory to the pages being
onlined, rather than just assuming an underlying DIMM as current code does.
> For Xen and KVM, how does it get decided that the guest needs more
> memory? Is this guest or host driven? Both? How is the guest
> notified? Is guest userspace involved at all?
In Xen, either the host or the guest can set the target size for the
domain, which is capped by the host-set limit. Aside from possibly
setting the target size, there's no usermode involvement in managing
ballooning. The virtio balloon driver is similar, though from a quick
look it seems to be entirely driven by the host side.
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 23:53 ` Jeremy Fitzhardinge
@ 2008-03-30 0:26 ` Anthony Liguori
2008-03-31 16:42 ` Dave Hansen
1 sibling, 0 replies; 24+ messages in thread
From: Anthony Liguori @ 2008-03-30 0:26 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Dave Hansen, KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Mel Gorman
Jeremy Fitzhardinge wrote:
> Dave Hansen wrote:
>>
>> To me, it sounds like the only different thing that you want is to make
>> sure that only partial sections are onlined. So, shall we work with the
>> existing interfaces to online partial sections, or will we just disable
>> it entirely when we see Xen?
>>
>
> Well, yes and no.
>
> For the current balloon driver, it doesn't make much sense. It would
> add a fair amount of complexity without any real gain. It's currently
> based around alloc_page/free_page. When it wants to shrink the domain
> and give memory back to the host, it allocates pages, adds the page
> structures to a ballooned pages list, and strips off the backing
> memory and gives it to the host. Growing the domain is the converse:
> it gets pages from the host, pulls page structures off the list, binds
> them together and frees them back to the kernel. If it runs out of
> ballooned page structures, it hotplugs in some memory to add more.
>
> That said, if (partial-)sections were much smaller - say 2-4 meg - and
> page migration/defrag worked reliably, then we could probably do
> without the balloon driver and do it all in terms of memory hot
> plug/unplug. That would give us a general mechanism which could
> either be driven from userspace, and/or have in-kernel
> Xen/kvm/s390/etc policy modules. Aside from small sections, the only
> additional requirement would be an online hook which can actually
> attach backing memory to the pages being onlined, rather than just
> assuming an underlying DIMM as current code does.
Ballooning on KVM (and s390) is very much a different beast from Xen.
With Xen, ballooning is very similar to hotplug in that you're adding
and removing physical memory from the guest. The use of alloc_page() to
implement it instead of hotplug is for the reasons Jeremy's outlined
above. Logically though, it's hotplug.
For KVM and s390, ballooning is really a primitive form of guest page
hinting. The host asks the guest to allocate some memory and the guest
allocates what it can, and then tells the host which pages they were.
It's basically saying the pages are Unused and then the host may move
those pages from Up=>Uz which reduces the resident size of the guest.
The virtual size stays the same though. We can enforce limits on the
resident size of the guest via the new cgroup memory controller.
The guest is free to reclaim those pages at any time it wants without
informing the host. In fact, we plan to utilize this by implementing a
shrinker and OOM handler in the virtio balloon driver.
Hotplug is still useful for us as it's more efficient to hot-add 1gb of
memory instead of starting out with an extra 1gb and ballooning down.
We wouldn't want to hotplug away every page we balloon though as we want
to be able to reclaim them if necessary without the hosts intervention
(like on an OOM condition).
>> For Xen and KVM, how does it get decided that the guest needs more
>> memory? Is this guest or host driven? Both? How is the guest
>> notified? Is guest userspace involved at all?
>
> In Xen, either the host or the guest can set the target size for the
> domain, which is capped by the host-set limit. Aside from possibly
> setting the target size, there's no usermode involvement in managing
> ballooning. The virtio balloon driver is similar, though from a quick
> look it seems to be entirely driven by the host side.
The host support for KVM ballooning is entirely in userspace, but that's
orthogonal to the discussion at hand really.
Regards,
Anthony Liguori
> J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-29 23:53 ` Jeremy Fitzhardinge
2008-03-30 0:26 ` Anthony Liguori
@ 2008-03-31 16:42 ` Dave Hansen
2008-03-31 18:06 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2008-03-31 16:42 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
On Sat, 2008-03-29 at 16:53 -0700, Jeremy Fitzhardinge wrote:
> Dave Hansen wrote:
> > On Fri, 2008-03-28 at 19:08 -0700, Jeremy Fitzhardinge wrote:
> >
> >> My big remaining problem is how to disable the sysfs interface for this
> >> memory. I need to prevent any onlining via /sys/device/system/memory.
> >>
> >
> > I've been thinking about this some more, and I wish that you wouldn't
> > just throw this interface away or completely disable it.
>
> I had no intention of globally disabling it. I just need to disable it
> for my use case.
Right, but by disabling it for your case, you have given up all of the
testing that others have done on it. Let's try and see if we can get
the interface to work for you.
> > To me, it sounds like the only different thing that you want is to make
> > sure that only partial sections are onlined. So, shall we work with the
> > existing interfaces to online partial sections, or will we just disable
> > it entirely when we see Xen?
> >
>
> Well, yes and no.
>
> For the current balloon driver, it doesn't make much sense. It would
> add a fair amount of complexity without any real gain. It's currently
> based around alloc_page/free_page. When it wants to shrink the domain
> and give memory back to the host, it allocates pages, adds the page
> structures to a ballooned pages list, and strips off the backing memory
> and gives it to the host. Growing the domain is the converse: it gets
> pages from the host, pulls page structures off the list, binds them
> together and frees them back to the kernel. If it runs out of ballooned
> page structures, it hotplugs in some memory to add more.
How does this deal with things like present_pages in the zones? Does
the total ram just grow with each hot-add, or does it grow on a per-page
basis from the ballooning?
> That said, if (partial-)sections were much smaller - say 2-4 meg - and
> page migration/defrag worked reliably, then we could probably do without
> the balloon driver and do it all in terms of memory hot plug/unplug.
> That would give us a general mechanism which could either be driven from
> userspace, and/or have in-kernel Xen/kvm/s390/etc policy modules. Aside
> from small sections, the only additional requirement would be an online
> hook which can actually attach backing memory to the pages being
> onlined, rather than just assuming an underlying DIMM as current code does.
Even with 1MB sections and a flat sparsemem map, you're only looking at
~500k of overhead for the sparsemem storage. Less if you use vmemmap.
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-31 16:42 ` Dave Hansen
@ 2008-03-31 18:06 ` Jeremy Fitzhardinge
2008-04-01 7:17 ` Yasunori Goto
2008-04-02 18:46 ` Dave Hansen
0 siblings, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-03-31 18:06 UTC (permalink / raw)
To: Dave Hansen
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
Dave Hansen wrote:
> On Sat, 2008-03-29 at 16:53 -0700, Jeremy Fitzhardinge wrote:
>
>> Dave Hansen wrote:
>>
>>> On Fri, 2008-03-28 at 19:08 -0700, Jeremy Fitzhardinge wrote:
>>>
>>>
>>>> My big remaining problem is how to disable the sysfs interface for this
>>>> memory. I need to prevent any onlining via /sys/device/system/memory.
>>>>
>>>>
>>> I've been thinking about this some more, and I wish that you wouldn't
>>> just throw this interface away or completely disable it.
>>>
>> I had no intention of globally disabling it. I just need to disable it
>> for my use case.
>>
>
> Right, but by disabling it for your case, you have given up all of the
> testing that others have done on it. Let's try and see if we can get
> the interface to work for you.
>
I suppose, but I'm not sure I see the point. What are the benefits of
using this interface? You mentioned that the interface exists so that
its possible to defer using a newly added piece of memory to avoid
fragmentation. I suppose I can see the point of that
But in the xen-balloon case, the memory is added on-demand precisely
when its about to be used, and then onlined in pieces as needed.
Extending the usermode interface to allow partial onlining/offlining
doesn't seem very useful for the case of physical hotplug memory, and
its not at all clear how to do it in a useful way for the xen-balloon
case. Particularly for offlining, since you'd need to guarantee that
any page chosen for offlining isn't currently in use.
>>> To me, it sounds like the only different thing that you want is to make
>>> sure that only partial sections are onlined. So, shall we work with the
>>> existing interfaces to online partial sections, or will we just disable
>>> it entirely when we see Xen?
>>>
>>>
>> Well, yes and no.
>>
>> For the current balloon driver, it doesn't make much sense. It would
>> add a fair amount of complexity without any real gain. It's currently
>> based around alloc_page/free_page. When it wants to shrink the domain
>> and give memory back to the host, it allocates pages, adds the page
>> structures to a ballooned pages list, and strips off the backing memory
>> and gives it to the host. Growing the domain is the converse: it gets
>> pages from the host, pulls page structures off the list, binds them
>> together and frees them back to the kernel. If it runs out of ballooned
>> page structures, it hotplugs in some memory to add more.
>>
>
> How does this deal with things like present_pages in the zones? Does
> the total ram just grow with each hot-add, or does it grow on a per-page
> basis from the ballooning?
>
Well, there are two ways of looking at it:
either hot-plugging memory immediately adds pages, but they're also
all immediately allocated and therefore unavailable for general use, or
the pages are notionally physically added as they're populated by
the host
In principle they're equivalent, but I could imagine the former has the
potential to make the VM waste time scanning unfreeable pages.
I'm not sure the patches I've posted are doing this stuff correctly
either way.
>> That said, if (partial-)sections were much smaller - say 2-4 meg - and
>> page migration/defrag worked reliably, then we could probably do without
>> the balloon driver and do it all in terms of memory hot plug/unplug.
>> That would give us a general mechanism which could either be driven from
>> userspace, and/or have in-kernel Xen/kvm/s390/etc policy modules. Aside
>> from small sections, the only additional requirement would be an online
>> hook which can actually attach backing memory to the pages being
>> onlined, rather than just assuming an underlying DIMM as current code does.
>>
>
> Even with 1MB sections
1MB is too small. It shouldn't be smaller than the size of a large page.
> and a flat sparsemem map, you're only looking at
> ~500k of overhead for the sparsemem storage. Less if you use vmemmap.
>
At the moment my concern is 32-bit x86, which doesn't support vmemmap or
sections smaller than 512MB because of the shortage of page flags bits.
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-31 18:06 ` Jeremy Fitzhardinge
@ 2008-04-01 7:17 ` Yasunori Goto
2008-04-02 18:46 ` Dave Hansen
1 sibling, 0 replies; 24+ messages in thread
From: Yasunori Goto @ 2008-04-01 7:17 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Dave Hansen, KAMEZAWA Hiroyuki, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
> Dave Hansen wrote:
> > On Sat, 2008-03-29 at 16:53 -0700, Jeremy Fitzhardinge wrote:
> >
> >> Dave Hansen wrote:
> >>
> >>> On Fri, 2008-03-28 at 19:08 -0700, Jeremy Fitzhardinge wrote:
> >>>
> >>>
> >>>> My big remaining problem is how to disable the sysfs interface for this
> >>>> memory. I need to prevent any onlining via /sys/device/system/memory.
> >>>>
> >>>>
> >>> I've been thinking about this some more, and I wish that you wouldn't
> >>> just throw this interface away or completely disable it.
> >>>
> >> I had no intention of globally disabling it. I just need to disable it
> >> for my use case.
> >>
> >
> > Right, but by disabling it for your case, you have given up all of the
> > testing that others have done on it. Let's try and see if we can get
> > the interface to work for you.
> >
>
> I suppose, but I'm not sure I see the point. What are the benefits of
> using this interface? You mentioned that the interface exists so that
> its possible to defer using a newly added piece of memory to avoid
> fragmentation. I suppose I can see the point of that
Not only to avoid fragmentation, but also for notification
to user level for preparing memory add event.
When memory is added, there is a notification via udev for each memory
device.
In our box, one node which includes some DIMMs and CPUs can be added by
hot-add, and there is another notification for 1 node by ACPI's
container device.
After user level check for preparing, user(or shell script) can
online memory.
IIRC, some of user level application would require this notification.
ex) resource manager over physical/logical partitioning.
>
> But in the xen-balloon case, the memory is added on-demand precisely
> when its about to be used, and then onlined in pieces as needed.
> Extending the usermode interface to allow partial onlining/offlining
> doesn't seem very useful for the case of physical hotplug memory, and
> its not at all clear how to do it in a useful way for the xen-balloon
> case. Particularly for offlining, since you'd need to guarantee that
> any page chosen for offlining isn't currently in use.
>
Basically, I hope there is no change for user level interface between
physical hotplug and Xen as much as possible.
So, I would like to make sense why memory is added "on-demand" on Xen.
I thought the hypervisor gathers a section's memory and moves all of them
from one guest to another at a time. Its gathering time may be long time.
But, each per page moving may cause of fragmentation, if my understanding
is correct....
> >>> To me, it sounds like the only different thing that you want is to make
> >>> sure that only partial sections are onlined. So, shall we work with the
> >>> existing interfaces to online partial sections, or will we just disable
> >>> it entirely when we see Xen?
> >>>
> >>>
> >> Well, yes and no.
> >>
> >> For the current balloon driver, it doesn't make much sense. It would
> >> add a fair amount of complexity without any real gain. It's currently
> >> based around alloc_page/free_page. When it wants to shrink the domain
> >> and give memory back to the host, it allocates pages, adds the page
> >> structures to a ballooned pages list, and strips off the backing memory
> >> and gives it to the host. Growing the domain is the converse: it gets
> >> pages from the host, pulls page structures off the list, binds them
> >> together and frees them back to the kernel. If it runs out of ballooned
> >> page structures, it hotplugs in some memory to add more.
> >>
> >
> > How does this deal with things like present_pages in the zones? Does
> > the total ram just grow with each hot-add, or does it grow on a per-page
> > basis from the ballooning?
> >
>
> Well, there are two ways of looking at it:
>
> either hot-plugging memory immediately adds pages, but they're also
> all immediately allocated and therefore unavailable for general use, or
>
> the pages are notionally physically added as they're populated by
> the host
>
>
> In principle they're equivalent, but I could imagine the former has the
> potential to make the VM waste time scanning unfreeable pages.
>
> I'm not sure the patches I've posted are doing this stuff correctly
> either way.
I don't make sense both your idea yet. Could you tell me more?
One of them may be same to my understanding. But I'm not sure.
Thanks.
--
Yasunori Goto
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-03-31 18:06 ` Jeremy Fitzhardinge
2008-04-01 7:17 ` Yasunori Goto
@ 2008-04-02 18:46 ` Dave Hansen
2008-04-02 18:52 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2008-04-02 18:46 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
On Mon, 2008-03-31 at 11:06 -0700, Jeremy Fitzhardinge wrote:
> >> That said, if (partial-)sections were much smaller - say 2-4 meg -
> and
> >> page migration/defrag worked reliably, then we could probably do
> without
> >> the balloon driver and do it all in terms of memory hot plug/unplug.
> >> That would give us a general mechanism which could either be driven from
> >> userspace, and/or have in-kernel Xen/kvm/s390/etc policy modules. Aside
> >> from small sections, the only additional requirement would be an online
> >> hook which can actually attach backing memory to the pages being
> >> onlined, rather than just assuming an underlying DIMM as current code does.
> >>
> >
> > Even with 1MB sections
>
> 1MB is too small. It shouldn't be smaller than the size of a large page.
Oh, I was just using 1MB as an easy-to-do-math-on-a-napkin number. :)
> > and a flat sparsemem map, you're only looking at
> > ~500k of overhead for the sparsemem storage. Less if you use vmemmap.
> >
>
> At the moment my concern is 32-bit x86, which doesn't support vmemmap or
> sections smaller than 512MB because of the shortage of page flags bits.
Yeah, I forgot that we didn't have vmemmap on x86-32. Ugh.
OK, here's another idea: Xen (and the balloon driver) already handle a
case where a guest boots up with 2GB of memory but only needs 1GB,
right? It will balloon the guest down to 1GB from 2GB.
Why don't we just have hotplug work that way? When we want to take a
guest from 1GB to 1GB+1 page (or whatever), we just hotplug the entire
section (512MB or 1GB or whatever), actually online the whole thing,
then make the balloon driver take it back to where it *should* be. That
way we're completely reusing existing components that have do be able to
handle this case anyway.
Yeah, this is suboptimal, an it has a possibility of fragmenting the
memory, but it will only be used for the x86-32 case.
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 18:46 ` Dave Hansen
@ 2008-04-02 18:52 ` Jeremy Fitzhardinge
2008-04-02 18:59 ` Dave Hansen
0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-04-02 18:52 UTC (permalink / raw)
To: Dave Hansen
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
Dave Hansen wrote:
>>> and a flat sparsemem map, you're only looking at
>>> ~500k of overhead for the sparsemem storage. Less if you use vmemmap.
>>>
>>>
>> At the moment my concern is 32-bit x86, which doesn't support vmemmap or
>> sections smaller than 512MB because of the shortage of page flags bits.
>>
>
> Yeah, I forgot that we didn't have vmemmap on x86-32. Ugh.
>
> OK, here's another idea: Xen (and the balloon driver) already handle a
> case where a guest boots up with 2GB of memory but only needs 1GB,
> right? It will balloon the guest down to 1GB from 2GB.
>
Right.
> Why don't we just have hotplug work that way? When we want to take a
> guest from 1GB to 1GB+1 page (or whatever), we just hotplug the entire
> section (512MB or 1GB or whatever), actually online the whole thing,
> then make the balloon driver take it back to where it *should* be. That
> way we're completely reusing existing components that have do be able to
> handle this case anyway.
>
> Yeah, this is suboptimal, an it has a possibility of fragmenting the
> memory, but it will only be used for the x86-32 case.
>
It also requires you actually have the memory on hand to populate the
whole area. 512MB is still a significant chunk on a 2GB server; you may
end up generating significant overall system memory pressure to scrape
together the memory, only to immediately discard it again.
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 18:52 ` Jeremy Fitzhardinge
@ 2008-04-02 18:59 ` Dave Hansen
2008-04-02 21:03 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2008-04-02 18:59 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
On Wed, 2008-04-02 at 11:52 -0700, Jeremy Fitzhardinge wrote:
> > Why don't we just have hotplug work that way? When we want to take a
> > guest from 1GB to 1GB+1 page (or whatever), we just hotplug the entire
> > section (512MB or 1GB or whatever), actually online the whole thing,
> > then make the balloon driver take it back to where it *should* be. That
> > way we're completely reusing existing components that have do be able to
> > handle this case anyway.
> >
> > Yeah, this is suboptimal, an it has a possibility of fragmenting the
> > memory, but it will only be used for the x86-32 case.
> >
>
> It also requires you actually have the memory on hand to populate the
> whole area. 512MB is still a significant chunk on a 2GB server; you may
> end up generating significant overall system memory pressure to scrape
> together the memory, only to immediately discard it again.
That's a very good point. Can we make it so that the hypervisors don't
actually allocate the memory to the guest until its first touch? If the
pages are on the freelist, their *contents* shouldn't be touched at all
during the onlining process.
Maybe we could put a special mark on the pages (please no page flag :)
and the allocator can jump in and ask for the page from the hypervisor
before returning it to the system. I think Anthony had some ideas
around this area. It's kinda a poor man's page hinting.
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 18:59 ` Dave Hansen
@ 2008-04-02 21:03 ` Jeremy Fitzhardinge
2008-04-02 21:17 ` Dave Hansen
0 siblings, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-04-02 21:03 UTC (permalink / raw)
To: Dave Hansen
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
Dave Hansen wrote:
> On Wed, 2008-04-02 at 11:52 -0700, Jeremy Fitzhardinge wrote:
>
>>> Why don't we just have hotplug work that way? When we want to take a
>>> guest from 1GB to 1GB+1 page (or whatever), we just hotplug the entire
>>> section (512MB or 1GB or whatever), actually online the whole thing,
>>> then make the balloon driver take it back to where it *should* be. That
>>> way we're completely reusing existing components that have do be able to
>>> handle this case anyway.
>>>
>>> Yeah, this is suboptimal, an it has a possibility of fragmenting the
>>> memory, but it will only be used for the x86-32 case.
>>>
>>>
>> It also requires you actually have the memory on hand to populate the
>> whole area. 512MB is still a significant chunk on a 2GB server; you may
>> end up generating significant overall system memory pressure to scrape
>> together the memory, only to immediately discard it again.
>>
>
> That's a very good point. Can we make it so that the hypervisors don't
> actually allocate the memory to the guest until its first touch? If the
> pages are on the freelist, their *contents* shouldn't be touched at all
> during the onlining process.
>
No, not in a Xen direct-pagetable guest. The guest actually sees real
hardware page numbers (mfns) when the hypervisor gives it a page. By
the time the hypervisor gives it a page reference, it already
guaranteeing that the page is available for guest use. The only thing
that we could do is prevent the guest from mapping the page, but that
doesn't really achieve much.
I think we're getting off track here; this is a lot of extra complexity
to justify allowing usermode to use /sys to online a chunk of hotplugged
memory.
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 21:03 ` Jeremy Fitzhardinge
@ 2008-04-02 21:17 ` Dave Hansen
2008-04-02 21:35 ` Jeremy Fitzhardinge
2008-04-02 21:36 ` Anthony Liguori
0 siblings, 2 replies; 24+ messages in thread
From: Dave Hansen @ 2008-04-02 21:17 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
On Wed, 2008-04-02 at 14:03 -0700, Jeremy Fitzhardinge wrote:
> Dave Hansen wrote:
> No, not in a Xen direct-pagetable guest. The guest actually sees real
> hardware page numbers (mfns) when the hypervisor gives it a page. By
> the time the hypervisor gives it a page reference, it already
> guaranteeing that the page is available for guest use. The only thing
> that we could do is prevent the guest from mapping the page, but that
> doesn't really achieve much.
Oh, once we've let Linux establish ptes to it, we've required that the
hypervisor have it around? How does that work with the balloon driver?
Do we destroy the ptes when giving balloon memory back to the
hypervisor?
If we're talking about i386, then we're set. We don't map the hot-added
memory at all because we only add highmem on i386. The only time we map
these pages is *after* we actually allocate them when they get mapped
into userspace or used as vmalloc() or they're kmap()'d.
> I think we're getting off track here; this is a lot of extra complexity
> to justify allowing usermode to use /sys to online a chunk of hotplugged
> memory.
Either that, or we're going to develop the entire Xen/kvm memory hotplug
architecture around the soon-to-be-legacy i386 limitations. :)
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 21:17 ` Dave Hansen
@ 2008-04-02 21:35 ` Jeremy Fitzhardinge
2008-04-02 21:43 ` Dave Hansen
2008-04-02 21:36 ` Anthony Liguori
1 sibling, 1 reply; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-04-02 21:35 UTC (permalink / raw)
To: Dave Hansen
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
Dave Hansen wrote:
> Oh, once we've let Linux establish ptes to it, we've required that the
> hypervisor have it around? How does that work with the balloon driver?
> Do we destroy the ptes when giving balloon memory back to the
> hypervisor?
>
Yep. It removes any mapping before handing it back to the hypervisor.
> If we're talking about i386, then we're set. We don't map the hot-added
> memory at all because we only add highmem on i386. The only time we map
> these pages is *after* we actually allocate them when they get mapped
> into userspace or used as vmalloc() or they're kmap()'d.
>
Well, the balloon driver can balloon out lowmem pages, so we have to
deal with mappings either way. But balloon+hotplug would work
identically on x86-64, so all pages are mapped.
>> I think we're getting off track here; this is a lot of extra complexity
>> to justify allowing usermode to use /sys to online a chunk of hotplugged
>> memory.
>>
>
> Either that, or we're going to develop the entire Xen/kvm memory hotplug
> architecture around the soon-to-be-legacy i386 limitations. :)
Everything also applies to x86-64.
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 21:17 ` Dave Hansen
2008-04-02 21:35 ` Jeremy Fitzhardinge
@ 2008-04-02 21:36 ` Anthony Liguori
1 sibling, 0 replies; 24+ messages in thread
From: Anthony Liguori @ 2008-04-02 21:36 UTC (permalink / raw)
To: Dave Hansen
Cc: Jeremy Fitzhardinge, KAMEZAWA Hiroyuki, Yasunori Goto,
Christoph Lameter, Linux Kernel Mailing List, Mel Gorman
Dave Hansen wrote:
> On Wed, 2008-04-02 at 14:03 -0700, Jeremy Fitzhardinge wrote:
>
>> Dave Hansen wrote:
>> No, not in a Xen direct-pagetable guest. The guest actually sees real
>> hardware page numbers (mfns) when the hypervisor gives it a page. By
>> the time the hypervisor gives it a page reference, it already
>> guaranteeing that the page is available for guest use. The only thing
>> that we could do is prevent the guest from mapping the page, but that
>> doesn't really achieve much.
>>
>
> Oh, once we've let Linux establish ptes to it, we've required that the
> hypervisor have it around? How does that work with the balloon driver?
> Do we destroy the ptes when giving balloon memory back to the
> hypervisor?
>
> If we're talking about i386, then we're set. We don't map the hot-added
> memory at all because we only add highmem on i386. The only time we map
> these pages is *after* we actually allocate them when they get mapped
> into userspace or used as vmalloc() or they're kmap()'d.
>
>
>> I think we're getting off track here; this is a lot of extra complexity
>> to justify allowing usermode to use /sys to online a chunk of hotplugged
>> memory.
>>
>
> Either that, or we're going to develop the entire Xen/kvm memory hotplug
> architecture around the soon-to-be-legacy i386 limitations. :)
>
s:Xen/kvm:Xen:g
We don't need anything special for KVM. Bare metal memory hotplug
should be sufficient provided userspace udev scripts are properly
configured to offline memory automatically.
Regards,
Anthony Liguori
> -- Dave
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 21:35 ` Jeremy Fitzhardinge
@ 2008-04-02 21:43 ` Dave Hansen
2008-04-02 22:13 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2008-04-02 21:43 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
On Wed, 2008-04-02 at 14:35 -0700, Jeremy Fitzhardinge wrote:
> Dave Hansen wrote:
> > Oh, once we've let Linux establish ptes to it, we've required that the
> > hypervisor have it around? How does that work with the balloon driver?
> > Do we destroy the ptes when giving balloon memory back to the
> > hypervisor?
>
> Yep. It removes any mapping before handing it back to the hypervisor.
Wow. So does Xen ever use PSE to map kernel data? That sucks.
> > If we're talking about i386, then we're set. We don't map the hot-added
> > memory at all because we only add highmem on i386. The only time we map
> > these pages is *after* we actually allocate them when they get mapped
> > into userspace or used as vmalloc() or they're kmap()'d.
>
> Well, the balloon driver can balloon out lowmem pages, so we have to
> deal with mappings either way. But balloon+hotplug would work
> identically on x86-64, so all pages are mapped.
Yeah, but I'm just talking about hotplugged memory. When we add it, we
don't have to map the added pages (since they're highmem) and don't have
to touch their contents and zero them out, either. Then, the balloon
driver can notice that the memory is too large, and start to balloon it
down.
> >> I think we're getting off track here; this is a lot of extra complexity
> >> to justify allowing usermode to use /sys to online a chunk of hotplugged
> >> memory.
> >>
> >
> > Either that, or we're going to develop the entire Xen/kvm memory hotplug
> > architecture around the soon-to-be-legacy i386 limitations. :)
>
> Everything also applies to x86-64.
Not really, though. We don't have the page->flags shortage or lack of
vmemmap on x86_64.
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 21:43 ` Dave Hansen
@ 2008-04-02 22:13 ` Jeremy Fitzhardinge
2008-04-02 23:27 ` Dave Hansen
2008-04-03 7:03 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 24+ messages in thread
From: Jeremy Fitzhardinge @ 2008-04-02 22:13 UTC (permalink / raw)
To: Dave Hansen
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
Dave Hansen wrote:
> On Wed, 2008-04-02 at 14:35 -0700, Jeremy Fitzhardinge wrote:
>
>> Dave Hansen wrote:
>>
>>> Oh, once we've let Linux establish ptes to it, we've required that the
>>> hypervisor have it around? How does that work with the balloon driver?
>>> Do we destroy the ptes when giving balloon memory back to the
>>> hypervisor?
>>>
>> Yep. It removes any mapping before handing it back to the hypervisor.
>>
>
> Wow. So does Xen ever use PSE to map kernel data? That sucks.
>
Not at present. But I'd like to change it to manage memory in largepage
chunks so that we can.
> Yeah, but I'm just talking about hotplugged memory. When we add it, we
> don't have to map the added pages (since they're highmem) and don't have
> to touch their contents and zero them out, either. Then, the balloon
> driver can notice that the memory is too large, and start to balloon it
> down.
>
I didn't think x86-64 had a notion of highmem.
How do you prevent the pages from being used before they're ballooned out?
>> Everything also applies to x86-64.
>>
>
> Not really, though. We don't have the page->flags shortage or lack of
> vmemmap on x86_64.
Right now, I'd rather have a single mechanism that works for both.
J
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 22:13 ` Jeremy Fitzhardinge
@ 2008-04-02 23:27 ` Dave Hansen
2008-04-03 7:03 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 24+ messages in thread
From: Dave Hansen @ 2008-04-02 23:27 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: KAMEZAWA Hiroyuki, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
On Wed, 2008-04-02 at 15:13 -0700, Jeremy Fitzhardinge wrote:
> Dave Hansen wrote:
> > Yeah, but I'm just talking about hotplugged memory. When we add it, we
> > don't have to map the added pages (since they're highmem) and don't have
> > to touch their contents and zero them out, either. Then, the balloon
> > driver can notice that the memory is too large, and start to balloon it
> > down.
>
> I didn't think x86-64 had a notion of highmem.
It doesn't.
> How do you prevent the pages from being used before they're ballooned out?
I think there are a few options here. One is to check on the way out of
the allocator that we're not over some Xen-specific limit. Basically
that we aren't about to touch a hardware page for which the hypervisor
hasn't allocating backing memory.
Another is to give pages sitting in the allocator some kind of
associated state or keep them on separate lists. (I think this has
something in common with those s390 CMM patches). When you want to
allocate a page, you not only pull it off the buddy lists, but you also
have to check with the hypervisor to make sure it has backing store
before you actually return it. You make it non-volatile in CMM-speak (I
think).
If you can't allocate backing store for a page, you toss it over to the
balloon driver (who's whole job is to keep track of pages without
hypervisor backing anyway) and go back to the allocator for another
one.
> >> Everything also applies to x86-64.
> >
> > Not really, though. We don't have the page->flags shortage or lack of
> > vmemmap on x86_64.
>
> Right now, I'd rather have a single mechanism that works for both.
Yeah, that would be most ideal. But, at the same time, you don't want
to hobble your rockstar x86_64 implementation with quirks inherited from
the crufy 32-bit junk. :)
-- Dave
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining
2008-04-02 22:13 ` Jeremy Fitzhardinge
2008-04-02 23:27 ` Dave Hansen
@ 2008-04-03 7:03 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-04-03 7:03 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Dave Hansen, Yasunori Goto, Christoph Lameter,
Linux Kernel Mailing List, Anthony Liguori, Mel Gorman
On Wed, 02 Apr 2008 15:13:05 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> > Yeah, but I'm just talking about hotplugged memory. When we add it, we
> > don't have to map the added pages (since they're highmem) and don't have
> > to touch their contents and zero them out, either. Then, the balloon
> > driver can notice that the memory is too large, and start to balloon it
> > down.
> >
>
> I didn't think x86-64 had a notion of highmem.
>
> How do you prevent the pages from being used before they're ballooned out?
>
As I mentioned before, you can do that by plug online_page().
Now, online_page() is per-architecture. So, it's not so bad to make this
online_page() to be just a callback.
=
int online_page(struct page *page)
{
if (online_page_callback)
return (*online_page_callback)(struct page *page);
retrun arch_default_online_page(page);
}
=
maybe not so dirty look.
your ballon driver can overwrite this callback pointer.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2008-04-03 7:08 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-29 0:00 [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining Jeremy Fitzhardinge
2008-03-29 0:47 ` Dave Hansen
2008-03-29 2:08 ` Jeremy Fitzhardinge
2008-03-29 6:01 ` Dave Hansen
2008-03-29 16:06 ` Dave Hansen
2008-03-29 23:53 ` Jeremy Fitzhardinge
2008-03-30 0:26 ` Anthony Liguori
2008-03-31 16:42 ` Dave Hansen
2008-03-31 18:06 ` Jeremy Fitzhardinge
2008-04-01 7:17 ` Yasunori Goto
2008-04-02 18:46 ` Dave Hansen
2008-04-02 18:52 ` Jeremy Fitzhardinge
2008-04-02 18:59 ` Dave Hansen
2008-04-02 21:03 ` Jeremy Fitzhardinge
2008-04-02 21:17 ` Dave Hansen
2008-04-02 21:35 ` Jeremy Fitzhardinge
2008-04-02 21:43 ` Dave Hansen
2008-04-02 22:13 ` Jeremy Fitzhardinge
2008-04-02 23:27 ` Dave Hansen
2008-04-03 7:03 ` KAMEZAWA Hiroyuki
2008-04-02 21:36 ` Anthony Liguori
2008-03-29 4:38 ` KAMEZAWA Hiroyuki
2008-03-29 5:48 ` Jeremy Fitzhardinge
2008-03-29 6:26 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).