LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Nick Piggin <npiggin@suse.de>
To: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [patch] mm: NUMA replicated pagecache
Date: Thu, 15 Feb 2007 01:38:10 +0100	[thread overview]
Message-ID: <20070215003810.GE29797@wotan.suse.de> (raw)
In-Reply-To: <1171485124.5099.43.camel@localhost>

On Wed, Feb 14, 2007 at 03:32:04PM -0500, Lee Schermerhorn wrote:
> On Tue, 2007-02-13 at 07:09 +0100, Nick Piggin wrote:
> > Hi,
> > 
> > Just tinkering around with this and got something working, so I'll see
> > if anyone else wants to try it.
> > 
> > Not proposing for inclusion, but I'd be interested in comments or results.
> > 
> > Thanks,
> > Nick
> 
> I've included a small patch below that allow me to build and boot with
> these patches on an HP NUMA platform.  I'm still seeing an "unable to

Thanks Lee. Merged.

> > - Would like to be able to control replication via userspace, and maybe
> >   even internally to the kernel.
> How about per cpuset?  Consider a cpuset, on a NUMA system, with cpus
> and memories from a specific set of nodes.  One might choose to have
> page cache pages referenced by tasks in this cpuset to be pulled into
> the cpuset's memories for local access.  The remainder of the system may
> choose not to replicate page cache pages--e.g., to conserve memory.
> However, "unreplicating" on write would still need to work system wide.
> 
> But, note:  may [probably] want option to disable replication for shmem
> pages?  I'm thinking here of large data base shmem regions that, at any
> time, might have a lot of pages accessed "read only".  Probably wouldn't
> want a lot of replication/unreplication happening behind the scene. 

Yeah cpusets is an interesting possibility. A per-inode attribute could be
another one. The good old global sysctl is also a must :)


> > - Ideally, reclaim might reclaim replicated pages preferentially, however
> >   I aim to be _minimally_ intrusive.
> > - Would like to replicate PagePrivate, but filesystem may dirty page via
> >   buffers. Any solutions? (currently should mount with 'nobh').
> Linux migrates pages with PagePrivate using a per mapping migratepage
> address space op to handle the buffers.  File systems can provide their
> own or use a generic version.  How about a "replicatepage" aop?

I guess the main problem is those filesystems which dirty the page via
the buffers, via b_this_page, or b_data. However AFAIKS, these only happen
for things like directories. I _think_ we can safely assume that regular
file pages will not get modified (that would be data corruption!).

> > +struct page * find_get_page_readonly(struct address_space *mapping, unsigned long offset)
> > +{
> > +	struct page *page;
> > +
> > +retry:
> > +	read_lock_irq(&mapping->tree_lock);
> > +	if (radix_tree_tag_get(&mapping->page_tree, offset,
> > +					PAGECACHE_TAG_REPLICATED)) {
> > +		int nid;
> > +		struct pcache_desc *pcd;
> > +replicated:
> > +		nid = numa_node_id();
> > +		pcd = radix_tree_lookup(&mapping->page_tree, offset);
> ??? possible NULL pcd?  I believe I'm seeing one here...

Hmm, OK. I'll have to do some stress testing. I'm sure there are a few bugs
left.

> 
> > +		if (!node_isset(nid, pcd->nodes_present)) {
> Do this check [and possible replicate] only if replication enabled
> [system wide?, per cpuset?  based on explicit replication policy?, ...]?

Yep.

> > +			struct page *repl_page;
> > +
> > +			page = pcd->master;
> > +			page_cache_get(page);
> > +			read_unlock_irq(&mapping->tree_lock);
> > +			repl_page = alloc_pages_node(nid,
> > +					mapping_gfp_mask(mapping), 0);
> ??? don't try to hard to allocate page, as it's only a performance
> optimization.  E.g., add in GFP_THISNODE and remove and __GFP_WAIT?

I think that has merit. The problem if we remove __GFP_WAIT is that the
page allocator gives us access to some reserves. __GFP_NORETRY should
be reasonable?

> 
> > +			if (!repl_page)
> > +				return page;
> > +			copy_highpage(repl_page, page);
> > +			flush_dcache_page(repl_page);
> > +			page->mapping = mapping;
> > +			page->index = offset;
> > +			SetPageUptodate(repl_page); /* XXX: nonatomic */
> > +			page_cache_release(page);
> > +			write_lock_irq(&mapping->tree_lock);
> > +			__insert_replicated_page(repl_page, mapping, offset, nid);
> ??? can this fail due to race?  Don't care because we retry the lookup?
> page freed [released] in the function...

Yeah, I told you it was ugly :P Sorry you had to wade through this, but
it can be cleaned up..

> >  EXPORT_SYMBOL(find_lock_page);
> ??? should find_trylock_page() handle potential replicated page?
>     until it is removed, anyway?  

It is removed upstream, but in 2.6.20 it has no callers anyway so I didn't
worry about it.


Thanks for the comments & patch.



  reply	other threads:[~2007-02-15  0:38 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-02-13  6:09 Nick Piggin
2007-02-13  6:09 ` Nick Piggin
2007-02-14  4:28 ` Nick Piggin
2007-02-14 18:57 ` Christoph Lameter
2007-02-15  0:07   ` Nick Piggin
2007-02-14 19:00 ` Christoph Lameter
2007-02-15  0:10   ` Nick Piggin
2007-02-15  0:13     ` Christoph Lameter
2007-02-14 20:32 ` Lee Schermerhorn
2007-02-15  0:38   ` Nick Piggin [this message]
2007-02-15 23:29     ` Lee Schermerhorn
2007-02-16 13:35       ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070215003810.GE29797@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --subject='Re: [patch] mm: NUMA replicated pagecache' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).