LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	netdev@vger.kernel.org, trond.myklebust@fys.uio.no
Subject: Re: [PATCH 00/28] Swap over NFS -v16
Date: Fri, 29 Feb 2008 12:29:18 +1100	[thread overview]
Message-ID: <18375.24558.876276.255804@notabene.brown> (raw)
In-Reply-To: message from Peter Zijlstra on Wednesday February 27


So I've been pondering all this some more trying to find the pattern,
and things are beginning to crystalise (I hope).

One of the approaches I have been taking is to compare it to mempools
(which I think I understand) and work out what the important
differences are.

One difference is that you don't wait for memory to become available
(as I mentioned earlier).  Rather you just try to get the memory and
if it isn't available, you drop the packet.  This probably makes sense
for incoming packets as you rely on the packet being re-sent, and
hopefully various back-off algorithms will slow things down a bit so
that there is a good change that memory will be available next time...

For out going messages I'm less clear on exactly what is going on.
Maybe I haven't looked at that code properly yet, but I would expect
there would be a place for waiting for memory to become available
somewhere in the out-going path ??

But there is another important difference to mempools which I think is
worth exploring.  With mempools, you are certain that the memory will
only be used to make forward progress in writing out dirty data.  So
if you find that there isn't enough memory at the moment and you have
to wait, you can be sure someone else is making forward progress and
so waiting isn't such a bad thing.

With your reservations it isn't quite the same.  Reserved memory can
be used for related purposes.  In particular, any incoming packet can
use some reserved memory.  Once the purpose of that packet is
discovered (i.e. that matching socket is found), the memory will be
freed again.  But there is a period of time when memory is being used
for an inappropriate purpose.  The consequences of this should be
clearly understood.

In particular, the memory that is reserved for the emergency pool
should include some overhead to acknowledge the fact that memory
might be used for short periods of time for unrelated purposes.

I think we can fit this acknowledgement into the current model quite
easily, and it makes the tree structure suddenly make lots of sense
(where as before I was still struggling with it).

A key observation in this design is "Sometimes we need to allocate
emergency memory without knowing exactly what it is going to be used
for".  I think we should make that explicit in the implementation as
follows:

  We have a tree of reservations (as you already do) where levels in
  the tree correspond to more explicit knowledge of how the memory
  will be used.
  At the top level there is a generic 'page' reservation.  Below that
  to one side with have a 'SLUB/SLAB' reservation.  I'm not sure yet
  exactly what that will look like.
  Also below the 'page' reservation is a reservation for pages to hold
  incoming network fragments.
  Below the SLxB reservation is a reservation for skbs, which is
  parent to a reservation for IPv4 skbs and another for IPv6 skbs.

Each of these nodes has its own independent reservation - parents are
not simply the sum of the children.
The sum over the whole tree is given to the VM as the size of the
emergency pool to reserve for emergency allocations.

Now, every actual allocation from the emergency pool effectively comes
in at the top of the tree and moves down as its purpose is more fully
understood.  Every emergency allocation is *always* charged to one
node in the tree, though which node may change.

e.g.
  A network driver asks for a page to store a fragment.
  netdev_alloc_page calls alloc_page with __GFP_MEMALLOC set.
  If alloc_page needs to dive into the emergency pool, it first
  charges the one page against the root for the reservation tree.
  If this succeeds, it returns the page with ->reserve set.  If the
  reservation fails, it ignores the GFP_MEMALLOC and fails.
  netdev_alloc_page notices that the page is a ->reserve page, and
  knows that it has been changed to the top 'page' reservation, but it
  should be changed to the network-page reservation.  So it tried to
  charge against the network-pages reservation, and reverses the
  charge against 'pages'.  If the network-pages reservation fails, the
  page is freed and netdev_alloc_page fails.
  As you can see, the charge moves down the tree as more information
  becomes available.

  Similarly a charge might move from 'pages' to 'SLxB' to 'net_skb' to
  'ipv4_skb'.

  At the bottom levels, the reservations says how much memory is
  needed for that particular usage to be able to make sensible forward
  progress.
  At the higher levels, the reservation says how much overhead we need
  to allow to ensure that transient invalid uses don't unduly limit
  available emergency memory.  As pages are likely to be immediately
  re-charged lower down the tree, the reservation at the top level
  would probably be proportional to the number of CPUs (probably one
  page per CPU would be perfect).  Lower down, different calculations
  might suggest different intermediate reservations.

Of course, these things don't need to be explicitly structured as a
tree.  There is no need for 'parent' or 'sibling' pointers.  The code
implicitly knows where to move charges from and to.
You still need an explicit structure to allow groups of reservations
that are activated or de-activated as a whole.  That can use your
current tree structure, or whatever else turns out to make sense.

This model, I think, captures the important "allocate before charging"
aspect of reservations that you need (particularly for incoming
network packets) and it makes that rule apply throughout the different
stages that an allocated chunk of memory goes through.

With this model, alloc_page could fail more often, as it now also
fails if the top level reservation is exhausted.  This may seem
un-necessary, but I think it could be a good thing.  It means that at
very busy times (when lots of requests are needing emergency memory)
we drop requests randomly and very early.  If we are going to drop a
request eventually, dropping it early means we waste less time on it
which is probably a good thing.


So: Does this model help others with understanding how the
reservations work, or am I just over-engineering?

NeilBrown

  parent reply	other threads:[~2008-02-29  1:29 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-20 14:46 Peter Zijlstra
2008-02-20 14:46 ` [PATCH 01/28] mm: gfp_to_alloc_flags() Peter Zijlstra
2008-02-20 14:46 ` [PATCH 02/28] mm: tag reseve pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 03/28] mm: slb: add knowledge of reserve pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 04/28] mm: kmem_estimate_pages() Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 06/28] mm: serialize access to min_free_kbytes Peter Zijlstra
2008-02-20 14:46 ` [PATCH 07/28] mm: emergency pool Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 09/28] mm: __GFP_MEMALLOC Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 10/28] mm: memory reserve management Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 11/28] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2008-02-20 14:46 ` [PATCH 12/28] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2008-02-20 14:46 ` [PATCH 13/28] net: packet split receive api Peter Zijlstra
2008-02-20 14:46 ` [PATCH 14/28] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
2008-02-20 14:46 ` [PATCH 15/28] netvm: network reserve infrastructure Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-24  6:52   ` Mike Snitzer
2008-02-20 14:46 ` [PATCH 16/28] netvm: INET reserves Peter Zijlstra
2008-02-20 14:46 ` [PATCH 17/28] netvm: hook skb allocation to reserves Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 18/28] netvm: filter emergency skbs Peter Zijlstra
2008-02-20 14:46 ` [PATCH 19/28] netvm: prevent a stream specific deadlock Peter Zijlstra
2008-02-20 14:46 ` [PATCH 20/28] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
2008-02-20 14:46 ` [PATCH 21/28] netvm: skb processing Peter Zijlstra
2008-02-20 14:46 ` [PATCH 22/28] mm: add support for non block device backed swap files Peter Zijlstra
2008-02-20 16:30   ` Randy Dunlap
2008-02-20 16:46     ` Peter Zijlstra
2008-02-26 12:45   ` Miklos Szeredi
2008-02-26 12:58     ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 23/28] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 24/28] nfs: remove mempools Peter Zijlstra
2008-02-20 14:46 ` [PATCH 25/28] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 26/28] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2008-02-20 14:46 ` [PATCH 27/28] nfs: enable swap on NFS Peter Zijlstra
2008-02-20 14:46 ` [PATCH 28/28] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
2008-02-23  8:06 ` [PATCH 00/28] Swap over NFS -v16 Andrew Morton
2008-02-26  6:03   ` Neil Brown
2008-02-26 10:50     ` Peter Zijlstra
2008-02-26 12:00       ` Peter Zijlstra
2008-02-26 15:29       ` Miklos Szeredi
2008-02-26 15:41         ` Peter Zijlstra
2008-02-26 15:43         ` Peter Zijlstra
2008-02-26 15:47           ` Miklos Szeredi
2008-02-26 17:56       ` Andrew Morton
2008-02-27  5:51       ` Neil Brown
2008-02-27  7:58         ` Peter Zijlstra
2008-02-27  8:05           ` Pekka Enberg
2008-02-27  8:14             ` Peter Zijlstra
2008-02-27  8:33               ` Peter Zijlstra
2008-02-27  8:43                 ` Pekka J Enberg
2008-02-29 11:51             ` Peter Zijlstra
2008-02-29 11:58               ` Pekka Enberg
2008-02-29 12:18                 ` Peter Zijlstra
2008-02-29 12:29                   ` Pekka Enberg
2008-02-29  1:29           ` Neil Brown [this message]
2008-02-29 10:21             ` Peter Zijlstra
2008-03-02 22:18               ` Neil Brown
2008-03-02 23:33                 ` Peter Zijlstra
2008-03-03 23:41                   ` Neil Brown
2008-03-04 10:28                     ` Peter Zijlstra
     [not found]           ` <1837 <1204626509.6241.39.camel@lappy>
2008-03-07  3:33             ` Neil Brown
2008-03-07 11:17               ` Peter Zijlstra
2008-03-07 11:55                 ` Peter Zijlstra
2008-03-10  5:15                 ` Neil Brown
2008-03-10  9:17                   ` Peter Zijlstra
2008-03-14  5:22                     ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18375.24558.876276.255804@notabene.brown \
    --to=neilb@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=trond.myklebust@fys.uio.no \
    --subject='Re: [PATCH 00/28] Swap over NFS -v16' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).