LKML Archive on
help / color / mirror / Atom feed
From: Neil Brown <>
To: Peter Zijlstra <>
Cc: Andrew Morton <>,
	Linus Torvalds <>,,,,
Subject: Re: [PATCH 00/28] Swap over NFS -v16
Date: Tue, 4 Mar 2008 10:41:23 +1100	[thread overview]
Message-ID: <18380.36003.162081.900296@notabene.brown> (raw)
In-Reply-To: message from Peter Zijlstra on Monday March 3

Hi Peter,

 Thanks for trying to spell it out for me. :-)

On Monday March 3, wrote:
> >From my POV there is a model, and I've tried to convey it, but clearly
> I'm failing ^[$,3r_^[(Bhorribly. Let me try again:
> Create a stable state where you can receive an unlimited amount of
> network packets awaiting the one packet you need to move forward.


> To do so we need to distinguish needed from unneeded packets; we do this
> by means of SK_MEMALLOC. So we need to be able to receive packets up to
> that point.


> The unlimited amount of packets means unlimited time; which means that
> our state must not consume memory, merely use memory. That is, the
> amount of memory used must not grow unbounded over time.

Yes.  Good point.

> So we must guarantee that all memory allocated will be promptly freed
> again, and never allocate more than available.


> Because this state is not the normal state, we need a trigger to enter
> this state (and consequently a trigger to leave this state). We do that
> by detecting a low memory situation just like you propose. We enter this
> state once normal memory allocations fail and leave this state once they
> start succeeding again.


> We need the accounting to ensure we never allocate more than is
> available, but more importantly because we need to ensure progress for
> those packets we already have allocated.

 1/ Memory is used 
     a/ in caches, such as the fragment cache and the route cache
     b/ in transient allocations on their way from one place to
        another. e.g. network card to fragment cache, frag cache to
    The caches can (do?) impose a natural limit on the amount of
    memory they use.  The transient allocations should be satisfied
    from the normal low watermark pool.  When we are in a low memory
    conditions we can expect packet loss so we expect network streams
    to slow down, so we expect there to be fewer bits in transit.
    Also in low memory conditions the caches would be extra-cautious
    not to use too much memory.
    So it isn't completely clear (to me) that extra accounting is needed.

 2/ If we were to do accounting to "ensure progress for those packets
    we already have allocated", then I would expect a reservation
    (charge) of max_packet_size when a fragment arrives on the network
    card - or at least when a new fragment is determined to not match
    any packet already in the fragment cache.  But I didn't see that
    in your code.  I saw incremental charges as each page arrived.
    And that implementation does seem to fit the model.
> A packet is received, it can be a fragment, it will be placed in the
> fragment cache for packet re-assembly.


> We need to ensure we can overflow this fragment cache in order that
> something will come out at the other end. If under a fragment attack,
> the fragment cache limit will prune the oldest fragments, freeing up
> memory to receive new ones.

I don't understand why we want to "overflow this fragment cache".
I picture the cache having a target size.  When under this size,
fragments might be allowed to live longer.  When at or over the target
size, old fragments are pruned earlier.  When in a low memory
situation it might be even more keen to prune old fragments, to keep
beneath the target size.
When you say "overflow this fragment cache", I picture deliberately
allowing the cache to get bigger than the target size.  I don't
understand why you would want to do that.

> Eventually we'd be able to receive either a whole packet, or enough
> fragments to assemble one.

That would be important, yes.

> Next comes routing the packet; we need to know where to process the
> packet; local or non-local. This potentially involves filling the
> route-cache.
> If at this point there is no memory available because we forgot to limit
> the amount of memory available for skb allocation we again are stuck.

Those skbs we allocated - they are either sitting in the fragment
cache, or have been attached to a SK_MEMALLOC socket, or have been
freed - correct?  If so, then there is already a limit to how much
memory they can consume.

> The route-cache, like the fragment assembly, is already accounted and
> will prune old (unused) entries once the total memory usage exceeds a
> pre-determined amount of memory.

Good.  So as long as the normal emergency reserves covers the size of
the route cache plus the size of the fragment cache plus a little bit
of slack, we should be safe - yes?

> Eventually we'll end up at socket demux, matching packets to sockets
> which allows us to either toss the packet or consume it. Dropping
> packets is allowed because network is assumed lossy, and we have not yet
> acknowledged the receive.
> Does this make sense?

Lots of it does, yes.

> Then we have TX, which like I said above needs to operate under certain
> limits as well. We need to be able to send out packets when under
> pressure in order to relieve said pressure.

Catch-22 ?? :-)

> We need to ensure doing so will not exhaust our reserves.
> Writing out a page typically takes a little memory, you fudge some
> packets with protocol info, mtu size etc.. send them out, and wait for
> an acknowledge from the other end, and drop the stuff and go on writing
> other pages.

Yes, rate-limiting those write-outs should keep that moving.

> So sending out pages does not consume memory if we're able to receive
> ACKs. Being able to receive packets what what all the previous was
> about.
> Now of course there is some RPC concurrency, TCP windows and other
> funnies going on, but I assumed - and I don't think that's a wrong
> assumption - that sending out pages will consume endless amounts of
                                          ^not ??
> memory.

Sounds fair.

> Nor will it keep on sending pages, once there is a certain amount of
> packets outstanding (nfs congestion logic), it will wait, at which point
> it should have no memory in use at all.

Providing it frees any headers it attached to each page (or had
allocated them from a private pool), it should have no memory in use.
I'd have to check through the RPC code (I get lost in there too) to
see how much memory is tied up by each outstanding page write.

> Anyway I did get lost in the RPC code, and I know I didn't fully account
> everything, but under some (hopefully realistic) assumptions I think the
> model is sound.
> Does this make sense?


So I can see two possible models here.

The first is the "bounded cache" or "locally bounded" model.
At every step in the path from writepage to clear_page_writeback,
the amount of extra memory used is bounded by some local rules.
NFS and RPC uses congestion logic to limit the number of outstanding
writes.  For incoming packets, the fragment cache and route cache
impose their own limits.
We simply need that the VM reserves a total amount of memory to meet
the sum of those local limits.

Your code embodies this model with the tree of reservations.  The root
of the tree stores the sum of all the reservations below, and this
number is given to the VM.
The value of the tree is that different components can register their
needs independently, and the whole tree (or subtrees) can be attached
or not depending on global conditions, such as whether there are any
SK_MEMALLOC sockets or not.

However I don't see how the charging that you implemented fits into
this model.
You don't do any significant charging for the route cache.  But you do
for skbs.  Why?  Don't the majority of those skbs live in the fragment
cache?  Doesn't it account their size? (Maybe it doesn't.... maybe it

I also don't see the value of tracking pages to see if they are
'reserve' pages or not.  The decision to drop an skb that is not for
an SK_MEMALLOC socket should be based on whether we are currently
short on memory.  Not whether we were short on memory when the skb was

The second model that could fit is "total accounting". 
In this model we reserve memory at each stage including the transient
stages (packet that has arrived but isn't in fragment cache yet).
As memory moves around, we move the charging from one reserve to
another.  If the target reserve doesn't have an space, we drop the
On the transmit side, that means putting the page back on a queue for
sending later.  On the receive side that means discarding the packet
and waiting for a resend.
This model makes it easy for the various limits to be very different
while under memory pressure that otherwise.  It also means they are
imposed differently which isn't so good.

 - Why do you impose skb allocation limits beyond what is imposed
   by the fragment cache?
 - Why do you need to track whether each allocation is a reserve or


  reply	other threads:[~2008-03-03 23:41 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-20 14:46 Peter Zijlstra
2008-02-20 14:46 ` [PATCH 01/28] mm: gfp_to_alloc_flags() Peter Zijlstra
2008-02-20 14:46 ` [PATCH 02/28] mm: tag reseve pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 03/28] mm: slb: add knowledge of reserve pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 04/28] mm: kmem_estimate_pages() Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 06/28] mm: serialize access to min_free_kbytes Peter Zijlstra
2008-02-20 14:46 ` [PATCH 07/28] mm: emergency pool Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK Peter Zijlstra
2008-02-23  8:05   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 09/28] mm: __GFP_MEMALLOC Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 10/28] mm: memory reserve management Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 11/28] selinux: tag avc cache alloc as non-critical Peter Zijlstra
2008-02-20 14:46 ` [PATCH 12/28] net: wrap sk->sk_backlog_rcv() Peter Zijlstra
2008-02-20 14:46 ` [PATCH 13/28] net: packet split receive api Peter Zijlstra
2008-02-20 14:46 ` [PATCH 14/28] net: sk_allocation() - concentrate socket related allocations Peter Zijlstra
2008-02-20 14:46 ` [PATCH 15/28] netvm: network reserve infrastructure Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-24  6:52   ` Mike Snitzer
2008-02-20 14:46 ` [PATCH 16/28] netvm: INET reserves Peter Zijlstra
2008-02-20 14:46 ` [PATCH 17/28] netvm: hook skb allocation to reserves Peter Zijlstra
2008-02-23  8:06   ` Andrew Morton
2008-02-20 14:46 ` [PATCH 18/28] netvm: filter emergency skbs Peter Zijlstra
2008-02-20 14:46 ` [PATCH 19/28] netvm: prevent a stream specific deadlock Peter Zijlstra
2008-02-20 14:46 ` [PATCH 20/28] netfilter: NF_QUEUE vs emergency skbs Peter Zijlstra
2008-02-20 14:46 ` [PATCH 21/28] netvm: skb processing Peter Zijlstra
2008-02-20 14:46 ` [PATCH 22/28] mm: add support for non block device backed swap files Peter Zijlstra
2008-02-20 16:30   ` Randy Dunlap
2008-02-20 16:46     ` Peter Zijlstra
2008-02-26 12:45   ` Miklos Szeredi
2008-02-26 12:58     ` Peter Zijlstra
2008-02-20 14:46 ` [PATCH 23/28] mm: methods for teaching filesystems about PG_swapcache pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 24/28] nfs: remove mempools Peter Zijlstra
2008-02-20 14:46 ` [PATCH 25/28] nfs: teach the NFS client how to treat PG_swapcache pages Peter Zijlstra
2008-02-20 14:46 ` [PATCH 26/28] nfs: disable data cache revalidation for swapfiles Peter Zijlstra
2008-02-20 14:46 ` [PATCH 27/28] nfs: enable swap on NFS Peter Zijlstra
2008-02-20 14:46 ` [PATCH 28/28] nfs: fix various memory recursions possible with swap over NFS Peter Zijlstra
2008-02-23  8:06 ` [PATCH 00/28] Swap over NFS -v16 Andrew Morton
2008-02-26  6:03   ` Neil Brown
2008-02-26 10:50     ` Peter Zijlstra
2008-02-26 12:00       ` Peter Zijlstra
2008-02-26 15:29       ` Miklos Szeredi
2008-02-26 15:41         ` Peter Zijlstra
2008-02-26 15:43         ` Peter Zijlstra
2008-02-26 15:47           ` Miklos Szeredi
2008-02-26 17:56       ` Andrew Morton
2008-02-27  5:51       ` Neil Brown
2008-02-27  7:58         ` Peter Zijlstra
2008-02-27  8:05           ` Pekka Enberg
2008-02-27  8:14             ` Peter Zijlstra
2008-02-27  8:33               ` Peter Zijlstra
2008-02-27  8:43                 ` Pekka J Enberg
2008-02-29 11:51             ` Peter Zijlstra
2008-02-29 11:58               ` Pekka Enberg
2008-02-29 12:18                 ` Peter Zijlstra
2008-02-29 12:29                   ` Pekka Enberg
2008-02-29  1:29           ` Neil Brown
2008-02-29 10:21             ` Peter Zijlstra
2008-03-02 22:18               ` Neil Brown
2008-03-02 23:33                 ` Peter Zijlstra
2008-03-03 23:41                   ` Neil Brown [this message]
2008-03-04 10:28                     ` Peter Zijlstra
     [not found]           ` <1837 <1204626509.6241.39.camel@lappy>
2008-03-07  3:33             ` Neil Brown
2008-03-07 11:17               ` Peter Zijlstra
2008-03-07 11:55                 ` Peter Zijlstra
2008-03-10  5:15                 ` Neil Brown
2008-03-10  9:17                   ` Peter Zijlstra
2008-03-14  5:22                     ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18380.36003.162081.900296@notabene.brown \ \ \ \ \ \ \ \ \
    --subject='Re: [PATCH 00/28] Swap over NFS -v16' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).