LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-18  7:42 Martin Knoblauch
  2008-09-18  8:18 ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-18  7:42 UTC (permalink / raw)
  To: Andrew Morton, Greg Banks; +Cc: linux-nfs list, linux-kernel

----- Original Message ----

> From: Andrew Morton <akpm@linux-foundation.org>
> To: Greg Banks <gnb@melbourne.sgi.com>
> Cc: Martin Knoblauch <knobi@knobisoft.de>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
> Sent: Thursday, September 18, 2008 5:13:34 AM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> On Thu, 18 Sep 2008 11:42:54 +1000 Greg Banks wrote:
> 
> > I think having a tunable for client readahead is an excellent idea,
> > although not to solve your particular problem.  The SLES10 kernel has a
> > patch which does precisely that, perhaps Neil could post it.
> > 
> > I don't think there's a lot of point having both a module parameter and
> > a sysctl.
> 
> mount -o remount,readahead=42

[root@lpsdm52 ~]# mount -o remount,readahead=42 /net/spsdms/fs13
Bad nfs mount parameter: readahead
[root@lpsdm52 ~]# mount -o readahead=42 /net/spsdms/fs13
Bad nfs mount parameter: readahead


 I assume the reply was meant to say that the correct way of introducing a modifyable readahead size is to implement it as a mount option ? :-) I considered it, but it seems to be more intrusive than the workaround patch. It also needs changes to userspace tools - correct?

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-18  7:42 [RFC][Resend] Make NFS-Client readahead tunable Martin Knoblauch
@ 2008-09-18  8:18 ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2008-09-18  8:18 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: Greg Banks, linux-nfs list, linux-kernel

On Thu, 18 Sep 2008 00:42:58 -0700 (PDT) Martin Knoblauch <knobi@knobisoft.de> wrote:

> ----- Original Message ----
> 
> > From: Andrew Morton <akpm@linux-foundation.org>
> > To: Greg Banks <gnb@melbourne.sgi.com>
> > Cc: Martin Knoblauch <knobi@knobisoft.de>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
> > Sent: Thursday, September 18, 2008 5:13:34 AM
> > Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> > 
> > On Thu, 18 Sep 2008 11:42:54 +1000 Greg Banks wrote:
> > 
> > > I think having a tunable for client readahead is an excellent idea,
> > > although not to solve your particular problem.  The SLES10 kernel has a
> > > patch which does precisely that, perhaps Neil could post it.
> > > 
> > > I don't think there's a lot of point having both a module parameter and
> > > a sysctl.
> > 
> > mount -o remount,readahead=42
> 
> [root@lpsdm52 ~]# mount -o remount,readahead=42 /net/spsdms/fs13
> Bad nfs mount parameter: readahead
> [root@lpsdm52 ~]# mount -o readahead=42 /net/spsdms/fs13
> Bad nfs mount parameter: readahead
> 
> 
>  I assume the reply was meant to say that the correct way of introducing a modifyable readahead size is to implement it as a mount option ? :-)

Yes.

> I considered it, but it seems to be more intrusive than the workaround patch. It also needs changes to userspace tools - correct?

No.  mount(8) will pass unrecognised options straight down into the
filesystem driver.

It's better this way - it allows the tunable to be set on a per-mount
basis rather than machine-wide.

Note that for block devices, readahead is a per-backing_dev_info thing
(and a backing_dev_info has a 1:1 relationship to a disk drive for sane
setups).  

And the NFS client maintains a backing_dev_info, which appears to map
onto a server, so making the NFS readahead a per-backing_dev_info (ie:
per server) thing might make sense.  Maybe nfs makes per-server information
manipulatable down in sysfs somewhere..


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-21 12:50 Martin Knoblauch
@ 2008-09-21 13:53 ` Chuck Lever
  0 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever @ 2008-09-21 13:53 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Andrew Morton, Greg Banks, linux-nfs list, linux-kernel, Peter zijlstra

On Sun, Sep 21, 2008 at 8:50 AM, Martin Knoblauch <knobi@knobisoft.de> wrote:
> ----- Original Message ----
>
>> From: Chuck Lever <chucklever@gmail.com>
>> To: Martin Knoblauch <knobi@knobisoft.de>
>> Cc: Andrew Morton <akpm@linux-foundation.org>; Greg Banks <gnb@melbourne.sgi.com>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org; Peter zijlstra <a.p.zijlstra@chello.nl>
>> Sent: Thursday, September 18, 2008 8:24:42 PM
>> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>>
>> On Thu, Sep 18, 2008 at 6:53 AM, Martin Knoblauch wrote:
>> > ----- Original Message ----
>> >
>> >> From: Andrew Morton
>> >> To: Martin Knoblauch
>> >> Cc: Greg Banks ; linux-nfs list
>> ; linux-kernel@vger.kernel.org; Peter zijlstra
>>
>> >> Sent: Thursday, September 18, 2008 10:47:33 AM
>> >> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>> >>
>> >> On Thu, 18 Sep 2008 01:38:57 -0700 (PDT) Martin Knoblauch
>> >> wrote:
>> >>
>> >> > > No.  mount(8) will pass unrecognised options straight down into the
>> >> > > filesystem driver.
>> >> > >
>> >> >
>> >> >  Has that always been the case, or is it a recent change? I have to support
>> >> RHEL4 userland, which is not really new.
>> >>
>> >> It's been that way for ever and ever.  It's how all these guys:
>> >>
>> >> y:/usr/src/25> grep Opt_ fs/*/super.c|wc
>> >>     781    2626   33703
>> >>
>> >> get handled.
>> >
>> >  while that seems to be not to complicated, I seem to have a problem passing
>> the mount options to the kernel. They come down as mount data version "6".
>> Apparently mount(8) or mount.nfs(8) are doing the parsing and send down the
>> legacy data block. So, what is the minimum version of mount or mount.nfs that
>> pass the options down unaltered?
>>
>> The mount command has passed a string of options to the kernel for
>> particular file systems for a while, but the facility for the NFS
>> client to parse a string of mount options in the kernel was added only
>> recently -- at least 2.6.23 or 2.6.24 is required to support this.
>> Before this, the mount command parsed these options.
>>
>
>  I understand that. Question remains, which version of the mount(8) or nfs.mount(8) command do I need to pass the options to the kernel.

You can use the latest version of nfs-utils, which is 1.1.3 to get a
version of mount.nfs that passes a string of mount options.

You should probably also replace the mount command with the latest
version from the util-linux package to get a version that starts the
mount.nfs subcommand instead of trying to do an NFS mount itself.

This is not needed for experimentation, though.  You can issue
mount.nfs directly.

--
Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-21 12:53 Martin Knoblauch
  0 siblings, 0 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-21 12:53 UTC (permalink / raw)
  To: Peter Staubach, Chuck Lever
  Cc: Andrew Morton, Greg Banks, linux-nfs list, linux-kernel, Peter zijlstra



> >
> > I agree that a mount option would allow more fine-grained control over
> > readahead.  A system wide parameter controlling readahead has always
> > been a weakness.  Readahead, as implemented in the VFS, has a
> > *per-file descriptor* context, however, which operates automatically
> > (and can be tuned at run-time by an application with [mf]advise(2).
> >
> > As a future feature, this might work in better combination with the
> > per-mount bdi changes proposed by Peter to provide maximal flexibility
> > without exposing yet another confusing knob that could help some
> > workloads but hurt others.
> 
> And perhaps add some dynamic tuning capabilities to the NFS client
> code to just make it do "the right thing".  This would be better
> than any tunables and would help to serve in other situations, such
> as high bandwidth/latency networks, overloaded servers who don't
> need more read-ahead READ requests piled on, etc...
> 

 this goes over my capabilities, but would certainly help the situation. But then I would hate to see Sun/Linux going off the hook, because Linux just played nice :-)

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-21 12:50 Martin Knoblauch
  2008-09-21 13:53 ` Chuck Lever
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-21 12:50 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Andrew Morton, Greg Banks, linux-nfs list, linux-kernel, Peter zijlstra

----- Original Message ----

> From: Chuck Lever <chucklever@gmail.com>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>; Greg Banks <gnb@melbourne.sgi.com>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org; Peter zijlstra <a.p.zijlstra@chello.nl>
> Sent: Thursday, September 18, 2008 8:24:42 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> On Thu, Sep 18, 2008 at 6:53 AM, Martin Knoblauch wrote:
> > ----- Original Message ----
> >
> >> From: Andrew Morton 
> >> To: Martin Knoblauch 
> >> Cc: Greg Banks ; linux-nfs list 
> ; linux-kernel@vger.kernel.org; Peter zijlstra 
> 
> >> Sent: Thursday, September 18, 2008 10:47:33 AM
> >> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> >>
> >> On Thu, 18 Sep 2008 01:38:57 -0700 (PDT) Martin Knoblauch
> >> wrote:
> >>
> >> > > No.  mount(8) will pass unrecognised options straight down into the
> >> > > filesystem driver.
> >> > >
> >> >
> >> >  Has that always been the case, or is it a recent change? I have to support
> >> RHEL4 userland, which is not really new.
> >>
> >> It's been that way for ever and ever.  It's how all these guys:
> >>
> >> y:/usr/src/25> grep Opt_ fs/*/super.c|wc
> >>     781    2626   33703
> >>
> >> get handled.
> >
> >  while that seems to be not to complicated, I seem to have a problem passing 
> the mount options to the kernel. They come down as mount data version "6". 
> Apparently mount(8) or mount.nfs(8) are doing the parsing and send down the 
> legacy data block. So, what is the minimum version of mount or mount.nfs that 
> pass the options down unaltered?
> 
> The mount command has passed a string of options to the kernel for
> particular file systems for a while, but the facility for the NFS
> client to parse a string of mount options in the kernel was added only
> recently -- at least 2.6.23 or 2.6.24 is required to support this.
> Before this, the mount command parsed these options.
>

 I understand that. Question remains, which version of the mount(8) or nfs.mount(8) command do I need to pass the options to the kernel.
 
> For RHEL 4, based on 2.6.9, you are stuck.  It uses a binary structure
> whose fields must match between the kernel and user space.  For RH
> enterprise kernels, the ABI cannot change in a given release, so RH
> wouldn't take a patch to change the data structure that mount uses.
> You would have to maintain such a change yourself, and build your own
> kernels and mount command after each RHEL 4 update is released.
> 

 For implementation/testing purposes I have some freedom in what userland I run. My systems are mainly RHEL4,  but I have long junked the 2.6.9 based kernels.

> I agree that a mount option would allow more fine-grained control over
> readahead.  A system wide parameter controlling readahead has always
> been a weakness.  Readahead, as implemented in the VFS, has a
> *per-file descriptor* context, however, which operates automatically
> (and can be tuned at run-time by an application with [mf]advise(2).
>

 The per-file descriptor stuff is not applicable in my case, as I do not have controll over programms acessing the files.
 
> As a future feature, this might work in better combination with the
> per-mount bdi changes proposed by Peter to provide maximal flexibility
> without exposing yet another confusing knob that could help some
> workloads but hurt others.
> 


 Thats why I am open to experiment, but need to know which userland tools support the direct passing of mount options to the NFS client.

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-18 18:24 ` Chuck Lever
@ 2008-09-18 19:03   ` Peter Staubach
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Staubach @ 2008-09-18 19:03 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Martin Knoblauch, Andrew Morton, Greg Banks, linux-nfs list,
	linux-kernel, Peter zijlstra

Chuck Lever wrote:
> On Thu, Sep 18, 2008 at 6:53 AM, Martin Knoblauch <knobi@knobisoft.de> wrote:
>   
>> ----- Original Message ----
>>
>>     
>>> From: Andrew Morton <akpm@linux-foundation.org>
>>> To: Martin Knoblauch <knobi@knobisoft.de>
>>> Cc: Greg Banks <gnb@melbourne.sgi.com>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org; Peter zijlstra <a.p.zijlstra@chello.nl>
>>> Sent: Thursday, September 18, 2008 10:47:33 AM
>>> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>>>
>>> On Thu, 18 Sep 2008 01:38:57 -0700 (PDT) Martin Knoblauch
>>> wrote:
>>>
>>>       
>>>>> No.  mount(8) will pass unrecognised options straight down into the
>>>>> filesystem driver.
>>>>>
>>>>>           
>>>>  Has that always been the case, or is it a recent change? I have to support
>>>>         
>>> RHEL4 userland, which is not really new.
>>>
>>> It's been that way for ever and ever.  It's how all these guys:
>>>
>>> y:/usr/src/25> grep Opt_ fs/*/super.c|wc
>>>     781    2626   33703
>>>
>>> get handled.
>>>       
>>  while that seems to be not to complicated, I seem to have a problem passing the mount options to the kernel. They come down as mount data version "6". Apparently mount(8) or mount.nfs(8) are doing the parsing and send down the legacy data block. So, what is the minimum version of mount or mount.nfs that pass the options down unaltered?
>>     
>
> The mount command has passed a string of options to the kernel for
> particular file systems for a while, but the facility for the NFS
> client to parse a string of mount options in the kernel was added only
> recently -- at least 2.6.23 or 2.6.24 is required to support this.
> Before this, the mount command parsed these options.
>
> For RHEL 4, based on 2.6.9, you are stuck.  It uses a binary structure
> whose fields must match between the kernel and user space.  For RH
> enterprise kernels, the ABI cannot change in a given release, so RH
> wouldn't take a patch to change the data structure that mount uses.
> You would have to maintain such a change yourself, and build your own
> kernels and mount command after each RHEL 4 update is released.
>
> I agree that a mount option would allow more fine-grained control over
> readahead.  A system wide parameter controlling readahead has always
> been a weakness.  Readahead, as implemented in the VFS, has a
> *per-file descriptor* context, however, which operates automatically
> (and can be tuned at run-time by an application with [mf]advise(2).
>
> As a future feature, this might work in better combination with the
> per-mount bdi changes proposed by Peter to provide maximal flexibility
> without exposing yet another confusing knob that could help some
> workloads but hurt others.

And perhaps add some dynamic tuning capabilities to the NFS client
code to just make it do "the right thing".  This would be better
than any tunables and would help to serve in other situations, such
as high bandwidth/latency networks, overloaded servers who don't
need more read-ahead READ requests piled on, etc...

       ps

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-18 11:53 Martin Knoblauch
@ 2008-09-18 18:24 ` Chuck Lever
  2008-09-18 19:03   ` Peter Staubach
  0 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2008-09-18 18:24 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Andrew Morton, Greg Banks, linux-nfs list, linux-kernel, Peter zijlstra

On Thu, Sep 18, 2008 at 6:53 AM, Martin Knoblauch <knobi@knobisoft.de> wrote:
> ----- Original Message ----
>
>> From: Andrew Morton <akpm@linux-foundation.org>
>> To: Martin Knoblauch <knobi@knobisoft.de>
>> Cc: Greg Banks <gnb@melbourne.sgi.com>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org; Peter zijlstra <a.p.zijlstra@chello.nl>
>> Sent: Thursday, September 18, 2008 10:47:33 AM
>> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>>
>> On Thu, 18 Sep 2008 01:38:57 -0700 (PDT) Martin Knoblauch
>> wrote:
>>
>> > > No.  mount(8) will pass unrecognised options straight down into the
>> > > filesystem driver.
>> > >
>> >
>> >  Has that always been the case, or is it a recent change? I have to support
>> RHEL4 userland, which is not really new.
>>
>> It's been that way for ever and ever.  It's how all these guys:
>>
>> y:/usr/src/25> grep Opt_ fs/*/super.c|wc
>>     781    2626   33703
>>
>> get handled.
>
>  while that seems to be not to complicated, I seem to have a problem passing the mount options to the kernel. They come down as mount data version "6". Apparently mount(8) or mount.nfs(8) are doing the parsing and send down the legacy data block. So, what is the minimum version of mount or mount.nfs that pass the options down unaltered?

The mount command has passed a string of options to the kernel for
particular file systems for a while, but the facility for the NFS
client to parse a string of mount options in the kernel was added only
recently -- at least 2.6.23 or 2.6.24 is required to support this.
Before this, the mount command parsed these options.

For RHEL 4, based on 2.6.9, you are stuck.  It uses a binary structure
whose fields must match between the kernel and user space.  For RH
enterprise kernels, the ABI cannot change in a given release, so RH
wouldn't take a patch to change the data structure that mount uses.
You would have to maintain such a change yourself, and build your own
kernels and mount command after each RHEL 4 update is released.

I agree that a mount option would allow more fine-grained control over
readahead.  A system wide parameter controlling readahead has always
been a weakness.  Readahead, as implemented in the VFS, has a
*per-file descriptor* context, however, which operates automatically
(and can be tuned at run-time by an application with [mf]advise(2).

As a future feature, this might work in better combination with the
per-mount bdi changes proposed by Peter to provide maximal flexibility
without exposing yet another confusing knob that could help some
workloads but hurt others.

--
Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-18  8:38 Martin Knoblauch
  2008-09-18  8:47 ` Andrew Morton
@ 2008-09-18 13:20 ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2008-09-18 13:20 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: Andrew Morton, Greg Banks, linux-nfs list, linux-kernel

On Thu, 2008-09-18 at 01:38 -0700, Martin Knoblauch wrote:

>  I believe Peter wanted to add per bdi stuff for nfs some time ago. Not sure what came out of it.

$ mount localhost:/ /mnt/tmp
$ grep nfs /proc/$$/mountinfo
21 17 0:17 / /var/lib/nfs/rpc_pipefs rw - rpc_pipefs rpc_pipefs rw
31 13 0:20 / /proc/fs/nfsd rw - nfsd nfsd rw
37 17 0:22 / /mnt/tmp rw - nfs localhost:/ rw,vers=3,rsize=65536,wsize=65536,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountproto=tcp,addr=127.0.0.1
$ ls -la /sys/class/bdi/0\:22/
total 0
drwxr-xr-x  3 root root    0 2008-09-18 15:16 .
drwxr-xr-x 21 root root    0 2008-09-18 15:16 ..
-rw-r--r--  1 root root 4096 2008-09-18 15:19 max_ratio
-rw-r--r--  1 root root 4096 2008-09-18 15:19 min_ratio
drwxr-xr-x  2 root root    0 2008-09-18 15:19 power
-rw-r--r--  1 root root 4096 2008-09-18 15:19 read_ahead_kb
lrwxrwxrwx  1 root root    0 2008-09-18 15:19 subsystem -> ../../bdi
-rw-r--r--  1 root root 4096 2008-09-18 15:19 uevent
$ cat /sys/class/bdi/0\:22/read_ahead_kb
960





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-18 11:53 Martin Knoblauch
  2008-09-18 18:24 ` Chuck Lever
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-18 11:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Greg Banks, linux-nfs list, linux-kernel, Peter zijlstra

----- Original Message ----

> From: Andrew Morton <akpm@linux-foundation.org>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: Greg Banks <gnb@melbourne.sgi.com>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org; Peter zijlstra <a.p.zijlstra@chello.nl>
> Sent: Thursday, September 18, 2008 10:47:33 AM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> On Thu, 18 Sep 2008 01:38:57 -0700 (PDT) Martin Knoblauch 
> wrote:
> 
> > > No.  mount(8) will pass unrecognised options straight down into the
> > > filesystem driver.
> > >
> > 
> >  Has that always been the case, or is it a recent change? I have to support 
> RHEL4 userland, which is not really new.
> 
> It's been that way for ever and ever.  It's how all these guys:
> 
> y:/usr/src/25> grep Opt_ fs/*/super.c|wc
>     781    2626   33703
> 
> get handled.

 while that seems to be not to complicated, I seem to have a problem passing the mount options to the kernel. They come down as mount data version "6". Apparently mount(8) or mount.nfs(8) are doing the parsing and send down the legacy data block. So, what is the minimum version of mount or mount.nfs that pass the options down unaltered?

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-18  9:32 Martin Knoblauch
  0 siblings, 0 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-18  9:32 UTC (permalink / raw)
  To: Greg Banks; +Cc: linux-nfs list, linux-kernel

----- Original Message ----

> From: Greg Banks <gnb@melbourne.sgi.com>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
> Sent: Thursday, September 18, 2008 10:45:01 AM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> Martin Knoblauch wrote:
> > ----- Original Message ----
> >
> >  
> >>
> >> I think having a tunable for client readahead is an excellent idea,
> >> although not to solve your particular problem.  The SLES10 kernel has a
> >> patch which does precisely that, perhaps Neil could post it.
> >>
> >> I don't think there's a lot of point having both a module parameter and
> >> a sysctl.
> >>
> >>    
> >
> >  Actually there is a good reason. The module parameter can be used to set the 
> new value at load time and never bother again. The sysctl is very convenient 
> when doing experiments.
> >  
> You can set module parameters after module load in
> /sys/module/$module/parameters.

 OK, one always learns new stuff :-)

> >  As Andrew already pointed out, the best solution would be a mount option.
> Yep.
> >  But that seems much more involved as my workaround patch.
> >
> >  
> Yep.

 Seeing Andrews Mails, it might be less involved that I thought :-) Will have a look.

> >> A maximum of 15 is unwise.  I've found that (at least with the older
> >> readahead mechanisms in SLES10) a multiple of 4 is required to preserve
> >> rsize-alignment of READ rpcs to the server, which helps a lot with wide
> >> RAID backends.  So in SGI we tune client readahead to 16.
> >>
> >>    
> >
> >  15 is the value that the Linux NFS client uses., at least since 2.6.3. 
> It's a silly value.

 I was not sure whether it was just arbitrary, or caused by some other internal limit. The definition made it look like it was somehow related to RPC_DEF_SLOT.

> > As it is not tunable up to today, the comment seems moot :-)  But it opens the 
> questions:
> >
> > a) should 1 be the minimum, or 0?
> >  
> Turning off client RA entirely is potentially useful.
> > b) can the backing_dev_info.ra_pages field safely be set to something higher 
> than 15?
> >  
> Yes.  Did I mention 16 ?

 You also mentioned that you do that at SGI. So, who knows whatelse there is changed :-) Anyway, thanks for the hint. 

> >  
> >> Your patch seems to have a bunch of other unrelated stuff mixed in.
> >>
> >>    
> >
> >  Yeah, someone already pointed out, that the Makefile hunk does not belong 
> there. But you say "a bunch" - anything else?
> >  
> I rapidly scrolled past some stuff about 64bit inodes.

 Ahh. Now I remember... when I implemented the module  parameter, I found that "enable_ino64" lacks a description. Yeah, should be a separate patch.

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-18  8:47 ` Andrew Morton
@ 2008-09-18  8:57   ` Greg Banks
  0 siblings, 0 replies; 29+ messages in thread
From: Greg Banks @ 2008-09-18  8:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Martin Knoblauch, linux-nfs list, linux-kernel, Peter zijlstra

Andrew Morton wrote:
> On Thu, 18 Sep 2008 01:38:57 -0700 (PDT) Martin Knoblauch <knobi@knobisoft.de> wrote:
>
>   
>>> No.  mount(8) will pass unrecognised options straight down into the
>>> filesystem driver.
>>>
>>>       
>>  Has that always been the case, or is it a recent change? I have to support RHEL4 userland, which is not really new.
>>     
>
> It's been that way for ever and ever.  It's how all these guys:
>
> y:/usr/src/25> grep Opt_ fs/*/super.c|wc
>     781    2626   33703
>
> get handled.
>   
NFS was...special...for a long time.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
Be like the squirrel.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-18  8:38 Martin Knoblauch
@ 2008-09-18  8:47 ` Andrew Morton
  2008-09-18  8:57   ` Greg Banks
  2008-09-18 13:20 ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2008-09-18  8:47 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: Greg Banks, linux-nfs list, linux-kernel, Peter zijlstra

On Thu, 18 Sep 2008 01:38:57 -0700 (PDT) Martin Knoblauch <knobi@knobisoft.de> wrote:

> > No.  mount(8) will pass unrecognised options straight down into the
> > filesystem driver.
> >
> 
>  Has that always been the case, or is it a recent change? I have to support RHEL4 userland, which is not really new.

It's been that way for ever and ever.  It's how all these guys:

y:/usr/src/25> grep Opt_ fs/*/super.c|wc
    781    2626   33703

get handled.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-18  8:19 Martin Knoblauch
@ 2008-09-18  8:45 ` Greg Banks
  0 siblings, 0 replies; 29+ messages in thread
From: Greg Banks @ 2008-09-18  8:45 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: linux-nfs list, linux-kernel

Martin Knoblauch wrote:
> ----- Original Message ----
>
>   
>>
>> I think having a tunable for client readahead is an excellent idea,
>> although not to solve your particular problem.  The SLES10 kernel has a
>> patch which does precisely that, perhaps Neil could post it.
>>
>> I don't think there's a lot of point having both a module parameter and
>> a sysctl.
>>
>>     
>
>  Actually there is a good reason. The module parameter can be used to set the new value at load time and never bother again. The sysctl is very convenient when doing experiments.
>   
You can set module parameters after module load in
/sys/module/$module/parameters.
>  As Andrew already pointed out, the best solution would be a mount option.
Yep.
>  But that seems much more involved as my workaround patch.
>
>   
Yep.
>> A maximum of 15 is unwise.  I've found that (at least with the older
>> readahead mechanisms in SLES10) a multiple of 4 is required to preserve
>> rsize-alignment of READ rpcs to the server, which helps a lot with wide
>> RAID backends.  So in SGI we tune client readahead to 16.
>>
>>     
>
>  15 is the value that the Linux NFS client uses., at least since 2.6.3. 
It's a silly value.
> As it is not tunable up to today, the comment seems moot :-)  But it opens the questions:
>
> a) should 1 be the minimum, or 0?
>   
Turning off client RA entirely is potentially useful.
> b) can the backing_dev_info.ra_pages field safely be set to something higher than 15?
>   
Yes.  Did I mention 16 ?
>   
>> Your patch seems to have a bunch of other unrelated stuff mixed in.
>>
>>     
>
>  Yeah, someone already pointed out, that the Makefile hunk does not belong there. But you say "a bunch" - anything else?
>   
I rapidly scrolled past some stuff about 64bit inodes.
> Cheers
> Martin
> PS: Did we ever meet/mail when I was at SGI (1991-1997)?
>
>   
I'm more recent than that.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
Be like the squirrel.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-18  8:38 Martin Knoblauch
  2008-09-18  8:47 ` Andrew Morton
  2008-09-18 13:20 ` Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-18  8:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Greg Banks, linux-nfs list, linux-kernel, Peter zijlstra

----- Original Message ----

> From: Andrew Morton <akpm@linux-foundation.org>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: Greg Banks <gnb@melbourne.sgi.com>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
> Sent: Thursday, September 18, 2008 10:18:18 AM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> On Thu, 18 Sep 2008 00:42:58 -0700 (PDT) Martin Knoblauch 
> wrote:
> 
> > ----- Original Message ----
> > 
> > > From: Andrew Morton 
> > > To: Greg Banks 
> > > Cc: Martin Knoblauch ; linux-nfs list 
> ; linux-kernel@vger.kernel.org
> > > Sent: Thursday, September 18, 2008 5:13:34 AM
> > > Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> > > 
> > > On Thu, 18 Sep 2008 11:42:54 +1000 Greg Banks wrote:
> > > 
> > > > I think having a tunable for client readahead is an excellent idea,
> > > > although not to solve your particular problem.  The SLES10 kernel has a
> > > > patch which does precisely that, perhaps Neil could post it.
> > > > 
> > > > I don't think there's a lot of point having both a module parameter and
> > > > a sysctl.
> > > 
> > > mount -o remount,readahead=42
> > 
> > [root@lpsdm52 ~]# mount -o remount,readahead=42 /net/spsdms/fs13
> > Bad nfs mount parameter: readahead
> > [root@lpsdm52 ~]# mount -o readahead=42 /net/spsdms/fs13
> > Bad nfs mount parameter: readahead
> > 
> > 
> >  I assume the reply was meant to say that the correct way of introducing a 
> modifyable readahead size is to implement it as a mount option ? :-)
> 
> Yes.
>

:-)
 
> > I considered it, but it seems to be more intrusive than the workaround patch. 
> It also needs changes to userspace tools - correct?
> 
> No.  mount(8) will pass unrecognised options straight down into the
> filesystem driver.
>

 Has that always been the case, or is it a recent change? I have to support RHEL4 userland, which is not really new.
 
> It's better this way - it allows the tunable to be set on a per-mount
> basis rather than machine-wide.
>

 No question about that. I just thought it to be to complicated. Maybe I erred.
 
> Note that for block devices, readahead is a per-backing_dev_info thing
> (and a backing_dev_info has a 1:1 relationship to a disk drive for sane
> setups).  
> 
> And the NFS client maintains a backing_dev_info, which appears to map
> onto a server, so making the NFS readahead a per-backing_dev_info (ie:
> per server) thing might make sense.  Maybe nfs makes per-server information
> manipulatable down in sysfs somewhere..

 I believe Peter wanted to add per bdi stuff for nfs some time ago. Not sure what came out of it.

Cheers
Martin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-18  8:19 Martin Knoblauch
  2008-09-18  8:45 ` Greg Banks
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-18  8:19 UTC (permalink / raw)
  To: Greg Banks; +Cc: linux-nfs list, linux-kernel

----- Original Message ----

> From: Greg Banks <gnb@melbourne.sgi.com>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
> Sent: Thursday, September 18, 2008 3:42:54 AM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> Martin Knoblauch wrote:
> > Hi,
> >
> > the following/attached patch works around a [obscure] problem when an 2.6 (not 
> sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10 
> NFS server when the underlying filesystem is of type SAM-FS. Happens with 
> RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance 
> for a short-/mid-term solution from Sun are very slim. So, being lazy, I would 
> love to get this patch into Linux. If not, I just will have to maintain it for 
> eternity out of tree.
> >
> > The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data 
> and a relatively small amount of data "online" on disk and pushes old or 
> infrequently used data to "offline" media like e.g. tape. This is completely 
> transparent to the users. If the date for an "offline" file is needed, the so 
> called "stager daemon" copies it back from the offline medium. All of this works 
> great most of the time. Now, if an Linux NFS client tries to read such an 
> offline file, performance drops to "extremely slow". 
> By "extremely slow" do you mean "tape read speed"?
> > After lengthly investigation of tcp-dumps, mount options and procedures 
> involving black cats at midnight, we found out that the readahead behaviour of 
> the Linux NFS client causes the problem. Basically it seems to issue read 
> requests up to 15*rsize to the server. In the case of the "offline" files, this 
> behaviour causes heavy competition for the inode lock between the NFSD process 
> and the stager daemon on the Solaris server.
> > 
Hi Greg,

 my impression is, there is some confusion here. Likely caused by me not writing a good description :-(
 
> So, you need to
> 
> a) make your stager daemon do IO more sensibly, and
>

 As I am not affiliated with Sun in any way, it is "their" stager daemon. And I told "them", but a solution will not come before the next major release :-(
 
> b) apply something like this patch which adds O_NONBLOCK when knfsd does
> reads writes and truncates and translates -EAGAIN into NFS3ERR_JUKEBOX
> 
> http://kerneltrap.org/mailarchive/linux-fsdevel/2006/5/5/312567
> 

 OK, what has knfsd to do with it? The NFS server is Solaris-10 on Sparc.

> and
> 
> c) make your filesystem IO interposing layer report -EAGAIN when a
> process tries to do IO to an offline region in a file and O_NONBLOCK is
> present.

 I leave that to "them" :-)

> > - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the 
> problem, but a solution will need time. Lots of it.
> > - The working solution: disable the client side readahead, or make it tunable. 
> The patch does that by introducing a NFS module parameter "ra_factor" which can 
> take values between 1 and 15 (default 15) and a tunable 
> "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
> >  
> I think having a tunable for client readahead is an excellent idea,
> although not to solve your particular problem.  The SLES10 kernel has a
> patch which does precisely that, perhaps Neil could post it.
> 
> I don't think there's a lot of point having both a module parameter and
> a sysctl.
> 

 Actually there is a good reason. The module parameter can be used to set the new value at load time and never bother again. The sysctl is very convenient when doing experiments.

 As Andrew already pointed out, the best solution would be a mount option. But that seems much more involved as my workaround patch.

> A maximum of 15 is unwise.  I've found that (at least with the older
> readahead mechanisms in SLES10) a multiple of 4 is required to preserve
> rsize-alignment of READ rpcs to the server, which helps a lot with wide
> RAID backends.  So in SGI we tune client readahead to 16.
>

 15 is the value that the Linux NFS client uses., at least since 2.6.3. As it is not tunable up to today, the comment seems moot :-)  But it opens the questions:

a) should 1 be the minimum, or 0?
b) can the backing_dev_info.ra_pages field safely be set to something higher than 15?

> Your patch seems to have a bunch of other unrelated stuff mixed in.
> 

 Yeah, someone already pointed out, that the Makefile hunk does not belong there. But you say "a bunch" - anything else?

Cheers
Martin
PS: Did we ever meet/mail when I was at SGI (1991-1997)?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-18  1:42 ` Greg Banks
@ 2008-09-18  3:13   ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2008-09-18  3:13 UTC (permalink / raw)
  To: Greg Banks; +Cc: Martin Knoblauch, linux-nfs list, linux-kernel

On Thu, 18 Sep 2008 11:42:54 +1000 Greg Banks <gnb@melbourne.sgi.com> wrote:

> I think having a tunable for client readahead is an excellent idea,
> although not to solve your particular problem.  The SLES10 kernel has a
> patch which does precisely that, perhaps Neil could post it.
> 
> I don't think there's a lot of point having both a module parameter and
> a sysctl.

mount -o remount,readahead=42

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-17 13:06 Martin Knoblauch
  2008-09-17 14:06 ` Peter Staubach
@ 2008-09-18  1:42 ` Greg Banks
  2008-09-18  3:13   ` Andrew Morton
  1 sibling, 1 reply; 29+ messages in thread
From: Greg Banks @ 2008-09-18  1:42 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: linux-nfs list, linux-kernel

Martin Knoblauch wrote:
> Hi,
>
> the following/attached patch works around a [obscure] problem when an 2.6 (not sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS. Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance for a short-/mid-term solution from Sun are very slim. So, being lazy, I would love to get this patch into Linux. If not, I just will have to maintain it for eternity out of tree.
>
> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data and a relatively small amount of data "online" on disk and pushes old or infrequently used data to "offline" media like e.g. tape. This is completely transparent to the users. If the date for an "offline" file is needed, the so called "stager daemon" copies it back from the offline medium. All of this works great most of the time. Now, if an Linux NFS client tries to read such an offline file, performance drops to "extremely slow". 
By "extremely slow" do you mean "tape read speed"?
> After lengthly investigation of tcp-dumps, mount options and procedures involving black cats at midnight, we found out that the readahead behaviour of the Linux NFS client causes the problem. Basically it seems to issue read requests up to 15*rsize to the server. In the case of the "offline" files, this behaviour causes heavy competition for the inode lock between the NFSD process and the stager daemon on the Solaris server.
>   
So, you need to

a) make your stager daemon do IO more sensibly, and

b) apply something like this patch which adds O_NONBLOCK when knfsd does
reads writes and truncates and translates -EAGAIN into NFS3ERR_JUKEBOX

http://kerneltrap.org/mailarchive/linux-fsdevel/2006/5/5/312567

and

c) make your filesystem IO interposing layer report -EAGAIN when a
process tries to do IO to an offline region in a file and O_NONBLOCK is
present.
> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the problem, but a solution will need time. Lots of it.
> - The working solution: disable the client side readahead, or make it tunable. The patch does that by introducing a NFS module parameter "ra_factor" which can take values between 1 and 15 (default 15) and a tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
>   
I think having a tunable for client readahead is an excellent idea,
although not to solve your particular problem.  The SLES10 kernel has a
patch which does precisely that, perhaps Neil could post it.

I don't think there's a lot of point having both a module parameter and
a sysctl.

A maximum of 15 is unwise.  I've found that (at least with the older
readahead mechanisms in SLES10) a multiple of 4 is required to preserve
rsize-alignment of READ rpcs to the server, which helps a lot with wide
RAID backends.  So in SGI we tune client readahead to 16.

Your patch seems to have a bunch of other unrelated stuff mixed in.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
Be like the squirrel.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 17:01 Martin Knoblauch
  0 siblings, 0 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-17 17:01 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Peter Staubach, linux-nfs list, linux-kernel

----- Original Message ----

> From: Chuck Lever <chucklever@gmail.com>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: Peter Staubach <staubach@redhat.com>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
> Sent: Wednesday, September 17, 2008 6:43:48 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> On Wed, Sep 17, 2008 at 11:23 AM, Martin Knoblauch wrote:
> > ----- Original Message ----
> >
> >> From: Chuck Lever 
> >> To: Peter Staubach 
> >> Cc: Martin Knoblauch ; linux-nfs list 
> ; linux-kernel@vger.kernel.org
> >> Sent: Wednesday, September 17, 2008 5:41:15 PM
> >> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> >>
> >> On Wed, Sep 17, 2008 at 9:06 AM, Peter Staubach wrote:
> >> > Martin Knoblauch wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> the following/attached patch works around a [obscure] problem when an 2.6
> >> >> (not sure/caring about 2.4) NFS client accesses an "offline" file on a
> >> >> Sun/Solaris-10 NFS server when the underlying filesystem is of type 
> SAM-FS.
> >> >> Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux
> >> >> problem, but the chance for a short-/mid-term solution from Sun are very
> >> >> slim. So, being lazy, I would love to get this patch into Linux. If not, I
> >> >> just will have to maintain it for eternity out of tree.
> >> >>
> >> >> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores
> >> >> meta-data and a relatively small amount of data "online" on disk and 
> pushes
> >> >> old or infrequently used data to "offline" media like e.g. tape. This is
> >> >> completely transparent to the users. If the date for an "offline" file is
> >> >> needed, the so called "stager daemon" copies it back from the offline
> >> >> medium. All of this works great most of the time. Now, if an Linux NFS
> >> >> client tries to read such an offline file, performance drops to "extremely
> >> >> slow". After lengthly investigation of tcp-dumps, mount options and
> >> >> procedures involving black cats at midnight, we found out that the 
> readahead
> >> >> behaviour of the Linux NFS client causes the problem. Basically it seems 
> to
> >> >> issue read requests up to 15*rsize to the server. In the case of the
> >> >> "offline" files, this behaviour causes heavy competition for the inode 
> lock
> >> >> between the NFSD process and the stager daemon on the Solaris server.
> >> >>
> >> >> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks
> >> >> the problem, but a solution will need time. Lots of it.
> >> >> - The working solution: disable the client side readahead, or make it
> >> >> tunable. The patch does that by introducing a NFS module parameter
> >> >> "ra_factor" which can take values between 1 and 15 (default 15) and a
> >> >> tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
> >> >
> >> > Hi.
> >> >
> >> > I was curious if a design to limit or eliminate read-ahead
> >> > activity when the server returns EJUKEBOX was considered?
> >> > Unless one can know that the server and client can get into
> >> > this situation ahead of time, how would the tunable be used?
> >>
> >> I tend to agree.  A tunable is probably not a good solution in this case.
> >>
> >> I would bet that this lock contention issue is a problem in other more
> >> common cases, and would merit some careful analysis.
> >>
> >
> >  Are you talking wrt. a Solaris NFS-Server with SAM-FS/QFS as backend 
> filesystem?
> 
> I misread your mail, and thought the inode lock contention issue was
> on the client.
> 

 No problem, maybe I was not articulating myself clearly.  Just to restate - the lock contention happens on the server.

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-17 16:23 Martin Knoblauch
@ 2008-09-17 16:43 ` Chuck Lever
  0 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever @ 2008-09-17 16:43 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: Peter Staubach, linux-nfs list, linux-kernel

On Wed, Sep 17, 2008 at 11:23 AM, Martin Knoblauch <knobi@knobisoft.de> wrote:
> ----- Original Message ----
>
>> From: Chuck Lever <chucklever@gmail.com>
>> To: Peter Staubach <staubach@redhat.com>
>> Cc: Martin Knoblauch <knobi@knobisoft.de>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
>> Sent: Wednesday, September 17, 2008 5:41:15 PM
>> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>>
>> On Wed, Sep 17, 2008 at 9:06 AM, Peter Staubach wrote:
>> > Martin Knoblauch wrote:
>> >>
>> >> Hi,
>> >>
>> >> the following/attached patch works around a [obscure] problem when an 2.6
>> >> (not sure/caring about 2.4) NFS client accesses an "offline" file on a
>> >> Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS.
>> >> Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux
>> >> problem, but the chance for a short-/mid-term solution from Sun are very
>> >> slim. So, being lazy, I would love to get this patch into Linux. If not, I
>> >> just will have to maintain it for eternity out of tree.
>> >>
>> >> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores
>> >> meta-data and a relatively small amount of data "online" on disk and pushes
>> >> old or infrequently used data to "offline" media like e.g. tape. This is
>> >> completely transparent to the users. If the date for an "offline" file is
>> >> needed, the so called "stager daemon" copies it back from the offline
>> >> medium. All of this works great most of the time. Now, if an Linux NFS
>> >> client tries to read such an offline file, performance drops to "extremely
>> >> slow". After lengthly investigation of tcp-dumps, mount options and
>> >> procedures involving black cats at midnight, we found out that the readahead
>> >> behaviour of the Linux NFS client causes the problem. Basically it seems to
>> >> issue read requests up to 15*rsize to the server. In the case of the
>> >> "offline" files, this behaviour causes heavy competition for the inode lock
>> >> between the NFSD process and the stager daemon on the Solaris server.
>> >>
>> >> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks
>> >> the problem, but a solution will need time. Lots of it.
>> >> - The working solution: disable the client side readahead, or make it
>> >> tunable. The patch does that by introducing a NFS module parameter
>> >> "ra_factor" which can take values between 1 and 15 (default 15) and a
>> >> tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
>> >
>> > Hi.
>> >
>> > I was curious if a design to limit or eliminate read-ahead
>> > activity when the server returns EJUKEBOX was considered?
>> > Unless one can know that the server and client can get into
>> > this situation ahead of time, how would the tunable be used?
>>
>> I tend to agree.  A tunable is probably not a good solution in this case.
>>
>> I would bet that this lock contention issue is a problem in other more
>> common cases, and would merit some careful analysis.
>>
>
>  Are you talking wrt. a Solaris NFS-Server with SAM-FS/QFS as backend filesystem?

I misread your mail, and thought the inode lock contention issue was
on the client.

--
Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 16:23 Martin Knoblauch
  2008-09-17 16:43 ` Chuck Lever
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-17 16:23 UTC (permalink / raw)
  To: Chuck Lever, Peter Staubach; +Cc: linux-nfs list, linux-kernel

----- Original Message ----

> From: Chuck Lever <chucklever@gmail.com>
> To: Peter Staubach <staubach@redhat.com>
> Cc: Martin Knoblauch <knobi@knobisoft.de>; linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
> Sent: Wednesday, September 17, 2008 5:41:15 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> On Wed, Sep 17, 2008 at 9:06 AM, Peter Staubach wrote:
> > Martin Knoblauch wrote:
> >>
> >> Hi,
> >>
> >> the following/attached patch works around a [obscure] problem when an 2.6
> >> (not sure/caring about 2.4) NFS client accesses an "offline" file on a
> >> Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS.
> >> Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux
> >> problem, but the chance for a short-/mid-term solution from Sun are very
> >> slim. So, being lazy, I would love to get this patch into Linux. If not, I
> >> just will have to maintain it for eternity out of tree.
> >>
> >> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores
> >> meta-data and a relatively small amount of data "online" on disk and pushes
> >> old or infrequently used data to "offline" media like e.g. tape. This is
> >> completely transparent to the users. If the date for an "offline" file is
> >> needed, the so called "stager daemon" copies it back from the offline
> >> medium. All of this works great most of the time. Now, if an Linux NFS
> >> client tries to read such an offline file, performance drops to "extremely
> >> slow". After lengthly investigation of tcp-dumps, mount options and
> >> procedures involving black cats at midnight, we found out that the readahead
> >> behaviour of the Linux NFS client causes the problem. Basically it seems to
> >> issue read requests up to 15*rsize to the server. In the case of the
> >> "offline" files, this behaviour causes heavy competition for the inode lock
> >> between the NFSD process and the stager daemon on the Solaris server.
> >>
> >> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks
> >> the problem, but a solution will need time. Lots of it.
> >> - The working solution: disable the client side readahead, or make it
> >> tunable. The patch does that by introducing a NFS module parameter
> >> "ra_factor" which can take values between 1 and 15 (default 15) and a
> >> tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
> >
> > Hi.
> >
> > I was curious if a design to limit or eliminate read-ahead
> > activity when the server returns EJUKEBOX was considered?
> > Unless one can know that the server and client can get into
> > this situation ahead of time, how would the tunable be used?
> 
> I tend to agree.  A tunable is probably not a good solution in this case.
> 
> I would bet that this lock contention issue is a problem in other more
> common cases, and would merit some careful analysis.
> 

 Are you talking wrt. a Solaris NFS-Server with SAM-FS/QFS as backend filesystem? We have in a lot of tests not observed any problems when accessing online files with the default readahead setting. The "offline" situation seems unique. As for other NFS Servers, we have never observed any readahead related problems either.

 As I already replied elsewhere, teaching the Linux NFS client to do "proper" readahead handling is beyond my knowledge. But I can test. I guess Sun engineering would be delighted if they don't have to fix their stuff :-)

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 16:15 Martin Knoblauch
  0 siblings, 0 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-17 16:15 UTC (permalink / raw)
  To: Michael Trimarchi, linux-nfs list; +Cc: linux-kernel

----- Original Message ----

> From: Michael Trimarchi <trimarchimichael@yahoo.it>
> To: Martin Knoblauch <knobi@knobisoft.de>; linux-nfs list <linux-nfs@vger.kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Sent: Wednesday, September 17, 2008 3:42:30 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> Hi,
> 
> ...
> 
> > Signed-off-by: Martin Knoblauch 
> > 
> > diff -urp linux-2.6.27-rc6-git4/fs/nfs/client.c 
> > linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c
> > --- linux-2.6.27-rc6-git4/fs/nfs/client.c       2008-09-17 11:35:21.000000000 
> > +0200
> > +++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c        2008-09-17 
> > 11:55:18.000000000 +0200
> > @@ -722,6 +722,11 @@ error:
> > }
> > 
> > /*
> > + * NFS Client Read-Ahead factor
> > +*/
> > +unsigned int nfs_ra_factor;
> > +
> > +/*
> >   * Load up the server record from information gained in an fsinfo record
> >   */
> > static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo 
> 
> > *fsinfo)
> > @@ -746,7 +751,11 @@ static void nfs_server_set_fsinfo(struct
> >                 server->rsize = NFS_MAX_FILE_IO_SIZE;
> >         server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> 
> > PAGE_CACHE_SHIFT;
> > 
> > -       server->backing_dev_info.ra_pages = server->rpages * 
> NFS_MAX_READAHEAD;
> > +       dprintk("nfs_server_set_fsinfo: rsize, wsize, rpages, \
> > +               nfs_ra_factor, ra_pages: %d %d %d %d %d\n",
> > +               server->rsize,server->wsize,server->rpages,
> > +               nfs_ra_factor,server->rpages * nfs_ra_factor);
> > +       server->backing_dev_info.ra_pages = server->rpages * nfs_ra_factor;
> > 
> >         if (server->wsize > max_rpc_payload)
> >                 server->wsize = max_rpc_payload;
> > diff -urp linux-2.6.27-rc6-git4/fs/nfs/inode.c 
> > linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c
> > --- linux-2.6.27-rc6-git4/fs/nfs/inode.c        2008-09-17 11:35:21.000000000 
> > +0200
> > +++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c 2008-09-17 11:45:09.000000000 
> > +0200
> > @@ -53,6 +53,8 @@
> > 
> > /* Default is to see 64-bit inode numbers */
> > static int enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED;
> > +static unsigned int ra_factor __read_mostly = NFS_MAX_READAHEAD;
> > +
> > 
> > static void nfs_invalidate_inode(struct inode *);
> > static int nfs_update_inode(struct inode *, struct nfs_fattr *);
> > @@ -1347,6 +1349,12 @@ static int __init init_nfs_fs(void)
> > #endif
> >         if ((err = register_nfs_fs()) != 0)
> >                 goto out;
> > +
> > +       if (ra_factor < 1 || ra_factor > NFS_MAX_READAHEAD)
> > +               nfs_ra_factor = NFS_MAX_READAHEAD;
> > +       else
> > +               nfs_ra_factor = ra_factor;
> > +
> 
> So, I think that this is not necessary because it is done ( ... I hope) by the 
> proc_dointvec_minmax handler. It is correct?
> 

That is of course true if the tunable is changed via the /proc interface. The code above handles the module parameter, which is not governed by the minmax handler.

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 16:10 Martin Knoblauch
  0 siblings, 0 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-17 16:10 UTC (permalink / raw)
  To: Jim Rees; +Cc: linux-nfs list, linux-kernel

Adding back LKML.


----- Original Message ----
> From: Jim Rees <rees@umich.edu>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: linux-nfs list <linux-nfs@vger.kernel.org>
> Sent: Wednesday, September 17, 2008 5:31:12 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> Martin Knoblauch wrote:
> 
>    We never needed that in our case. But yes, would be trivial. The question
>    is, whether there should be a maximum, just as a safeguard.
> 
> Yes.  The default should be (RPC_DEF_SLOT_TABLE - 1), and the maximum should
> be max(xprt_udp_slot_table_entries, xprt_tcp_slot_table_entries) (maybe
> minus one).
> 

The default is NFS_MAX_READAHEAD, which is (RPC_DEF_SLOT_TABLE - 1). Incidentially, your suggested maximum seems to be the same on a default setup (minus one applied).

> I wonder if it would make sense to adjust NFS_MAX_READAHEAD when
> xprt_*_slot_table_entries is changed via sysctl.

I am not sure how useful/practical this is, as currently the ra_factor is applied at mount time.

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 16:03 Martin Knoblauch
  0 siblings, 0 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-17 16:03 UTC (permalink / raw)
  To: Peter Staubach; +Cc: linux-nfs list, linux-kernel

----- Original Message ----

> From: Peter Staubach <staubach@redhat.com>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: linux-nfs list <linux-nfs@vger.kernel.org>; linux-kernel@vger.kernel.org
> Sent: Wednesday, September 17, 2008 4:06:44 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> Martin Knoblauch wrote:
> > Hi,
> >
> > the following/attached patch works around a [obscure] problem when an 2.6 (not 
> sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10 
> NFS server when the underlying filesystem is of type SAM-FS. Happens with 
> RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance 
> for a short-/mid-term solution from Sun are very slim. So, being lazy, I would 
> love to get this patch into Linux. If not, I just will have to maintain it for 
> eternity out of tree.
> >
> > The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data 
> and a relatively small amount of data "online" on disk and pushes old or 
> infrequently used data to "offline" media like e.g. tape. This is completely 
> transparent to the users. If the date for an "offline" file is needed, the so 
> called "stager daemon" copies it back from the offline medium. All of this works 
> great most of the time. Now, if an Linux NFS client tries to read such an 
> offline file, performance drops to "extremely slow". After lengthly 
> investigation of tcp-dumps, mount options and procedures involving black cats at 
> midnight, we found out that the readahead behaviour of the Linux NFS client 
> causes the problem. Basically it seems to issue read requests up to 15*rsize to 
> the server. In the case of the "offline" files, this behaviour causes heavy 
> competition for the inode lock between the NFSD process and the stager daemon on 
> the Solaris server.
> >
> > - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the 
> problem, but a solution will need time. Lots of it.
> > - The working solution: disable the client side readahead, or make it tunable. 
> The patch does that by introducing a NFS module parameter "ra_factor" which can 
> take values between 1 and 15 (default 15) and a tunable 
> "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
> 
> Hi.
> 
> I was curious if a design to limit or eliminate read-ahead
> activity when the server returns EJUKEBOX was considered?

 not seriously, because that would need a lot more knowledge about the internal workings of the NFS-Client than I have. The Solaris client seems to be working along that lines, but the code to modify the readahead window looks complicated. The Solaris client also seems to be a lot less agressive when doing readahead. Maximum seems to be 4x8k. As far as I see, the Linux client doesn't really care about the readahead handling at all. It just fills "server->backing_dev_info.ra_pages" and leaves the handling to the MM system.

 Then, there is no guarantee that EJUKEBOX is ever sent by the server. If the offline archive resides on disk (e.g. a cheap SATA array), delivery will start almost immediatelly and the server will not send that error. Tracked that :-( Same for already positioned tapes.

> Unless one can know that the server and client can get into
> this situation ahead of time, how would the tunable be used?
> 

 Basically one has to know that the problem exists (that is easily detected) and that the readahead factor is involved.

 My patch has of course some pitfalls. at least:

a) as implemented, the nfs_ra_factor will be used for all NFS mounts. It should/could be per filesystem, but that needs a new mount option and I did not want to touch that code due to lack of understanding (and no time to aquire said understanding). But frankly, so far we have not observed any serious performance drawbacks with ra_factor=1.
b) changing the factor needs a remount, as the NFS client only cares about it at that time.

 Not a problem in my situation of course.

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-17 14:06 ` Peter Staubach
@ 2008-09-17 15:41   ` Chuck Lever
  0 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever @ 2008-09-17 15:41 UTC (permalink / raw)
  To: Peter Staubach; +Cc: Martin Knoblauch, linux-nfs list, linux-kernel

On Wed, Sep 17, 2008 at 9:06 AM, Peter Staubach <staubach@redhat.com> wrote:
> Martin Knoblauch wrote:
>>
>> Hi,
>>
>> the following/attached patch works around a [obscure] problem when an 2.6
>> (not sure/caring about 2.4) NFS client accesses an "offline" file on a
>> Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS.
>> Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux
>> problem, but the chance for a short-/mid-term solution from Sun are very
>> slim. So, being lazy, I would love to get this patch into Linux. If not, I
>> just will have to maintain it for eternity out of tree.
>>
>> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores
>> meta-data and a relatively small amount of data "online" on disk and pushes
>> old or infrequently used data to "offline" media like e.g. tape. This is
>> completely transparent to the users. If the date for an "offline" file is
>> needed, the so called "stager daemon" copies it back from the offline
>> medium. All of this works great most of the time. Now, if an Linux NFS
>> client tries to read such an offline file, performance drops to "extremely
>> slow". After lengthly investigation of tcp-dumps, mount options and
>> procedures involving black cats at midnight, we found out that the readahead
>> behaviour of the Linux NFS client causes the problem. Basically it seems to
>> issue read requests up to 15*rsize to the server. In the case of the
>> "offline" files, this behaviour causes heavy competition for the inode lock
>> between the NFSD process and the stager daemon on the Solaris server.
>>
>> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks
>> the problem, but a solution will need time. Lots of it.
>> - The working solution: disable the client side readahead, or make it
>> tunable. The patch does that by introducing a NFS module parameter
>> "ra_factor" which can take values between 1 and 15 (default 15) and a
>> tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
>
> Hi.
>
> I was curious if a design to limit or eliminate read-ahead
> activity when the server returns EJUKEBOX was considered?
> Unless one can know that the server and client can get into
> this situation ahead of time, how would the tunable be used?

I tend to agree.  A tunable is probably not a good solution in this case.

I would bet that this lock contention issue is a problem in other more
common cases, and would merit some careful analysis.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
  2008-09-17 13:06 Martin Knoblauch
@ 2008-09-17 14:06 ` Peter Staubach
  2008-09-17 15:41   ` Chuck Lever
  2008-09-18  1:42 ` Greg Banks
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Staubach @ 2008-09-17 14:06 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: linux-nfs list, linux-kernel

Martin Knoblauch wrote:
> Hi,
>
> the following/attached patch works around a [obscure] problem when an 2.6 (not sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS. Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance for a short-/mid-term solution from Sun are very slim. So, being lazy, I would love to get this patch into Linux. If not, I just will have to maintain it for eternity out of tree.
>
> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data and a relatively small amount of data "online" on disk and pushes old or infrequently used data to "offline" media like e.g. tape. This is completely transparent to the users. If the date for an "offline" file is needed, the so called "stager daemon" copies it back from the offline medium. All of this works great most of the time. Now, if an Linux NFS client tries to read such an offline file, performance drops to "extremely slow". After lengthly investigation of tcp-dumps, mount options and procedures involving black cats at midnight, we found out that the readahead behaviour of the Linux NFS client causes the problem. Basically it seems to issue read requests up to 15*rsize to the server. In the case of the "offline" files, this behaviour causes heavy competition for the inode lock between the NFSD process and the stager daemon on the Solaris server.
>
> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the problem, but a solution will need time. Lots of it.
> - The working solution: disable the client side readahead, or make it tunable. The patch does that by introducing a NFS module parameter "ra_factor" which can take values between 1 and 15 (default 15) and a tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.

Hi.

I was curious if a design to limit or eliminate read-ahead
activity when the server returns EJUKEBOX was considered?
Unless one can know that the server and client can get into
this situation ahead of time, how would the tunable be used?

    Thanx...

       ps


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 13:42 Michael Trimarchi
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Trimarchi @ 2008-09-17 13:42 UTC (permalink / raw)
  To: Martin Knoblauch, linux-nfs list; +Cc: linux-kernel

Hi,

...

> Signed-off-by: Martin Knoblauch 
> 
> diff -urp linux-2.6.27-rc6-git4/fs/nfs/client.c 
> linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c
> --- linux-2.6.27-rc6-git4/fs/nfs/client.c       2008-09-17 11:35:21.000000000 
> +0200
> +++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c        2008-09-17 
> 11:55:18.000000000 +0200
> @@ -722,6 +722,11 @@ error:
> }
> 
> /*
> + * NFS Client Read-Ahead factor
> +*/
> +unsigned int nfs_ra_factor;
> +
> +/*
>   * Load up the server record from information gained in an fsinfo record
>   */
> static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo 
> *fsinfo)
> @@ -746,7 +751,11 @@ static void nfs_server_set_fsinfo(struct
>                 server->rsize = NFS_MAX_FILE_IO_SIZE;
>         server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> 
> PAGE_CACHE_SHIFT;
> 
> -       server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
> +       dprintk("nfs_server_set_fsinfo: rsize, wsize, rpages, \
> +               nfs_ra_factor, ra_pages: %d %d %d %d %d\n",
> +               server->rsize,server->wsize,server->rpages,
> +               nfs_ra_factor,server->rpages * nfs_ra_factor);
> +       server->backing_dev_info.ra_pages = server->rpages * nfs_ra_factor;
> 
>         if (server->wsize > max_rpc_payload)
>                 server->wsize = max_rpc_payload;
> diff -urp linux-2.6.27-rc6-git4/fs/nfs/inode.c 
> linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c
> --- linux-2.6.27-rc6-git4/fs/nfs/inode.c        2008-09-17 11:35:21.000000000 
> +0200
> +++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c 2008-09-17 11:45:09.000000000 
> +0200
> @@ -53,6 +53,8 @@
> 
> /* Default is to see 64-bit inode numbers */
> static int enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED;
> +static unsigned int ra_factor __read_mostly = NFS_MAX_READAHEAD;
> +
> 
> static void nfs_invalidate_inode(struct inode *);
> static int nfs_update_inode(struct inode *, struct nfs_fattr *);
> @@ -1347,6 +1349,12 @@ static int __init init_nfs_fs(void)
> #endif
>         if ((err = register_nfs_fs()) != 0)
>                 goto out;
> +
> +       if (ra_factor < 1 || ra_factor > NFS_MAX_READAHEAD)
> +               nfs_ra_factor = NFS_MAX_READAHEAD;
> +       else
> +               nfs_ra_factor = ra_factor;
> +

So, I think that this is not necessary because it is done ( ... I hope) by the 
proc_dointvec_minmax handler. It is correct?

Regards Michael

__________________________________________________
Do You Yahoo!?
Poco spazio e tanto spam? Yahoo! Mail ti protegge dallo spam e ti da tanto spazio gratuito per i tuoi file e i messaggi 
http://mail.yahoo.it 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 13:27 Martin Knoblauch
  0 siblings, 0 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-17 13:27 UTC (permalink / raw)
  To: Michael Trimarchi, linux-nfs list; +Cc: linux-kernel

----- Original Message ----

> From: Michael Trimarchi <trimarchimichael@yahoo.it>
> To: Martin Knoblauch <knobi@knobisoft.de>; linux-nfs list <linux-nfs@vger.kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Sent: Wednesday, September 17, 2008 3:19:25 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> 
> Hi
> 
> 
> 
> ----- Messaggio originale -----
> > Da: Martin Knoblauch 
> > A: linux-nfs list 
> > Cc: linux-kernel@vger.kernel.org
> > Inviato: Mercoledì 17 settembre 2008, 15:06:40
> > Oggetto: [RFC][Resend] Make NFS-Client readahead tunable
> > 
> ....
> 
> > diff -urp linux-2.6.27-rc6-git4/Makefile linux-2.6.27-rc6-git4-nfs_ra/Makefile
> > --- linux-2.6.27-rc6-git4/Makefile      2008-09-17 11:35:56.000000000 +0200
> > +++ linux-2.6.27-rc6-git4-nfs_ra/Makefile       2008-09-17 11:45:09.000000000 
> > +0200
> > @@ -1,7 +1,7 @@
> > VERSION = 2
> > PATCHLEVEL = 6
> > SUBLEVEL = 27
> > -EXTRAVERSION = -rc6-git4
> > +EXTRAVERSION = -rc6-git4-nfs_ra
> > NAME = Rotary Wombat
> > 
> > # *DOCUMENTATION*
> 
> I'm not an expert but maybe this is not necessary :) 
> 

 Doh, yes. A final patch should not have that. Instead it should have some documentation for the module parameter and the tunable :-)

Cheers
Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 13:19 Michael Trimarchi
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Trimarchi @ 2008-09-17 13:19 UTC (permalink / raw)
  To: Martin Knoblauch, linux-nfs list; +Cc: linux-kernel

Hi



----- Messaggio originale -----
> Da: Martin Knoblauch <knobi@knobisoft.de>
> A: linux-nfs list <linux-nfs@vger.kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Inviato: Mercoledì 17 settembre 2008, 15:06:40
> Oggetto: [RFC][Resend] Make NFS-Client readahead tunable
> 
....

> diff -urp linux-2.6.27-rc6-git4/Makefile linux-2.6.27-rc6-git4-nfs_ra/Makefile
> --- linux-2.6.27-rc6-git4/Makefile      2008-09-17 11:35:56.000000000 +0200
> +++ linux-2.6.27-rc6-git4-nfs_ra/Makefile       2008-09-17 11:45:09.000000000 
> +0200
> @@ -1,7 +1,7 @@
> VERSION = 2
> PATCHLEVEL = 6
> SUBLEVEL = 27
> -EXTRAVERSION = -rc6-git4
> +EXTRAVERSION = -rc6-git4-nfs_ra
> NAME = Rotary Wombat
> 
> # *DOCUMENTATION*

I'm not an expert but maybe this is not necessary :) 

> 
> 
> Cheers
> Martin
> 
> ------------------------------------------------------

Michael

__________________________________________________
Do You Yahoo!?
Poco spazio e tanto spam? Yahoo! Mail ti protegge dallo spam e ti da tanto spazio gratuito per i tuoi file e i messaggi 
http://mail.yahoo.it 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC][Resend] Make NFS-Client readahead tunable
@ 2008-09-17 13:06 Martin Knoblauch
  2008-09-17 14:06 ` Peter Staubach
  2008-09-18  1:42 ` Greg Banks
  0 siblings, 2 replies; 29+ messages in thread
From: Martin Knoblauch @ 2008-09-17 13:06 UTC (permalink / raw)
  To: linux-nfs list; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6900 bytes --]

Hi,

the following/attached patch works around a [obscure] problem when an 2.6 (not sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS. Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance for a short-/mid-term solution from Sun are very slim. So, being lazy, I would love to get this patch into Linux. If not, I just will have to maintain it for eternity out of tree.

The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data and a relatively small amount of data "online" on disk and pushes old or infrequently used data to "offline" media like e.g. tape. This is completely transparent to the users. If the date for an "offline" file is needed, the so called "stager daemon" copies it back from the offline medium. All of this works great most of the time. Now, if an Linux NFS client tries to read such an offline file, performance drops to "extremely slow". After lengthly investigation of tcp-dumps, mount options and procedures involving black cats at midnight, we found out that the readahead behaviour of the Linux NFS client causes the problem. Basically it seems to issue read requests up to 15*rsize to the server. In the case of the "offline" files, this behaviour causes heavy competition for the inode lock between the NFSD process and the stager daemon on the Solaris server.

- The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the problem, but a solution will need time. Lots of it.
- The working solution: disable the client side readahead, or make it tunable. The patch does that by introducing a NFS module parameter "ra_factor" which can take values between 1 and 15 (default 15) and a tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.

Signed-off-by: Martin Knoblauch <knobi@knobisoft.de>

diff -urp linux-2.6.27-rc6-git4/fs/nfs/client.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c
--- linux-2.6.27-rc6-git4/fs/nfs/client.c       2008-09-17 11:35:21.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c        2008-09-17 11:55:18.000000000 +0200
@@ -722,6 +722,11 @@ error:
 }

 /*
+ * NFS Client Read-Ahead factor
+*/
+unsigned int nfs_ra_factor;
+
+/*
  * Load up the server record from information gained in an fsinfo record
  */
 static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *fsinfo)
@@ -746,7 +751,11 @@ static void nfs_server_set_fsinfo(struct
                server->rsize = NFS_MAX_FILE_IO_SIZE;
        server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

-       server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+       dprintk("nfs_server_set_fsinfo: rsize, wsize, rpages, \
+               nfs_ra_factor, ra_pages: %d %d %d %d %d\n",
+               server->rsize,server->wsize,server->rpages,
+               nfs_ra_factor,server->rpages * nfs_ra_factor);
+       server->backing_dev_info.ra_pages = server->rpages * nfs_ra_factor;

        if (server->wsize > max_rpc_payload)
                server->wsize = max_rpc_payload;
diff -urp linux-2.6.27-rc6-git4/fs/nfs/inode.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c
--- linux-2.6.27-rc6-git4/fs/nfs/inode.c        2008-09-17 11:35:21.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c 2008-09-17 11:45:09.000000000 +0200
@@ -53,6 +53,8 @@

 /* Default is to see 64-bit inode numbers */
 static int enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED;
+static unsigned int ra_factor __read_mostly = NFS_MAX_READAHEAD;
+

 static void nfs_invalidate_inode(struct inode *);
 static int nfs_update_inode(struct inode *, struct nfs_fattr *);
@@ -1347,6 +1349,12 @@ static int __init init_nfs_fs(void)
 #endif
        if ((err = register_nfs_fs()) != 0)
                goto out;
+
+       if (ra_factor < 1 || ra_factor > NFS_MAX_READAHEAD)
+               nfs_ra_factor = NFS_MAX_READAHEAD;
+       else
+               nfs_ra_factor = ra_factor;
+
        return 0;
 out:
 #ifdef CONFIG_PROC_FS
@@ -1388,6 +1396,10 @@ static void __exit exit_nfs_fs(void)
 MODULE_AUTHOR("Olaf Kirch <okir@monad.swb.de>");
 MODULE_LICENSE("GPL");
 module_param(enable_ino64, bool, 0644);
+MODULE_PARM_DESC(enable_ino64, "Enable 64-bit inode numbers (Default: 1)");
+module_param(ra_factor, uint, 0644);
+MODULE_PARM_DESC(ra_factor,
+       "Number of rsize read-ahead requests (Default/Max: 15, Min: 1)");

 module_init(init_nfs_fs)
 module_exit(exit_nfs_fs)
diff -urp linux-2.6.27-rc6-git4/fs/nfs/sysctl.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/sysctl.c
--- linux-2.6.27-rc6-git4/fs/nfs/sysctl.c       2008-07-13 23:51:29.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/sysctl.c        2008-09-17 11:45:09.000000000 +0200
@@ -14,9 +14,12 @@
 #include <linux/nfs_fs.h>

 #include "callback.h"
+#include "internal.h"

 static const int nfs_set_port_min = 0;
 static const int nfs_set_port_max = 65535;
+static const unsigned int min_nfs_ra_factor = 1;
+static const unsigned int max_nfs_ra_factor = NFS_MAX_READAHEAD;
 static struct ctl_table_header *nfs_callback_sysctl_table;

 static ctl_table nfs_cb_sysctls[] = {
@@ -58,6 +61,16 @@ static ctl_table nfs_cb_sysctls[] = {
                .mode           = 0644,
                .proc_handler   = &proc_dointvec,
        },
+       {
+               .ctl_name = CTL_UNNUMBERED,
+               .procname = "nfs_ra_factor",
+               .data = &nfs_ra_factor,
+               .maxlen = sizeof(unsigned int),
+               .mode = 0644,
+               .proc_handler = &proc_dointvec_minmax,
+               .extra1 = (unsigned int *)&min_nfs_ra_factor,
+               .extra2 = (unsigned int *)&max_nfs_ra_factor,
+       },
        { .ctl_name = 0 }
 };

diff -urp linux-2.6.27-rc6-git4/include/linux/nfs_fs.h linux-2.6.27-rc6-git4-nfs_ra/include/linux/nfs_fs.h
--- linux-2.6.27-rc6-git4/include/linux/nfs_fs.h        2008-09-17 11:35:25.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/include/linux/nfs_fs.h 2008-09-17 11:45:09.000000000 +0200
@@ -464,6 +464,11 @@ extern int nfs_writeback_done(struct rpc
 extern void nfs_writedata_release(void *);

 /*
+ * linux/fs/nfs/client.c
+*/
+extern unsigned int nfs_ra_factor;
+
+/*
  * Try to write back everything synchronously (but check the
  * return value!)
  */
diff -urp linux-2.6.27-rc6-git4/Makefile linux-2.6.27-rc6-git4-nfs_ra/Makefile
--- linux-2.6.27-rc6-git4/Makefile      2008-09-17 11:35:56.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/Makefile       2008-09-17 11:45:09.000000000 +0200
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 27
-EXTRAVERSION = -rc6-git4
+EXTRAVERSION = -rc6-git4-nfs_ra
 NAME = Rotary Wombat

 # *DOCUMENTATION*



Cheers
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:  http://www.knobisoft.de

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: nfs_ra-2.6.27-rc6-git4.diff --]
[-- Type: text/x-patch; name="nfs_ra-2.6.27-rc6-git4.diff", Size: 4467 bytes --]

diff -urp linux-2.6.27-rc6-git4/fs/nfs/client.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c
--- linux-2.6.27-rc6-git4/fs/nfs/client.c	2008-09-17 11:35:21.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c	2008-09-17 11:55:18.000000000 +0200
@@ -722,6 +722,11 @@ error:
 }
 
 /*
+ * NFS Client Read-Ahead factor
+*/
+unsigned int nfs_ra_factor;
+
+/*
  * Load up the server record from information gained in an fsinfo record
  */
 static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *fsinfo)
@@ -746,7 +751,11 @@ static void nfs_server_set_fsinfo(struct
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
-	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+	dprintk("nfs_server_set_fsinfo: rsize, wsize, rpages, \
+		nfs_ra_factor, ra_pages: %d %d %d %d %d\n",
+		server->rsize,server->wsize,server->rpages,
+		nfs_ra_factor,server->rpages * nfs_ra_factor);
+	server->backing_dev_info.ra_pages = server->rpages * nfs_ra_factor;
 
 	if (server->wsize > max_rpc_payload)
 		server->wsize = max_rpc_payload;
diff -urp linux-2.6.27-rc6-git4/fs/nfs/inode.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c
--- linux-2.6.27-rc6-git4/fs/nfs/inode.c	2008-09-17 11:35:21.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c	2008-09-17 11:45:09.000000000 +0200
@@ -53,6 +53,8 @@
 
 /* Default is to see 64-bit inode numbers */
 static int enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED;
+static unsigned int ra_factor __read_mostly = NFS_MAX_READAHEAD;
+
 
 static void nfs_invalidate_inode(struct inode *);
 static int nfs_update_inode(struct inode *, struct nfs_fattr *);
@@ -1347,6 +1349,12 @@ static int __init init_nfs_fs(void)
 #endif
 	if ((err = register_nfs_fs()) != 0)
 		goto out;
+
+	if (ra_factor < 1 || ra_factor > NFS_MAX_READAHEAD)
+		nfs_ra_factor = NFS_MAX_READAHEAD;
+	else
+		nfs_ra_factor = ra_factor;
+
 	return 0;
 out:
 #ifdef CONFIG_PROC_FS
@@ -1388,6 +1396,10 @@ static void __exit exit_nfs_fs(void)
 MODULE_AUTHOR("Olaf Kirch <okir@monad.swb.de>");
 MODULE_LICENSE("GPL");
 module_param(enable_ino64, bool, 0644);
+MODULE_PARM_DESC(enable_ino64, "Enable 64-bit inode numbers (Default: 1)");
+module_param(ra_factor, uint, 0644);
+MODULE_PARM_DESC(ra_factor,
+	"Number of rsize read-ahead requests (Default/Max: 15, Min: 1)");
 
 module_init(init_nfs_fs)
 module_exit(exit_nfs_fs)
diff -urp linux-2.6.27-rc6-git4/fs/nfs/sysctl.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/sysctl.c
--- linux-2.6.27-rc6-git4/fs/nfs/sysctl.c	2008-07-13 23:51:29.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/sysctl.c	2008-09-17 11:45:09.000000000 +0200
@@ -14,9 +14,12 @@
 #include <linux/nfs_fs.h>
 
 #include "callback.h"
+#include "internal.h"
 
 static const int nfs_set_port_min = 0;
 static const int nfs_set_port_max = 65535;
+static const unsigned int min_nfs_ra_factor = 1;
+static const unsigned int max_nfs_ra_factor = NFS_MAX_READAHEAD;
 static struct ctl_table_header *nfs_callback_sysctl_table;
 
 static ctl_table nfs_cb_sysctls[] = {
@@ -58,6 +61,16 @@ static ctl_table nfs_cb_sysctls[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name = CTL_UNNUMBERED,
+		.procname = "nfs_ra_factor",
+		.data = &nfs_ra_factor,
+		.maxlen = sizeof(unsigned int),
+		.mode = 0644,
+		.proc_handler = &proc_dointvec_minmax,
+		.extra1 = (unsigned int *)&min_nfs_ra_factor,
+		.extra2 = (unsigned int *)&max_nfs_ra_factor,
+	},
 	{ .ctl_name = 0 }
 };
 
diff -urp linux-2.6.27-rc6-git4/include/linux/nfs_fs.h linux-2.6.27-rc6-git4-nfs_ra/include/linux/nfs_fs.h
--- linux-2.6.27-rc6-git4/include/linux/nfs_fs.h	2008-09-17 11:35:25.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/include/linux/nfs_fs.h	2008-09-17 11:45:09.000000000 +0200
@@ -464,6 +464,11 @@ extern int nfs_writeback_done(struct rpc
 extern void nfs_writedata_release(void *);
 
 /*
+ * linux/fs/nfs/client.c
+*/
+extern unsigned int nfs_ra_factor;
+
+/*
  * Try to write back everything synchronously (but check the
  * return value!)
  */
diff -urp linux-2.6.27-rc6-git4/Makefile linux-2.6.27-rc6-git4-nfs_ra/Makefile
--- linux-2.6.27-rc6-git4/Makefile	2008-09-17 11:35:56.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/Makefile	2008-09-17 11:45:09.000000000 +0200
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 27
-EXTRAVERSION = -rc6-git4
+EXTRAVERSION = -rc6-git4-nfs_ra
 NAME = Rotary Wombat
 
 # *DOCUMENTATION*

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-09-21 13:53 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-18  7:42 [RFC][Resend] Make NFS-Client readahead tunable Martin Knoblauch
2008-09-18  8:18 ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2008-09-21 12:53 Martin Knoblauch
2008-09-21 12:50 Martin Knoblauch
2008-09-21 13:53 ` Chuck Lever
2008-09-18 11:53 Martin Knoblauch
2008-09-18 18:24 ` Chuck Lever
2008-09-18 19:03   ` Peter Staubach
2008-09-18  9:32 Martin Knoblauch
2008-09-18  8:38 Martin Knoblauch
2008-09-18  8:47 ` Andrew Morton
2008-09-18  8:57   ` Greg Banks
2008-09-18 13:20 ` Peter Zijlstra
2008-09-18  8:19 Martin Knoblauch
2008-09-18  8:45 ` Greg Banks
2008-09-17 17:01 Martin Knoblauch
2008-09-17 16:23 Martin Knoblauch
2008-09-17 16:43 ` Chuck Lever
2008-09-17 16:15 Martin Knoblauch
2008-09-17 16:10 Martin Knoblauch
2008-09-17 16:03 Martin Knoblauch
2008-09-17 13:42 Michael Trimarchi
2008-09-17 13:27 Martin Knoblauch
2008-09-17 13:19 Michael Trimarchi
2008-09-17 13:06 Martin Knoblauch
2008-09-17 14:06 ` Peter Staubach
2008-09-17 15:41   ` Chuck Lever
2008-09-18  1:42 ` Greg Banks
2008-09-18  3:13   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).