LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* 2.6.18 mmap hangs unrelated apps
@ 2006-12-15  2:30 Michal Sabala
  2006-12-15 16:24 ` Trond Myklebust
  2006-12-16 12:59 ` Christian Kuhn
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Sabala @ 2006-12-15  2:30 UTC (permalink / raw)
  To: linux-kernel

Hello LKML,

I am observing processes entering uninterruptible sleep apparently due
to an unrelated application using mmap over nfs. Applications in
"uninterruptible sleep" hang indefinitely while other applications
continue working properly.

The code causing the mmap nfs hangs does the following:
(as replicated by the included test-mmap.c file)

  1. create file on nfs (file_A, descr_A)
  2. make file_A a sparse 200MB file
  3. mmap descr_A
  4. close descr_A
  5. unlink file_A
  6. memcpy 200MB to mmaped buffer
  7. create a second file on nfs (file_B, descr_B)
  8. write() 200MB from mmaped buffer to descr_B
  9. close descr_B
  10. munmap first file

This code may need to be ran tens to hundred runs to trigger the condition.

During the execution of the above code, unrelated applications enter
uninterruptible sleep (D) - usually firefox2.0, Xorg/XFree86, gimp2.2, gconfd
or bash; probably the most active processes.

`dmesg` shows nothing of interest.

`free` shows anywhere between 1MB and 80MB of memory still remaining
free when the problem occurs.

`cat /proc/*PID*/wchan` for all hanging processes contains page_sync.

* Client Setups:

  Linux 2.6.18 debian kernel (not tainted)
  Intel P3/800
  512MB ram
  0 swap
  NFS root (rw,noatime,rsize=8192,wsize=8192,nfsvers=3,hard,lock,udp)
  NIC: 100mbit tulip Cardbus
  NFS server is Linux 2.6.8 (debian)
  Gnome running with ooffice, gimp2.2 and firefox2 open

  and

  Linux 2.6.18 debian kernel (not tainted)
  Intel P4/2.8
  mem=192M boot option
  0 swap
  NFS home (rw,nosuid,rsize=8192,wsize=8192,hard)
  NIC: 100mbit e100 PCI
  NFS server is Apple OSX 10.3
  Gnome running with ooffice, gimp2.2 and firefox2 open

This happens with NFS servers based on Linux 2.6.8 and OSX 10.3.x. There
is nothing unusual in the server log files. Other than large nfs mmaps
on limited ram clients, NFS clients are 100% stable (file locking, performance,
6 month uptimes, etc..)

NOTE:
  I also ran the same code on the P4 machine in /tmp (local disk)
and it too caused some applications to enter uninterruptible sleep
(dozens of consecutive runs were needed). As such this looks not to
be directly related to nfs.

I would like to assist in any way I can in tracking this bug. I am open
to running patched kernels, etc...

Thank You,
 Sincerely,

   Michal Sabala



PS. thank you for all the hard work on the Linux kernel.


----------- test-mmap.c: --------------------------------

#include <unistd.h>
#include <stdlib.h>
#include <stdarg.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>

#include <sys/mman.h>

int main (int argc, char * argv[] ){

  char * data = 0;
  int blocks = 12800;
  int bSize = 16384;
  
  char mmapFileName[] = "temp-XXXXXX";
  int mmapFileDes = mkstemp( mmapFileName );
  if ( mmapFileDes == -1 ){
    printf( "cannot make temporary file %s !\n", mmapFileName );
    exit( -1 );
  }

  printf( "using desc %d tempfile %s\n", mmapFileDes, mmapFileName );

  errno = 0;  
  if ( lseek( mmapFileDes, (blocks*bSize)-1, SEEK_SET ) == -1 ){
    if ( errno != 0 ){
	perror ( "lseek error: " );
    }
    printf(  "cannot lseek tempfile %s !\n", mmapFileName);
    close( mmapFileDes );
    unlink( mmapFileName );
    exit( -1 );
  }

  if ( write( mmapFileDes, "X", 1 ) != 1 ){
    printf(  "cannot sparse write tempfile %s !\n", mmapFileName);
    close( mmapFileDes );
    unlink( mmapFileName );
    exit( -1 );
  }

  data = mmap ( NULL, (blocks*bSize), PROT_READ | PROT_WRITE, MAP_SHARED, mmapFileDes, 0 );
  if ( data == (void *) -1 ){
    printf(  "mmap of %s failed!\n", mmapFileName );
    close( mmapFileDes );
    unlink( mmapFileName );
    exit( -1 );
  }

  printf( "block size: %d, blocks num: %d\n", bSize, blocks);

  close( mmapFileDes );
  unlink( mmapFileName );

  int i;
  char * ptr = data;
  for ( i = 1; i <= blocks; i++ ){
    printf( "wrote %d of %d blocks to %s\n", i, blocks, mmapFileName );
    memset( ptr, 0, bSize ); 
    ptr += bSize;
  }

  // msync( data, blocks*bSize, MS_SYNC );

  char destFile[] = "destination-XXXXXX";
  int destDes = mkstemp( destFile );
  if ( destDes == -1 ){
    printf( "cannot make destination file %s !\n", destFile );
    exit( -1 );
  }

  printf( "using desc %d destfile %s\n", destDes, destFile);
 
  ptr = data;
  for ( i = 1; i <= blocks; i++ ){
    int wLen = write( destDes, ptr, bSize );
    printf( "wrote %d of %d blocks to %s\n", i, blocks, destFile );
    if ( wLen != bSize ){
      printf( "debug: short write to %s at %d bytes\n", destFile, wLen );
    }
    ptr += bSize;
  }
  
  close( destDes );
  
  munmap( data, blocks*bSize );

  exit( 0 );
}

-- 
Michal "Saahbs" Sabala

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15  2:30 2.6.18 mmap hangs unrelated apps Michal Sabala
@ 2006-12-15 16:24 ` Trond Myklebust
  2006-12-15 17:50   ` Michal Sabala
  2006-12-16 12:59 ` Christian Kuhn
  1 sibling, 1 reply; 22+ messages in thread
From: Trond Myklebust @ 2006-12-15 16:24 UTC (permalink / raw)
  To: Michal Sabala; +Cc: linux-kernel

On Thu, 2006-12-14 at 20:30 -0600, Michal Sabala wrote:
> Hello LKML,
> 
> I am observing processes entering uninterruptible sleep apparently due
> to an unrelated application using mmap over nfs. Applications in
> "uninterruptible sleep" hang indefinitely while other applications
> continue working properly.
> 
> The code causing the mmap nfs hangs does the following:
> (as replicated by the included test-mmap.c file)
> 
>   1. create file on nfs (file_A, descr_A)
>   2. make file_A a sparse 200MB file
>   3. mmap descr_A
>   4. close descr_A
>   5. unlink file_A
>   6. memcpy 200MB to mmaped buffer
>   7. create a second file on nfs (file_B, descr_B)
>   8. write() 200MB from mmaped buffer to descr_B
>   9. close descr_B
>   10. munmap first file
> 
> This code may need to be ran tens to hundred runs to trigger the condition.
> 
> During the execution of the above code, unrelated applications enter
> uninterruptible sleep (D) - usually firefox2.0, Xorg/XFree86, gimp2.2, gconfd
> or bash; probably the most active processes.
> 
> `dmesg` shows nothing of interest.
> 
> `free` shows anywhere between 1MB and 80MB of memory still remaining
> free when the problem occurs.
> 
> `cat /proc/*PID*/wchan` for all hanging processes contains page_sync.

Have you tried an 'echo t >/proc/sysrq-trigger' on a client with one of
these hanging processes? If so, what does the output look like?

Cheers
  Trond


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 16:24 ` Trond Myklebust
@ 2006-12-15 17:50   ` Michal Sabala
  2006-12-15 19:44     ` Trond Myklebust
  2006-12-15 20:42     ` Andrew Morton
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Sabala @ 2006-12-15 17:50 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

On 2006/12/15 at 10:24:15 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> On Thu, 2006-12-14 at 20:30 -0600, Michal Sabala wrote:
> > 
> > `cat /proc/*PID*/wchan` for all hanging processes contains page_sync.
> 
> Have you tried an 'echo t >/proc/sysrq-trigger' on a client with one of
> these hanging processes? If so, what does the output look like?

Hello Trond,

Below is the sysrq trace output for XFree86 which entered the
uninterruptible sleep state on the P4 machine with nfs /home. Please
note that XFree86 does not have any files open in /home - as reported by
`lsof`. Below, I also listed the output of vmstat.


XFree86       D 00000003     0  2471   2453                     (NOTLB)
       c4871c0c 00003082 c86b72bc 00000003 cb7c94a4 0000001d 3b67f3ff c0146dd2 
       c1184180 cb3e7110 00000000 001ec7ff a60f8097 00000089 c02e1e60 cb3e7000 
       c1184180 00000000 c1180030 c4871c18 c028c7d8 c4871c5c c01435b6 c01435f3 
Call Trace:
 [<c0146dd2>] free_pages_bulk+0x1d/0x1d4
 [<c028c7d8>] io_schedule+0x26/0x30
 [<c01435b6>] sync_page+0x0/0x40
 [<c01435f3>] sync_page+0x3d/0x40
 [<c028c9ce>] __wait_on_bit_lock+0x2c/0x52
 [<c0143c13>] __lock_page+0x6a/0x72
 [<c012ec77>] wake_bit_function+0x0/0x3c
 [<c012ec77>] wake_bit_function+0x0/0x3c
 [<c0149d2f>] pagevec_lookup+0x17/0x1d
 [<c014a085>] truncate_inode_pages_range+0x20a/0x260
 [<c014a0e4>] truncate_inode_pages+0x9/0xc
 [<c0172c8a>] generic_delete_inode+0xb6/0x10f
 [<c0172e73>] iput+0x5f/0x61
 [<c01706bd>] dentry_iput+0x68/0x83
 [<c01707d8>] dput+0x100/0x118
 [<ccb6c334>] put_nfs_open_context+0x67/0x88 [nfs]
 [<ccb701ed>] nfs_release_request+0x38/0x47 [nfs]
 [<ccb736dd>] nfs_wait_on_requests_locked+0x62/0x98 [nfs]
 [<ccb74c32>] nfs_sync_inode_wait+0x4a/0x130 [nfs]
 [<ccb6b639>] nfs_release_page+0x0/0x30 [nfs]
 [<ccb6b655>] nfs_release_page+0x1c/0x30 [nfs]
 [<c015f37c>] try_to_release_page+0x34/0x46
 [<c014aa8b>] shrink_page_list+0x263/0x350
 [<c0104db8>] do_IRQ+0x48/0x50
 [<c01036c6>] common_interrupt+0x1a/0x20
 [<c014acd7>] shrink_inactive_list+0x9b/0x248
 [<c014b2fd>] shrink_zone+0xb5/0xd0
 [<c014b382>] shrink_zones+0x6a/0x7e
 [<c014b48e>] try_to_free_pages+0xf8/0x1da
 [<c0147a18>] __alloc_pages+0x17c/0x278
 [<c014f555>] do_anonymous_page+0x45/0x150
 [<c014f9f7>] __handle_mm_fault+0xda/0x1bf
 [<c0115849>] do_page_fault+0x1c4/0x4bc
 [<c01021b7>] restore_sigcontext+0x10c/0x15f
 [<c0115685>] do_page_fault+0x0/0x4bc
 [<c0103809>] error_code+0x39/0x40




$> vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  1      0  82128  11484  36096    0    0    12    21  311   287  8  3  0 89

$> vmstat -m
Cache                       Num  Total   Size  Pages
nfs_direct_cache              0      0     76     50
nfs_write_data               36     91    512      7
nfs_read_data                32     35    512      7
nfs_inode_cache              28    108    648      6
nfs_page                      1     59     64     59
rpc_buffers                   8      8   2048      2
rpc_tasks                     8     15    256     15
rpc_inode_cache               8     14    512      7
fib6_nodes                    5    113     32    113
ip6_dst_cache                 4     15    256     15
ndisc_cache                   1     15    256     15
RAWv6                         4      6    640      6
UDPv6                         1      6    640      6
tw_sock_TCPv6                 0      0    128     30
request_sock_TCPv6            0      0    128     30
TCPv6                         2      3   1280      3
ip_conntrack_expect           0      0     96     40
ip_conntrack                 13     68    224     17
ip_fib_alias                  9    113     32    113
ip_fib_hash                   9    113     32    113
jbd_4k                        0      0   4096      1
Cache                       Num  Total   Size  Pages
ext3_inode_cache          12284  25352    504      8
ext3_xattr                    0      0     48     78
journal_handle                2    169     20    169
journal_head                 85    504     52     72
revoke_table                  2    254     12    254
revoke_record                 0      0     16    203
uhci_urb_priv                 1    127     28    127
clip_arp_cache                0      0    256     15
UNIX                         46    133    512      7
flow_cache                    0      0    128     30
cfq_ioc_pool                 29     84     92     42
cfq_pool                     27     80     96     40
crq_pool                     20     84     44     84
deadline_drq                  0      0     44     84
as_arq                        0      0     56     67
mqueue_inode_cache            1      6    640      6
dnotify_cache                 0      0     20    169
dquot                         0      0    128     30
eventpoll_pwq                 0      0     36    101
eventpoll_epi                 0      0    128     30
inotify_event_cache           0      0     28    127
Cache                       Num  Total   Size  Pages
inotify_watch_cache           1     92     40     92
kioctx                        0      0    256     15
kiocb                         0      0    128     30
fasync_cache                  2    203     16    203
shmem_inode_cache           452    459    448      9
posix_timers_cache            0      0     88     44
uid_cache                     3     59     64     59
ip_mrt_cache                  0      0    128     30
tcp_bind_bucket              12    203     16    203
inet_peer_cache               4     59     64     59
secpath_cache                 0      0     32    113
xfrm_dst_cache                0      0    384     10
ip_dst_cache                 18     45    256     15
arp_cache                     3     15    256     15
RAW                           2      7    512      7
UDP                           9     14    512      7
tw_sock_TCP                   0      0    128     30
request_sock_TCP              0      0     64     59
TCP                          12     21   1152      7
blkdev_ioc                   29    127     28    127
blkdev_queue                 19     20    956      4
Cache                       Num  Total   Size  Pages
blkdev_requests              17     23    172     23
biovec-256                    7      8   3072      2
biovec-128                    7     10   1536      5
biovec-64                     7     10    768      5
biovec-16                     7     15    256     15
biovec-4                      7     59     64     59
biovec-1                     51    203     16    203
bio                         276    300    128     30
sock_inode_cache             73    154    512      7
skbuff_fclone_cache           1     10    384     10
skbuff_head_cache           459    615    256     15
file_lock_cache              40     40     96     40
Acpi-Operand                699    736     40     92
Acpi-ParseExt                 0      0     44     84
Acpi-Parse                    0      0     28    127
Acpi-State                    0      0     44     84
Acpi-Namespace              319    338     20    169
proc_inode_cache           1218   1310    368     10
sigqueue                     16     27    144     27
radix_tree_node             989   2576    276     14
bdev_cache                    4      7    512      7
Cache                       Num  Total   Size  Pages
sysfs_dir_cache            3656   3696     44     84
mnt_cache                    25     30    128     30
inode_cache                 724    869    352     11
dentry_cache               8164  26100    132     29
filp                        693   1840    192     20
names_cache                   1      1   4096      1
key_jar                       6     30    128     30
idr_layer_cache             100    116    136     29
buffer_head                3614   6192     52     72
mm_struct                    54     81    448      9
vm_area_struct             1272   2576     84     46
fs_cache                     72    118     64     59
files_cache                 102    135    256     15
signal_cache                 55     90    384     10
sighand_cache                66     66   1344      3
task_struct                  68     87   1360      3
anon_vma                    647   1524     12    254
pgd                          40     40   4096      1
pid                          65    202     36    101
size-131072(DMA)              0      0 131072      1
size-131072                   0      0 131072      1
Cache                       Num  Total   Size  Pages
size-65536(DMA)               0      0  65536      1
size-65536                    1      1  65536      1
size-32768(DMA)               0      0  32768      1
size-32768                    0      0  32768      1
size-16384(DMA)               0      0  16384      1
size-16384                    0      0  16384      1
size-8192(DMA)                0      0   8192      1
size-8192                    67     69   8192      1
size-4096(DMA)                0      0   4096      1
size-4096                    33     33   4096      1
size-2048(DMA)                0      0   2048      2
size-2048                   503    534   2048      2
size-1024(DMA)                0      0   1024      4
size-1024                   197    208   1024      4
size-512(DMA)                 0      0    512      8
size-512                    241    296    512      8
size-256(DMA)                 0      0    256     15
size-256                     72     90    256     15
size-192(DMA)                 0      0    192     20
size-192                    308    320    192     20
size-128(DMA)                 0      0    128     30
Cache                       Num  Total   Size  Pages
size-128                    455    480    128     30
size-96(DMA)                  0      0    128     30
size-96                     672    720    128     30
size-64(DMA)                  0      0     64     59
size-32(DMA)                  0      0     32    113
size-64                     946   1357     64     59
size-32                    2424   2599     32    113
kmem_cache                  133    150    128     30


Thanks, Michal

-- 
Michal "Saahbs" Sabala


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 17:50   ` Michal Sabala
@ 2006-12-15 19:44     ` Trond Myklebust
  2006-12-15 21:06       ` Michal Sabala
  2006-12-15 20:42     ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Trond Myklebust @ 2006-12-15 19:44 UTC (permalink / raw)
  To: Michal Sabala; +Cc: linux-kernel

On Fri, 2006-12-15 at 11:50 -0600, Michal Sabala wrote:
> On 2006/12/15 at 10:24:15 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> > On Thu, 2006-12-14 at 20:30 -0600, Michal Sabala wrote:
> > > 
> > > `cat /proc/*PID*/wchan` for all hanging processes contains page_sync.
> > 
> > Have you tried an 'echo t >/proc/sysrq-trigger' on a client with one of
> > these hanging processes? If so, what does the output look like?
> 
> Hello Trond,
> 
> Below is the sysrq trace output for XFree86 which entered the
> uninterruptible sleep state on the P4 machine with nfs /home. Please
> note that XFree86 does not have any files open in /home - as reported by
> `lsof`. Below, I also listed the output of vmstat.


It is hanging because it is trying to free up memory by reclaiming pages
that are held by your mmapped file on NFS. Do you know why NFS is
hanging?

Cheers
  Trond

> XFree86       D 00000003     0  2471   2453                     (NOTLB)
>        c4871c0c 00003082 c86b72bc 00000003 cb7c94a4 0000001d 3b67f3ff c0146dd2 
>        c1184180 cb3e7110 00000000 001ec7ff a60f8097 00000089 c02e1e60 cb3e7000 
>        c1184180 00000000 c1180030 c4871c18 c028c7d8 c4871c5c c01435b6 c01435f3 
> Call Trace:
>  [<c0146dd2>] free_pages_bulk+0x1d/0x1d4
>  [<c028c7d8>] io_schedule+0x26/0x30
>  [<c01435b6>] sync_page+0x0/0x40
>  [<c01435f3>] sync_page+0x3d/0x40
>  [<c028c9ce>] __wait_on_bit_lock+0x2c/0x52
>  [<c0143c13>] __lock_page+0x6a/0x72
>  [<c012ec77>] wake_bit_function+0x0/0x3c
>  [<c012ec77>] wake_bit_function+0x0/0x3c
>  [<c0149d2f>] pagevec_lookup+0x17/0x1d
>  [<c014a085>] truncate_inode_pages_range+0x20a/0x260
>  [<c014a0e4>] truncate_inode_pages+0x9/0xc
>  [<c0172c8a>] generic_delete_inode+0xb6/0x10f
>  [<c0172e73>] iput+0x5f/0x61
>  [<c01706bd>] dentry_iput+0x68/0x83
>  [<c01707d8>] dput+0x100/0x118
>  [<ccb6c334>] put_nfs_open_context+0x67/0x88 [nfs]
>  [<ccb701ed>] nfs_release_request+0x38/0x47 [nfs]
>  [<ccb736dd>] nfs_wait_on_requests_locked+0x62/0x98 [nfs]
>  [<ccb74c32>] nfs_sync_inode_wait+0x4a/0x130 [nfs]
>  [<ccb6b639>] nfs_release_page+0x0/0x30 [nfs]
>  [<ccb6b655>] nfs_release_page+0x1c/0x30 [nfs]
>  [<c015f37c>] try_to_release_page+0x34/0x46
>  [<c014aa8b>] shrink_page_list+0x263/0x350
>  [<c0104db8>] do_IRQ+0x48/0x50
>  [<c01036c6>] common_interrupt+0x1a/0x20
>  [<c014acd7>] shrink_inactive_list+0x9b/0x248
>  [<c014b2fd>] shrink_zone+0xb5/0xd0
>  [<c014b382>] shrink_zones+0x6a/0x7e
>  [<c014b48e>] try_to_free_pages+0xf8/0x1da
>  [<c0147a18>] __alloc_pages+0x17c/0x278
>  [<c014f555>] do_anonymous_page+0x45/0x150
>  [<c014f9f7>] __handle_mm_fault+0xda/0x1bf
>  [<c0115849>] do_page_fault+0x1c4/0x4bc
>  [<c01021b7>] restore_sigcontext+0x10c/0x15f
>  [<c0115685>] do_page_fault+0x0/0x4bc
>  [<c0103809>] error_code+0x39/0x40
> 
> 
> 
> 
> $> vmstat
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  2  1      0  82128  11484  36096    0    0    12    21  311   287  8  3  0 89
> 
> $> vmstat -m
> Cache                       Num  Total   Size  Pages
> nfs_direct_cache              0      0     76     50
> nfs_write_data               36     91    512      7
> nfs_read_data                32     35    512      7
> nfs_inode_cache              28    108    648      6
> nfs_page                      1     59     64     59
> rpc_buffers                   8      8   2048      2
> rpc_tasks                     8     15    256     15
> rpc_inode_cache               8     14    512      7
> fib6_nodes                    5    113     32    113
> ip6_dst_cache                 4     15    256     15
> ndisc_cache                   1     15    256     15
> RAWv6                         4      6    640      6
> UDPv6                         1      6    640      6
> tw_sock_TCPv6                 0      0    128     30
> request_sock_TCPv6            0      0    128     30
> TCPv6                         2      3   1280      3
> ip_conntrack_expect           0      0     96     40
> ip_conntrack                 13     68    224     17
> ip_fib_alias                  9    113     32    113
> ip_fib_hash                   9    113     32    113
> jbd_4k                        0      0   4096      1
> Cache                       Num  Total   Size  Pages
> ext3_inode_cache          12284  25352    504      8
> ext3_xattr                    0      0     48     78
> journal_handle                2    169     20    169
> journal_head                 85    504     52     72
> revoke_table                  2    254     12    254
> revoke_record                 0      0     16    203
> uhci_urb_priv                 1    127     28    127
> clip_arp_cache                0      0    256     15
> UNIX                         46    133    512      7
> flow_cache                    0      0    128     30
> cfq_ioc_pool                 29     84     92     42
> cfq_pool                     27     80     96     40
> crq_pool                     20     84     44     84
> deadline_drq                  0      0     44     84
> as_arq                        0      0     56     67
> mqueue_inode_cache            1      6    640      6
> dnotify_cache                 0      0     20    169
> dquot                         0      0    128     30
> eventpoll_pwq                 0      0     36    101
> eventpoll_epi                 0      0    128     30
> inotify_event_cache           0      0     28    127
> Cache                       Num  Total   Size  Pages
> inotify_watch_cache           1     92     40     92
> kioctx                        0      0    256     15
> kiocb                         0      0    128     30
> fasync_cache                  2    203     16    203
> shmem_inode_cache           452    459    448      9
> posix_timers_cache            0      0     88     44
> uid_cache                     3     59     64     59
> ip_mrt_cache                  0      0    128     30
> tcp_bind_bucket              12    203     16    203
> inet_peer_cache               4     59     64     59
> secpath_cache                 0      0     32    113
> xfrm_dst_cache                0      0    384     10
> ip_dst_cache                 18     45    256     15
> arp_cache                     3     15    256     15
> RAW                           2      7    512      7
> UDP                           9     14    512      7
> tw_sock_TCP                   0      0    128     30
> request_sock_TCP              0      0     64     59
> TCP                          12     21   1152      7
> blkdev_ioc                   29    127     28    127
> blkdev_queue                 19     20    956      4
> Cache                       Num  Total   Size  Pages
> blkdev_requests              17     23    172     23
> biovec-256                    7      8   3072      2
> biovec-128                    7     10   1536      5
> biovec-64                     7     10    768      5
> biovec-16                     7     15    256     15
> biovec-4                      7     59     64     59
> biovec-1                     51    203     16    203
> bio                         276    300    128     30
> sock_inode_cache             73    154    512      7
> skbuff_fclone_cache           1     10    384     10
> skbuff_head_cache           459    615    256     15
> file_lock_cache              40     40     96     40
> Acpi-Operand                699    736     40     92
> Acpi-ParseExt                 0      0     44     84
> Acpi-Parse                    0      0     28    127
> Acpi-State                    0      0     44     84
> Acpi-Namespace              319    338     20    169
> proc_inode_cache           1218   1310    368     10
> sigqueue                     16     27    144     27
> radix_tree_node             989   2576    276     14
> bdev_cache                    4      7    512      7
> Cache                       Num  Total   Size  Pages
> sysfs_dir_cache            3656   3696     44     84
> mnt_cache                    25     30    128     30
> inode_cache                 724    869    352     11
> dentry_cache               8164  26100    132     29
> filp                        693   1840    192     20
> names_cache                   1      1   4096      1
> key_jar                       6     30    128     30
> idr_layer_cache             100    116    136     29
> buffer_head                3614   6192     52     72
> mm_struct                    54     81    448      9
> vm_area_struct             1272   2576     84     46
> fs_cache                     72    118     64     59
> files_cache                 102    135    256     15
> signal_cache                 55     90    384     10
> sighand_cache                66     66   1344      3
> task_struct                  68     87   1360      3
> anon_vma                    647   1524     12    254
> pgd                          40     40   4096      1
> pid                          65    202     36    101
> size-131072(DMA)              0      0 131072      1
> size-131072                   0      0 131072      1
> Cache                       Num  Total   Size  Pages
> size-65536(DMA)               0      0  65536      1
> size-65536                    1      1  65536      1
> size-32768(DMA)               0      0  32768      1
> size-32768                    0      0  32768      1
> size-16384(DMA)               0      0  16384      1
> size-16384                    0      0  16384      1
> size-8192(DMA)                0      0   8192      1
> size-8192                    67     69   8192      1
> size-4096(DMA)                0      0   4096      1
> size-4096                    33     33   4096      1
> size-2048(DMA)                0      0   2048      2
> size-2048                   503    534   2048      2
> size-1024(DMA)                0      0   1024      4
> size-1024                   197    208   1024      4
> size-512(DMA)                 0      0    512      8
> size-512                    241    296    512      8
> size-256(DMA)                 0      0    256     15
> size-256                     72     90    256     15
> size-192(DMA)                 0      0    192     20
> size-192                    308    320    192     20
> size-128(DMA)                 0      0    128     30
> Cache                       Num  Total   Size  Pages
> size-128                    455    480    128     30
> size-96(DMA)                  0      0    128     30
> size-96                     672    720    128     30
> size-64(DMA)                  0      0     64     59
> size-32(DMA)                  0      0     32    113
> size-64                     946   1357     64     59
> size-32                    2424   2599     32    113
> kmem_cache                  133    150    128     30
> 
> 
> Thanks, Michal
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 17:50   ` Michal Sabala
  2006-12-15 19:44     ` Trond Myklebust
@ 2006-12-15 20:42     ` Andrew Morton
  2006-12-15 21:35       ` Michal Sabala
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2006-12-15 20:42 UTC (permalink / raw)
  To: Michal Sabala; +Cc: Trond Myklebust, linux-kernel

On Fri, 15 Dec 2006 11:50:30 -0600
Michal Sabala <lkml@saahbs.net> wrote:

> On 2006/12/15 at 10:24:15 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> > On Thu, 2006-12-14 at 20:30 -0600, Michal Sabala wrote:
> > > 
> > > `cat /proc/*PID*/wchan` for all hanging processes contains page_sync.
> > 
> > Have you tried an 'echo t >/proc/sysrq-trigger' on a client with one of
> > these hanging processes? If so, what does the output look like?
> 
> Hello Trond,
> 
> Below is the sysrq trace output for XFree86 which entered the
> uninterruptible sleep state on the P4 machine with nfs /home. Please
> note that XFree86 does not have any files open in /home - as reported by
> `lsof`. Below, I also listed the output of vmstat.

We'd need to see the trace of all D-state processes, please.  Xfree86 might
just be a victim of a deadlock elsewhere.  However there is a problem here..


> 
> XFree86       D 00000003     0  2471   2453                     (NOTLB)
>        c4871c0c 00003082 c86b72bc 00000003 cb7c94a4 0000001d 3b67f3ff c0146dd2 
>        c1184180 cb3e7110 00000000 001ec7ff a60f8097 00000089 c02e1e60 cb3e7000 
>        c1184180 00000000 c1180030 c4871c18 c028c7d8 c4871c5c c01435b6 c01435f3 
> Call Trace:
>  [<c0146dd2>] free_pages_bulk+0x1d/0x1d4
>  [<c028c7d8>] io_schedule+0x26/0x30
>  [<c01435b6>] sync_page+0x0/0x40
>  [<c01435f3>] sync_page+0x3d/0x40
>  [<c028c9ce>] __wait_on_bit_lock+0x2c/0x52
>  [<c0143c13>] __lock_page+0x6a/0x72
>  [<c012ec77>] wake_bit_function+0x0/0x3c
>  [<c012ec77>] wake_bit_function+0x0/0x3c
>  [<c0149d2f>] pagevec_lookup+0x17/0x1d
>  [<c014a085>] truncate_inode_pages_range+0x20a/0x260
>  [<c014a0e4>] truncate_inode_pages+0x9/0xc
>  [<c0172c8a>] generic_delete_inode+0xb6/0x10f
>  [<c0172e73>] iput+0x5f/0x61
>  [<c01706bd>] dentry_iput+0x68/0x83
>  [<c01707d8>] dput+0x100/0x118
>  [<ccb6c334>] put_nfs_open_context+0x67/0x88 [nfs]
>  [<ccb701ed>] nfs_release_request+0x38/0x47 [nfs]
>  [<ccb736dd>] nfs_wait_on_requests_locked+0x62/0x98 [nfs]
>  [<ccb74c32>] nfs_sync_inode_wait+0x4a/0x130 [nfs]
>  [<ccb6b639>] nfs_release_page+0x0/0x30 [nfs]
>  [<ccb6b655>] nfs_release_page+0x1c/0x30 [nfs]
>  [<c015f37c>] try_to_release_page+0x34/0x46
>  [<c014aa8b>] shrink_page_list+0x263/0x350
>  [<c0104db8>] do_IRQ+0x48/0x50
>  [<c01036c6>] common_interrupt+0x1a/0x20
>  [<c014acd7>] shrink_inactive_list+0x9b/0x248
>  [<c014b2fd>] shrink_zone+0xb5/0xd0
>  [<c014b382>] shrink_zones+0x6a/0x7e
>  [<c014b48e>] try_to_free_pages+0xf8/0x1da
>  [<c0147a18>] __alloc_pages+0x17c/0x278
>  [<c014f555>] do_anonymous_page+0x45/0x150
>  [<c014f9f7>] __handle_mm_fault+0xda/0x1bf
>  [<c0115849>] do_page_fault+0x1c4/0x4bc
>  [<c01021b7>] restore_sigcontext+0x10c/0x15f
>  [<c0115685>] do_page_fault+0x0/0x4bc
>  [<c0103809>] error_code+0x39/0x40

nfs_release_page() was called with a locked page.  It's doing a bunch of
stuff which results in a call to truncate_inode_pages(), which will run
lock_page(), which is deadlocky.

But it's rather obviously deadlocky, so perhaps NFS drops and reacquires
the page lock somewhere?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 19:44     ` Trond Myklebust
@ 2006-12-15 21:06       ` Michal Sabala
  2006-12-15 21:12         ` Arjan van de Ven
  2006-12-15 21:44         ` Trond Myklebust
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Sabala @ 2006-12-15 21:06 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

On 2006/12/15 at 13:44:44 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> On Fri, 2006-12-15 at 11:50 -0600, Michal Sabala wrote:
> > On 2006/12/15 at 10:24:15 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> > > On Thu, 2006-12-14 at 20:30 -0600, Michal Sabala wrote:
> > > > 
> > > > `cat /proc/*PID*/wchan` for all hanging processes contains page_sync.
> > > 
> > > Have you tried an 'echo t >/proc/sysrq-trigger' on a client with one of
> > > these hanging processes? If so, what does the output look like?
> > 
> > Hello Trond,
> > 
> > Below is the sysrq trace output for XFree86 which entered the
> > uninterruptible sleep state on the P4 machine with nfs /home. Please
> > note that XFree86 does not have any files open in /home - as reported by
> > `lsof`. Below, I also listed the output of vmstat.
> 
> 
> It is hanging because it is trying to free up memory by reclaiming pages
> that are held by your mmaped file on NFS. Do you know why NFS is
> hanging?

Trond,

I do not have any indication that it is the server not responding. Other
applications which have NFS files open are continuing to work while in
this case XFree86 blocks.

Also, please note that test-mmap.c has successfully finished execution
and it is no longer running while XFree86 is still hanging.

Could this be related to the fact that the nfs mmaped file is unlinked
before it is ummaped? The .nfsXXXXXXX file disappears from the NFS
server as soon as test-mmap.c exits.

What nfs_debug information would be useful in tracking this
problem? Is there any other information I can provide you?

Thank You,
 Sincerely,

   Michal


--
Michal "Saahbs" Sabala

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 21:06       ` Michal Sabala
@ 2006-12-15 21:12         ` Arjan van de Ven
  2006-12-15 21:43           ` Michal Sabala
  2006-12-15 21:44         ` Trond Myklebust
  1 sibling, 1 reply; 22+ messages in thread
From: Arjan van de Ven @ 2006-12-15 21:12 UTC (permalink / raw)
  To: Michal Sabala; +Cc: linux-kernel


> 
> I do not have any indication that it is the server not responding. Other
> applications which have NFS files open are continuing to work while in
> this case XFree86 blocks.

just a strange question, but which video driver do you use in X? maybe
that one is blocking say the pci bus or something...



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 20:42     ` Andrew Morton
@ 2006-12-15 21:35       ` Michal Sabala
  2006-12-15 21:41         ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Sabala @ 2006-12-15 21:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Trond Myklebust, linux-kernel

On 2006/12/15 at 14:42:08 Andrew Morton <akpm@osdl.org> wrote
> On Fri, 15 Dec 2006 11:50:30 -0600
> Michal Sabala <lkml@saahbs.net> wrote:
> 
> > On 2006/12/15 at 10:24:15 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> > > On Thu, 2006-12-14 at 20:30 -0600, Michal Sabala wrote:
> > > > 
> > > > `cat /proc/*PID*/wchan` for all hanging processes contains page_sync.
> > > 
> > > Have you tried an 'echo t >/proc/sysrq-trigger' on a client with one of
> > > these hanging processes? If so, what does the output look like?
> > 
> > Hello Trond,
> > 
> > Below is the sysrq trace output for XFree86 which entered the
> > uninterruptible sleep state on the P4 machine with nfs /home. Please
> > note that XFree86 does not have any files open in /home - as reported by
> > `lsof`. Below, I also listed the output of vmstat.
> 
> We'd need to see the trace of all D-state processes, please.  Xfree86 might
> just be a victim of a deadlock elsewhere.  However there is a problem here..

Hi Andrew,

In most cases only a single process enters the D-state, this time it was
XFree, but I've seen gimp, firefox, gconfd and bash. Once or twice I did
see two or three processes ending up in uninterruptible sleep, but I
suspect they entered this state at different test-mmap.c runs (I left
test-mmap.c running in a bash loop and checked the system after a few
hours).

Would it be beneficial to keep running test-mmap.c on this machine until
two or more processes end up in D-state? I can leave this machine
running test-mmap.c over the weekend. 

Please advise,

Sincerely,

Michal

-- 
Michal "Saahbs" Sabala

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 21:35       ` Michal Sabala
@ 2006-12-15 21:41         ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2006-12-15 21:41 UTC (permalink / raw)
  To: Michael Sabala; +Cc: Trond Myklebust, linux-kernel

On Fri, 15 Dec 2006 15:35:00 -0600
Michal Sabala <lkml@saahbs.net> wrote:

> On 2006/12/15 at 14:42:08 Andrew Morton <akpm@osdl.org> wrote
> > On Fri, 15 Dec 2006 11:50:30 -0600
> > Michal Sabala <lkml@saahbs.net> wrote:
> > 
> > > On 2006/12/15 at 10:24:15 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> > > > On Thu, 2006-12-14 at 20:30 -0600, Michal Sabala wrote:
> > > > > 
> > > > > `cat /proc/*PID*/wchan` for all hanging processes contains page_sync.
> > > > 
> > > > Have you tried an 'echo t >/proc/sysrq-trigger' on a client with one of
> > > > these hanging processes? If so, what does the output look like?
> > > 
> > > Hello Trond,
> > > 
> > > Below is the sysrq trace output for XFree86 which entered the
> > > uninterruptible sleep state on the P4 machine with nfs /home. Please
> > > note that XFree86 does not have any files open in /home - as reported by
> > > `lsof`. Below, I also listed the output of vmstat.
> > 
> > We'd need to see the trace of all D-state processes, please.  Xfree86 might
> > just be a victim of a deadlock elsewhere.  However there is a problem here..
> 
> Hi Andrew,
> 
> In most cases only a single process enters the D-state, this time it was
> XFree, but I've seen gimp, firefox, gconfd and bash. Once or twice I did
> see two or three processes ending up in uninterruptible sleep, but I
> suspect they entered this state at different test-mmap.c runs (I left
> test-mmap.c running in a bash loop and checked the system after a few
> hours).

OK, useful info, thanks.

> Would it be beneficial to keep running test-mmap.c on this machine until
> two or more processes end up in D-state? I can leave this machine
> running test-mmap.c over the weekend. 

No, that's OK.  The next step should be for a kernel wrangler to get in
there with your testcase.  It could well be that lock_page-inside-lock_page
thing.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 21:12         ` Arjan van de Ven
@ 2006-12-15 21:43           ` Michal Sabala
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Sabala @ 2006-12-15 21:43 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel

On 2006/12/15 at 15:12:06 Arjan van de Ven <arjan@infradead.org> wrote
> 
> > 
> > I do not have any indication that it is the server not responding. Other
> > applications which have NFS files open are continuing to work while in
> > this case XFree86 blocks.
> 
> just a strange question, but which video driver do you use in X? maybe
> that one is blocking say the pci bus or something...

Arjan,

The P3 box with nfs root uses the "ati" X11 driver with:
0000:01:00.0 VGA compatible controller: ATI Technologies Inc Rage Mobility P/M AGP 2x (rev 64)

The P4 box with nfs /home uses the "i810" X11 driver with:
0000:00:02.0 VGA compatible controller: Intel Corp. 82865G Integrated Graphics Device (rev 02)

Thanks, Michal

-- 
Michal "Saahbs" Sabala

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 21:06       ` Michal Sabala
  2006-12-15 21:12         ` Arjan van de Ven
@ 2006-12-15 21:44         ` Trond Myklebust
  2006-12-15 22:05           ` Michal Sabala
                             ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Trond Myklebust @ 2006-12-15 21:44 UTC (permalink / raw)
  To: Michal Sabala; +Cc: linux-kernel

On Fri, 2006-12-15 at 15:06 -0600, Michal Sabala wrote:
> On 2006/12/15 at 13:44:44 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> > On Fri, 2006-12-15 at 11:50 -0600, Michal Sabala wrote:
> > > On 2006/12/15 at 10:24:15 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> > > > On Thu, 2006-12-14 at 20:30 -0600, Michal Sabala wrote:
> > > > > 
> > > > > `cat /proc/*PID*/wchan` for all hanging processes contains page_sync.
> > > > 
> > > > Have you tried an 'echo t >/proc/sysrq-trigger' on a client with one of
> > > > these hanging processes? If so, what does the output look like?
> > > 
> > > Hello Trond,
> > > 
> > > Below is the sysrq trace output for XFree86 which entered the
> > > uninterruptible sleep state on the P4 machine with nfs /home. Please
> > > note that XFree86 does not have any files open in /home - as reported by
> > > `lsof`. Below, I also listed the output of vmstat.
> > 
> > 
> > It is hanging because it is trying to free up memory by reclaiming pages
> > that are held by your mmaped file on NFS. Do you know why NFS is
> > hanging?
> 
> Trond,
> 
> I do not have any indication that it is the server not responding. Other
> applications which have NFS files open are continuing to work while in
> this case XFree86 blocks.
> 
> Also, please note that test-mmap.c has successfully finished execution
> and it is no longer running while XFree86 is still hanging.
> 
> Could this be related to the fact that the nfs mmaped file is unlinked
> before it is ummaped? The .nfsXXXXXXX file disappears from the NFS
> server as soon as test-mmap.c exits.

That shouldn't normally matter. The file won't be deleted until after
the last user has stopped referencing it. However it is true that the
trace you sent indicated that XFree86 was hanging in iput().

> What nfs_debug information would be useful in tracking this
> problem? Is there any other information I can provide you?

Could you just out of interest try 2.6.20-rc1?

Cheers
  Trond


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 21:44         ` Trond Myklebust
@ 2006-12-15 22:05           ` Michal Sabala
  2006-12-19 22:26           ` Andrew Morton
  2006-12-20 14:51           ` 2.6.18 mmap hangs unrelated apps Michal Sabala
  2 siblings, 0 replies; 22+ messages in thread
From: Michal Sabala @ 2006-12-15 22:05 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

On 2006/12/15 at 15:44:14 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> On Fri, 2006-12-15 at 15:06 -0600, Michal Sabala wrote:
> > Could this be related to the fact that the nfs mmaped file is unlinked
> > before it is ummaped? The .nfsXXXXXXX file disappears from the NFS
> > server as soon as test-mmap.c exits.
> 
> That shouldn't normally matter. The file won't be deleted until after
> the last user has stopped referencing it. However it is true that the
> trace you sent indicated that XFree86 was hanging in iput().
> 
> > What nfs_debug information would be useful in tracking this
> > problem? Is there any other information I can provide you?
> 
> Could you just out of interest try 2.6.20-rc1?

Trond,

I'll try 2.6.20-rc1 on Monday and post results to the list.

Thanks,
Michal

-- 
Michal "Saahbs" Sabala

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15  2:30 2.6.18 mmap hangs unrelated apps Michal Sabala
  2006-12-15 16:24 ` Trond Myklebust
@ 2006-12-16 12:59 ` Christian Kuhn
  2006-12-16 18:45   ` Christian Kuhn
  1 sibling, 1 reply; 22+ messages in thread
From: Christian Kuhn @ 2006-12-16 12:59 UTC (permalink / raw)
  To: Michal Sabala; +Cc: linux-kernel

Hi,

Am Donnerstag, den 14.12.2006, 20:30 -0600 schrieb Michal Sabala:
> I am observing processes entering uninterruptible sleep apparently due
> to an unrelated application using mmap over nfs. Applications in
> "uninterruptible sleep" hang indefinitely while other applications
> continue working properly.
> 

I have a similar issue that seems to be related to this. If I start a
copy process to a nfs mount, this leads to one other process starving
for a while.

This can be reproduced with vanilla 2.6.17,18,19 and 19.1 by starting a
cp to a nfs share and running a local kernel untar at the same time.
After some seconds the tar hangs for arbitrary time. The system is
stable as long as nfs is not used.

Neither logs nor dmesg show anything of interest.

I have already stripped down my (not tainted) x86_64 kernel as far as I
could, this is a amd athlon X2 with a nforce chipset, ubuntu edgy eft
and gentoo. I have tested the usual subjects, disabled smp, apic, acpi,
preemption, some other nic, no X and such.

I will provide any information needed to track this down. But I am not a
kernel hacker and will need advice on how to do this.

Thank You,
Christian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-16 12:59 ` Christian Kuhn
@ 2006-12-16 18:45   ` Christian Kuhn
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Kuhn @ 2006-12-16 18:45 UTC (permalink / raw)
  To: Michal Sabala; +Cc: linux-kernel

Hi,

I enabled some kernel hacking options on 2.6.20-rc1 and ran sysrq t when
the problem occured.

Some hopefully usefull information in the links below: .config, dmesg,
vmstat and vmstat -m. Sorry for the links, I do not know what is
relevant and this is too much to inline for this list (is it?)


http://www.lollingola.de/kernel/dmesg
http://www.lollingola.de/kernel/vmstat
http://www.lollingola.de/kernel/vmstat-m
http://www.lollingola.de/kernel/config-2.6.20-rc1


Christian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 21:44         ` Trond Myklebust
  2006-12-15 22:05           ` Michal Sabala
@ 2006-12-19 22:26           ` Andrew Morton
  2006-12-19 23:19             ` Trond Myklebust
  2006-12-20 14:51           ` 2.6.18 mmap hangs unrelated apps Michal Sabala
  2 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2006-12-19 22:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Michal Sabala, linux-kernel

On Fri, 15 Dec 2006 16:44:14 -0500
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> However it is true that the
> trace you sent indicated that XFree86 was hanging in iput().

We know what the bug is, don't we?

> > XFree86       D 00000003     0  2471   2453                     (NOTLB)
> >        c4871c0c 00003082 c86b72bc 00000003 cb7c94a4 0000001d 3b67f3ff c0146dd2 
> >        c1184180 cb3e7110 00000000 001ec7ff a60f8097 00000089 c02e1e60 cb3e7000 
> >        c1184180 00000000 c1180030 c4871c18 c028c7d8 c4871c5c c01435b6 c01435f3 
> > Call Trace:
> >  [<c0146dd2>] free_pages_bulk+0x1d/0x1d4
> >  [<c028c7d8>] io_schedule+0x26/0x30
> >  [<c01435b6>] sync_page+0x0/0x40
> >  [<c01435f3>] sync_page+0x3d/0x40
> >  [<c028c9ce>] __wait_on_bit_lock+0x2c/0x52
> >  [<c0143c13>] __lock_page+0x6a/0x72
> >  [<c012ec77>] wake_bit_function+0x0/0x3c
> >  [<c012ec77>] wake_bit_function+0x0/0x3c
> >  [<c0149d2f>] pagevec_lookup+0x17/0x1d
> >  [<c014a085>] truncate_inode_pages_range+0x20a/0x260
> >  [<c014a0e4>] truncate_inode_pages+0x9/0xc
> >  [<c0172c8a>] generic_delete_inode+0xb6/0x10f
> >  [<c0172e73>] iput+0x5f/0x61
> >  [<c01706bd>] dentry_iput+0x68/0x83
> >  [<c01707d8>] dput+0x100/0x118
> >  [<ccb6c334>] put_nfs_open_context+0x67/0x88 [nfs]
> >  [<ccb701ed>] nfs_release_request+0x38/0x47 [nfs]
> >  [<ccb736dd>] nfs_wait_on_requests_locked+0x62/0x98 [nfs]
> >  [<ccb74c32>] nfs_sync_inode_wait+0x4a/0x130 [nfs]
> >  [<ccb6b639>] nfs_release_page+0x0/0x30 [nfs]
> >  [<ccb6b655>] nfs_release_page+0x1c/0x30 [nfs]
> >  [<c015f37c>] try_to_release_page+0x34/0x46
> >  [<c014aa8b>] shrink_page_list+0x263/0x350
> >  [<c0104db8>] do_IRQ+0x48/0x50
> >  [<c01036c6>] common_interrupt+0x1a/0x20
> >  [<c014acd7>] shrink_inactive_list+0x9b/0x248
> >  [<c014b2fd>] shrink_zone+0xb5/0xd0
> >  [<c014b382>] shrink_zones+0x6a/0x7e
> >  [<c014b48e>] try_to_free_pages+0xf8/0x1da
> >  [<c0147a18>] __alloc_pages+0x17c/0x278
> >  [<c014f555>] do_anonymous_page+0x45/0x150
> >  [<c014f9f7>] __handle_mm_fault+0xda/0x1bf
> >  [<c0115849>] do_page_fault+0x1c4/0x4bc
> >  [<c01021b7>] restore_sigcontext+0x10c/0x15f
> >  [<c0115685>] do_page_fault+0x0/0x4bc
> >  [<c0103809>] error_code+0x39/0x40
> 
> nfs_release_page() was called with a locked page.  It's doing a bunch of
> stuff which results in a call to truncate_inode_pages(), which will run
> lock_page(), which is deadlocky.

Now, arguably the VM shouldn't be calling try_to_release_page() with
__GFP_FS when it's holding a lock on a page.

But otoh, NFS should never be running lock_page() within nfs_release_page()
against the page which was passed into nfs_release_page().  It'll deadlock
for sure.

So we could alter the VM to not pass in __GFP_FS in this situation, but
nfs_release_page() would still be deadlocky.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-19 22:26           ` Andrew Morton
@ 2006-12-19 23:19             ` Trond Myklebust
  2006-12-20  0:03               ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Trond Myklebust @ 2006-12-19 23:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Sabala, linux-kernel

On Tue, 2006-12-19 at 14:26 -0800, Andrew Morton wrote:
> On Fri, 15 Dec 2006 16:44:14 -0500
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> > However it is true that the
> > trace you sent indicated that XFree86 was hanging in iput().
> 
> We know what the bug is, don't we?
> 
> > > XFree86       D 00000003     0  2471   2453                     (NOTLB)
> > >        c4871c0c 00003082 c86b72bc 00000003 cb7c94a4 0000001d 3b67f3ff c0146dd2 
> > >        c1184180 cb3e7110 00000000 001ec7ff a60f8097 00000089 c02e1e60 cb3e7000 
> > >        c1184180 00000000 c1180030 c4871c18 c028c7d8 c4871c5c c01435b6 c01435f3 
> > > Call Trace:
> > >  [<c0146dd2>] free_pages_bulk+0x1d/0x1d4
> > >  [<c028c7d8>] io_schedule+0x26/0x30
> > >  [<c01435b6>] sync_page+0x0/0x40
> > >  [<c01435f3>] sync_page+0x3d/0x40
> > >  [<c028c9ce>] __wait_on_bit_lock+0x2c/0x52
> > >  [<c0143c13>] __lock_page+0x6a/0x72
> > >  [<c012ec77>] wake_bit_function+0x0/0x3c
> > >  [<c012ec77>] wake_bit_function+0x0/0x3c
> > >  [<c0149d2f>] pagevec_lookup+0x17/0x1d
> > >  [<c014a085>] truncate_inode_pages_range+0x20a/0x260
> > >  [<c014a0e4>] truncate_inode_pages+0x9/0xc
> > >  [<c0172c8a>] generic_delete_inode+0xb6/0x10f
> > >  [<c0172e73>] iput+0x5f/0x61
> > >  [<c01706bd>] dentry_iput+0x68/0x83
> > >  [<c01707d8>] dput+0x100/0x118
> > >  [<ccb6c334>] put_nfs_open_context+0x67/0x88 [nfs]
> > >  [<ccb701ed>] nfs_release_request+0x38/0x47 [nfs]
> > >  [<ccb736dd>] nfs_wait_on_requests_locked+0x62/0x98 [nfs]
> > >  [<ccb74c32>] nfs_sync_inode_wait+0x4a/0x130 [nfs]
> > >  [<ccb6b639>] nfs_release_page+0x0/0x30 [nfs]
> > >  [<ccb6b655>] nfs_release_page+0x1c/0x30 [nfs]
> > >  [<c015f37c>] try_to_release_page+0x34/0x46
> > >  [<c014aa8b>] shrink_page_list+0x263/0x350
> > >  [<c0104db8>] do_IRQ+0x48/0x50
> > >  [<c01036c6>] common_interrupt+0x1a/0x20
> > >  [<c014acd7>] shrink_inactive_list+0x9b/0x248
> > >  [<c014b2fd>] shrink_zone+0xb5/0xd0
> > >  [<c014b382>] shrink_zones+0x6a/0x7e
> > >  [<c014b48e>] try_to_free_pages+0xf8/0x1da
> > >  [<c0147a18>] __alloc_pages+0x17c/0x278
> > >  [<c014f555>] do_anonymous_page+0x45/0x150
> > >  [<c014f9f7>] __handle_mm_fault+0xda/0x1bf
> > >  [<c0115849>] do_page_fault+0x1c4/0x4bc
> > >  [<c01021b7>] restore_sigcontext+0x10c/0x15f
> > >  [<c0115685>] do_page_fault+0x0/0x4bc
> > >  [<c0103809>] error_code+0x39/0x40
> > 
> > nfs_release_page() was called with a locked page.  It's doing a bunch of
> > stuff which results in a call to truncate_inode_pages(), which will run
> > lock_page(), which is deadlocky.
> 
> Now, arguably the VM shouldn't be calling try_to_release_page() with
> __GFP_FS when it's holding a lock on a page.
> 
> But otoh, NFS should never be running lock_page() within nfs_release_page()
> against the page which was passed into nfs_release_page().  It'll deadlock
> for sure.

The reason why it is happening is that the last dirty page from that
inode gets cleaned, resulting in a call to dput().

> So we could alter the VM to not pass in __GFP_FS in this situation, but
> nfs_release_page() would still be deadlocky.

It certainly never makes sense to call nfs_release_page(__GFP_FS) if the
caller doesn't have a reference to the inode or the dentry.

OK. How about something like this? (untested)

Cheers
  Trond
-------------------
commit ac5d597264255dfc0f29b4e3f54294d3d5f1778e
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Tue Dec 19 18:17:36 2006 -0500

    NFS: Fix race in nfs_release_page()
    
    invalidate_inode_pages2() may set the dirty bit on a page owing to the call
    to unmap_mapping_range() after the page was locked. In order to fix this,
    NFS has hooked the releasepage() method. This, however leads to deadlocks
    in other parts of the VM.
    
    Fix is to add a new callback: flushpage(), which will write out a dirty
    page that is under the page lock.
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 Documentation/filesystems/Locking |    6 ++++++
 fs/nfs/file.c                     |   11 ++---------
 include/linux/fs.h                |    1 +
 mm/truncate.c                     |   23 ++++++++++++++++++-----
 4 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 790ef6f..ddcff76 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -171,6 +171,7 @@ prototypes:
 	int (*releasepage) (struct page *, int);
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
+	int (*flushpage) (struct page *);
 
 locking rules:
 	All except set_page_dirty may block
@@ -188,6 +189,7 @@ bmap:			yes
 invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
+flushpage:		no	yes
 
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
@@ -281,6 +283,10 @@ buffers from the page in preparation for
 indicate that the buffers are (or may be) freeable.  If ->releasepage is zero,
 the kernel assumes that the fs has no private interest in the buffers.
 
+	->flushpage() is called when the kernel has locked a dirty page prior
+to releasing it. It returns 0 if the page was successfully cleaned. Note
+that the page lock must be held across the entire operation.
+
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0dd6be3..77e6c42 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -313,15 +313,8 @@ static void nfs_invalidate_page(struct p
 	nfs_wb_page_priority(page->mapping->host, page, FLUSH_INVALIDATE);
 }
 
-static int nfs_release_page(struct page *page, gfp_t gfp)
+static int nfs_flush_page(struct page *page)
 {
-	/*
-	 * Avoid deadlock on nfs_wait_on_request().
-	 */
-	if (!(gfp & __GFP_FS))
-		return 0;
-	/* Hack... Force nfs_wb_page() to write out the page */
-	SetPageDirty(page);
 	return !nfs_wb_page(page->mapping->host, page);
 }
 
@@ -334,10 +327,10 @@ const struct address_space_operations nf
 	.prepare_write = nfs_prepare_write,
 	.commit_write = nfs_commit_write,
 	.invalidatepage = nfs_invalidate_page,
-	.releasepage = nfs_release_page,
 #ifdef CONFIG_NFS_DIRECTIO
 	.direct_IO = nfs_direct_IO,
 #endif
+	.flushpage = nfs_release_page,
 };
 
 static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 186da81..bb9ce57 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -426,6 +426,7 @@ struct address_space_operations {
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
+	int (*flushpage) (struct page *);
 };
 
 struct backing_dev_info;
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..0d6b18d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -321,6 +321,16 @@ failed:
 	return 0;
 }
 
+static int
+flush_page(struct struct address_space *mapping, struct page *page)
+{
+	if (!PageDirty(page))
+		return 0;
+	if (page->mapping != mapping || mapping->a_ops->flushpage == NULL)
+		return 0;
+	return mapping->a_ops->flushpage(page);
+}
+
 /**
  * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping: the address_space
@@ -386,11 +396,14 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
-			if (!invalidate_complete_page2(mapping, page)) {
-				if (was_dirty)
-					set_page_dirty(page);
-				ret = -EIO;
+			ret = flush_page(page);
+			if (ret == 0) {
+				was_dirty = test_clear_page_dirty(page);
+				if (!invalidate_complete_page2(mapping, page)) {
+					if (was_dirty)
+						set_page_dirty(page);
+					ret = -EIO;
+				}
 			}
 			unlock_page(page);
 		}



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-19 23:19             ` Trond Myklebust
@ 2006-12-20  0:03               ` Andrew Morton
  2006-12-20  0:17                 ` Trond Myklebust
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2006-12-20  0:03 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Michal Sabala, linux-kernel

On Tue, 19 Dec 2006 18:19:38 -0500
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

>     NFS: Fix race in nfs_release_page()
>     
>     invalidate_inode_pages2() may set the dirty bit on a page owing to the call
>     to unmap_mapping_range() after the page was locked. In order to fix this,
>     NFS has hooked the releasepage() method. This, however leads to deadlocks
>     in other parts of the VM.

hmm, subtle.

>     Fix is to add a new callback: flushpage(), which will write out a dirty
>     page that is under the page lock.
>     

I guess this might permit us to clean up some of the nasties in
invalidate_inode_pages2() - if the page comes dirty again, write it again. 
But the requirement that the page remain locked makes it hard.  Need to
think about it some more.

Are you sure this is the cause of the NFS problem?

>  	.prepare_write = nfs_prepare_write,
>  	.commit_write = nfs_commit_write,
>  	.invalidatepage = nfs_invalidate_page,
> -	.releasepage = nfs_release_page,

A NULL ->releasepage means that try_to_release_page() will call
try_to_free_buffers() if PagePrivate().  I suspect you'll need a stub to
prevent this.

(We were supposed to stop doing that about four years ago - change it so
that all a_ops must implement ->releasepage, but nobody got around to it).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-20  0:03               ` Andrew Morton
@ 2006-12-20  0:17                 ` Trond Myklebust
  2006-12-20  0:22                   ` Andrew Morton
  2006-12-20  1:21                   ` Trond Myklebust
  0 siblings, 2 replies; 22+ messages in thread
From: Trond Myklebust @ 2006-12-20  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Sabala, linux-kernel

On Tue, 2006-12-19 at 16:03 -0800, Andrew Morton wrote:
> On Tue, 19 Dec 2006 18:19:38 -0500
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> 
> >     NFS: Fix race in nfs_release_page()
> >     
> >     invalidate_inode_pages2() may set the dirty bit on a page owing to the call
> >     to unmap_mapping_range() after the page was locked. In order to fix this,
> >     NFS has hooked the releasepage() method. This, however leads to deadlocks
> >     in other parts of the VM.
> 
> hmm, subtle.
> 
> >     Fix is to add a new callback: flushpage(), which will write out a dirty
> >     page that is under the page lock.
> >     
> 
> I guess this might permit us to clean up some of the nasties in
> invalidate_inode_pages2() - if the page comes dirty again, write it again. 
> But the requirement that the page remain locked makes it hard.  Need to
> think about it some more.

This was one of the reasons why I had to introduce
nfs_writepage_locked() for 2.6.20 (the other reason being readpage()).

The problem is that you can only protect against redirtying of the page
by holding the page lock across the call to unmap_mapping_range(), the
page writeout and the page removal.

> Are you sure this is the cause of the NFS problem?
> 
> >  	.prepare_write = nfs_prepare_write,
> >  	.commit_write = nfs_commit_write,
> >  	.invalidatepage = nfs_invalidate_page,
> > -	.releasepage = nfs_release_page,
> 
> A NULL ->releasepage means that try_to_release_page() will call
> try_to_free_buffers() if PagePrivate().  I suspect you'll need a stub to
> prevent this.

Ack, I'll add one in. If PagePrivate() is set during the call to
try_to_release_page(), then the page should never be freeable.

> (We were supposed to stop doing that about four years ago - change it so
> that all a_ops must implement ->releasepage, but nobody got around to it).

Would you still be interested in seeing this done?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-20  0:17                 ` Trond Myklebust
@ 2006-12-20  0:22                   ` Andrew Morton
  2006-12-20  1:21                   ` Trond Myklebust
  1 sibling, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2006-12-20  0:22 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Michal Sabala, linux-kernel

On Tue, 19 Dec 2006 19:17:43 -0500
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> > (We were supposed to stop doing that about four years ago - change it so
> > that all a_ops must implement ->releasepage, but nobody got around to it).
> 
> Would you still be interested in seeing this done?

Sure, when things calm down.  It's just a cleanup.

There are various places where we got lazy and did this.  ->set_page_dirty,
->page_mkwrite, many others.  With varying degrees of consequential ugliness.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-20  0:17                 ` Trond Myklebust
  2006-12-20  0:22                   ` Andrew Morton
@ 2006-12-20  1:21                   ` Trond Myklebust
  2007-01-08 14:48                     ` [PATCH] NFS: Fix race in nfs_release_page() Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Trond Myklebust @ 2006-12-20  1:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Sabala, linux-kernel

On Tue, 2006-12-19 at 19:17 -0500, Trond Myklebust wrote:
> Ack, I'll add one in. If PagePrivate() is set during the call to
> try_to_release_page(), then the page should never be freeable.

OK. This one actually compiles, and eliminates a few logic bugs. Note
that I renamed the callback to ->launder_page() for clarity (and for
histerical reasons).

Cheers
  Trond

----------------------------------------------------------------
commit 85a5b844c56706a5e3f47cde8b82109d325ad609
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Tue Dec 19 20:18:55 2006 -0500

    NFS: Fix race in nfs_release_page()
    
    invalidate_inode_pages2() may find the dirty bit has been set on a page
    owing to the fact that the page may still be mapped after it was locked.
    Only after the call to unmap_mapping_range() are we sure that the page
    can no longer be dirtied.
    In order to fix this, NFS has hooked the releasepage() method and tries
    to write the page out between the call to unmap_mapping_range() and the
    call to remove_mapping(). This, however leads to deadlocks in the page
    reclaim code, where the page may be locked without holding a reference
    to the inode or dentry.
    
    Fix is to add a new address_space_operation, launder_page(), which will
    attempt to write out a dirty page without releasing the page lock.
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 Documentation/filesystems/Locking |    8 ++++++++
 fs/nfs/file.c                     |   16 ++++++++--------
 include/linux/fs.h                |    1 +
 mm/truncate.c                     |   23 ++++++++++++++++++-----
 4 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 790ef6f..28bfea7 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -171,6 +171,7 @@ prototypes:
 	int (*releasepage) (struct page *, int);
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
+	int (*launder_page) (struct page *);
 
 locking rules:
 	All except set_page_dirty may block
@@ -188,6 +189,7 @@ bmap:			yes
 invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
+launder_page:		no	yes
 
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
@@ -281,6 +283,12 @@ buffers from the page in preparation for
 indicate that the buffers are (or may be) freeable.  If ->releasepage is zero,
 the kernel assumes that the fs has no private interest in the buffers.
 
+	->launder_page() may be called prior to releasing a page if
+it is still found to be dirty. It returns zero if the page was successfully
+cleaned, or an error value if not. Note that in order to prevent the page
+getting mapped back in and redirtied, it needs to be kept locked
+across the entire operation.
+
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0dd6be3..fab20d0 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -315,14 +315,13 @@ static void nfs_invalidate_page(struct p
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
-	/*
-	 * Avoid deadlock on nfs_wait_on_request().
-	 */
-	if (!(gfp & __GFP_FS))
-		return 0;
-	/* Hack... Force nfs_wb_page() to write out the page */
-	SetPageDirty(page);
-	return !nfs_wb_page(page->mapping->host, page);
+	/* If PagePrivate() is set, then the page is not freeable */
+	return 0;
+}
+
+static int nfs_launder_page(struct page *page)
+{
+	return nfs_wb_page(page->mapping->host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
@@ -338,6 +337,7 @@ const struct address_space_operations nf
 #ifdef CONFIG_NFS_DIRECTIO
 	.direct_IO = nfs_direct_IO,
 #endif
+	.launder_page = nfs_launder_page,
 };
 
 static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 186da81..14a337c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -426,6 +426,7 @@ struct address_space_operations {
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
+	int (*launder_page) (struct page *);
 };
 
 struct backing_dev_info;
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..d4811dc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -321,6 +321,16 @@ failed:
 	return 0;
 }
 
+static int
+do_launder_page(struct address_space *mapping, struct page *page)
+{
+	if (!PageDirty(page))
+		return 0;
+	if (page->mapping != mapping || mapping->a_ops->launder_page == NULL)
+		return 0;
+	return mapping->a_ops->launder_page(page);
+}
+
 /**
  * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping: the address_space
@@ -386,11 +396,14 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			was_dirty = test_clear_page_dirty(page);
-			if (!invalidate_complete_page2(mapping, page)) {
-				if (was_dirty)
-					set_page_dirty(page);
-				ret = -EIO;
+			ret = do_launder_page(mapping, page);
+			if (ret == 0) {
+				was_dirty = test_clear_page_dirty(page);
+				if (!invalidate_complete_page2(mapping, page)) {
+					if (was_dirty)
+						set_page_dirty(page);
+					ret = -EIO;
+				}
 			}
 			unlock_page(page);
 		}



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 2.6.18 mmap hangs unrelated apps
  2006-12-15 21:44         ` Trond Myklebust
  2006-12-15 22:05           ` Michal Sabala
  2006-12-19 22:26           ` Andrew Morton
@ 2006-12-20 14:51           ` Michal Sabala
  2 siblings, 0 replies; 22+ messages in thread
From: Michal Sabala @ 2006-12-20 14:51 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel

On 2006/12/15 at 15:44:14 Trond Myklebust <trond.myklebust@fys.uio.no> wrote
> On Fri, 2006-12-15 at 15:06 -0600, Michal Sabala wrote:
> >
> > What nfs_debug information would be useful in tracking this
> > problem? Is there any other information I can provide you?
> 
> Could you just out of interest try 2.6.20-rc1?

Hello Trond, Andrew,

For what it's worth, after running 2.6.20-rc1 for ~12 hours, I did not
observe the uninterruptible sleep condition.

Thanks, Michal

-- 
Michal "Saahbs" Sabala

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] NFS: Fix race in nfs_release_page()
  2006-12-20  1:21                   ` Trond Myklebust
@ 2007-01-08 14:48                     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2007-01-08 14:48 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, Michal Sabala, linux-kernel, torvalds


Could we push this to mainline before .20 - pretty please?

---
From: Trond Myklebust <Trond.Myklebust@netapp.com>

    NFS: Fix race in nfs_release_page()
    
    invalidate_inode_pages2() may find the dirty bit has been set on a page
    owing to the fact that the page may still be mapped after it was locked.
    Only after the call to unmap_mapping_range() are we sure that the page
    can no longer be dirtied.
    In order to fix this, NFS has hooked the releasepage() method and tries
    to write the page out between the call to unmap_mapping_range() and the
    call to remove_mapping(). This, however leads to deadlocks in the page
    reclaim code, where the page may be locked without holding a reference
    to the inode or dentry.
    
    Fix is to add a new address_space_operation, launder_page(), which will
    attempt to write out a dirty page without releasing the page lock.
    
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

    Also, the bare SetPageDirty() can skew all sort of accounting leading to 
    other nasties.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/filesystems/Locking |    8 ++++++++
 fs/nfs/file.c                     |   16 ++++++++--------
 include/linux/fs.h                |    1 +
 mm/truncate.c                     |   13 ++++++++++++-
 4 files changed, 29 insertions(+), 9 deletions(-)

Index: linux-2.6-git/Documentation/filesystems/Locking
===================================================================
--- linux-2.6-git.orig/Documentation/filesystems/Locking	2006-12-08 10:11:33.000000000 +0100
+++ linux-2.6-git/Documentation/filesystems/Locking	2007-01-08 15:11:15.000000000 +0100
@@ -171,6 +171,7 @@ prototypes:
 	int (*releasepage) (struct page *, int);
 	int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs);
+	int (*launder_page) (struct page *);
 
 locking rules:
 	All except set_page_dirty may block
@@ -188,6 +189,7 @@ bmap:			yes
 invalidatepage:		no	yes
 releasepage:		no	yes
 direct_IO:		no
+launder_page:		no	yes
 
 	->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
 may be called from the request handler (/dev/loop).
@@ -281,6 +283,12 @@ buffers from the page in preparation for
 indicate that the buffers are (or may be) freeable.  If ->releasepage is zero,
 the kernel assumes that the fs has no private interest in the buffers.
 
+	->launder_page() may be called prior to releasing a page if
+it is still found to be dirty. It returns zero if the page was successfully
+cleaned, or an error value if not. Note that in order to prevent the page
+getting mapped back in and redirtied, it needs to be kept locked
+across the entire operation.
+
 	Note: currently almost all instances of address_space methods are
 using BKL for internal serialization and that's one of the worst sources
 of contention. Normally they are calling library functions (in fs/buffer.c)
Index: linux-2.6-git/fs/nfs/file.c
===================================================================
--- linux-2.6-git.orig/fs/nfs/file.c	2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/fs/nfs/file.c	2007-01-08 15:11:15.000000000 +0100
@@ -315,14 +315,13 @@ static void nfs_invalidate_page(struct p
 
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
-	/*
-	 * Avoid deadlock on nfs_wait_on_request().
-	 */
-	if (!(gfp & __GFP_FS))
-		return 0;
-	/* Hack... Force nfs_wb_page() to write out the page */
-	SetPageDirty(page);
-	return !nfs_wb_page(page->mapping->host, page);
+	/* If PagePrivate() is set, then the page is not freeable */
+	return 0;
+}
+
+static int nfs_launder_page(struct page *page)
+{
+	return nfs_wb_page(page->mapping->host, page);
 }
 
 const struct address_space_operations nfs_file_aops = {
@@ -338,6 +337,7 @@ const struct address_space_operations nf
 #ifdef CONFIG_NFS_DIRECTIO
 	.direct_IO = nfs_direct_IO,
 #endif
+	.launder_page = nfs_launder_page,
 };
 
 static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
Index: linux-2.6-git/include/linux/fs.h
===================================================================
--- linux-2.6-git.orig/include/linux/fs.h	2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/include/linux/fs.h	2007-01-08 15:11:15.000000000 +0100
@@ -426,6 +426,7 @@ struct address_space_operations {
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *);
+	int (*launder_page) (struct page *);
 };
 
 struct backing_dev_info;
Index: linux-2.6-git/mm/truncate.c
===================================================================
--- linux-2.6-git.orig/mm/truncate.c	2007-01-08 11:53:13.000000000 +0100
+++ linux-2.6-git/mm/truncate.c	2007-01-08 15:30:08.000000000 +0100
@@ -341,6 +341,16 @@ failed:
 	return 0;
 }
 
+static int
+do_launder_page(struct address_space *mapping, struct page *page)
+{
+	if (!PageDirty(page))
+		return 0;
+	if (page->mapping != mapping || mapping->a_ops->launder_page == NULL)
+		return 0;
+	return mapping->a_ops->launder_page(page);
+}
+
 /**
  * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping: the address_space
@@ -405,7 +415,8 @@ int invalidate_inode_pages2_range(struct
 					  PAGE_CACHE_SIZE, 0);
 				}
 			}
-			if (!invalidate_complete_page2(mapping, page))
+			ret = do_launder_page(mapping, page);
+			if (ret == 0 && !invalidate_complete_page2(mapping, page))
 				ret = -EIO;
 			unlock_page(page);
 		}



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-01-08 14:50 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-15  2:30 2.6.18 mmap hangs unrelated apps Michal Sabala
2006-12-15 16:24 ` Trond Myklebust
2006-12-15 17:50   ` Michal Sabala
2006-12-15 19:44     ` Trond Myklebust
2006-12-15 21:06       ` Michal Sabala
2006-12-15 21:12         ` Arjan van de Ven
2006-12-15 21:43           ` Michal Sabala
2006-12-15 21:44         ` Trond Myklebust
2006-12-15 22:05           ` Michal Sabala
2006-12-19 22:26           ` Andrew Morton
2006-12-19 23:19             ` Trond Myklebust
2006-12-20  0:03               ` Andrew Morton
2006-12-20  0:17                 ` Trond Myklebust
2006-12-20  0:22                   ` Andrew Morton
2006-12-20  1:21                   ` Trond Myklebust
2007-01-08 14:48                     ` [PATCH] NFS: Fix race in nfs_release_page() Peter Zijlstra
2006-12-20 14:51           ` 2.6.18 mmap hangs unrelated apps Michal Sabala
2006-12-15 20:42     ` Andrew Morton
2006-12-15 21:35       ` Michal Sabala
2006-12-15 21:41         ` Andrew Morton
2006-12-16 12:59 ` Christian Kuhn
2006-12-16 18:45   ` Christian Kuhn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).