LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 0/8] fuse: writable mmap + batched write
@ 2008-03-17 19:19 Miklos Szeredi
2008-03-17 19:19 ` [patch 1/8] mm: bdi: export bdi_writeout_inc() Miklos Szeredi
` (7 more replies)
0 siblings, 8 replies; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-fsdevel, linux-mm
Hi Andrew,
These are fuse updates for 2.6.26.
1-4) small tweaks to core code to make writable mmap in fuse possible
5) the fuse writable mmap support itself
6-7) allows buffered fuse writes to be bigger than 4k
8) handle short buffered reads better
Thanks,
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 1/8] mm: bdi: export bdi_writeout_inc()
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
@ 2008-03-17 19:19 ` Miklos Szeredi
2008-03-18 11:27 ` Peter Zijlstra
2008-03-17 19:19 ` [patch 2/8] mm: Add NR_WRITEBACK_TEMP counter Miklos Szeredi
` (6 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-fsdevel, linux-mm
[-- Attachment #1: export_bdi_writeout_inc.patch --]
[-- Type: text/plain, Size: 1361 bytes --]
From: Miklos Szeredi <mszeredi@suse.cz>
Fuse needs this for writable mmap support.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
include/linux/backing-dev.h | 2 ++
mm/page-writeback.c | 10 ++++++++++
2 files changed, 12 insertions(+)
Index: linux/include/linux/backing-dev.h
===================================================================
--- linux.orig/include/linux/backing-dev.h 2008-03-17 18:24:13.000000000 +0100
+++ linux/include/linux/backing-dev.h 2008-03-17 18:24:36.000000000 +0100
@@ -134,6 +134,8 @@ static inline s64 bdi_stat_sum(struct ba
return sum;
}
+extern void bdi_writeout_inc(struct backing_dev_info *bdi);
+
/*
* maximal error of a stat counter.
*/
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/mm/page-writeback.c 2008-03-17 18:24:36.000000000 +0100
@@ -168,6 +168,16 @@ static inline void __bdi_writeout_inc(st
bdi->max_prop_frac);
}
+void bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __bdi_writeout_inc(bdi);
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL(bdi_writeout_inc);
+
static inline void task_dirty_inc(struct task_struct *tsk)
{
prop_inc_single(&vm_dirties, &tsk->dirties);
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 2/8] mm: Add NR_WRITEBACK_TEMP counter
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
2008-03-17 19:19 ` [patch 1/8] mm: bdi: export bdi_writeout_inc() Miklos Szeredi
@ 2008-03-17 19:19 ` Miklos Szeredi
2008-03-18 5:05 ` Andrew Morton
2008-03-17 19:19 ` [patch 3/8] mm: rotate_reclaimable_page() cleanup Miklos Szeredi
` (5 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-fsdevel, linux-mm
[-- Attachment #1: add_nr_writeback_temp_stat.patch --]
[-- Type: text/plain, Size: 3781 bytes --]
From: Miklos Szeredi <mszeredi@suse.cz>
Fuse will use temporary buffers to write back dirty data from memory
mappings (normal writes are done synchronously). This is needed,
because there cannot be any guarantee about the time in which a write
will complete.
By using temporary buffers, from the MM's point if view the page is
written back immediately. If the writeout was due to memory pressure,
this effectively migrates data from a full zone to a less full zone.
This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of
pages used as temporary buffers.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
drivers/base/node.c | 2 ++
fs/proc/proc_misc.c | 2 ++
include/linux/mmzone.h | 1 +
mm/page-writeback.c | 3 ++-
4 files changed, 7 insertions(+), 1 deletion(-)
Index: linux/fs/proc/proc_misc.c
===================================================================
--- linux.orig/fs/proc/proc_misc.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/fs/proc/proc_misc.c 2008-03-17 18:25:17.000000000 +0100
@@ -179,6 +179,7 @@ static int meminfo_read_proc(char *page,
"PageTables: %8lu kB\n"
"NFS_Unstable: %8lu kB\n"
"Bounce: %8lu kB\n"
+ "WritebackTmp: %8lu kB\n"
"CommitLimit: %8lu kB\n"
"Committed_AS: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
@@ -210,6 +211,7 @@ static int meminfo_read_proc(char *page,
K(global_page_state(NR_PAGETABLE)),
K(global_page_state(NR_UNSTABLE_NFS)),
K(global_page_state(NR_BOUNCE)),
+ K(global_page_state(NR_WRITEBACK_TEMP)),
K(allowed),
K(committed),
(unsigned long)VMALLOC_TOTAL >> 10,
Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h 2008-03-17 18:24:13.000000000 +0100
+++ linux/include/linux/mmzone.h 2008-03-17 18:25:17.000000000 +0100
@@ -95,6 +95,7 @@ enum zone_stat_item {
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
+ NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
#ifdef CONFIG_NUMA
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
Index: linux/drivers/base/node.c
===================================================================
--- linux.orig/drivers/base/node.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/drivers/base/node.c 2008-03-17 18:25:17.000000000 +0100
@@ -64,6 +64,7 @@ static ssize_t node_read_meminfo(struct
"Node %d PageTables: %8lu kB\n"
"Node %d NFS_Unstable: %8lu kB\n"
"Node %d Bounce: %8lu kB\n"
+ "Node %d WritebackTmp: %8lu kB\n"
"Node %d Slab: %8lu kB\n"
"Node %d SReclaimable: %8lu kB\n"
"Node %d SUnreclaim: %8lu kB\n",
@@ -86,6 +87,7 @@ static ssize_t node_read_meminfo(struct
nid, K(node_page_state(nid, NR_PAGETABLE)),
nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
nid, K(node_page_state(nid, NR_BOUNCE)),
+ nid, K(node_page_state(nid, NR_WRITEBACK_TEMP)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2008-03-17 18:24:36.000000000 +0100
+++ linux/mm/page-writeback.c 2008-03-17 18:25:17.000000000 +0100
@@ -211,7 +211,8 @@ clip_bdi_dirty_limit(struct backing_dev_
avail_dirty = dirty -
(global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_WRITEBACK) +
- global_page_state(NR_UNSTABLE_NFS));
+ global_page_state(NR_UNSTABLE_NFS) +
+ global_page_state(NR_WRITEBACK_TEMP));
if (avail_dirty < 0)
avail_dirty = 0;
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 3/8] mm: rotate_reclaimable_page() cleanup
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
2008-03-17 19:19 ` [patch 1/8] mm: bdi: export bdi_writeout_inc() Miklos Szeredi
2008-03-17 19:19 ` [patch 2/8] mm: Add NR_WRITEBACK_TEMP counter Miklos Szeredi
@ 2008-03-17 19:19 ` Miklos Szeredi
2008-03-18 11:31 ` Peter Zijlstra
2008-03-17 19:19 ` [patch 4/8] mm: allow not updating BDI stats in end_page_writeback() Miklos Szeredi
` (4 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-fsdevel, linux-mm
[-- Attachment #1: rotate_reclaimable_page_cleanup.patch --]
[-- Type: text/plain, Size: 3116 bytes --]
From: Miklos Szeredi <mszeredi@suse.cz>
Clean up messy conditional calling of test_clear_page_writeback() from
both rotate_reclaimable_page() and end_page_writeback().
The only user of rotate_reclaimable_page() is end_page_writeback() so
this is OK.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
include/linux/swap.h | 2 +-
mm/filemap.c | 10 ++++++----
mm/swap.c | 37 ++++++++++++-------------------------
3 files changed, 19 insertions(+), 30 deletions(-)
Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h 2008-03-17 18:24:13.000000000 +0100
+++ linux/include/linux/swap.h 2008-03-17 18:25:38.000000000 +0100
@@ -177,7 +177,7 @@ extern void activate_page(struct page *)
extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
-extern int rotate_reclaimable_page(struct page *page);
+extern void rotate_reclaimable_page(struct page *page);
extern void swap_setup(void);
/* linux/mm/vmscan.c */
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/mm/filemap.c 2008-03-17 18:25:38.000000000 +0100
@@ -577,10 +577,12 @@ EXPORT_SYMBOL(unlock_page);
*/
void end_page_writeback(struct page *page)
{
- if (!TestClearPageReclaim(page) || rotate_reclaimable_page(page)) {
- if (!test_clear_page_writeback(page))
- BUG();
- }
+ if (TestClearPageReclaim(page))
+ rotate_reclaimable_page(page);
+
+ if (!test_clear_page_writeback(page))
+ BUG();
+
smp_mb__after_clear_bit();
wake_up_page(page, PG_writeback);
}
Index: linux/mm/swap.c
===================================================================
--- linux.orig/mm/swap.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/mm/swap.c 2008-03-17 18:25:38.000000000 +0100
@@ -133,34 +133,21 @@ static void pagevec_move_tail(struct pag
* Writeback is about to end against a page which has been marked for immediate
* reclaim. If it still appears to be reclaimable, move it to the tail of the
* inactive list.
- *
- * Returns zero if it cleared PG_writeback.
*/
-int rotate_reclaimable_page(struct page *page)
+void rotate_reclaimable_page(struct page *page)
{
- struct pagevec *pvec;
- unsigned long flags;
-
- if (PageLocked(page))
- return 1;
- if (PageDirty(page))
- return 1;
- if (PageActive(page))
- return 1;
- if (!PageLRU(page))
- return 1;
-
- page_cache_get(page);
- local_irq_save(flags);
- pvec = &__get_cpu_var(lru_rotate_pvecs);
- if (!pagevec_add(pvec, page))
- pagevec_move_tail(pvec);
- local_irq_restore(flags);
-
- if (!test_clear_page_writeback(page))
- BUG();
+ if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
+ PageLRU(page)) {
+ struct pagevec *pvec;
+ unsigned long flags;
- return 0;
+ page_cache_get(page);
+ local_irq_save(flags);
+ pvec = &__get_cpu_var(lru_rotate_pvecs);
+ if (!pagevec_add(pvec, page))
+ pagevec_move_tail(pvec);
+ local_irq_restore(flags);
+ }
}
/*
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
` (2 preceding siblings ...)
2008-03-17 19:19 ` [patch 3/8] mm: rotate_reclaimable_page() cleanup Miklos Szeredi
@ 2008-03-17 19:19 ` Miklos Szeredi
2008-03-18 5:04 ` Andrew Morton
2008-03-18 11:33 ` Peter Zijlstra
2008-03-17 19:19 ` [patch 5/8] fuse: support writable mmap Miklos Szeredi
` (3 subsequent siblings)
7 siblings, 2 replies; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-fsdevel, linux-mm
[-- Attachment #1: end_page_writeback_nobdi.patch --]
[-- Type: text/plain, Size: 3714 bytes --]
From: Miklos Szeredi <mszeredi@suse.cz>
Fuse's writepage will need to clear page writeback separately from
updating the per BDI counters.
This patch renames end_page_writeback() to __end_page_writeback() and
adds a boolean parameter to indicate if the per BDI stats need to be
updated.
Regular callers get an inline end_page_writeback() without the boolean
parameter.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
include/linux/page-flags.h | 2 +-
include/linux/pagemap.h | 7 ++++++-
mm/filemap.c | 7 ++++---
mm/page-writeback.c | 4 ++--
4 files changed, 13 insertions(+), 7 deletions(-)
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h 2008-03-17 18:24:13.000000000 +0100
+++ linux/include/linux/page-flags.h 2008-03-17 18:25:53.000000000 +0100
@@ -300,7 +300,7 @@ struct page; /* forward declaration */
extern void cancel_dirty_page(struct page *page, unsigned int account_size);
-int test_clear_page_writeback(struct page *page);
+int test_clear_page_writeback(struct page *page, bool bdi_stats);
int test_set_page_writeback(struct page *page);
static inline void set_page_writeback(struct page *page)
Index: linux/include/linux/pagemap.h
===================================================================
--- linux.orig/include/linux/pagemap.h 2008-03-17 18:24:13.000000000 +0100
+++ linux/include/linux/pagemap.h 2008-03-17 18:25:53.000000000 +0100
@@ -223,7 +223,12 @@ static inline void wait_on_page_writebac
wait_on_page_bit(page, PG_writeback);
}
-extern void end_page_writeback(struct page *page);
+extern void __end_page_writeback(struct page *page, bool bdi_stats);
+
+static inline void end_page_writeback(struct page *page)
+{
+ __end_page_writeback(page, true);
+}
/*
* Fault a userspace page into pagetables. Return non-zero on a fault.
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c 2008-03-17 18:25:38.000000000 +0100
+++ linux/mm/filemap.c 2008-03-17 18:25:53.000000000 +0100
@@ -574,19 +574,20 @@ EXPORT_SYMBOL(unlock_page);
/**
* end_page_writeback - end writeback against a page
* @page: the page
+ * @bdi_stats: update the per-bdi writeback counter
*/
-void end_page_writeback(struct page *page)
+void __end_page_writeback(struct page *page, bool bdi_stats)
{
if (TestClearPageReclaim(page))
rotate_reclaimable_page(page);
- if (!test_clear_page_writeback(page))
+ if (!test_clear_page_writeback(page, bdi_stats))
BUG();
smp_mb__after_clear_bit();
wake_up_page(page, PG_writeback);
}
-EXPORT_SYMBOL(end_page_writeback);
+EXPORT_SYMBOL(__end_page_writeback);
/**
* __lock_page - get a lock on the page, assuming we need to sleep to get it
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2008-03-17 18:25:17.000000000 +0100
+++ linux/mm/page-writeback.c 2008-03-17 18:25:53.000000000 +0100
@@ -1242,7 +1242,7 @@ int clear_page_dirty_for_io(struct page
}
EXPORT_SYMBOL(clear_page_dirty_for_io);
-int test_clear_page_writeback(struct page *page)
+int test_clear_page_writeback(struct page *page, bool bdi_stats)
{
struct address_space *mapping = page_mapping(page);
int ret;
@@ -1257,7 +1257,7 @@ int test_clear_page_writeback(struct pag
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (bdi_cap_writeback_dirty(bdi)) {
+ if (bdi_stats && bdi_cap_writeback_dirty(bdi)) {
__dec_bdi_stat(bdi, BDI_WRITEBACK);
__bdi_writeout_inc(bdi);
}
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 5/8] fuse: support writable mmap
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
` (3 preceding siblings ...)
2008-03-17 19:19 ` [patch 4/8] mm: allow not updating BDI stats in end_page_writeback() Miklos Szeredi
@ 2008-03-17 19:19 ` Miklos Szeredi
2008-03-17 19:19 ` [patch 6/8] fuse: clean up setting i_size in write Miklos Szeredi
` (2 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-fsdevel, linux-mm
[-- Attachment #1: fuse_mmap_write.patch --]
[-- Type: text/plain, Size: 25592 bytes --]
From: Miklos Szeredi <mszeredi@suse.cz>
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty
page accounting infrastructure to materialize (thanks to Peter
Zijlstra and others). This nicely solved the biggest problem:
limiting the number of pages used for write caching.
Some small details remained, however, which this largish patch
attempts to address. It provides a page writeback implementation for
fuse, which is completely safe against VM related deadlocks.
Performance may not be very good for certain usage patterns, but
generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
This patch depends on
mm-bdi-allow-setting-a-maximum-for-the-bdi-dirty-limit-fix.patch
Fuse page writeback design
--------------------------
fuse_writepage() allocates a new temporary page with
GFP_NOFS|__GFP_HIGHMEM. It copies the contents of the original page,
and queues a WRITE request to the userspace filesystem using this temp
page.
The writeback is finished instantly from the MM's point of view: the
page is removed from the radix trees, and the PageDirty and
PageWriteback flags are cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the
actual write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used
at a time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is
this complication needed?
The basic problem is that there can be no guarantee about the time in
which the userspace filesystem will complete a write. It may be buggy
or even malicious, and fail to complete WRITE requests. We don't want
unrelated parts of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory)
to complete a WRITE request. There's a great danger of a deadlock if
that allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow
blocking. So for these cases new code has to be added to fuse, since
the VM is not tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a
single fuse filesystem is set to 1% by default. This way one (or
several) buggy or malicious fuse filesystems cannot slow down the rest
of the system by hogging dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
fs/fuse/dev.c | 19 +++
fs/fuse/dir.c | 84 +++++++++++++-
fs/fuse/file.c | 320 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
fs/fuse/fuse_i.h | 37 ++++++
fs/fuse/inode.c | 47 ++++++--
5 files changed, 478 insertions(+), 29 deletions(-)
Index: linux/fs/fuse/dev.c
===================================================================
--- linux.orig/fs/fuse/dev.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/fs/fuse/dev.c 2008-03-17 18:26:04.000000000 +0100
@@ -47,6 +47,14 @@ struct fuse_req *fuse_request_alloc(void
return req;
}
+struct fuse_req *fuse_request_alloc_nofs(void)
+{
+ struct fuse_req *req = kmem_cache_alloc(fuse_req_cachep, GFP_NOFS);
+ if (req)
+ fuse_request_init(req);
+ return req;
+}
+
void fuse_request_free(struct fuse_req *req)
{
kmem_cache_free(fuse_req_cachep, req);
@@ -430,6 +438,17 @@ void request_send_background(struct fuse
}
/*
+ * Called under fc->lock
+ *
+ * fc->connected must have been checked previously
+ */
+void request_send_background_locked(struct fuse_conn *fc, struct fuse_req *req)
+{
+ req->isreply = 1;
+ request_send_nowait_locked(fc, req);
+}
+
+/*
* Lock the request. Up to the next unlock_request() there mustn't be
* anything that could cause a page-fault. If the request was already
* aborted bail out.
Index: linux/fs/fuse/dir.c
===================================================================
--- linux.orig/fs/fuse/dir.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/fs/fuse/dir.c 2008-03-17 18:26:04.000000000 +0100
@@ -1107,6 +1107,50 @@ static void iattr_to_fattr(struct iattr
}
/*
+ * Prevent concurrent writepages on inode
+ *
+ * This is done by adding a negative bias to the inode write counter
+ * and waiting for all pending writes to finish.
+ */
+void fuse_set_nowrite(struct inode *inode)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ BUG_ON(!mutex_is_locked(&inode->i_mutex));
+
+ spin_lock(&fc->lock);
+ BUG_ON(fi->writectr < 0);
+ fi->writectr += FUSE_NOWRITE;
+ spin_unlock(&fc->lock);
+ wait_event(fi->page_waitq, fi->writectr == FUSE_NOWRITE);
+}
+
+/*
+ * Allow writepages on inode
+ *
+ * Remove the bias from the writecounter and send any queued
+ * writepages.
+ */
+static void __fuse_release_nowrite(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ BUG_ON(fi->writectr != FUSE_NOWRITE);
+ fi->writectr = 0;
+ fuse_flush_writepages(inode);
+}
+
+void fuse_release_nowrite(struct inode *inode)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ spin_lock(&fc->lock);
+ __fuse_release_nowrite(inode);
+ spin_unlock(&fc->lock);
+}
+
+/*
* Set attributes, and at the same time refresh them.
*
* Truncation is slightly complicated, because the 'truncate' request
@@ -1122,6 +1166,8 @@ static int fuse_do_setattr(struct dentry
struct fuse_req *req;
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
+ bool is_truncate = false;
+ loff_t oldsize;
int err;
if (!fuse_allow_task(fc, current))
@@ -1145,12 +1191,16 @@ static int fuse_do_setattr(struct dentry
send_sig(SIGXFSZ, current, 0);
return -EFBIG;
}
+ is_truncate = true;
}
req = fuse_get_req(fc);
if (IS_ERR(req))
return PTR_ERR(req);
+ if (is_truncate)
+ fuse_set_nowrite(inode);
+
memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
iattr_to_fattr(attr, &inarg);
@@ -1181,16 +1231,44 @@ static int fuse_do_setattr(struct dentry
if (err) {
if (err == -EINTR)
fuse_invalidate_attr(inode);
- return err;
+ goto error;
}
if ((inode->i_mode ^ outarg.attr.mode) & S_IFMT) {
make_bad_inode(inode);
- return -EIO;
+ err = -EIO;
+ goto error;
+ }
+
+ spin_lock(&fc->lock);
+ fuse_change_attributes_common(inode, &outarg.attr,
+ attr_timeout(&outarg));
+ oldsize = inode->i_size;
+ i_size_write(inode, outarg.attr.size);
+
+ if (is_truncate) {
+ /* NOTE: this may release/reacquire fc->lock */
+ __fuse_release_nowrite(inode);
+ }
+ spin_unlock(&fc->lock);
+
+ /*
+ * Only call invalidate_inode_pages2() after removing
+ * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
+ */
+ if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+ if (outarg.attr.size < oldsize)
+ fuse_truncate(inode->i_mapping, outarg.attr.size);
+ invalidate_inode_pages2(inode->i_mapping);
}
- fuse_change_attributes(inode, &outarg.attr, attr_timeout(&outarg), 0);
return 0;
+
+error:
+ if (is_truncate)
+ fuse_release_nowrite(inode);
+
+ return err;
}
static int fuse_setattr(struct dentry *entry, struct iattr *attr)
Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/fs/fuse/file.c 2008-03-17 18:26:04.000000000 +0100
@@ -210,6 +210,49 @@ u64 fuse_lock_owner_id(struct fuse_conn
return (u64) v0 + ((u64) v1 << 32);
}
+/*
+ * Check if page is under writeback
+ *
+ * This is currently done by walking the list of writepage requests
+ * for the inode, which can be pretty inefficient.
+ */
+static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_req *req;
+ bool found = false;
+
+ spin_lock(&fc->lock);
+ list_for_each_entry(req, &fi->writepages, writepages_entry) {
+ pgoff_t curr_index;
+
+ BUG_ON(req->inode != inode);
+ curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+ if (curr_index == index) {
+ found = true;
+ break;
+ }
+ }
+ spin_unlock(&fc->lock);
+
+ return found;
+}
+
+/*
+ * Wait for page writeback to be completed.
+ *
+ * Since fuse doesn't rely on the VM writeback tracking, this has to
+ * use some other means.
+ */
+static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
+ return 0;
+}
+
static int fuse_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file->f_path.dentry->d_inode;
@@ -245,6 +288,21 @@ static int fuse_flush(struct file *file,
return err;
}
+/*
+ * Wait for all pending writepages on the inode to finish.
+ *
+ * This is currently done by blocking further writes with FUSE_NOWRITE
+ * and waiting for all sent writes to complete.
+ *
+ * This must be called under i_mutex, otherwise the FUSE_NOWRITE usage
+ * could conflict with truncation.
+ */
+static void fuse_sync_writes(struct inode *inode)
+{
+ fuse_set_nowrite(inode);
+ fuse_release_nowrite(inode);
+}
+
int fuse_fsync_common(struct file *file, struct dentry *de, int datasync,
int isdir)
{
@@ -261,6 +319,17 @@ int fuse_fsync_common(struct file *file,
if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir))
return 0;
+ /*
+ * Start writeback against all dirty pages of the inode, then
+ * wait for all outstanding writes, before sending the FSYNC
+ * request.
+ */
+ err = write_inode_now(inode, 0);
+ if (err)
+ return err;
+
+ fuse_sync_writes(inode);
+
req = fuse_get_req(fc);
if (IS_ERR(req))
return PTR_ERR(req);
@@ -340,6 +409,13 @@ static int fuse_readpage(struct file *fi
if (is_bad_inode(inode))
goto out;
+ /*
+ * Page writeback can extend beyond the liftime of the
+ * page-cache page, so make sure we read a properly synced
+ * page.
+ */
+ fuse_wait_on_page_writeback(inode, page->index);
+
req = fuse_get_req(fc);
err = PTR_ERR(req);
if (IS_ERR(req))
@@ -411,6 +487,8 @@ static int fuse_readpages_fill(void *_da
struct inode *inode = data->inode;
struct fuse_conn *fc = get_fuse_conn(inode);
+ fuse_wait_on_page_writeback(inode, page->index);
+
if (req->num_pages &&
(req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
(req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read ||
@@ -477,11 +555,10 @@ static ssize_t fuse_file_aio_read(struct
}
static void fuse_write_fill(struct fuse_req *req, struct file *file,
- struct inode *inode, loff_t pos, size_t count,
- int writepage)
+ struct fuse_file *ff, struct inode *inode,
+ loff_t pos, size_t count, int writepage)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- struct fuse_file *ff = file->private_data;
struct fuse_write_in *inarg = &req->misc.write.in;
struct fuse_write_out *outarg = &req->misc.write.out;
@@ -490,7 +567,7 @@ static void fuse_write_fill(struct fuse_
inarg->offset = pos;
inarg->size = count;
inarg->write_flags = writepage ? FUSE_WRITE_CACHE : 0;
- inarg->flags = file->f_flags;
+ inarg->flags = file ? file->f_flags : 0;
req->in.h.opcode = FUSE_WRITE;
req->in.h.nodeid = get_node_id(inode);
req->in.argpages = 1;
@@ -511,7 +588,7 @@ static size_t fuse_send_write(struct fus
fl_owner_t owner)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- fuse_write_fill(req, file, inode, pos, count, 0);
+ fuse_write_fill(req, file, file->private_data, inode, pos, count, 0);
if (owner != NULL) {
struct fuse_write_in *inarg = &req->misc.write.in;
inarg->write_flags |= FUSE_WRITE_LOCKOWNER;
@@ -546,6 +623,12 @@ static int fuse_buffered_write(struct fi
if (is_bad_inode(inode))
return -EIO;
+ /*
+ * Make sure writepages on the same page are not mixed up with
+ * plain writes.
+ */
+ fuse_wait_on_page_writeback(inode, page->index);
+
req = fuse_get_req(fc);
if (IS_ERR(req))
return PTR_ERR(req);
@@ -716,21 +799,224 @@ static ssize_t fuse_direct_write(struct
return res;
}
-static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
+static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
{
- if ((vma->vm_flags & VM_SHARED)) {
- if ((vma->vm_flags & VM_WRITE))
- return -ENODEV;
- else
- vma->vm_flags &= ~VM_MAYWRITE;
+ __free_page(req->pages[0]);
+ fuse_file_put(req->ff);
+ fuse_put_request(fc, req);
+}
+
+static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
+{
+ struct inode *inode = req->inode;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+
+ list_del(&req->writepages_entry);
+ dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
+ bdi_writeout_inc(bdi);
+ wake_up(&fi->page_waitq);
+}
+
+/* Called under fc->lock, may release and reacquire it */
+static void fuse_send_writepage(struct fuse_conn *fc, struct fuse_req *req)
+{
+ struct fuse_inode *fi = get_fuse_inode(req->inode);
+ loff_t size = i_size_read(req->inode);
+ struct fuse_write_in *inarg = &req->misc.write.in;
+
+ if (!fc->connected)
+ goto out_free;
+
+ if (inarg->offset + PAGE_CACHE_SIZE <= size) {
+ inarg->size = PAGE_CACHE_SIZE;
+ } else if (inarg->offset < size) {
+ inarg->size = size & (PAGE_CACHE_SIZE - 1);
+ } else {
+ /* Got truncated off completely */
+ goto out_free;
+ }
+
+ req->in.args[1].size = inarg->size;
+ fi->writectr++;
+ request_send_background_locked(fc, req);
+ return;
+
+ out_free:
+ fuse_writepage_finish(fc, req);
+ spin_unlock(&fc->lock);
+ fuse_writepage_free(fc, req);
+ spin_lock(&fc->lock);
+}
+
+/*
+ * If fi->writectr is positive (no truncate or fsync going on) send
+ * all queued writepage requests.
+ *
+ * Called with fc->lock
+ */
+void fuse_flush_writepages(struct inode *inode)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_req *req;
+
+ while (fi->writectr >= 0 && !list_empty(&fi->queued_writes)) {
+ req = list_entry(fi->queued_writes.next, struct fuse_req, list);
+ list_del_init(&req->list);
+ fuse_send_writepage(fc, req);
}
- return generic_file_mmap(file, vma);
}
-static int fuse_set_page_dirty(struct page *page)
+static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
{
- printk("fuse_set_page_dirty: should not happen\n");
- dump_stack();
+ struct inode *inode = req->inode;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ mapping_set_error(inode->i_mapping, req->out.h.error);
+ spin_lock(&fc->lock);
+ fi->writectr--;
+ fuse_writepage_finish(fc, req);
+ spin_unlock(&fc->lock);
+ fuse_writepage_free(fc, req);
+}
+
+static int fuse_writepage_locked(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_req *req;
+ struct fuse_file *ff;
+ struct page *tmp_page;
+
+ set_page_writeback(page);
+
+ req = fuse_request_alloc_nofs();
+ if (!req)
+ goto err;
+
+ tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+ if (!tmp_page)
+ goto err_free;
+
+ spin_lock(&fc->lock);
+ BUG_ON(list_empty(&fi->write_files));
+ ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+ req->ff = fuse_file_get(ff);
+ spin_unlock(&fc->lock);
+
+ fuse_write_fill(req, NULL, ff, inode, page_offset(page), 0, 1);
+
+ copy_highpage(tmp_page, page);
+ req->num_pages = 1;
+ req->pages[0] = tmp_page;
+ req->page_offset = 0;
+ req->end = fuse_writepage_end;
+ req->inode = inode;
+
+ inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+ __end_page_writeback(page, false);
+
+ spin_lock(&fc->lock);
+ list_add(&req->writepages_entry, &fi->writepages);
+ list_add_tail(&req->list, &fi->queued_writes);
+ fuse_flush_writepages(inode);
+ spin_unlock(&fc->lock);
+
+ return 0;
+
+err_free:
+ fuse_request_free(req);
+err:
+ end_page_writeback(page);
+ return -ENOMEM;
+}
+
+static int fuse_writepage(struct page *page, struct writeback_control *wbc)
+{
+ int err;
+
+ err = fuse_writepage_locked(page);
+ unlock_page(page);
+
+ return err;
+}
+
+static int fuse_launder_page(struct page *page)
+{
+ int err = 0;
+ if (clear_page_dirty_for_io(page)) {
+ struct inode *inode = page->mapping->host;
+ err = fuse_writepage_locked(page);
+ if (!err)
+ fuse_wait_on_page_writeback(inode, page->index);
+ }
+ return err;
+}
+
+/*
+ * Write back dirty pages now, because there may not be any suitable
+ * open files later
+ */
+static void fuse_vma_close(struct vm_area_struct *vma)
+{
+ filemap_write_and_wait(vma->vm_file->f_mapping);
+}
+
+/*
+ * Wait for writeback against this page to complete before allowing it
+ * to be marked dirty again, and hence written back again, possibly
+ * before the previous writepage completed.
+ *
+ * Block here, instead of in ->writepage(), so that the userspace fs
+ * can only block processes actually operating on the filesystem.
+ *
+ * Otherwise unprivileged userspace fs would be able to block
+ * unrelated:
+ *
+ * - page migration
+ * - sync(2)
+ * - try_to_free_pages() with order > PAGE_ALLOC_COSTLY_ORDER
+ */
+static int fuse_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+ /*
+ * Don't use page->mapping as it may become NULL from a
+ * concurrent truncate.
+ */
+ struct inode *inode = vma->vm_file->f_mapping->host;
+
+ fuse_wait_on_page_writeback(inode, page->index);
+ return 0;
+}
+
+static struct vm_operations_struct fuse_file_vm_ops = {
+ .close = fuse_vma_close,
+ .fault = filemap_fault,
+ .page_mkwrite = fuse_page_mkwrite,
+};
+
+static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
+ struct inode *inode = file->f_dentry->d_inode;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_file *ff = file->private_data;
+ /*
+ * file may be written through mmap, so chain it onto the
+ * inodes's write_file list
+ */
+ spin_lock(&fc->lock);
+ if (list_empty(&ff->write_entry))
+ list_add(&ff->write_entry, &fi->write_files);
+ spin_unlock(&fc->lock);
+ }
+ file_accessed(file);
+ vma->vm_ops = &fuse_file_vm_ops;
return 0;
}
@@ -940,10 +1226,12 @@ static const struct file_operations fuse
static const struct address_space_operations fuse_file_aops = {
.readpage = fuse_readpage,
+ .writepage = fuse_writepage,
+ .launder_page = fuse_launder_page,
.write_begin = fuse_write_begin,
.write_end = fuse_write_end,
.readpages = fuse_readpages,
- .set_page_dirty = fuse_set_page_dirty,
+ .set_page_dirty = __set_page_dirty_nobuffers,
.bmap = fuse_bmap,
};
Index: linux/fs/fuse/fuse_i.h
===================================================================
--- linux.orig/fs/fuse/fuse_i.h 2008-03-17 18:24:13.000000000 +0100
+++ linux/fs/fuse/fuse_i.h 2008-03-17 18:26:04.000000000 +0100
@@ -15,6 +15,7 @@
#include <linux/mm.h>
#include <linux/backing-dev.h>
#include <linux/mutex.h>
+#include <linux/rwsem.h>
/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -25,6 +26,9 @@
/** Congestion starts at 75% of maximum */
#define FUSE_CONGESTION_THRESHOLD (FUSE_MAX_BACKGROUND * 75 / 100)
+/** Bias for fi->writectr, meaning new writepages must not be sent */
+#define FUSE_NOWRITE INT_MIN
+
/** It could be as large as PATH_MAX, but would that have any uses? */
#define FUSE_NAME_MAX 1024
@@ -73,6 +77,19 @@ struct fuse_inode {
/** Files usable in writepage. Protected by fc->lock */
struct list_head write_files;
+
+ /** Writepages pending on truncate or fsync */
+ struct list_head queued_writes;
+
+ /** Number of sent writes, a negative bias (FUSE_NOWRITE)
+ * means more writes are blocked */
+ int writectr;
+
+ /** Waitq for writepage completion */
+ wait_queue_head_t page_waitq;
+
+ /** List of writepage requestst (pending or sent) */
+ struct list_head writepages;
};
/** FUSE specific file data */
@@ -242,6 +259,12 @@ struct fuse_req {
/** File used in the request (or NULL) */
struct fuse_file *ff;
+ /** Inode used in the request or NULL */
+ struct inode *inode;
+
+ /** Link on fi->writepages */
+ struct list_head writepages_entry;
+
/** Request completion callback */
void (*end)(struct fuse_conn *, struct fuse_req *);
@@ -504,6 +527,11 @@ void fuse_init_symlink(struct inode *ino
void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
u64 attr_valid, u64 attr_version);
+void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
+ u64 attr_valid);
+
+void fuse_truncate(struct address_space *mapping, loff_t offset);
+
/**
* Initialize the client device
*/
@@ -522,6 +550,8 @@ void fuse_ctl_cleanup(void);
*/
struct fuse_req *fuse_request_alloc(void);
+struct fuse_req *fuse_request_alloc_nofs(void);
+
/**
* Free a request
*/
@@ -558,6 +588,8 @@ void request_send_noreply(struct fuse_co
*/
void request_send_background(struct fuse_conn *fc, struct fuse_req *req);
+void request_send_background_locked(struct fuse_conn *fc, struct fuse_req *req);
+
/* Abort all requests */
void fuse_abort_conn(struct fuse_conn *fc);
@@ -600,3 +632,8 @@ u64 fuse_lock_owner_id(struct fuse_conn
int fuse_update_attributes(struct inode *inode, struct kstat *stat,
struct file *file, bool *refreshed);
+
+void fuse_flush_writepages(struct inode *inode);
+
+void fuse_set_nowrite(struct inode *inode);
+void fuse_release_nowrite(struct inode *inode);
Index: linux/fs/fuse/inode.c
===================================================================
--- linux.orig/fs/fuse/inode.c 2008-03-17 18:24:13.000000000 +0100
+++ linux/fs/fuse/inode.c 2008-03-17 18:26:04.000000000 +0100
@@ -59,7 +59,11 @@ static struct inode *fuse_alloc_inode(st
fi->nodeid = 0;
fi->nlookup = 0;
fi->attr_version = 0;
+ fi->writectr = 0;
INIT_LIST_HEAD(&fi->write_files);
+ INIT_LIST_HEAD(&fi->queued_writes);
+ INIT_LIST_HEAD(&fi->writepages);
+ init_waitqueue_head(&fi->page_waitq);
fi->forget_req = fuse_request_alloc();
if (!fi->forget_req) {
kmem_cache_free(fuse_inode_cachep, inode);
@@ -73,6 +77,7 @@ static void fuse_destroy_inode(struct in
{
struct fuse_inode *fi = get_fuse_inode(inode);
BUG_ON(!list_empty(&fi->write_files));
+ BUG_ON(!list_empty(&fi->queued_writes));
if (fi->forget_req)
fuse_request_free(fi->forget_req);
kmem_cache_free(fuse_inode_cachep, inode);
@@ -109,7 +114,7 @@ static int fuse_remount_fs(struct super_
return 0;
}
-static void fuse_truncate(struct address_space *mapping, loff_t offset)
+void fuse_truncate(struct address_space *mapping, loff_t offset)
{
/* See vmtruncate() */
unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
@@ -117,19 +122,12 @@ static void fuse_truncate(struct address
unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
}
-
-void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid, u64 attr_version)
+void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
+ u64 attr_valid)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
- loff_t oldsize;
- spin_lock(&fc->lock);
- if (attr_version != 0 && fi->attr_version > attr_version) {
- spin_unlock(&fc->lock);
- return;
- }
fi->attr_version = ++fc->attr_version;
fi->i_time = attr_valid;
@@ -159,6 +157,22 @@ void fuse_change_attributes(struct inode
fi->orig_i_mode = inode->i_mode;
if (!(fc->flags & FUSE_DEFAULT_PERMISSIONS))
inode->i_mode &= ~S_ISVTX;
+}
+
+void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
+ u64 attr_valid, u64 attr_version)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ loff_t oldsize;
+
+ spin_lock(&fc->lock);
+ if (attr_version != 0 && fi->attr_version > attr_version) {
+ spin_unlock(&fc->lock);
+ return;
+ }
+
+ fuse_change_attributes_common(inode, attr, attr_valid);
oldsize = inode->i_size;
i_size_write(inode, attr->size);
@@ -476,6 +490,19 @@ static struct fuse_conn *new_conn(struct
err = bdi_register_dev(&fc->bdi, fc->dev);
if (err)
goto error_bdi_destroy;
+ /*
+ * For a single fuse filesystem use max 1% of dirty +
+ * writeback threshold.
+ *
+ * This gives about 1M of write buffer for memory maps on a
+ * machine with 1G and 10% dirty_ratio, which should be more
+ * than enough.
+ *
+ * Privileged users can raise it by writing to
+ *
+ * /sys/class/bdi/<bdi>/max_ratio
+ */
+ bdi_set_max_ratio(&fc->bdi, 1);
fc->reqctr = 0;
fc->blocked = 1;
fc->attr_version = 1;
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 6/8] fuse: clean up setting i_size in write
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
` (4 preceding siblings ...)
2008-03-17 19:19 ` [patch 5/8] fuse: support writable mmap Miklos Szeredi
@ 2008-03-17 19:19 ` Miklos Szeredi
2008-03-18 5:08 ` Andrew Morton
2008-03-17 19:19 ` [patch 7/8] fuse: implement perform_write Miklos Szeredi
2008-03-17 19:19 ` [patch 8/8] fuse: update file size on short read Miklos Szeredi
7 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-fsdevel, linux-mm
[-- Attachment #1: fuse_write_update_size.patch --]
[-- Type: text/plain, Size: 1954 bytes --]
From: Miklos Szeredi <mszeredi@suse.cz>
Extract common code for setting i_size in write functions into a
common helper.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
fs/fuse/file.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c 2008-03-17 18:26:04.000000000 +0100
+++ linux/fs/fuse/file.c 2008-03-17 18:26:28.000000000 +0100
@@ -610,13 +610,24 @@ static int fuse_write_begin(struct file
return 0;
}
+static void fuse_write_update_size(struct inode *inode, loff_t pos)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ spin_lock(&fc->lock);
+ fi->attr_version = ++fc->attr_version;
+ if (pos > inode->i_size)
+ i_size_write(inode, pos);
+ spin_unlock(&fc->lock);
+}
+
static int fuse_buffered_write(struct file *file, struct inode *inode,
loff_t pos, unsigned count, struct page *page)
{
int err;
size_t nres;
struct fuse_conn *fc = get_fuse_conn(inode);
- struct fuse_inode *fi = get_fuse_inode(inode);
unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
struct fuse_req *req;
@@ -643,12 +654,7 @@ static int fuse_buffered_write(struct fi
err = -EIO;
if (!err) {
pos += nres;
- spin_lock(&fc->lock);
- fi->attr_version = ++fc->attr_version;
- if (pos > inode->i_size)
- i_size_write(inode, pos);
- spin_unlock(&fc->lock);
-
+ fuse_write_update_size(inode, pos);
if (count == PAGE_CACHE_SIZE)
SetPageUptodate(page);
}
@@ -766,12 +772,8 @@ static ssize_t fuse_direct_io(struct fil
}
fuse_put_request(fc, req);
if (res > 0) {
- if (write) {
- spin_lock(&fc->lock);
- if (pos > inode->i_size)
- i_size_write(inode, pos);
- spin_unlock(&fc->lock);
- }
+ if (write)
+ fuse_write_update_size(inode, pos);
*ppos = pos;
}
fuse_invalidate_attr(inode);
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 7/8] fuse: implement perform_write
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
` (5 preceding siblings ...)
2008-03-17 19:19 ` [patch 6/8] fuse: clean up setting i_size in write Miklos Szeredi
@ 2008-03-17 19:19 ` Miklos Szeredi
2008-03-17 19:19 ` [patch 8/8] fuse: update file size on short read Miklos Szeredi
7 siblings, 0 replies; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm
Cc: linux-kernel, linux-fsdevel, linux-mm, Nick Piggin, Christoph Hellwig
[-- Attachment #1: fuse_perform_write.patch --]
[-- Type: text/plain, Size: 5744 bytes --]
From: Nick Piggin <npiggin@suse.de>
Introduce fuse_perform_write. With fusexmp (a passthrough filesystem), large
(1MB) writes into a backing tmpfs filesystem are sped up by almost 4 times
(256MB/s vs 71MB/s).
[mszeredi@suse.cz]:
- split into smaller functions
- testing
- duplicate generic_file_aio_write(), so that there's no need to add a
new ->perform_write() a_op. Comment from hch.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Christoph Hellwig <hch@infradead.org>
---
fs/fuse/file.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 193 insertions(+), 1 deletion(-)
Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c 2008-03-17 18:26:28.000000000 +0100
+++ linux/fs/fuse/file.c 2008-03-17 18:35:26.000000000 +0100
@@ -677,6 +677,198 @@ static int fuse_write_end(struct file *f
return res;
}
+static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
+ struct inode *inode, loff_t pos,
+ size_t count)
+{
+ size_t res;
+ unsigned offset;
+ unsigned i;
+
+ for (i = 0; i < req->num_pages; i++)
+ fuse_wait_on_page_writeback(inode, req->pages[i]->index);
+
+ res = fuse_send_write(req, file, inode, pos, count, NULL);
+
+ offset = req->page_offset;
+ count = res;
+ for (i = 0; i < req->num_pages; i++) {
+ struct page *page = req->pages[i];
+
+ if (!req->out.h.error && !offset && count >= PAGE_CACHE_SIZE)
+ SetPageUptodate(page);
+
+ if (count > PAGE_CACHE_SIZE - offset)
+ count -= PAGE_CACHE_SIZE - offset;
+ else
+ count = 0;
+ offset = 0;
+
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
+ return res;
+}
+
+static ssize_t fuse_fill_write_pages(struct fuse_req *req,
+ struct address_space *mapping,
+ struct iov_iter *ii, loff_t pos)
+{
+ struct fuse_conn *fc = get_fuse_conn(mapping->host);
+ unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+ size_t count = 0;
+ int err;
+
+ req->page_offset = offset;
+
+ do {
+ size_t tmp;
+ struct page *page;
+ pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+ size_t bytes = min_t(size_t, PAGE_CACHE_SIZE - offset,
+ iov_iter_count(ii));
+
+ bytes = min_t(size_t, bytes, fc->max_write - count);
+
+ again:
+ err = -EFAULT;
+ if (iov_iter_fault_in_readable(ii, bytes))
+ break;
+
+ err = -ENOMEM;
+ page = __grab_cache_page(mapping, index);
+ if (!page)
+ break;
+
+ pagefault_disable();
+ tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes);
+ pagefault_enable();
+ flush_dcache_page(page);
+
+ if (!tmp) {
+ unlock_page(page);
+ page_cache_release(page);
+ bytes = min(bytes, iov_iter_single_seg_count(ii));
+ goto again;
+ }
+
+ err = 0;
+ req->pages[req->num_pages] = page;
+ req->num_pages++;
+
+ iov_iter_advance(ii, tmp);
+ count += tmp;
+ pos += tmp;
+ offset += tmp;
+ if (offset == PAGE_CACHE_SIZE)
+ offset = 0;
+
+ } while (iov_iter_count(ii) && count < fc->max_write &&
+ req->num_pages < FUSE_MAX_PAGES_PER_REQ && offset == 0);
+
+ return count > 0 ? count : err;
+}
+
+static ssize_t fuse_perform_write(struct file *file,
+ struct address_space *mapping,
+ struct iov_iter *ii, loff_t pos)
+{
+ struct inode *inode = mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ int err = 0;
+ ssize_t res = 0;
+
+ if (is_bad_inode(inode))
+ return -EIO;
+
+ do {
+ struct fuse_req *req;
+ ssize_t count;
+
+ req = fuse_get_req(fc);
+ if (IS_ERR(req)) {
+ err = PTR_ERR(req);
+ break;
+ }
+
+ count = fuse_fill_write_pages(req, mapping, ii, pos);
+ if (count <= 0) {
+ err = count;
+ } else {
+ size_t num_written;
+
+ num_written = fuse_send_write_pages(req, file, inode,
+ pos, count);
+ err = req->out.h.error;
+ if (!err) {
+ res += num_written;
+ pos += num_written;
+
+ /* break out of the loop on short write */
+ if (num_written != count)
+ err = -EIO;
+ }
+ }
+ fuse_put_request(fc, req);
+ } while (!err && iov_iter_count(ii));
+
+ if (res > 0)
+ fuse_write_update_size(inode, pos);
+
+ fuse_invalidate_attr(inode);
+
+ return res > 0 ? res : err;
+}
+
+static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct address_space *mapping = file->f_mapping;
+ size_t count = 0;
+ ssize_t written = 0;
+ struct inode *inode = mapping->host;
+ ssize_t err;
+ struct iov_iter i;
+
+ WARN_ON(iocb->ki_pos != pos);
+
+ err = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
+ if (err)
+ return err;
+
+ mutex_lock(&inode->i_mutex);
+ vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+
+ /* We can write back this queue in page reclaim */
+ current->backing_dev_info = mapping->backing_dev_info;
+
+ err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
+ if (err)
+ goto out;
+
+ if (count == 0)
+ goto out;
+
+ err = remove_suid(file->f_path.dentry);
+ if (err)
+ goto out;
+
+ file_update_time(file);
+
+ iov_iter_init(&i, iov, nr_segs, count, 0);
+ written = fuse_perform_write(file, mapping, &i, pos);
+ if (written >= 0)
+ iocb->ki_pos = pos + written;
+
+out:
+ current->backing_dev_info = NULL;
+ mutex_unlock(&inode->i_mutex);
+
+ return written ? written : err;
+}
+
static void fuse_release_user_pages(struct fuse_req *req, int write)
{
unsigned i;
@@ -1202,7 +1394,7 @@ static const struct file_operations fuse
.read = do_sync_read,
.aio_read = fuse_file_aio_read,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .aio_write = fuse_file_aio_write,
.mmap = fuse_file_mmap,
.open = fuse_open,
.flush = fuse_flush,
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* [patch 8/8] fuse: update file size on short read
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
` (6 preceding siblings ...)
2008-03-17 19:19 ` [patch 7/8] fuse: implement perform_write Miklos Szeredi
@ 2008-03-17 19:19 ` Miklos Szeredi
7 siblings, 0 replies; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-17 19:19 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel, linux-fsdevel, linux-mm
[-- Attachment #1: fuse_truncate_file_on_short_read.patch --]
[-- Type: text/plain, Size: 2777 bytes --]
From: Miklos Szeredi <mszeredi@suse.cz>
If the READ request returned a short count, then either
- cached size is incorrect
- filesystem is buggy, as short reads are only allowed on EOF
So assume, that the size is wrong and refresh it, so that cached
read() doesn't zero fill the missing chunk.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
fs/fuse/file.c | 42 ++++++++++++++++++++++++++++++++++++++----
1 file changed, 38 insertions(+), 4 deletions(-)
Index: linux/fs/fuse/file.c
===================================================================
--- linux.orig/fs/fuse/file.c 2008-03-17 18:26:43.000000000 +0100
+++ linux/fs/fuse/file.c 2008-03-17 18:27:11.000000000 +0100
@@ -398,11 +398,26 @@ static size_t fuse_send_read(struct fuse
return req->out.args[0].size;
}
+static void fuse_read_update_size(struct inode *inode, loff_t size)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ spin_lock(&fc->lock);
+ fi->attr_version = ++fc->attr_version;
+ if (size < inode->i_size)
+ i_size_write(inode, size);
+ spin_unlock(&fc->lock);
+}
+
static int fuse_readpage(struct file *file, struct page *page)
{
struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
+ size_t num_read;
+ loff_t pos = page_offset(page);
+ size_t count = PAGE_CACHE_SIZE;
int err;
err = -EIO;
@@ -424,12 +439,20 @@ static int fuse_readpage(struct file *fi
req->out.page_zeroing = 1;
req->num_pages = 1;
req->pages[0] = page;
- fuse_send_read(req, file, inode, page_offset(page), PAGE_CACHE_SIZE,
- NULL);
+ num_read = fuse_send_read(req, file, inode, pos, count, NULL);
err = req->out.h.error;
fuse_put_request(fc, req);
- if (!err)
+
+ if (!err) {
+ /*
+ * Short read means EOF. If file size is larger, truncate it
+ */
+ if (num_read < count)
+ fuse_read_update_size(inode, pos + num_read);
+
SetPageUptodate(page);
+ }
+
fuse_invalidate_attr(inode); /* atime changed */
out:
unlock_page(page);
@@ -439,8 +462,19 @@ static int fuse_readpage(struct file *fi
static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
{
int i;
+ size_t count = req->misc.read_in.size;
+ size_t num_read = req->out.args[0].size;
+ struct inode *inode = req->pages[0]->mapping->host;
- fuse_invalidate_attr(req->pages[0]->mapping->host); /* atime changed */
+ /*
+ * Short read means EOF. If file size is larger, truncate it
+ */
+ if (!req->out.h.error && num_read < count) {
+ loff_t pos = page_offset(req->pages[0]) + num_read;
+ fuse_read_update_size(inode, pos);
+ }
+
+ fuse_invalidate_attr(inode); /* atime changed */
for (i = 0; i < req->num_pages; i++) {
struct page *page = req->pages[i];
--
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-17 19:19 ` [patch 4/8] mm: allow not updating BDI stats in end_page_writeback() Miklos Szeredi
@ 2008-03-18 5:04 ` Andrew Morton
2008-03-18 8:11 ` Miklos Szeredi
2008-03-18 11:33 ` Peter Zijlstra
1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2008-03-18 5:04 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel, linux-mm
On Mon, 17 Mar 2008 20:19:12 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Fuse's writepage will need to clear page writeback separately from
> updating the per BDI counters.
>
> This patch renames end_page_writeback() to __end_page_writeback() and
> adds a boolean parameter to indicate if the per BDI stats need to be
> updated.
>
> Regular callers get an inline end_page_writeback() without the boolean
> parameter.
>
> ...
>
> Index: linux/include/linux/page-flags.h
> ===================================================================
> --- linux.orig/include/linux/page-flags.h 2008-03-17 18:24:13.000000000 +0100
> +++ linux/include/linux/page-flags.h 2008-03-17 18:25:53.000000000 +0100
> @@ -300,7 +300,7 @@ struct page; /* forward declaration */
>
> extern void cancel_dirty_page(struct page *page, unsigned int account_size);
>
> -int test_clear_page_writeback(struct page *page);
> +int test_clear_page_writeback(struct page *page, bool bdi_stats);
> int test_set_page_writeback(struct page *page);
>
> static inline void set_page_writeback(struct page *page)
> Index: linux/include/linux/pagemap.h
> ===================================================================
> --- linux.orig/include/linux/pagemap.h 2008-03-17 18:24:13.000000000 +0100
> +++ linux/include/linux/pagemap.h 2008-03-17 18:25:53.000000000 +0100
> @@ -223,7 +223,12 @@ static inline void wait_on_page_writebac
> wait_on_page_bit(page, PG_writeback);
> }
>
> -extern void end_page_writeback(struct page *page);
> +extern void __end_page_writeback(struct page *page, bool bdi_stats);
> +
> +static inline void end_page_writeback(struct page *page)
> +{
> + __end_page_writeback(page, true);
> +}
>
> /*
> * Fault a userspace page into pagetables. Return non-zero on a fault.
> Index: linux/mm/filemap.c
> ===================================================================
> --- linux.orig/mm/filemap.c 2008-03-17 18:25:38.000000000 +0100
> +++ linux/mm/filemap.c 2008-03-17 18:25:53.000000000 +0100
> @@ -574,19 +574,20 @@ EXPORT_SYMBOL(unlock_page);
> /**
> * end_page_writeback - end writeback against a page
> * @page: the page
> + * @bdi_stats: update the per-bdi writeback counter
> */
> -void end_page_writeback(struct page *page)
> +void __end_page_writeback(struct page *page, bool bdi_stats)
> {
> if (TestClearPageReclaim(page))
> rotate_reclaimable_page(page);
>
> - if (!test_clear_page_writeback(page))
> + if (!test_clear_page_writeback(page, bdi_stats))
> BUG();
>
> smp_mb__after_clear_bit();
> wake_up_page(page, PG_writeback);
> }
> -EXPORT_SYMBOL(end_page_writeback);
> +EXPORT_SYMBOL(__end_page_writeback);
>
> /**
> * __lock_page - get a lock on the page, assuming we need to sleep to get it
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c 2008-03-17 18:25:17.000000000 +0100
> +++ linux/mm/page-writeback.c 2008-03-17 18:25:53.000000000 +0100
> @@ -1242,7 +1242,7 @@ int clear_page_dirty_for_io(struct page
> }
> EXPORT_SYMBOL(clear_page_dirty_for_io);
>
> -int test_clear_page_writeback(struct page *page)
> +int test_clear_page_writeback(struct page *page, bool bdi_stats)
> {
> struct address_space *mapping = page_mapping(page);
> int ret;
> @@ -1257,7 +1257,7 @@ int test_clear_page_writeback(struct pag
> radix_tree_tag_clear(&mapping->page_tree,
> page_index(page),
> PAGECACHE_TAG_WRITEBACK);
> - if (bdi_cap_writeback_dirty(bdi)) {
> + if (bdi_stats && bdi_cap_writeback_dirty(bdi)) {
> __dec_bdi_stat(bdi, BDI_WRITEBACK);
> __bdi_writeout_inc(bdi);
> }
Adding `mode' flags to a core function is generally considered poor form.
And it adds additional overhead and possibly stack utilisation for all
callers.
We generally prefer that a new function be created. After all, that's what
you've done here, only the code has gone and wedged two different functions
into one.
Another approach might be to add a new bdi_cap_foo() flag. We could then do
if (bdi_cap_writeback_dirty(bdi) && bdi_cap_mumble(bdi)) {
here. But even better would be to create a new BDI capability which
indicates that this address_space doesn't want this treatment in
test_clear_page_writeback(), then go fix up all the
!bdi_cap_writeback_dirty() address_spaces to set that flag.
So then the code becomes
if (!bdi_cap_account_writeback_in_test_clear_page_writeback(bdi)) {
(good luck thinking up a better name ;))
Reason: bdi_cap_writeback_dirty() is kinda weirdly intrepreted to mean
various different things in different places and we really should separate
its multiple interpretations into separate flags.
Note that this becomes a standalone VFS cleanup patch, and the fuse code
can then just use it later on.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 2/8] mm: Add NR_WRITEBACK_TEMP counter
2008-03-17 19:19 ` [patch 2/8] mm: Add NR_WRITEBACK_TEMP counter Miklos Szeredi
@ 2008-03-18 5:05 ` Andrew Morton
0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2008-03-18 5:05 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel, linux-mm
On Mon, 17 Mar 2008 20:19:10 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> @@ -179,6 +179,7 @@ static int meminfo_read_proc(char *page,
> "PageTables: %8lu kB\n"
> "NFS_Unstable: %8lu kB\n"
> "Bounce: %8lu kB\n"
> + "WritebackTmp: %8lu kB\n"
These fields are documented in Documentation/filesystems/proc.txt (please),
although we're missing some of them now.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 6/8] fuse: clean up setting i_size in write
2008-03-17 19:19 ` [patch 6/8] fuse: clean up setting i_size in write Miklos Szeredi
@ 2008-03-18 5:08 ` Andrew Morton
2008-03-18 8:16 ` Miklos Szeredi
0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2008-03-18 5:08 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-kernel, linux-fsdevel, linux-mm
On Mon, 17 Mar 2008 20:19:14 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Extract common code for setting i_size in write functions into a
> common helper.
>
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
> fs/fuse/file.c | 28 +++++++++++++++-------------
> 1 file changed, 15 insertions(+), 13 deletions(-)
>
> Index: linux/fs/fuse/file.c
> ===================================================================
> --- linux.orig/fs/fuse/file.c 2008-03-17 18:26:04.000000000 +0100
> +++ linux/fs/fuse/file.c 2008-03-17 18:26:28.000000000 +0100
> @@ -610,13 +610,24 @@ static int fuse_write_begin(struct file
> return 0;
> }
>
> +static void fuse_write_update_size(struct inode *inode, loff_t pos)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + spin_lock(&fc->lock);
> + fi->attr_version = ++fc->attr_version;
> + if (pos > inode->i_size)
> + i_size_write(inode, pos);
> + spin_unlock(&fc->lock);
> +}
>
> ...
>
> @@ -766,12 +772,8 @@ static ssize_t fuse_direct_io(struct fil
> }
> fuse_put_request(fc, req);
> if (res > 0) {
> - if (write) {
> - spin_lock(&fc->lock);
> - if (pos > inode->i_size)
> - i_size_write(inode, pos);
> - spin_unlock(&fc->lock);
> - }
> + if (write)
> + fuse_write_update_size(inode, pos);
We require that i_mutex be held here, to prevent i_size_write() deadlocks.
Is it held?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 5:04 ` Andrew Morton
@ 2008-03-18 8:11 ` Miklos Szeredi
2008-03-18 8:18 ` Andrew Morton
0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-18 8:11 UTC (permalink / raw)
To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-fsdevel, linux-mm
[PeterZ added to CC]
> On Mon, 17 Mar 2008 20:19:12 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > From: Miklos Szeredi <mszeredi@suse.cz>
> >
> > Fuse's writepage will need to clear page writeback separately from
> > updating the per BDI counters.
> >
> > This patch renames end_page_writeback() to __end_page_writeback() and
> > adds a boolean parameter to indicate if the per BDI stats need to be
> > updated.
> >
> > Regular callers get an inline end_page_writeback() without the boolean
> > parameter.
> >
> > ...
> >
> > Index: linux/include/linux/page-flags.h
> > ===================================================================
> > --- linux.orig/include/linux/page-flags.h 2008-03-17 18:24:13.000000000 +0100
> > +++ linux/include/linux/page-flags.h 2008-03-17 18:25:53.000000000 +0100
> > @@ -300,7 +300,7 @@ struct page; /* forward declaration */
> >
> > extern void cancel_dirty_page(struct page *page, unsigned int account_size);
> >
> > -int test_clear_page_writeback(struct page *page);
> > +int test_clear_page_writeback(struct page *page, bool bdi_stats);
> > int test_set_page_writeback(struct page *page);
> >
> > static inline void set_page_writeback(struct page *page)
> > Index: linux/include/linux/pagemap.h
> > ===================================================================
> > --- linux.orig/include/linux/pagemap.h 2008-03-17 18:24:13.000000000 +0100
> > +++ linux/include/linux/pagemap.h 2008-03-17 18:25:53.000000000 +0100
> > @@ -223,7 +223,12 @@ static inline void wait_on_page_writebac
> > wait_on_page_bit(page, PG_writeback);
> > }
> >
> > -extern void end_page_writeback(struct page *page);
> > +extern void __end_page_writeback(struct page *page, bool bdi_stats);
> > +
> > +static inline void end_page_writeback(struct page *page)
> > +{
> > + __end_page_writeback(page, true);
> > +}
> >
> > /*
> > * Fault a userspace page into pagetables. Return non-zero on a fault.
> > Index: linux/mm/filemap.c
> > ===================================================================
> > --- linux.orig/mm/filemap.c 2008-03-17 18:25:38.000000000 +0100
> > +++ linux/mm/filemap.c 2008-03-17 18:25:53.000000000 +0100
> > @@ -574,19 +574,20 @@ EXPORT_SYMBOL(unlock_page);
> > /**
> > * end_page_writeback - end writeback against a page
> > * @page: the page
> > + * @bdi_stats: update the per-bdi writeback counter
> > */
> > -void end_page_writeback(struct page *page)
> > +void __end_page_writeback(struct page *page, bool bdi_stats)
> > {
> > if (TestClearPageReclaim(page))
> > rotate_reclaimable_page(page);
> >
> > - if (!test_clear_page_writeback(page))
> > + if (!test_clear_page_writeback(page, bdi_stats))
> > BUG();
> >
> > smp_mb__after_clear_bit();
> > wake_up_page(page, PG_writeback);
> > }
> > -EXPORT_SYMBOL(end_page_writeback);
> > +EXPORT_SYMBOL(__end_page_writeback);
> >
> > /**
> > * __lock_page - get a lock on the page, assuming we need to sleep to get it
> > Index: linux/mm/page-writeback.c
> > ===================================================================
> > --- linux.orig/mm/page-writeback.c 2008-03-17 18:25:17.000000000 +0100
> > +++ linux/mm/page-writeback.c 2008-03-17 18:25:53.000000000 +0100
> > @@ -1242,7 +1242,7 @@ int clear_page_dirty_for_io(struct page
> > }
> > EXPORT_SYMBOL(clear_page_dirty_for_io);
> >
> > -int test_clear_page_writeback(struct page *page)
> > +int test_clear_page_writeback(struct page *page, bool bdi_stats)
> > {
> > struct address_space *mapping = page_mapping(page);
> > int ret;
> > @@ -1257,7 +1257,7 @@ int test_clear_page_writeback(struct pag
> > radix_tree_tag_clear(&mapping->page_tree,
> > page_index(page),
> > PAGECACHE_TAG_WRITEBACK);
> > - if (bdi_cap_writeback_dirty(bdi)) {
> > + if (bdi_stats && bdi_cap_writeback_dirty(bdi)) {
> > __dec_bdi_stat(bdi, BDI_WRITEBACK);
> > __bdi_writeout_inc(bdi);
> > }
>
> Adding `mode' flags to a core function is generally considered poor form.
> And it adds additional overhead and possibly stack utilisation for all
> callers.
>
> We generally prefer that a new function be created. After all, that's what
> you've done here, only the code has gone and wedged two different functions
> into one.
Yes, although duplicating such a not entirely trivial function has
it's dangers as well, I think.
> Another approach might be to add a new bdi_cap_foo() flag. We could then do
>
> if (bdi_cap_writeback_dirty(bdi) && bdi_cap_mumble(bdi)) {
>
> here. But even better would be to create a new BDI capability which
> indicates that this address_space doesn't want this treatment in
> test_clear_page_writeback(), then go fix up all the
> !bdi_cap_writeback_dirty() address_spaces to set that flag.
>
> So then the code becomes
>
> if (!bdi_cap_account_writeback_in_test_clear_page_writeback(bdi)) {
>
> (good luck thinking up a better name ;))
>
> Reason: bdi_cap_writeback_dirty() is kinda weirdly intrepreted to mean
> various different things in different places and we really should separate
> its multiple interpretations into separate flags.
>
> Note that this becomes a standalone VFS cleanup patch, and the fuse code
> can then just use it later on.
Hmm, I can see two slightly different meanings of bdi_cap_writeback_dirty():
1) need to call ->writepage (sync_page_range(), ...)
2) need to update BDI stats (test_clear_page_writeback(), ...)
If these two were different flags, then fuse could set the
NEED_WRITEPAGE flag, but clear the NEED_UPDATE_BDI_STATS flag, and do
it manually.
Does that sound workable?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 6/8] fuse: clean up setting i_size in write
2008-03-18 5:08 ` Andrew Morton
@ 2008-03-18 8:16 ` Miklos Szeredi
0 siblings, 0 replies; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-18 8:16 UTC (permalink / raw)
To: akpm; +Cc: miklos, linux-kernel, linux-fsdevel, linux-mm
> On Mon, 17 Mar 2008 20:19:14 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > From: Miklos Szeredi <mszeredi@suse.cz>
> >
> > Extract common code for setting i_size in write functions into a
> > common helper.
> >
> > Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> > ---
> > fs/fuse/file.c | 28 +++++++++++++++-------------
> > 1 file changed, 15 insertions(+), 13 deletions(-)
> >
> > Index: linux/fs/fuse/file.c
> > ===================================================================
> > --- linux.orig/fs/fuse/file.c 2008-03-17 18:26:04.000000000 +0100
> > +++ linux/fs/fuse/file.c 2008-03-17 18:26:28.000000000 +0100
> > @@ -610,13 +610,24 @@ static int fuse_write_begin(struct file
> > return 0;
> > }
> >
> > +static void fuse_write_update_size(struct inode *inode, loff_t pos)
> > +{
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > +
> > + spin_lock(&fc->lock);
> > + fi->attr_version = ++fc->attr_version;
> > + if (pos > inode->i_size)
> > + i_size_write(inode, pos);
> > + spin_unlock(&fc->lock);
> > +}
> >
> > ...
> >
> > @@ -766,12 +772,8 @@ static ssize_t fuse_direct_io(struct fil
> > }
> > fuse_put_request(fc, req);
> > if (res > 0) {
> > - if (write) {
> > - spin_lock(&fc->lock);
> > - if (pos > inode->i_size)
> > - i_size_write(inode, pos);
> > - spin_unlock(&fc->lock);
> > - }
> > + if (write)
> > + fuse_write_update_size(inode, pos);
>
> We require that i_mutex be held here, to prevent i_size_write() deadlocks.
> Is it held?
>
No, fuse uses the per connection spinlock to protect against
concurrent calls to i_size_write(), because in some cases holding
i_mutex would be difficult.
Fuse already got painfully bitten by the i_size_write() deadlock, so
I'm well aware of the problem :)
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 8:11 ` Miklos Szeredi
@ 2008-03-18 8:18 ` Andrew Morton
0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2008-03-18 8:18 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-fsdevel, linux-mm
On Tue, 18 Mar 2008 09:11:49 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > Reason: bdi_cap_writeback_dirty() is kinda weirdly intrepreted to mean
> > various different things in different places and we really should separate
> > its multiple interpretations into separate flags.
> >
> > Note that this becomes a standalone VFS cleanup patch, and the fuse code
> > can then just use it later on.
>
> Hmm, I can see two slightly different meanings of bdi_cap_writeback_dirty():
>
> 1) need to call ->writepage (sync_page_range(), ...)
> 2) need to update BDI stats (test_clear_page_writeback(), ...)
>
> If these two were different flags, then fuse could set the
> NEED_WRITEPAGE flag, but clear the NEED_UPDATE_BDI_STATS flag, and do
> it manually.
>
> Does that sound workable?
Yup, thanks.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 1/8] mm: bdi: export bdi_writeout_inc()
2008-03-17 19:19 ` [patch 1/8] mm: bdi: export bdi_writeout_inc() Miklos Szeredi
@ 2008-03-18 11:27 ` Peter Zijlstra
2008-03-18 11:46 ` Miklos Szeredi
0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2008-03-18 11:27 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, linux-mm
On Mon, 2008-03-17 at 20:19 +0100, Miklos Szeredi wrote:
> plain text document attachment (export_bdi_writeout_inc.patch)
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Fuse needs this for writable mmap support.
>
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
> include/linux/backing-dev.h | 2 ++
> mm/page-writeback.c | 10 ++++++++++
> 2 files changed, 12 insertions(+)
>
> Index: linux/include/linux/backing-dev.h
> ===================================================================
> --- linux.orig/include/linux/backing-dev.h 2008-03-17 18:24:13.000000000 +0100
> +++ linux/include/linux/backing-dev.h 2008-03-17 18:24:36.000000000 +0100
> @@ -134,6 +134,8 @@ static inline s64 bdi_stat_sum(struct ba
> return sum;
> }
>
> +extern void bdi_writeout_inc(struct backing_dev_info *bdi);
> +
> /*
> * maximal error of a stat counter.
> */
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c 2008-03-17 18:24:13.000000000 +0100
> +++ linux/mm/page-writeback.c 2008-03-17 18:24:36.000000000 +0100
> @@ -168,6 +168,16 @@ static inline void __bdi_writeout_inc(st
> bdi->max_prop_frac);
> }
>
> +void bdi_writeout_inc(struct backing_dev_info *bdi)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + __bdi_writeout_inc(bdi);
> + local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(bdi_writeout_inc);
> +
May I ask to make this a _GPL export, please?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 3/8] mm: rotate_reclaimable_page() cleanup
2008-03-17 19:19 ` [patch 3/8] mm: rotate_reclaimable_page() cleanup Miklos Szeredi
@ 2008-03-18 11:31 ` Peter Zijlstra
2008-03-18 11:56 ` Miklos Szeredi
0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2008-03-18 11:31 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, linux-mm
On Mon, 2008-03-17 at 20:19 +0100, Miklos Szeredi wrote:
> plain text document attachment (rotate_reclaimable_page_cleanup.patch)
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Clean up messy conditional calling of test_clear_page_writeback() from
> both rotate_reclaimable_page() and end_page_writeback().
> -int rotate_reclaimable_page(struct page *page)
> +void rotate_reclaimable_page(struct page *page)
> {
> - struct pagevec *pvec;
> - unsigned long flags;
> -
> - if (PageLocked(page))
> - return 1;
> - if (PageDirty(page))
> - return 1;
> - if (PageActive(page))
> - return 1;
> - if (!PageLRU(page))
> - return 1;
Might be me, but I find the above easier to read than
> + if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> + PageLRU(page)) {
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-17 19:19 ` [patch 4/8] mm: allow not updating BDI stats in end_page_writeback() Miklos Szeredi
2008-03-18 5:04 ` Andrew Morton
@ 2008-03-18 11:33 ` Peter Zijlstra
2008-03-18 11:59 ` Miklos Szeredi
1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2008-03-18 11:33 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, linux-mm
On Mon, 2008-03-17 at 20:19 +0100, Miklos Szeredi wrote:
> plain text document attachment (end_page_writeback_nobdi.patch)
> From: Miklos Szeredi <mszeredi@suse.cz>
>
> Fuse's writepage will need to clear page writeback separately from
> updating the per BDI counters.
This is because of the juggling with temporary pages, right?
Would be nice to have some comments in the code explaining this.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 1/8] mm: bdi: export bdi_writeout_inc()
2008-03-18 11:27 ` Peter Zijlstra
@ 2008-03-18 11:46 ` Miklos Szeredi
0 siblings, 0 replies; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-18 11:46 UTC (permalink / raw)
To: peterz; +Cc: miklos, akpm, linux-kernel, linux-fsdevel, linux-mm
> > +void bdi_writeout_inc(struct backing_dev_info *bdi)
> > +{
> > + unsigned long flags;
> > +
> > + local_irq_save(flags);
> > + __bdi_writeout_inc(bdi);
> > + local_irq_restore(flags);
> > +}
> > +EXPORT_SYMBOL(bdi_writeout_inc);
> > +
>
> May I ask to make this a _GPL export, please?
Sure.
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 3/8] mm: rotate_reclaimable_page() cleanup
2008-03-18 11:31 ` Peter Zijlstra
@ 2008-03-18 11:56 ` Miklos Szeredi
2008-03-18 16:45 ` Andrew Morton
0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-18 11:56 UTC (permalink / raw)
To: peterz; +Cc: miklos, akpm, linux-kernel, linux-fsdevel, linux-mm
> > -int rotate_reclaimable_page(struct page *page)
> > +void rotate_reclaimable_page(struct page *page)
> > {
> > - struct pagevec *pvec;
> > - unsigned long flags;
> > -
> > - if (PageLocked(page))
> > - return 1;
> > - if (PageDirty(page))
> > - return 1;
> > - if (PageActive(page))
> > - return 1;
> > - if (!PageLRU(page))
> > - return 1;
>
> Might be me, but I find the above easier to read than
>
> > + if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> > + PageLRU(page)) {
> >
Matter of taste, returning from a middle of a function is generally to
be avoided (unless not). Anyway, this is just a side effect of the
main cleanup, so I think I'm entitled to choose the style I prefer ;)
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 11:33 ` Peter Zijlstra
@ 2008-03-18 11:59 ` Miklos Szeredi
2008-03-18 12:29 ` Peter Zijlstra
0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-18 11:59 UTC (permalink / raw)
To: peterz; +Cc: miklos, akpm, linux-kernel, linux-fsdevel, linux-mm
> On Mon, 2008-03-17 at 20:19 +0100, Miklos Szeredi wrote:
> > plain text document attachment (end_page_writeback_nobdi.patch)
> > From: Miklos Szeredi <mszeredi@suse.cz>
> >
> > Fuse's writepage will need to clear page writeback separately from
> > updating the per BDI counters.
>
> This is because of the juggling with temporary pages, right?
>
> Would be nice to have some comments in the code explaining this.
Yup, well it will go through a bigger cleanup, as discussed with
Andrew, if that's OK with you?
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 11:59 ` Miklos Szeredi
@ 2008-03-18 12:29 ` Peter Zijlstra
2008-03-18 12:51 ` Miklos Szeredi
0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2008-03-18 12:29 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, linux-mm
On Tue, 2008-03-18 at 12:59 +0100, Miklos Szeredi wrote:
> > On Mon, 2008-03-17 at 20:19 +0100, Miklos Szeredi wrote:
> > > plain text document attachment (end_page_writeback_nobdi.patch)
> > > From: Miklos Szeredi <mszeredi@suse.cz>
> > >
> > > Fuse's writepage will need to clear page writeback separately from
> > > updating the per BDI counters.
> >
> > This is because of the juggling with temporary pages, right?
> >
> > Would be nice to have some comments in the code explaining this.
>
> Yup, well it will go through a bigger cleanup, as discussed with
> Andrew, if that's OK with you?
Well, I was pondering that - it hadn't made its way out to the other
side of my brain yet.. but I'll dump have I have:
Yes, it does two things, _however_ those two things are very much
related. Your use-case that breaks this relation is an execption - and I
haven't really grasped it yet..
I'm in general not too keen about you having to export the BDI
accounting stuff and using it explicitly like this, but I'm afraid I
don't see a way around it - the danger is that other filesystems will
get creative (hence the req for GPL - that excludes the most creative
ones).
Yes, it makes sense to delay the write completion accounting until its
actually completed.. but I would suggest all writeback accounting.
So the thing that's in your way is that removing a page from the radix
tree doesn't imply its done writing. So perhaps we should make that
distinction instead?
So instead of conditionally do part of the accounting, never do it and
require something like: page_writeback_complete() to be called after a
successfull test_clear_page_writeback().
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 12:29 ` Peter Zijlstra
@ 2008-03-18 12:51 ` Miklos Szeredi
2008-03-18 13:08 ` Peter Zijlstra
0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-18 12:51 UTC (permalink / raw)
To: peterz; +Cc: miklos, akpm, linux-kernel, linux-fsdevel, linux-mm
> Yes, it does two things, _however_ those two things are very much
> related. Your use-case that breaks this relation is an execption - and I
> haven't really grasped it yet..
>
> I'm in general not too keen about you having to export the BDI
> accounting stuff and using it explicitly like this, but I'm afraid I
> don't see a way around it - the danger is that other filesystems will
> get creative (hence the req for GPL - that excludes the most creative
> ones).
>
> Yes, it makes sense to delay the write completion accounting until its
> actually completed.. but I would suggest all writeback accounting.
Doesn't work, as long as we have throttle_vm_writeout() waiting for
NR_WRITEBACK to go below a threshold, delaying the NR_WRITEBACK
accounting could lead to a deadlock.
So at least until that's resolved NR_WRITEBACK_TEMP needs to be
separate from NR_WRITEBACK_TEMP. And it makes sense possibly even
after that, as they are fundamentally different things. The first one
is page cache pages being under writeout, the second is just kernel
buffers (mostly) unrelated to the page cache.
> So the thing that's in your way is that removing a page from the radix
> tree doesn't imply its done writing. So perhaps we should make that
> distinction instead?
>
> So instead of conditionally do part of the accounting, never do it and
> require something like: page_writeback_complete() to be called after a
> successfull test_clear_page_writeback().
Yes, that's a possibility, but then normal filesystems miss out on the
small optimization provided by doing the BDI accounting functions
inside the same IRQ disabled region as the radix tree operation.
Would that have any significant performance impact?
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 12:51 ` Miklos Szeredi
@ 2008-03-18 13:08 ` Peter Zijlstra
2008-03-18 13:58 ` Miklos Szeredi
0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2008-03-18 13:08 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, linux-mm
On Tue, 2008-03-18 at 13:51 +0100, Miklos Szeredi wrote:
> > Yes, it does two things, _however_ those two things are very much
> > related. Your use-case that breaks this relation is an execption - and I
> > haven't really grasped it yet..
> >
> > I'm in general not too keen about you having to export the BDI
> > accounting stuff and using it explicitly like this, but I'm afraid I
> > don't see a way around it - the danger is that other filesystems will
> > get creative (hence the req for GPL - that excludes the most creative
> > ones).
> >
> > Yes, it makes sense to delay the write completion accounting until its
> > actually completed.. but I would suggest all writeback accounting.
>
> Doesn't work, as long as we have throttle_vm_writeout() waiting for
> NR_WRITEBACK to go below a threshold, delaying the NR_WRITEBACK
> accounting could lead to a deadlock.
>
> So at least until that's resolved NR_WRITEBACK_TEMP needs to be
> separate from NR_WRITEBACK_TEMP. And it makes sense possibly even
> after that, as they are fundamentally different things. The first one
> is page cache pages being under writeout, the second is just kernel
> buffers (mostly) unrelated to the page cache.
Urgh, throttle_vm_writeout() again.. Agreed, that'll deadlock.
> > So the thing that's in your way is that removing a page from the radix
> > tree doesn't imply its done writing. So perhaps we should make that
> > distinction instead?
> >
> > So instead of conditionally do part of the accounting, never do it and
> > require something like: page_writeback_complete() to be called after a
> > successfull test_clear_page_writeback().
>
> Yes, that's a possibility, but then normal filesystems miss out on the
> small optimization provided by doing the BDI accounting functions
> inside the same IRQ disabled region as the radix tree operation.
> Would that have any significant performance impact?
Yeah, realized that. Don't know, would have to measure it somehow...
some archs are rather slow with disabling IRQs, but we're talking about
writeout which should be dominated by the IO times.
Its just that your proposal exposes too much guts, I'd like the
interface to be a little higher level.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 13:08 ` Peter Zijlstra
@ 2008-03-18 13:58 ` Miklos Szeredi
2008-03-18 13:59 ` Peter Zijlstra
0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-18 13:58 UTC (permalink / raw)
To: peterz; +Cc: miklos, akpm, linux-kernel, linux-fsdevel, linux-mm
> > > So the thing that's in your way is that removing a page from the radix
> > > tree doesn't imply its done writing. So perhaps we should make that
> > > distinction instead?
> > >
> > > So instead of conditionally do part of the accounting, never do it and
> > > require something like: page_writeback_complete() to be called after a
> > > successfull test_clear_page_writeback().
> >
> > Yes, that's a possibility, but then normal filesystems miss out on the
> > small optimization provided by doing the BDI accounting functions
> > inside the same IRQ disabled region as the radix tree operation.
> > Would that have any significant performance impact?
>
> Yeah, realized that. Don't know, would have to measure it somehow...
> some archs are rather slow with disabling IRQs, but we're talking about
> writeout which should be dominated by the IO times.
>
> Its just that your proposal exposes too much guts, I'd like the
> interface to be a little higher level.
Well, but this is the kernel, you can't really make foolproof
interfaces. If we'll go with Andrew's suggestion, I'll add comments
warning users about not touching those flags unless they know what
they are doing, OK?
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 13:58 ` Miklos Szeredi
@ 2008-03-18 13:59 ` Peter Zijlstra
2008-03-18 15:53 ` Miklos Szeredi
0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2008-03-18 13:59 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel, linux-mm
On Tue, 2008-03-18 at 14:58 +0100, Miklos Szeredi wrote:
> > > > So the thing that's in your way is that removing a page from the radix
> > > > tree doesn't imply its done writing. So perhaps we should make that
> > > > distinction instead?
> > > >
> > > > So instead of conditionally do part of the accounting, never do it and
> > > > require something like: page_writeback_complete() to be called after a
> > > > successfull test_clear_page_writeback().
> > >
> > > Yes, that's a possibility, but then normal filesystems miss out on the
> > > small optimization provided by doing the BDI accounting functions
> > > inside the same IRQ disabled region as the radix tree operation.
> > > Would that have any significant performance impact?
> >
> > Yeah, realized that. Don't know, would have to measure it somehow...
> > some archs are rather slow with disabling IRQs, but we're talking about
> > writeout which should be dominated by the IO times.
> >
> > Its just that your proposal exposes too much guts, I'd like the
> > interface to be a little higher level.
>
> Well, but this is the kernel, you can't really make foolproof
> interfaces. If we'll go with Andrew's suggestion, I'll add comments
> warning users about not touching those flags unless they know what
> they are doing, OK?
Yeah, I guess so :-)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 13:59 ` Peter Zijlstra
@ 2008-03-18 15:53 ` Miklos Szeredi
2008-03-18 16:49 ` Andrew Morton
0 siblings, 1 reply; 29+ messages in thread
From: Miklos Szeredi @ 2008-03-18 15:53 UTC (permalink / raw)
To: peterz; +Cc: miklos, akpm, linux-kernel, linux-fsdevel, linux-mm
> > Well, but this is the kernel, you can't really make foolproof
> > interfaces. If we'll go with Andrew's suggestion, I'll add comments
> > warning users about not touching those flags unless they know what
> > they are doing, OK?
>
> Yeah, I guess so :-)
Cool :)
On a related note, is there a reason why bdi_cap_writeback_dirty() and
friends need to be macros instead of inline functions? If not I'd
clean that up as well.
Miklos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 3/8] mm: rotate_reclaimable_page() cleanup
2008-03-18 11:56 ` Miklos Szeredi
@ 2008-03-18 16:45 ` Andrew Morton
0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2008-03-18 16:45 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: peterz, linux-kernel, linux-fsdevel, linux-mm
On Tue, 18 Mar 2008 12:56:34 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > -int rotate_reclaimable_page(struct page *page)
> > > +void rotate_reclaimable_page(struct page *page)
> > > {
> > > - struct pagevec *pvec;
> > > - unsigned long flags;
> > > -
> > > - if (PageLocked(page))
> > > - return 1;
> > > - if (PageDirty(page))
> > > - return 1;
> > > - if (PageActive(page))
> > > - return 1;
> > > - if (!PageLRU(page))
> > > - return 1;
> >
> > Might be me, but I find the above easier to read than
Me too, but (believe it or not) sometimes I will eschew comment ;)
> > > + if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> > > + PageLRU(page)) {
> > >
>
> Matter of taste, returning from a middle of a function is generally to
> be avoided (unless not).
Avoiding multiple returns is more than a matter of taste: the practice is a
source of code bloat, resource leaks and locking errors.
But we do do it quite commonly at the start of the function when checking the
arguments, before the function has actually altered anything.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [patch 4/8] mm: allow not updating BDI stats in end_page_writeback()
2008-03-18 15:53 ` Miklos Szeredi
@ 2008-03-18 16:49 ` Andrew Morton
0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2008-03-18 16:49 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: peterz, linux-kernel, linux-fsdevel, linux-mm
On Tue, 18 Mar 2008 16:53:52 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:
> On a related note, is there a reason why bdi_cap_writeback_dirty() and
> friends need to be macros instead of inline functions?
None whatsoever.
> If not I'd clean that up as well.
Goodness.
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2008-03-20 0:23 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-17 19:19 [patch 0/8] fuse: writable mmap + batched write Miklos Szeredi
2008-03-17 19:19 ` [patch 1/8] mm: bdi: export bdi_writeout_inc() Miklos Szeredi
2008-03-18 11:27 ` Peter Zijlstra
2008-03-18 11:46 ` Miklos Szeredi
2008-03-17 19:19 ` [patch 2/8] mm: Add NR_WRITEBACK_TEMP counter Miklos Szeredi
2008-03-18 5:05 ` Andrew Morton
2008-03-17 19:19 ` [patch 3/8] mm: rotate_reclaimable_page() cleanup Miklos Szeredi
2008-03-18 11:31 ` Peter Zijlstra
2008-03-18 11:56 ` Miklos Szeredi
2008-03-18 16:45 ` Andrew Morton
2008-03-17 19:19 ` [patch 4/8] mm: allow not updating BDI stats in end_page_writeback() Miklos Szeredi
2008-03-18 5:04 ` Andrew Morton
2008-03-18 8:11 ` Miklos Szeredi
2008-03-18 8:18 ` Andrew Morton
2008-03-18 11:33 ` Peter Zijlstra
2008-03-18 11:59 ` Miklos Szeredi
2008-03-18 12:29 ` Peter Zijlstra
2008-03-18 12:51 ` Miklos Szeredi
2008-03-18 13:08 ` Peter Zijlstra
2008-03-18 13:58 ` Miklos Szeredi
2008-03-18 13:59 ` Peter Zijlstra
2008-03-18 15:53 ` Miklos Szeredi
2008-03-18 16:49 ` Andrew Morton
2008-03-17 19:19 ` [patch 5/8] fuse: support writable mmap Miklos Szeredi
2008-03-17 19:19 ` [patch 6/8] fuse: clean up setting i_size in write Miklos Szeredi
2008-03-18 5:08 ` Andrew Morton
2008-03-18 8:16 ` Miklos Szeredi
2008-03-17 19:19 ` [patch 7/8] fuse: implement perform_write Miklos Szeredi
2008-03-17 19:19 ` [patch 8/8] fuse: update file size on short read Miklos Szeredi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).