Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 00/20] virtiofs: Add DAX support
@ 2020-08-07 19:55 Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
                   ` (20 more replies)
  0 siblings, 21 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

Hi All,

This is V2 of the patches. I had posted V1 here.

https://lore.kernel.org/linux-fsdevel/20200304165845.3081-1-vgoyal@redhat.com/

I have taken care of most of the comments from V1.

Amir had suggested that some of the code can be moved to a new file, may
be post series. I have left that item for the post series cleanup.

Miklos was concerned about KVM error handling when a mapped file on
host gets truncated and guest tries to access it. As of now guest
can spin in infinite loop. I have tried to address that with KVM
developers and there is no easy race free way to inject synchronous
exception which can indicate error in guest. For now, I have proposed
that atleast guest can exit to qemu with error when such scenario
happen and proposed a patch which has got two acks. But it has not
been merged yet.

https://lore.kernel.org/kvm/20200720211359.GF502563@redhat.com/

We don't support modifications in virtiofs files from host or
from other guests. And even if somebody modifies it, qemu will
exit gracefully without bringing down host. So I think this
probably is good enough for now to make progress on virtiofs DAX
patches.

Fixing kvm to inject errors is not easy and is a complicated design.
We will need to make sure that happens before we support DAX in
shared mode in virtiofs. But blocking behind this is proabably not
a good idea because DAX can be very useful in non-shared mode in
kata containers.

Performance numbers are more or less same as in V1 post. So I am
not providing any numbers again. In general, DAX can provide a good
performance boost and memory savings.

Any thoughts or feedback is welcome.

Description from previous post
------------------------------

This patch series adds DAX support to virtiofs filesystem. This allows
bypassing guest page cache and allows mapping host page cache directly
in guest address space.

When a page of file is needed, guest sends a request to map that page
(in host page cache) in qemu address space. Inside guest this is
a physical memory range controlled by virtiofs device. And guest
directly maps this physical address range using DAX and hence gets
access to file data on host.

This can speed up things considerably in many situations. Also this
can result in substantial memory savings as file data does not have
to be copied in guest and it is directly accessed from host page
cache.

Most of the changes are limited to fuse/virtiofs. There are couple
of changes needed in generic dax infrastructure and couple of changes
in virtio to be able to access shared memory region.
 
Thanks
Vivek

Sebastien Boeuf (3):
  virtio: Add get_shm_region method
  virtio: Implement get_shm_region for PCI transport
  virtio: Implement get_shm_region for MMIO transport

Stefan Hajnoczi (2):
  virtio_fs, dax: Set up virtio_fs dax_device
  fuse,dax: add DAX mmap support

Vivek Goyal (15):
  dax: Modify bdev_dax_pgoff() to handle NULL bdev
  dax: Create a range version of dax_layout_busy_page()
  virtiofs: Provide a helper function for virtqueue initialization
  fuse: Get rid of no_mount_options
  fuse,virtiofs: Add a mount option to enable dax
  fuse,virtiofs: Keep a list of free dax memory ranges
  fuse: implement FUSE_INIT map_alignment field
  fuse: Introduce setupmapping/removemapping commands
  fuse, dax: Implement dax read/write operations
  fuse, dax: Take ->i_mmap_sem lock during dax page fault
  fuse,virtiofs: Define dax address space operations
  fuse,virtiofs: Maintain a list of busy elements
  fuse: Release file in process context
  fuse: Take inode lock for dax inode truncation
  fuse,virtiofs: Add logic to free up a memory range

 drivers/dax/super.c                |    3 +-
 drivers/virtio/virtio_mmio.c       |   32 +
 drivers/virtio/virtio_pci_modern.c |   96 +++
 fs/dax.c                           |   66 +-
 fs/fuse/dir.c                      |    2 +
 fs/fuse/file.c                     | 1189 +++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h                   |  112 ++-
 fs/fuse/inode.c                    |  147 +++-
 fs/fuse/virtio_fs.c                |  279 ++++++-
 include/linux/dax.h                |    6 +
 include/linux/virtio_config.h      |   17 +
 include/uapi/linux/fuse.h          |   34 +-
 include/uapi/linux/virtio_fs.h     |    3 +
 include/uapi/linux/virtio_mmio.h   |   11 +
 include/uapi/linux/virtio_pci.h    |   11 +-
 15 files changed, 1927 insertions(+), 81 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-17 16:57   ` Jan Kara
  2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Christoph Hellwig,
	Dan Williams, linux-nvdimm

virtiofs does not have a block device but it has dax device.
Modify bdev_dax_pgoff() to be able to handle that.

If there is no bdev, that means dax offset is 0. (It can't be a partition
block device starting at an offset in dax device).

This is little hackish. There have been discussions about getting rid
of dax not supporting partitions.

https://lore.kernel.org/linux-fsdevel/20200107125159.GA15745@infradead.org/

IMHO, this path can easily break exisitng users. For example
ioctl(BLKPG_ADD_PARTITION) will start breaking on block devices
supporting DAX. Also, I personally find it very useful to be able to
partition dax devices and still be able to use DAX.

Alternatively, I tried to store offset into dax device information in iomap
interface, but that got NACKed.

https://lore.kernel.org/linux-fsdevel/20200217133117.GB20444@infradead.org/

I can't think of a good path to solve this issue properly. So to make
progress, it seems this patch is least bad option for now and I hope
we can take it.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
---
 drivers/dax/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 8e32345be0f7..c4bec437e88b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 		pgoff_t *pgoff)
 {
-	phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+	sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+	phys_addr_t phys_off = (start_sect + sector) * 512;
 
 	if (pgoff)
 		*pgoff = PHYS_PFN(phys_off);
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-17 16:53   ` Jan Kara
  2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Dan Williams, linux-nvdimm

virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.

As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.

We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references  (and don't want to unmap all the pages of inode).

Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/dax.c            | 66 ++++++++++++++++++++++++++++++++-------------
 include/linux/dax.h |  6 +++++
 2 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..0d51b0fbb489 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
 	return xa_mk_internal(VM_FAULT_FALLBACK);
 }
 
-/**
- * dax_layout_busy_page - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
- *
- * DAX requires ZONE_DEVICE mapped pages. These pages are never
- * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
- * any page in the mapping is busy, i.e. for DMA, or other
- * get_user_pages() usages.
- *
- * It is expected that the filesystem is holding locks to block the
- * establishment of new mappings in this address_space. I.e. it expects
- * to be able to run unmap_mapping_range() and subsequently not race
- * mapping_mapped() becoming true.
+/*
+ * Partial pages are included. If end is LLONG_MAX, pages in the range from
+ * start to end of the file are inluded.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+					loff_t start, loff_t end)
 {
-	XA_STATE(xas, &mapping->i_pages, 0);
 	void *entry;
 	unsigned int scanned = 0;
 	struct page *page = NULL;
+	pgoff_t start_idx = start >> PAGE_SHIFT;
+	pgoff_t end_idx = end >> PAGE_SHIFT;
+	XA_STATE(xas, &mapping->i_pages, start_idx);
+	loff_t len, lstart = round_down(start, PAGE_SIZE);
 
 	/*
 	 * In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
 		return NULL;
 
+	/* If end == LLONG_MAX, all pages from start to till end of file */
+	if (end == LLONG_MAX) {
+		end_idx = ULONG_MAX;
+		len = 0;
+	} else {
+		/* length is being calculated from lstart and not start.
+		 * This is due to behavior of unmap_mapping_range(). If
+		 * start is say 4094 and end is on 4096 then we want to
+		 * unamp two pages, idx 0 and 1. But unmap_mapping_range()
+		 * will unmap only page at idx 0. If we calculate len
+		 * from the rounded down start, this problem should not
+		 * happen.
+		 */
+		len = end - lstart + 1;
+	}
+
 	/*
 	 * If we race get_user_pages_fast() here either we'll see the
 	 * elevated page count in the iteration and wait, or
@@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	 * guaranteed to either see new references or prevent new
 	 * references from being established.
 	 */
-	unmap_mapping_range(mapping, 0, 0, 0);
+	unmap_mapping_range(mapping, start, len, 0);
 
 	xas_lock_irq(&xas);
-	xas_for_each(&xas, entry, ULONG_MAX) {
+	xas_for_each(&xas, entry, end_idx) {
 		if (WARN_ON_ONCE(!xa_is_value(entry)))
 			continue;
 		if (unlikely(dax_is_locked(entry)))
@@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	xas_unlock_irq(&xas);
 	return page;
 }
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true.
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	return dax_layout_busy_page_range(mapping, 0, 0);
+}
 EXPORT_SYMBOL_GPL(dax_layout_busy_page);
 
 static int __dax_invalidate_entry(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..9016929db4c6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -141,6 +141,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
+struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
 #else
@@ -171,6 +172,11 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 	return NULL;
 }
 
+static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages)
+{
+	return NULL;
+}
+
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
 		struct dax_device *dax_dev, struct writeback_control *wbc)
 {
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 03/20] virtio: Add get_shm_region method
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-10 13:47   ` Michael S. Tsirkin
                     ` (2 more replies)
  2020-08-07 19:55 ` [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport Vivek Goyal
                   ` (17 subsequent siblings)
  20 siblings, 3 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Sebastien Boeuf, kvm,
	Michael S. Tsirkin

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

Virtio defines 'shared memory regions' that provide a continuously
shared region between the host and guest.

Provide a method to find a particular region on a device.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: kvm@vger.kernel.org
Cc: "Michael S. Tsirkin" <mst@redhat.com>
---
 include/linux/virtio_config.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index bb4cc4910750..c859f000a751 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -10,6 +10,11 @@
 
 struct irq_affinity;
 
+struct virtio_shm_region {
+       u64 addr;
+       u64 len;
+};
+
 /**
  * virtio_config_ops - operations for configuring a virtio device
  * Note: Do not assume that a transport implements all of the operations
@@ -65,6 +70,7 @@ struct irq_affinity;
  *      the caller can then copy.
  * @set_vq_affinity: set the affinity for a virtqueue (optional).
  * @get_vq_affinity: get the affinity for a virtqueue (optional).
+ * @get_shm_region: get a shared memory region based on the index.
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -88,6 +94,8 @@ struct virtio_config_ops {
 			       const struct cpumask *cpu_mask);
 	const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
 			int index);
+	bool (*get_shm_region)(struct virtio_device *vdev,
+			       struct virtio_shm_region *region, u8 id);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
@@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
 	return 0;
 }
 
+static inline
+bool virtio_get_shm_region(struct virtio_device *vdev,
+                         struct virtio_shm_region *region, u8 id)
+{
+	if (!vdev->config->get_shm_region)
+		return false;
+	return vdev->config->get_shm_region(vdev, region, id);
+}
+
 static inline bool virtio_is_little_endian(struct virtio_device *vdev)
 {
 	return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (2 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-10 14:05   ` Michael S. Tsirkin
  2020-08-07 19:55 ` [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Sebastien Boeuf,
	kbuild test robot, kvm, Michael S. Tsirkin

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

On PCI the shm regions are found using capability entries;
find a region by searching for the capability.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: kbuild test robot <lkp@intel.com>
Cc: kvm@vger.kernel.org
Cc: "Michael S. Tsirkin" <mst@redhat.com>
---
 drivers/virtio/virtio_pci_modern.c | 96 ++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_pci.h    | 11 +++-
 2 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index db93cedd262f..3fc0cd848fe9 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -444,6 +444,100 @@ static void del_vq(struct virtio_pci_vq_info *info)
 	vring_del_virtqueue(vq);
 }
 
+static int virtio_pci_find_shm_cap(struct pci_dev *dev, u8 required_id,
+				   u8 *bar, u64 *offset, u64 *len)
+{
+	int pos;
+
+	for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); pos > 0;
+	     pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+		u8 type, cap_len, id;
+		u32 tmp32;
+		u64 res_offset, res_length;
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+							 cfg_type), &type);
+		if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+			continue;
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+							 cap_len), &cap_len);
+		if (cap_len != sizeof(struct virtio_pci_cap64)) {
+			printk(KERN_ERR "%s: shm cap with bad size offset: %d"
+			       "size: %d\n", __func__, pos, cap_len);
+			continue;
+		}
+
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+                                                         id), &id);
+		if (id != required_id)
+			continue;
+
+		/* Type, and ID match, looks good */
+		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+							 bar), bar);
+
+		/* Read the lower 32bit of length and offset */
+		pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap,
+							  offset), &tmp32);
+		res_offset = tmp32;
+		pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap,
+							  length), &tmp32);
+		res_length = tmp32;
+
+		/* and now the top half */
+		pci_read_config_dword(dev,
+				      pos + offsetof(struct virtio_pci_cap64,
+                                                     offset_hi), &tmp32);
+		res_offset |= ((u64)tmp32) << 32;
+		pci_read_config_dword(dev,
+				      pos + offsetof(struct virtio_pci_cap64,
+                                                     length_hi), &tmp32);
+		res_length |= ((u64)tmp32) << 32;
+
+		*offset = res_offset;
+		*len = res_length;
+
+		return pos;
+	}
+	return 0;
+}
+
+static bool vp_get_shm_region(struct virtio_device *vdev,
+			      struct virtio_shm_region *region, u8 id)
+{
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	struct pci_dev *pci_dev = vp_dev->pci_dev;
+	u8 bar;
+	u64 offset, len;
+	phys_addr_t phys_addr;
+	size_t bar_len;
+
+	if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
+		return false;
+	}
+
+	phys_addr = pci_resource_start(pci_dev, bar);
+	bar_len = pci_resource_len(pci_dev, bar);
+
+	if ((offset + len) < offset) {
+		dev_err(&pci_dev->dev, "%s: cap offset+len overflow detected\n",
+			__func__);
+		return false;
+	}
+
+	if (offset + len > bar_len) {
+		dev_err(&pci_dev->dev, "%s: bar shorter than cap offset+len\n",
+			__func__);
+		return false;
+	}
+
+	region->len = len;
+	region->addr = (u64) phys_addr + offset;
+
+	return true;
+}
+
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
 	.get		= NULL,
 	.set		= NULL,
@@ -458,6 +552,7 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
 	.bus_name	= vp_bus_name,
 	.set_vq_affinity = vp_set_vq_affinity,
 	.get_vq_affinity = vp_get_vq_affinity,
+	.get_shm_region  = vp_get_shm_region,
 };
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -474,6 +569,7 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
 	.bus_name	= vp_bus_name,
 	.set_vq_affinity = vp_set_vq_affinity,
 	.get_vq_affinity = vp_get_vq_affinity,
+	.get_shm_region  = vp_get_shm_region,
 };
 
 /**
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 90007a1abcab..fe9f43680a1d 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -113,6 +113,8 @@
 #define VIRTIO_PCI_CAP_DEVICE_CFG	4
 /* PCI configuration access */
 #define VIRTIO_PCI_CAP_PCI_CFG		5
+/* Additional shared memory capability */
+#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
 
 /* This is the PCI capability header: */
 struct virtio_pci_cap {
@@ -121,11 +123,18 @@ struct virtio_pci_cap {
 	__u8 cap_len;		/* Generic PCI field: capability length */
 	__u8 cfg_type;		/* Identifies the structure. */
 	__u8 bar;		/* Where to find it. */
-	__u8 padding[3];	/* Pad to full dword. */
+	__u8 id;		/* Multiple capabilities of the same type */
+	__u8 padding[2];	/* Pad to full dword. */
 	__le32 offset;		/* Offset within bar. */
 	__le32 length;		/* Length of the structure, in bytes. */
 };
 
+struct virtio_pci_cap64 {
+       struct virtio_pci_cap cap;
+       __le32 offset_hi;             /* Most sig 32 bits of offset */
+       __le32 length_hi;             /* Most sig 32 bits of length */
+};
+
 struct virtio_pci_notify_cap {
 	struct virtio_pci_cap cap;
 	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (3 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-10 14:03   ` Michael S. Tsirkin
  2020-08-07 19:55 ` [PATCH v2 06/20] virtiofs: Provide a helper function for virtqueue initialization Vivek Goyal
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Sebastien Boeuf, kvm,
	Michael S. Tsirkin

From: Sebastien Boeuf <sebastien.boeuf@intel.com>

On MMIO a new set of registers is defined for finding SHM
regions.  Add their definitions and use them to find the region.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: kvm@vger.kernel.org
Cc: "Michael S. Tsirkin" <mst@redhat.com>
---
 drivers/virtio/virtio_mmio.c     | 32 ++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_mmio.h | 11 +++++++++++
 2 files changed, 43 insertions(+)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 627ac0487494..2697c492cf78 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -498,6 +498,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
 	return vm_dev->pdev->name;
 }
 
+static bool vm_get_shm_region(struct virtio_device *vdev,
+			      struct virtio_shm_region *region, u8 id)
+{
+	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+	u64 len, addr;
+
+	/* Select the region we're interested in */
+	writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
+
+	/* Read the region size */
+	len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
+	len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
+
+	region->len = len;
+
+	/* Check if region length is -1. If that's the case, the shared memory
+	 * region does not exist and there is no need to proceed further.
+	 */
+	if (len == ~(u64)0) {
+		return false;
+	}
+
+	/* Read the region base address */
+	addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
+	addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
+
+	region->addr = addr;
+
+	return true;
+}
+
 static const struct virtio_config_ops virtio_mmio_config_ops = {
 	.get		= vm_get,
 	.set		= vm_set,
@@ -510,6 +541,7 @@ static const struct virtio_config_ops virtio_mmio_config_ops = {
 	.get_features	= vm_get_features,
 	.finalize_features = vm_finalize_features,
 	.bus_name	= vm_bus_name,
+	.get_shm_region = vm_get_shm_region,
 };
 
 
diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
 #define VIRTIO_MMIO_QUEUE_USED_LOW	0x0a0
 #define VIRTIO_MMIO_QUEUE_USED_HIGH	0x0a4
 
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL             0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW         0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH        0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW        0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH       0x0bc
+
 /* Configuration atomicity value */
 #define VIRTIO_MMIO_CONFIG_GENERATION	0x0fc
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 06/20] virtiofs: Provide a helper function for virtqueue initialization
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (4 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 07/20] fuse: Get rid of no_mount_options Vivek Goyal
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

This reduces code duplication and make it little easier to read code.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/virtio_fs.c | 50 +++++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 4c4ef5d69298..e615edfbd09c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -24,6 +24,8 @@ enum {
 	VQ_REQUEST
 };
 
+#define VQ_NAME_LEN	24
+
 /* Per-virtqueue state */
 struct virtio_fs_vq {
 	spinlock_t lock;
@@ -36,7 +38,7 @@ struct virtio_fs_vq {
 	bool connected;
 	long in_flight;
 	struct completion in_flight_zero; /* No inflight requests */
-	char name[24];
+	char name[VQ_NAME_LEN];
 } ____cacheline_aligned_in_smp;
 
 /* A virtio-fs device instance */
@@ -596,6 +598,26 @@ static void virtio_fs_vq_done(struct virtqueue *vq)
 	schedule_work(&fsvq->done_work);
 }
 
+static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, char *name,
+			      int vq_type)
+{
+	strncpy(fsvq->name, name, VQ_NAME_LEN);
+	spin_lock_init(&fsvq->lock);
+	INIT_LIST_HEAD(&fsvq->queued_reqs);
+	INIT_LIST_HEAD(&fsvq->end_reqs);
+	init_completion(&fsvq->in_flight_zero);
+
+	if (vq_type == VQ_REQUEST) {
+		INIT_WORK(&fsvq->done_work, virtio_fs_requests_done_work);
+		INIT_DELAYED_WORK(&fsvq->dispatch_work,
+				  virtio_fs_request_dispatch_work);
+	} else {
+		INIT_WORK(&fsvq->done_work, virtio_fs_hiprio_done_work);
+		INIT_DELAYED_WORK(&fsvq->dispatch_work,
+				  virtio_fs_hiprio_dispatch_work);
+	}
+}
+
 /* Initialize virtqueues */
 static int virtio_fs_setup_vqs(struct virtio_device *vdev,
 			       struct virtio_fs *fs)
@@ -611,7 +633,7 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
 	if (fs->num_request_queues == 0)
 		return -EINVAL;
 
-	fs->nvqs = 1 + fs->num_request_queues;
+	fs->nvqs = VQ_REQUEST + fs->num_request_queues;
 	fs->vqs = kcalloc(fs->nvqs, sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL);
 	if (!fs->vqs)
 		return -ENOMEM;
@@ -625,29 +647,17 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
 		goto out;
 	}
 
+	/* Initialize the hiprio/forget request virtqueue */
 	callbacks[VQ_HIPRIO] = virtio_fs_vq_done;
-	snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name),
-			"hiprio");
+	virtio_fs_init_vq(&fs->vqs[VQ_HIPRIO], "hiprio", VQ_HIPRIO);
 	names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name;
-	INIT_WORK(&fs->vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work);
-	INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].queued_reqs);
-	INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].end_reqs);
-	INIT_DELAYED_WORK(&fs->vqs[VQ_HIPRIO].dispatch_work,
-			virtio_fs_hiprio_dispatch_work);
-	init_completion(&fs->vqs[VQ_HIPRIO].in_flight_zero);
-	spin_lock_init(&fs->vqs[VQ_HIPRIO].lock);
 
 	/* Initialize the requests virtqueues */
 	for (i = VQ_REQUEST; i < fs->nvqs; i++) {
-		spin_lock_init(&fs->vqs[i].lock);
-		INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work);
-		INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work,
-				  virtio_fs_request_dispatch_work);
-		INIT_LIST_HEAD(&fs->vqs[i].queued_reqs);
-		INIT_LIST_HEAD(&fs->vqs[i].end_reqs);
-		init_completion(&fs->vqs[i].in_flight_zero);
-		snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
-			 "requests.%u", i - VQ_REQUEST);
+		char vq_name[VQ_NAME_LEN];
+
+		snprintf(vq_name, VQ_NAME_LEN, "requests.%u", i - VQ_REQUEST);
+		virtio_fs_init_vq(&fs->vqs[i], vq_name, VQ_REQUEST);
 		callbacks[i] = virtio_fs_vq_done;
 		names[i] = fs->vqs[i].name;
 	}
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 07/20] fuse: Get rid of no_mount_options
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (5 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 06/20] virtiofs: Provide a helper function for virtqueue initialization Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 08/20] fuse,virtiofs: Add a mount option to enable dax Vivek Goyal
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.

Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore. What
we need is a per mount option specific flag so that fileystem can
specify which options to show.

Add few such flags to control the behavior in more fine grained manner
and get rid of no_mount_options.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/fuse_i.h    | 14 ++++++++++----
 fs/fuse/inode.c     | 22 ++++++++++++++--------
 fs/fuse/virtio_fs.c |  1 -
 3 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..cf5e675100ec 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -471,18 +471,21 @@ struct fuse_fs_context {
 	int fd;
 	unsigned int rootmode;
 	kuid_t user_id;
+	bool user_id_show;
 	kgid_t group_id;
+	bool group_id_show;
 	bool is_bdev:1;
 	bool fd_present:1;
 	bool rootmode_present:1;
 	bool user_id_present:1;
 	bool group_id_present:1;
 	bool default_permissions:1;
+	bool default_permissions_show:1;
 	bool allow_other:1;
+	bool allow_other_show:1;
 	bool destroy:1;
 	bool no_control:1;
 	bool no_force_umount:1;
-	bool no_mount_options:1;
 	unsigned int max_read;
 	unsigned int blksize;
 	const char *subtype;
@@ -512,9 +515,11 @@ struct fuse_conn {
 
 	/** The user id for this mount */
 	kuid_t user_id;
+	bool user_id_show:1;
 
 	/** The group id for this mount */
 	kgid_t group_id;
+	bool group_id_show:1;
 
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
@@ -698,10 +703,14 @@ struct fuse_conn {
 
 	/** Check permissions based on the file mode or not? */
 	unsigned default_permissions:1;
+	bool default_permissions_show:1;
 
 	/** Allow other than the mounter user to access the filesystem ? */
 	unsigned allow_other:1;
 
+	/** Show allow_other in mount options */
+	bool allow_other_show:1;
+
 	/** Does the filesystem support copy_file_range? */
 	unsigned no_copy_file_range:1;
 
@@ -717,9 +726,6 @@ struct fuse_conn {
 	/** Do not allow MNT_FORCE umount */
 	unsigned int no_force_umount:1;
 
-	/* Do not show mount options */
-	unsigned int no_mount_options:1;
-
 	/** The number of requests waiting for completion */
 	atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..2ac5713c4c32 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -535,10 +535,12 @@ static int fuse_parse_param(struct fs_context *fc, struct fs_parameter *param)
 
 	case OPT_DEFAULT_PERMISSIONS:
 		ctx->default_permissions = true;
+		ctx->default_permissions_show = true;
 		break;
 
 	case OPT_ALLOW_OTHER:
 		ctx->allow_other = true;
+		ctx->allow_other_show = true;
 		break;
 
 	case OPT_MAX_READ:
@@ -573,14 +575,15 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	if (fc->no_mount_options)
-		return 0;
-
-	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
-	if (fc->default_permissions)
+	if (fc->user_id_show)
+		seq_printf(m, ",user_id=%u",
+			   from_kuid_munged(fc->user_ns, fc->user_id));
+	if (fc->group_id_show)
+		seq_printf(m, ",group_id=%u",
+			   from_kgid_munged(fc->user_ns, fc->group_id));
+	if (fc->default_permissions && fc->default_permissions_show)
 		seq_puts(m, ",default_permissions");
-	if (fc->allow_other)
+	if (fc->allow_other && fc->allow_other_show)
 		seq_puts(m, ",allow_other");
 	if (fc->max_read != ~0)
 		seq_printf(m, ",max_read=%u", fc->max_read);
@@ -1193,14 +1196,17 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	sb->s_flags |= SB_POSIXACL;
 
 	fc->default_permissions = ctx->default_permissions;
+	fc->default_permissions_show = ctx->default_permissions_show;
 	fc->allow_other = ctx->allow_other;
+	fc->allow_other_show = ctx->allow_other_show;
 	fc->user_id = ctx->user_id;
+	fc->user_id_show = ctx->user_id_show;
 	fc->group_id = ctx->group_id;
+	fc->group_id_show = ctx->group_id_show;
 	fc->max_read = max_t(unsigned, 4096, ctx->max_read);
 	fc->destroy = ctx->destroy;
 	fc->no_control = ctx->no_control;
 	fc->no_force_umount = ctx->no_force_umount;
-	fc->no_mount_options = ctx->no_mount_options;
 
 	err = -ENOMEM;
 	root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index e615edfbd09c..afddae8ee2ec 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1096,7 +1096,6 @@ static int virtio_fs_fill_super(struct super_block *sb)
 		.destroy = true,
 		.no_control = true,
 		.no_force_umount = true,
-		.no_mount_options = true,
 	};
 
 	mutex_lock(&virtio_fs_mutex);
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 08/20] fuse,virtiofs: Add a mount option to enable dax
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (6 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 07/20] fuse: Get rid of no_mount_options Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 09/20] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

Add a mount option to allow using dax with virtio_fs.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/fuse_i.h    |  7 ++++
 fs/fuse/inode.c     |  3 ++
 fs/fuse/virtio_fs.c | 82 +++++++++++++++++++++++++++++++++++++--------
 3 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index cf5e675100ec..04fdd7c41bd1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -486,10 +486,14 @@ struct fuse_fs_context {
 	bool destroy:1;
 	bool no_control:1;
 	bool no_force_umount:1;
+	bool dax:1;
 	unsigned int max_read;
 	unsigned int blksize;
 	const char *subtype;
 
+	/* DAX device, may be NULL */
+	struct dax_device *dax_dev;
+
 	/* fuse_dev pointer to fill in, should contain NULL on entry */
 	void **fudptr;
 };
@@ -761,6 +765,9 @@ struct fuse_conn {
 
 	/** List of device instances belonging to this connection */
 	struct list_head devices;
+
+	/** DAX device, non-NULL if DAX is supported */
+	struct dax_device *dax_dev;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2ac5713c4c32..beac337ccc10 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -589,6 +589,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 		seq_printf(m, ",max_read=%u", fc->max_read);
 	if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
 		seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+	if (fc->dax_dev)
+		seq_printf(m, ",dax");
 	return 0;
 }
 
@@ -1207,6 +1209,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	fc->destroy = ctx->destroy;
 	fc->no_control = ctx->no_control;
 	fc->no_force_umount = ctx->no_force_umount;
+	fc->dax_dev = ctx->dax_dev;
 
 	err = -ENOMEM;
 	root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index afddae8ee2ec..add31794ca1a 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -10,6 +10,7 @@
 #include <linux/virtio_fs.h>
 #include <linux/delay.h>
 #include <linux/fs_context.h>
+#include <linux/fs_parser.h>
 #include <linux/highmem.h>
 #include "fuse_i.h"
 
@@ -71,6 +72,45 @@ struct virtio_fs_req_work {
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 				 struct fuse_req *req, bool in_flight);
 
+enum {
+	OPT_DAX,
+};
+
+static const struct fs_parameter_spec virtio_fs_parameters[] = {
+	fsparam_flag	("dax",		OPT_DAX),
+	{}
+};
+
+static int virtio_fs_parse_param(struct fs_context *fc,
+				 struct fs_parameter *param)
+{
+	struct fs_parse_result result;
+	struct fuse_fs_context *ctx = fc->fs_private;
+	int opt;
+
+	opt = fs_parse(fc, virtio_fs_parameters, param, &result);
+	if (opt < 0)
+		return opt;
+
+	switch(opt) {
+	case OPT_DAX:
+		ctx->dax = 1;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void virtio_fs_free_fc(struct fs_context *fc)
+{
+	struct fuse_fs_context *ctx = fc->fs_private;
+
+	if (ctx)
+		kfree(ctx);
+}
+
 static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
 {
 	struct virtio_fs *fs = vq->vdev->priv;
@@ -1081,23 +1121,27 @@ static const struct fuse_iqueue_ops virtio_fs_fiq_ops = {
 	.release			= virtio_fs_fiq_release,
 };
 
-static int virtio_fs_fill_super(struct super_block *sb)
+static inline void virtio_fs_ctx_set_defaults(struct fuse_fs_context *ctx)
+{
+	ctx->rootmode = S_IFDIR;
+	ctx->default_permissions = 1;
+	ctx->allow_other = 1;
+	ctx->max_read = UINT_MAX;
+	ctx->blksize = 512;
+	ctx->destroy = true;
+	ctx->no_control = true;
+	ctx->no_force_umount = true;
+}
+
+static int virtio_fs_fill_super(struct super_block *sb, struct fs_context *fsc)
 {
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 	struct virtio_fs *fs = fc->iq.priv;
+	struct fuse_fs_context *ctx = fsc->fs_private;
 	unsigned int i;
 	int err;
-	struct fuse_fs_context ctx = {
-		.rootmode = S_IFDIR,
-		.default_permissions = 1,
-		.allow_other = 1,
-		.max_read = UINT_MAX,
-		.blksize = 512,
-		.destroy = true,
-		.no_control = true,
-		.no_force_umount = true,
-	};
 
+	virtio_fs_ctx_set_defaults(ctx);
 	mutex_lock(&virtio_fs_mutex);
 
 	/* After holding mutex, make sure virtiofs device is still there.
@@ -1121,8 +1165,10 @@ static int virtio_fs_fill_super(struct super_block *sb)
 	}
 
 	/* virtiofs allocates and installs its own fuse devices */
-	ctx.fudptr = NULL;
-	err = fuse_fill_super_common(sb, &ctx);
+	ctx->fudptr = NULL;
+	if (ctx->dax)
+		ctx->dax_dev = fs->dax_dev;
+	err = fuse_fill_super_common(sb, ctx);
 	if (err < 0)
 		goto err_free_fuse_devs;
 
@@ -1233,7 +1279,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
 		return PTR_ERR(sb);
 
 	if (!sb->s_root) {
-		err = virtio_fs_fill_super(sb);
+		err = virtio_fs_fill_super(sb, fsc);
 		if (err) {
 			deactivate_locked_super(sb);
 			return err;
@@ -1248,11 +1294,19 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
 }
 
 static const struct fs_context_operations virtio_fs_context_ops = {
+	.free		= virtio_fs_free_fc,
+	.parse_param	= virtio_fs_parse_param,
 	.get_tree	= virtio_fs_get_tree,
 };
 
 static int virtio_fs_init_fs_context(struct fs_context *fsc)
 {
+	struct fuse_fs_context *ctx;
+
+	ctx = kzalloc(sizeof(struct fuse_fs_context), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	fsc->fs_private = ctx;
 	fsc->ops = &virtio_fs_context_ops;
 	return 0;
 }
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 09/20] virtio_fs, dax: Set up virtio_fs dax_device
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (7 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 08/20] fuse,virtiofs: Add a mount option to enable dax Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 10/20] fuse,virtiofs: Keep a list of free dax memory ranges Vivek Goyal
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Sebastien Boeuf, Liu Bo

From: Stefan Hajnoczi <stefanha@redhat.com>

Setup a dax device.

Use the shm capability to find the cache entry and map it.

The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86).  Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/virtio_fs.c            | 139 +++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_fs.h |   3 +
 2 files changed, 142 insertions(+)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index add31794ca1a..e54696f778a7 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,6 +5,9 @@
  */
 
 #include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/pci.h>
+#include <linux/pfn_t.h>
 #include <linux/module.h>
 #include <linux/virtio.h>
 #include <linux/virtio_fs.h>
@@ -12,6 +15,7 @@
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
 #include <linux/highmem.h>
+#include <linux/uio.h>
 #include "fuse_i.h"
 
 /* List of virtio-fs device instances and a lock for the list. Also provides
@@ -50,6 +54,12 @@ struct virtio_fs {
 	struct virtio_fs_vq *vqs;
 	unsigned int nvqs;               /* number of virtqueues */
 	unsigned int num_request_queues; /* number of request queues */
+	struct dax_device *dax_dev;
+
+	/* DAX memory window where file contents are mapped */
+	void *window_kaddr;
+	phys_addr_t window_phys_addr;
+	size_t window_len;
 };
 
 struct virtio_fs_forget_req {
@@ -726,6 +736,131 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
 	vdev->config->del_vqs(vdev);
 }
 
+/* Map a window offset to a page frame number.  The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+				    long nr_pages, void **kaddr, pfn_t *pfn)
+{
+	struct virtio_fs *fs = dax_get_private(dax_dev);
+	phys_addr_t offset = PFN_PHYS(pgoff);
+	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+
+	if (kaddr)
+		*kaddr = fs->window_kaddr + offset;
+	if (pfn)
+		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+					PFN_DEV | PFN_MAP);
+	return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+				       pgoff_t pgoff, void *addr,
+				       size_t bytes, struct iov_iter *i)
+{
+	return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
+				       pgoff_t pgoff, void *addr,
+				       size_t bytes, struct iov_iter *i)
+{
+	return copy_to_iter(addr, bytes, i);
+}
+
+static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
+				     pgoff_t pgoff, size_t nr_pages)
+{
+	long rc;
+	void *kaddr;
+
+	rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
+	if (rc < 0)
+		return rc;
+	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
+	dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
+	return 0;
+}
+
+static const struct dax_operations virtio_fs_dax_ops = {
+	.direct_access = virtio_fs_direct_access,
+	.copy_from_iter = virtio_fs_copy_from_iter,
+	.copy_to_iter = virtio_fs_copy_to_iter,
+	.zero_page_range = virtio_fs_zero_page_range,
+};
+
+static void virtio_fs_cleanup_dax(void *data)
+{
+	struct dax_device *dax_dev = data;
+
+	kill_dax(dax_dev);
+	put_dax(dax_dev);
+}
+
+static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
+{
+	struct virtio_shm_region cache_reg;
+	struct dev_pagemap *pgmap;
+	bool have_cache;
+
+	if (!IS_ENABLED(CONFIG_DAX_DRIVER))
+		return 0;
+
+	/* Get cache region */
+	have_cache = virtio_get_shm_region(vdev, &cache_reg,
+					   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
+	if (!have_cache) {
+		dev_notice(&vdev->dev, "%s: No cache capability\n", __func__);
+		return 0;
+	}
+
+	if (!devm_request_mem_region(&vdev->dev, cache_reg.addr, cache_reg.len,
+				     dev_name(&vdev->dev))) {
+		dev_warn(&vdev->dev, "could not reserve region addr=0x%llx"
+			 " len=0x%llx\n", cache_reg.addr, cache_reg.len);
+		return -EBUSY;
+	}
+
+	dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n", cache_reg.len,
+		   cache_reg.addr);
+
+	pgmap = devm_kzalloc(&vdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+	/* Ideally we would directly use the PCI BAR resource but
+	 * devm_memremap_pages() wants its own copy in pgmap.  So
+	 * initialize a struct resource from scratch (only the start
+	 * and end fields will be used).
+	 */
+	pgmap->res = (struct resource){
+		.name = "virtio-fs dax window",
+		.start = (phys_addr_t) cache_reg.addr,
+		.end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
+	};
+
+	fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);
+	if (IS_ERR(fs->window_kaddr))
+		return PTR_ERR(fs->window_kaddr);
+
+	fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
+	fs->window_len = (phys_addr_t) cache_reg.len;
+
+	dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx"
+		" len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr,
+		cache_reg.len);
+
+	fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops, 0);
+	if (IS_ERR(fs->dax_dev))
+		return PTR_ERR(fs->dax_dev);
+
+	return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax,
+					fs->dax_dev);
+}
+
 static int virtio_fs_probe(struct virtio_device *vdev)
 {
 	struct virtio_fs *fs;
@@ -747,6 +882,10 @@ static int virtio_fs_probe(struct virtio_device *vdev)
 
 	/* TODO vq affinity */
 
+	ret = virtio_fs_setup_dax(vdev, fs);
+	if (ret < 0)
+		goto out_vqs;
+
 	/* Bring the device online in case the filesystem is mounted and
 	 * requests need to be sent before we return.
 	 */
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
index b02eb2ac3d99..2f64abce781f 100644
--- a/include/uapi/linux/virtio_fs.h
+++ b/include/uapi/linux/virtio_fs.h
@@ -16,4 +16,7 @@ struct virtio_fs_config {
 	__u32 num_request_queues;
 } __attribute__((packed));
 
+/* For the id field in virtio_pci_shm_cap */
+#define VIRTIO_FS_SHMCAP_ID_CACHE 0
+
 #endif /* _UAPI_LINUX_VIRTIO_FS_H */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 10/20] fuse,virtiofs: Keep a list of free dax memory ranges
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (8 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 09/20] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 11/20] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Peng Tao

Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
---
 fs/fuse/fuse_i.h    | 23 ++++++++++++
 fs/fuse/inode.c     | 88 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/virtio_fs.c |  2 ++
 3 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 04fdd7c41bd1..478c940b05b4 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,6 +47,11 @@
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
+/* Default memory range size, 2MB */
+#define FUSE_DAX_SZ	(2*1024*1024)
+#define FUSE_DAX_SHIFT	(21)
+#define FUSE_DAX_PAGES	(FUSE_DAX_SZ/PAGE_SIZE)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -63,6 +68,18 @@ struct fuse_forget_link {
 	struct fuse_forget_link *next;
 };
 
+/** Translation information for file offsets to DAX window offsets */
+struct fuse_dax_mapping {
+	/* Will connect in fc->free_ranges to keep track of free memory */
+	struct list_head list;
+
+	/** Position in DAX window */
+	u64 window_offset;
+
+	/** Length of mapping, in bytes */
+	loff_t length;
+};
+
 /** FUSE inode */
 struct fuse_inode {
 	/** Inode data */
@@ -768,6 +785,12 @@ struct fuse_conn {
 
 	/** DAX device, non-NULL if DAX is supported */
 	struct dax_device *dax_dev;
+
+	/*
+	 * DAX Window Free Ranges
+	 */
+	long nr_free_ranges;
+	struct list_head free_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index beac337ccc10..b82eb61d63cc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -23,6 +23,8 @@
 #include <linux/exportfs.h>
 #include <linux/posix_acl.h>
 #include <linux/pid_namespace.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
 
 MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
 MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -620,6 +622,76 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
+static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
+{
+	struct fuse_dax_mapping *range, *temp;
+
+	/* Free All allocated elements */
+	list_for_each_entry_safe(range, temp, mem_list, list) {
+		list_del(&range->list);
+		kfree(range);
+	}
+}
+
+#ifdef CONFIG_FS_DAX
+static int fuse_dax_mem_range_init(struct fuse_conn *fc,
+				   struct dax_device *dax_dev)
+{
+	long nr_pages, nr_ranges;
+	void *kaddr;
+	pfn_t pfn;
+	struct fuse_dax_mapping *range;
+	LIST_HEAD(mem_ranges);
+	phys_addr_t phys_addr;
+	int ret = 0, id;
+	size_t dax_size = -1;
+	unsigned long i;
+
+	id = dax_read_lock();
+	nr_pages = dax_direct_access(dax_dev, 0, PHYS_PFN(dax_size), &kaddr,
+					&pfn);
+	dax_read_unlock(id);
+	if (nr_pages < 0) {
+		pr_debug("dax_direct_access() returned %ld\n", nr_pages);
+		return nr_pages;
+	}
+
+	phys_addr = pfn_t_to_phys(pfn);
+	nr_ranges = nr_pages/FUSE_DAX_PAGES;
+	printk("fuse_dax_mem_range_init(): dax mapped %ld pages. nr_ranges=%ld\n", nr_pages, nr_ranges);
+
+	for (i = 0; i < nr_ranges; i++) {
+		range = kzalloc(sizeof(struct fuse_dax_mapping), GFP_KERNEL);
+		if (!range) {
+			pr_debug("memory allocation for mem_range failed.\n");
+			ret = -ENOMEM;
+			goto out_err;
+		}
+		/* TODO: This offset only works if virtio-fs driver is not
+		 * having some memory hidden at the beginning. This needs
+		 * better handling
+		 */
+		range->window_offset = i * FUSE_DAX_SZ;
+		range->length = FUSE_DAX_SZ;
+		list_add_tail(&range->list, &mem_ranges);
+	}
+
+	list_replace_init(&mem_ranges, &fc->free_ranges);
+	fc->nr_free_ranges = nr_ranges;
+	return 0;
+out_err:
+	/* Free All allocated elements */
+	fuse_free_dax_mem_ranges(&mem_ranges);
+	return ret;
+}
+#else /* !CONFIG_FS_DAX */
+static inline int fuse_dax_mem_range_init(struct fuse_conn *fc,
+					  struct dax_device *dax_dev)
+{
+	return 0;
+}
+#endif /* CONFIG_FS_DAX */
+
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 		    const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
 {
@@ -647,6 +719,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
+	INIT_LIST_HEAD(&fc->free_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -655,6 +728,8 @@ void fuse_conn_put(struct fuse_conn *fc)
 	if (refcount_dec_and_test(&fc->count)) {
 		struct fuse_iqueue *fiq = &fc->iq;
 
+		if (fc->dax_dev)
+			fuse_free_dax_mem_ranges(&fc->free_ranges);
 		if (fiq->ops->release)
 			fiq->ops->release(fiq);
 		put_pid_ns(fc->pid_ns);
@@ -1179,11 +1254,19 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	if (sb->s_user_ns != &init_user_ns)
 		sb->s_xattr = fuse_no_acl_xattr_handlers;
 
+	if (ctx->dax_dev) {
+		err = fuse_dax_mem_range_init(fc, ctx->dax_dev);
+		if (err) {
+			pr_debug("fuse_dax_mem_range_init() returned %d\n", err);
+			goto err_free_ranges;
+		}
+	}
+
 	if (ctx->fudptr) {
 		err = -ENOMEM;
 		fud = fuse_dev_alloc_install(fc);
 		if (!fud)
-			goto err;
+			goto err_free_ranges;
 	}
 
 	fc->dev = sb->s_dev;
@@ -1242,6 +1325,9 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
  err_dev_free:
 	if (fud)
 		fuse_dev_free(fud);
+ err_free_ranges:
+	if (ctx->dax_dev)
+		fuse_free_dax_mem_ranges(&fc->free_ranges);
  err:
 	return err;
 }
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index e54696f778a7..c44c7bf7bba2 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -747,6 +747,8 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	phys_addr_t offset = PFN_PHYS(pgoff);
 	size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
 
+	pr_debug("virtio_fs_direct_access(): called. nr_pages=%ld max_nr_pages=%zu\n", nr_pages, max_nr_pages);
+
 	if (kaddr)
 		*kaddr = fs->window_kaddr + offset;
 	if (pfn)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 11/20] fuse: implement FUSE_INIT map_alignment field
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (9 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 10/20] fuse,virtiofs: Keep a list of free dax memory ranges Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 12/20] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field.  Parse this field and
ensure our DAX mappings meet the alignment constraints.

We don't actually align anything differently since our mappings are
already 2MB aligned.  Just check the value when the connection is
established.  If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.

The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/fuse_i.h          |  5 ++++-
 fs/fuse/inode.c           | 19 +++++++++++++++++--
 include/uapi/linux/fuse.h |  4 +++-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 478c940b05b4..4a46e35222c7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,7 +47,10 @@
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
-/* Default memory range size, 2MB */
+/*
+ * Default memory range size.  A power of 2 so it agrees with common FUSE_INIT
+ * map_alignment values 4KB and 64KB.
+ */
 #define FUSE_DAX_SZ	(2*1024*1024)
 #define FUSE_DAX_SHIFT	(21)
 #define FUSE_DAX_PAGES	(FUSE_DAX_SZ/PAGE_SIZE)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b82eb61d63cc..9b690456de30 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
 {
 	struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
 	struct fuse_init_out *arg = &ia->out;
+	bool ok = true;
 
 	if (error || arg->major != FUSE_KERNEL_VERSION)
-		fc->conn_error = 1;
+		ok = false;
 	else {
 		unsigned long ra_pages;
 
@@ -1045,6 +1046,14 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
 					min_t(unsigned int, FUSE_MAX_MAX_PAGES,
 					max_t(unsigned int, arg->max_pages, 1));
 			}
+			if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
+			    (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
+				printk(KERN_ERR "FUSE: map_alignment %u"
+				       " incompatible with dax mem range size"
+				       " %u\n", arg->map_alignment,
+				       FUSE_DAX_SZ);
+				ok = false;
+			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1060,6 +1069,11 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
 	}
 	kfree(ia);
 
+	if (!ok) {
+		fc->conn_init = 0;
+		fc->conn_error = 1;
+	}
+
 	fuse_set_initialized(fc);
 	wake_up_all(&fc->blocked_waitq);
 }
@@ -1082,7 +1096,8 @@ void fuse_send_init(struct fuse_conn *fc)
 		FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
 		FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
 		FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
-		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+		FUSE_MAP_ALIGNMENT;
 	ia->args.opcode = FUSE_INIT;
 	ia->args.in_numargs = 1;
 	ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..5b85819e045f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -313,7 +313,9 @@ struct fuse_file_lock {
  * FUSE_CACHE_SYMLINKS: cache READLINK responses
  * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
  * FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
- * FUSE_MAP_ALIGNMENT: map_alignment field is valid
+ * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
+ *		       foffset and moffset fields in struct
+ *		       fuse_setupmapping_out and fuse_removemapping_one.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 12/20] fuse: Introduce setupmapping/removemapping commands
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (10 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 11/20] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 13/20] fuse, dax: Implement dax read/write operations Vivek Goyal
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Peng Tao

Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
---
 include/uapi/linux/fuse.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5b85819e045f..4c386116978a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -894,4 +894,33 @@ struct fuse_copy_file_range_in {
 	uint64_t	flags;
 };
 
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+	/* An already open handle */
+	uint64_t	fh;
+	/* Offset into the file to start the mapping */
+	uint64_t	foffset;
+	/* Length of mapping required */
+	uint64_t	len;
+	/* Flags, FUSE_SETUPMAPPING_FLAG_* */
+	uint64_t	flags;
+	/* Offset in Memory Window */
+	uint64_t	moffset;
+};
+
+struct fuse_removemapping_in {
+	/* number of fuse_removemapping_one follows */
+	uint32_t        count;
+};
+
+struct fuse_removemapping_one {
+	/* Offset into the dax window start the unmapping */
+	uint64_t        moffset;
+        /* Length of mapping required */
+        uint64_t	len;
+};
+
+#define FUSE_REMOVEMAPPING_MAX_ENTRY   \
+		(PAGE_SIZE / sizeof(struct fuse_removemapping_one))
+
 #endif /* _LINUX_FUSE_H */
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 13/20] fuse, dax: Implement dax read/write operations
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (11 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 12/20] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-10 22:06   ` Dave Chinner
  2020-08-07 19:55 ` [PATCH v2 14/20] fuse,dax: add DAX mmap support Vivek Goyal
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Miklos Szeredi, Liu Bo, Peng Tao

This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.

We make use of interval tree to keep track of per inode dax mappings.

Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
---
 fs/fuse/file.c            | 550 +++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h          |  26 ++
 fs/fuse/inode.c           |   6 +
 include/uapi/linux/fuse.h |   1 +
 4 files changed, 577 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 83d917f7e542..194fe3e404a7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -19,6 +19,9 @@
 #include <linux/falloc.h>
 #include <linux/uio.h>
 #include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/interval_tree_generic.h>
 
 static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
 				      struct fuse_page_desc **desc)
@@ -188,6 +191,228 @@ static void fuse_link_write_file(struct file *file)
 	spin_unlock(&fi->lock);
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+	struct fuse_dax_mapping *dmap = NULL;
+
+	spin_lock(&fc->lock);
+
+	if (fc->nr_free_ranges <= 0) {
+		spin_unlock(&fc->lock);
+		return NULL;
+	}
+
+	WARN_ON(list_empty(&fc->free_ranges));
+
+	/* Take a free range */
+	dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
+					list);
+	list_del_init(&dmap->list);
+	fc->nr_free_ranges--;
+	spin_unlock(&fc->lock);
+	return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __dmap_add_to_free_pool(struct fuse_conn *fc,
+				struct fuse_dax_mapping *dmap)
+{
+	list_add_tail(&dmap->list, &fc->free_ranges);
+	fc->nr_free_ranges++;
+}
+
+static void dmap_add_to_free_pool(struct fuse_conn *fc,
+				struct fuse_dax_mapping *dmap)
+{
+	/* Return fuse_dax_mapping to free list */
+	spin_lock(&fc->lock);
+	__dmap_add_to_free_pool(fc, dmap);
+	spin_unlock(&fc->lock);
+}
+
+static int fuse_setup_one_mapping(struct inode *inode, unsigned long start_idx,
+				  struct fuse_dax_mapping *dmap, bool writable,
+				  bool upgrade)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_setupmapping_in inarg;
+	loff_t offset = start_idx << FUSE_DAX_SHIFT;
+	FUSE_ARGS(args);
+	ssize_t err;
+
+	WARN_ON(fc->nr_free_ranges < 0);
+
+	/* Ask fuse daemon to setup mapping */
+	memset(&inarg, 0, sizeof(inarg));
+	inarg.foffset = offset;
+	inarg.fh = -1;
+	inarg.moffset = dmap->window_offset;
+	inarg.len = FUSE_DAX_SZ;
+	inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+	if (writable)
+		inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+	args.opcode = FUSE_SETUPMAPPING;
+	args.nodeid = fi->nodeid;
+	args.in_numargs = 1;
+	args.in_args[0].size = sizeof(inarg);
+	args.in_args[0].value = &inarg;
+	err = fuse_simple_request(fc, &args);
+	if (err < 0)
+		return err;
+	dmap->writable = writable;
+	if (!upgrade) {
+		dmap->itn.start = dmap->itn.last = start_idx;
+		/* Protected by fi->i_dmap_sem */
+		interval_tree_insert(&dmap->itn, &fi->dmap_tree);
+		fi->nr_dmaps++;
+	}
+	return 0;
+}
+
+static int
+fuse_send_removemapping(struct inode *inode,
+			struct fuse_removemapping_in *inargp,
+			struct fuse_removemapping_one *remove_one)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	FUSE_ARGS(args);
+
+	args.opcode = FUSE_REMOVEMAPPING;
+	args.nodeid = fi->nodeid;
+	args.in_numargs = 2;
+	args.in_args[0].size = sizeof(*inargp);
+	args.in_args[0].value = inargp;
+	args.in_args[1].size = inargp->count * sizeof(*remove_one);
+	args.in_args[1].value = remove_one;
+	return fuse_simple_request(fc, &args);
+}
+
+static int dmap_removemapping_list(struct inode *inode, unsigned num,
+				   struct list_head *to_remove)
+{
+	struct fuse_removemapping_one *remove_one, *ptr;
+	struct fuse_removemapping_in inarg;
+	struct fuse_dax_mapping *dmap;
+	int ret, i = 0, nr_alloc;
+
+	nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
+	remove_one = kmalloc_array(nr_alloc, sizeof(*remove_one), GFP_NOFS);
+	if (!remove_one)
+		return -ENOMEM;
+
+	ptr = remove_one;
+	list_for_each_entry(dmap, to_remove, list) {
+		ptr->moffset = dmap->window_offset;
+		ptr->len = dmap->length;
+		ptr++;
+		i++;
+		num--;
+		if (i >= nr_alloc || num == 0) {
+			memset(&inarg, 0, sizeof(inarg));
+			inarg.count = i;
+			ret = fuse_send_removemapping(inode, &inarg,
+						      remove_one);
+			if (ret)
+				goto out;
+			ptr = remove_one;
+			i = 0;
+		}
+	}
+out:
+	kfree(remove_one);
+	return ret;
+}
+
+/*
+ * Cleanup dmap entry and add back to free list. This should be called with
+ * fc->lock held.
+ */
+static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
+					    struct fuse_dax_mapping *dmap)
+{
+	pr_debug("fuse: freeing memory range start_idx=0x%lx end_idx=0x%lx "
+		 "window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
+		 dmap->itn.last, dmap->window_offset, dmap->length);
+	dmap->itn.start = dmap->itn.last = 0;
+	__dmap_add_to_free_pool(fc, dmap);
+}
+
+/*
+ * Free inode dmap entries whose range falls inside [start, end].
+ * Does not take any locks. At this point of time it should only be
+ * called from evict_inode() path where we know all dmap entries can be
+ * reclaimed.
+ */
+static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
+				      loff_t start, loff_t end)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap, *n;
+	int err, num = 0;
+	LIST_HEAD(to_remove);
+	unsigned long start_idx = start >> FUSE_DAX_SHIFT;
+	unsigned long end_idx = end >> FUSE_DAX_SHIFT;
+	struct interval_tree_node *node;
+
+	while (1) {
+		node = interval_tree_iter_first(&fi->dmap_tree, start_idx,
+						end_idx);
+		if (!node)
+			break;
+		dmap = node_to_dmap(node);
+		interval_tree_remove(&dmap->itn, &fi->dmap_tree);
+		num++;
+		list_add(&dmap->list, &to_remove);
+	}
+
+	/* Nothing to remove */
+	if (list_empty(&to_remove))
+		return;
+
+	WARN_ON(fi->nr_dmaps < num);
+	fi->nr_dmaps -= num;
+	/*
+	 * During umount/shutdown, fuse connection is dropped first
+	 * and evict_inode() is called later. That means any
+	 * removemapping messages are going to fail. Send messages
+	 * only if connection is up. Otherwise fuse daemon is
+	 * responsible for cleaning up any leftover references and
+	 * mappings.
+	 */
+	if (fc->connected) {
+		err = dmap_removemapping_list(inode, num, &to_remove);
+		if (err) {
+			pr_warn("Failed to removemappings. start=0x%llx"
+				" end=0x%llx\n", start, end);
+		}
+	}
+	spin_lock(&fc->lock);
+	list_for_each_entry_safe(dmap, n, &to_remove, list) {
+		list_del_init(&dmap->list);
+		dmap_reinit_add_to_free_pool(fc, dmap);
+	}
+	spin_unlock(&fc->lock);
+}
+
+/*
+ * It is called from evict_inode() and by that time inode is going away. So
+ * this function does not take any locks like fi->i_dmap_sem for traversing
+ * that fuse inode interval tree. If that lock is taken then lock validator
+ * complains of deadlock situation w.r.t fs_reclaim lock.
+ */
+void fuse_cleanup_inode_mappings(struct inode *inode)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	/*
+	 * fuse_evict_inode() has alredy called truncate_inode_pages_final()
+	 * before we arrive here. So we should not have to worry about
+	 * any pages/exception entries still associated with inode.
+	 */
+	inode_reclaim_dmap_range(fc, inode, 0, -1);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
@@ -1535,32 +1760,335 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return res;
 }
 
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
+	struct inode *inode = file->f_mapping->host;
 
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
-		return fuse_cache_read_iter(iocb, to);
-	else
+	if (IS_DAX(inode))
+		return fuse_dax_read_iter(iocb, to);
+
+	if (ff->open_flags & FOPEN_DIRECT_IO)
 		return fuse_direct_read_iter(iocb, to);
+
+	return fuse_cache_read_iter(iocb, to);
 }
 
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
 static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct fuse_file *ff = file->private_data;
+	struct inode *inode = file->f_mapping->host;
 
 	if (is_bad_inode(file_inode(file)))
 		return -EIO;
 
-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
-		return fuse_cache_write_iter(iocb, from);
-	else
+	if (IS_DAX(inode))
+		return fuse_dax_write_iter(iocb, from);
+
+	if (ff->open_flags & FOPEN_DIRECT_IO)
 		return fuse_direct_write_iter(iocb, from);
+
+	return fuse_cache_write_iter(iocb, from);
+}
+
+static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
+{
+	iomap->addr = IOMAP_NULL_ADDR;
+	iomap->length = length;
+	iomap->type = IOMAP_HOLE;
+}
+
+static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
+			struct iomap *iomap, struct fuse_dax_mapping *dmap,
+			unsigned flags)
+{
+	loff_t offset, len;
+	loff_t i_size = i_size_read(inode);
+
+	offset = pos - (dmap->itn.start << FUSE_DAX_SHIFT);
+	len = min(length, dmap->length - offset);
+
+	/* If length is beyond end of file, truncate further */
+	if (pos + len > i_size)
+		len = i_size - pos;
+
+	if (len > 0) {
+		iomap->addr = dmap->window_offset + offset;
+		iomap->length = len;
+		if (flags & IOMAP_FAULT)
+			iomap->length = ALIGN(len, PAGE_SIZE);
+		iomap->type = IOMAP_MAPPED;
+	} else {
+		/* Mapping beyond end of file is hole */
+		fuse_fill_iomap_hole(iomap, length);
+	}
+}
+
+static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
+					 loff_t length, unsigned flags,
+					 struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
+	int ret;
+	bool writable = flags & IOMAP_WRITE;
+	unsigned long start_idx = pos >> FUSE_DAX_SHIFT;
+	struct interval_tree_node *node;
+
+	alloc_dmap = alloc_dax_mapping(fc);
+	if (!alloc_dmap)
+		return -EBUSY;
+
+	/*
+	 * Take write lock so that only one caller can try to setup mapping
+	 * and other waits.
+	 */
+	down_write(&fi->i_dmap_sem);
+	/*
+	 * We dropped lock. Check again if somebody else setup
+	 * mapping already.
+	 */
+	node = interval_tree_iter_first(&fi->dmap_tree, start_idx, start_idx);
+	if (node) {
+		dmap = node_to_dmap(node);
+		fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+		dmap_add_to_free_pool(fc, alloc_dmap);
+		up_write(&fi->i_dmap_sem);
+		return 0;
+	}
+
+	/* Setup one mapping */
+	ret = fuse_setup_one_mapping(inode, pos >> FUSE_DAX_SHIFT, alloc_dmap,
+				     writable, false);
+	if (ret < 0) {
+		dmap_add_to_free_pool(fc, alloc_dmap);
+		up_write(&fi->i_dmap_sem);
+		return ret;
+	}
+	fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
+	up_write(&fi->i_dmap_sem);
+	return 0;
+}
+
+static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
+					 loff_t length, unsigned flags,
+					 struct iomap *iomap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+	int ret;
+	unsigned long idx = pos >> FUSE_DAX_SHIFT;
+	struct interval_tree_node *node;
+
+	/*
+	 * Take exclusive lock so that only one caller can try to setup
+	 * mapping and others wait.
+	 */
+	down_write(&fi->i_dmap_sem);
+	node = interval_tree_iter_first(&fi->dmap_tree, idx, idx);
+
+	/* We are holding either inode lock or i_mmap_sem, and that should
+	 * ensure that dmap can't reclaimed or truncated and it should still
+	 * be there in tree despite the fact we dropped and re-acquired the
+	 * lock.
+	 */
+	ret = -EIO;
+	if (WARN_ON(!node))
+		goto out_err;
+
+	dmap = node_to_dmap(node);
+
+	/* Maybe another thread already upgraded mapping while we were not
+	 * holding lock.
+	 */
+	if (dmap->writable) {
+		ret = 0;
+		goto out_fill_iomap;
+	}
+
+	ret = fuse_setup_one_mapping(inode, pos >> FUSE_DAX_SHIFT, dmap, true,
+				     true);
+	if (ret < 0)
+		goto out_err;
+out_fill_iomap:
+	fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+out_err:
+	up_write(&fi->i_dmap_sem);
+	return ret;
+}
+
+/* This is just for DAX and the mapping is ephemeral, do not use it for other
+ * purposes since there is no block device with a permanent mapping.
+ */
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
+			    unsigned flags, struct iomap *iomap,
+			    struct iomap *srcmap)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_dax_mapping *dmap;
+	bool writable = flags & IOMAP_WRITE;
+	unsigned long start_idx = pos >> FUSE_DAX_SHIFT;
+	struct interval_tree_node *node;
+
+	/* We don't support FIEMAP */
+	BUG_ON(flags & IOMAP_REPORT);
+
+	iomap->offset = pos;
+	iomap->flags = 0;
+	iomap->bdev = NULL;
+	iomap->dax_dev = fc->dax_dev;
+
+	/*
+	 * Both read/write and mmap path can race here. So we need something
+	 * to make sure if we are setting up mapping, then other path waits
+	 *
+	 * For now, use a semaphore for this. It probably needs to be
+	 * optimized later.
+	 */
+	down_read(&fi->i_dmap_sem);
+	node = interval_tree_iter_first(&fi->dmap_tree, start_idx, start_idx);
+	if (node) {
+		dmap = node_to_dmap(node);
+		if (writable && !dmap->writable) {
+			/* Upgrade read-only mapping to read-write. This will
+			 * require exclusive i_dmap_sem lock as we don't want
+			 * two threads to be trying to this simultaneously
+			 * for same dmap. So drop shared lock and acquire
+			 * exclusive lock.
+			 */
+			up_read(&fi->i_dmap_sem);
+			pr_debug("%s: Upgrading mapping at offset 0x%llx"
+				 " length 0x%llx\n", __func__, pos, length);
+			return iomap_begin_upgrade_mapping(inode, pos, length,
+							   flags, iomap);
+		} else {
+			fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+			up_read(&fi->i_dmap_sem);
+			return 0;
+		}
+	} else {
+		up_read(&fi->i_dmap_sem);
+		pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
+				__func__, pos, length);
+		if (pos >= i_size_read(inode))
+			goto iomap_hole;
+
+		return iomap_begin_setup_new_mapping(inode, pos, length, flags,
+						     iomap);
+	}
+
+	/*
+	 * If read beyond end of file happnes, fs code seems to return
+	 * it as hole
+	 */
+iomap_hole:
+	fuse_fill_iomap_hole(iomap, length);
+	pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
+	return 0;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
+			  ssize_t written, unsigned flags,
+			  struct iomap *iomap)
+{
+	/* DAX writes beyond end-of-file aren't handled using iomap, so the
+	 * file size is unchanged and there is nothing to do here.
+	 */
+	return 0;
+}
+
+static const struct iomap_ops fuse_iomap_ops = {
+	.iomap_begin = fuse_iomap_begin,
+	.iomap_end = fuse_iomap_end,
+};
+
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock_shared(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock_shared(inode);
+	}
+
+	ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
+	inode_unlock_shared(inode);
+
+	/* TODO file_accessed(iocb->f_filp) */
+	return ret;
+}
+
+static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	return (iov_iter_rw(from) == WRITE &&
+		((iocb->ki_pos) >= i_size_read(inode) ||
+		  (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode))));
+}
+
+static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
+	ssize_t ret;
+
+	ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
+	if (ret < 0)
+		return ret;
+
+	fuse_invalidate_attr(inode);
+	fuse_write_update_size(inode, iocb->ki_pos);
+	return ret;
+}
+
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+	} else {
+		inode_lock(inode);
+	}
+
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	ret = file_remove_privs(iocb->ki_filp);
+	if (ret)
+		goto out;
+	/* TODO file_update_time() but we don't want metadata I/O */
+
+	/* Do not use dax for file extending writes as write and on
+	 * on disk i_size increase are not atomic otherwise.
+	 */
+	if (file_extending_write(iocb, from))
+		ret = fuse_dax_direct_write(iocb, from);
+	else
+		ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
+
+out:
+	inode_unlock(inode);
+
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	return ret;
 }
 
 static void fuse_writepage_free(struct fuse_writepage_args *wpa)
@@ -2335,6 +2863,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	return -EINVAL; /* TODO */
+}
+
 static int convert_fuse_file_lock(struct fuse_conn *fc,
 				  const struct fuse_file_lock *ffl,
 				  struct file_lock *fl)
@@ -3431,6 +3964,7 @@ static const struct address_space_operations fuse_file_aops  = {
 void fuse_init_file_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
@@ -3440,4 +3974,8 @@ void fuse_init_file_inode(struct inode *inode)
 	fi->writectr = 0;
 	init_waitqueue_head(&fi->page_waitq);
 	fi->writepages = RB_ROOT;
+	fi->dmap_tree = RB_ROOT_CACHED;
+
+	if (fc->dax_dev)
+		inode->i_flags |= S_DAX;
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4a46e35222c7..22fb01ba55fb 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -31,6 +31,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
 #include <linux/user_namespace.h>
+#include <linux/interval_tree.h>
 
 /** Default max number of pages that can be used in a single read request */
 #define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
@@ -76,11 +77,17 @@ struct fuse_dax_mapping {
 	/* Will connect in fc->free_ranges to keep track of free memory */
 	struct list_head list;
 
+	/* For interval tree in file/inode */
+	struct interval_tree_node itn;
+
 	/** Position in DAX window */
 	u64 window_offset;
 
 	/** Length of mapping, in bytes */
 	loff_t length;
+
+	/* Is this mapping read-only or read-write */
+	bool writable;
 };
 
 /** FUSE inode */
@@ -168,6 +175,15 @@ struct fuse_inode {
 
 	/** Lock to protect write related fields */
 	spinlock_t lock;
+
+	/*
+	 * Semaphore to protect modifications to dmap_tree
+	 */
+	struct rw_semaphore i_dmap_sem;
+
+	/** Sorted rb tree of struct fuse_dax_mapping elements */
+	struct rb_root_cached dmap_tree;
+	unsigned long nr_dmaps;
 };
 
 /** FUSE inode state bits */
@@ -1131,5 +1147,15 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
  */
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
 void fuse_free_conn(struct fuse_conn *fc);
+void fuse_cleanup_inode_mappings(struct inode *inode);
+
+static inline struct fuse_dax_mapping *
+node_to_dmap(struct interval_tree_node *node)
+{
+	if (!node)
+		return NULL;
+
+	return container_of(node, struct fuse_dax_mapping, itn);
+}
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9b690456de30..41edc377a3df 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -86,7 +86,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	fi->attr_version = 0;
 	fi->orig_ino = 0;
 	fi->state = 0;
+	fi->nr_dmaps = 0;
 	mutex_init(&fi->mutex);
+	init_rwsem(&fi->i_dmap_sem);
 	spin_lock_init(&fi->lock);
 	fi->forget = fuse_alloc_forget();
 	if (!fi->forget) {
@@ -114,6 +116,10 @@ static void fuse_evict_inode(struct inode *inode)
 	clear_inode(inode);
 	if (inode->i_sb->s_flags & SB_ACTIVE) {
 		struct fuse_conn *fc = get_fuse_conn(inode);
+		if (IS_DAX(inode)) {
+			fuse_cleanup_inode_mappings(inode);
+			WARN_ON(fi->nr_dmaps);
+		}
 		fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
 		fi->forget = NULL;
 	}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 4c386116978a..7f74a7f98e37 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -895,6 +895,7 @@ struct fuse_copy_file_range_in {
 };
 
 #define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
 struct fuse_setupmapping_in {
 	/* An already open handle */
 	uint64_t	fh;
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 14/20] fuse,dax: add DAX mmap support
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (12 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 13/20] fuse, dax: Implement dax read/write operations Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

From: Stefan Hajnoczi <stefanha@redhat.com>

Add DAX mmap() support.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 fs/fuse/file.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 194fe3e404a7..be7d90eb5b41 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2841,10 +2841,15 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 	.page_mkwrite	= fuse_page_mkwrite,
 };
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct fuse_file *ff = file->private_data;
 
+	/* DAX mmap is superior to direct_io mmap */
+	if (IS_DAX(file_inode(file)))
+		return fuse_dax_mmap(file, vma);
+
 	if (ff->open_flags & FOPEN_DIRECT_IO) {
 		/* Can't provide the coherency needed for MAP_SHARED */
 		if (vma->vm_flags & VM_MAYSHARE)
@@ -2863,9 +2868,63 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
+				   enum page_entry_size pe_size, bool write)
+{
+	vm_fault_t ret;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	pfn_t pfn;
+
+	if (write)
+		sb_start_pagefault(sb);
+
+	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+
+	if (ret & VM_FAULT_NEEDDSYNC)
+		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+	if (write)
+		sb_end_pagefault(sb);
+
+	return ret;
+}
+
+static vm_fault_t fuse_dax_fault(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+				vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf,
+			       enum page_entry_size pe_size)
+{
+	return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+	.fault		= fuse_dax_fault,
+	.huge_fault	= fuse_dax_huge_fault,
+	.page_mkwrite	= fuse_dax_page_mkwrite,
+	.pfn_mkwrite	= fuse_dax_pfn_mkwrite,
+};
+
 static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	return -EINVAL; /* TODO */
+	file_accessed(file);
+	vma->vm_ops = &fuse_dax_vm_ops;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	return 0;
 }
 
 static int convert_fuse_file_lock(struct fuse_conn *fc,
@@ -3938,6 +3997,7 @@ static const struct file_operations fuse_file_operations = {
 	.release	= fuse_release,
 	.fsync		= fuse_fsync,
 	.lock		= fuse_file_lock,
+	.get_unmapped_area = thp_get_unmapped_area,
 	.flock		= fuse_file_flock,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (13 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 14/20] fuse,dax: add DAX mmap support Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-10 22:22   ` Dave Chinner
  2020-08-07 19:55 ` [PATCH v2 16/20] fuse,virtiofs: Define dax address space operations Vivek Goyal
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

We need some kind of locking mechanism here. Normal file systems like
ext4 and xfs seems to take their own semaphore to protect agains
truncate while fault is going on.

We have additional requirement to protect against fuse dax memory range
reclaim. When a range has been selected for reclaim, we need to make sure
no other read/write/fault can try to access that memory range while
reclaim is in progress. Once reclaim is complete, lock will be released
and read/write/fault will trigger allocation of fresh dax range.

Taking inode_lock() is not an option in fault path as lockdep complains
about circular dependencies. So define a new fuse_inode->i_mmap_sem.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/dir.c    |  2 ++
 fs/fuse/file.c   | 15 ++++++++++++---
 fs/fuse/fuse_i.h |  7 +++++++
 fs/fuse/inode.c  |  1 +
 4 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 26f028bc760b..f40766c0693b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 	 */
 	if ((is_truncate || !is_wb) &&
 	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+		down_write(&fi->i_mmap_sem);
 		truncate_pagecache(inode, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
+		up_write(&fi->i_mmap_sem);
 	}
 
 	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index be7d90eb5b41..00ad27216cc3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
 
 	if (write)
 		sb_start_pagefault(sb);
-
+	/*
+	 * We need to serialize against not only truncate but also against
+	 * fuse dax memory range reclaim. While a range is being reclaimed,
+	 * we do not want any read/write/mmap to make progress and try
+	 * to populate page cache or access memory we are trying to free.
+	 */
+	down_read(&get_fuse_inode(inode)->i_mmap_sem);
 	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
 
 	if (ret & VM_FAULT_NEEDDSYNC)
 		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+	up_read(&get_fuse_inode(inode)->i_mmap_sem);
 
 	if (write)
 		sb_end_pagefault(sb);
@@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 			file_update_time(file);
 	}
 
-	if (mode & FALLOC_FL_PUNCH_HOLE)
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		down_write(&fi->i_mmap_sem);
 		truncate_pagecache_range(inode, offset, offset + length - 1);
-
+		up_write(&fi->i_mmap_sem);
+	}
 	fuse_invalidate_attr(inode);
 
 out:
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 22fb01ba55fb..1ddf526330a5 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -181,6 +181,13 @@ struct fuse_inode {
 	 */
 	struct rw_semaphore i_dmap_sem;
 
+	/**
+	 * Can't take inode lock in fault path (leads to circular dependency).
+	 * So take this in fuse dax fault path to make sure truncate and
+	 * punch hole etc. can't make progress in parallel.
+	 */
+	struct rw_semaphore i_mmap_sem;
+
 	/** Sorted rb tree of struct fuse_dax_mapping elements */
 	struct rb_root_cached dmap_tree;
 	unsigned long nr_dmaps;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 41edc377a3df..4bd965d0ecf6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -88,6 +88,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
 	fi->state = 0;
 	fi->nr_dmaps = 0;
 	mutex_init(&fi->mutex);
+	init_rwsem(&fi->i_mmap_sem);
 	init_rwsem(&fi->i_dmap_sem);
 	spin_lock_init(&fi->lock);
 	fi->forget = fuse_alloc_forget();
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 16/20] fuse,virtiofs: Define dax address space operations
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (14 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 17/20] fuse,virtiofs: Maintain a list of busy elements Vivek Goyal
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

This is done along the lines of ext4 and xfs. I primarily wanted ->writepages
hook at this time so that I could call into dax_writeback_mapping_range().
This in turn will decide which pfns need to be written back.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 00ad27216cc3..54708cb24da0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2669,6 +2669,16 @@ static int fuse_writepages_fill(struct page *page,
 	return err;
 }
 
+static int fuse_dax_writepages(struct address_space *mapping,
+				struct writeback_control *wbc)
+{
+
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	return dax_writeback_mapping_range(mapping, fc->dax_dev, wbc);
+}
+
 static int fuse_writepages(struct address_space *mapping,
 			   struct writeback_control *wbc)
 {
@@ -4030,6 +4040,13 @@ static const struct address_space_operations fuse_file_aops  = {
 	.write_end	= fuse_write_end,
 };
 
+static const struct address_space_operations fuse_dax_file_aops  = {
+	.writepages	= fuse_dax_writepages,
+	.direct_IO	= noop_direct_IO,
+	.set_page_dirty	= noop_set_page_dirty,
+	.invalidatepage	= noop_invalidatepage,
+};
+
 void fuse_init_file_inode(struct inode *inode)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
@@ -4045,6 +4062,8 @@ void fuse_init_file_inode(struct inode *inode)
 	fi->writepages = RB_ROOT;
 	fi->dmap_tree = RB_ROOT_CACHED;
 
-	if (fc->dax_dev)
+	if (fc->dax_dev) {
 		inode->i_flags |= S_DAX;
+		inode->i_data.a_ops = &fuse_dax_file_aops;
+	}
 }
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 17/20] fuse,virtiofs: Maintain a list of busy elements
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (15 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 16/20] fuse,virtiofs: Define dax address space operations Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 18/20] fuse: Release file in process context Vivek Goyal
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c   | 22 ++++++++++++++++++++++
 fs/fuse/fuse_i.h |  7 +++++++
 fs/fuse/inode.c  |  4 ++++
 3 files changed, 33 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 54708cb24da0..ecd2a42f6e30 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -213,6 +213,23 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 	return dmap;
 }
 
+/* This assumes fc->lock is held */
+static void __dmap_remove_busy_list(struct fuse_conn *fc,
+				    struct fuse_dax_mapping *dmap)
+{
+	list_del_init(&dmap->busy_list);
+	WARN_ON(fc->nr_busy_ranges == 0);
+	fc->nr_busy_ranges--;
+}
+
+static void dmap_remove_busy_list(struct fuse_conn *fc,
+				  struct fuse_dax_mapping *dmap)
+{
+	spin_lock(&fc->lock);
+	__dmap_remove_busy_list(fc, dmap);
+	spin_unlock(&fc->lock);
+}
+
 /* This assumes fc->lock is held */
 static void __dmap_add_to_free_pool(struct fuse_conn *fc,
 				struct fuse_dax_mapping *dmap)
@@ -266,6 +283,10 @@ static int fuse_setup_one_mapping(struct inode *inode, unsigned long start_idx,
 		/* Protected by fi->i_dmap_sem */
 		interval_tree_insert(&dmap->itn, &fi->dmap_tree);
 		fi->nr_dmaps++;
+		spin_lock(&fc->lock);
+		list_add_tail(&dmap->busy_list, &fc->busy_ranges);
+		fc->nr_busy_ranges++;
+		spin_unlock(&fc->lock);
 	}
 	return 0;
 }
@@ -335,6 +356,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
 	pr_debug("fuse: freeing memory range start_idx=0x%lx end_idx=0x%lx "
 		 "window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
 		 dmap->itn.last, dmap->window_offset, dmap->length);
+	__dmap_remove_busy_list(fc, dmap);
 	dmap->itn.start = dmap->itn.last = 0;
 	__dmap_add_to_free_pool(fc, dmap);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1ddf526330a5..f84ec9c661ab 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -80,6 +80,9 @@ struct fuse_dax_mapping {
 	/* For interval tree in file/inode */
 	struct interval_tree_node itn;
 
+	/* Will connect in fc->busy_ranges to keep track busy memory */
+	struct list_head busy_list;
+
 	/** Position in DAX window */
 	u64 window_offset;
 
@@ -812,6 +815,10 @@ struct fuse_conn {
 	/** DAX device, non-NULL if DAX is supported */
 	struct dax_device *dax_dev;
 
+	/* List of memory ranges which are busy */
+	unsigned long nr_busy_ranges;
+	struct list_head busy_ranges;
+
 	/*
 	 * DAX Window Free Ranges
 	 */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4bd965d0ecf6..cfc04c5eda73 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -636,6 +636,8 @@ static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
 	/* Free All allocated elements */
 	list_for_each_entry_safe(range, temp, mem_list, list) {
 		list_del(&range->list);
+		if (!list_empty(&range->busy_list))
+			list_del(&range->busy_list);
 		kfree(range);
 	}
 }
@@ -680,6 +682,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 		 */
 		range->window_offset = i * FUSE_DAX_SZ;
 		range->length = FUSE_DAX_SZ;
+		INIT_LIST_HEAD(&range->busy_list);
 		list_add_tail(&range->list, &mem_ranges);
 	}
 
@@ -727,6 +730,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->user_ns = get_user_ns(user_ns);
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	INIT_LIST_HEAD(&fc->free_ranges);
+	INIT_LIST_HEAD(&fc->busy_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 18/20] fuse: Release file in process context
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (16 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 17/20] fuse,virtiofs: Maintain a list of busy elements Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-10  8:29   ` Miklos Szeredi
  2020-08-07 19:55 ` [PATCH v2 19/20] fuse: Take inode lock for dax inode truncation Vivek Goyal
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

fuse_file_put(sync) can be called with sync=true/false. If sync=true,
it waits for release request response and then calls iput() in the
caller's context. If sync=false, it does not wait for release request
response, frees the fuse_file struct immediately and req->end function
does the iput().

iput() can be a problem with DAX if called in req->end context. If this
is last reference to inode (VFS has let go its reference already), then
iput() will clean DAX mappings as well and send REMOVEMAPPING requests
and wait for completion. (All the the worker thread context which is
processing fuse replies from daemon on the host).

That means it blocks worker thread and it stops processing further
replies and system deadlocks.

So for now, force sync release of file in case of DAX inodes.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ecd2a42f6e30..605976a586c2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -527,6 +527,7 @@ void fuse_release_common(struct file *file, bool isdir)
 	struct fuse_file *ff = file->private_data;
 	struct fuse_release_args *ra = ff->release_args;
 	int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+	bool sync = false;
 
 	fuse_prepare_release(fi, ff, file->f_flags, opcode);
 
@@ -546,8 +547,19 @@ void fuse_release_common(struct file *file, bool isdir)
 	 * Make the release synchronous if this is a fuseblk mount,
 	 * synchronous RELEASE is allowed (and desirable) in this case
 	 * because the server can be trusted not to screw up.
+	 *
+	 * For DAX, fuse server is trusted. So it should be fine to
+	 * do a sync file put. Doing async file put is creating
+	 * problems right now because when request finish, iput()
+	 * can lead to freeing of inode. That means it tears down
+	 * mappings backing DAX memory and sends REMOVEMAPPING message
+	 * to server and blocks for completion. Currently, waiting
+	 * in req->end context deadlocks the system as same worker thread
+	 * can't process REMOVEMAPPING reply it is waiting for.
 	 */
-	fuse_file_put(ff, ff->fc->destroy, isdir);
+	if (IS_DAX(file_inode(file)) || ff->fc->destroy)
+		sync = true;
+	fuse_file_put(ff, sync, isdir);
 }
 
 static int fuse_open(struct inode *inode, struct file *file)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 19/20] fuse: Take inode lock for dax inode truncation
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (17 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 18/20] fuse: Release file in process context Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-07 19:55 ` [PATCH v2 20/20] fuse,virtiofs: Add logic to free up a memory range Vivek Goyal
  2020-08-10  7:29 ` [PATCH v2 00/20] virtiofs: Add DAX support Miklos Szeredi
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs; +Cc: vgoyal, miklos, stefanha, dgilbert

When a file is opened with O_TRUNC, we need to make sure that any other
DAX operation is not in progress. DAX expects i_size to be stable.

In fuse_iomap_begin() we check for i_size at multiple places and we expect
i_size to not change.

Another problem is, if we setup a mapping in fuse_iomap_begin(), and
file gets truncated and dax read/write happens, KVM currently hangs.
It tries to fault in a page which does not exist on host (file got
truncated). It probably requries fixing in KVM.

So for now, take inode lock. Once KVM is fixed, we might have to
have a look at it again.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fuse/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 605976a586c2..f103355bf71f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -467,7 +467,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
 	int err;
 	bool is_wb_truncate = (file->f_flags & O_TRUNC) &&
 			  fc->atomic_o_trunc &&
-			  fc->writeback_cache;
+			  (fc->writeback_cache || IS_DAX(inode));
 
 	err = generic_file_open(inode, file);
 	if (err)
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 20/20] fuse,virtiofs: Add logic to free up a memory range
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (18 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 19/20] fuse: Take inode lock for dax inode truncation Vivek Goyal
@ 2020-08-07 19:55 ` Vivek Goyal
  2020-08-10  7:29 ` [PATCH v2 00/20] virtiofs: Add DAX support Miklos Szeredi
  20 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-07 19:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, virtio-fs
  Cc: vgoyal, miklos, stefanha, dgilbert, Liu Bo

Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.

Process can also steal one of its busy dax ranges if free range is not
available. I will refer it to as direct reclaim.

If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.

For reclaiming a range, as of now we need to hold following locks in
specified order.

	down_write(&fi->i_mmap_sem);
	down_write(&fi->i_dmap_sem);

We look for a free range in following order.

A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/file.c      | 525 +++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h    |  25 +++
 fs/fuse/inode.c     |   4 +
 fs/fuse/virtio_fs.c |   5 +
 4 files changed, 551 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f103355bf71f..edc9c89e4cf2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -8,6 +8,7 @@
 
 #include "fuse_i.h"
 
+#include <linux/delay.h>
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
@@ -35,6 +36,8 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
 	return pages;
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+							struct inode *inode);
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
 {
@@ -191,6 +194,26 @@ static void fuse_link_write_file(struct file *file)
 	spin_unlock(&fi->lock);
 }
 
+static void
+__kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+	unsigned long free_threshold;
+
+	/* If number of free ranges are below threshold, start reclaim */
+	free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
+				(unsigned long)1);
+	if (fc->nr_free_ranges < free_threshold)
+		queue_delayed_work(system_long_wq, &fc->dax_free_work,
+				   msecs_to_jiffies(delay_ms));
+}
+
+static void kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+	spin_lock(&fc->lock);
+	__kick_dmap_free_worker(fc, delay_ms);
+	spin_unlock(&fc->lock);
+}
+
 static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 {
 	struct fuse_dax_mapping *dmap = NULL;
@@ -199,7 +222,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 
 	if (fc->nr_free_ranges <= 0) {
 		spin_unlock(&fc->lock);
-		return NULL;
+		goto out_kick;
 	}
 
 	WARN_ON(list_empty(&fc->free_ranges));
@@ -210,6 +233,9 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 	list_del_init(&dmap->list);
 	fc->nr_free_ranges--;
 	spin_unlock(&fc->lock);
+
+out_kick:
+	kick_dmap_free_worker(fc, 0);
 	return dmap;
 }
 
@@ -236,6 +262,7 @@ static void __dmap_add_to_free_pool(struct fuse_conn *fc,
 {
 	list_add_tail(&dmap->list, &fc->free_ranges);
 	fc->nr_free_ranges++;
+	wake_up(&fc->dax_range_waitq);
 }
 
 static void dmap_add_to_free_pool(struct fuse_conn *fc,
@@ -279,6 +306,12 @@ static int fuse_setup_one_mapping(struct inode *inode, unsigned long start_idx,
 		return err;
 	dmap->writable = writable;
 	if (!upgrade) {
+		/*
+		 * We don't take a refernce on inode. inode is valid right now
+		 * and when inode is going away, cleanup logic should first
+		 * cleanup dmap entries.
+		 */
+		dmap->inode = inode;
 		dmap->itn.start = dmap->itn.last = start_idx;
 		/* Protected by fi->i_dmap_sem */
 		interval_tree_insert(&dmap->itn, &fi->dmap_tree);
@@ -357,6 +390,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
 		 "window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
 		 dmap->itn.last, dmap->window_offset, dmap->length);
 	__dmap_remove_busy_list(fc, dmap);
+	dmap->inode = NULL;
 	dmap->itn.start = dmap->itn.last = 0;
 	__dmap_add_to_free_pool(fc, dmap);
 }
@@ -384,6 +418,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
 		if (!node)
 			break;
 		dmap = node_to_dmap(node);
+		/* inode is going away. There should not be any users of dmap */
+		WARN_ON(refcount_read(&dmap->refcnt) > 1);
 		interval_tree_remove(&dmap->itn, &fi->dmap_tree);
 		num++;
 		list_add(&dmap->list, &to_remove);
@@ -418,6 +454,21 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
 	spin_unlock(&fc->lock);
 }
 
+static int dmap_removemapping_one(struct inode *inode,
+				  struct fuse_dax_mapping *dmap)
+{
+	struct fuse_removemapping_one forget_one;
+	struct fuse_removemapping_in inarg;
+
+	memset(&inarg, 0, sizeof(inarg));
+	inarg.count = 1;
+	memset(&forget_one, 0, sizeof(forget_one));
+	forget_one.moffset = dmap->window_offset;
+	forget_one.len = dmap->length;
+
+	return fuse_send_removemapping(inode, &inarg, &forget_one);
+}
+
 /*
  * It is called from evict_inode() and by that time inode is going away. So
  * this function does not take any locks like fi->i_dmap_sem for traversing
@@ -1859,6 +1910,16 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
 		if (flags & IOMAP_FAULT)
 			iomap->length = ALIGN(len, PAGE_SIZE);
 		iomap->type = IOMAP_MAPPED;
+		/*
+		 * increace refcnt so that reclaim code knows this dmap is in
+		 * use. This assumes i_dmap_sem mutex is held either
+		 * shared/exclusive.
+		 */
+		refcount_inc(&dmap->refcnt);
+
+		/* iomap->private should be NULL */
+		WARN_ON_ONCE(iomap->private);
+		iomap->private = dmap;
 	} else {
 		/* Mapping beyond end of file is hole */
 		fuse_fill_iomap_hole(iomap, length);
@@ -1877,8 +1938,28 @@ static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
 	unsigned long start_idx = pos >> FUSE_DAX_SHIFT;
 	struct interval_tree_node *node;
 
-	alloc_dmap = alloc_dax_mapping(fc);
-	if (!alloc_dmap)
+	/*
+	 * Can't do inline reclaim in fault path. We call
+	 * dax_layout_busy_page() before we free a range. And
+	 * fuse_wait_dax_page() drops fi->i_mmap_sem lock and requires it.
+	 * In fault path we enter with fi->i_mmap_sem held and can't drop
+	 * it. Also in fault path we hold fi->i_mmap_sem shared and not
+	 * exclusive, so that creates further issues with fuse_wait_dax_page().
+	 * Hence return -EAGAIN and fuse_dax_fault() will wait for a memory
+	 * range to become free and retry.
+	 */
+	if (flags & IOMAP_FAULT) {
+		alloc_dmap = alloc_dax_mapping(fc);
+		if (!alloc_dmap)
+			return -EAGAIN;
+	} else {
+		alloc_dmap = alloc_dax_mapping_reclaim(fc, inode);
+		if (IS_ERR(alloc_dmap))
+			return PTR_ERR(alloc_dmap);
+	}
+
+	/* If we are here, we should have memory allocated */
+	if (WARN_ON(!alloc_dmap))
 		return -EBUSY;
 
 	/*
@@ -1930,16 +2011,26 @@ static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
 	node = interval_tree_iter_first(&fi->dmap_tree, idx, idx);
 
 	/* We are holding either inode lock or i_mmap_sem, and that should
-	 * ensure that dmap can't reclaimed or truncated and it should still
-	 * be there in tree despite the fact we dropped and re-acquired the
-	 * lock.
+	 * ensure that dmap can't be truncated. We are holding a reference
+	 * on dmap and that should make sure it can't be reclaimed. So dmap
+	 * should still be there in tree despite the fact we dropped and
+	 * re-acquired the i_dmap_sem lock.
 	 */
 	ret = -EIO;
 	if (WARN_ON(!node))
 		goto out_err;
-
 	dmap = node_to_dmap(node);
 
+	/* We took an extra reference on dmap to make sure its not reclaimd.
+	 * Now we hold i_dmap_sem lock and that reference is not needed
+	 * anymore. Drop it.
+	 */
+	if (refcount_dec_and_test(&dmap->refcnt)) {
+		/* refcount should not hit 0. This object only goes
+		 * away when fuse connection goes away */
+		WARN_ON_ONCE(1);
+	}
+
 	/* Maybe another thread already upgraded mapping while we were not
 	 * holding lock.
 	 */
@@ -1998,7 +2089,11 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 			 * two threads to be trying to this simultaneously
 			 * for same dmap. So drop shared lock and acquire
 			 * exclusive lock.
+			 *
+			 * Before dropping i_dmap_sem lock, take reference
+			 * on dmap so that its not freed by range reclaim.
 			 */
+			refcount_inc(&dmap->refcnt);
 			up_read(&fi->i_dmap_sem);
 			pr_debug("%s: Upgrading mapping at offset 0x%llx"
 				 " length 0x%llx\n", __func__, pos, length);
@@ -2034,6 +2129,16 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 			  ssize_t written, unsigned flags,
 			  struct iomap *iomap)
 {
+	struct fuse_dax_mapping *dmap = iomap->private;
+
+	if (dmap) {
+		if (refcount_dec_and_test(&dmap->refcnt)) {
+			/* refcount should not hit 0. This object only goes
+			 * away when fuse connection goes away */
+			WARN_ON_ONCE(1);
+		}
+	}
+
 	/* DAX writes beyond end-of-file aren't handled using iomap, so the
 	 * file size is unchanged and there is nothing to do here.
 	 */
@@ -2919,9 +3024,17 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct super_block *sb = inode->i_sb;
 	pfn_t pfn;
+	int error = 0;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	bool retry = false;
 
 	if (write)
 		sb_start_pagefault(sb);
+
+retry:
+	if (retry && !(fc->nr_free_ranges > 0))
+		wait_event(fc->dax_range_waitq, (fc->nr_free_ranges > 0));
+
 	/*
 	 * We need to serialize against not only truncate but also against
 	 * fuse dax memory range reclaim. While a range is being reclaimed,
@@ -2929,7 +3042,13 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
 	 * to populate page cache or access memory we are trying to free.
 	 */
 	down_read(&get_fuse_inode(inode)->i_mmap_sem);
-	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+	ret = dax_iomap_fault(vmf, pe_size, &pfn, &error, &fuse_iomap_ops);
+	if ((ret & VM_FAULT_ERROR) && error == -EAGAIN) {
+		error = 0;
+		retry = true;
+		up_read(&get_fuse_inode(inode)->i_mmap_sem);
+		goto retry;
+	}
 
 	if (ret & VM_FAULT_NEEDDSYNC)
 		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
@@ -4101,3 +4220,393 @@ void fuse_init_file_inode(struct inode *inode)
 		inode->i_data.a_ops = &fuse_dax_file_aops;
 	}
 }
+
+static int dmap_writeback_invalidate(struct inode *inode,
+				     struct fuse_dax_mapping *dmap)
+{
+	int ret;
+	loff_t start_pos = dmap->itn.start << FUSE_DAX_SHIFT;
+	loff_t end_pos = (start_pos + FUSE_DAX_SZ -1);
+
+	ret = filemap_fdatawrite_range(inode->i_mapping, start_pos, end_pos);
+	if (ret) {
+		pr_debug("fuse: filemap_fdatawrite_range() failed. err=%d"
+		         " start_pos=0x%llx, end_pos=0x%llx\n", ret, start_pos,
+		         end_pos);
+		return ret;
+	}
+
+	ret = invalidate_inode_pages2_range(inode->i_mapping,
+					    start_pos >> PAGE_SHIFT,
+					    end_pos >> PAGE_SHIFT);
+	if (ret)
+		pr_debug("fuse: invalidate_inode_pages2_range() failed err=%d\n"
+			 , ret);
+
+	return ret;
+}
+
+static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
+				   struct fuse_dax_mapping *dmap)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	/*
+	 * igrab() was done to make sure inode won't go under us, and this
+	 * further avoids the race with evict().
+	 */
+	ret = dmap_writeback_invalidate(inode, dmap);
+	if (ret)
+		return ret;
+
+	/* Remove dax mapping from inode interval tree now */
+	interval_tree_remove(&dmap->itn, &fi->dmap_tree);
+	fi->nr_dmaps--;
+
+	/* It is possible that umount/shutodwn has killed the fuse connection
+	 * and worker thread is trying to reclaim memory in parallel. So check
+	 * if connection is still up or not otherwise don't send removemapping
+	 * message.
+	 */
+	if (fc->connected) {
+		ret = dmap_removemapping_one(inode, dmap);
+		if (ret) {
+			pr_warn("Failed to remove mapping. offset=0x%llx"
+				" len=0x%llx ret=%d\n", dmap->window_offset,
+				dmap->length, ret);
+		}
+	}
+	return 0;
+}
+
+static void fuse_wait_dax_page(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+        up_write(&fi->i_mmap_sem);
+        schedule();
+        down_write(&fi->i_mmap_sem);
+}
+
+/* Should be called with fi->i_mmap_sem lock held exclusively */
+static int __fuse_break_dax_layouts(struct inode *inode, bool *retry,
+				    loff_t start, loff_t end)
+{
+	struct page *page;
+
+	page = dax_layout_busy_page_range(inode->i_mapping, start, end);
+	if (!page)
+		return 0;
+
+	*retry = true;
+	return ___wait_var_event(&page->_refcount,
+			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
+			0, 0, fuse_wait_dax_page(inode));
+}
+
+/* dmap_end == 0 leads to unmapping of whole file */
+static int fuse_break_dax_layouts(struct inode *inode, u64 dmap_start,
+				  u64 dmap_end)
+{
+	bool	retry;
+	int	ret;
+
+	do {
+		retry = false;
+		ret = __fuse_break_dax_layouts(inode, &retry, dmap_start,
+					       dmap_end);
+        } while (ret == 0 && retry);
+
+        return ret;
+}
+
+/* Find first mapped dmap for an inode and return file offset. Caller needs
+ * to hold inode->i_dmap_sem lock either shared or exclusive. */
+static struct fuse_dax_mapping *inode_lookup_first_dmap(struct fuse_conn *fc,
+							struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+	struct interval_tree_node *node;
+
+	for (node = interval_tree_iter_first(&fi->dmap_tree, 0, -1); node;
+	     node = interval_tree_iter_next(node, 0, -1)) {
+		dmap = node_to_dmap(node);
+		/* still in use. */
+		if (refcount_read(&dmap->refcnt) > 1)
+			continue;
+
+		return dmap;
+	}
+
+	return NULL;
+}
+
+/*
+ * Find first mapping in the tree and free it and return it. Do not add
+ * it back to free pool.
+ */
+static struct fuse_dax_mapping *
+inode_inline_reclaim_one_dmap(struct fuse_conn *fc, struct inode *inode,
+			      bool *retry)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+	u64 dmap_start, dmap_end;
+	unsigned long start_idx;
+	int ret;
+	struct interval_tree_node *node;
+
+	down_write(&fi->i_mmap_sem);
+
+	/* Lookup a dmap and corresponding file offset to reclaim. */
+	down_read(&fi->i_dmap_sem);
+	dmap = inode_lookup_first_dmap(fc, inode);
+	if (dmap) {
+		start_idx = dmap->itn.start;
+		dmap_start = start_idx << FUSE_DAX_SHIFT;
+		dmap_end = dmap_start + FUSE_DAX_SZ - 1;
+	}
+	up_read(&fi->i_dmap_sem);
+
+	if (!dmap)
+		goto out_mmap_sem;
+	/*
+	 * Make sure there are no references to inode pages using
+	 * get_user_pages()
+	 */
+	ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
+	if (ret) {
+		pr_debug("fuse: fuse_break_dax_layouts() failed. err=%d\n",
+			 ret);
+		dmap = ERR_PTR(ret);
+		goto out_mmap_sem;
+	}
+
+	down_write(&fi->i_dmap_sem);
+	node = interval_tree_iter_first(&fi->dmap_tree, start_idx, start_idx);
+	/* Range already got reclaimed by somebody else */
+	if (!node) {
+		if (retry)
+			*retry = true;
+		goto out_write_dmap_sem;
+	}
+
+	dmap = node_to_dmap(node);
+	/* still in use. */
+	if (refcount_read(&dmap->refcnt) > 1) {
+		dmap = NULL;
+		if (retry)
+			*retry = true;
+		goto out_write_dmap_sem;
+	}
+
+	ret = reclaim_one_dmap_locked(fc, inode, dmap);
+	if (ret < 0) {
+		dmap = ERR_PTR(ret);
+		goto out_write_dmap_sem;
+	}
+
+	/* Clean up dmap. Do not add back to free list */
+	dmap_remove_busy_list(fc, dmap);
+	dmap->inode = NULL;
+	dmap->itn.start = dmap->itn.last = 0;
+
+	pr_debug("fuse: %s: inline reclaimed memory range. inode=%px,"
+		 " window_offset=0x%llx, length=0x%llx\n", __func__,
+		 inode, dmap->window_offset, dmap->length);
+
+out_write_dmap_sem:
+	up_write(&fi->i_dmap_sem);
+out_mmap_sem:
+	up_write(&fi->i_mmap_sem);
+	return dmap;
+}
+
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+					struct inode *inode)
+{
+	struct fuse_dax_mapping *dmap;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	while(1) {
+		bool retry = false;
+
+		dmap = alloc_dax_mapping(fc);
+		if (dmap)
+			return dmap;
+
+		dmap = inode_inline_reclaim_one_dmap(fc, inode, &retry);
+		/*
+		 * Either we got a mapping or it is an error, return in both
+		 * the cases.
+		 */
+		if (dmap)
+			return dmap;
+
+		/* If we could not reclaim a mapping because it
+		 * had a reference or some other temporary failure,
+		 * Try again. We want to give up inline reclaim only
+		 * if there is no range assigned to this node. Otherwise
+		 * if a deadlock is possible if we sleep with fi->i_mmap_sem
+		 * held and worker to free memory can't make progress due
+		 * to unavailability of fi->i_mmap_sem lock. So sleep
+		 * only if fi->nr_dmaps=0
+		 */
+		if (retry)
+			continue;
+		/*
+		 * There are no mappings which can be reclaimed. Wait for one.
+		 * We are not holding fi->i_dmap_sem. So it is possible
+		 * that range gets added now. But as we are not holding
+		 * fi->i_mmap_sem, worker should still be able to free up
+		 * a range and wake us up.
+		 */
+		if (!fi->nr_dmaps && !(fc->nr_free_ranges > 0)) {
+			if (wait_event_killable_exclusive(fc->dax_range_waitq,
+					(fc->nr_free_ranges > 0))) {
+				return ERR_PTR(-EINTR);
+			}
+		}
+	}
+}
+
+static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
+					  struct inode *inode,
+					  unsigned long start_idx)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_dax_mapping *dmap;
+	struct interval_tree_node *node;
+
+	/* Find fuse dax mapping at file offset inode. */
+	node = interval_tree_iter_first(&fi->dmap_tree, start_idx, start_idx);
+
+	/* Range already got cleaned up by somebody else */
+	if (!node)
+		return 0;
+	dmap = node_to_dmap(node);
+
+	/* still in use. */
+	if (refcount_read(&dmap->refcnt) > 1)
+		return 0;
+
+	ret = reclaim_one_dmap_locked(fc, inode, dmap);
+	if (ret < 0)
+		return ret;
+
+	/* Cleanup dmap entry and add back to free list */
+	spin_lock(&fc->lock);
+	dmap_reinit_add_to_free_pool(fc, dmap);
+	spin_unlock(&fc->lock);
+	return ret;
+}
+
+/*
+ * Free a range of memory.
+ * Locking.
+ * 1. Take fuse_inode->i_mmap_sem to block dax faults.
+ * 2. Take fuse_inode->i_dmap_sem to protect interval tree and also to make
+ *    sure read/write can not reuse a dmap which we might be freeing.
+ */
+static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
+				   unsigned long start_idx,
+				   unsigned long end_idx)
+{
+	int ret;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t dmap_start = start_idx << FUSE_DAX_SHIFT;
+	loff_t dmap_end = (dmap_start + FUSE_DAX_SZ) - 1;
+
+	down_write(&fi->i_mmap_sem);
+	ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
+	if (ret) {
+		pr_debug("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
+			 ret);
+		goto out_mmap_sem;
+	}
+
+	down_write(&fi->i_dmap_sem);
+	ret = lookup_and_reclaim_dmap_locked(fc, inode, start_idx);
+	up_write(&fi->i_dmap_sem);
+out_mmap_sem:
+	up_write(&fi->i_mmap_sem);
+	return ret;
+}
+
+static int try_to_free_dmap_chunks(struct fuse_conn *fc,
+				   unsigned long nr_to_free)
+{
+	struct fuse_dax_mapping *dmap, *pos, *temp;
+	int ret, nr_freed = 0;
+	unsigned long start_idx = 0, end_idx = 0;
+	u64 window_offset = 0;
+	struct inode *inode = NULL;
+
+	/* Pick first busy range and free it for now*/
+	while(1) {
+		if (nr_freed >= nr_to_free)
+			break;
+
+		dmap = NULL;
+		spin_lock(&fc->lock);
+
+		if (!fc->nr_busy_ranges) {
+			spin_unlock(&fc->lock);
+			return 0;
+		}
+
+		list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
+						busy_list) {
+			/* skip this range if it's in use. */
+			if (refcount_read(&pos->refcnt) > 1)
+				continue;
+
+			inode = igrab(pos->inode);
+			/*
+			 * This inode is going away. That will free
+			 * up all the ranges anyway, continue to
+			 * next range.
+			 */
+			if (!inode)
+				continue;
+			/*
+			 * Take this element off list and add it tail. If
+			 * this element can't be freed, it will help with
+			 * selecting new element in next iteration of loop.
+			 */
+			dmap = pos;
+			list_move_tail(&dmap->busy_list, &fc->busy_ranges);
+			start_idx = end_idx = dmap->itn.start;
+			window_offset = dmap->window_offset;
+			break;
+		}
+		spin_unlock(&fc->lock);
+		if (!dmap)
+			return 0;
+
+		ret = lookup_and_reclaim_dmap(fc, inode, start_idx, end_idx);
+		iput(inode);
+		if (ret)
+			return ret;
+		nr_freed++;
+	}
+	return 0;
+}
+
+void fuse_dax_free_mem_worker(struct work_struct *work)
+{
+	int ret;
+	struct fuse_conn *fc = container_of(work, struct fuse_conn,
+						dax_free_work.work);
+	ret = try_to_free_dmap_chunks(fc, FUSE_DAX_RECLAIM_CHUNK);
+	if (ret) {
+		pr_debug("fuse: try_to_free_dmap_chunks() failed with err=%d\n",
+			 ret);
+	}
+
+	/* If number of free ranges are still below threhold, requeue */
+	kick_dmap_free_worker(fc, 1);
+}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f84ec9c661ab..be8bb438d1c9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -56,6 +56,16 @@
 #define FUSE_DAX_SHIFT	(21)
 #define FUSE_DAX_PAGES	(FUSE_DAX_SZ/PAGE_SIZE)
 
+/* Number of ranges reclaimer will try to free in one invocation */
+#define FUSE_DAX_RECLAIM_CHUNK		(10)
+
+/*
+ * Dax memory reclaim threshold in percetage of total ranges. When free
+ * number of free ranges drops below this threshold, reclaim can trigger
+ * Default is 20%
+ * */
+#define FUSE_DAX_RECLAIM_THRESHOLD	(20)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -74,6 +84,9 @@ struct fuse_forget_link {
 
 /** Translation information for file offsets to DAX window offsets */
 struct fuse_dax_mapping {
+	/* Pointer to inode where this memory range is mapped */
+	struct inode *inode;
+
 	/* Will connect in fc->free_ranges to keep track of free memory */
 	struct list_head list;
 
@@ -91,6 +104,9 @@ struct fuse_dax_mapping {
 
 	/* Is this mapping read-only or read-write */
 	bool writable;
+
+	/* reference count when the mapping is used by dax iomap. */
+	refcount_t refcnt;
 };
 
 /** FUSE inode */
@@ -819,11 +835,19 @@ struct fuse_conn {
 	unsigned long nr_busy_ranges;
 	struct list_head busy_ranges;
 
+	/* Worker to free up memory ranges */
+	struct delayed_work dax_free_work;
+
+	/* Wait queue for a dax range to become free */
+	wait_queue_head_t dax_range_waitq;
+
 	/*
 	 * DAX Window Free Ranges
 	 */
 	long nr_free_ranges;
 	struct list_head free_ranges;
+
+	unsigned long nr_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1161,6 +1185,7 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
  */
 u64 fuse_get_unique(struct fuse_iqueue *fiq);
 void fuse_free_conn(struct fuse_conn *fc);
+void fuse_dax_free_mem_worker(struct work_struct *work);
 void fuse_cleanup_inode_mappings(struct inode *inode);
 
 static inline struct fuse_dax_mapping *
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index cfc04c5eda73..7ac83d145506 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -683,11 +683,13 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 		range->window_offset = i * FUSE_DAX_SZ;
 		range->length = FUSE_DAX_SZ;
 		INIT_LIST_HEAD(&range->busy_list);
+		refcount_set(&range->refcnt, 1);
 		list_add_tail(&range->list, &mem_ranges);
 	}
 
 	list_replace_init(&mem_ranges, &fc->free_ranges);
 	fc->nr_free_ranges = nr_ranges;
+	fc->nr_ranges = nr_ranges;
 	return 0;
 out_err:
 	/* Free All allocated elements */
@@ -712,6 +714,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	refcount_set(&fc->count, 1);
 	atomic_set(&fc->dev_count, 1);
 	init_waitqueue_head(&fc->blocked_waitq);
+	init_waitqueue_head(&fc->dax_range_waitq);
 	fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
 	INIT_LIST_HEAD(&fc->bg_queue);
 	INIT_LIST_HEAD(&fc->entry);
@@ -731,6 +734,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
 	fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 	INIT_LIST_HEAD(&fc->free_ranges);
 	INIT_LIST_HEAD(&fc->busy_ranges);
+	INIT_DELAYED_WORK(&fc->dax_free_work, fuse_dax_free_mem_worker);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index c44c7bf7bba2..4666dcb77a2e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1345,6 +1345,11 @@ static void virtio_kill_sb(struct super_block *sb)
 	vfs = fc->iq.priv;
 	fsvq = &vfs->vqs[VQ_HIPRIO];
 
+	/* Stop dax worker. Soon evict_inodes() will be called which will
+	 * free all memory ranges belonging to all inodes.
+	 */
+	cancel_delayed_work_sync(&fc->dax_free_work);
+
 	/* Stop forget queue. Soon destroy will be sent */
 	spin_lock(&fsvq->lock);
 	fsvq->connected = false;
-- 
2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/20] virtiofs: Add DAX support
  2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
                   ` (19 preceding siblings ...)
  2020-08-07 19:55 ` [PATCH v2 20/20] fuse,virtiofs: Add logic to free up a memory range Vivek Goyal
@ 2020-08-10  7:29 ` Miklos Szeredi
  2020-08-10 13:08   ` Vivek Goyal
  20 siblings, 1 reply; 45+ messages in thread
From: Miklos Szeredi @ 2020-08-10  7:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs-list, Stefan Hajnoczi,
	Dr. David Alan Gilbert

On Fri, Aug 7, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>

> Most of the changes are limited to fuse/virtiofs. There are couple
> of changes needed in generic dax infrastructure and couple of changes
> in virtio to be able to access shared memory region.

So what's the plan for merging the different subsystems?  I can take
all that into the fuse tree, but would need ACKs from the respective
maintainers.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 18/20] fuse: Release file in process context
  2020-08-07 19:55 ` [PATCH v2 18/20] fuse: Release file in process context Vivek Goyal
@ 2020-08-10  8:29   ` Miklos Szeredi
  2020-08-10 15:48     ` Vivek Goyal
  2020-08-10 19:37     ` Vivek Goyal
  0 siblings, 2 replies; 45+ messages in thread
From: Miklos Szeredi @ 2020-08-10  8:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs-list, Stefan Hajnoczi,
	Dr. David Alan Gilbert

On Fri, Aug 7, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> fuse_file_put(sync) can be called with sync=true/false. If sync=true,
> it waits for release request response and then calls iput() in the
> caller's context. If sync=false, it does not wait for release request
> response, frees the fuse_file struct immediately and req->end function
> does the iput().
>
> iput() can be a problem with DAX if called in req->end context. If this
> is last reference to inode (VFS has let go its reference already), then
> iput() will clean DAX mappings as well and send REMOVEMAPPING requests
> and wait for completion. (All the the worker thread context which is
> processing fuse replies from daemon on the host).
>
> That means it blocks worker thread and it stops processing further
> replies and system deadlocks.

Is this reasoning specific to DAX?  Seems to me this is a virtio-fs
specific issue.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/20] virtiofs: Add DAX support
  2020-08-10  7:29 ` [PATCH v2 00/20] virtiofs: Add DAX support Miklos Szeredi
@ 2020-08-10 13:08   ` Vivek Goyal
  0 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-10 13:08 UTC (permalink / raw)
  To: Miklos Szeredi, Dan Williams, Michael S. Tsirkin
  Cc: linux-fsdevel, linux-kernel, virtio-fs-list, Stefan Hajnoczi,
	Dr. David Alan Gilbert, virtualization

On Mon, Aug 10, 2020 at 09:29:47AM +0200, Miklos Szeredi wrote:
> On Fri, Aug 7, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> 
> > Most of the changes are limited to fuse/virtiofs. There are couple
> > of changes needed in generic dax infrastructure and couple of changes
> > in virtio to be able to access shared memory region.
> 
> So what's the plan for merging the different subsystems?  I can take
> all that into the fuse tree, but would need ACKs from the respective
> maintainers.

I am assuming for DAX patches we need ACK from Dan Williams and for
virtio patches we need ack from Michael S. Tsirkin.

Dan, Michael, can you please review the dax and virtio patches
respectively and if there are no concerns, please provide ACK. Or
suggest an alternative way of how these patches can be merged.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/20] virtio: Add get_shm_region method
  2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
@ 2020-08-10 13:47   ` Michael S. Tsirkin
  2020-08-10 13:54     ` Vivek Goyal
  2020-08-10 14:02   ` Michael S. Tsirkin
  2020-08-10 14:06   ` Michael S. Tsirkin
  2 siblings, 1 reply; 45+ messages in thread
From: Michael S. Tsirkin @ 2020-08-10 13:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Sebastien Boeuf, kvm

On Fri, Aug 07, 2020 at 03:55:09PM -0400, Vivek Goyal wrote:
> From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> 
> Virtio defines 'shared memory regions' that provide a continuously
> shared region between the host and guest.
> 
> Provide a method to find a particular region on a device.
> 
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: "Michael S. Tsirkin" <mst@redhat.com>

I'm not sure why doesn't b4 pick up reset of this
patchset. where can I find it?


IIUC all this is 5.10 material, right?


> ---
>  include/linux/virtio_config.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
> index bb4cc4910750..c859f000a751 100644
> --- a/include/linux/virtio_config.h
> +++ b/include/linux/virtio_config.h
> @@ -10,6 +10,11 @@
>  
>  struct irq_affinity;
>  
> +struct virtio_shm_region {
> +       u64 addr;
> +       u64 len;
> +};
> +
>  /**
>   * virtio_config_ops - operations for configuring a virtio device
>   * Note: Do not assume that a transport implements all of the operations
> @@ -65,6 +70,7 @@ struct irq_affinity;
>   *      the caller can then copy.
>   * @set_vq_affinity: set the affinity for a virtqueue (optional).
>   * @get_vq_affinity: get the affinity for a virtqueue (optional).
> + * @get_shm_region: get a shared memory region based on the index.
>   */
>  typedef void vq_callback_t(struct virtqueue *);
>  struct virtio_config_ops {
> @@ -88,6 +94,8 @@ struct virtio_config_ops {
>  			       const struct cpumask *cpu_mask);
>  	const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
>  			int index);
> +	bool (*get_shm_region)(struct virtio_device *vdev,
> +			       struct virtio_shm_region *region, u8 id);
>  };
>  
>  /* If driver didn't advertise the feature, it will never appear. */
> @@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
>  	return 0;
>  }
>  
> +static inline
> +bool virtio_get_shm_region(struct virtio_device *vdev,
> +                         struct virtio_shm_region *region, u8 id)
> +{
> +	if (!vdev->config->get_shm_region)
> +		return false;
> +	return vdev->config->get_shm_region(vdev, region, id);
> +}
> +
>  static inline bool virtio_is_little_endian(struct virtio_device *vdev)
>  {
>  	return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
> -- 
> 2.25.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/20] virtio: Add get_shm_region method
  2020-08-10 13:47   ` Michael S. Tsirkin
@ 2020-08-10 13:54     ` Vivek Goyal
  0 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-10 13:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Sebastien Boeuf, kvm

On Mon, Aug 10, 2020 at 09:47:15AM -0400, Michael S. Tsirkin wrote:
> On Fri, Aug 07, 2020 at 03:55:09PM -0400, Vivek Goyal wrote:
> > From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > 
> > Virtio defines 'shared memory regions' that provide a continuously
> > shared region between the host and guest.
> > 
> > Provide a method to find a particular region on a device.
> > 
> > Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Cc: kvm@vger.kernel.org
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> 
> I'm not sure why doesn't b4 pick up reset of this
> patchset. where can I find it?

What is b4? I think I might not have copied to right list. I cced
kvm, but probably I should sent it to virtualization@lists.linux-foundation.org
instead?

> 
> 
> IIUC all this is 5.10 material, right?

I think so. It probably is too late for 5.9.

Thanks
Vivek

> 
> 
> > ---
> >  include/linux/virtio_config.h | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
> > index bb4cc4910750..c859f000a751 100644
> > --- a/include/linux/virtio_config.h
> > +++ b/include/linux/virtio_config.h
> > @@ -10,6 +10,11 @@
> >  
> >  struct irq_affinity;
> >  
> > +struct virtio_shm_region {
> > +       u64 addr;
> > +       u64 len;
> > +};
> > +
> >  /**
> >   * virtio_config_ops - operations for configuring a virtio device
> >   * Note: Do not assume that a transport implements all of the operations
> > @@ -65,6 +70,7 @@ struct irq_affinity;
> >   *      the caller can then copy.
> >   * @set_vq_affinity: set the affinity for a virtqueue (optional).
> >   * @get_vq_affinity: get the affinity for a virtqueue (optional).
> > + * @get_shm_region: get a shared memory region based on the index.
> >   */
> >  typedef void vq_callback_t(struct virtqueue *);
> >  struct virtio_config_ops {
> > @@ -88,6 +94,8 @@ struct virtio_config_ops {
> >  			       const struct cpumask *cpu_mask);
> >  	const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
> >  			int index);
> > +	bool (*get_shm_region)(struct virtio_device *vdev,
> > +			       struct virtio_shm_region *region, u8 id);
> >  };
> >  
> >  /* If driver didn't advertise the feature, it will never appear. */
> > @@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
> >  	return 0;
> >  }
> >  
> > +static inline
> > +bool virtio_get_shm_region(struct virtio_device *vdev,
> > +                         struct virtio_shm_region *region, u8 id)
> > +{
> > +	if (!vdev->config->get_shm_region)
> > +		return false;
> > +	return vdev->config->get_shm_region(vdev, region, id);
> > +}
> > +
> >  static inline bool virtio_is_little_endian(struct virtio_device *vdev)
> >  {
> >  	return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
> > -- 
> > 2.25.4
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/20] virtio: Add get_shm_region method
  2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
  2020-08-10 13:47   ` Michael S. Tsirkin
@ 2020-08-10 14:02   ` Michael S. Tsirkin
  2020-08-10 14:06   ` Michael S. Tsirkin
  2 siblings, 0 replies; 45+ messages in thread
From: Michael S. Tsirkin @ 2020-08-10 14:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Sebastien Boeuf, kvm

On Fri, Aug 07, 2020 at 03:55:09PM -0400, Vivek Goyal wrote:
> From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> 
> Virtio defines 'shared memory regions' that provide a continuously
> shared region between the host and guest.
> 
> Provide a method to find a particular region on a device.
> 
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: "Michael S. Tsirkin" <mst@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  include/linux/virtio_config.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
> index bb4cc4910750..c859f000a751 100644
> --- a/include/linux/virtio_config.h
> +++ b/include/linux/virtio_config.h
> @@ -10,6 +10,11 @@
>  
>  struct irq_affinity;
>  
> +struct virtio_shm_region {
> +       u64 addr;
> +       u64 len;
> +};
> +
>  /**
>   * virtio_config_ops - operations for configuring a virtio device
>   * Note: Do not assume that a transport implements all of the operations
> @@ -65,6 +70,7 @@ struct irq_affinity;
>   *      the caller can then copy.
>   * @set_vq_affinity: set the affinity for a virtqueue (optional).
>   * @get_vq_affinity: get the affinity for a virtqueue (optional).
> + * @get_shm_region: get a shared memory region based on the index.
>   */
>  typedef void vq_callback_t(struct virtqueue *);
>  struct virtio_config_ops {
> @@ -88,6 +94,8 @@ struct virtio_config_ops {
>  			       const struct cpumask *cpu_mask);
>  	const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
>  			int index);
> +	bool (*get_shm_region)(struct virtio_device *vdev,
> +			       struct virtio_shm_region *region, u8 id);
>  };
>  
>  /* If driver didn't advertise the feature, it will never appear. */
> @@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
>  	return 0;
>  }
>  
> +static inline
> +bool virtio_get_shm_region(struct virtio_device *vdev,
> +                         struct virtio_shm_region *region, u8 id)
> +{
> +	if (!vdev->config->get_shm_region)
> +		return false;
> +	return vdev->config->get_shm_region(vdev, region, id);
> +}
> +
>  static inline bool virtio_is_little_endian(struct virtio_device *vdev)
>  {
>  	return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport
  2020-08-07 19:55 ` [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
@ 2020-08-10 14:03   ` Michael S. Tsirkin
  0 siblings, 0 replies; 45+ messages in thread
From: Michael S. Tsirkin @ 2020-08-10 14:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Sebastien Boeuf, kvm

On Fri, Aug 07, 2020 at 03:55:11PM -0400, Vivek Goyal wrote:
> From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> 
> On MMIO a new set of registers is defined for finding SHM
> regions.  Add their definitions and use them to find the region.
> 
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: kvm@vger.kernel.org
> Cc: "Michael S. Tsirkin" <mst@redhat.com>


Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/virtio/virtio_mmio.c     | 32 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_mmio.h | 11 +++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index 627ac0487494..2697c492cf78 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -498,6 +498,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
>  	return vm_dev->pdev->name;
>  }
>  
> +static bool vm_get_shm_region(struct virtio_device *vdev,
> +			      struct virtio_shm_region *region, u8 id)
> +{
> +	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
> +	u64 len, addr;
> +
> +	/* Select the region we're interested in */
> +	writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
> +
> +	/* Read the region size */
> +	len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
> +	len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
> +
> +	region->len = len;
> +
> +	/* Check if region length is -1. If that's the case, the shared memory
> +	 * region does not exist and there is no need to proceed further.
> +	 */
> +	if (len == ~(u64)0) {
> +		return false;
> +	}
> +

It might make sense to validate that addr/len do not overlap.
Will be useful for PCI too.
Can be a patch on top.

> +	/* Read the region base address */
> +	addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
> +	addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
> +
> +	region->addr = addr;
> +
> +	return true;
> +}
> +
>  static const struct virtio_config_ops virtio_mmio_config_ops = {
>  	.get		= vm_get,
>  	.set		= vm_set,
> @@ -510,6 +541,7 @@ static const struct virtio_config_ops virtio_mmio_config_ops = {
>  	.get_features	= vm_get_features,
>  	.finalize_features = vm_finalize_features,
>  	.bus_name	= vm_bus_name,
> +	.get_shm_region = vm_get_shm_region,
>  };
>  
>  
> diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
> index c4b09689ab64..0650f91bea6c 100644
> --- a/include/uapi/linux/virtio_mmio.h
> +++ b/include/uapi/linux/virtio_mmio.h
> @@ -122,6 +122,17 @@
>  #define VIRTIO_MMIO_QUEUE_USED_LOW	0x0a0
>  #define VIRTIO_MMIO_QUEUE_USED_HIGH	0x0a4
>  
> +/* Shared memory region id */
> +#define VIRTIO_MMIO_SHM_SEL             0x0ac
> +
> +/* Shared memory region length, 64 bits in two halves */
> +#define VIRTIO_MMIO_SHM_LEN_LOW         0x0b0
> +#define VIRTIO_MMIO_SHM_LEN_HIGH        0x0b4
> +
> +/* Shared memory region base address, 64 bits in two halves */
> +#define VIRTIO_MMIO_SHM_BASE_LOW        0x0b8
> +#define VIRTIO_MMIO_SHM_BASE_HIGH       0x0bc
> +
>  /* Configuration atomicity value */
>  #define VIRTIO_MMIO_CONFIG_GENERATION	0x0fc
>  
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport
  2020-08-07 19:55 ` [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport Vivek Goyal
@ 2020-08-10 14:05   ` Michael S. Tsirkin
  2020-08-10 14:50     ` Vivek Goyal
  0 siblings, 1 reply; 45+ messages in thread
From: Michael S. Tsirkin @ 2020-08-10 14:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Sebastien Boeuf, kbuild test robot, kvm

On Fri, Aug 07, 2020 at 03:55:10PM -0400, Vivek Goyal wrote:
> From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> 
> On PCI the shm regions are found using capability entries;
> find a region by searching for the capability.
> 
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: kbuild test robot <lkp@intel.com>
> Cc: kvm@vger.kernel.org
> Cc: "Michael S. Tsirkin" <mst@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>


> ---
>  drivers/virtio/virtio_pci_modern.c | 96 ++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_pci.h    | 11 +++-
>  2 files changed, 106 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index db93cedd262f..3fc0cd848fe9 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -444,6 +444,100 @@ static void del_vq(struct virtio_pci_vq_info *info)
>  	vring_del_virtqueue(vq);
>  }
>  
> +static int virtio_pci_find_shm_cap(struct pci_dev *dev, u8 required_id,
> +				   u8 *bar, u64 *offset, u64 *len)
> +{
> +	int pos;
> +
> +	for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); pos > 0;
> +	     pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> +		u8 type, cap_len, id;
> +		u32 tmp32;
> +		u64 res_offset, res_length;
> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +							 cfg_type), &type);
> +		if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
> +			continue;
> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +							 cap_len), &cap_len);
> +		if (cap_len != sizeof(struct virtio_pci_cap64)) {
> +			printk(KERN_ERR "%s: shm cap with bad size offset: %d"
> +			       "size: %d\n", __func__, pos, cap_len);
> +			continue;
> +		}
> +
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +                                                         id), &id);
> +		if (id != required_id)
> +			continue;
> +
> +		/* Type, and ID match, looks good */
> +		pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> +							 bar), bar);
> +
> +		/* Read the lower 32bit of length and offset */
> +		pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap,
> +							  offset), &tmp32);
> +		res_offset = tmp32;
> +		pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap,
> +							  length), &tmp32);
> +		res_length = tmp32;
> +
> +		/* and now the top half */
> +		pci_read_config_dword(dev,
> +				      pos + offsetof(struct virtio_pci_cap64,
> +                                                     offset_hi), &tmp32);
> +		res_offset |= ((u64)tmp32) << 32;
> +		pci_read_config_dword(dev,
> +				      pos + offsetof(struct virtio_pci_cap64,
> +                                                     length_hi), &tmp32);
> +		res_length |= ((u64)tmp32) << 32;
> +
> +		*offset = res_offset;
> +		*len = res_length;
> +
> +		return pos;
> +	}
> +	return 0;
> +}
> +
> +static bool vp_get_shm_region(struct virtio_device *vdev,
> +			      struct virtio_shm_region *region, u8 id)
> +{
> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +	struct pci_dev *pci_dev = vp_dev->pci_dev;
> +	u8 bar;
> +	u64 offset, len;
> +	phys_addr_t phys_addr;
> +	size_t bar_len;
> +
> +	if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> +		return false;
> +	}
> +
> +	phys_addr = pci_resource_start(pci_dev, bar);
> +	bar_len = pci_resource_len(pci_dev, bar);
> +
> +	if ((offset + len) < offset) {
> +		dev_err(&pci_dev->dev, "%s: cap offset+len overflow detected\n",
> +			__func__);
> +		return false;
> +	}
> +
> +	if (offset + len > bar_len) {
> +		dev_err(&pci_dev->dev, "%s: bar shorter than cap offset+len\n",
> +			__func__);
> +		return false;
> +	}

Maybe move this to a common header so the checks can be reused by
other transports? Can be a patch on top.

> +
> +	region->len = len;
> +	region->addr = (u64) phys_addr + offset;
> +
> +	return true;
> +}
> +
>  static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
>  	.get		= NULL,
>  	.set		= NULL,
> @@ -458,6 +552,7 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
>  	.bus_name	= vp_bus_name,
>  	.set_vq_affinity = vp_set_vq_affinity,
>  	.get_vq_affinity = vp_get_vq_affinity,
> +	.get_shm_region  = vp_get_shm_region,
>  };
>  
>  static const struct virtio_config_ops virtio_pci_config_ops = {
> @@ -474,6 +569,7 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
>  	.bus_name	= vp_bus_name,
>  	.set_vq_affinity = vp_set_vq_affinity,
>  	.get_vq_affinity = vp_get_vq_affinity,
> +	.get_shm_region  = vp_get_shm_region,
>  };
>  
>  /**
> diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
> index 90007a1abcab..fe9f43680a1d 100644
> --- a/include/uapi/linux/virtio_pci.h
> +++ b/include/uapi/linux/virtio_pci.h
> @@ -113,6 +113,8 @@
>  #define VIRTIO_PCI_CAP_DEVICE_CFG	4
>  /* PCI configuration access */
>  #define VIRTIO_PCI_CAP_PCI_CFG		5
> +/* Additional shared memory capability */
> +#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
>  
>  /* This is the PCI capability header: */
>  struct virtio_pci_cap {
> @@ -121,11 +123,18 @@ struct virtio_pci_cap {
>  	__u8 cap_len;		/* Generic PCI field: capability length */
>  	__u8 cfg_type;		/* Identifies the structure. */
>  	__u8 bar;		/* Where to find it. */
> -	__u8 padding[3];	/* Pad to full dword. */
> +	__u8 id;		/* Multiple capabilities of the same type */
> +	__u8 padding[2];	/* Pad to full dword. */
>  	__le32 offset;		/* Offset within bar. */
>  	__le32 length;		/* Length of the structure, in bytes. */
>  };
>  
> +struct virtio_pci_cap64 {
> +       struct virtio_pci_cap cap;
> +       __le32 offset_hi;             /* Most sig 32 bits of offset */
> +       __le32 length_hi;             /* Most sig 32 bits of length */
> +};
> +
>  struct virtio_pci_notify_cap {
>  	struct virtio_pci_cap cap;
>  	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/20] virtio: Add get_shm_region method
  2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
  2020-08-10 13:47   ` Michael S. Tsirkin
  2020-08-10 14:02   ` Michael S. Tsirkin
@ 2020-08-10 14:06   ` Michael S. Tsirkin
  2 siblings, 0 replies; 45+ messages in thread
From: Michael S. Tsirkin @ 2020-08-10 14:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Sebastien Boeuf, kvm

On Fri, Aug 07, 2020 at 03:55:09PM -0400, Vivek Goyal wrote:
> From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> 
> Virtio defines 'shared memory regions' that provide a continuously
> shared region between the host and guest.
> 
> Provide a method to find a particular region on a device.
> 
> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: "Michael S. Tsirkin" <mst@redhat.com>

I don't think I can merge it through my tree for 5.9 at this stage,
but if there's a tree where this can be merged for 5.9,
feel free.

> ---
>  include/linux/virtio_config.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
> index bb4cc4910750..c859f000a751 100644
> --- a/include/linux/virtio_config.h
> +++ b/include/linux/virtio_config.h
> @@ -10,6 +10,11 @@
>  
>  struct irq_affinity;
>  
> +struct virtio_shm_region {
> +       u64 addr;
> +       u64 len;
> +};
> +
>  /**
>   * virtio_config_ops - operations for configuring a virtio device
>   * Note: Do not assume that a transport implements all of the operations
> @@ -65,6 +70,7 @@ struct irq_affinity;
>   *      the caller can then copy.
>   * @set_vq_affinity: set the affinity for a virtqueue (optional).
>   * @get_vq_affinity: get the affinity for a virtqueue (optional).
> + * @get_shm_region: get a shared memory region based on the index.
>   */
>  typedef void vq_callback_t(struct virtqueue *);
>  struct virtio_config_ops {
> @@ -88,6 +94,8 @@ struct virtio_config_ops {
>  			       const struct cpumask *cpu_mask);
>  	const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
>  			int index);
> +	bool (*get_shm_region)(struct virtio_device *vdev,
> +			       struct virtio_shm_region *region, u8 id);
>  };
>  
>  /* If driver didn't advertise the feature, it will never appear. */
> @@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
>  	return 0;
>  }
>  
> +static inline
> +bool virtio_get_shm_region(struct virtio_device *vdev,
> +                         struct virtio_shm_region *region, u8 id)
> +{
> +	if (!vdev->config->get_shm_region)
> +		return false;
> +	return vdev->config->get_shm_region(vdev, region, id);
> +}
> +
>  static inline bool virtio_is_little_endian(struct virtio_device *vdev)
>  {
>  	return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
> -- 
> 2.25.4
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport
  2020-08-10 14:05   ` Michael S. Tsirkin
@ 2020-08-10 14:50     ` Vivek Goyal
       [not found]       ` <CAAfnVBk+Hmcm2ftd3wOK-P2NyYQ7z4Wrf1JKhLJaNkCZBLoo6g@mail.gmail.com>
  0 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-10 14:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Sebastien Boeuf, kbuild test robot, kvm

On Mon, Aug 10, 2020 at 10:05:17AM -0400, Michael S. Tsirkin wrote:
> On Fri, Aug 07, 2020 at 03:55:10PM -0400, Vivek Goyal wrote:
> > From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > 
> > On PCI the shm regions are found using capability entries;
> > find a region by searching for the capability.
> > 
> > Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Signed-off-by: kbuild test robot <lkp@intel.com>
> > Cc: kvm@vger.kernel.org
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 

[..]
> > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > +			      struct virtio_shm_region *region, u8 id)
> > +{
> > +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > +	struct pci_dev *pci_dev = vp_dev->pci_dev;
> > +	u8 bar;
> > +	u64 offset, len;
> > +	phys_addr_t phys_addr;
> > +	size_t bar_len;
> > +
> > +	if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > +		return false;
> > +	}
> > +
> > +	phys_addr = pci_resource_start(pci_dev, bar);
> > +	bar_len = pci_resource_len(pci_dev, bar);
> > +
> > +	if ((offset + len) < offset) {
> > +		dev_err(&pci_dev->dev, "%s: cap offset+len overflow detected\n",
> > +			__func__);
> > +		return false;
> > +	}
> > +
> > +	if (offset + len > bar_len) {
> > +		dev_err(&pci_dev->dev, "%s: bar shorter than cap offset+len\n",
> > +			__func__);
> > +		return false;
> > +	}
> 
> Maybe move this to a common header so the checks can be reused by
> other transports? Can be a patch on top.

Will do as patch on top once these patches get merged. 

Thanks
Vivek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 18/20] fuse: Release file in process context
  2020-08-10  8:29   ` Miklos Szeredi
@ 2020-08-10 15:48     ` Vivek Goyal
  2020-08-10 19:37     ` Vivek Goyal
  1 sibling, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-10 15:48 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, virtio-fs-list, Stefan Hajnoczi,
	Dr. David Alan Gilbert

On Mon, Aug 10, 2020 at 10:29:13AM +0200, Miklos Szeredi wrote:
> On Fri, Aug 7, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > fuse_file_put(sync) can be called with sync=true/false. If sync=true,
> > it waits for release request response and then calls iput() in the
> > caller's context. If sync=false, it does not wait for release request
> > response, frees the fuse_file struct immediately and req->end function
> > does the iput().
> >
> > iput() can be a problem with DAX if called in req->end context. If this
> > is last reference to inode (VFS has let go its reference already), then
> > iput() will clean DAX mappings as well and send REMOVEMAPPING requests
> > and wait for completion. (All the the worker thread context which is
> > processing fuse replies from daemon on the host).
> >
> > That means it blocks worker thread and it stops processing further
> > replies and system deadlocks.
> 
> Is this reasoning specific to DAX?  Seems to me this is a virtio-fs
> specific issue.

I would think it is virtio-fs + DAX issues. virtio-fs without DAX does
not have this problem.

If making this conditional on DAX an issue, I am wondering, can
we now set args->may_block instead. Now virtiofs will do completion
in a separate worker thread if ->may_block is set. That means, 
worker will block till REMOVEMAPPING completes and then be woken
up.

Do let me know if you like setting args->may_block approach better.
I can give that a try.

Vivek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 18/20] fuse: Release file in process context
  2020-08-10  8:29   ` Miklos Szeredi
  2020-08-10 15:48     ` Vivek Goyal
@ 2020-08-10 19:37     ` Vivek Goyal
  2020-08-10 19:39       ` Miklos Szeredi
  1 sibling, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-10 19:37 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, virtio-fs-list, Stefan Hajnoczi,
	Dr. David Alan Gilbert

On Mon, Aug 10, 2020 at 10:29:13AM +0200, Miklos Szeredi wrote:
> On Fri, Aug 7, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > fuse_file_put(sync) can be called with sync=true/false. If sync=true,
> > it waits for release request response and then calls iput() in the
> > caller's context. If sync=false, it does not wait for release request
> > response, frees the fuse_file struct immediately and req->end function
> > does the iput().
> >
> > iput() can be a problem with DAX if called in req->end context. If this
> > is last reference to inode (VFS has let go its reference already), then
> > iput() will clean DAX mappings as well and send REMOVEMAPPING requests
> > and wait for completion. (All the the worker thread context which is
> > processing fuse replies from daemon on the host).
> >
> > That means it blocks worker thread and it stops processing further
> > replies and system deadlocks.
> 
> Is this reasoning specific to DAX?  Seems to me this is a virtio-fs
> specific issue.

Hi Miklos,

I am looking at this patch closely and maybe we don't need it, given
current code.

Reason being that looks like for virtiofs now we set ctx->destroy=true
and that sets fc->destroy. And if that's set, we don't schedule async req
in fuse_release_common();

fuse_release_common() {
  fuse_file_put(ff, ff->fc->destroy, isdir);
}

And if we don't schedule async file put, we don't run the risk of
blocking worker thread in evict_inode() with DAX enabled.

I ran blogbench multiple times with this patch reverted and I can't
reproduce the deadlock.

So we can drop this patch for now. If need be I will fix it as
post merge patch.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 18/20] fuse: Release file in process context
  2020-08-10 19:37     ` Vivek Goyal
@ 2020-08-10 19:39       ` Miklos Szeredi
  0 siblings, 0 replies; 45+ messages in thread
From: Miklos Szeredi @ 2020-08-10 19:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs-list, Stefan Hajnoczi,
	Dr. David Alan Gilbert

On Mon, Aug 10, 2020 at 9:37 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Aug 10, 2020 at 10:29:13AM +0200, Miklos Szeredi wrote:
> > On Fri, Aug 7, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > fuse_file_put(sync) can be called with sync=true/false. If sync=true,
> > > it waits for release request response and then calls iput() in the
> > > caller's context. If sync=false, it does not wait for release request
> > > response, frees the fuse_file struct immediately and req->end function
> > > does the iput().
> > >
> > > iput() can be a problem with DAX if called in req->end context. If this
> > > is last reference to inode (VFS has let go its reference already), then
> > > iput() will clean DAX mappings as well and send REMOVEMAPPING requests
> > > and wait for completion. (All the the worker thread context which is
> > > processing fuse replies from daemon on the host).
> > >
> > > That means it blocks worker thread and it stops processing further
> > > replies and system deadlocks.
> >
> > Is this reasoning specific to DAX?  Seems to me this is a virtio-fs
> > specific issue.
>
> Hi Miklos,
>
> I am looking at this patch closely and maybe we don't need it, given
> current code.
>
> Reason being that looks like for virtiofs now we set ctx->destroy=true
> and that sets fc->destroy. And if that's set, we don't schedule async req
> in fuse_release_common();
>
> fuse_release_common() {
>   fuse_file_put(ff, ff->fc->destroy, isdir);
> }
>
> And if we don't schedule async file put, we don't run the risk of
> blocking worker thread in evict_inode() with DAX enabled.
>
> I ran blogbench multiple times with this patch reverted and I can't
> reproduce the deadlock.
>
> So we can drop this patch for now. If need be I will fix it as
> post merge patch.

Okay.  Yeah, I was thinking along the lines of no-async-release-ever
for virtiofs, but apparently that's already done.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 13/20] fuse, dax: Implement dax read/write operations
  2020-08-07 19:55 ` [PATCH v2 13/20] fuse, dax: Implement dax read/write operations Vivek Goyal
@ 2020-08-10 22:06   ` Dave Chinner
  2020-08-11 17:44     ` Vivek Goyal
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2020-08-10 22:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Miklos Szeredi, Liu Bo, Peng Tao

On Fri, Aug 07, 2020 at 03:55:19PM -0400, Vivek Goyal wrote:
> This patch implements basic DAX support. mmap() is not implemented
> yet and will come in later patches. This patch looks into implemeting
> read/write.

....

> +static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
> +					 loff_t length, unsigned flags,
> +					 struct iomap *iomap)

please doin't use the iomap_* namespace even for static functions in
filesystem code. This really doesn't have anything to do with iomap
except that whatever fuse sets up is used to fill an iomap.
i.e. fuse_setup_new_dax_mapping() would be far more appropriate
name.

> +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> +					 loff_t length, unsigned flags,
> +					 struct iomap *iomap)

ditto: fuse_upgrade_dax_mapping().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
  2020-08-07 19:55 ` [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
@ 2020-08-10 22:22   ` Dave Chinner
  2020-08-11 17:55     ` Vivek Goyal
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2020-08-10 22:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha, dgilbert

On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> We need some kind of locking mechanism here. Normal file systems like
> ext4 and xfs seems to take their own semaphore to protect agains
> truncate while fault is going on.
> 
> We have additional requirement to protect against fuse dax memory range
> reclaim. When a range has been selected for reclaim, we need to make sure
> no other read/write/fault can try to access that memory range while
> reclaim is in progress. Once reclaim is complete, lock will be released
> and read/write/fault will trigger allocation of fresh dax range.
> 
> Taking inode_lock() is not an option in fault path as lockdep complains
> about circular dependencies. So define a new fuse_inode->i_mmap_sem.

That's precisely why filesystems like ext4 and XFS define their own
rwsem.

Note that this isn't a DAX requirement - the page fault
serialisation is actually a requirement of hole punching...

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/fuse/dir.c    |  2 ++
>  fs/fuse/file.c   | 15 ++++++++++++---
>  fs/fuse/fuse_i.h |  7 +++++++
>  fs/fuse/inode.c  |  1 +
>  4 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 26f028bc760b..f40766c0693b 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>  	 */
>  	if ((is_truncate || !is_wb) &&
>  	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
> +		down_write(&fi->i_mmap_sem);
>  		truncate_pagecache(inode, outarg.attr.size);
>  		invalidate_inode_pages2(inode->i_mapping);
> +		up_write(&fi->i_mmap_sem);
>  	}
>  
>  	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index be7d90eb5b41..00ad27216cc3 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
>  
>  	if (write)
>  		sb_start_pagefault(sb);
> -
> +	/*
> +	 * We need to serialize against not only truncate but also against
> +	 * fuse dax memory range reclaim. While a range is being reclaimed,
> +	 * we do not want any read/write/mmap to make progress and try
> +	 * to populate page cache or access memory we are trying to free.
> +	 */
> +	down_read(&get_fuse_inode(inode)->i_mmap_sem);
>  	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
>  
>  	if (ret & VM_FAULT_NEEDDSYNC)
>  		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> +	up_read(&get_fuse_inode(inode)->i_mmap_sem);
>  
>  	if (write)
>  		sb_end_pagefault(sb);
> @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>  			file_update_time(file);
>  	}
>  
> -	if (mode & FALLOC_FL_PUNCH_HOLE)
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		down_write(&fi->i_mmap_sem);
>  		truncate_pagecache_range(inode, offset, offset + length - 1);
> -
> +		up_write(&fi->i_mmap_sem);
> +	}
>  	fuse_invalidate_attr(inode);


I'm not sure this is sufficient. You have to lock page faults out
for the entire time the hole punch is being performed, not just while
the mapping is being invalidated.

That is, once you've taken the inode lock and written back the dirty
data over the range being punched, you can then take a page fault
and dirty the page again. Then after you punch the hole out,
you have a dirty page with non-zero data in it, and that can get
written out before the page cache is truncated.

IOWs, to do a hole punch safely, you have to both lock the inode
and lock out page faults *before* you write back dirty data. Then
you can invalidate the page cache so you know there is no cached
data over the range about to be punched. Once the punch is done,
then you can drop all locks....

The same goes for any other operation that manipulates extents
directly (other fallocate ops, truncate, etc).

/me also wonders if there can be racing AIO+DIO in progress over the
range that is being punched and whether fuse needs to call
inode_dio_wait() before punching holes, running truncates, etc...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 13/20] fuse, dax: Implement dax read/write operations
  2020-08-10 22:06   ` Dave Chinner
@ 2020-08-11 17:44     ` Vivek Goyal
  0 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-11 17:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Miklos Szeredi, Liu Bo, Peng Tao

On Tue, Aug 11, 2020 at 08:06:55AM +1000, Dave Chinner wrote:
> On Fri, Aug 07, 2020 at 03:55:19PM -0400, Vivek Goyal wrote:
> > This patch implements basic DAX support. mmap() is not implemented
> > yet and will come in later patches. This patch looks into implemeting
> > read/write.
> 
> ....
> 
> > +static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
> > +					 loff_t length, unsigned flags,
> > +					 struct iomap *iomap)
> 
> please doin't use the iomap_* namespace even for static functions in
> filesystem code. This really doesn't have anything to do with iomap
> except that whatever fuse sets up is used to fill an iomap.
> i.e. fuse_setup_new_dax_mapping() would be far more appropriate
> name.
> 
> > +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> > +					 loff_t length, unsigned flags,
> > +					 struct iomap *iomap)
> 
> ditto: fuse_upgrade_dax_mapping().
> 

Hi Dave,

Will rename these functions as you suggested.

Vivek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
  2020-08-10 22:22   ` Dave Chinner
@ 2020-08-11 17:55     ` Vivek Goyal
  2020-08-12  1:23       ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-11 17:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha, dgilbert

On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote:
> On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> > We need some kind of locking mechanism here. Normal file systems like
> > ext4 and xfs seems to take their own semaphore to protect agains
> > truncate while fault is going on.
> > 
> > We have additional requirement to protect against fuse dax memory range
> > reclaim. When a range has been selected for reclaim, we need to make sure
> > no other read/write/fault can try to access that memory range while
> > reclaim is in progress. Once reclaim is complete, lock will be released
> > and read/write/fault will trigger allocation of fresh dax range.
> > 
> > Taking inode_lock() is not an option in fault path as lockdep complains
> > about circular dependencies. So define a new fuse_inode->i_mmap_sem.
> 
> That's precisely why filesystems like ext4 and XFS define their own
> rwsem.
> 
> Note that this isn't a DAX requirement - the page fault
> serialisation is actually a requirement of hole punching...

Hi Dave,

I noticed that fuse code currently does not seem to have a rwsem which
can provide mutual exclusion between truncation/hole_punch path
and page fault path. I am wondering does that mean there are issues
with existing code or something else makes it unnecessary to provide
this mutual exlusion.

> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/fuse/dir.c    |  2 ++
> >  fs/fuse/file.c   | 15 ++++++++++++---
> >  fs/fuse/fuse_i.h |  7 +++++++
> >  fs/fuse/inode.c  |  1 +
> >  4 files changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 26f028bc760b..f40766c0693b 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
> >  	 */
> >  	if ((is_truncate || !is_wb) &&
> >  	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
> > +		down_write(&fi->i_mmap_sem);
> >  		truncate_pagecache(inode, outarg.attr.size);
> >  		invalidate_inode_pages2(inode->i_mapping);
> > +		up_write(&fi->i_mmap_sem);
> >  	}
> >  
> >  	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index be7d90eb5b41..00ad27216cc3 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -2878,11 +2878,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
> >  
> >  	if (write)
> >  		sb_start_pagefault(sb);
> > -
> > +	/*
> > +	 * We need to serialize against not only truncate but also against
> > +	 * fuse dax memory range reclaim. While a range is being reclaimed,
> > +	 * we do not want any read/write/mmap to make progress and try
> > +	 * to populate page cache or access memory we are trying to free.
> > +	 */
> > +	down_read(&get_fuse_inode(inode)->i_mmap_sem);
> >  	ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
> >  
> >  	if (ret & VM_FAULT_NEEDDSYNC)
> >  		ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> > +	up_read(&get_fuse_inode(inode)->i_mmap_sem);
> >  
> >  	if (write)
> >  		sb_end_pagefault(sb);
> > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> >  			file_update_time(file);
> >  	}
> >  
> > -	if (mode & FALLOC_FL_PUNCH_HOLE)
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		down_write(&fi->i_mmap_sem);
> >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > -
> > +		up_write(&fi->i_mmap_sem);
> > +	}
> >  	fuse_invalidate_attr(inode);
> 
> 
> I'm not sure this is sufficient. You have to lock page faults out
> for the entire time the hole punch is being performed, not just while
> the mapping is being invalidated.
> 
> That is, once you've taken the inode lock and written back the dirty
> data over the range being punched, you can then take a page fault
> and dirty the page again. Then after you punch the hole out,
> you have a dirty page with non-zero data in it, and that can get
> written out before the page cache is truncated.

Just for my better udnerstanding of the issue, I am wondering what
problem will it lead to. If one process is doing punch_hole and
other is writing in the range being punched, end result could be
anything. Either we will read zeroes from punched_hole pages or
we will read the data written by process writing to mmaped page, depending
on in what order it got executed. 

If that's the case, then holding fi->i_mmap_sem for the whole duration
might not matter. What am I missing?

Thanks
Vivek

> 
> IOWs, to do a hole punch safely, you have to both lock the inode
> and lock out page faults *before* you write back dirty data. Then
> you can invalidate the page cache so you know there is no cached
> data over the range about to be punched. Once the punch is done,
> then you can drop all locks....
> 
> The same goes for any other operation that manipulates extents
> directly (other fallocate ops, truncate, etc).
> 
> /me also wonders if there can be racing AIO+DIO in progress over the
> range that is being punched and whether fuse needs to call
> inode_dio_wait() before punching holes, running truncates, etc...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
  2020-08-11 17:55     ` Vivek Goyal
@ 2020-08-12  1:23       ` Dave Chinner
  2020-08-12 21:10         ` Vivek Goyal
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2020-08-12  1:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha, dgilbert

On Tue, Aug 11, 2020 at 01:55:30PM -0400, Vivek Goyal wrote:
> On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote:
> > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> > > We need some kind of locking mechanism here. Normal file systems like
> > > ext4 and xfs seems to take their own semaphore to protect agains
> > > truncate while fault is going on.
> > > 
> > > We have additional requirement to protect against fuse dax memory range
> > > reclaim. When a range has been selected for reclaim, we need to make sure
> > > no other read/write/fault can try to access that memory range while
> > > reclaim is in progress. Once reclaim is complete, lock will be released
> > > and read/write/fault will trigger allocation of fresh dax range.
> > > 
> > > Taking inode_lock() is not an option in fault path as lockdep complains
> > > about circular dependencies. So define a new fuse_inode->i_mmap_sem.
> > 
> > That's precisely why filesystems like ext4 and XFS define their own
> > rwsem.
> > 
> > Note that this isn't a DAX requirement - the page fault
> > serialisation is actually a requirement of hole punching...
> 
> Hi Dave,
> 
> I noticed that fuse code currently does not seem to have a rwsem which
> can provide mutual exclusion between truncation/hole_punch path
> and page fault path. I am wondering does that mean there are issues
> with existing code or something else makes it unnecessary to provide
> this mutual exlusion.

I don't know enough about the fuse implementation to say. What I'm
saying is that nothing in the core mm/ or VFS serilises page cache
access to the data against direct filesystem manipulations of the
underlying filesystem structures.

i.e. nothing in the VFS or page fault IO path prevents this race
condition:

P0				P1
fallocate
page cache invalidation
				page fault
				read data
punch out data extents
				<data exposed to userspace is stale>
				<data exposed to userspace has no
				backing store allocated>


That's where the ext4 and XFS internal rwsem come into play:

fallocate
down_write(mmaplock)
page cache invalidation
				page fault
				down_read(mmaplock)
				<blocks>
punch out data
up_write(mmaplock)
				<unblocks>
				<sees hole>
				<allocates zeroed pages in page cache>

And there's not stale data exposure to userspace.

It's the same reason that we use the i_rwsem to prevent concurrent
IO while a truncate or hole punch is in progress. The IO could map
the extent, then block in the IO path, while the filesytsem
re-allocates and writes new data or metadata to those blocks. That's
another potential non-owner data exposure problem.

And if you don't drain AIO+DIO before truncate/hole punch, the
i_rwsem does not protect you against concurrent IO as that gets
dropped after the AIO is submitted and returns EIOCBQUEUED to the
AIO layer. Hence there's IO in flight that isn't tracked by the
i_rwsem or the MMAPLOCK, and if you punch out the blocks and
reallocate them while the IO is in flight....

> > > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> > >  			file_update_time(file);
> > >  	}
> > >  
> > > -	if (mode & FALLOC_FL_PUNCH_HOLE)
> > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +		down_write(&fi->i_mmap_sem);
> > >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > > -
> > > +		up_write(&fi->i_mmap_sem);
> > > +	}
> > >  	fuse_invalidate_attr(inode);
> > 
> > 
> > I'm not sure this is sufficient. You have to lock page faults out
> > for the entire time the hole punch is being performed, not just while
> > the mapping is being invalidated.
> > 
> > That is, once you've taken the inode lock and written back the dirty
> > data over the range being punched, you can then take a page fault
> > and dirty the page again. Then after you punch the hole out,
> > you have a dirty page with non-zero data in it, and that can get
> > written out before the page cache is truncated.
> 
> Just for my better udnerstanding of the issue, I am wondering what
> problem will it lead to.
> If one process is doing punch_hole and other is writing in the
> range being punched, end result could be anything. Either we will
> read zeroes from punched_hole pages or we will read the data
> written by process writing to mmaped page, depending on in what
> order it got executed. 
>
> If that's the case, then holding fi->i_mmap_sem for the whole
> duration might not matter. What am I missing?

That it is safe to invalidate the page cache after the hole has been
punched.

There is nothing stopping, say, memory reclaim from reclaiming pages
over the range while the hole is being punched, then having the
application refault them while the backing store is being freed.
While the page fault read IO is in progress, there's nothing
stopping the filesystem from freeing those blocks, nor reallocating
them and writing something else to them (e.g. metadata). So they
could read someone elses data.

Even worse: the page fault is a write fault, it lands in a hole, has
space allocated, the page cache is zeroed, page marked dirty, and
then the hole punch calls truncate_pagecache_range() which tosses
away the zeroed page and the data the userspace application wrote
to the page.

The application then refaults the page, reading stale data off
disk instead of seeing what it had already written to the page....

And unlike truncated pages, the mm/ code cannot reliably detect
invalidation races on lockless lookup of pages that are within EOF.
They rely on truncate changing the file size before page
invalidation to detect races as page->index then points beyond EOF.
Hole punching does not change inode size, so the page cache lookups
cannot tell the difference between a new page that just needs IO to
initialise the data and a page that has just been invalidated....

IOWs, there are many ways things can go wrong with hole punch, and
the only way to avoid them all is to do invalidate and lock out the
page cache before starting the fallocate operation. i.e.:

	1. lock up the entire IO path (vfs and page fault)
	2. drain the AIO+DIO path
	3. write back dirty pages
	4. invalidate the page cache

Because this is the only way we can guarantee that nothing can access
the filesystem's backing store for the range we are about to
directly manipulate the data in while we perform an "offloaded" data
transformation on that range...

This isn't just hole punch - the same problems exist with
FALLOC_FL_ZERO_RANGE and FALLOC_FL_{INSERT,COLLAPSE}_RANGE because
they change data with extent manipulations and/or hardware offloads
that provide no guarantees of specific data state or integrity until
they complete....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
  2020-08-12  1:23       ` Dave Chinner
@ 2020-08-12 21:10         ` Vivek Goyal
  2020-08-13  5:12           ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2020-08-12 21:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha, dgilbert

On Wed, Aug 12, 2020 at 11:23:45AM +1000, Dave Chinner wrote:
> On Tue, Aug 11, 2020 at 01:55:30PM -0400, Vivek Goyal wrote:
> > On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> > > > We need some kind of locking mechanism here. Normal file systems like
> > > > ext4 and xfs seems to take their own semaphore to protect agains
> > > > truncate while fault is going on.
> > > > 
> > > > We have additional requirement to protect against fuse dax memory range
> > > > reclaim. When a range has been selected for reclaim, we need to make sure
> > > > no other read/write/fault can try to access that memory range while
> > > > reclaim is in progress. Once reclaim is complete, lock will be released
> > > > and read/write/fault will trigger allocation of fresh dax range.
> > > > 
> > > > Taking inode_lock() is not an option in fault path as lockdep complains
> > > > about circular dependencies. So define a new fuse_inode->i_mmap_sem.
> > > 
> > > That's precisely why filesystems like ext4 and XFS define their own
> > > rwsem.
> > > 
> > > Note that this isn't a DAX requirement - the page fault
> > > serialisation is actually a requirement of hole punching...
> > 
> > Hi Dave,
> > 
> > I noticed that fuse code currently does not seem to have a rwsem which
> > can provide mutual exclusion between truncation/hole_punch path
> > and page fault path. I am wondering does that mean there are issues
> > with existing code or something else makes it unnecessary to provide
> > this mutual exlusion.
> 
> I don't know enough about the fuse implementation to say. What I'm
> saying is that nothing in the core mm/ or VFS serilises page cache
> access to the data against direct filesystem manipulations of the
> underlying filesystem structures.

Hi Dave,

Got it. I was checking nfs and they also seem to be calling filemap_fault
and not taking any locks to block faults. fallocate() (nfs42_fallocate)
seems to block read/write/aio/dio but does not seem to do anything
about blocking faults. I am wondering if remote filesystem are
little different in this aspect. Especially fuse does not maintain
any filesystem block/extent data. It is file server which is doing
all that.

> 
> i.e. nothing in the VFS or page fault IO path prevents this race
> condition:
> 
> P0				P1
> fallocate
> page cache invalidation
> 				page fault
> 				read data
> punch out data extents
> 				<data exposed to userspace is stale>
> 				<data exposed to userspace has no
> 				backing store allocated>
> 
> 
> That's where the ext4 and XFS internal rwsem come into play:
> 
> fallocate
> down_write(mmaplock)
> page cache invalidation
> 				page fault
> 				down_read(mmaplock)
> 				<blocks>
> punch out data
> up_write(mmaplock)
> 				<unblocks>
> 				<sees hole>
> 				<allocates zeroed pages in page cache>
> 
> And there's not stale data exposure to userspace.

Got it. I noticed that both fuse/nfs seem to have reversed the
order of operation. They call server to punch out data first
and then truncate page cache. And that should mean that even
if mmap reader will not see stale data after fallocate(punch_hole)
has finished.

> 
> It's the same reason that we use the i_rwsem to prevent concurrent
> IO while a truncate or hole punch is in progress. The IO could map
> the extent, then block in the IO path, while the filesytsem
> re-allocates and writes new data or metadata to those blocks. That's
> another potential non-owner data exposure problem.
> 
> And if you don't drain AIO+DIO before truncate/hole punch, the
> i_rwsem does not protect you against concurrent IO as that gets
> dropped after the AIO is submitted and returns EIOCBQUEUED to the
> AIO layer. Hence there's IO in flight that isn't tracked by the
> i_rwsem or the MMAPLOCK, and if you punch out the blocks and
> reallocate them while the IO is in flight....
> 
> > > > @@ -3849,9 +3856,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> > > >  			file_update_time(file);
> > > >  	}
> > > >  
> > > > -	if (mode & FALLOC_FL_PUNCH_HOLE)
> > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > +		down_write(&fi->i_mmap_sem);
> > > >  		truncate_pagecache_range(inode, offset, offset + length - 1);
> > > > -
> > > > +		up_write(&fi->i_mmap_sem);
> > > > +	}
> > > >  	fuse_invalidate_attr(inode);
> > > 
> > > 
> > > I'm not sure this is sufficient. You have to lock page faults out
> > > for the entire time the hole punch is being performed, not just while
> > > the mapping is being invalidated.
> > > 
> > > That is, once you've taken the inode lock and written back the dirty
> > > data over the range being punched, you can then take a page fault
> > > and dirty the page again. Then after you punch the hole out,
> > > you have a dirty page with non-zero data in it, and that can get
> > > written out before the page cache is truncated.
> > 
> > Just for my better udnerstanding of the issue, I am wondering what
> > problem will it lead to.
> > If one process is doing punch_hole and other is writing in the
> > range being punched, end result could be anything. Either we will
> > read zeroes from punched_hole pages or we will read the data
> > written by process writing to mmaped page, depending on in what
> > order it got executed. 
> >
> > If that's the case, then holding fi->i_mmap_sem for the whole
> > duration might not matter. What am I missing?
> 
> That it is safe to invalidate the page cache after the hole has been
> punched.

That's precisely both fuse and nfs seem to be doing. truncate page
cache after server has hole punched the file. (nfs42_proc_deallocate()
and fuse_file_fallocate()). I don't understand the nfs code, but
that seems to be the case from a quick look.

> 
> There is nothing stopping, say, memory reclaim from reclaiming pages
> over the range while the hole is being punched, then having the
> application refault them while the backing store is being freed.
> While the page fault read IO is in progress, there's nothing
> stopping the filesystem from freeing those blocks, nor reallocating
> them and writing something else to them (e.g. metadata). So they
> could read someone elses data.
> 
> Even worse: the page fault is a write fault, it lands in a hole, has
> space allocated, the page cache is zeroed, page marked dirty, and
> then the hole punch calls truncate_pagecache_range() which tosses
> away the zeroed page and the data the userspace application wrote
> to the page.

But isn't that supposed to happen. If fallocate(hole_punch) and mmaped
write are happening at the same time, then there is no guarantee
in what order they will execute. App might read back data it wrote
or might read back zeros depdening on order it was executed. (Even
with proper locking).

> 
> The application then refaults the page, reading stale data off
> disk instead of seeing what it had already written to the page....
> 
> And unlike truncated pages, the mm/ code cannot reliably detect
> invalidation races on lockless lookup of pages that are within EOF.
> They rely on truncate changing the file size before page
> invalidation to detect races as page->index then points beyond EOF.
> Hole punching does not change inode size, so the page cache lookups
> cannot tell the difference between a new page that just needs IO to
> initialise the data and a page that has just been invalidated....
> 
> IOWs, there are many ways things can go wrong with hole punch, and
> the only way to avoid them all is to do invalidate and lock out the
> page cache before starting the fallocate operation. i.e.:
> 
> 	1. lock up the entire IO path (vfs and page fault)
> 	2. drain the AIO+DIO path
> 	3. write back dirty pages
> 	4. invalidate the page cache

I see that this is definitely safe. Stop all read/write/faults/aio/dio
before proceeding with punching hole and invalidating page cache.

I think for my purpose, I need to take fi->i_mmap_sem in memory
range freeing path and need to exactly do all the above to make
sure that no I/O, fault or AIO/DIO is going on before I take
away the memory range I have allocated for that inode offset. This
is I think very similar to assigning blocks/extents and taking
these away. In that code path I am already taking care of
taking inode lock as well as i_mmap_sem. But I have not taken
care of AIO/DIO stuff. I will introduce that too.

For the time being I will handle this fallocate/ftruncate possible
races in a separate patch series. To me it makes sense to do what
ext4/xfs are doing. But there might be more to it when it comes
to remote filesystems... 

Miklos, WDYT. Shall I modify fuse fallocate/ftruncate code to 
block all faults/AIO/DIO as well before we get down to the
task of writing back pages, truncating page cache and punching
hole.

> 
> Because this is the only way we can guarantee that nothing can access
> the filesystem's backing store for the range we are about to
> directly manipulate the data in while we perform an "offloaded" data
> transformation on that range...
> 
> This isn't just hole punch - the same problems exist with
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_{INSERT,COLLAPSE}_RANGE because
> they change data with extent manipulations and/or hardware offloads
> that provide no guarantees of specific data state or integrity until
> they complete....

Ok. As of now fuse seems to have blocked all extra fallocate operations
like ZERO_RANGE, INSERT, COLLAPSE.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault
  2020-08-12 21:10         ` Vivek Goyal
@ 2020-08-13  5:12           ` Dave Chinner
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2020-08-13  5:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha, dgilbert

On Wed, Aug 12, 2020 at 05:10:12PM -0400, Vivek Goyal wrote:
> On Wed, Aug 12, 2020 at 11:23:45AM +1000, Dave Chinner wrote:
> > On Tue, Aug 11, 2020 at 01:55:30PM -0400, Vivek Goyal wrote:
> > > On Tue, Aug 11, 2020 at 08:22:38AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 07, 2020 at 03:55:21PM -0400, Vivek Goyal wrote:
> > > > > We need some kind of locking mechanism here. Normal file systems like
> > > > > ext4 and xfs seems to take their own semaphore to protect agains
> > > > > truncate while fault is going on.
> > > > > 
> > > > > We have additional requirement to protect against fuse dax memory range
> > > > > reclaim. When a range has been selected for reclaim, we need to make sure
> > > > > no other read/write/fault can try to access that memory range while
> > > > > reclaim is in progress. Once reclaim is complete, lock will be released
> > > > > and read/write/fault will trigger allocation of fresh dax range.
> > > > > 
> > > > > Taking inode_lock() is not an option in fault path as lockdep complains
> > > > > about circular dependencies. So define a new fuse_inode->i_mmap_sem.
> > > > 
> > > > That's precisely why filesystems like ext4 and XFS define their own
> > > > rwsem.
> > > > 
> > > > Note that this isn't a DAX requirement - the page fault
> > > > serialisation is actually a requirement of hole punching...
> > > 
> > > Hi Dave,
> > > 
> > > I noticed that fuse code currently does not seem to have a rwsem which
> > > can provide mutual exclusion between truncation/hole_punch path
> > > and page fault path. I am wondering does that mean there are issues
> > > with existing code or something else makes it unnecessary to provide
> > > this mutual exlusion.
> > 
> > I don't know enough about the fuse implementation to say. What I'm
> > saying is that nothing in the core mm/ or VFS serilises page cache
> > access to the data against direct filesystem manipulations of the
> > underlying filesystem structures.
> 
> Hi Dave,
> 
> Got it. I was checking nfs and they also seem to be calling filemap_fault
> and not taking any locks to block faults. fallocate() (nfs42_fallocate)
> seems to block read/write/aio/dio but does not seem to do anything
> about blocking faults. I am wondering if remote filesystem are
> little different in this aspect. Especially fuse does not maintain
> any filesystem block/extent data. It is file server which is doing
> all that.

I suspect they have all the same problems, and worse, the behaviour
will largely be dependent on the server side behaviour that is out
of the user's control.

Essentially, nobody except us XFS folks seem to regard hole punching
corrupting data or exposing stale data as being a problem that needs
to be avoided or fixed. The only reason ext4 has the i_mmap_sem is
because ext4 wanted to support DAX, and us XFS developers said "DAX
absolutely requires that the filesystem can lock out physical access
to the storage" and so they had no choice in the matter.

Other than that, nobody really seems to understand or care about all
these nasty little mmap() corner cases that we've seen corrupt user
data or expose stale data to users over many years.....

> > i.e. nothing in the VFS or page fault IO path prevents this race
> > condition:
> > 
> > P0				P1
> > fallocate
> > page cache invalidation
> > 				page fault
> > 				read data
> > punch out data extents
> > 				<data exposed to userspace is stale>
> > 				<data exposed to userspace has no
> > 				backing store allocated>
> > 
> > 
> > That's where the ext4 and XFS internal rwsem come into play:
> > 
> > fallocate
> > down_write(mmaplock)
> > page cache invalidation
> > 				page fault
> > 				down_read(mmaplock)
> > 				<blocks>
> > punch out data
> > up_write(mmaplock)
> > 				<unblocks>
> > 				<sees hole>
> > 				<allocates zeroed pages in page cache>
> > 
> > And there's not stale data exposure to userspace.
> 
> Got it. I noticed that both fuse/nfs seem to have reversed the
> order of operation. They call server to punch out data first
> and then truncate page cache. And that should mean that even
> if mmap reader will not see stale data after fallocate(punch_hole)
> has finished.

Yes, but that doesn't prevent page fault races from occuring, it
just changes the nature of them.. Such as.....

> > There is nothing stopping, say, memory reclaim from reclaiming pages
> > over the range while the hole is being punched, then having the
> > application refault them while the backing store is being freed.
> > While the page fault read IO is in progress, there's nothing
> > stopping the filesystem from freeing those blocks, nor reallocating
> > them and writing something else to them (e.g. metadata). So they
> > could read someone elses data.
> > 
> > Even worse: the page fault is a write fault, it lands in a hole, has
> > space allocated, the page cache is zeroed, page marked dirty, and
> > then the hole punch calls truncate_pagecache_range() which tosses
> > away the zeroed page and the data the userspace application wrote
> > to the page.
> 
> But isn't that supposed to happen.

Right, it isn;'t supposed to happen, but it can happen if
page_mkwrite doesn't serialise against fallocate(). i.e. without a
i_mmap_sem, nothing in the mm page fault paths serialise the page
fault against the filesystem fallocate operation in progress.

Indeed, looking at fuse_page_mkwrite(), it just locks the page,
checks the page->mapping hasn't changed (that's one of those
"doesn't work for hole punching page invalidation" checks that I
mentioned!) and then it waits for page writeback to complete. IOWs,
fuse will allow a clean page in cache to be dirtied without the
filesystem actually locking anything or doing any sort of internal
serialisation operation.

IOWs, there is nothing stopping an application on fuse from hitting
this data corruption when a concurrent hole punch is run:

 P0				P1
 <read() loop to find zeros>
 fallocate
 write back dirty data
 punch out data extents
 .....
 				<app reads data via mmap>
				  read fault, clean page in cache!
 				<app writes data via mmap>
 				  write page fault
				  page_mkwrite
				    <page is locked just fine>
				  page is dirtied.
				<app writes new data to page>
 .....
 page cache invalidation
   <app's dirty page thrown away>
				.....
				<app reads data via mmap>
				  read page fault
				    <maps punched hole, returns zeros>
				app detects data corruption

That can happen quite easily - just go put a "sparsify" script into
cron so that runs of zeroes in files are converted into holes to
free up disk space every night....

> If fallocate(hole_punch) and mmaped
> write are happening at the same time, then there is no guarantee
> in what order they will execute.

It's not "order of exceution" that is the problem here - it's
guaranteeing *atomic execution* that is the problem. See the example
above - by not locking out page faults, fallocate() does not execute
atomically w.r.t. to mmap() access to the file, and hence we end up
losing changes the to data made via mmap.

That's precisely what the i_mmap_sem fixes. It *guarantees* the
ordering of the fallocate() operation and the page fault based
access to the underlying data by ensuring that the *operations
execute atomically* with respect to each other. And, by definition,
that atomicity of execution removes all the races that can lead to
data loss, corruption and/or stale data exposure....

> App might read back data it wrote
> or might read back zeros depdening on order it was executed. (Even
> with proper locking).

That behaviour is what "proper locking" provides. If you don't
have an i_mmap_sem to guarantee serialisation of page faults against
fallocate (i.e. "unproper locking"), then you also can get stale
data, the wrong data, data loss, access-after-free, overwrite of
metadata or other people's data, etc.

> > The application then refaults the page, reading stale data off
> > disk instead of seeing what it had already written to the page....
> > 
> > And unlike truncated pages, the mm/ code cannot reliably detect
> > invalidation races on lockless lookup of pages that are within EOF.
> > They rely on truncate changing the file size before page
> > invalidation to detect races as page->index then points beyond EOF.
> > Hole punching does not change inode size, so the page cache lookups
> > cannot tell the difference between a new page that just needs IO to
> > initialise the data and a page that has just been invalidated....
> > 
> > IOWs, there are many ways things can go wrong with hole punch, and
> > the only way to avoid them all is to do invalidate and lock out the
> > page cache before starting the fallocate operation. i.e.:
> > 
> > 	1. lock up the entire IO path (vfs and page fault)
> > 	2. drain the AIO+DIO path
> > 	3. write back dirty pages
> > 	4. invalidate the page cache
> 
> I see that this is definitely safe. Stop all read/write/faults/aio/dio
> before proceeding with punching hole and invalidating page cache.
> 
> I think for my purpose, I need to take fi->i_mmap_sem in memory
> range freeing path and need to exactly do all the above to make
> sure that no I/O, fault or AIO/DIO is going on before I take
> away the memory range I have allocated for that inode offset. This
> is I think very similar to assigning blocks/extents and taking
> these away. In that code path I am already taking care of
> taking inode lock as well as i_mmap_sem. But I have not taken
> care of AIO/DIO stuff. I will introduce that too.
> 
> For the time being I will handle this fallocate/ftruncate possible
> races in a separate patch series. To me it makes sense to do what
> ext4/xfs are doing. But there might be more to it when it comes
> to remote filesystems... 

Remote filesystems introduce a whole new range of data coherency
problems that are outside the scope of mmap() vs fallocate()
serialisation. That is, page fault vs fallocate serialisation is a
local client serialisation condition, not a remote filesystem
data coherency issue...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()
  2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
@ 2020-08-17 16:53   ` Jan Kara
  2020-08-17 17:22     ` Vivek Goyal
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2020-08-17 16:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Dan Williams, linux-nvdimm

On Fri 07-08-20 15:55:08, Vivek Goyal wrote:
> virtiofs device has a range of memory which is mapped into file inodes
> using dax. This memory is mapped in qemu on host and maps different
> sections of real file on host. Size of this memory is limited
> (determined by administrator) and depending on filesystem size, we will
> soon reach a situation where all the memory is in use and we need to
> reclaim some.
> 
> As part of reclaim process, we will need to make sure that there are
> no active references to pages (taken by get_user_pages()) on the memory
> range we are trying to reclaim. I am planning to use
> dax_layout_busy_page() for this. But in current form this is per inode
> and scans through all the pages of the inode.
> 
> We want to reclaim only a portion of memory (say 2MB page). So we want
> to make sure that only that 2MB range of pages do not have any
> references  (and don't want to unmap all the pages of inode).
> 
> Hence, create a range version of this function named
> dax_layout_busy_page_range() which can be used to pass a range which
> needs to be unmapped.
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-nvdimm@lists.01.org
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

The API looks OK. Some comments WRT the implementation below.

> diff --git a/fs/dax.c b/fs/dax.c
> index 11b16729b86f..0d51b0fbb489 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  	return xa_mk_internal(VM_FAULT_FALLBACK);
>  }
>  
> -/**
> - * dax_layout_busy_page - find first pinned page in @mapping
> - * @mapping: address space to scan for a page with ref count > 1
> - *
> - * DAX requires ZONE_DEVICE mapped pages. These pages are never
> - * 'onlined' to the page allocator so they are considered idle when
> - * page->count == 1. A filesystem uses this interface to determine if
> - * any page in the mapping is busy, i.e. for DMA, or other
> - * get_user_pages() usages.
> - *
> - * It is expected that the filesystem is holding locks to block the
> - * establishment of new mappings in this address_space. I.e. it expects
> - * to be able to run unmap_mapping_range() and subsequently not race
> - * mapping_mapped() becoming true.
> +/*
> + * Partial pages are included. If end is LLONG_MAX, pages in the range from
> + * start to end of the file are inluded.
>   */

I think the big kerneldoc comment should stay with
dax_layout_busy_page_range() since dax_layout_busy_page() will be just a
trivial wrapper around it..

> -struct page *dax_layout_busy_page(struct address_space *mapping)
> +struct page *dax_layout_busy_page_range(struct address_space *mapping,
> +					loff_t start, loff_t end)
>  {
> -	XA_STATE(xas, &mapping->i_pages, 0);
>  	void *entry;
>  	unsigned int scanned = 0;
>  	struct page *page = NULL;
> +	pgoff_t start_idx = start >> PAGE_SHIFT;
> +	pgoff_t end_idx = end >> PAGE_SHIFT;
> +	XA_STATE(xas, &mapping->i_pages, start_idx);
> +	loff_t len, lstart = round_down(start, PAGE_SIZE);
>  
>  	/*
>  	 * In the 'limited' case get_user_pages() for dax is disabled.
> @@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
>  	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
>  		return NULL;
>  
> +	/* If end == LLONG_MAX, all pages from start to till end of file */
> +	if (end == LLONG_MAX) {
> +		end_idx = ULONG_MAX;
> +		len = 0;
> +	} else {
> +		/* length is being calculated from lstart and not start.
> +		 * This is due to behavior of unmap_mapping_range(). If
> +		 * start is say 4094 and end is on 4096 then we want to
> +		 * unamp two pages, idx 0 and 1. But unmap_mapping_range()
> +		 * will unmap only page at idx 0. If we calculate len
> +		 * from the rounded down start, this problem should not
> +		 * happen.
> +		 */
> +		len = end - lstart + 1;
> +	}

Maybe it would be more understandable to use
	unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1);
below and avoid all this rounding and special-casing.

> +
>  	/*
>  	 * If we race get_user_pages_fast() here either we'll see the
>  	 * elevated page count in the iteration and wait, or
> @@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
>  	 * guaranteed to either see new references or prevent new
>  	 * references from being established.
>  	 */
> -	unmap_mapping_range(mapping, 0, 0, 0);
> +	unmap_mapping_range(mapping, start, len, 0);
>  
>  	xas_lock_irq(&xas);
> -	xas_for_each(&xas, entry, ULONG_MAX) {
> +	xas_for_each(&xas, entry, end_idx) {
>  		if (WARN_ON_ONCE(!xa_is_value(entry)))
>  			continue;
>  		if (unlikely(dax_is_locked(entry)))
> @@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
>  	xas_unlock_irq(&xas);
>  	return page;
>  }
> +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> +
> +/**
> + * dax_layout_busy_page - find first pinned page in @mapping
> + * @mapping: address space to scan for a page with ref count > 1
> + *
> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> + * 'onlined' to the page allocator so they are considered idle when
> + * page->count == 1. A filesystem uses this interface to determine if
> + * any page in the mapping is busy, i.e. for DMA, or other
> + * get_user_pages() usages.
> + *
> + * It is expected that the filesystem is holding locks to block the
> + * establishment of new mappings in this address_space. I.e. it expects
> + * to be able to run unmap_mapping_range() and subsequently not race
> + * mapping_mapped() becoming true.
> + */
> +struct page *dax_layout_busy_page(struct address_space *mapping)
> +{
> +	return dax_layout_busy_page_range(mapping, 0, 0);

Should the 'end' rather be LLONG_MAX?

Otherwise the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev
  2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
@ 2020-08-17 16:57   ` Jan Kara
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kara @ 2020-08-17 16:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Christoph Hellwig, Dan Williams, linux-nvdimm

On Fri 07-08-20 15:55:07, Vivek Goyal wrote:
> virtiofs does not have a block device but it has dax device.
> Modify bdev_dax_pgoff() to be able to handle that.
> 
> If there is no bdev, that means dax offset is 0. (It can't be a partition
> block device starting at an offset in dax device).
> 
> This is little hackish. There have been discussions about getting rid
> of dax not supporting partitions.
> 
> https://lore.kernel.org/linux-fsdevel/20200107125159.GA15745@infradead.org/
> 
> IMHO, this path can easily break exisitng users. For example
> ioctl(BLKPG_ADD_PARTITION) will start breaking on block devices
> supporting DAX. Also, I personally find it very useful to be able to
> partition dax devices and still be able to use DAX.
> 
> Alternatively, I tried to store offset into dax device information in iomap
> interface, but that got NACKed.
> 
> https://lore.kernel.org/linux-fsdevel/20200217133117.GB20444@infradead.org/
> 
> I can't think of a good path to solve this issue properly. So to make
> progress, it seems this patch is least bad option for now and I hope
> we can take it.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-nvdimm@lists.01.org

This patch looks OK to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/super.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 8e32345be0f7..c4bec437e88b 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
>  int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
>  		pgoff_t *pgoff)
>  {
> -	phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
> +	sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
> +	phys_addr_t phys_off = (start_sect + sector) * 512;
>  
>  	if (pgoff)
>  		*pgoff = PHYS_PFN(phys_off);
> -- 
> 2.25.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page()
  2020-08-17 16:53   ` Jan Kara
@ 2020-08-17 17:22     ` Vivek Goyal
  0 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-17 17:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-kernel, virtio-fs, miklos, stefanha,
	dgilbert, Dan Williams, linux-nvdimm

On Mon, Aug 17, 2020 at 06:53:39PM +0200, Jan Kara wrote:
> On Fri 07-08-20 15:55:08, Vivek Goyal wrote:
> > virtiofs device has a range of memory which is mapped into file inodes
> > using dax. This memory is mapped in qemu on host and maps different
> > sections of real file on host. Size of this memory is limited
> > (determined by administrator) and depending on filesystem size, we will
> > soon reach a situation where all the memory is in use and we need to
> > reclaim some.
> > 
> > As part of reclaim process, we will need to make sure that there are
> > no active references to pages (taken by get_user_pages()) on the memory
> > range we are trying to reclaim. I am planning to use
> > dax_layout_busy_page() for this. But in current form this is per inode
> > and scans through all the pages of the inode.
> > 
> > We want to reclaim only a portion of memory (say 2MB page). So we want
> > to make sure that only that 2MB range of pages do not have any
> > references  (and don't want to unmap all the pages of inode).
> > 
> > Hence, create a range version of this function named
> > dax_layout_busy_page_range() which can be used to pass a range which
> > needs to be unmapped.
> > 
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-nvdimm@lists.01.org
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> 
> The API looks OK. Some comments WRT the implementation below.
> 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 11b16729b86f..0d51b0fbb489 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
> >  	return xa_mk_internal(VM_FAULT_FALLBACK);
> >  }
> >  
> > -/**
> > - * dax_layout_busy_page - find first pinned page in @mapping
> > - * @mapping: address space to scan for a page with ref count > 1
> > - *
> > - * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > - * 'onlined' to the page allocator so they are considered idle when
> > - * page->count == 1. A filesystem uses this interface to determine if
> > - * any page in the mapping is busy, i.e. for DMA, or other
> > - * get_user_pages() usages.
> > - *
> > - * It is expected that the filesystem is holding locks to block the
> > - * establishment of new mappings in this address_space. I.e. it expects
> > - * to be able to run unmap_mapping_range() and subsequently not race
> > - * mapping_mapped() becoming true.
> > +/*
> > + * Partial pages are included. If end is LLONG_MAX, pages in the range from
> > + * start to end of the file are inluded.
> >   */
> 
> I think the big kerneldoc comment should stay with
> dax_layout_busy_page_range() since dax_layout_busy_page() will be just a
> trivial wrapper around it..

Hi Jan,

Thanks for the review.

Will move kerneldoc comment.


> 
> > -struct page *dax_layout_busy_page(struct address_space *mapping)
> > +struct page *dax_layout_busy_page_range(struct address_space *mapping,
> > +					loff_t start, loff_t end)
> >  {
> > -	XA_STATE(xas, &mapping->i_pages, 0);
> >  	void *entry;
> >  	unsigned int scanned = 0;
> >  	struct page *page = NULL;
> > +	pgoff_t start_idx = start >> PAGE_SHIFT;
> > +	pgoff_t end_idx = end >> PAGE_SHIFT;
> > +	XA_STATE(xas, &mapping->i_pages, start_idx);
> > +	loff_t len, lstart = round_down(start, PAGE_SIZE);
> >  
> >  	/*
> >  	 * In the 'limited' case get_user_pages() for dax is disabled.
> > @@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> >  	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> >  		return NULL;
> >  
> > +	/* If end == LLONG_MAX, all pages from start to till end of file */
> > +	if (end == LLONG_MAX) {
> > +		end_idx = ULONG_MAX;
> > +		len = 0;
> > +	} else {
> > +		/* length is being calculated from lstart and not start.
> > +		 * This is due to behavior of unmap_mapping_range(). If
> > +		 * start is say 4094 and end is on 4096 then we want to
> > +		 * unamp two pages, idx 0 and 1. But unmap_mapping_range()
> > +		 * will unmap only page at idx 0. If we calculate len
> > +		 * from the rounded down start, this problem should not
> > +		 * happen.
> > +		 */
> > +		len = end - lstart + 1;
> > +	}
> 
> Maybe it would be more understandable to use
> 	unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1);
> below and avoid all this rounding and special-casing.

Will do.

> 
> > +
> >  	/*
> >  	 * If we race get_user_pages_fast() here either we'll see the
> >  	 * elevated page count in the iteration and wait, or
> > @@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> >  	 * guaranteed to either see new references or prevent new
> >  	 * references from being established.
> >  	 */
> > -	unmap_mapping_range(mapping, 0, 0, 0);
> > +	unmap_mapping_range(mapping, start, len, 0);
> >  
> >  	xas_lock_irq(&xas);
> > -	xas_for_each(&xas, entry, ULONG_MAX) {
> > +	xas_for_each(&xas, entry, end_idx) {
> >  		if (WARN_ON_ONCE(!xa_is_value(entry)))
> >  			continue;
> >  		if (unlikely(dax_is_locked(entry)))
> > @@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> >  	xas_unlock_irq(&xas);
> >  	return page;
> >  }
> > +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> > +
> > +/**
> > + * dax_layout_busy_page - find first pinned page in @mapping
> > + * @mapping: address space to scan for a page with ref count > 1
> > + *
> > + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > + * 'onlined' to the page allocator so they are considered idle when
> > + * page->count == 1. A filesystem uses this interface to determine if
> > + * any page in the mapping is busy, i.e. for DMA, or other
> > + * get_user_pages() usages.
> > + *
> > + * It is expected that the filesystem is holding locks to block the
> > + * establishment of new mappings in this address_space. I.e. it expects
> > + * to be able to run unmap_mapping_range() and subsequently not race
> > + * mapping_mapped() becoming true.
> > + */
> > +struct page *dax_layout_busy_page(struct address_space *mapping)
> > +{
> > +	return dax_layout_busy_page_range(mapping, 0, 0);
> 
> Should the 'end' rather be LLONG_MAX?

My bad. I forgot to change this. Previous version of patches had the
semantic that 'end == 0' signifies till the end of file. Yes, 'end'
should be LLONG_MAX now. Will fix it.

Thanks
Vivek

> 
> Otherwise the patch looks good to me.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport
       [not found]       ` <CAAfnVBk+Hmcm2ftd3wOK-P2NyYQ7z4Wrf1JKhLJaNkCZBLoo6g@mail.gmail.com>
@ 2020-08-17 20:29         ` Vivek Goyal
  0 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2020-08-17 20:29 UTC (permalink / raw)
  To: Gurchetan Singh
  Cc: Michael S. Tsirkin, linux-fsdevel, open list, virtio-fs, miklos,
	stefanha, dgilbert, Sebastien Boeuf, kbuild test robot, kvm

On Thu, Aug 13, 2020 at 07:51:56PM -0700, Gurchetan Singh wrote:
> On Mon, Aug 10, 2020 at 7:50 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > On Mon, Aug 10, 2020 at 10:05:17AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Aug 07, 2020 at 03:55:10PM -0400, Vivek Goyal wrote:
> > > > From: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > > >
> > > > On PCI the shm regions are found using capability entries;
> > > > find a region by searching for the capability.
> > > >
> > > > Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > Signed-off-by: kbuild test robot <lkp@intel.com>
> > > > Cc: kvm@vger.kernel.org
> > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > >
> > > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > >
> >
> > [..]
> > > > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > > > +                         struct virtio_shm_region *region, u8 id)
> > > > +{
> > > > +   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > > > +   struct pci_dev *pci_dev = vp_dev->pci_dev;
> > > > +   u8 bar;
> > > > +   u64 offset, len;
> > > > +   phys_addr_t phys_addr;
> > > > +   size_t bar_len;
> > > > +
> > > > +   if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > > > +           return false;
> > > > +   }
> > > > +
> > > > +   phys_addr = pci_resource_start(pci_dev, bar);
> > > > +   bar_len = pci_resource_len(pci_dev, bar);
> > > > +
> > > > +   if ((offset + len) < offset) {
> > > > +           dev_err(&pci_dev->dev, "%s: cap offset+len overflow
> > detected\n",
> > > > +                   __func__);
> > > > +           return false;
> > > > +   }
> > > > +
> > > > +   if (offset + len > bar_len) {
> > > > +           dev_err(&pci_dev->dev, "%s: bar shorter than cap
> > offset+len\n",
> > > > +                   __func__);
> > > > +           return false;
> > > > +   }
> > >
> > > Maybe move this to a common header so the checks can be reused by
> > > other transports? Can be a patch on top.
> >
> > Will do as patch on top once these patches get merged.
> >
> 
> There's also some minor checkpatch complaints.  I fixed them with the
> series I sent out with virtio-gpu.

Thanks Gurchetan. I will post the patch with V3 posting anyway. We both
need these patches. So whatever gets merged in the tree first, other
person can use it.

Vivek


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2020-08-17 20:29 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 19:55 [PATCH v2 00/20] virtiofs: Add DAX support Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 01/20] dax: Modify bdev_dax_pgoff() to handle NULL bdev Vivek Goyal
2020-08-17 16:57   ` Jan Kara
2020-08-07 19:55 ` [PATCH v2 02/20] dax: Create a range version of dax_layout_busy_page() Vivek Goyal
2020-08-17 16:53   ` Jan Kara
2020-08-17 17:22     ` Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 03/20] virtio: Add get_shm_region method Vivek Goyal
2020-08-10 13:47   ` Michael S. Tsirkin
2020-08-10 13:54     ` Vivek Goyal
2020-08-10 14:02   ` Michael S. Tsirkin
2020-08-10 14:06   ` Michael S. Tsirkin
2020-08-07 19:55 ` [PATCH v2 04/20] virtio: Implement get_shm_region for PCI transport Vivek Goyal
2020-08-10 14:05   ` Michael S. Tsirkin
2020-08-10 14:50     ` Vivek Goyal
     [not found]       ` <CAAfnVBk+Hmcm2ftd3wOK-P2NyYQ7z4Wrf1JKhLJaNkCZBLoo6g@mail.gmail.com>
2020-08-17 20:29         ` Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 05/20] virtio: Implement get_shm_region for MMIO transport Vivek Goyal
2020-08-10 14:03   ` Michael S. Tsirkin
2020-08-07 19:55 ` [PATCH v2 06/20] virtiofs: Provide a helper function for virtqueue initialization Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 07/20] fuse: Get rid of no_mount_options Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 08/20] fuse,virtiofs: Add a mount option to enable dax Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 09/20] virtio_fs, dax: Set up virtio_fs dax_device Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 10/20] fuse,virtiofs: Keep a list of free dax memory ranges Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 11/20] fuse: implement FUSE_INIT map_alignment field Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 12/20] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 13/20] fuse, dax: Implement dax read/write operations Vivek Goyal
2020-08-10 22:06   ` Dave Chinner
2020-08-11 17:44     ` Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 14/20] fuse,dax: add DAX mmap support Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 15/20] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
2020-08-10 22:22   ` Dave Chinner
2020-08-11 17:55     ` Vivek Goyal
2020-08-12  1:23       ` Dave Chinner
2020-08-12 21:10         ` Vivek Goyal
2020-08-13  5:12           ` Dave Chinner
2020-08-07 19:55 ` [PATCH v2 16/20] fuse,virtiofs: Define dax address space operations Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 17/20] fuse,virtiofs: Maintain a list of busy elements Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 18/20] fuse: Release file in process context Vivek Goyal
2020-08-10  8:29   ` Miklos Szeredi
2020-08-10 15:48     ` Vivek Goyal
2020-08-10 19:37     ` Vivek Goyal
2020-08-10 19:39       ` Miklos Szeredi
2020-08-07 19:55 ` [PATCH v2 19/20] fuse: Take inode lock for dax inode truncation Vivek Goyal
2020-08-07 19:55 ` [PATCH v2 20/20] fuse,virtiofs: Add logic to free up a memory range Vivek Goyal
2020-08-10  7:29 ` [PATCH v2 00/20] virtiofs: Add DAX support Miklos Szeredi
2020-08-10 13:08   ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).