LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC 0/2] kvm "fake DAX" device flushing
@ 2018-04-05 10:48 Pankaj Gupta
  2018-04-05 10:48 ` [RFC 1/2] kvm: add virtio pmem driver Pankaj Gupta
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Pankaj Gupta @ 2018-04-05 10:48 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, david, xiaoguangrong.eric, hch,
	marcel, mst, niteshnarayanlal, imammedo, pagupta

We are sharing RFC version of 'fake DAX' flushing
interface for feedback. This is still work in progress
and not yet ready for merging.

Prototype implements two major parts:

- Qemu virtio-pmem device
  It exposes a persistent memory range to KVM guest which at host side is file
  backed memory and works as persistent memory device. In addition to this it
  provides a virtio flushing interface for KVM guest to do a Qemu side sync for
  guest DAX persistent memory range.  

- Guest virtio-pmem driver
  Reads persistent memory range from paravirt device and registers with 'nvdimm_bus'.
  'nvdimm/pmem' driver uses this information to allocate persistent memory range. 
  Also, we have implemented guest side of VIRTIO flushing interface.
  
Changes from previous RFC:

- Reuse existing 'pmem' code instead of creating an entirely new block driver.
- Use VIRTIO driver to register memory information with nvdimm_bus and create
  region_type accordingly. 
- Use VIRTIO flushing interface from existing pmem driver code based on
  registered flushign mechanism.

We have done the implementation based on suggestions here [1]. Previous RFC is
shared here [2]. Details of project idea for 'fake DAX' flushing is shared 
here [3] & [4].

[1] https://marc.info/?l=linux-mm&m=150782346802290&w=2
[2] https://marc.info/?l=kvm&m=151630416506527&w=2
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html

Work yet to be done:

- Qemu RAM address handling independent of PC-DIMM so that memory
  operations(get_free_address) can be used for VIRTIO device type as well. 
  (David Hildenbrand CCed has a prototype for this).
- Qemu device flush functionality trigger with guest fsync on file.
- Qemu live migration work when host page cache is used.
- Multiple virtio-pmem disks support.
- Prepare virtio spec after we get feedback on current approach.

 drivers/nvdimm/region_devs.c     |    7 ++
 drivers/virtio/Kconfig           |   12 +++
 drivers/virtio/Makefile          |    1 
 drivers/virtio/virtio_pmem.c     |  122 +++++++++++++++++++++++++++++++++++++++
 include/linux/libnvdimm.h        |    2 
 include/uapi/linux/virtio_ids.h  |    1 
 include/uapi/linux/virtio_pmem.h |   61 +++++++++++++++++++
 7 files changed, 206 insertions(+)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 1/2] kvm: add virtio pmem driver
  2018-04-05 10:48 [RFC 0/2] kvm "fake DAX" device flushing Pankaj Gupta
@ 2018-04-05 10:48 ` Pankaj Gupta
  2018-04-05 10:48 ` [RFC 2/2] pmem: device flush over VIRTIO Pankaj Gupta
  2018-04-05 10:48 ` [RFC] qemu: Add virtio pmem device Pankaj Gupta
  2 siblings, 0 replies; 10+ messages in thread
From: Pankaj Gupta @ 2018-04-05 10:48 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, david, xiaoguangrong.eric, hch,
	marcel, mst, niteshnarayanlal, imammedo, pagupta

  This patch adds virtio-pmem driver for KVM guest.
  Guest reads the persistent memory range information
  from Qemu over VIRTIO and registers it on 'nvdimm_bus'.
  It also creates a 'nd_region' object with the persistent 
  memory range information so that existing 'nvdimm/pmem'
  driver can allocate this memeory into system memory map. 
  This way 'virtio-pmem' driver uses existing functionality 
  of pmem driver to register persistent memory compatible 
  for DAX capable filesystems.

  It also provides function to perform guest flush over 
  VIRTIO from 'pmem' driver when userspace performs flush on 
  DAX memory range.

  There is work to do including flush at host side and other
  features of pmem driver.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/virtio/Kconfig           |  12 ++++
 drivers/virtio/Makefile          |   1 +
 drivers/virtio/virtio_pmem.c     | 122 +++++++++++++++++++++++++++++++++++++++
 include/linux/libnvdimm.h        |   2 +
 include/uapi/linux/virtio_ids.h  |   1 +
 include/uapi/linux/virtio_pmem.h |  61 ++++++++++++++++++++
 6 files changed, 199 insertions(+)
 create mode 100644 drivers/virtio/virtio_pmem.c
 create mode 100644 include/uapi/linux/virtio_pmem.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cff773f..437de03 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -38,6 +38,18 @@ config VIRTIO_PCI_LEGACY
 
 	  If unsure, say Y.
 
+config VIRTIO_PMEM
+	tristate "Virtio pmem driver"
+	depends on VIRTIO
+	---help---
+	 This driver adds persistent memory range to nd_region and registers
+	 with nvdimm bus. NVDIMM 'pmem' driver later allocates a persistent
+	 memory range on the memory information added by this driver. In addition
+	 to this, 'virtio-pmem' driver also provides a paravirt flushing interface
+	 from guest to host.
+
+	 If unsure, say M.
+
 config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5..cbe91c6 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
new file mode 100644
index 0000000..4fffbb5
--- /dev/null
+++ b/drivers/virtio/virtio_pmem.c
@@ -0,0 +1,122 @@
+/*
+ * virtio-pmem driver
+ */
+
+#include <linux/virtio.h>
+#include <linux/swap.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/oom.h>
+#include <linux/wait.h>
+#include <linux/magic.h>
+#include <linux/virtio_pmem.h>
+#include <linux/libnvdimm.h>
+
+static void pmem_flush_done(struct virtqueue *vq)
+{
+	return;
+};
+
+static int init_vq(struct virtio_pmem *vpmem)
+{
+	struct virtqueue *vq;
+
+	/* single vq */
+	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
+				pmem_flush_done, "flush_queue");
+
+	if (IS_ERR(vq))
+		return PTR_ERR(vq);
+
+	return 0;
+};
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+	int err = 0;
+	struct resource res;
+	struct virtio_pmem *vpmem;
+	struct nvdimm_bus *nvdimm_bus;
+	struct nd_region_desc ndr_desc;
+	int nid = dev_to_node(&vdev->dev);
+	static struct nvdimm_bus_descriptor nd_desc;
+
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+			GFP_KERNEL);
+	if (!vpmem) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	vpmem->vdev = vdev;
+	err = init_vq(vpmem);
+	if (err)
+		goto out;
+
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			start, &vpmem->start);
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			size, &vpmem->size);
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			align, &vpmem->align);
+
+	res.start = vpmem->start;
+	res.end   = vpmem->start + vpmem->size-1;
+
+	memset(&nd_desc, 0, sizeof(nd_desc));
+	nd_desc.provider_name = "virtio-pmem";
+	nd_desc.module = THIS_MODULE;
+	nvdimm_bus = nvdimm_bus_register(&vdev->dev, &nd_desc);
+
+	if (!nvdimm_bus)
+		goto out_nd;
+	dev_set_drvdata(&vdev->dev, nvdimm_bus);
+
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+	ndr_desc.res = &res;
+	ndr_desc.numa_node = nid;
+	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+	set_bit(ND_REGION_VIRTIO, &ndr_desc.flags);
+
+	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+		goto out_nd;
+
+	virtio_device_ready(vdev);
+	return 0;
+
+out_nd:
+	nvdimm_bus_unregister(nvdimm_bus);
+out:
+	dev_err(&vdev->dev, "failed to register virtio pmem ranges\n");
+	vdev->config->del_vqs(vdev);
+	return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+	nvdimm_bus_unregister(nvdimm_bus);
+	vdev->config->del_vqs(vdev);
+}
+
+static struct virtio_driver virtio_pmem_driver = {
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.probe			= virtio_pmem_probe,
+	.remove			= virtio_pmem_remove,
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index f8109dd..ce9bf49 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -47,6 +47,8 @@ enum {
 
 	/* region flag indicating to direct-map persistent memory by default */
 	ND_REGION_PAGEMAP = 0,
+	/* region flag indicating to use VIRTIO flush interface for pmem */
+	ND_REGION_VIRTIO  = 1,
 
 	/* mark newly adjusted resources as requiring a label update */
 	DPA_RESOURCE_ADJUSTED = 1 << 0,
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2..5ebd049 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_PMEM         21 /* virtio pmem */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 0000000..4a46092
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,61 @@
+/*
+ * Virtio pmem Driver
+ *
+ * Discovers persitent memory information to guest
+ * and provides a virtio flushing interface
+ *
+ */
+
+#ifndef _LINUX_VIRTIO_PMEM_H
+#define _LINUX_VIRTIO_PMEM_H
+
+#include <linux/types.h>
+#include <linux/virtio_types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ring.h>
+
+
+struct virtio_pmem_config {
+
+	uint64_t start;
+	uint64_t size;
+	uint64_t align;
+};
+
+struct virtio_pmem {
+
+	struct virtio_device *vdev;
+	struct virtqueue *req_vq;
+
+	uint64_t start;
+	uint64_t size;
+	uint64_t align;
+} __packed;
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+void virtio_pmem_flush(struct device *dev)
+{
+	struct scatterlist sg;
+	struct virtio_device *vdev  = dev_to_virtio(dev->parent->parent);
+	struct virtio_pmem   *vpmem = vdev->priv;
+	char *buf = "FLUSH";
+	int err;
+
+	sg_init_one(&sg, buf, sizeof(buf));
+
+	err = virtqueue_add_outbuf(vpmem->req_vq, &sg, 1, buf, GFP_KERNEL);
+
+	if (err) {
+		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
+		return;
+	}
+
+	virtqueue_kick(vpmem->req_vq);
+};
+
+#endif
-- 
2.9.3

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 2/2] pmem: device flush over VIRTIO
  2018-04-05 10:48 [RFC 0/2] kvm "fake DAX" device flushing Pankaj Gupta
  2018-04-05 10:48 ` [RFC 1/2] kvm: add virtio pmem driver Pankaj Gupta
@ 2018-04-05 10:48 ` Pankaj Gupta
  2018-04-05 10:48 ` [RFC] qemu: Add virtio pmem device Pankaj Gupta
  2 siblings, 0 replies; 10+ messages in thread
From: Pankaj Gupta @ 2018-04-05 10:48 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, david, xiaoguangrong.eric, hch,
	marcel, mst, niteshnarayanlal, imammedo, pagupta

 This patch adds functionality to perform flush
 from guest to host over VIRTIO when 'ND_REGION_VIRTIO' 
 flag is set on nd_negion. This flag is set by 'virtio-pmem' 
 driver for virtio flush operation. 

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/nvdimm/region_devs.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index abaf38c..1c6cd2a 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -20,6 +20,7 @@
 #include <linux/nd.h>
 #include "nd-core.h"
 #include "nd.h"
+#include <linux/virtio_pmem.h>
 
 /*
  * For readq() and writeq() on 32-bit builds, the hi-lo, lo-hi order is
@@ -1043,6 +1044,12 @@ void nvdimm_flush(struct nd_region *nd_region)
 	struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
 	int i, idx;
 
+       /* call PV device flush */
+	if (test_bit(ND_REGION_VIRTIO, &nd_region->flags)) {
+		virtio_pmem_flush(&nd_region->dev);
+		return;
+	}
+
 	/*
 	 * Try to encourage some diversity in flush hint addresses
 	 * across cpus assuming a limited number of flush hints.
-- 
2.9.3

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC] qemu: Add virtio pmem device
  2018-04-05 10:48 [RFC 0/2] kvm "fake DAX" device flushing Pankaj Gupta
  2018-04-05 10:48 ` [RFC 1/2] kvm: add virtio pmem driver Pankaj Gupta
  2018-04-05 10:48 ` [RFC 2/2] pmem: device flush over VIRTIO Pankaj Gupta
@ 2018-04-05 10:48 ` Pankaj Gupta
  2018-04-05 11:26   ` David Hildenbrand
  2 siblings, 1 reply; 10+ messages in thread
From: Pankaj Gupta @ 2018-04-05 10:48 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, david, xiaoguangrong.eric, hch,
	marcel, mst, niteshnarayanlal, imammedo, pagupta

 This patch adds virtio-pmem Qemu device.

 This device configures memory address range information with file
 backend type. It acts like persistent memory device for KVM guest.
 It presents the memory address range to virtio-pmem driver over
 virtio channel and does the block flush whenever there is request
 from guest to flush/sync. (Qemu part for backing file flush
 is yet to be implemented).

 Current code is a RFC to support guest with persistent memory 
 range & DAX. 

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 hw/virtio/Makefile.objs                     |   2 +-
 hw/virtio/virtio-pci.c                      |  44 +++++++++
 hw/virtio/virtio-pci.h                      |  14 +++
 hw/virtio/virtio-pmem.c                     | 133 ++++++++++++++++++++++++++++
 include/hw/pci/pci.h                        |   1 +
 include/hw/virtio/virtio-pmem.h             |  43 +++++++++
 include/standard-headers/linux/virtio_ids.h |   1 +
 7 files changed, 237 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/virtio-pmem.c
 create mode 100644 include/hw/virtio/virtio-pmem.h

diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index 765d363c1f..bb5573d2ef 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -5,7 +5,7 @@ common-obj-y += virtio-bus.o
 common-obj-y += virtio-mmio.o
 
 obj-y += virtio.o virtio-balloon.o 
-obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
+obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o virtio-pmem.o
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
 obj-y += virtio-crypto.o
 obj-$(CONFIG_VIRTIO_PCI) += virtio-crypto-pci.o
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index c20537f31d..114ca05497 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2491,6 +2491,49 @@ static const TypeInfo virtio_rng_pci_info = {
     .class_init    = virtio_rng_pci_class_init,
 };
 
+/* virtio-pmem-pci */
+
+static void virtio_pmem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
+{
+    VirtIOPMEMPCI *vpmem = VIRTIO_PMEM_PCI(vpci_dev);
+    DeviceState *vdev = DEVICE(&vpmem->vdev);
+
+    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
+    object_property_set_bool(OBJECT(vdev), true, "realized", errp);
+}
+
+static void virtio_pmem_pci_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
+    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
+    k->realize = virtio_pmem_pci_realize;
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
+    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PMEM;
+    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
+    pcidev_k->class_id = PCI_CLASS_OTHERS;
+}
+
+static void virtio_pmem_pci_instance_init(Object *obj)
+{
+    VirtIOPMEMPCI *dev = VIRTIO_PMEM_PCI(obj);
+
+    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
+                                TYPE_VIRTIO_PMEM);
+    object_property_add_alias(obj, "memdev", OBJECT(&dev->vdev), "memdev",
+                              &error_abort);
+}
+
+static const TypeInfo virtio_pmem_pci_info = {
+    .name          = TYPE_VIRTIO_PMEM_PCI,
+    .parent        = TYPE_VIRTIO_PCI,
+    .instance_size = sizeof(VirtIOPMEMPCI),
+    .instance_init = virtio_pmem_pci_instance_init,
+    .class_init    = virtio_pmem_pci_class_init,
+};
+
+
 /* virtio-input-pci */
 
 static Property virtio_input_pci_properties[] = {
@@ -2683,6 +2726,7 @@ static void virtio_pci_register_types(void)
     type_register_static(&virtio_balloon_pci_info);
     type_register_static(&virtio_serial_pci_info);
     type_register_static(&virtio_net_pci_info);
+    type_register_static(&virtio_pmem_pci_info);
 #ifdef CONFIG_VHOST_SCSI
     type_register_static(&vhost_scsi_pci_info);
 #endif
diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
index 813082b0d7..fe74fcad3f 100644
--- a/hw/virtio/virtio-pci.h
+++ b/hw/virtio/virtio-pci.h
@@ -19,6 +19,7 @@
 #include "hw/virtio/virtio-blk.h"
 #include "hw/virtio/virtio-net.h"
 #include "hw/virtio/virtio-rng.h"
+#include "hw/virtio/virtio-pmem.h"
 #include "hw/virtio/virtio-serial.h"
 #include "hw/virtio/virtio-scsi.h"
 #include "hw/virtio/virtio-balloon.h"
@@ -57,6 +58,7 @@ typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
 typedef struct VirtIOGPUPCI VirtIOGPUPCI;
 typedef struct VHostVSockPCI VHostVSockPCI;
 typedef struct VirtIOCryptoPCI VirtIOCryptoPCI;
+typedef struct VirtIOPMEMPCI VirtIOPMEMPCI;
 
 /* virtio-pci-bus */
 
@@ -274,6 +276,18 @@ struct VirtIOBlkPCI {
     VirtIOBlock vdev;
 };
 
+/*
+ * virtio-pmem-pci: This extends VirtioPCIProxy.
+ */
+#define TYPE_VIRTIO_PMEM_PCI "virtio-pmem-pci"
+#define VIRTIO_PMEM_PCI(obj) \
+        OBJECT_CHECK(VirtIOPMEMPCI, (obj), TYPE_VIRTIO_PMEM_PCI)
+
+struct VirtIOPMEMPCI {
+    VirtIOPCIProxy parent_obj;
+    VirtIOPMEM vdev;
+};
+
 /*
  * virtio-balloon-pci: This extends VirtioPCIProxy.
  */
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
new file mode 100644
index 0000000000..28d06fc501
--- /dev/null
+++ b/hw/virtio/virtio-pmem.c
@@ -0,0 +1,133 @@
+/*
+ * Virtio pmem device
+ *
+ */
+
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "hw/virtio/virtio-pmem.h"
+
+
+static void virtio_pmem_system_reset(void *opaque)
+{
+
+}
+
+static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
+{
+    VirtQueueElement *elem;
+
+    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+    if (!elem) {
+        return;
+    }
+    /* todo flush raw file */
+
+    virtio_notify(vdev, vq);
+    g_free(elem);
+
+}
+
+static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
+{
+    VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
+    struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *) config;
+
+    pmemcfg->start = pmem->start;
+    pmemcfg->size  = pmem->size;
+    pmemcfg->align = pmem->align;
+}
+
+static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t features,
+                                        Error **errp)
+{
+    virtio_add_feature(&features, VIRTIO_PMEM_PLUG);
+    return features;
+}
+
+
+static void virtio_pmem_realize(DeviceState *dev, Error **errp)
+{
+    VirtIODevice   *vdev   = VIRTIO_DEVICE(dev);
+    VirtIOPMEM     *pmem   = VIRTIO_PMEM(dev);
+    MachineState   *ms     = MACHINE(qdev_get_machine());
+    MemoryRegion   *mr;
+    PCMachineState *pcms   = PC_MACHINE(object_dynamic_cast(OBJECT(ms), TYPE_PC_MACHINE));
+    uint64_t addr;
+
+    if (!pmem->memdev) {
+        error_setg(errp, "virtio-pmem not set");
+        return;
+    }
+
+    mr               = host_memory_backend_get_memory(pmem->memdev, errp);
+    addr             = pcms->hotplug_memory.base;
+    pmem->start      = addr;
+    pmem->size       = memory_region_size(mr);
+    pmem->align      = memory_region_get_alignment(mr);
+
+    memory_region_init_alias(&pmem->mr, OBJECT(ms),
+                             "virtio_pmem-memory", mr, 0, pmem->size);
+
+    host_memory_backend_set_mapped(pmem->memdev, true);
+    virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
+                sizeof(struct virtio_pmem_config));
+
+    pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
+    qemu_register_reset(virtio_pmem_system_reset, pmem);
+}
+
+static void virtio_mem_check_memdev(Object *obj, const char *name, Object *val,
+                                    Error **errp)
+{
+    if (host_memory_backend_is_mapped(MEMORY_BACKEND(val))) {
+
+        char *path = object_get_canonical_path_component(val);
+        error_setg(errp, "Can't use already busy memdev: %s", path);
+        g_free(path);
+        return;
+    }
+
+    qdev_prop_allow_set_link_before_realize(obj, name, val, errp);
+}
+
+static void virtio_pmem_instance_init(Object *obj)
+{
+
+    VirtIOPMEM *vm = VIRTIO_PMEM(obj);
+
+    object_property_add_link(obj, "memdev", TYPE_MEMORY_BACKEND,
+                             (Object **)&vm->memdev,
+                             (void *) virtio_mem_check_memdev,
+                             OBJ_PROP_LINK_UNREF_ON_RELEASE,
+                             &error_abort);
+}
+
+
+static void virtio_pmem_class_init(ObjectClass *klass, void *data)
+{
+    VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
+
+    vdc->realize      =  virtio_pmem_realize;
+    vdc->get_config   =  virtio_pmem_get_config;
+    vdc->get_features =  virtio_pmem_get_features;
+}
+
+static TypeInfo virtio_pmem_info = {
+    .name          = TYPE_VIRTIO_PMEM,
+    .parent        = TYPE_VIRTIO_DEVICE,
+    .class_size    = sizeof(VirtIOPMEM),
+    .class_init    = virtio_pmem_class_init,
+    .instance_init = virtio_pmem_instance_init,
+};
+
+
+static void virtio_register_types(void)
+{
+    type_register_static(&virtio_pmem_info);
+}
+
+type_init(virtio_register_types)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 15ced9648c..0f9091e6cc 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -85,6 +85,7 @@ extern bool pci_available;
 #define PCI_DEVICE_ID_VIRTIO_RNG         0x1005
 #define PCI_DEVICE_ID_VIRTIO_9P          0x1009
 #define PCI_DEVICE_ID_VIRTIO_VSOCK       0x1012
+#define PCI_DEVICE_ID_VIRTIO_PMEM        0x1013
 
 #define PCI_VENDOR_ID_REDHAT             0x1b36
 #define PCI_DEVICE_ID_REDHAT_BRIDGE      0x0001
diff --git a/include/hw/virtio/virtio-pmem.h b/include/hw/virtio/virtio-pmem.h
new file mode 100644
index 0000000000..f4be2f54f3
--- /dev/null
+++ b/include/hw/virtio/virtio-pmem.h
@@ -0,0 +1,43 @@
+/*
+ * Virtio pmem Device
+ *
+ *
+ */
+
+#ifndef QEMU_VIRTIO_PMEM_H
+#define QEMU_VIRTIO_PMEM_H
+
+#include "hw/virtio/virtio.h"
+#include "exec/memory.h"
+#include "sysemu/hostmem.h"
+#include "standard-headers/linux/virtio_ids.h"
+#include "hw/boards.h"
+#include "hw/i386/pc.h"
+
+#define VIRTIO_PMEM_PLUG 0
+
+#define TYPE_VIRTIO_PMEM "virtio-pmem"
+
+#define VIRTIO_PMEM(obj) \
+        OBJECT_CHECK(VirtIOPMEM, (obj), TYPE_VIRTIO_PMEM)
+
+typedef struct VirtIOPMEM {
+
+    VirtIODevice parent_obj;
+    VirtQueue *rq_vq;
+    uint64_t start;
+    uint64_t size;
+    uint64_t align;
+
+    MemoryRegion mr;
+    HostMemoryBackend *memdev;
+} VirtIOPMEM;
+
+struct virtio_pmem_config {
+
+    uint64_t start;
+    uint64_t size;
+    uint64_t align;
+};
+
+#endif
diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
index 6d5c3b2d4f..5ebd04980d 100644
--- a/include/standard-headers/linux/virtio_ids.h
+++ b/include/standard-headers/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_PMEM         21 /* virtio pmem */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
-- 
2.14.3

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] qemu: Add virtio pmem device
  2018-04-05 10:48 ` [RFC] qemu: Add virtio pmem device Pankaj Gupta
@ 2018-04-05 11:26   ` David Hildenbrand
  2018-04-05 12:09     ` [Qemu-devel] " Pankaj Gupta
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2018-04-05 11:26 UTC (permalink / raw)
  To: Pankaj Gupta, linux-kernel, kvm, qemu-devel, linux-nvdimm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, xiaoguangrong.eric, hch, marcel,
	mst, niteshnarayanlal, imammedo

On 05.04.2018 12:48, Pankaj Gupta wrote:
>  This patch adds virtio-pmem Qemu device.
> 
>  This device configures memory address range information with file
>  backend type. It acts like persistent memory device for KVM guest.
>  It presents the memory address range to virtio-pmem driver over
>  virtio channel and does the block flush whenever there is request
>  from guest to flush/sync. (Qemu part for backing file flush
>  is yet to be implemented).
> 
>  Current code is a RFC to support guest with persistent memory 
>  range & DAX. 
> 
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  hw/virtio/Makefile.objs                     |   2 +-
>  hw/virtio/virtio-pci.c                      |  44 +++++++++
>  hw/virtio/virtio-pci.h                      |  14 +++
>  hw/virtio/virtio-pmem.c                     | 133 ++++++++++++++++++++++++++++
>  include/hw/pci/pci.h                        |   1 +
>  include/hw/virtio/virtio-pmem.h             |  43 +++++++++
>  include/standard-headers/linux/virtio_ids.h |   1 +
>  7 files changed, 237 insertions(+), 1 deletion(-)
>  create mode 100644 hw/virtio/virtio-pmem.c
>  create mode 100644 include/hw/virtio/virtio-pmem.h
> 
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index 765d363c1f..bb5573d2ef 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -5,7 +5,7 @@ common-obj-y += virtio-bus.o
>  common-obj-y += virtio-mmio.o
>  
>  obj-y += virtio.o virtio-balloon.o 
> -obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
> +obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o virtio-pmem.o
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
>  obj-y += virtio-crypto.o
>  obj-$(CONFIG_VIRTIO_PCI) += virtio-crypto-pci.o
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index c20537f31d..114ca05497 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -2491,6 +2491,49 @@ static const TypeInfo virtio_rng_pci_info = {
>      .class_init    = virtio_rng_pci_class_init,
>  };
>  
> +/* virtio-pmem-pci */
> +
> +static void virtio_pmem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> +{
> +    VirtIOPMEMPCI *vpmem = VIRTIO_PMEM_PCI(vpci_dev);
> +    DeviceState *vdev = DEVICE(&vpmem->vdev);
> +
> +    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
> +    object_property_set_bool(OBJECT(vdev), true, "realized", errp);
> +}
> +
> +static void virtio_pmem_pci_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
> +    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
> +    k->realize = virtio_pmem_pci_realize;
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> +    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PMEM;
> +    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
> +    pcidev_k->class_id = PCI_CLASS_OTHERS;
> +}
> +
> +static void virtio_pmem_pci_instance_init(Object *obj)
> +{
> +    VirtIOPMEMPCI *dev = VIRTIO_PMEM_PCI(obj);
> +
> +    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
> +                                TYPE_VIRTIO_PMEM);
> +    object_property_add_alias(obj, "memdev", OBJECT(&dev->vdev), "memdev",
> +                              &error_abort);
> +}
> +
> +static const TypeInfo virtio_pmem_pci_info = {
> +    .name          = TYPE_VIRTIO_PMEM_PCI,
> +    .parent        = TYPE_VIRTIO_PCI,
> +    .instance_size = sizeof(VirtIOPMEMPCI),
> +    .instance_init = virtio_pmem_pci_instance_init,
> +    .class_init    = virtio_pmem_pci_class_init,
> +};
> +
> +
>  /* virtio-input-pci */
>  
>  static Property virtio_input_pci_properties[] = {
> @@ -2683,6 +2726,7 @@ static void virtio_pci_register_types(void)
>      type_register_static(&virtio_balloon_pci_info);
>      type_register_static(&virtio_serial_pci_info);
>      type_register_static(&virtio_net_pci_info);
> +    type_register_static(&virtio_pmem_pci_info);
>  #ifdef CONFIG_VHOST_SCSI
>      type_register_static(&vhost_scsi_pci_info);
>  #endif
> diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
> index 813082b0d7..fe74fcad3f 100644
> --- a/hw/virtio/virtio-pci.h
> +++ b/hw/virtio/virtio-pci.h
> @@ -19,6 +19,7 @@
>  #include "hw/virtio/virtio-blk.h"
>  #include "hw/virtio/virtio-net.h"
>  #include "hw/virtio/virtio-rng.h"
> +#include "hw/virtio/virtio-pmem.h"
>  #include "hw/virtio/virtio-serial.h"
>  #include "hw/virtio/virtio-scsi.h"
>  #include "hw/virtio/virtio-balloon.h"
> @@ -57,6 +58,7 @@ typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
>  typedef struct VirtIOGPUPCI VirtIOGPUPCI;
>  typedef struct VHostVSockPCI VHostVSockPCI;
>  typedef struct VirtIOCryptoPCI VirtIOCryptoPCI;
> +typedef struct VirtIOPMEMPCI VirtIOPMEMPCI;
>  
>  /* virtio-pci-bus */
>  
> @@ -274,6 +276,18 @@ struct VirtIOBlkPCI {
>      VirtIOBlock vdev;
>  };
>  
> +/*
> + * virtio-pmem-pci: This extends VirtioPCIProxy.
> + */
> +#define TYPE_VIRTIO_PMEM_PCI "virtio-pmem-pci"
> +#define VIRTIO_PMEM_PCI(obj) \
> +        OBJECT_CHECK(VirtIOPMEMPCI, (obj), TYPE_VIRTIO_PMEM_PCI)
> +
> +struct VirtIOPMEMPCI {
> +    VirtIOPCIProxy parent_obj;
> +    VirtIOPMEM vdev;
> +};
> +
>  /*
>   * virtio-balloon-pci: This extends VirtioPCIProxy.
>   */
> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> new file mode 100644
> index 0000000000..28d06fc501
> --- /dev/null
> +++ b/hw/virtio/virtio-pmem.c
> @@ -0,0 +1,133 @@
> +/*
> + * Virtio pmem device
> + *
> + */
> +
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu-common.h"
> +#include "qemu/error-report.h"
> +#include "hw/virtio/virtio-pmem.h"
> +
> +
> +static void virtio_pmem_system_reset(void *opaque)
> +{
> +
> +}
> +
> +static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
> +{
> +    VirtQueueElement *elem;
> +
> +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> +    if (!elem) {
> +        return;
> +    }
> +    /* todo flush raw file */
> +
> +    virtio_notify(vdev, vq);
> +    g_free(elem);
> +
> +}
> +
> +static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
> +{
> +    VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> +    struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *) config;
> +
> +    pmemcfg->start = pmem->start;
> +    pmemcfg->size  = pmem->size;
> +    pmemcfg->align = pmem->align;
> +}
> +
> +static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t features,
> +                                        Error **errp)
> +{
> +    virtio_add_feature(&features, VIRTIO_PMEM_PLUG);
> +    return features;
> +}
> +
> +
> +static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> +{
> +    VirtIODevice   *vdev   = VIRTIO_DEVICE(dev);
> +    VirtIOPMEM     *pmem   = VIRTIO_PMEM(dev);
> +    MachineState   *ms     = MACHINE(qdev_get_machine());
> +    MemoryRegion   *mr;
> +    PCMachineState *pcms   = PC_MACHINE(object_dynamic_cast(OBJECT(ms), TYPE_PC_MACHINE));
> +    uint64_t addr;
> +
> +    if (!pmem->memdev) {
> +        error_setg(errp, "virtio-pmem not set");
> +        return;
> +    }
> +
> +    mr               = host_memory_backend_get_memory(pmem->memdev, errp);
> +    addr             = pcms->hotplug_memory.base;
> +    pmem->start      = addr;
> +    pmem->size       = memory_region_size(mr);
> +    pmem->align      = memory_region_get_alignment(mr);
> +
> +    memory_region_init_alias(&pmem->mr, OBJECT(ms),
> +                             "virtio_pmem-memory", mr, 0, pmem->size);
> +
> +    host_memory_backend_set_mapped(pmem->memdev, true);
> +    virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> +                sizeof(struct virtio_pmem_config));
> +
> +    pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> +    qemu_register_reset(virtio_pmem_system_reset, pmem);


So right now you're just using some memdev for testing.

I assume that the memory region we will provide to the guest will be a
simple memory mapped raw file. Dirty tracking (using the kvm slot) will
be used to detect which blocks actually changed and have to be flushed
to disk.

Will this raw file already have the "disk information header" (no idea
how that stuff is called) encoded? Are there any plans/possible ways to

a) automatically create the headers? (if that's even possible)
b) support anything but raw files?

Please note that under x86, a KVM memory slot still has a (in my
opinion) fairly big overhead depending on the size of the slot (rmap,
page_track). We might have to optimize that.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC] qemu: Add virtio pmem device
  2018-04-05 11:26   ` David Hildenbrand
@ 2018-04-05 12:09     ` Pankaj Gupta
  2018-04-05 12:19       ` David Hildenbrand
  2018-04-09  3:26       ` Stefan Hajnoczi
  0 siblings, 2 replies; 10+ messages in thread
From: Pankaj Gupta @ 2018-04-05 12:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, kvm, qemu-devel, linux-nvdimm, kwolf,
	haozhong zhang, jack, xiaoguangrong eric, riel, niteshnarayanlal,
	mst, ross zwisler, hch, stefanha, imammedo, marcel, pbonzini,
	dan j williams, nilal


Hi David,

> >  This patch adds virtio-pmem Qemu device.
> > 
> >  This device configures memory address range information with file
> >  backend type. It acts like persistent memory device for KVM guest.
> >  It presents the memory address range to virtio-pmem driver over
> >  virtio channel and does the block flush whenever there is request
> >  from guest to flush/sync. (Qemu part for backing file flush
> >  is yet to be implemented).
> > 
> >  Current code is a RFC to support guest with persistent memory
> >  range & DAX.
> > 
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  hw/virtio/Makefile.objs                     |   2 +-
> >  hw/virtio/virtio-pci.c                      |  44 +++++++++
> >  hw/virtio/virtio-pci.h                      |  14 +++
> >  hw/virtio/virtio-pmem.c                     | 133
> >  ++++++++++++++++++++++++++++
> >  include/hw/pci/pci.h                        |   1 +
> >  include/hw/virtio/virtio-pmem.h             |  43 +++++++++
> >  include/standard-headers/linux/virtio_ids.h |   1 +
> >  7 files changed, 237 insertions(+), 1 deletion(-)
> >  create mode 100644 hw/virtio/virtio-pmem.c
> >  create mode 100644 include/hw/virtio/virtio-pmem.h
> > 
> > diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> > index 765d363c1f..bb5573d2ef 100644
> > --- a/hw/virtio/Makefile.objs
> > +++ b/hw/virtio/Makefile.objs
> > @@ -5,7 +5,7 @@ common-obj-y += virtio-bus.o
> >  common-obj-y += virtio-mmio.o
> >  
> >  obj-y += virtio.o virtio-balloon.o
> > -obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
> > +obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o virtio-pmem.o
> >  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
> >  obj-y += virtio-crypto.o
> >  obj-$(CONFIG_VIRTIO_PCI) += virtio-crypto-pci.o
> > diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> > index c20537f31d..114ca05497 100644
> > --- a/hw/virtio/virtio-pci.c
> > +++ b/hw/virtio/virtio-pci.c
> > @@ -2491,6 +2491,49 @@ static const TypeInfo virtio_rng_pci_info = {
> >      .class_init    = virtio_rng_pci_class_init,
> >  };
> >  
> > +/* virtio-pmem-pci */
> > +
> > +static void virtio_pmem_pci_realize(VirtIOPCIProxy *vpci_dev, Error
> > **errp)
> > +{
> > +    VirtIOPMEMPCI *vpmem = VIRTIO_PMEM_PCI(vpci_dev);
> > +    DeviceState *vdev = DEVICE(&vpmem->vdev);
> > +
> > +    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
> > +    object_property_set_bool(OBJECT(vdev), true, "realized", errp);
> > +}
> > +
> > +static void virtio_pmem_pci_class_init(ObjectClass *klass, void *data)
> > +{
> > +    DeviceClass *dc = DEVICE_CLASS(klass);
> > +    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
> > +    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
> > +    k->realize = virtio_pmem_pci_realize;
> > +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> > +    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> > +    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PMEM;
> > +    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
> > +    pcidev_k->class_id = PCI_CLASS_OTHERS;
> > +}
> > +
> > +static void virtio_pmem_pci_instance_init(Object *obj)
> > +{
> > +    VirtIOPMEMPCI *dev = VIRTIO_PMEM_PCI(obj);
> > +
> > +    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
> > +                                TYPE_VIRTIO_PMEM);
> > +    object_property_add_alias(obj, "memdev", OBJECT(&dev->vdev), "memdev",
> > +                              &error_abort);
> > +}
> > +
> > +static const TypeInfo virtio_pmem_pci_info = {
> > +    .name          = TYPE_VIRTIO_PMEM_PCI,
> > +    .parent        = TYPE_VIRTIO_PCI,
> > +    .instance_size = sizeof(VirtIOPMEMPCI),
> > +    .instance_init = virtio_pmem_pci_instance_init,
> > +    .class_init    = virtio_pmem_pci_class_init,
> > +};
> > +
> > +
> >  /* virtio-input-pci */
> >  
> >  static Property virtio_input_pci_properties[] = {
> > @@ -2683,6 +2726,7 @@ static void virtio_pci_register_types(void)
> >      type_register_static(&virtio_balloon_pci_info);
> >      type_register_static(&virtio_serial_pci_info);
> >      type_register_static(&virtio_net_pci_info);
> > +    type_register_static(&virtio_pmem_pci_info);
> >  #ifdef CONFIG_VHOST_SCSI
> >      type_register_static(&vhost_scsi_pci_info);
> >  #endif
> > diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
> > index 813082b0d7..fe74fcad3f 100644
> > --- a/hw/virtio/virtio-pci.h
> > +++ b/hw/virtio/virtio-pci.h
> > @@ -19,6 +19,7 @@
> >  #include "hw/virtio/virtio-blk.h"
> >  #include "hw/virtio/virtio-net.h"
> >  #include "hw/virtio/virtio-rng.h"
> > +#include "hw/virtio/virtio-pmem.h"
> >  #include "hw/virtio/virtio-serial.h"
> >  #include "hw/virtio/virtio-scsi.h"
> >  #include "hw/virtio/virtio-balloon.h"
> > @@ -57,6 +58,7 @@ typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
> >  typedef struct VirtIOGPUPCI VirtIOGPUPCI;
> >  typedef struct VHostVSockPCI VHostVSockPCI;
> >  typedef struct VirtIOCryptoPCI VirtIOCryptoPCI;
> > +typedef struct VirtIOPMEMPCI VirtIOPMEMPCI;
> >  
> >  /* virtio-pci-bus */
> >  
> > @@ -274,6 +276,18 @@ struct VirtIOBlkPCI {
> >      VirtIOBlock vdev;
> >  };
> >  
> > +/*
> > + * virtio-pmem-pci: This extends VirtioPCIProxy.
> > + */
> > +#define TYPE_VIRTIO_PMEM_PCI "virtio-pmem-pci"
> > +#define VIRTIO_PMEM_PCI(obj) \
> > +        OBJECT_CHECK(VirtIOPMEMPCI, (obj), TYPE_VIRTIO_PMEM_PCI)
> > +
> > +struct VirtIOPMEMPCI {
> > +    VirtIOPCIProxy parent_obj;
> > +    VirtIOPMEM vdev;
> > +};
> > +
> >  /*
> >   * virtio-balloon-pci: This extends VirtioPCIProxy.
> >   */
> > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > new file mode 100644
> > index 0000000000..28d06fc501
> > --- /dev/null
> > +++ b/hw/virtio/virtio-pmem.c
> > @@ -0,0 +1,133 @@
> > +/*
> > + * Virtio pmem device
> > + *
> > + */
> > +
> > +
> > +#include "qemu/osdep.h"
> > +#include "qapi/error.h"
> > +#include "qemu-common.h"
> > +#include "qemu/error-report.h"
> > +#include "hw/virtio/virtio-pmem.h"
> > +
> > +
> > +static void virtio_pmem_system_reset(void *opaque)
> > +{
> > +
> > +}
> > +
> > +static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
> > +{
> > +    VirtQueueElement *elem;
> > +
> > +    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> > +    if (!elem) {
> > +        return;
> > +    }
> > +    /* todo flush raw file */
> > +
> > +    virtio_notify(vdev, vq);
> > +    g_free(elem);
> > +
> > +}
> > +
> > +static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
> > +{
> > +    VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> > +    struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *)
> > config;
> > +
> > +    pmemcfg->start = pmem->start;
> > +    pmemcfg->size  = pmem->size;
> > +    pmemcfg->align = pmem->align;
> > +}
> > +
> > +static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t
> > features,
> > +                                        Error **errp)
> > +{
> > +    virtio_add_feature(&features, VIRTIO_PMEM_PLUG);
> > +    return features;
> > +}
> > +
> > +
> > +static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> > +{
> > +    VirtIODevice   *vdev   = VIRTIO_DEVICE(dev);
> > +    VirtIOPMEM     *pmem   = VIRTIO_PMEM(dev);
> > +    MachineState   *ms     = MACHINE(qdev_get_machine());
> > +    MemoryRegion   *mr;
> > +    PCMachineState *pcms   = PC_MACHINE(object_dynamic_cast(OBJECT(ms),
> > TYPE_PC_MACHINE));
> > +    uint64_t addr;
> > +
> > +    if (!pmem->memdev) {
> > +        error_setg(errp, "virtio-pmem not set");
> > +        return;
> > +    }
> > +
> > +    mr               = host_memory_backend_get_memory(pmem->memdev, errp);
> > +    addr             = pcms->hotplug_memory.base;
> > +    pmem->start      = addr;
> > +    pmem->size       = memory_region_size(mr);
> > +    pmem->align      = memory_region_get_alignment(mr);
> > +
> > +    memory_region_init_alias(&pmem->mr, OBJECT(ms),
> > +                             "virtio_pmem-memory", mr, 0, pmem->size);
> > +
> > +    host_memory_backend_set_mapped(pmem->memdev, true);
> > +    virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > +                sizeof(struct virtio_pmem_config));
> > +
> > +    pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> > +    qemu_register_reset(virtio_pmem_system_reset, pmem);
> 
> 
> So right now you're just using some memdev for testing.

yes.

> 
> I assume that the memory region we will provide to the guest will be a
> simple memory mapped raw file. Dirty tracking (using the kvm slot) will
> be used to detect which blocks actually changed and have to be flushed
> to disk.

Not really, we will perform fsync on raw file. As this file is created
on regular storage and not nvdimm, so host page cache radix tree would have 
the dirty pages information which will be used for fsync.
   
> 
> Will this raw file already have the "disk information header" (no idea
> how that stuff is called) encoded? Are there any plans/possible ways to
> 
> a) automatically create the headers? (if that's even possible)

Its raw. Right now we are just supporting raw format.  

As this is direct mapping of memory into guest address space, I don't
think we can have an abstraction of headers for block specific features.
Or may be we can get opinion of others(Qemu block people) it is at all possible?

> b) support anything but raw files?
> 
> Please note that under x86, a KVM memory slot still has a (in my
> opinion) fairly big overhead depending on the size of the slot (rmap,
> page_track). We might have to optimize that.

I have not tried/observed this. Right now I just used single memory slot and cold add
few MB's of memory in Qemu. Can you please provide more details on this?

> 
> --
> 
> Thanks,
> 
> David / dhildenb
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC] qemu: Add virtio pmem device
  2018-04-05 12:09     ` [Qemu-devel] " Pankaj Gupta
@ 2018-04-05 12:19       ` David Hildenbrand
  2018-04-09  3:26       ` Stefan Hajnoczi
  1 sibling, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2018-04-05 12:19 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: linux-kernel, kvm, qemu-devel, linux-nvdimm, kwolf,
	haozhong zhang, jack, xiaoguangrong eric, riel, niteshnarayanlal,
	mst, ross zwisler, hch, stefanha, imammedo, marcel, pbonzini,
	dan j williams, nilal


>>
>> So right now you're just using some memdev for testing.
> 
> yes.
> 
>>
>> I assume that the memory region we will provide to the guest will be a
>> simple memory mapped raw file. Dirty tracking (using the kvm slot) will
>> be used to detect which blocks actually changed and have to be flushed
>> to disk.
> 
> Not really, we will perform fsync on raw file. As this file is created
> on regular storage and not nvdimm, so host page cache radix tree would have 
> the dirty pages information which will be used for fsync.

Ah right. That makes things a lot easier!

>    
>>
>> Will this raw file already have the "disk information header" (no idea
>> how that stuff is called) encoded? Are there any plans/possible ways to
>>
>> a) automatically create the headers? (if that's even possible)
> 
> Its raw. Right now we are just supporting raw format.  
> 
> As this is direct mapping of memory into guest address space, I don't
> think we can have an abstraction of headers for block specific features.
> Or may be we can get opinion of others(Qemu block people) it is at all possible?
> 
>> b) support anything but raw files?
>>
>> Please note that under x86, a KVM memory slot still has a (in my
>> opinion) fairly big overhead depending on the size of the slot (rmap,
>> page_track). We might have to optimize that.
> 
> I have not tried/observed this. Right now I just used single memory slot and cold add
> few MB's of memory in Qemu. Can you please provide more details on this?
> 

You can have a look at kvm_arch_create_memslot() in arch/x86/kvm/x86.c.

"npages" is used to allocate certain arrays (rmap for shadow page
tables). Also kvm_page_track_create_memslot() allocates data for page_track.

Having a big disk involves a lot of memory overhead due to the big kvm
memory slot. This is already the case for NVDIMMs as of now.

Other architectures (e.g. s390x) don't have this "problem". They don't
allocate any such data depending on the size of a memory slot.

This is certainly something to work on in the future.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC] qemu: Add virtio pmem device
  2018-04-05 12:09     ` [Qemu-devel] " Pankaj Gupta
  2018-04-05 12:19       ` David Hildenbrand
@ 2018-04-09  3:26       ` Stefan Hajnoczi
  2018-04-09  6:42         ` David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2018-04-09  3:26 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: David Hildenbrand, kwolf, haozhong zhang, jack,
	xiaoguangrong eric, kvm, riel, linux-nvdimm, mst, ross zwisler,
	linux-kernel, qemu-devel, hch, pbonzini, stefanha,
	niteshnarayanlal, marcel, imammedo, dan j williams, nilal

[-- Attachment #1: Type: text/plain, Size: 929 bytes --]

On Thu, Apr 05, 2018 at 08:09:26AM -0400, Pankaj Gupta wrote:
> > Will this raw file already have the "disk information header" (no idea
> > how that stuff is called) encoded? Are there any plans/possible ways to
> > 
> > a) automatically create the headers? (if that's even possible)
> 
> Its raw. Right now we are just supporting raw format.  
> 
> As this is direct mapping of memory into guest address space, I don't
> think we can have an abstraction of headers for block specific features.
> Or may be we can get opinion of others(Qemu block people) it is at all possible?

memdev and the block layer are completely separate.  The block layer
isn't designed for memory-mapped access.

I think it makes sense to use memdev here.  If the user wants a block
device, they should use an emulated block device, not virtio-pmem,
because buffering is necessary anyway when an image file format is used.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [RFC] qemu: Add virtio pmem device
  2018-04-09  3:26       ` Stefan Hajnoczi
@ 2018-04-09  6:42         ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2018-04-09  6:42 UTC (permalink / raw)
  To: Stefan Hajnoczi, Pankaj Gupta
  Cc: kwolf, haozhong zhang, jack, xiaoguangrong eric, kvm, riel,
	linux-nvdimm, mst, ross zwisler, linux-kernel, qemu-devel, hch,
	pbonzini, stefanha, niteshnarayanlal, marcel, imammedo,
	dan j williams, nilal

On 09.04.2018 05:26, Stefan Hajnoczi wrote:
> On Thu, Apr 05, 2018 at 08:09:26AM -0400, Pankaj Gupta wrote:
>>> Will this raw file already have the "disk information header" (no idea
>>> how that stuff is called) encoded? Are there any plans/possible ways to
>>>
>>> a) automatically create the headers? (if that's even possible)
>>
>> Its raw. Right now we are just supporting raw format.  
>>
>> As this is direct mapping of memory into guest address space, I don't
>> think we can have an abstraction of headers for block specific features.
>> Or may be we can get opinion of others(Qemu block people) it is at all possible?
> 
> memdev and the block layer are completely separate.  The block layer
> isn't designed for memory-mapped access.
> 

Not questioning if this is the right thing to do now. I was wondering if
we could expose any block device in the future as virtio-pmem. And I
think with quite some work it could be possible.

As you said, we will need some buffering. Maybe userfaultfd and friends
(WP) could allow to implement that.

> I think it makes sense to use memdev here.  If the user wants a block
> device, they should use an emulated block device, not virtio-pmem,
> because buffering is necessary anyway when an image file format is used.
> 
> Stefan
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 0/2] KVM "fake DAX" device flushing
@ 2017-10-12 15:50 Pankaj Gupta
  0 siblings, 0 replies; 10+ messages in thread
From: Pankaj Gupta @ 2017-10-12 15:50 UTC (permalink / raw)
  To: linux-kernel, kvm, qemu-devel, linux-nvdimm, linux-mm
  Cc: jack, stefanha, dan.j.williams, riel, haozhong.zhang, nilal,
	kwolf, pbonzini, ross.zwisler, david, xiaoguangrong.eric,
	pagupta

We are sharing the prototype version of 'fake DAX' flushing
interface for the initial feedback. This is still work in progress
and not yet ready for merging.

Prototype right now just implements basic functionality without advanced
features with two major parts:

- Qemu virtio-pmem device
  It exposes a persistent memory range to KVM guest which at host side is file
  backed memory and works as persistent memory device. In addition to this it
  provides a virtio flushing interface for KVM guest to do a Qemu side sync for
  guest DAX persistent memory range.  

- Guest virtio-pmem driver
  Reads persistent memory range from paravirt device and reserves system memory map.
  It also allocates a block device corresponding to the pmem range which is accessed
  by DAX capable file systems. (file system support is still pending).  
  
We shared the project idea for 'fake DAX' flushing interface here [1].
Based on suggestions here [2], we implemented guest 'virtio-pmem'
driver and Qemu paravirt device.

[1] https://www.spinics.net/lists/kvm/msg149761.html
[2] https://www.spinics.net/lists/kvm/msg153095.html

Work yet to be done:

- Separate out the common code used by ACPI pmem interface and
  reuse it.

- In pmem device memmap allocation and working. There is some parallel work
  going on upstream related to 'memory_hotplug restructuring' [3] and also hitting
  a memory section alignment issue [4].
  
  [3] https://lwn.net/Articles/712099/
  [4] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html
  
- Provide DAX capable file-system(ext4 & XFS) support.
- Qemu device flush functionality trigger with guest fsync on file.
- Qemu live migration work when host page cache is used.
- Multiple virtio-pmem disks support.
- Prepare virtio spec after we get feedback on current approach.

 drivers/nvdimm/pfn.h             |   14 -
 drivers/nvdimm/pfn_devs.c        |   20 --
 drivers/nvdimm/pmem.c            |   40 ----
 drivers/nvdimm/pmem.h            |    5 
 drivers/virtio/Kconfig           |   10 +
 drivers/virtio/Makefile          |    1 
 drivers/virtio/virtio_pmem.c     |  322 +++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h         |   23 ++
 include/linux/pfn.h              |   15 +
 include/linux/pmem_common.h      |   52 ++++++
 include/uapi/linux/virtio_pmem.h |   55 ++++++
 11 files changed, 479 insertions(+), 78 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-04-09  6:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-05 10:48 [RFC 0/2] kvm "fake DAX" device flushing Pankaj Gupta
2018-04-05 10:48 ` [RFC 1/2] kvm: add virtio pmem driver Pankaj Gupta
2018-04-05 10:48 ` [RFC 2/2] pmem: device flush over VIRTIO Pankaj Gupta
2018-04-05 10:48 ` [RFC] qemu: Add virtio pmem device Pankaj Gupta
2018-04-05 11:26   ` David Hildenbrand
2018-04-05 12:09     ` [Qemu-devel] " Pankaj Gupta
2018-04-05 12:19       ` David Hildenbrand
2018-04-09  3:26       ` Stefan Hajnoczi
2018-04-09  6:42         ` David Hildenbrand
  -- strict thread matches above, loose matches on Subject: below --
2017-10-12 15:50 [RFC 0/2] KVM "fake DAX" device flushing Pankaj Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).