LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* another pmem variant V2
@ 2015-03-26  8:32 Christoph Hellwig
  2015-03-26  8:32 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
                   ` (5 more replies)
  0 siblings, 6 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:32 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, axboe, boaz

Here is another version of the same trivial pmem driver, because two
obviously aren't enough.  The first patch is the same pmem driver
that Ross posted a short time ago, just modified to use platform_devices
to find the persistant memory region instead of hardconding it in the
Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
but still allow auto-discovery.

The other two patches are a heavily rewritten version of the code that
Intel gave to various storage vendors to discover the type 12 (and earlier
type 6) nvdimms, which I massaged into a form that is hopefully suitable
for mainline.

Note that pmem.c really is the minimal version as I think we need something
included ASAP.  We'll eventually need to be able to do other I/O from and
to it, and as most people know everyone has their own preferre method to
do it, which I'd like to discuss once we have the basic driver in.

This has been tested both with a real NVDIMM on a system with a type 12
capable bios, as well as with "fake persistent" memory using the memmap=
option.

Changes since V1:
  - s/E820_PROTECTED_KERN/E820_PMEM/g
  - map the persistent memory as uncached
  - better kernel parameter description
  - various typo fixes
  - MODULE_LICENSE fix


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-26  8:32 another pmem variant V2 Christoph Hellwig
@ 2015-03-26  8:32 ` Christoph Hellwig
  2015-03-26 14:12   ` [Linux-nvdimm] " Dan Williams
  2015-03-26  8:32 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:32 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, axboe, boaz

From: Ross Zwisler <ross.zwisler@linux.intel.com>

PMEM is a new driver that presents a reserved range of memory as a
block device.  This is useful for developing with NV-DIMMs, and
can be used with volatile memory as a development platform.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
[hch: convert to use a platform_device for discovery, fix partition
 support]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 MAINTAINERS            |   6 +
 drivers/block/Kconfig  |  13 ++
 drivers/block/Makefile |   1 +
 drivers/block/pmem.c   | 373 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 393 insertions(+)
 create mode 100644 drivers/block/pmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 358eb01..efacf2b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8063,6 +8063,12 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
+PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+S:	Supported
+F:	drivers/block/pmem.c
+
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 1b8094d..9284aaf 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -404,6 +404,19 @@ config BLK_DEV_RAM_DAX
 	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
+config BLK_DEV_PMEM
+	tristate "Persistent memory block device support"
+	help
+	  Saying Y here will allow you to use a contiguous range of reserved
+	  memory as one or more block devices.  Memory for PMEM should be
+	  reserved using the "memmap" kernel parameter.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called pmem.
+
+	  Most normal users won't need this functionality, and can thus say N
+	  here.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d..9cc6c18 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
new file mode 100644
index 0000000..545b13b
--- /dev/null
+++ b/drivers/block/pmem.c
@@ -0,0 +1,373 @@
+/*
+ * Persistent Memory Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * This driver is heavily based on drivers/block/brd.c.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include <asm/cacheflush.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/init.h>
+#include <linux/platform_device.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+
+#define SECTOR_SHIFT		9
+#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
+#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
+
+#define PMEM_MINORS		16
+
+struct pmem_device {
+	struct request_queue	*pmem_queue;
+	struct gendisk		*pmem_disk;
+
+	/* One contiguous memory region per device */
+	phys_addr_t		phys_addr;
+	void			*virt_addr;
+	size_t			size;
+};
+
+static int pmem_major;
+static atomic_t pmem_index;
+
+static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	return 0;
+}
+
+/*
+ * direct translation from (pmem,sector) => void*
+ * We do not require that sector be page aligned.
+ * The return value will point to the beginning of the page containing the
+ * given sector, not to the sector itself.
+ */
+static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+	size_t offset = page_offset << PAGE_SHIFT;
+
+	BUG_ON(offset >= pmem->size);
+	return pmem->virt_addr + offset;
+}
+
+/* sector must be page aligned */
+static unsigned long pmem_lookup_pfn(struct pmem_device *pmem, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+
+	BUG_ON(sector & (PAGE_SECTORS - 1));
+	return (pmem->phys_addr >> PAGE_SHIFT) + page_offset;
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_to_pmem(struct pmem_device *pmem, const void *src,
+			sector_t sector, size_t n)
+{
+	void *dst;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	dst = pmem_lookup_pg_addr(pmem, sector);
+	memcpy(dst + offset, src, copy);
+
+	if (copy < n) {
+		src += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		dst = pmem_lookup_pg_addr(pmem, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_from_pmem(void *dst, struct pmem_device *pmem,
+			  sector_t sector, size_t n)
+{
+	void *src;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	src = pmem_lookup_pg_addr(pmem, sector);
+
+	memcpy(dst, src + offset, copy);
+
+	if (copy < n) {
+		dst += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		src = pmem_lookup_pg_addr(pmem, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	void *mem = kmap_atomic(page);
+
+	if (rw == READ) {
+		copy_from_pmem(mem + off, pmem, sector, len);
+		flush_dcache_page(page);
+	} else {
+		/*
+		 * FIXME: Need more involved flushing to ensure that writes to
+		 * NVDIMMs are actually durable before returning.
+		 */
+		flush_dcache_page(page);
+		copy_to_pmem(pmem, mem + off, sector, len);
+	}
+
+	kunmap_atomic(mem);
+}
+
+static void pmem_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		pmem_do_bvec(pmem, bvec.bv_page, len,
+			    bvec.bv_offset, rw, sector);
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static int pmem_rw_page(struct block_device *bdev, sector_t sector,
+		       struct page *page, int rw)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+	pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
+static long pmem_direct_access(struct block_device *bdev, sector_t sector,
+			      void **kaddr, unsigned long *pfn, long size)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+	if (!pmem)
+		return -ENODEV;
+
+	*kaddr = pmem_lookup_pg_addr(pmem, sector);
+	*pfn = pmem_lookup_pfn(pmem, sector);
+
+	return pmem->size - (sector * 512);
+}
+
+static const struct block_device_operations pmem_fops = {
+	.owner =		THIS_MODULE,
+	.rw_page =		pmem_rw_page,
+	.direct_access =	pmem_direct_access,
+	.getgeo =		pmem_getgeo,
+};
+
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set virt_addr if successful.
+ */
+static int pmem_mapmem(struct pmem_device *pmem)
+{
+	struct resource *res_mem;
+	int err;
+
+	res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
+					       "pmem");
+	if (!res_mem) {
+		pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			   pmem->phys_addr, pmem->size);
+		return -EINVAL;
+	}
+
+	/*
+	 * Map the memory as non-cachable, as we can't write back the contents
+	 * of the CPU caches in case of a crash.
+	 */
+	pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);
+	if (unlikely(!pmem->virt_addr)) {
+		err = -ENXIO;
+		goto out_release;
+	}
+	return 0;
+
+out_release:
+	release_mem_region(pmem->phys_addr, pmem->size);
+	return err;
+}
+
+static void pmem_unmapmem(struct pmem_device *pmem)
+{
+	if (unlikely(!pmem->virt_addr))
+		return;
+
+	iounmap(pmem->virt_addr);
+	release_mem_region(pmem->phys_addr, pmem->size);
+	pmem->virt_addr = NULL;
+}
+
+static int pmem_probe(struct platform_device *pdev)
+{
+	struct pmem_device *pmem;
+	struct gendisk *disk;
+	struct resource *res;
+	int idx, err;
+
+	if (WARN_ON(pdev->num_resources > 1))
+		return -ENXIO;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENXIO;
+
+	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
+	if (unlikely(!pmem))
+		return -ENOMEM;
+
+	pmem->phys_addr = res->start;
+	pmem->size = resource_size(res);
+
+	err = pmem_mapmem(pmem);
+	if (unlikely(err))
+		goto out_free_dev;
+
+	err = -ENOMEM;
+	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+	if (unlikely(!pmem->pmem_queue))
+		goto out_unmap;
+
+	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+
+	disk = alloc_disk(PMEM_MINORS);
+	if (unlikely(!disk))
+		goto out_free_queue;
+
+	idx = atomic_inc_return(&pmem_index) - 1;
+
+	disk->major		= pmem_major;
+	disk->first_minor	= PMEM_MINORS * idx;
+	disk->fops		= &pmem_fops;
+	disk->private_data	= pmem;
+	disk->queue		= pmem->pmem_queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "pmem%d", idx);
+	disk->driverfs_dev = &pdev->dev;
+	set_capacity(disk, pmem->size >> SECTOR_SHIFT);
+	pmem->pmem_disk = disk;
+
+	add_disk(disk);
+
+	platform_set_drvdata(pdev, pmem);
+	return 0;
+
+out_free_queue:
+	blk_cleanup_queue(pmem->pmem_queue);
+out_unmap:
+	pmem_unmapmem(pmem);
+out_free_dev:
+	kfree(pmem);
+	return err;
+}
+
+static int pmem_remove(struct platform_device *pdev)
+{
+	struct pmem_device *pmem = platform_get_drvdata(pdev);
+
+	del_gendisk(pmem->pmem_disk);
+	put_disk(pmem->pmem_disk);
+	blk_cleanup_queue(pmem->pmem_queue);
+	pmem_unmapmem(pmem);
+	kfree(pmem);
+
+	return 0;
+}
+
+static struct platform_driver pmem_driver = {
+	.probe		= pmem_probe,
+	.remove		= pmem_remove,
+	.driver		= {
+		.owner	= THIS_MODULE,
+		.name	= "pmem",
+	},
+};
+
+static int __init pmem_init(void)
+{
+	int error;
+
+	pmem_major = register_blkdev(0, "pmem");
+	if (pmem_major < 0)
+		return pmem_major;
+
+	error = platform_driver_register(&pmem_driver);
+	if (error)
+		unregister_blkdev(pmem_major, "pmem");
+	return error;
+}
+
+static void pmem_exit(void)
+{
+	platform_driver_unregister(&pmem_driver);
+	unregister_blkdev(pmem_major, "pmem");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+
+module_init(pmem_init);
+module_exit(pmem_exit);
-- 
1.9.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26  8:32 another pmem variant V2 Christoph Hellwig
  2015-03-26  8:32 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
@ 2015-03-26  8:32 ` Christoph Hellwig
  2015-03-26  9:02   ` Ingo Molnar
  2015-03-26  8:32 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:32 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, axboe, boaz

This will allow to deal with persistent memory which needs to be
treated like ram in many, but not all cases.

Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/kernel/e820.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 46201de..de088e3 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -48,6 +48,11 @@ unsigned long pci_mem_start = 0xaeedbabe;
 EXPORT_SYMBOL(pci_mem_start);
 #endif
 
+static inline bool is_e820_ram(__u32 type)
+{
+	return type == E820_RAM;
+}
+
 /*
  * This function checks if any part of the range <start,end> is mapped
  * with type.
@@ -688,7 +693,7 @@ void __init e820_mark_nosave_regions(unsigned long limit_pfn)
 			register_nosave_region(pfn, PFN_UP(ei->addr));
 
 		pfn = PFN_DOWN(ei->addr + ei->size);
-		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
+		if (!is_e820_ram(ei->type) && ei->type != E820_RESERVED_KERN)
 			register_nosave_region(PFN_UP(ei->addr), pfn);
 
 		if (pfn >= limit_pfn)
@@ -748,7 +753,7 @@ u64 __init early_reserve_e820(u64 size, u64 align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
+static unsigned long __init e820_end_pfn(unsigned long limit_pfn)
 {
 	int i;
 	unsigned long last_pfn = 0;
@@ -759,7 +764,7 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (ei->type != type)
+		if (!is_e820_ram(ei->type))
 			continue;
 
 		start_pfn = ei->addr >> PAGE_SHIFT;
@@ -784,12 +789,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 }
 unsigned long __init e820_end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_RAM);
+	return e820_end_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820_end_of_low_ram_pfn(void)
 {
-	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
+	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT));
 }
 
 static void early_panic(char *msg)
-- 
1.9.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-26  8:32 another pmem variant V2 Christoph Hellwig
  2015-03-26  8:32 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
  2015-03-26  8:32 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
@ 2015-03-26  8:32 ` Christoph Hellwig
  2015-03-26 16:57 ` another pmem variant V2 Boaz Harrosh
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:32 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, axboe, boaz

Various recent BIOSes support NVDIMMs or ADR using a non-standard
e820 memory type, and Intel supplied reference Linux code using this
type to various vendors.

Wire this e820 table type up to export platform devices for the pmem
driver so that we can use it in Linux, and also provide a memmap=
argument to manually tag memory as protected, which can be used
if the BIOSs doesn't use the standard nonstandard interface, or
we just want to test the pmem driver with regular memory.

Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 Documentation/kernel-parameters.txt |  6 ++++
 arch/x86/Kconfig                    | 13 +++++++
 arch/x86/include/asm/setup.h        |  6 ++++
 arch/x86/include/uapi/asm/e820.h    | 10 ++++++
 arch/x86/kernel/Makefile            |  1 +
 arch/x86/kernel/e820.c              | 21 ++++++++++-
 arch/x86/kernel/pmem.c              | 70 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/setup.c             |  2 ++
 8 files changed, 128 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/pmem.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a6..c87122d 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1965,6 +1965,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			         or
 			         memmap=0x10000$0x18690000
 
+	memmap=nn[KMG]!ss[KMG]
+			[KNL,X86] Mark specific memory as protected.
+			Region of memory to be used, from ss to ss+nn.
+			The memory region may be marked as e820 type 12 (0xc)
+			and is NVDIMM or ADR memory.
+
 	memory_corruption_check=0/1 [X86]
 			Some BIOSes seem to corrupt the first 64k of
 			memory when doing things like suspend/resume.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..ecc472c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1430,6 +1430,19 @@ config ILLEGAL_POINTER_VALUE
 
 source "mm/Kconfig"
 
+config X86_PMEM_LEGACY
+	bool "Support non-standard NVDIMMs and ADR protected memory"
+	help
+	  Treat memory marked using the non-standard e820 type of 12 as used
+	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
+	  The kernel will offer these regions to the pmem driver so
+	  they can be used for persistent storage.
+
+	  If you say N the kernel will treat the ADR region like an e820
+	  reserved region.
+
+	  Say Y if unsure
+
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
 	depends on HIGHMEM
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ff4e7b2..2352fde 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -57,6 +57,12 @@ extern void x86_ce4100_early_setup(void);
 static inline void x86_ce4100_early_setup(void) { }
 #endif
 
+#ifdef CONFIG_X86_PMEM_LEGACY
+void reserve_pmem(void);
+#else
+static inline void reserve_pmem(void) { }
+#endif
+
 #ifndef _SETUP
 
 #include <asm/espfix.h>
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index d993e33..91b674e 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -33,6 +33,16 @@
 #define E820_NVS	4
 #define E820_UNUSABLE	5
 
+/*
+ * This is a non-standardized way to represent ADR or NVDIMM regions that
+ * persist over a reboot.  The kernel will ignore their special capabilities
+ * unless the CONFIG_X86_PMEM_LEGACY option is set.
+ *
+ * Note that older platforms also used 6 for the same type of memory,
+ * but newer versions switched to 12 as 6 was assigned differently.  Some
+ * time they will learn..
+ */
+#define E820_PMEM	12
 
 /*
  * reserved RAM used by kernel itself
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cdb1b70..971f18c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -94,6 +94,7 @@ obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
+obj-$(CONFIG_X86_PMEM_LEGACY)	+= pmem.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
 
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index de088e3..b49193e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -48,10 +48,21 @@ unsigned long pci_mem_start = 0xaeedbabe;
 EXPORT_SYMBOL(pci_mem_start);
 #endif
 
+/*
+ * Persistent memory is accounted as ram for purposes of establishing max_pfn
+ * and mem_map.
+ */
+#ifdef CONFIG_X86_PMEM_LEGACY
+static inline bool is_e820_ram(__u32 type)
+{
+	return type == E820_RAM || type == E820_PMEM;
+}
+#else
 static inline bool is_e820_ram(__u32 type)
 {
 	return type == E820_RAM;
 }
+#endif
 
 /*
  * This function checks if any part of the range <start,end> is mapped
@@ -154,6 +165,9 @@ static void __init e820_print_type(u32 type)
 	case E820_UNUSABLE:
 		printk(KERN_CONT "unusable");
 		break;
+	case E820_PMEM:
+		printk(KERN_CONT "persistent (type %u)", type);
+		break;
 	default:
 		printk(KERN_CONT "type %u", type);
 		break;
@@ -871,6 +885,9 @@ static int __init parse_memmap_one(char *p)
 	} else if (*p == '$') {
 		start_at = memparse(p+1, &p);
 		e820_add_region(start_at, mem_size, E820_RESERVED);
+	} else if (*p == '!') {
+		start_at = memparse(p+1, &p);
+		e820_add_region(start_at, mem_size, E820_PMEM);
 	} else
 		e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
 
@@ -912,6 +929,7 @@ static inline const char *e820_type_to_string(int e820_type)
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
+	case E820_PMEM: return "Protected RAM";
 	default:	return "reserved";
 	}
 }
@@ -946,7 +964,8 @@ void __init e820_reserve_resources(void)
 		 * pcibios_resource_survey()
 		 */
 		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
-			res->flags |= IORESOURCE_BUSY;
+			if (e820.map[i].type != E820_PMEM)
+				res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
 		res++;
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
new file mode 100644
index 0000000..734ca51
--- /dev/null
+++ b/arch/x86/kernel/pmem.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2009, Intel Corporation.
+ * Copyright (c) 2015, Christoph Hellwig.
+ */
+#include <linux/memblock.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <asm/e820.h>
+#include <asm/page_types.h>
+#include <asm/setup.h>
+
+void __init reserve_pmem(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type != E820_PMEM)
+			continue;
+
+		memblock_reserve(ei->addr, ei->addr + ei->size);
+		max_pfn_mapped = init_memory_mapping(
+				ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
+				ei->addr + ei->size);
+	}
+}
+
+static __init void register_pmem_device(struct resource *res)
+{
+	struct platform_device *pdev;
+	int error;
+
+	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
+	if (!pdev)
+		return;
+
+	error = platform_device_add_resources(pdev, res, 1);
+	if (error)
+		goto out_put_pdev;
+
+	error = platform_device_add(pdev);
+	if (error)
+		goto out_put_pdev;
+	return;
+out_put_pdev:
+	dev_warn(&pdev->dev, "failed to add pmem device!\n");
+	platform_device_put(pdev);
+}
+
+static __init int register_pmem_devices(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type == E820_PMEM) {
+			struct resource res = {
+				.flags	= IORESOURCE_MEM,
+				.start	= ei->addr,
+				.end	= ei->addr + ei->size - 1,
+			};
+			register_pmem_device(&res);
+		}
+	}
+
+	return 0;
+}
+device_initcall(register_pmem_devices);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 0a2421c..f2bed2b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1158,6 +1158,8 @@ void __init setup_arch(char **cmdline_p)
 
 	early_acpi_boot_init();
 
+	reserve_pmem();
+
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26  8:32 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
@ 2015-03-26  9:02   ` Ingo Molnar
  2015-03-26  9:34     ` Christoph Hellwig
  0 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2015-03-26  9:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz


* Christoph Hellwig <hch@lst.de> wrote:

> This will allow to deal with persistent memory which needs to be
> treated like ram in many, but not all cases.
> 
> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  arch/x86/kernel/e820.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 46201de..de088e3 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -48,6 +48,11 @@ unsigned long pci_mem_start = 0xaeedbabe;
>  EXPORT_SYMBOL(pci_mem_start);
>  #endif
>  
> +static inline bool is_e820_ram(__u32 type)
> +{
> +	return type == E820_RAM;
> +}
> +
>  /*
>   * This function checks if any part of the range <start,end> is mapped
>   * with type.
> @@ -688,7 +693,7 @@ void __init e820_mark_nosave_regions(unsigned long limit_pfn)
>  			register_nosave_region(pfn, PFN_UP(ei->addr));
>  
>  		pfn = PFN_DOWN(ei->addr + ei->size);
> -		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
> +		if (!is_e820_ram(ei->type) && ei->type != E820_RESERVED_KERN)
>  			register_nosave_region(PFN_UP(ei->addr), pfn);
>  
>  		if (pfn >= limit_pfn)
> @@ -748,7 +753,7 @@ u64 __init early_reserve_e820(u64 size, u64 align)
>  /*
>   * Find the highest page frame number we have available
>   */
> -static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
> +static unsigned long __init e820_end_pfn(unsigned long limit_pfn)
>  {
>  	int i;
>  	unsigned long last_pfn = 0;
> @@ -759,7 +764,7 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
>  		unsigned long start_pfn;
>  		unsigned long end_pfn;
>  
> -		if (ei->type != type)
> +		if (!is_e820_ram(ei->type))
>  			continue;
>  
>  		start_pfn = ei->addr >> PAGE_SHIFT;
> @@ -784,12 +789,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
>  }
>  unsigned long __init e820_end_of_ram_pfn(void)
>  {
> -	return e820_end_pfn(MAX_ARCH_PFN, E820_RAM);
> +	return e820_end_pfn(MAX_ARCH_PFN);
>  }
>  
>  unsigned long __init e820_end_of_low_ram_pfn(void)
>  {
> -	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
> +	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT));
>  }
>  
>  static void early_panic(char *msg)

This is_e820_ram() factoring out becomes really messy in patch #3.

So you left out a bunch of places making comparisons with E820_RAM, 
notably e820_reserve_resources_late() and memblock_x86_fill() - and of 
course those have to be left out, otherwise NVRAM might be registered 
and used as real kernel RAM!

And this shows the weakness and confusion caused by the factoring out 
of is_e820_ram() and then adding E820_PMEM to its definition...

I'd rather you add explicit checks to E820_PMEM (why not E820_PRAM, to 
keep in line with the E820_RAM name?), and not lie about 
is_e820_ram(). It should result in the exact same end result, with 
less confusion.

I have no fundamental objections to the driver otherwise.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26  9:02   ` Ingo Molnar
@ 2015-03-26  9:34     ` Christoph Hellwig
  2015-03-26 10:04       ` Ingo Molnar
                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26  9:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe, boaz

On Thu, Mar 26, 2015 at 10:02:15AM +0100, Ingo Molnar wrote:
> This is_e820_ram() factoring out becomes really messy in patch #3.
> 
> So you left out a bunch of places making comparisons with E820_RAM, 
> notably e820_reserve_resources_late() and memblock_x86_fill() - and of 
> course those have to be left out, otherwise NVRAM might be registered 
> and used as real kernel RAM!
> 
> And this shows the weakness and confusion caused by the factoring out 
> of is_e820_ram() and then adding E820_PMEM to its definition...
> 
> I'd rather you add explicit checks to E820_PMEM (why not E820_PRAM, to 
> keep in line with the E820_RAM name?), and not lie about 
> is_e820_ram(). It should result in the exact same end result, with 
> less confusion.
> 
> I have no fundamental objections to the driver otherwise.

Does this patch (replaces patches 2 and 3) look better to you?

---
>From 4a6fdc8433559d649d7bf707f72aafa4488f2d23 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Wed, 25 Mar 2015 12:24:11 +0100
Subject: x86: add support for the non-standard protected e820 type

Various recent BIOSes support NVDIMMs or ADR using a non-standard
e820 memory type, and Intel supplied reference Linux code using this
type to various vendors.

Wire this e820 table type up to export platform devices for the pmem
driver so that we can use it in Linux, and also provide a memmap=
argument to manually tag memory as protected, which can be used
if the BIOSs doesn't use the standard nonstandard interface, or
we just want to test the pmem driver with regular memory.

Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 Documentation/kernel-parameters.txt |  6 ++++
 arch/x86/Kconfig                    | 10 ++++++
 arch/x86/include/asm/setup.h        |  6 ++++
 arch/x86/include/uapi/asm/e820.h    | 10 ++++++
 arch/x86/kernel/Makefile            |  1 +
 arch/x86/kernel/e820.c              | 31 ++++++++++++----
 arch/x86/kernel/pmem.c              | 70 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/setup.c             |  2 ++
 8 files changed, 130 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/kernel/pmem.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a6..c87122d 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1965,6 +1965,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			         or
 			         memmap=0x10000$0x18690000
 
+	memmap=nn[KMG]!ss[KMG]
+			[KNL,X86] Mark specific memory as protected.
+			Region of memory to be used, from ss to ss+nn.
+			The memory region may be marked as e820 type 12 (0xc)
+			and is NVDIMM or ADR memory.
+
 	memory_corruption_check=0/1 [X86]
 			Some BIOSes seem to corrupt the first 64k of
 			memory when doing things like suspend/resume.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..c0e8ee3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1430,6 +1430,16 @@ config ILLEGAL_POINTER_VALUE
 
 source "mm/Kconfig"
 
+config X86_PMEM_LEGACY
+	bool "Support non-standard NVDIMMs and ADR protected memory"
+	help
+	  Treat memory marked using the non-standard e820 type of 12 as used
+	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
+	  The kernel will offer these regions to the pmem driver so
+	  they can be used for persistent storage.
+
+	  Say Y if unsure.
+
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
 	depends on HIGHMEM
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ff4e7b2..2352fde 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -57,6 +57,12 @@ extern void x86_ce4100_early_setup(void);
 static inline void x86_ce4100_early_setup(void) { }
 #endif
 
+#ifdef CONFIG_X86_PMEM_LEGACY
+void reserve_pmem(void);
+#else
+static inline void reserve_pmem(void) { }
+#endif
+
 #ifndef _SETUP
 
 #include <asm/espfix.h>
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index d993e33..ce0d0bf 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -33,6 +33,16 @@
 #define E820_NVS	4
 #define E820_UNUSABLE	5
 
+/*
+ * This is a non-standardized way to represent ADR or NVDIMM regions that
+ * persist over a reboot.  The kernel will ignore their special capabilities
+ * unless the CONFIG_X86_PMEM_LEGACY option is set.
+ *
+ * Note that older platforms also used 6 for the same type of memory,
+ * but newer versions switched to 12 as 6 was assigned differently.  Some
+ * time they will learn..
+ */
+#define E820_PRAM	12
 
 /*
  * reserved RAM used by kernel itself
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cdb1b70..971f18c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -94,6 +94,7 @@ obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
+obj-$(CONFIG_X86_PMEM_LEGACY)	+= pmem.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
 
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 46201de..4bd525a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -149,6 +149,9 @@ static void __init e820_print_type(u32 type)
 	case E820_UNUSABLE:
 		printk(KERN_CONT "unusable");
 		break;
+	case E820_PRAM:
+		printk(KERN_CONT "persistent (type %u)", type);
+		break;
 	default:
 		printk(KERN_CONT "type %u", type);
 		break;
@@ -688,8 +691,15 @@ void __init e820_mark_nosave_regions(unsigned long limit_pfn)
 			register_nosave_region(pfn, PFN_UP(ei->addr));
 
 		pfn = PFN_DOWN(ei->addr + ei->size);
-		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
+
+		switch (ei->type) {
+		case E820_RAM:
+		case E820_PRAM:
+		case E820_RESERVED_KERN:
+			break;
+		default:
 			register_nosave_region(PFN_UP(ei->addr), pfn);
+		}
 
 		if (pfn >= limit_pfn)
 			break;
@@ -748,7 +758,7 @@ u64 __init early_reserve_e820(u64 size, u64 align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
+static unsigned long __init e820_end_pfn(unsigned long limit_pfn)
 {
 	int i;
 	unsigned long last_pfn = 0;
@@ -759,7 +769,11 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (ei->type != type)
+		/*
+		 * Persistent memory is accounted as ram for purposes of
+		 * establishing max_pfn and mem_map.
+		 */
+		if (ei->type != E820_RAM && ei->type != E820_PRAM)
 			continue;
 
 		start_pfn = ei->addr >> PAGE_SHIFT;
@@ -784,12 +798,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 }
 unsigned long __init e820_end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_RAM);
+	return e820_end_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820_end_of_low_ram_pfn(void)
 {
-	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
+	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT));
 }
 
 static void early_panic(char *msg)
@@ -866,6 +880,9 @@ static int __init parse_memmap_one(char *p)
 	} else if (*p == '$') {
 		start_at = memparse(p+1, &p);
 		e820_add_region(start_at, mem_size, E820_RESERVED);
+	} else if (*p == '!') {
+		start_at = memparse(p+1, &p);
+		e820_add_region(start_at, mem_size, E820_PRAM);
 	} else
 		e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
 
@@ -907,6 +924,7 @@ static inline const char *e820_type_to_string(int e820_type)
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
+	case E820_PRAM: return "Persistent RAM";
 	default:	return "reserved";
 	}
 }
@@ -941,7 +959,8 @@ void __init e820_reserve_resources(void)
 		 * pcibios_resource_survey()
 		 */
 		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
-			res->flags |= IORESOURCE_BUSY;
+			if (e820.map[i].type != E820_PRAM)
+				res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
 		res++;
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
new file mode 100644
index 0000000..f970048
--- /dev/null
+++ b/arch/x86/kernel/pmem.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2009, Intel Corporation.
+ * Copyright (c) 2015, Christoph Hellwig.
+ */
+#include <linux/memblock.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <asm/e820.h>
+#include <asm/page_types.h>
+#include <asm/setup.h>
+
+void __init reserve_pmem(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type != E820_PRAM)
+			continue;
+
+		memblock_reserve(ei->addr, ei->addr + ei->size);
+		max_pfn_mapped = init_memory_mapping(
+				ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
+				ei->addr + ei->size);
+	}
+}
+
+static __init void register_pmem_device(struct resource *res)
+{
+	struct platform_device *pdev;
+	int error;
+
+	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
+	if (!pdev)
+		return;
+
+	error = platform_device_add_resources(pdev, res, 1);
+	if (error)
+		goto out_put_pdev;
+
+	error = platform_device_add(pdev);
+	if (error)
+		goto out_put_pdev;
+	return;
+out_put_pdev:
+	dev_warn(&pdev->dev, "failed to add pmem device!\n");
+	platform_device_put(pdev);
+}
+
+static __init int register_pmem_devices(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type == E820_PRAM) {
+			struct resource res = {
+				.flags	= IORESOURCE_MEM,
+				.start	= ei->addr,
+				.end	= ei->addr + ei->size - 1,
+			};
+			register_pmem_device(&res);
+		}
+	}
+
+	return 0;
+}
+device_initcall(register_pmem_devices);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 0a2421c..f2bed2b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1158,6 +1158,8 @@ void __init setup_arch(char **cmdline_p)
 
 	early_acpi_boot_init();
 
+	reserve_pmem();
+
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26  9:34     ` Christoph Hellwig
@ 2015-03-26 10:04       ` Ingo Molnar
  2015-03-26 10:19         ` Christoph Hellwig
  2015-03-26 15:49       ` Boaz Harrosh
  2015-03-26 22:59       ` Yinghai Lu
  2 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2015-03-26 10:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz


* Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Mar 26, 2015 at 10:02:15AM +0100, Ingo Molnar wrote:
> > This is_e820_ram() factoring out becomes really messy in patch #3.
> > 
> > So you left out a bunch of places making comparisons with E820_RAM, 
> > notably e820_reserve_resources_late() and memblock_x86_fill() - and of 
> > course those have to be left out, otherwise NVRAM might be registered 
> > and used as real kernel RAM!
> > 
> > And this shows the weakness and confusion caused by the factoring out 
> > of is_e820_ram() and then adding E820_PMEM to its definition...
> > 
> > I'd rather you add explicit checks to E820_PMEM (why not E820_PRAM, to 
> > keep in line with the E820_RAM name?), and not lie about 
> > is_e820_ram(). It should result in the exact same end result, with 
> > less confusion.
> > 
> > I have no fundamental objections to the driver otherwise.
> 
> Does this patch (replaces patches 2 and 3) look better to you?

Yeah, the code is much clearer now:

  Acked-by: Ingo Molnar <mingo@kernel.org>

What tree is this intended for? Should I pick up the x86 bits?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 10:04       ` Ingo Molnar
@ 2015-03-26 10:19         ` Christoph Hellwig
  2015-03-26 10:28           ` Ingo Molnar
  0 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26 10:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz

On Thu, Mar 26, 2015 at 11:04:13AM +0100, Ingo Molnar wrote:
> Yeah, the code is much clearer now:
> 
>   Acked-by: Ingo Molnar <mingo@kernel.org>
> 
> What tree is this intended for? Should I pick up the x86 bits?

The x86 bits really need to go through the x86 tree.  The pmem driver
itself would normally go through the block tree, but if Jens is fine with
it sending it through the x86 tree as well would allow us to have a single
complete tree for testing.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 10:19         ` Christoph Hellwig
@ 2015-03-26 10:28           ` Ingo Molnar
  2015-03-26 10:29             ` Christoph Hellwig
  0 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2015-03-26 10:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz


* Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Mar 26, 2015 at 11:04:13AM +0100, Ingo Molnar wrote:
> > Yeah, the code is much clearer now:
> > 
> >   Acked-by: Ingo Molnar <mingo@kernel.org>
> > 
> > What tree is this intended for? Should I pick up the x86 bits?
> 
> The x86 bits really need to go through the x86 tree.  The pmem 
> driver itself would normally go through the block tree, but if Jens 
> is fine with it sending it through the x86 tree as well would allow 
> us to have a single complete tree for testing.

btw., there's half a dozen block drivers in arch/* platform code, so 
in theory even the block driver could be merged there - but I agree 
that it's much cleaner and more generic in drivers/block/.

Jens?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 10:28           ` Ingo Molnar
@ 2015-03-26 10:29             ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26 10:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe, boaz

On Thu, Mar 26, 2015 at 11:28:10AM +0100, Ingo Molnar wrote:
> btw., there's half a dozen block drivers in arch/* platform code, so 
> in theory even the block driver could be merged there - but I agree 
> that it's much cleaner and more generic in drivers/block/.

The block driver isn't really architecture specific.  With the upcoming
UEFI interface to discover persistent memory arm, arm64 and in theory ia64
will immeditately benefit from it.  I also expect it to show up on powerpc
sooner or later at least.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-26  8:32 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
@ 2015-03-26 14:12   ` Dan Williams
  2015-03-26 14:35     ` Christoph Hellwig
  2015-03-26 14:52     ` Boaz Harrosh
  0 siblings, 2 replies; 69+ messages in thread
From: Dan Williams @ 2015-03-26 14:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, X86 ML, Jens Axboe

On Thu, Mar 26, 2015 at 1:32 AM, Christoph Hellwig <hch@lst.de> wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>
> PMEM is a new driver that presents a reserved range of memory as a
> block device.  This is useful for developing with NV-DIMMs, and
> can be used with volatile memory as a development platform.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> [hch: convert to use a platform_device for discovery, fix partition
>  support]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  MAINTAINERS            |   6 +
>  drivers/block/Kconfig  |  13 ++
>  drivers/block/Makefile |   1 +
>  drivers/block/pmem.c   | 373 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 393 insertions(+)
>  create mode 100644 drivers/block/pmem.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 358eb01..efacf2b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8063,6 +8063,12 @@ S:       Maintained
>  F:     Documentation/blockdev/ramdisk.txt
>  F:     drivers/block/brd.c
>
> +PERSISTENT MEMORY DRIVER
> +M:     Ross Zwisler <ross.zwisler@linux.intel.com>
> +L:     linux-nvdimm@lists.01.org
> +S:     Supported
> +F:     drivers/block/pmem.c
> +
>  RANDOM NUMBER DRIVER
>  M:     "Theodore Ts'o" <tytso@mit.edu>
>  S:     Maintained
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 1b8094d..9284aaf 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -404,6 +404,19 @@ config BLK_DEV_RAM_DAX
>           and will prevent RAM block device backing store memory from being
>           allocated from highmem (only a problem for highmem systems).
>
> +config BLK_DEV_PMEM
> +       tristate "Persistent memory block device support"
> +       help
> +         Saying Y here will allow you to use a contiguous range of reserved
> +         memory as one or more block devices.  Memory for PMEM should be
> +         reserved using the "memmap" kernel parameter.
> +
> +         To compile this driver as a module, choose M here: the module will be
> +         called pmem.
> +
> +         Most normal users won't need this functionality, and can thus say N
> +         here.
> +
>  config CDROM_PKTCDVD
>         tristate "Packet writing on CD/DVD media"
>         depends on !UML
> diff --git a/drivers/block/Makefile b/drivers/block/Makefile
> index 02b688d..9cc6c18 100644
> --- a/drivers/block/Makefile
> +++ b/drivers/block/Makefile
> @@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)                += ps3vram.o
>  obj-$(CONFIG_ATARI_FLOPPY)     += ataflop.o
>  obj-$(CONFIG_AMIGA_Z2RAM)      += z2ram.o
>  obj-$(CONFIG_BLK_DEV_RAM)      += brd.o
> +obj-$(CONFIG_BLK_DEV_PMEM)     += pmem.o
>  obj-$(CONFIG_BLK_DEV_LOOP)     += loop.o
>  obj-$(CONFIG_BLK_CPQ_DA)       += cpqarray.o
>  obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> new file mode 100644
> index 0000000..545b13b
> --- /dev/null
> +++ b/drivers/block/pmem.c
> @@ -0,0 +1,373 @@
> +/*
> + * Persistent Memory Driver
> + * Copyright (c) 2014, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * This driver is heavily based on drivers/block/brd.c.
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + */
> +
> +#include <asm/cacheflush.h>
> +#include <linux/blkdev.h>
> +#include <linux/hdreg.h>
> +#include <linux/init.h>
> +#include <linux/platform_device.h>
> +#include <linux/module.h>
> +#include <linux/moduleparam.h>
> +#include <linux/slab.h>
> +
> +#define SECTOR_SHIFT           9
> +#define PAGE_SECTORS_SHIFT     (PAGE_SHIFT - SECTOR_SHIFT)
> +#define PAGE_SECTORS           (1 << PAGE_SECTORS_SHIFT)
> +
> +#define PMEM_MINORS            16
> +
> +struct pmem_device {
> +       struct request_queue    *pmem_queue;
> +       struct gendisk          *pmem_disk;
> +
> +       /* One contiguous memory region per device */
> +       phys_addr_t             phys_addr;
> +       void                    *virt_addr;
> +       size_t                  size;
> +};
> +
> +static int pmem_major;
> +static atomic_t pmem_index;
> +
> +static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo)
> +{
> +       /* some standard values */
> +       geo->heads = 1 << 6;
> +       geo->sectors = 1 << 5;
> +       geo->cylinders = get_capacity(bd->bd_disk) >> 11;
> +       return 0;
> +}
> +
> +/*
> + * direct translation from (pmem,sector) => void*
> + * We do not require that sector be page aligned.
> + * The return value will point to the beginning of the page containing the
> + * given sector, not to the sector itself.
> + */
> +static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
> +{
> +       size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
> +       size_t offset = page_offset << PAGE_SHIFT;
> +
> +       BUG_ON(offset >= pmem->size);
> +       return pmem->virt_addr + offset;
> +}
> +
> +/* sector must be page aligned */
> +static unsigned long pmem_lookup_pfn(struct pmem_device *pmem, sector_t sector)
> +{
> +       size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
> +
> +       BUG_ON(sector & (PAGE_SECTORS - 1));
> +       return (pmem->phys_addr >> PAGE_SHIFT) + page_offset;
> +}
> +
> +/*
> + * sector is not required to be page aligned.
> + * n is at most a single page, but could be less.
> + */
> +static void copy_to_pmem(struct pmem_device *pmem, const void *src,
> +                       sector_t sector, size_t n)
> +{
> +       void *dst;
> +       unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
> +       size_t copy;
> +
> +       BUG_ON(n > PAGE_SIZE);
> +
> +       copy = min_t(size_t, n, PAGE_SIZE - offset);
> +       dst = pmem_lookup_pg_addr(pmem, sector);
> +       memcpy(dst + offset, src, copy);
> +
> +       if (copy < n) {
> +               src += copy;
> +               sector += copy >> SECTOR_SHIFT;
> +               copy = n - copy;
> +               dst = pmem_lookup_pg_addr(pmem, sector);
> +               memcpy(dst, src, copy);
> +       }
> +}
> +
> +/*
> + * sector is not required to be page aligned.
> + * n is at most a single page, but could be less.
> + */
> +static void copy_from_pmem(void *dst, struct pmem_device *pmem,
> +                         sector_t sector, size_t n)
> +{
> +       void *src;
> +       unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
> +       size_t copy;
> +
> +       BUG_ON(n > PAGE_SIZE);
> +
> +       copy = min_t(size_t, n, PAGE_SIZE - offset);
> +       src = pmem_lookup_pg_addr(pmem, sector);
> +
> +       memcpy(dst, src + offset, copy);
> +
> +       if (copy < n) {
> +               dst += copy;
> +               sector += copy >> SECTOR_SHIFT;
> +               copy = n - copy;
> +               src = pmem_lookup_pg_addr(pmem, sector);
> +               memcpy(dst, src, copy);
> +       }
> +}
> +
> +static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
> +                       unsigned int len, unsigned int off, int rw,
> +                       sector_t sector)
> +{
> +       void *mem = kmap_atomic(page);
> +
> +       if (rw == READ) {
> +               copy_from_pmem(mem + off, pmem, sector, len);
> +               flush_dcache_page(page);
> +       } else {
> +               /*
> +                * FIXME: Need more involved flushing to ensure that writes to
> +                * NVDIMMs are actually durable before returning.
> +                */
> +               flush_dcache_page(page);
> +               copy_to_pmem(pmem, mem + off, sector, len);
> +       }
> +
> +       kunmap_atomic(mem);
> +}
> +
> +static void pmem_make_request(struct request_queue *q, struct bio *bio)
> +{
> +       struct block_device *bdev = bio->bi_bdev;
> +       struct pmem_device *pmem = bdev->bd_disk->private_data;
> +       int rw;
> +       struct bio_vec bvec;
> +       sector_t sector;
> +       struct bvec_iter iter;
> +       int err = 0;
> +
> +       sector = bio->bi_iter.bi_sector;
> +       if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
> +               err = -EIO;
> +               goto out;
> +       }
> +
> +       BUG_ON(bio->bi_rw & REQ_DISCARD);
> +
> +       rw = bio_rw(bio);
> +       if (rw == READA)
> +               rw = READ;
> +
> +       bio_for_each_segment(bvec, bio, iter) {
> +               unsigned int len = bvec.bv_len;
> +
> +               BUG_ON(len > PAGE_SIZE);
> +               pmem_do_bvec(pmem, bvec.bv_page, len,
> +                           bvec.bv_offset, rw, sector);
> +               sector += len >> SECTOR_SHIFT;
> +       }
> +
> +out:
> +       bio_endio(bio, err);
> +}
> +
> +static int pmem_rw_page(struct block_device *bdev, sector_t sector,
> +                      struct page *page, int rw)
> +{
> +       struct pmem_device *pmem = bdev->bd_disk->private_data;
> +
> +       pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
> +       page_endio(page, rw & WRITE, 0);
> +       return 0;
> +}
> +
> +static long pmem_direct_access(struct block_device *bdev, sector_t sector,
> +                             void **kaddr, unsigned long *pfn, long size)
> +{
> +       struct pmem_device *pmem = bdev->bd_disk->private_data;
> +
> +       if (!pmem)
> +               return -ENODEV;
> +
> +       *kaddr = pmem_lookup_pg_addr(pmem, sector);
> +       *pfn = pmem_lookup_pfn(pmem, sector);
> +
> +       return pmem->size - (sector * 512);
> +}
> +
> +static const struct block_device_operations pmem_fops = {
> +       .owner =                THIS_MODULE,
> +       .rw_page =              pmem_rw_page,
> +       .direct_access =        pmem_direct_access,
> +       .getgeo =               pmem_getgeo,
> +};
> +
> +/* pmem->phys_addr and pmem->size need to be set.
> + * Will then set virt_addr if successful.
> + */
> +static int pmem_mapmem(struct pmem_device *pmem)
> +{
> +       struct resource *res_mem;
> +       int err;
> +
> +       res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
> +                                              "pmem");

Isn't request_mem_region() enough?  i.e. it seems
request_mem_region_exclusive() assumes no DAX, at least in theory?

> +       if (!res_mem) {
> +               pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
> +                          pmem->phys_addr, pmem->size);
> +               return -EINVAL;
> +       }
> +
> +       /*
> +        * Map the memory as non-cachable, as we can't write back the contents
> +        * of the CPU caches in case of a crash.
> +        */
> +       pmem->virt_addr = ioremap_nocache(pmem->phys_addr, pmem->size);

This is fine for now, but I think we're going to end up with a
continuum of solutions to this problem based on the platform and the
device.  Some ADR platforms have firmware that takes actions like
flushing caches on a "power going away" signal.  Other platforms have
cache management instructions that we can use on either a per-i/o or
per REQ_FUA/FLUSH request.  Hmm, with this being in the memory map by
default I think this poses a challenge for VIVT caches and aliased
accesses?  We can revisit this when arm support shows up.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-26 14:12   ` [Linux-nvdimm] " Dan Williams
@ 2015-03-26 14:35     ` Christoph Hellwig
  2015-03-26 21:37       ` Ross Zwisler
  2015-03-26 14:52     ` Boaz Harrosh
  1 sibling, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26 14:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On Thu, Mar 26, 2015 at 07:12:23AM -0700, Dan Williams wrote:
> > +       struct resource *res_mem;
> > +       int err;
> > +
> > +       res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
> > +                                              "pmem");
> 
> Isn't request_mem_region() enough?  i.e. it seems
> request_mem_region_exclusive() assumes no DAX, at least in theory?

This is 1:1 from the patch Ross sent, but I've been wondering why
request_mem_region_exclusive is used here.  All it does is setting the
IORESOURCE_EXCLUSIVE flag, which prevents /dev/mem and sysfs from accessing
the memory while the driver claims it. Besides pmem only a watchdog driver
and e1000 make use of this flag, and there's various function related to
it that are entirely unused.  It's a weird beast.

> This is fine for now, but I think we're going to end up with a
> continuum of solutions to this problem based on the platform and the
> device.  Some ADR platforms have firmware that takes actions like
> flushing caches on a "power going away" signal.  Other platforms have
> cache management instructions that we can use on either a per-i/o or
> per REQ_FUA/FLUSH request.  Hmm, with this being in the memory map by
> default I think this poses a challenge for VIVT caches and aliased
> accesses?  We can revisit this when arm support shows up.

Yes, I expect us to pass flags related to this through the platform_data
eventually, but I think that starting with the simplest and safest version
is probably the best idea.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-26 14:12   ` [Linux-nvdimm] " Dan Williams
  2015-03-26 14:35     ` Christoph Hellwig
@ 2015-03-26 14:52     ` Boaz Harrosh
  2015-03-26 15:59       ` Dan Williams
  1 sibling, 1 reply; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-26 14:52 UTC (permalink / raw)
  To: Dan Williams, Christoph Hellwig
  Cc: linux-fsdevel, linux-nvdimm, X86 ML, Jens Axboe, linux-kernel

On 03/26/2015 04:12 PM, Dan Williams wrote:
> On Thu, Mar 26, 2015 at 1:32 AM, Christoph Hellwig <hch@lst.de> wrote:
>> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>>

Dan something is Broken with you mailer program it keeps dropping the
CC when sending replies.

For example Both me and Ross who were on CC got dropped, Jens Axboe
though got add back.

Its not only this email, it is all the emails in this series, please
check what is going on.

Thanks
Boaz

>> PMEM is a new driver that presents a reserved range of memory as a
>> block device.  This is useful for developing with NV-DIMMs, and
>> can be used with volatile memory as a development platform.
>>
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> [hch: convert to use a platform_device for discovery, fix partition
>>  support]
<>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26  9:34     ` Christoph Hellwig
  2015-03-26 10:04       ` Ingo Molnar
@ 2015-03-26 15:49       ` Boaz Harrosh
  2015-03-26 16:02         ` [Linux-nvdimm] " Dan Williams
  2015-03-26 16:43         ` Christoph Hellwig
  2015-03-26 22:59       ` Yinghai Lu
  2 siblings, 2 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-26 15:49 UTC (permalink / raw)
  To: Christoph Hellwig, Ingo Molnar
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz

On 03/26/2015 11:34 AM, Christoph Hellwig wrote:
<>

Please re-post this patch stand alone because git am on this will
Give me the wrong title and commit message

small comments ...

> From: Christoph Hellwig <hch@lst.de>
> Date: Wed, 25 Mar 2015 12:24:11 +0100
> Subject: x86: add support for the non-standard protected e820 type
> 
> Various recent BIOSes support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.
> 
> Wire this e820 table type up to export platform devices for the pmem
> driver so that we can use it in Linux, and also provide a memmap=
> argument to manually tag memory as protected, which can be used
> if the BIOSs doesn't use the standard nonstandard interface, or
> we just want to test the pmem driver with regular memory.
> 
> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  Documentation/kernel-parameters.txt |  6 ++++
>  arch/x86/Kconfig                    | 10 ++++++
>  arch/x86/include/asm/setup.h        |  6 ++++
>  arch/x86/include/uapi/asm/e820.h    | 10 ++++++
>  arch/x86/kernel/Makefile            |  1 +
>  arch/x86/kernel/e820.c              | 31 ++++++++++++----
>  arch/x86/kernel/pmem.c              | 70 +++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/setup.c             |  2 ++
>  8 files changed, 130 insertions(+), 6 deletions(-)
>  create mode 100644 arch/x86/kernel/pmem.c
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index bfcb1a6..c87122d 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1965,6 +1965,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			         or
>  			         memmap=0x10000$0x18690000
>  
> +	memmap=nn[KMG]!ss[KMG]
> +			[KNL,X86] Mark specific memory as protected.
> +			Region of memory to be used, from ss to ss+nn.
> +			The memory region may be marked as e820 type 12 (0xc)
> +			and is NVDIMM or ADR memory.
> +

Do we need to escape "\!" this character on grub command line ? It might
help to note that. I did like the original "|" BTW

>  	memory_corruption_check=0/1 [X86]
>  			Some BIOSes seem to corrupt the first 64k of
>  			memory when doing things like suspend/resume.
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b7d31ca..c0e8ee3 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1430,6 +1430,16 @@ config ILLEGAL_POINTER_VALUE
>  
>  source "mm/Kconfig"
>  
> +config X86_PMEM_LEGACY
> +	bool "Support non-standard NVDIMMs and ADR protected memory"
> +	help
> +	  Treat memory marked using the non-standard e820 type of 12 as used
> +	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
> +	  The kernel will offer these regions to the pmem driver so
> +	  they can be used for persistent storage.
> +
> +	  Say Y if unsure.
> +
>  config HIGHPTE
>  	bool "Allocate 3rd-level pagetables from highmem"
>  	depends on HIGHMEM
> diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
> index ff4e7b2..2352fde 100644
> --- a/arch/x86/include/asm/setup.h
> +++ b/arch/x86/include/asm/setup.h
> @@ -57,6 +57,12 @@ extern void x86_ce4100_early_setup(void);
>  static inline void x86_ce4100_early_setup(void) { }
>  #endif
>  
> +#ifdef CONFIG_X86_PMEM_LEGACY
> +void reserve_pmem(void);
> +#else
> +static inline void reserve_pmem(void) { }
> +#endif
> +
>  #ifndef _SETUP
>  
>  #include <asm/espfix.h>
> diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
> index d993e33..ce0d0bf 100644
> --- a/arch/x86/include/uapi/asm/e820.h
> +++ b/arch/x86/include/uapi/asm/e820.h
> @@ -33,6 +33,16 @@
>  #define E820_NVS	4
>  #define E820_UNUSABLE	5
>  
> +/*
> + * This is a non-standardized way to represent ADR or NVDIMM regions that
> + * persist over a reboot.  The kernel will ignore their special capabilities
> + * unless the CONFIG_X86_PMEM_LEGACY option is set.
> + *
> + * Note that older platforms also used 6 for the same type of memory,
> + * but newer versions switched to 12 as 6 was assigned differently.  Some
> + * time they will learn..
> + */
> +#define E820_PRAM	12

Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
to enable is _PMEM_, the driver stack that gets loaded is pmem,
so PRAM is unexpected.

Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
but there are other not RAM technologies that can be supported exactly
the same way.
MEM is a more general name meaning "on the memory bus". I think.

I would love the consistency.

>  
>  /*
>   * reserved RAM used by kernel itself
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index cdb1b70..971f18c 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -94,6 +94,7 @@ obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
>  obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
>  obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
>  obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
> +obj-$(CONFIG_X86_PMEM_LEGACY)	+= pmem.o
>  
>  obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
>  
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 46201de..4bd525a 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -149,6 +149,9 @@ static void __init e820_print_type(u32 type)
>  	case E820_UNUSABLE:
>  		printk(KERN_CONT "unusable");
>  		break;
> +	case E820_PRAM:
> +		printk(KERN_CONT "persistent (type %u)", type);

This case can only mean 12 in this patch. (I think historically
you had a Kconfig to set its Number

> +		break;
>  	default:
>  		printk(KERN_CONT "type %u", type);
>  		break;
> @@ -688,8 +691,15 @@ void __init e820_mark_nosave_regions(unsigned long limit_pfn)
>  			register_nosave_region(pfn, PFN_UP(ei->addr));
>  
>  		pfn = PFN_DOWN(ei->addr + ei->size);
> -		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
> +
> +		switch (ei->type) {
> +		case E820_RAM:
> +		case E820_PRAM:
> +		case E820_RESERVED_KERN:
> +			break;
> +		default:
>  			register_nosave_region(PFN_UP(ei->addr), pfn);
> +		}
>  
>  		if (pfn >= limit_pfn)
>  			break;
> @@ -748,7 +758,7 @@ u64 __init early_reserve_e820(u64 size, u64 align)
>  /*
>   * Find the highest page frame number we have available
>   */
> -static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
> +static unsigned long __init e820_end_pfn(unsigned long limit_pfn)
>  {
>  	int i;
>  	unsigned long last_pfn = 0;
> @@ -759,7 +769,11 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
>  		unsigned long start_pfn;
>  		unsigned long end_pfn;
>  
> -		if (ei->type != type)
> +		/*
> +		 * Persistent memory is accounted as ram for purposes of
> +		 * establishing max_pfn and mem_map.
> +		 */
> +		if (ei->type != E820_RAM && ei->type != E820_PRAM)
>  			continue;
>  
>  		start_pfn = ei->addr >> PAGE_SHIFT;
> @@ -784,12 +798,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
>  }
>  unsigned long __init e820_end_of_ram_pfn(void)
>  {
> -	return e820_end_pfn(MAX_ARCH_PFN, E820_RAM);
> +	return e820_end_pfn(MAX_ARCH_PFN);
>  }
>  
>  unsigned long __init e820_end_of_low_ram_pfn(void)
>  {
> -	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
> +	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT));
>  }
>  
>  static void early_panic(char *msg)
> @@ -866,6 +880,9 @@ static int __init parse_memmap_one(char *p)
>  	} else if (*p == '$') {
>  		start_at = memparse(p+1, &p);
>  		e820_add_region(start_at, mem_size, E820_RESERVED);
> +	} else if (*p == '!') {
> +		start_at = memparse(p+1, &p);
> +		e820_add_region(start_at, mem_size, E820_PRAM);
>  	} else
>  		e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
>  
> @@ -907,6 +924,7 @@ static inline const char *e820_type_to_string(int e820_type)
>  	case E820_ACPI:	return "ACPI Tables";
>  	case E820_NVS:	return "ACPI Non-volatile Storage";
>  	case E820_UNUSABLE:	return "Unusable memory";
> +	case E820_PRAM: return "Persistent RAM";

if you change E820_PRAM, then Also here

>  	default:	return "reserved";
>  	}
>  }
> @@ -941,7 +959,8 @@ void __init e820_reserve_resources(void)
>  		 * pcibios_resource_survey()
>  		 */
>  		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
> -			res->flags |= IORESOURCE_BUSY;
> +			if (e820.map[i].type != E820_PRAM)
> +				res->flags |= IORESOURCE_BUSY;
>  			insert_resource(&iomem_resource, res);
>  		}
>  		res++;
> diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
> new file mode 100644
> index 0000000..f970048
> --- /dev/null
> +++ b/arch/x86/kernel/pmem.c
> @@ -0,0 +1,70 @@
> +/*
> + * Copyright (c) 2009, Intel Corporation.
> + * Copyright (c) 2015, Christoph Hellwig.
> + */
> +#include <linux/memblock.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +#include <asm/e820.h>
> +#include <asm/page_types.h>
> +#include <asm/setup.h>
> +
> +void __init reserve_pmem(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < e820.nr_map; i++) {
> +		struct e820entry *ei = &e820.map[i];
> +
> +		if (ei->type != E820_PRAM)
> +			continue;
> +
> +		memblock_reserve(ei->addr, ei->addr + ei->size);
> +		max_pfn_mapped = init_memory_mapping(
> +				ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
> +				ei->addr + ei->size);
> +	}
> +}
> +
> +static __init void register_pmem_device(struct resource *res)
> +{
> +	struct platform_device *pdev;
> +	int error;
> +
> +	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
> +	if (!pdev)
> +		return;
> +
> +	error = platform_device_add_resources(pdev, res, 1);
> +	if (error)
> +		goto out_put_pdev;
> +
> +	error = platform_device_add(pdev);
> +	if (error)
> +		goto out_put_pdev;
> +	return;
> +out_put_pdev:
> +	dev_warn(&pdev->dev, "failed to add pmem device!\n");
> +	platform_device_put(pdev);
> +}
> +
> +static __init int register_pmem_devices(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < e820.nr_map; i++) {
> +		struct e820entry *ei = &e820.map[i];
> +
> +		if (ei->type == E820_PRAM) {

See here it would be cleaner to ask for E820_PMEM in a 
register_pmem_devices member

Just my $0.017

Thanks
Boaz

> +			struct resource res = {
> +				.flags	= IORESOURCE_MEM,
> +				.start	= ei->addr,
> +				.end	= ei->addr + ei->size - 1,
> +			};
> +			register_pmem_device(&res);
> +		}
> +	}
> +
> +	return 0;
> +}
> +device_initcall(register_pmem_devices);
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 0a2421c..f2bed2b 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1158,6 +1158,8 @@ void __init setup_arch(char **cmdline_p)
>  
>  	early_acpi_boot_init();
>  
> +	reserve_pmem();
> +
>  	initmem_init();
>  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
>  
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-26 14:52     ` Boaz Harrosh
@ 2015-03-26 15:59       ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2015-03-26 15:59 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-fsdevel, linux-nvdimm, X86 ML,
	Jens Axboe, linux-kernel

On Thu, Mar 26, 2015 at 7:52 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/26/2015 04:12 PM, Dan Williams wrote:
>> On Thu, Mar 26, 2015 at 1:32 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>>>
>
> Dan something is Broken with you mailer program it keeps dropping the
> CC when sending replies.
>
> For example Both me and Ross who were on CC got dropped, Jens Axboe
> though got add back.
>
> Its not only this email, it is all the emails in this series, please
> check what is going on.

They show up in the archives:
https://lists.01.org/pipermail/linux-nvdimm/2015-March/thread.html

Sometimes vger.kernel.org drops intel.com mails, it's outside my control.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 15:49       ` Boaz Harrosh
@ 2015-03-26 16:02         ` Dan Williams
  2015-03-26 16:07           ` Boaz Harrosh
  2015-03-26 16:43         ` Christoph Hellwig
  1 sibling, 1 reply; 69+ messages in thread
From: Dan Williams @ 2015-03-26 16:02 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Ingo Molnar, Jens Axboe, linux-nvdimm, X86 ML,
	linux-kernel, linux-fsdevel

On Thu, Mar 26, 2015 at 8:49 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/26/2015 11:34 AM, Christoph Hellwig wrote:
>> +/*
>> + * This is a non-standardized way to represent ADR or NVDIMM regions that
>> + * persist over a reboot.  The kernel will ignore their special capabilities
>> + * unless the CONFIG_X86_PMEM_LEGACY option is set.
>> + *
>> + * Note that older platforms also used 6 for the same type of memory,
>> + * but newer versions switched to 12 as 6 was assigned differently.  Some
>> + * time they will learn..
>> + */
>> +#define E820_PRAM    12
>
> Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
> to enable is _PMEM_, the driver stack that gets loaded is pmem,
> so PRAM is unexpected.
>
> Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
> but there are other not RAM technologies that can be supported exactly
> the same way.
> MEM is a more general name meaning "on the memory bus". I think.
>
> I would love the consistency.

One of nice side of effects of having a "PRAM" name is that we can
later add a UEFI "PMEM" type where the distinction is thsy "PRAM" is
included in the system memory map by default and "PMEM" is analogous
to "IOMEM".  Just a thought...

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 16:02         ` [Linux-nvdimm] " Dan Williams
@ 2015-03-26 16:07           ` Boaz Harrosh
  0 siblings, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-26 16:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Ingo Molnar, Jens Axboe, linux-nvdimm, X86 ML,
	linux-kernel, linux-fsdevel

On 03/26/2015 06:02 PM, Dan Williams wrote:
> On Thu, Mar 26, 2015 at 8:49 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 03/26/2015 11:34 AM, Christoph Hellwig wrote:
>>> +/*
>>> + * This is a non-standardized way to represent ADR or NVDIMM regions that
>>> + * persist over a reboot.  The kernel will ignore their special capabilities
>>> + * unless the CONFIG_X86_PMEM_LEGACY option is set.
>>> + *
>>> + * Note that older platforms also used 6 for the same type of memory,
>>> + * but newer versions switched to 12 as 6 was assigned differently.  Some
>>> + * time they will learn..
>>> + */
>>> +#define E820_PRAM    12
>>
>> Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
>> to enable is _PMEM_, the driver stack that gets loaded is pmem,
>> so PRAM is unexpected.
>>
>> Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
>> but there are other not RAM technologies that can be supported exactly
>> the same way.
>> MEM is a more general name meaning "on the memory bus". I think.
>>
>> I would love the consistency.
> 
> One of nice side of effects of having a "PRAM" name is that we can
> later add a UEFI "PMEM" type where the distinction is thsy "PRAM" is
> included in the system memory map by default and "PMEM" is analogous
> to "IOMEM".  Just a thought...
> 

Than lets say E820_PMEM_12, but not PRAM for sure. Also would UEFI be
E820_XXX will it not be a UEFI_PMEM ??

For me I hate RAM because it became to mean a technology, maybe then
the same name as the Kconfig PMEM_LEGACY

Thanks
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 15:49       ` Boaz Harrosh
  2015-03-26 16:02         ` [Linux-nvdimm] " Dan Williams
@ 2015-03-26 16:43         ` Christoph Hellwig
  2015-03-26 18:46           ` Elliott, Robert (Server Storage)
  2015-03-26 20:53           ` Ross Zwisler
  1 sibling, 2 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26 16:43 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Ingo Molnar, linux-nvdimm, linux-fsdevel,
	linux-kernel, x86, ross.zwisler, axboe

On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
> > +	memmap=nn[KMG]!ss[KMG]
> > +			[KNL,X86] Mark specific memory as protected.
> > +			Region of memory to be used, from ss to ss+nn.
> > +			The memory region may be marked as e820 type 12 (0xc)
> > +			and is NVDIMM or ADR memory.
> > +
> 
> Do we need to escape "\!" this character on grub command line ? It might
> help to note that. I did like the original "|" BTW

No need to escape it on the kvm command line, which is where I tested
this flag only so far.  If there is a strong argument for "|" I'm happy
to change it.

> > +#define E820_PRAM	12
> 
> Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
> to enable is _PMEM_, the driver stack that gets loaded is pmem,
> so PRAM is unexpected.
> 
> Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
> but there are other not RAM technologies that can be supported exactly
> the same way.
> MEM is a more general name meaning "on the memory bus". I think.
> 
> I would love the consistency.

Ingo asked for the PRAM name, I don't really care either way.

> > +	case E820_PRAM:
> > +		printk(KERN_CONT "persistent (type %u)", type);
> 
> This case can only mean 12 in this patch. (I think historically
> you had a Kconfig to set its Number

Indeed.  I kept it in case people hack their kernels, but I can remove
it if people feel it's too ugly.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-26  8:32 another pmem variant V2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2015-03-26  8:32 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
@ 2015-03-26 16:57 ` Boaz Harrosh
  2015-03-26 17:02   ` [PATCH] SQUASHME: Streamline pmem.c Boaz Harrosh
                     ` (2 more replies)
  2015-03-31 22:11 ` Elliott, Robert (Server Storage)
  2015-04-01 19:33 ` Elliott, Robert (Server Storage)
  5 siblings, 3 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-26 16:57 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: ross.zwisler, axboe, boaz

On 03/26/2015 10:32 AM, Christoph Hellwig wrote:
> Here is another version of the same trivial pmem driver, because two
> obviously aren't enough.  The first patch is the same pmem driver
> that Ross posted a short time ago, just modified to use platform_devices
> to find the persistant memory region instead of hardconding it in the
> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> but still allow auto-discovery.
> 

Hi Christoph

So I've been trying to test your version, and play around with it.
I currently have some problems, but this is the end of the week for me
so I will debug and fix it after the weekend on Sunday.

But just a heads up for what I stumbled on.

For one this auto discovery of yours is very (very) nice but is a bit
inconvenience. Before I would reserve a big chuck on each NUMA range
on Kernel's memmap=  And then at pmem map= would slice and dice it
as I want hot style on modprobe with no need for reboot. Now I need
to do it on reboot theoretically. (You know xfstest needs lots of devices
some big some small ;-))

theoretically because if I try:
	memmap=1G!4G,1G!6G

It will not boot at all. (will debug on Sunday) I must have 2 devices
because each one is on another NUMA node.

If I do memmap=2G!4G,1G!6G , which is a contiguous range it will give
me only one device. Just an inconvenience. (And also this one sometimes
does not boot)

{All this in a VM for now did not even get to real HW yet}

Also with the modprob pmem map= I was supporting a PCIE memory card but
I guess I need to throw this one out the door.

Am also posting an RFC cleanup to your pmem driver. RFC because I was not
yet able to boot the all thing in our lab for heavy testing.

Thanks
Boaz

> The other two patches are a heavily rewritten version of the code that
> Intel gave to various storage vendors to discover the type 12 (and earlier
> type 6) nvdimms, which I massaged into a form that is hopefully suitable
> for mainline.
> 
> Note that pmem.c really is the minimal version as I think we need something
> included ASAP.  We'll eventually need to be able to do other I/O from and
> to it, and as most people know everyone has their own preferre method to
> do it, which I'd like to discuss once we have the basic driver in.
> 
> This has been tested both with a real NVDIMM on a system with a type 12
> capable bios, as well as with "fake persistent" memory using the memmap=
> option.
> 
> Changes since V1:
>   - s/E820_PROTECTED_KERN/E820_PMEM/g
>   - map the persistent memory as uncached
>   - better kernel parameter description
>   - various typo fixes
>   - MODULE_LICENSE fix
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH] SQUASHME: Streamline pmem.c
  2015-03-26 16:57 ` another pmem variant V2 Boaz Harrosh
@ 2015-03-26 17:02   ` Boaz Harrosh
  2015-03-26 17:23     ` Christoph Hellwig
                       ` (3 more replies)
  2015-03-26 17:18   ` another pmem variant V2 Christoph Hellwig
  2015-03-31  9:25   ` Christoph Hellwig
  2 siblings, 4 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-26 17:02 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: ross.zwisler, axboe


Christoph why did you choose the fat and ugly version of
pmem.c beats me. Anyway, here are the cleanups you need on
top of your pmem patch.

Among other it does:
* Remove getgeo. It is not needed for modern fdisk and was never
  needed for libgparted and cfdisk.

* remove 89 lines of code to do a single memcpy. The reason
  this was so in brd (done badly BTW) is because destination
  memory is page-by-page based. With pmem we have the destination
  contiguous so we can do any size, in one go.

* Remove SECTOR_SHIFT. It is defined in 6 other places
  in the Kernel. I do not like a new one. 9 is used through
  out, including block core. I do not like pmem to blasphemy
  more than needed.

* More style stuff ...

Please squash into your initial submission

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 137 +++++++++++----------------------------------------
 1 file changed, 28 insertions(+), 109 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 545b13b..5a57a06 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -11,7 +11,7 @@
  * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
  * more details.
  *
- * This driver is heavily based on drivers/block/brd.c.
+ * This driver's skeleton is based on drivers/block/brd.c.
  * Copyright (C) 2007 Nick Piggin
  * Copyright (C) 2007 Novell Inc.
  */
@@ -24,11 +24,6 @@
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
-
-#define SECTOR_SHIFT		9
-#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
-#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
-
 #define PMEM_MINORS		16
 
 struct pmem_device {
@@ -44,100 +39,17 @@ struct pmem_device {
 static int pmem_major;
 static atomic_t pmem_index;
 
-static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo)
-{
-	/* some standard values */
-	geo->heads = 1 << 6;
-	geo->sectors = 1 << 5;
-	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
-	return 0;
-}
-
-/*
- * direct translation from (pmem,sector) => void*
- * We do not require that sector be page aligned.
- * The return value will point to the beginning of the page containing the
- * given sector, not to the sector itself.
- */
-static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
-{
-	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
-	size_t offset = page_offset << PAGE_SHIFT;
-
-	BUG_ON(offset >= pmem->size);
-	return pmem->virt_addr + offset;
-}
-
-/* sector must be page aligned */
-static unsigned long pmem_lookup_pfn(struct pmem_device *pmem, sector_t sector)
-{
-	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
-
-	BUG_ON(sector & (PAGE_SECTORS - 1));
-	return (pmem->phys_addr >> PAGE_SHIFT) + page_offset;
-}
-
-/*
- * sector is not required to be page aligned.
- * n is at most a single page, but could be less.
- */
-static void copy_to_pmem(struct pmem_device *pmem, const void *src,
-			sector_t sector, size_t n)
-{
-	void *dst;
-	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
-	size_t copy;
-
-	BUG_ON(n > PAGE_SIZE);
-
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	dst = pmem_lookup_pg_addr(pmem, sector);
-	memcpy(dst + offset, src, copy);
-
-	if (copy < n) {
-		src += copy;
-		sector += copy >> SECTOR_SHIFT;
-		copy = n - copy;
-		dst = pmem_lookup_pg_addr(pmem, sector);
-		memcpy(dst, src, copy);
-	}
-}
-
-/*
- * sector is not required to be page aligned.
- * n is at most a single page, but could be less.
- */
-static void copy_from_pmem(void *dst, struct pmem_device *pmem,
-			  sector_t sector, size_t n)
-{
-	void *src;
-	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
-	size_t copy;
-
-	BUG_ON(n > PAGE_SIZE);
-
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	src = pmem_lookup_pg_addr(pmem, sector);
-
-	memcpy(dst, src + offset, copy);
-
-	if (copy < n) {
-		dst += copy;
-		sector += copy >> SECTOR_SHIFT;
-		copy = n - copy;
-		src = pmem_lookup_pg_addr(pmem, sector);
-		memcpy(dst, src, copy);
-	}
-}
-
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 			unsigned int len, unsigned int off, int rw,
 			sector_t sector)
 {
 	void *mem = kmap_atomic(page);
+	size_t pmem_off = sector << 9;
+
+	BUG_ON(pmem_off >= pmem->size);
 
 	if (rw == READ) {
-		copy_from_pmem(mem + off, pmem, sector, len);
+		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
 		flush_dcache_page(page);
 	} else {
 		/*
@@ -145,7 +57,7 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 		 * NVDIMMs are actually durable before returning.
 		 */
 		flush_dcache_page(page);
-		copy_to_pmem(pmem, mem + off, sector, len);
+		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
 	}
 
 	kunmap_atomic(mem);
@@ -161,25 +73,32 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	struct bvec_iter iter;
 	int err = 0;
 
-	sector = bio->bi_iter.bi_sector;
-	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+	if (unlikely(bio_end_sector(bio) > get_capacity(bdev->bd_disk))) {
 		err = -EIO;
 		goto out;
 	}
 
-	BUG_ON(bio->bi_rw & REQ_DISCARD);
+	if (WARN_ON(bio->bi_rw & REQ_DISCARD)) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	rw = bio_rw(bio);
 	if (rw == READA)
 		rw = READ;
 
+	sector = bio->bi_iter.bi_sector;
 	bio_for_each_segment(bvec, bio, iter) {
-		unsigned int len = bvec.bv_len;
-
-		BUG_ON(len > PAGE_SIZE);
-		pmem_do_bvec(pmem, bvec.bv_page, len,
-			    bvec.bv_offset, rw, sector);
-		sector += len >> SECTOR_SHIFT;
+		/* NOTE: There is a legend saying that bv_len might be
+		 * bigger than PAGE_SIZE in the case that bv_page points to
+		 * a physical contiguous PFN set. But for us it is fine because
+		 * it means the Kernel virtual mapping is also contiguous. And
+		 * on the pmem side we are always contiguous both virtual and
+		 * physical
+		 */
+		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
+			     rw, sector);
+		sector += bvec.bv_len >> 9;
 	}
 
 out:
@@ -200,21 +119,21 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 			      void **kaddr, unsigned long *pfn, long size)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	size_t offset = sector << 9;
 
-	if (!pmem)
+	if (unlikely(!pmem))
 		return -ENODEV;
 
-	*kaddr = pmem_lookup_pg_addr(pmem, sector);
-	*pfn = pmem_lookup_pfn(pmem, sector);
+	*kaddr = pmem->virt_addr + offset;
+	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
 
-	return pmem->size - (sector * 512);
+	return pmem->size - offset;
 }
 
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		pmem_rw_page,
 	.direct_access =	pmem_direct_access,
-	.getgeo =		pmem_getgeo,
 };
 
 /* pmem->phys_addr and pmem->size need to be set.
@@ -307,7 +226,7 @@ static int pmem_probe(struct platform_device *pdev)
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	sprintf(disk->disk_name, "pmem%d", idx);
 	disk->driverfs_dev = &pdev->dev;
-	set_capacity(disk, pmem->size >> SECTOR_SHIFT);
+	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
 
 	add_disk(disk);
-- 
1.9.3



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-26 16:57 ` another pmem variant V2 Boaz Harrosh
  2015-03-26 17:02   ` [PATCH] SQUASHME: Streamline pmem.c Boaz Harrosh
@ 2015-03-26 17:18   ` Christoph Hellwig
  2015-03-26 17:31     ` Boaz Harrosh
  2015-03-31  9:25   ` Christoph Hellwig
  2 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26 17:18 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
> For one this auto discovery of yours is very (very) nice but is a bit
> inconvenience. Before I would reserve a big chuck on each NUMA range
> on Kernel's memmap=  And then at pmem map= would slice and dice it
> as I want hot style on modprobe with no need for reboot. Now I need
> to do it on reboot theoretically. (You know xfstest needs lots of devices
> some big some small ;-))

Slicing up a block device based on kernel options is not exactly a smart
idea.  We have partitions that are perfectly fine for that.  If you
really don't are about persistance of your partitioning you can just
set up a device mapper table.  No need to reinvent the wheel.

> Also with the modprob pmem map= I was supporting a PCIE memory card but
> I guess I need to throw this one out the door.

Send it my way to support it properly...  Just like any other PCIe device it
should have a proper driver.  For now it can register a pmem platform device,
but when you submit it I'd rather add slightly more low-level versions
of pmem_probe and pmem_remove that take a struct device * and possibly
a resource structure so that you can directly call into it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] SQUASHME: Streamline pmem.c
  2015-03-26 17:02   ` [PATCH] SQUASHME: Streamline pmem.c Boaz Harrosh
@ 2015-03-26 17:23     ` Christoph Hellwig
  2015-03-26 22:17     ` Ross Zwisler
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26 17:23 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Thu, Mar 26, 2015 at 07:02:17PM +0200, Boaz Harrosh wrote:
> Christoph why did you choose the fat and ugly version of
> pmem.c beats me. Anyway, here are the cleanups you need on
> top of your pmem patch.
> 
> Among other it does:
> * Remove getgeo. It is not needed for modern fdisk and was never
>   needed for libgparted and cfdisk.
> 
> * remove 89 lines of code to do a single memcpy. The reason
>   this was so in brd (done badly BTW) is because destination
>   memory is page-by-page based. With pmem we have the destination
>   contiguous so we can do any size, in one go.
> 
> * Remove SECTOR_SHIFT. It is defined in 6 other places
>   in the Kernel. I do not like a new one. 9 is used through
>   out, including block core. I do not like pmem to blasphemy
>   more than needed.
> 
> * More style stuff ...

One patch per items please..

> - * This driver is heavily based on drivers/block/brd.c.
> + * This driver's skeleton is based on drivers/block/brd.c.
>   * Copyright (C) 2007 Nick Piggin
>   * Copyright (C) 2007 Novell Inc.

Looks like there is basically nothing left of brd.c after this patch,
so we might as well drop this.

> -/*
> - * direct translation from (pmem,sector) => void*
> - * We do not require that sector be page aligned.
> - * The return value will point to the beginning of the page containing the
> - * given sector, not to the sector itself.
> - */

not quite related to you patch:  all the pmem and direct_access code uses
normal kernel address pointers, but we're actually dealing with iomem
here which makes sparse a little unhappy..

> -	BUG_ON(bio->bi_rw & REQ_DISCARD);
> +	if (WARN_ON(bio->bi_rw & REQ_DISCARD)) {
> +		err = -EINVAL;
> +		goto out;
> +	}

No need to write additional code here, I'd rather remove it entirely
if the BUG_ON bothers you.  There is no way we'll get a discard without
the driver asking for it.  And then you'd have to check for all the
other non-standard I/O types as well.

> +		/* NOTE: There is a legend saying that bv_len might be
> +		 * bigger than PAGE_SIZE in the case that bv_page points to
> +		 * a physical contiguous PFN set. But for us it is fine because
> +		 * it means the Kernel virtual mapping is also contiguous. And
> +		 * on the pmem side we are always contiguous both virtual and
> +		 * physical
> +		 */

Linux comment style has the opening "/*" on it's own line.  And talking
about legends in comments isn't a very nice style either.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-26 17:18   ` another pmem variant V2 Christoph Hellwig
@ 2015-03-26 17:31     ` Boaz Harrosh
  2015-03-26 18:38       ` Christoph Hellwig
  0 siblings, 1 reply; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-26 17:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/26/2015 07:18 PM, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
>> For one this auto discovery of yours is very (very) nice but is a bit
>> inconvenience. Before I would reserve a big chuck on each NUMA range
>> on Kernel's memmap=  And then at pmem map= would slice and dice it
>> as I want hot style on modprobe with no need for reboot. Now I need
>> to do it on reboot theoretically. (You know xfstest needs lots of devices
>> some big some small ;-))
> 
> Slicing up a block device based on kernel options is not exactly a smart
> idea.  We have partitions that are perfectly fine for that.  If you
> really don't are about persistance of your partitioning you can just
> set up a device mapper table.  No need to reinvent the wheel.
> 

I know! fdisk is my friend, I know.

But I hope you are not ignoring my real problem. any two memmap= ranges
will halt the boot. Specially if they are dis-contiguous.

Also I need the contiguous variant split into two devices because
they might belong to two NUMA nodes. It is very hard to manage
if a NUMA crossing is in a middle of a single pmemX device
The way we like to configure it is that each /dev/pmem belongs to
a single NUMA node. And in a multy device setup each CPU node
allocates from "his" pmem device If there is space.
(And it lets me set application affinity if need to)

BTW: Will device mapper let me call ->direct_access()

Thanks
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-26 17:31     ` Boaz Harrosh
@ 2015-03-26 18:38       ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26 18:38 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Thu, Mar 26, 2015 at 07:31:03PM +0200, Boaz Harrosh wrote:
> But I hope you are not ignoring my real problem. any two memmap= ranges
> will halt the boot. Specially if they are dis-contiguous.

not the case here, verified it in various cofigurations.

> Also I need the contiguous variant split into two devices because
> they might belong to two NUMA nodes. It is very hard to manage
> if a NUMA crossing is in a middle of a single pmemX device
> The way we like to configure it is that each /dev/pmem belongs to
> a single NUMA node. And in a multy device setup each CPU node
> allocates from "his" pmem device If there is space.
> (And it lets me set application affinity if need to)

The hack below ensures two separate type 12 entries stay separate,
but I'm not sure I really want this.  so far it seems like very special
hacks for your very specialized fake-pmem config.

> BTW: Will device mapper let me call ->direct_access()

Right now it doesn't, but it's not hard to add..

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 16:43         ` Christoph Hellwig
@ 2015-03-26 18:46           ` Elliott, Robert (Server Storage)
  2015-03-26 19:25             ` [Linux-nvdimm] " Dan Williams
  2015-03-26 20:53           ` Ross Zwisler
  1 sibling, 1 reply; 69+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-03-26 18:46 UTC (permalink / raw)
  To: Christoph Hellwig, Boaz Harrosh
  Cc: Ingo Molnar, linux-nvdimm, linux-fsdevel, linux-kernel, x86,
	ross.zwisler, axboe, Kani, Toshimitsu



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 11:43 AM
> To: Boaz Harrosh
> Cc: Christoph Hellwig; Ingo Molnar; linux-nvdimm@ml01.01.org; linux-
> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org;
> ross.zwisler@linux.intel.com; axboe@kernel.dk
> Subject: Re: [PATCH 2/3] x86: add a is_e820_ram() helper
> 
> On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
> > > +	memmap=nn[KMG]!ss[KMG]
> > > +			[KNL,X86] Mark specific memory as protected.
> > > +			Region of memory to be used, from ss to ss+nn.
> > > +			The memory region may be marked as e820 type 12 (0xc)
> > > +			and is NVDIMM or ADR memory.
> > > +
> >
> > Do we need to escape "\!" this character on grub command line ? It might
> > help to note that. I did like the original "|" BTW
> 
> No need to escape it on the kvm command line, which is where I tested
> this flag only so far.  If there is a strong argument for "|" I'm happy
> to change it.

I agree with Boaz that ! is a nuisance if loading pmem as a module
with modprobe from bash. 

The ! does work fine in the grub2 command line if the driver is 
built-in.  I added it to /etc/default/grub like this:
    GRUB_CMDLINE_LINUX="<other parameters> memmap=8G!18G"
and grub-mkconfig placed it in /boot/efi/EFI/centos/grub.cfg
with no issues.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 18:46           ` Elliott, Robert (Server Storage)
@ 2015-03-26 19:25             ` Dan Williams
  0 siblings, 0 replies; 69+ messages in thread
From: Dan Williams @ 2015-03-26 19:25 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, Boaz Harrosh, axboe, linux-nvdimm, x86,
	linux-kernel, linux-fsdevel, Ingo Molnar

On Thu, Mar 26, 2015 at 11:46 AM, Elliott, Robert (Server Storage)
<Elliott@hp.com> wrote:
>
>
>> -----Original Message-----
>> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
>> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
>> Sent: Thursday, March 26, 2015 11:43 AM
>> To: Boaz Harrosh
>> Cc: Christoph Hellwig; Ingo Molnar; linux-nvdimm@ml01.01.org; linux-
>> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org;
>> ross.zwisler@linux.intel.com; axboe@kernel.dk
>> Subject: Re: [PATCH 2/3] x86: add a is_e820_ram() helper
>>
>> On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
>> > > + memmap=nn[KMG]!ss[KMG]
>> > > +                 [KNL,X86] Mark specific memory as protected.
>> > > +                 Region of memory to be used, from ss to ss+nn.
>> > > +                 The memory region may be marked as e820 type 12 (0xc)
>> > > +                 and is NVDIMM or ADR memory.
>> > > +
>> >
>> > Do we need to escape "\!" this character on grub command line ? It might
>> > help to note that. I did like the original "|" BTW
>>
>> No need to escape it on the kvm command line, which is where I tested
>> this flag only so far.  If there is a strong argument for "|" I'm happy
>> to change it.
>
> I agree with Boaz that ! is a nuisance if loading pmem as a module
> with modprobe from bash.

This is a core kernel command line, not a module parameter.  I'm not
saying that it should stay "!", but modprobe will not need to deal
with it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 16:43         ` Christoph Hellwig
  2015-03-26 18:46           ` Elliott, Robert (Server Storage)
@ 2015-03-26 20:53           ` Ross Zwisler
  1 sibling, 0 replies; 69+ messages in thread
From: Ross Zwisler @ 2015-03-26 20:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Boaz Harrosh, Ingo Molnar, linux-nvdimm, linux-fsdevel,
	linux-kernel, x86, axboe

On Thu, 2015-03-26 at 17:43 +0100, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
> > > +#define E820_PRAM	12
> > 
> > Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
> > to enable is _PMEM_, the driver stack that gets loaded is pmem,
> > so PRAM is unexpected.
> > 
> > Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
> > but there are other not RAM technologies that can be supported exactly
> > the same way.
> > MEM is a more general name meaning "on the memory bus". I think.
> > 
> > I would love the consistency.
> 
> Ingo asked for the PRAM name, I don't really care either way.

I also prefer "E820_PMEM", fwiw.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-26 14:35     ` Christoph Hellwig
@ 2015-03-26 21:37       ` Ross Zwisler
  0 siblings, 0 replies; 69+ messages in thread
From: Ross Zwisler @ 2015-03-26 21:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Jens Axboe, linux-nvdimm, X86 ML, linux-kernel,
	linux-fsdevel

On Thu, 2015-03-26 at 15:35 +0100, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 07:12:23AM -0700, Dan Williams wrote:
> > > +       struct resource *res_mem;
> > > +       int err;
> > > +
> > > +       res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
> > > +                                              "pmem");
> > 
> > Isn't request_mem_region() enough?  i.e. it seems
> > request_mem_region_exclusive() assumes no DAX, at least in theory?
> 
> This is 1:1 from the patch Ross sent, but I've been wondering why
> request_mem_region_exclusive is used here.  All it does is setting the
> IORESOURCE_EXCLUSIVE flag, which prevents /dev/mem and sysfs from accessing
> the memory while the driver claims it. Besides pmem only a watchdog driver
> and e1000 make use of this flag, and there's various function related to
> it that are entirely unused.  It's a weird beast.

I don't have a compelling reason to use request_mem_region_exclusive()
over request_mem_region().  If the latter is cleaner I'm fine with the
change.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] SQUASHME: Streamline pmem.c
  2015-03-26 17:02   ` [PATCH] SQUASHME: Streamline pmem.c Boaz Harrosh
  2015-03-26 17:23     ` Christoph Hellwig
@ 2015-03-26 22:17     ` Ross Zwisler
  2015-03-26 22:22     ` Ross Zwisler
  2015-03-26 23:31     ` [Linux-nvdimm] " Dan Williams
  3 siblings, 0 replies; 69+ messages in thread
From: Ross Zwisler @ 2015-03-26 22:17 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86, axboe

On Thu, 2015-03-26 at 19:02 +0200, Boaz Harrosh wrote:
> Christoph why did you choose the fat and ugly version of
> pmem.c beats me. Anyway, here are the cleanups you need on
> top of your pmem patch.
> 
> Among other it does:
> * Remove getgeo. It is not needed for modern fdisk and was never
>   needed for libgparted and cfdisk.
> 
> * remove 89 lines of code to do a single memcpy. The reason
>   this was so in brd (done badly BTW) is because destination
>   memory is page-by-page based. With pmem we have the destination
>   contiguous so we can do any size, in one go.
> 
> * Remove SECTOR_SHIFT. It is defined in 6 other places
>   in the Kernel. I do not like a new one. 9 is used through
>   out, including block core. I do not like pmem to blasphemy
>   more than needed.
> 
> * More style stuff ...
> 
> Please squash into your initial submission
> 
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>

I agree with Christoph's comments, but overall I think these changes are
great.  Please send out as a series & you can add:

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH] SQUASHME: Streamline pmem.c
  2015-03-26 17:02   ` [PATCH] SQUASHME: Streamline pmem.c Boaz Harrosh
  2015-03-26 17:23     ` Christoph Hellwig
  2015-03-26 22:17     ` Ross Zwisler
@ 2015-03-26 22:22     ` Ross Zwisler
  2015-03-26 23:31     ` [Linux-nvdimm] " Dan Williams
  3 siblings, 0 replies; 69+ messages in thread
From: Ross Zwisler @ 2015-03-26 22:22 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86, axboe

On Thu, 2015-03-26 at 19:02 +0200, Boaz Harrosh wrote:
>  static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
>  			unsigned int len, unsigned int off, int rw,
>  			sector_t sector)
>  {
>  	void *mem = kmap_atomic(page);
> +	size_t pmem_off = sector << 9;
> +
> +	BUG_ON(pmem_off >= pmem->size);

This check should take 'len' into account so we don't copy off the end of our
PMEM space.

We should also just return -EIO back up to pmem_make_request() and have that
fail the bio, as opposed to doing the drastic BUG_ON.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26  9:34     ` Christoph Hellwig
  2015-03-26 10:04       ` Ingo Molnar
  2015-03-26 15:49       ` Boaz Harrosh
@ 2015-03-26 22:59       ` Yinghai Lu
  2015-03-27  8:10         ` Christoph Hellwig
  2 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2015-03-26 22:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, linux-nvdimm, linux-fsdevel,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	ross.zwisler, Jens Axboe, boaz

On Thu, Mar 26, 2015 at 2:34 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Mar 26, 2015 at 10:02:15AM +0100, Ingo Molnar wrote:
>> This is_e820_ram() factoring out becomes really messy in patch #3.
...
> Does this patch (replaces patches 2 and 3) look better to you?
>
> ---
> From 4a6fdc8433559d649d7bf707f72aafa4488f2d23 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Wed, 25 Mar 2015 12:24:11 +0100
> Subject: x86: add support for the non-standard protected e820 type
>
> Various recent BIOSes support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.
>
> Wire this e820 table type up to export platform devices for the pmem
> driver so that we can use it in Linux, and also provide a memmap=
> argument to manually tag memory as protected, which can be used
> if the BIOSs doesn't use the standard nonstandard interface, or
> we just want to test the pmem driver with regular memory.
>
> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>
>
...
> diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
> new file mode 100644
> index 0000000..f970048
> --- /dev/null
> +++ b/arch/x86/kernel/pmem.c
...
> +
> +void __init reserve_pmem(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < e820.nr_map; i++) {
> +               struct e820entry *ei = &e820.map[i];
> +
> +               if (ei->type != E820_PRAM)
> +                       continue;
> +
> +               memblock_reserve(ei->addr, ei->addr + ei->size);
> +               max_pfn_mapped = init_memory_mapping(
> +                               ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
> +                               ei->addr + ei->size);
> +       }
> +}

What do you want to get here?

You did not modify memblock_x86_fill() to treat
E820_PRAM as E820_RAM, so memblock will not have any
entry for E820_PRAM, so you do not need to call memblock_reserve
there.

And the same time,  init_memory_mapping() will call
init_range_memory_mapping/for_each_mem_pfn_range() to
set kernel mapping for memory range in memblock only.
So here calling init_memory_mapping will not do anything.
then just drop calling to that init_memory_mapping.
--- so will not kernel mapping pmem, is that what you intended to have?

After those two changes, You do not need this reserve_pmem at all.
Just drop it.


> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 0a2421c..f2bed2b 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1158,6 +1158,8 @@ void __init setup_arch(char **cmdline_p)
>
>         early_acpi_boot_init();
>
> +       reserve_pmem();
> +

Not needed.

>         initmem_init();
>         dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH] SQUASHME: Streamline pmem.c
  2015-03-26 17:02   ` [PATCH] SQUASHME: Streamline pmem.c Boaz Harrosh
                       ` (2 preceding siblings ...)
  2015-03-26 22:22     ` Ross Zwisler
@ 2015-03-26 23:31     ` Dan Williams
  2015-03-31 13:44       ` Boaz Harrosh
  3 siblings, 1 reply; 69+ messages in thread
From: Dan Williams @ 2015-03-26 23:31 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On Thu, Mar 26, 2015 at 10:02 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>
> Christoph why did you choose the fat and ugly version of
> pmem.c beats me.

Boaz, I am so very tired of your snide commentary.  It severely
detracts from the technical merit of your patches.  Please stop.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 22:59       ` Yinghai Lu
@ 2015-03-27  8:10         ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-27  8:10 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Christoph Hellwig, Ingo Molnar, linux-nvdimm, linux-fsdevel,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	ross.zwisler, Jens Axboe, boaz

On Thu, Mar 26, 2015 at 03:59:28PM -0700, Yinghai Lu wrote:
> What do you want to get here?
> 
> You did not modify memblock_x86_fill() to treat
> E820_PRAM as E820_RAM, so memblock will not have any
> entry for E820_PRAM, so you do not need to call memblock_reserve
> there.
> 
> And the same time,  init_memory_mapping() will call
> init_range_memory_mapping/for_each_mem_pfn_range() to
> set kernel mapping for memory range in memblock only.
> So here calling init_memory_mapping will not do anything.
> then just drop calling to that init_memory_mapping.
> --- so will not kernel mapping pmem, is that what you intended to have?

I think the intent of the old Intel code was to indeed map the pmem
into KVA space.  That got broken when I forward ported it to use
memblocks.  However the current pmem infrastructure doesn't need the
KVA mapping, so I can remove it for now.

However we have heated discussions about how to do I/O to pmem, and
KVA mapping is one of the options.  If we got with that option I might
bring this code back in a fixed up version.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-26 16:57 ` another pmem variant V2 Boaz Harrosh
  2015-03-26 17:02   ` [PATCH] SQUASHME: Streamline pmem.c Boaz Harrosh
  2015-03-26 17:18   ` another pmem variant V2 Christoph Hellwig
@ 2015-03-31  9:25   ` Christoph Hellwig
  2015-03-31 10:25     ` Boaz Harrosh
                       ` (2 more replies)
  2 siblings, 3 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-31  9:25 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
> On 03/26/2015 10:32 AM, Christoph Hellwig wrote:
> > Here is another version of the same trivial pmem driver, because two
> > obviously aren't enough.  The first patch is the same pmem driver
> > that Ross posted a short time ago, just modified to use platform_devices
> > to find the persistant memory region instead of hardconding it in the
> > Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> > but still allow auto-discovery.
> > 
> 
> Hi Christoph
> 
> So I've been trying to test your version, and play around with it.
> I currently have some problems, but this is the end of the week for me
> so I will debug and fix it after the weekend on Sunday.

Any news?  I'd really like to resend this ASAP to get it into 4.1..

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31  9:25   ` Christoph Hellwig
@ 2015-03-31 10:25     ` Boaz Harrosh
  2015-03-31 10:31       ` Boaz Harrosh
                         ` (2 more replies)
  2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
  2015-03-31 15:14     ` another pmem variant V2 Boaz Harrosh
  2 siblings, 3 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 10:25 UTC (permalink / raw)
  To: Christoph Hellwig, Boaz Harrosh
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 12:25 PM, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
>> On 03/26/2015 10:32 AM, Christoph Hellwig wrote:
>>> Here is another version of the same trivial pmem driver, because two
>>> obviously aren't enough.  The first patch is the same pmem driver
>>> that Ross posted a short time ago, just modified to use platform_devices
>>> to find the persistant memory region instead of hardconding it in the
>>> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
>>> but still allow auto-discovery.
>>>
>>
>> Hi Christoph
>>
>> So I've been trying to test your version, and play around with it.
>> I currently have some problems, but this is the end of the week for me
>> so I will debug and fix it after the weekend on Sunday.
> 
> Any news?  I'd really like to resend this ASAP to get it into 4.1..

Yes sorry I got stuck with the NUMA thing. I will finally finish today.

For some reason your patch with the memmap=nn!aa behaves differently than
with  memmap=nn\$aa

And also compared to my old e820.c fix + my old pmem. My fixes are effectively
the same as with your Kernel and the memmap=nn\$aa which sets the range "reserved"
like. And less "ram" like as in your patch.

The problem I see is that if I state a memmap=nn!aa that crosses a NUMA
boundary then the machine will not boot.
So BTW for sure I need that "don't merge E820_PMEM ranges" patch because
otherwise I will not be able to boot if I have pmem on both NUMA nodes
and they happen to be contiguous.

I do not understand why this happens, because a contiguous range of
RAM is fine with cross NUMA and pmem not. (Also the pmem defined as
memmap=nn!aa behaves the same)

Also we have another problem with NUMA that I'm researching for a solution
long term. Is that if the second NUMA node has only pmem and no RAM than
the Kernel will not define a second NUMA node. And we are NUMA screwed with
pmem. So one must put v-ram and nv-ram equally spread across his nodes.

Regarding the SQUASHMEs to PMEM. Originally I had them as 3-4 patches.
But I thought since you are squashing them into a single submitted patch
I can just send just the one patch. Tell me what you prefer and I'll
resend (The one vs the three)

And one last issue. I have some configuration "hardness" with the
memmap=nn!aa Kernel command line API, it was better for me with the
pmem map= module param. Will you be OK if I split pmem_probe() into
calling pmem_alloc(addr, length), so I can keep an out-of-tree patch
that adds the map= parameter to pmem?

Will send the patch in one hour. Just tell me if you need just the one
or three.

Thanks, Christoph
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31 10:25     ` Boaz Harrosh
@ 2015-03-31 10:31       ` Boaz Harrosh
  2015-03-31 14:21       ` [RFC] SQUASHME: pmem: Split up pmem_probe from pmem_alloc Boaz Harrosh
  2015-03-31 16:08       ` another pmem variant V2 Christoph Hellwig
  2 siblings, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 10:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 01:25 PM, Boaz Harrosh wrote:
> On 03/31/2015 12:25 PM, Christoph Hellwig wrote:
<>
> The problem I see is that if I state a memmap=nn!aa that crosses a NUMA
> boundary then the machine will not boot.
> So BTW for sure I need that "don't merge E820_PMEM ranges" patch because
> otherwise I will not be able to boot if I have pmem on both NUMA nodes
> and they happen to be contiguous.
> 

BTW if you need NUMA emulated in your KVM add the below to your virsh edit

  <cpu>
    ....
    <numa>
      <cell cpus='0-3' memory='5242880'/>
      <cell cpus='4-7' memory='3145728'/>
    </numa>
  </cpu>


memory= is in KiB only
The sum of these must exactly match what you have for 
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>

And you probably already have say:
    <topology sockets='2' cores='4' threads='1'/>

Cheers
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [SQUASHME 0/6] Streamline of Initial  pmem submission
  2015-03-31  9:25   ` Christoph Hellwig
  2015-03-31 10:25     ` Boaz Harrosh
@ 2015-03-31 13:18     ` Boaz Harrosh
  2015-03-31 13:23       ` [PATCH 1/6] SQUASHME: Don't let e820_PMEM sections Boaz Harrosh
                         ` (5 more replies)
  2015-03-31 15:14     ` another pmem variant V2 Boaz Harrosh
  2 siblings, 6 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 13:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 12:25 PM, Christoph Hellwig wrote:
> 
> Any news?  I'd really like to resend this ASAP to get it into 4.1..
> 

Hi Christoph, list

I'm sending in one SQUASHME patch to the first e820 patch.

And few other patches to the initial pmem.c driver submission.

Please squash those patches, after review.

list of patches:
 [PATCH 1/6] SQUASHME: Don't let e820_PMEM section merge

	Into first patch

 [PATCH 2/6] SQUASHME: pmem: Remove getgeo
 [PATCH 3/6] SQUASHME: pmem: Streamline pmem driver
 [PATCH 4/6] SQUSHME: pmem: Micro cleaning
 [PATCH 5/6] SQUASHME: pmem: Remove SECTOR_SHIFT
 [PATCH 6/6] SQUASHME: pmem: Remove "heavily based on brd.c" +

Thanks for reviewing
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/6] SQUASHME: Don't let e820_PMEM sections
  2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
@ 2015-03-31 13:23       ` Boaz Harrosh
  2015-03-31 17:16         ` [Linux-nvdimm] " Brooks, Adam J
  2015-03-31 13:24       ` [PATCH 2/6] SQUASHME: pmem: Remove getgeo Boaz Harrosh
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

From: Christoph Hellwig <hch@lst.de>

If we need a NUMA crossing a mids a contiguous memory range
we split the range at NUMA boundary so to produce two pmemX
devices. We do not like 2 NUMA IDs at the same device.

TODO: What happens with real type-12 NvDIMM the BIOS splits
these ranges?

Christoph I please need this patch for now. I have booting
problems on some machines, when a contiguous pmem range crosses
a NUMA boundary.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 arch/x86/kernel/e820.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 4bd525a..a098c74 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -346,7 +346,7 @@ int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map,
 		 * continue building up new bios map based on this
 		 * information
 		 */
-		if (current_type != last_type)	{
+		if (current_type != last_type || current_type == E820_PRAM) {
 			if (last_type != 0)	 {
 				new_bios[new_bios_entry].size =
 					change_point[chgidx]->addr - last_addr;
-- 
1.9.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/6] SQUASHME: pmem: Remove getgeo
  2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
  2015-03-31 13:23       ` [PATCH 1/6] SQUASHME: Don't let e820_PMEM sections Boaz Harrosh
@ 2015-03-31 13:24       ` Boaz Harrosh
  2015-03-31 13:25       ` [PATCH 3/6] SQUASHME: pmem: Streamline pmem driver Boaz Harrosh
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 13:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe


Remove getgeo. It is not needed for modern fdisk and was never
needed for libgparted and cfdisk.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 545b13b..dcb524f 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -44,15 +44,6 @@ struct pmem_device {
 static int pmem_major;
 static atomic_t pmem_index;
 
-static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo)
-{
-	/* some standard values */
-	geo->heads = 1 << 6;
-	geo->sectors = 1 << 5;
-	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
-	return 0;
-}
-
 /*
  * direct translation from (pmem,sector) => void*
  * We do not require that sector be page aligned.
@@ -214,7 +205,6 @@ static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		pmem_rw_page,
 	.direct_access =	pmem_direct_access,
-	.getgeo =		pmem_getgeo,
 };
 
 /* pmem->phys_addr and pmem->size need to be set.
-- 
1.9.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/6] SQUASHME: pmem: Streamline pmem driver
  2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
  2015-03-31 13:23       ` [PATCH 1/6] SQUASHME: Don't let e820_PMEM sections Boaz Harrosh
  2015-03-31 13:24       ` [PATCH 2/6] SQUASHME: pmem: Remove getgeo Boaz Harrosh
@ 2015-03-31 13:25       ` Boaz Harrosh
  2015-03-31 13:27       ` [PATCH 4/6] SQUSHME: pmem: Micro cleaning Boaz Harrosh
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 13:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe


remove 89 lines of code to do a single memcpy. The reason
this was so in brd (done badly BTW) is because destination
memory is page-by-page based. With pmem we have the destination
contiguous so we can do any size, in one go.

[v2]
Remove the BUG_ON checks on out of range IO.
The core already does these checks and I did not see these
checks done in other drivers.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 112 ++++++++++-----------------------------------------
 1 file changed, 22 insertions(+), 90 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index dcb524f..6a45fd5 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -44,91 +44,15 @@ struct pmem_device {
 static int pmem_major;
 static atomic_t pmem_index;
 
-/*
- * direct translation from (pmem,sector) => void*
- * We do not require that sector be page aligned.
- * The return value will point to the beginning of the page containing the
- * given sector, not to the sector itself.
- */
-static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
-{
-	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
-	size_t offset = page_offset << PAGE_SHIFT;
-
-	BUG_ON(offset >= pmem->size);
-	return pmem->virt_addr + offset;
-}
-
-/* sector must be page aligned */
-static unsigned long pmem_lookup_pfn(struct pmem_device *pmem, sector_t sector)
-{
-	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
-
-	BUG_ON(sector & (PAGE_SECTORS - 1));
-	return (pmem->phys_addr >> PAGE_SHIFT) + page_offset;
-}
-
-/*
- * sector is not required to be page aligned.
- * n is at most a single page, but could be less.
- */
-static void copy_to_pmem(struct pmem_device *pmem, const void *src,
-			sector_t sector, size_t n)
-{
-	void *dst;
-	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
-	size_t copy;
-
-	BUG_ON(n > PAGE_SIZE);
-
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	dst = pmem_lookup_pg_addr(pmem, sector);
-	memcpy(dst + offset, src, copy);
-
-	if (copy < n) {
-		src += copy;
-		sector += copy >> SECTOR_SHIFT;
-		copy = n - copy;
-		dst = pmem_lookup_pg_addr(pmem, sector);
-		memcpy(dst, src, copy);
-	}
-}
-
-/*
- * sector is not required to be page aligned.
- * n is at most a single page, but could be less.
- */
-static void copy_from_pmem(void *dst, struct pmem_device *pmem,
-			  sector_t sector, size_t n)
-{
-	void *src;
-	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
-	size_t copy;
-
-	BUG_ON(n > PAGE_SIZE);
-
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	src = pmem_lookup_pg_addr(pmem, sector);
-
-	memcpy(dst, src + offset, copy);
-
-	if (copy < n) {
-		dst += copy;
-		sector += copy >> SECTOR_SHIFT;
-		copy = n - copy;
-		src = pmem_lookup_pg_addr(pmem, sector);
-		memcpy(dst, src, copy);
-	}
-}
-
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 			unsigned int len, unsigned int off, int rw,
 			sector_t sector)
 {
 	void *mem = kmap_atomic(page);
+	size_t pmem_off = sector << 9;
 
 	if (rw == READ) {
-		copy_from_pmem(mem + off, pmem, sector, len);
+		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
 		flush_dcache_page(page);
 	} else {
 		/*
@@ -136,7 +60,7 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 		 * NVDIMMs are actually durable before returning.
 		 */
 		flush_dcache_page(page);
-		copy_to_pmem(pmem, mem + off, sector, len);
+		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
 	}
 
 	kunmap_atomic(mem);
@@ -152,25 +76,32 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	struct bvec_iter iter;
 	int err = 0;
 
-	sector = bio->bi_iter.bi_sector;
 	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
 		err = -EIO;
 		goto out;
 	}
 
-	BUG_ON(bio->bi_rw & REQ_DISCARD);
+	if (WARN_ON(bio->bi_rw & REQ_DISCARD)) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	rw = bio_rw(bio);
 	if (rw == READA)
 		rw = READ;
 
+	sector = bio->bi_iter.bi_sector;
 	bio_for_each_segment(bvec, bio, iter) {
-		unsigned int len = bvec.bv_len;
-
-		BUG_ON(len > PAGE_SIZE);
-		pmem_do_bvec(pmem, bvec.bv_page, len,
-			    bvec.bv_offset, rw, sector);
-		sector += len >> SECTOR_SHIFT;
+		/* NOTE: There is a legend saying that bv_len might be
+		 * bigger than PAGE_SIZE in the case that bv_page points to
+		 * a physical contiguous PFN set. But for us it is fine because
+		 * it means the Kernel virtual mapping is also contiguous. And
+		 * on the pmem side we are always contiguous both virtual and
+		 * physical
+		 */
+		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
+			     rw, sector);
+		sector += bvec.bv_len >> 9;
 	}
 
 out:
@@ -191,14 +122,15 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 			      void **kaddr, unsigned long *pfn, long size)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	size_t offset = sector << 9;
 
 	if (!pmem)
 		return -ENODEV;
 
-	*kaddr = pmem_lookup_pg_addr(pmem, sector);
-	*pfn = pmem_lookup_pfn(pmem, sector);
+	*kaddr = pmem->virt_addr + offset;
+	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
 
-	return pmem->size - (sector * 512);
+	return pmem->size - offset;
 }
 
 static const struct block_device_operations pmem_fops = {
-- 
1.9.3



^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 4/6] SQUSHME: pmem: Micro cleaning
  2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
                         ` (2 preceding siblings ...)
  2015-03-31 13:25       ` [PATCH 3/6] SQUASHME: pmem: Streamline pmem driver Boaz Harrosh
@ 2015-03-31 13:27       ` Boaz Harrosh
  2015-03-31 15:17         ` [Linux-nvdimm] " Dan Williams
  2015-03-31 13:28       ` [PATCH 5/6] SQUASHME: pmem: Remove SECTOR_SHIFT Boaz Harrosh
  2015-03-31 13:33       ` [PATCH 6/6] SQUASHME: pmem: Remove "... based on brd.c" + Copyright Boaz Harrosh
  5 siblings, 1 reply; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 13:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe


Some error checks had unlikely some did not. Put unlikely
on all error handling paths.
(I like unlikely for error paths specially for readability)

Also use bio_data_dir() to extract away the READA flag

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 6a45fd5..209a410 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -76,7 +76,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	struct bvec_iter iter;
 	int err = 0;
 
-	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+	if (unlikely(bio_end_sector(bio) > get_capacity(bdev->bd_disk))) {
 		err = -EIO;
 		goto out;
 	}
@@ -86,9 +86,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
-	rw = bio_rw(bio);
-	if (rw == READA)
-		rw = READ;
+	rw = bio_data_dir(bio);
 
 	sector = bio->bi_iter.bi_sector;
 	bio_for_each_segment(bvec, bio, iter) {
@@ -124,7 +122,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
 	size_t offset = sector << 9;
 
-	if (!pmem)
+	if (unlikely(!pmem))
 		return -ENODEV;
 
 	*kaddr = pmem->virt_addr + offset;
@@ -149,7 +147,7 @@ static int pmem_mapmem(struct pmem_device *pmem)
 
 	res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
 					       "pmem");
-	if (!res_mem) {
+	if (unlikely(!res_mem)) {
 		pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
 			   pmem->phys_addr, pmem->size);
 		return -EINVAL;
@@ -192,7 +190,7 @@ static int pmem_probe(struct platform_device *pdev)
 		return -ENXIO;
 
 	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (!res)
+	if (unlikely(!res))
 		return -ENXIO;
 
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
@@ -273,11 +271,11 @@ static int __init pmem_init(void)
 	int error;
 
 	pmem_major = register_blkdev(0, "pmem");
-	if (pmem_major < 0)
+	if (unlikely(pmem_major < 0))
 		return pmem_major;
 
 	error = platform_driver_register(&pmem_driver);
-	if (error)
+	if (unlikely(error))
 		unregister_blkdev(pmem_major, "pmem");
 	return error;
 }
-- 
1.9.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 5/6] SQUASHME: pmem: Remove SECTOR_SHIFT
  2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
                         ` (3 preceding siblings ...)
  2015-03-31 13:27       ` [PATCH 4/6] SQUSHME: pmem: Micro cleaning Boaz Harrosh
@ 2015-03-31 13:28       ` Boaz Harrosh
  2015-03-31 13:33       ` [PATCH 6/6] SQUASHME: pmem: Remove "... based on brd.c" + Copyright Boaz Harrosh
  5 siblings, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 13:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe


Remove SECTOR_SHIFT. It is defined in 6 other places
in the Kernel. I do not like a new one. 9 is used through
out, including block core. I do not like pmem to blasphemy
more than needed.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 209a410..508c9a8 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -25,10 +25,6 @@
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
 
-#define SECTOR_SHIFT		9
-#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
-#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
-
 #define PMEM_MINORS		16
 
 struct pmem_device {
@@ -227,7 +223,7 @@ static int pmem_probe(struct platform_device *pdev)
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	sprintf(disk->disk_name, "pmem%d", idx);
 	disk->driverfs_dev = &pdev->dev;
-	set_capacity(disk, pmem->size >> SECTOR_SHIFT);
+	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
 
 	add_disk(disk);
-- 
1.9.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 6/6] SQUASHME: pmem: Remove "... based on brd.c" + Copyright
  2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
                         ` (4 preceding siblings ...)
  2015-03-31 13:28       ` [PATCH 5/6] SQUASHME: pmem: Remove SECTOR_SHIFT Boaz Harrosh
@ 2015-03-31 13:33       ` Boaz Harrosh
  5 siblings, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 13:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe


The driver is no longer similar to brd.c.
Christoph completely changed the 2nd half of the patch
and I completely removed the 1st half.

pmem is its own thing.

Also added Copyright of Christoph and me. Feel free
to remove if it is not so.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 508c9a8..9ecdb79 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -1,6 +1,8 @@
 /*
  * Persistent Memory Driver
  * Copyright (c) 2014, Intel Corporation.
+ * Copyright (c) 2015, Christoph Hellwig <hch@lst.de>.
+ * Copyright (c) 2015, Boaz Harrosh <boaz@plexistor.com>.
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -10,10 +12,6 @@
  * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
  * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
  * more details.
- *
- * This driver is heavily based on drivers/block/brd.c.
- * Copyright (C) 2007 Nick Piggin
- * Copyright (C) 2007 Novell Inc.
  */
 
 #include <asm/cacheflush.h>
-- 
1.9.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH] SQUASHME: Streamline pmem.c
  2015-03-26 23:31     ` [Linux-nvdimm] " Dan Williams
@ 2015-03-31 13:44       ` Boaz Harrosh
  0 siblings, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 13:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On 03/27/2015 01:31 AM, Dan Williams wrote:
> On Thu, Mar 26, 2015 at 10:02 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>
>> Christoph why did you choose the fat and ugly version of
>> pmem.c beats me.
> 
> Boaz, I am so very tired of your snide commentary.  It severely
> detracts from the technical merit of your patches.  Please stop.
> 

Hi Dan

snide (snīd)
adj. snid·er, snid·est
 1. Mocking or derogatory in a malicious or ironic way

Please do not take me seriously. I'm just a joke ;-)
There is no "malicious nor Mocking" in any of my words Yes maybe some
"irony"

I think "severely detracts" is a bit of an exaggeration, no?

All I really really care about is like you, the "technical merit"

Thanks
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC] SQUASHME: pmem: Split up pmem_probe from pmem_alloc
  2015-03-31 10:25     ` Boaz Harrosh
  2015-03-31 10:31       ` Boaz Harrosh
@ 2015-03-31 14:21       ` Boaz Harrosh
  2015-03-31 16:10         ` Christoph Hellwig
  2015-03-31 16:08       ` another pmem variant V2 Christoph Hellwig
  2 siblings, 1 reply; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 14:21 UTC (permalink / raw)
  To: Boaz Harrosh, Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 01:25 PM, Boaz Harrosh wrote:
<>
> 
> And one last issue. I have some configuration "hardness" with the
> memmap=nn!aa Kernel command line API, it was better for me with the
> pmem map= module param. Will you be OK if I split pmem_probe() into
> calling pmem_alloc(addr, length), so I can keep an out-of-tree patch
> that adds the map= parameter to pmem?
> 

Hi Christoph.

Is this too much ugly for you? The reason I need it is because
I would like to keep out-of-tree a patch that adds back the
map= module param so to have more fine grain control of my
pmem devices.

[And also to have mapping control per pmem device. For example
 one device can be pages-mapped another uncached ioremap, 3rd
 write through mapping, and so on]

In the patch if pmem loads without map= it will load like
yours, but if map= is not empty it will not call platform_driver_register()
and will manually load devices as before.

I can still do this, of course. But with this here split patch
it will be easier to maintain.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 48 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 62cc9d0..0fc5c66 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -193,20 +193,13 @@ static void pmem_unmapmem(struct pmem_device *pmem)
 
 #endif /* !CONFIG_BLK_DEV_PMEM_USE_PAGES */
 
-static int pmem_probe(struct platform_device *pdev)
+static int pmem_alloc(struct resource *res, struct device *dev,
+		      struct pmem_device **o_pmem)
 {
 	struct pmem_device *pmem;
 	struct gendisk *disk;
-	struct resource *res;
 	int idx, err;
 
-	if (WARN_ON(pdev->num_resources > 1))
-		return -ENXIO;
-
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (unlikely(!res))
-		return -ENXIO;
-
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
 	if (unlikely(!pmem))
 		return -ENOMEM;
@@ -240,13 +233,12 @@ static int pmem_probe(struct platform_device *pdev)
 	disk->queue		= pmem->pmem_queue;
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	sprintf(disk->disk_name, "pmem%d", idx);
-	disk->driverfs_dev = &pdev->dev;
+	disk->driverfs_dev = dev;
 	set_capacity(disk, pmem->size >> 9);
 	pmem->pmem_disk = disk;
 
 	add_disk(disk);
-
-	platform_set_drvdata(pdev, pmem);
+	*o_pmem = pmem;
 	return 0;
 
 out_free_queue:
@@ -255,19 +247,45 @@ out_unmap:
 	pmem_unmapmem(pmem);
 out_free_dev:
 	kfree(pmem);
+	*o_pmem = NULL;
 	return err;
 }
 
-static int pmem_remove(struct platform_device *pdev)
+static void pmem_free(struct pmem_device *pmem)
 {
-	struct pmem_device *pmem = platform_get_drvdata(pdev);
-
 	del_gendisk(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
 	blk_cleanup_queue(pmem->pmem_queue);
 	pmem_unmapmem(pmem);
 	kfree(pmem);
+}
+
+static int pmem_probe(struct platform_device *pdev)
+{
+	struct pmem_device *pmem;
+	struct resource *res;
+	int err;
+
+	if (WARN_ON(pdev->num_resources > 1))
+		return -ENXIO;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (unlikely(!res))
+		return -ENXIO;
+
+	err = pmem_alloc(res, &pdev->dev, &pmem);
+	if (unlikely(err))
+		return err;
+
+	platform_set_drvdata(pdev, pmem);
+	return 0;
+}
+
+static int pmem_remove(struct platform_device *pdev)
+{
+	struct pmem_device *pmem = platform_get_drvdata(pdev);
 
+	pmem_free(pmem);
 	return 0;
 }
 
-- 
1.9.3



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31  9:25   ` Christoph Hellwig
  2015-03-31 10:25     ` Boaz Harrosh
  2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
@ 2015-03-31 15:14     ` Boaz Harrosh
  2015-03-31 16:16       ` Christoph Hellwig
  2 siblings, 1 reply; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 15:14 UTC (permalink / raw)
  To: Christoph Hellwig, Boaz Harrosh
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 12:25 PM, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
<>
> 
> Any news?  I'd really like to resend this ASAP to get it into 4.1..


Hi Christoph

I hate to be bearer of bad news but we have a problem with the
e820 patch:
	x86: add support for the non-standard protected e820 type

We can not accept it as is right now.
We have conducted farther tests. And it messes up NUMA.

All the below is based on your latest patches on top of 4.0-rc5

Before any modprobe of pmem.ko, just a clean boot.

In the same exact Kernel, if you use memmap=nn!aa ie add "type-12"
section we have below problems, but if we do memmap=nn\$aa ie add
"reserved" section then everything is fine.
[With my old e820 patches it all works fine because it is closer
 to the memmap=nn\$aa "reserved" section way]

The problems we see in a NUMA machine.
* On some machines we cannot boot if a single memmap=nn!aa crosses
  a NUMA boundary.
  Some VMs sometime boot sometimes do not. Some VMs never boot.
  Some machines boot just fine.
  Doing memmap=nn1!aa1,nn2!aa2 where the split is at the NUMA
  boundary will enable all machines to boot

* Regardless if we use memmap=nn!aa crossing a NUMA boundary or
  if we do memmap=nn1!aa1,nn2!aa2 the output of 
  cat /sys/devices/system/node/node1/meminfo
  Is all ZEROs. Yes very scary everything is ZERO. Even though
  the dmseg prints show the correct numbers the above
  node1/meminfo is broken.
  If with the same Kernel we do memmap=nn\$aa then everything
  is clean.
  So something in the way we defined our new type-12 region
  upsets the NUMA code. And we need farther investigation.

Perhaps you would like to start with my e820 much more conservative
fix, that makes type-12 memory behave exactly as reserved memory.
And go from there, step by step. Until we fix the problem above.
So we can submit something like that for 4.1.

Talk to me, tell me what you need me to experiment with. Should I try
your platform device way but based on my e820 fix? how can I please
help push this through?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning
  2015-03-31 13:27       ` [PATCH 4/6] SQUSHME: pmem: Micro cleaning Boaz Harrosh
@ 2015-03-31 15:17         ` Dan Williams
  2015-03-31 15:24           ` Boaz Harrosh
  0 siblings, 1 reply; 69+ messages in thread
From: Dan Williams @ 2015-03-31 15:17 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, X86 ML,
	linux-kernel, linux-fsdevel

On Tue, Mar 31, 2015 at 6:27 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>
> Some error checks had unlikely some did not. Put unlikely
> on all error handling paths.
> (I like unlikely for error paths specially for readability)

"unlikely()" is not a readability hint, it's specifically for branches
that profiling shows adding it makes a difference.  Just delete them
all until profiling show they make a difference.  They certainly don't
make a difference in the slow paths.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning
  2015-03-31 15:17         ` [Linux-nvdimm] " Dan Williams
@ 2015-03-31 15:24           ` Boaz Harrosh
  2015-03-31 15:30             ` Dan Williams
  0 siblings, 1 reply; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 15:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, X86 ML,
	linux-kernel, linux-fsdevel

On 03/31/2015 06:17 PM, Dan Williams wrote:
> On Tue, Mar 31, 2015 at 6:27 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>
>> Some error checks had unlikely some did not. Put unlikely
>> on all error handling paths.
>> (I like unlikely for error paths specially for readability)
> 
> "unlikely()" is not a readability hint, it's specifically for branches
> that profiling shows adding it makes a difference.  Just delete them
> all until profiling show they make a difference.  They certainly don't
> make a difference in the slow paths.
> 

Why?

So we do not fill up the  branch predictor with useless predictions
that will never matter. What is so bad with that. It may be cold path
but added up all over it will show eventually.

I do not see what is the harm of telling the compiler.
    "never store any prediction for this branch"

So since it can never (ever) harm any one or anything, and at the mass
if everywhere it was done this way it could actually help, then sure
it can be a readability thing. Since no harm done, right?

I still like it

Thanks
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning
  2015-03-31 15:24           ` Boaz Harrosh
@ 2015-03-31 15:30             ` Dan Williams
  2015-03-31 15:43               ` Boaz Harrosh
  0 siblings, 1 reply; 69+ messages in thread
From: Dan Williams @ 2015-03-31 15:30 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, X86 ML,
	linux-kernel, linux-fsdevel

On Tue, Mar 31, 2015 at 8:24 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/31/2015 06:17 PM, Dan Williams wrote:
>> On Tue, Mar 31, 2015 at 6:27 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>
>>> Some error checks had unlikely some did not. Put unlikely
>>> on all error handling paths.
>>> (I like unlikely for error paths specially for readability)
>>
>> "unlikely()" is not a readability hint, it's specifically for branches
>> that profiling shows adding it makes a difference.  Just delete them
>> all until profiling show they make a difference.  They certainly don't
>> make a difference in the slow paths.
>>
>
> Why?

Because the compiler and cpu already does a decent job, and if you get
the frequency wrong it can hurt performance [1].

It's pre-mature optimization to sprinkle them around, especially in slow paths.

[1]: https://lwn.net/Articles/420019/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning
  2015-03-31 15:30             ` Dan Williams
@ 2015-03-31 15:43               ` Boaz Harrosh
  2015-03-31 19:40                 ` Matthew Wilcox
  0 siblings, 1 reply; 69+ messages in thread
From: Boaz Harrosh @ 2015-03-31 15:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, X86 ML,
	linux-kernel, linux-fsdevel

On 03/31/2015 06:30 PM, Dan Williams wrote:
> On Tue, Mar 31, 2015 at 8:24 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 03/31/2015 06:17 PM, Dan Williams wrote:
>>> On Tue, Mar 31, 2015 at 6:27 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>>
>>>> Some error checks had unlikely some did not. Put unlikely
>>>> on all error handling paths.
>>>> (I like unlikely for error paths specially for readability)
>>>
>>> "unlikely()" is not a readability hint, it's specifically for branches
>>> that profiling shows adding it makes a difference.  Just delete them
>>> all until profiling show they make a difference.  They certainly don't
>>> make a difference in the slow paths.
>>>
>>
>> Why?
> 
> Because the compiler and cpu already does a decent job, and if you get
> the frequency wrong it can hurt performance [1].
> 
> It's pre-mature optimization to sprinkle them around, especially in slow paths.
> 
> [1]: https://lwn.net/Articles/420019/
> 

Sigh! It looks like a holy war. Again all that was said at above thread was about
statistical prediction yes-or-no. And I agree with all the use cases.

But not here. This is not an optimization this is the *error path*.
What I'm saying is:
	"No compiler nor CPU, even if 99% of the time this branch is taken
         I still consider it cold. Because it is the error case and
         I do not care for it"

And no I did not get it wrong. All these places are "error paths" that I do not
care for.

If any of these places are dependent on some input or code variable then yes
let the smarts do it. But never in the "error path".

That said. the patch is up for grabs. I like it ...

Thanks
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31 10:25     ` Boaz Harrosh
  2015-03-31 10:31       ` Boaz Harrosh
  2015-03-31 14:21       ` [RFC] SQUASHME: pmem: Split up pmem_probe from pmem_alloc Boaz Harrosh
@ 2015-03-31 16:08       ` Christoph Hellwig
  2 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-31 16:08 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Tue, Mar 31, 2015 at 01:25:46PM +0300, Boaz Harrosh wrote:
> The problem I see is that if I state a memmap=nn!aa that crosses a NUMA
> boundary then the machine will not boot.
> So BTW for sure I need that "don't merge E820_PMEM ranges" patch because
> otherwise I will not be able to boot if I have pmem on both NUMA nodes
> and they happen to be contiguous.

Ok.

> Regarding the SQUASHMEs to PMEM. Originally I had them as 3-4 patches.
> But I thought since you are squashing them into a single submitted patch
> I can just send just the one patch. Tell me what you prefer and I'll
> resend (The one vs the three)

The slpit is mostly to get a good description for each change.

> And one last issue. I have some configuration "hardness" with the
> memmap=nn!aa Kernel command line API, it was better for me with the
> pmem map= module param. Will you be OK if I split pmem_probe() into
> calling pmem_alloc(addr, length), so I can keep an out-of-tree patch
> that adds the map= parameter to pmem?

Can't your out of tree patch do that as well?  In fact you might want to
rewrite it to a module that allows your to create/destroy the platform_devices
you use.  And for your PCIe case I'd really prefer a proper in-tree PCI
driver for it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC] SQUASHME: pmem: Split up pmem_probe from pmem_alloc
  2015-03-31 14:21       ` [RFC] SQUASHME: pmem: Split up pmem_probe from pmem_alloc Boaz Harrosh
@ 2015-03-31 16:10         ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-31 16:10 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Christoph Hellwig, linux-nvdimm, linux-fsdevel,
	linux-kernel, x86, ross.zwisler, axboe

On Tue, Mar 31, 2015 at 05:21:15PM +0300, Boaz Harrosh wrote:
> -static int pmem_probe(struct platform_device *pdev)
> +static int pmem_alloc(struct resource *res, struct device *dev,
> +		      struct pmem_device **o_pmem)
>  {

please return the pmem device or an ERR_PTR() here.

Except for that it looks fine for me, and it's exactly what we'd export
for your PCIe device which would get an almost trivial pci_driver wrapper
around it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31 15:14     ` another pmem variant V2 Boaz Harrosh
@ 2015-03-31 16:16       ` Christoph Hellwig
  2015-03-31 16:44         ` Ingo Molnar
  2015-04-01 12:49         ` Boaz Harrosh
  0 siblings, 2 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-31 16:16 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote:
> We can not accept it as is right now.

Who is we?

> We have conducted farther tests. And it messes up NUMA.

Only you if you use the memmap option in weird ways.

Sounds like I should simply remove the memmap= option so people don't
abuse it.  The main point is to parse the e820 tables, which works fine.

And those people having fake pmem, or pcie cards that they are too lazy
to submit proper drivers for can stick to their out of tree hacks?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31 16:16       ` Christoph Hellwig
@ 2015-03-31 16:44         ` Ingo Molnar
  2015-03-31 17:24           ` Christoph Hellwig
  2015-04-01 12:49         ` Boaz Harrosh
  1 sibling, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2015-03-31 16:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Boaz Harrosh, linux-nvdimm, linux-fsdevel, linux-kernel, x86,
	ross.zwisler, axboe


* Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote:
> > We can not accept it as is right now.
> 
> Who is we?
> 
> > We have conducted farther tests. And it messes up NUMA.
> 
> Only you if you use the memmap option in weird ways.
> 
> Sounds like I should simply remove the memmap= option so people don't
> abuse it.  The main point is to parse the e820 tables, which works fine.

I'd be fine with that too - mind sending an updated series?

the memmap= option can be added on top of that.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [Linux-nvdimm] [PATCH 1/6] SQUASHME: Don't let e820_PMEM sections
  2015-03-31 13:23       ` [PATCH 1/6] SQUASHME: Don't let e820_PMEM sections Boaz Harrosh
@ 2015-03-31 17:16         ` Brooks, Adam J
  0 siblings, 0 replies; 69+ messages in thread
From: Brooks, Adam J @ 2015-03-31 17:16 UTC (permalink / raw)
  To: Boaz Harrosh, Christoph Hellwig
  Cc: axboe, linux-nvdimm, x86, linux-kernel, linux-fsdevel

>If we need a NUMA crossing a mids a contiguous memory range
>we split the range at NUMA boundary so to produce two pmemX
>devices. We do not like 2 NUMA IDs at the same device.
>
>TODO: What happens with real type-12 NvDIMM the BIOS splits
>these ranges?

The physical memory map for an Intel based dual socket system supporting NVDIMMS will be:
Socket 0 Normal Memory
Socket 0 NVDIMMs
Socket 1 Normal Memory
Socket 1 NVDIMMs

If you have both NVDIMMs and normal DIMMs on both sockets you will see two separate ranges. If there are NVDIMMs on both sockets, but no normal memory on socket 1, then you will end up with a single range covering the NVDIMMs on both sockets.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31 16:44         ` Ingo Molnar
@ 2015-03-31 17:24           ` Christoph Hellwig
  2015-03-31 17:33             ` [Linux-nvdimm] " Dan Williams
  0 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-31 17:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Boaz Harrosh, linux-nvdimm, linux-fsdevel,
	linux-kernel, x86, ross.zwisler, axboe

On Tue, Mar 31, 2015 at 06:44:56PM +0200, Ingo Molnar wrote:
> I'd be fine with that too - mind sending an updated series?

I will send an updated one tonight or early tomorrow.

Btw, do you want to keep the E820_PRAM name instead of E820_PMEM?
Seems like most people either don't care or prefer E820_PMEM. I'm
fine either way.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] another pmem variant V2
  2015-03-31 17:24           ` Christoph Hellwig
@ 2015-03-31 17:33             ` Dan Williams
  2015-04-01  7:50               ` Ingo Molnar
  0 siblings, 1 reply; 69+ messages in thread
From: Dan Williams @ 2015-03-31 17:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, Jens Axboe, linux-nvdimm, X86 ML, linux-kernel,
	linux-fsdevel

On Tue, Mar 31, 2015 at 10:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Tue, Mar 31, 2015 at 06:44:56PM +0200, Ingo Molnar wrote:
>> I'd be fine with that too - mind sending an updated series?
>
> I will send an updated one tonight or early tomorrow.
>
> Btw, do you want to keep the E820_PRAM name instead of E820_PMEM?
> Seems like most people either don't care or prefer E820_PMEM. I'm
> fine either way.

FWIW, I like the idea of having a separate E820_PRAM name for type-12
memory vs future "can't yet disclose" UEFI memory type.  The E820_PRAM
type potentially has the property of being relegated to "legacy"
NVDIMMs.  We can later add E820_PMEM as a memory type that, for
example, is not automatically backed by struct page.  That said, I'm
fine either way.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] [PATCH 4/6] SQUSHME: pmem: Micro cleaning
  2015-03-31 15:43               ` Boaz Harrosh
@ 2015-03-31 19:40                 ` Matthew Wilcox
  0 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2015-03-31 19:40 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dan Williams, Jens Axboe, linux-nvdimm, X86 ML, linux-kernel,
	linux-fsdevel, Christoph Hellwig

On Tue, Mar 31, 2015 at 06:43:40PM +0300, Boaz Harrosh wrote:
> But not here. This is not an optimization this is the *error path*.
> What I'm saying is:
> 	"No compiler nor CPU, even if 99% of the time this branch is taken
>          I still consider it cold. Because it is the error case and
>          I do not care for it"

GCC already understands that "if (foo) goto FORWARD_LABEL" should be
predicted unlikely by default.  All you're doing is cluttering the
source code.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: another pmem variant V2
  2015-03-26  8:32 another pmem variant V2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2015-03-26 16:57 ` another pmem variant V2 Boaz Harrosh
@ 2015-03-31 22:11 ` Elliott, Robert (Server Storage)
  2015-04-01  7:26   ` Christoph Hellwig
  2015-04-01 19:33 ` Elliott, Robert (Server Storage)
  5 siblings, 1 reply; 69+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-03-31 22:11 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: ross.zwisler, axboe, boaz, Kani, Toshimitsu



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 3:33 AM
> To: linux-nvdimm@ml01.01.org; linux-fsdevel@vger.kernel.org; linux-
> kernel@vger.kernel.org; x86@kernel.org
> Cc: ross.zwisler@linux.intel.com; axboe@kernel.dk; boaz@plexistor.com
> Subject: another pmem variant V2
> 
> Here is another version of the same trivial pmem driver, because two
> obviously aren't enough.  The first patch is the same pmem driver
> that Ross posted a short time ago, just modified to use platform_devices
> to find the persistant memory region instead of hardconding it in the
> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> but still allow auto-discovery.
> 
...
> This has been tested both with a real NVDIMM on a system with a type 12
> capable bios, as well as with "fake persistent" memory using the memmap=
> option.
> 
> Changes since V1:
>   - s/E820_PROTECTED_KERN/E820_PMEM/g
>   - map the persistent memory as uncached
>   - better kernel parameter description
>   - various typo fixes
>   - MODULE_LICENSE fix

I used fio to test 4 KiB random read and write IOPS 
on a 2-socket x86 DDR4 system.  With various cache attributes:

attr	read		write		notes
----	----		-----		-----
UC	37 K		21 K		ioremap_nocache
WB	3.6 M		2.5 M		ioremap
WC	764 K		3.7 M		ioremap_wc
WT	<not tested yet>		ioremap_wt

So, although UC and WT are the only modes certain to be safe,
the V1 default of UC provides abysmal performance - worse than
a consumer-class SATA SSD.

A solution for x86 is to use the MOVNTI instruction in WB
mode. This non-temporal hint uses a buffer like the write
combining buffer, not filling the cache and not stopping
everything in the CPU.  The kernel function __copy_from_user() 
uses that instruction (with SFENCE at the end) - see
arch/x86/lib/copy_user_nocache_64.S.

If I made the change from memcpy() to __copy_from_user()
correctly, that results in:

attr		read		write		notes
----		----		-----		-----
WB w/NTI	2.4 M		2.6 M		__copy_from_user()
WC w/NTI	3.2 M		2.1 M		__copy_from_user()

There is also a non-temporal streaming load hint instruction
called MOVNTDQA that might be helpful for reads for both WB
and WC. I don't see any existing kernel memcpy-like function 
that utilizes this instruction, so haven't tried it yet.


Intel64 and IA-32 Architectures 
Software Developers Manual excerpts (Jan 2015)
===================================
"The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ,
MOVNTPS, and MOVNTPD) allow data to be moved from the
processor's registers directly into system memory without
being also written into the L1, L2, and/or L3 caches. These
instructions can be used to prevent cache pollution when
operating on data that is going to be modified only once
before being stored back into system memory. ...

MOVNTI
...
The non-temporal hint is implemented by using a write
combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor
does not write the data into the cache hierarchy,
nor does it fetch the corresponding cache line from
memory into the cache hierarchy.
...

MOVNTDQA Provides a non-temporal hint that can cause
adjacent 16-byte items within an aligned 64-byte region
(a streaming line) to be fetched and held in a small
set of temporary buffers ("streaming load buffers"). 
Subsequent streaming loads to other aligned 16-byte 
items in the same streaming line may be supplied from
the streaming load buffer and can improve throughput.
...
A processor implementation may make use of the 
non-temporal hint associated with this instruction if
the memory source is WC (write combining) memory type. 
An implementation may also make use of the non-temporal
hint associated with this instruction if the memory
source is WB (writeback) memory type."



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31 22:11 ` Elliott, Robert (Server Storage)
@ 2015-04-01  7:26   ` Christoph Hellwig
  2015-04-02 15:11     ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2015-04-01  7:26 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe, boaz, Kani, Toshimitsu

On Tue, Mar 31, 2015 at 10:11:29PM +0000, Elliott, Robert (Server Storage) wrote:
> I used fio to test 4 KiB random read and write IOPS 
> on a 2-socket x86 DDR4 system.  With various cache attributes:
> 
> attr	read		write		notes
> ----	----		-----		-----
> UC	37 K		21 K		ioremap_nocache
> WB	3.6 M		2.5 M		ioremap
> WC	764 K		3.7 M		ioremap_wc
> WT	<not tested yet>		ioremap_wt
> 
> So, although UC and WT are the only modes certain to be safe,
> the V1 default of UC provides abysmal performance - worse than
> a consumer-class SATA SSD.

It doesn't look quite as bad on my setup, but performance is fairly
bad here as well.

> A solution for x86 is to use the MOVNTI instruction in WB
> mode. This non-temporal hint uses a buffer like the write
> combining buffer, not filling the cache and not stopping
> everything in the CPU.  The kernel function __copy_from_user() 
> uses that instruction (with SFENCE at the end) - see
> arch/x86/lib/copy_user_nocache_64.S.
> 
> If I made the change from memcpy() to __copy_from_user()
> correctly, that results in:
> 
> attr		read		write		notes
> ----		----		-----		-----
> WB w/NTI	2.4 M		2.6 M		__copy_from_user()
> WC w/NTI	3.2 M		2.1 M		__copy_from_user()

That looks a lot better.  It doesn't help us with a pmem device
mapped directly into userspace using mmap with the DAX infrastructure,
though.

Note when we want to move to non-temporal copies we'll need to add
a new prototype, as __copy_from_user isn't guaranteed to use these,
and it is defined to only work on user addresses.  That doesn't matter
on x86 but would blow up on say sparc or s390.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] another pmem variant V2
  2015-03-31 17:33             ` [Linux-nvdimm] " Dan Williams
@ 2015-04-01  7:50               ` Ingo Molnar
  2015-04-01  8:06                 ` Boaz Harrosh
  0 siblings, 1 reply; 69+ messages in thread
From: Ingo Molnar @ 2015-04-01  7:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, X86 ML,
	linux-kernel, linux-fsdevel


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Mar 31, 2015 at 10:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Tue, Mar 31, 2015 at 06:44:56PM +0200, Ingo Molnar wrote:
> >> I'd be fine with that too - mind sending an updated series?
> >
> > I will send an updated one tonight or early tomorrow.
> >
> > Btw, do you want to keep the E820_PRAM name instead of E820_PMEM?
> > Seems like most people either don't care or prefer E820_PMEM. I'm
> > fine either way.
> 
> FWIW, I like the idea of having a separate E820_PRAM name for 
> type-12 memory vs future "can't yet disclose" UEFI memory type.  The 
> E820_PRAM type potentially has the property of being relegated to 
> "legacy" NVDIMMs.  We can later add E820_PMEM as a memory type that, 
> for example, is not automatically backed by struct page.  That said, 
> I'm fine either way.

I agree that it's a minor detail, but I think the separation is 
useful in two ways:

 - We have a generic 'pmem' driver, but the low level, platform 
   specific RAM enumeration name does not use that name.

 - 'E820_PRAM' is a more natural extension of 'E820_RAM'.

Later on we can then do a:

    s/E820_PRAM/E820_LEGACY_PRAM

rename or so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Linux-nvdimm] another pmem variant V2
  2015-04-01  7:50               ` Ingo Molnar
@ 2015-04-01  8:06                 ` Boaz Harrosh
  0 siblings, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-04-01  8:06 UTC (permalink / raw)
  To: Ingo Molnar, Dan Williams
  Cc: Jens Axboe, linux-nvdimm, X86 ML, linux-kernel, linux-fsdevel,
	Christoph Hellwig

On 04/01/2015 10:50 AM, Ingo Molnar wrote:
> 
> * Dan Williams <dan.j.williams@intel.com> wrote:
> 
>> On Tue, Mar 31, 2015 at 10:24 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> On Tue, Mar 31, 2015 at 06:44:56PM +0200, Ingo Molnar wrote:
>>>> I'd be fine with that too - mind sending an updated series?
>>>
>>> I will send an updated one tonight or early tomorrow.
>>>
>>> Btw, do you want to keep the E820_PRAM name instead of E820_PMEM?
>>> Seems like most people either don't care or prefer E820_PMEM. I'm
>>> fine either way.
>>
>> FWIW, I like the idea of having a separate E820_PRAM name for 
>> type-12 memory vs future "can't yet disclose" UEFI memory type.  The 
>> E820_PRAM type potentially has the property of being relegated to 
>> "legacy" NVDIMMs.  We can later add E820_PMEM as a memory type that, 
>> for example, is not automatically backed by struct page.  That said, 
>> I'm fine either way.
> 
> I agree that it's a minor detail, but I think the separation is 
> useful in two ways:
> 
>  - We have a generic 'pmem' driver, but the low level, platform 
>    specific RAM enumeration name does not use that name.
> 
>  - 'E820_PRAM' is a more natural extension of 'E820_RAM'.
> 
> Later on we can then do a:
> 
>     s/E820_PRAM/E820_LEGACY_PRAM
> 
> rename or so.

If Dan does not like E820_PMEM. Than please let us just call it
E820_PMEM_LEGACY right from the let go. But PRAM is exactly not very
good because it is similar to RAM.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-03-31 16:16       ` Christoph Hellwig
  2015-03-31 16:44         ` Ingo Molnar
@ 2015-04-01 12:49         ` Boaz Harrosh
  1 sibling, 0 replies; 69+ messages in thread
From: Boaz Harrosh @ 2015-04-01 12:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 07:16 PM, Christoph Hellwig wrote:
> On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote:
>> We can not accept it as is right now.
> 
> Who is we?
> 
>> We have conducted farther tests. And it messes up NUMA.
> 
> Only you if you use the memmap option in weird ways.
> 

No weird ways just the single range. I would be very happy if
you can teach me the proper way, believe me I'm trying for 2 days.

> Sounds like I should simply remove the memmap= option so people don't
> abuse it.  The main point is to parse the e820 tables, which works fine.
> 

What abuse? the single range and the problem shows up.

So you are just pasting over the problem sir. The patch that you
are submitting has grave problems and covering it up with not
allowing memmap=! will not fix it.

The Kernel as is after your patch does not like this half baked
beast as we defined it, the defined pmem-memory-range messes
things up.

> And those people having fake pmem, or pcie cards that they are too lazy
> to submit proper drivers for can stick to their out of tree hacks?
> 

I have now conducted more tests on real type-12 DDR3 system and
the exact same problem as I reported exists with real type-12
chips!
And who are we kidding? the "memmap=!" yes or no, makes no
difference at all. All it does is edit the table as if it was
the table the BIOS gave us. There is no extra processing done
on memmap=.

Your e820 patch trashes NUMA.

But I fix it for good this time here is the fix below.
After I apply below patch every thing boots and work just as
expected. All the problems I reported disappear. Any configuration,
any number of ranges, cross NUMA or not, just works, exactly as
before with my patches.

The fix is over your V2, I will post one later that fixes
your V3 and adds back the memmap=!

---
If you inspect my fix below You will see that what happened is
that the original patch was too aggressive in making pmem
look like ram. In fact it started the ARCH side memory
initialization and was only skipping the generic initialization
of memory. This messed up internal real-memory structures.

With this fix below block/drivers/pmem.ko loads just fine
finds its resources and maps them, and everything just
works. including any "abuse" to memmap=! and any NUMA
configurations. Both real HW, type-12 HW and NUMA Vms.
(Preliminary testing we are conducting the full test
 rig as we speak)

(I'm still going over it, I might send some more cleaning)

Cheers

---
git diff --stat -p -M HEAD
 arch/x86/kernel/e820.c  |  7 ++++---
 arch/x86/kernel/pmem.c  | 17 -----------------
 arch/x86/kernel/setup.c |  2 --
 3 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 07246e5..dce0e84 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -963,9 +963,10 @@ void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
-			if (e820.map[i].type != E820_PRAM)
-				res->flags |= IORESOURCE_BUSY;
+		if (((e820.map[i].type != E820_RESERVED) &&
+		     (e820.map[i].type != E820_PRAM)) ||
+		     res->start < (1ULL<<20)) {
+			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
 		res++;
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
index f970048..fcdbc20 100644
--- a/arch/x86/kernel/pmem.c
+++ b/arch/x86/kernel/pmem.c
@@ -9,23 +9,6 @@
 #include <asm/page_types.h>
 #include <asm/setup.h>
 
-void __init reserve_pmem(void)
-{
-	int i;
-
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i];
-
-		if (ei->type != E820_PRAM)
-			continue;
-
-		memblock_reserve(ei->addr, ei->addr + ei->size);
-		max_pfn_mapped = init_memory_mapping(
-				ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
-				ei->addr + ei->size);
-	}
-}
-
 static __init void register_pmem_device(struct resource *res)
 {
 	struct platform_device *pdev;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f2bed2b..0a2421c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1158,8 +1158,6 @@ void __init setup_arch(char **cmdline_p)
 
 	early_acpi_boot_init();
 
-	reserve_pmem();
-
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: another pmem variant V2
  2015-03-26  8:32 another pmem variant V2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2015-03-31 22:11 ` Elliott, Robert (Server Storage)
@ 2015-04-01 19:33 ` Elliott, Robert (Server Storage)
  2015-04-02  9:37   ` Christoph Hellwig
  5 siblings, 1 reply; 69+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-04-01 19:33 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: ross.zwisler, axboe, boaz, Kani, Toshimitsu

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 3:33 AM
> To: linux-nvdimm@ml01.01.org; linux-fsdevel@vger.kernel.org; linux-
> kernel@vger.kernel.org; x86@kernel.org
> Cc: ross.zwisler@linux.intel.com; axboe@kernel.dk; boaz@plexistor.com
> Subject: another pmem variant V2
> 

I triggered a paging error in the memcpy call for a block read
from system-udevd (actually in a modified memcpy() for the cache
attribute experiments).  

1. This triggered an illegal schedule() call from an atomic context.
The call trace is shown below.

2. memcpy() doesn't provide exception handling or error reporting.
Some functions like do so, like __copy_user_nocache in
arch/x85/lib/copy_user_nocache_64.S.  

Should pmem only use functions that do so, if available on the
architecture?

pmem_rw_page can pass along the return value from the copy function.
pmem_make_request can report the error, if any, via bio_endio.


Call trace
==========
[62117.317216] BUG: scheduling while atomic: systemd-udevd/22135/0x00000001
[62117.317232] Modules linked in: pmem ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd xhci_pci hpilo xhci_hcd sb_edac edac_core microcode iTCO_wdt iTCO_vendor_support hpwdt ioatdma shpchp pcspkr lpc_ich mfd_core i2c_i801 wmi pcc_cpufreq dca acpi_cpufreq uinput nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs exportfs sr_mod cdrom sd_mod bnx2x tg3 ahci mdio libahci ptp pps_core hpsa libcrc32c dm_mirror dm_region_hash dm_log dm_mod ipv6 autofs4 [last unloaded: pmem]
[62117.317233] CPU: 31 PID: 22135 Comm: systemd-udevd Tainted: G      D         4.0.0-rc6+ #7
[62117.317234] Hardware name: HP ProLiant DL380 Gen9
[62117.317235]  ffff88047f3f3ac0 ffff8804241db2e8 ffffffff815a8866 00000000ff86ff86
[62117.317236]  ffff8804241dbfd8 ffff8804241db2f8 ffffffff815a4b45 ffff8804241db348
[62117.317237]  ffffffff815ab893 ffff880457091050 ffff88047f3fbb20 0000000000000000
[62117.317237] Call Trace:
[62117.317240]  [<ffffffff815a8866>] dump_stack+0x45/0x57
[62117.317245]  [<ffffffff815a4b45>] __schedule_bug+0x46/0x54
[62117.317247]  [<ffffffff815ab893>] __schedule+0x793/0x870
[62117.317251]  [<ffffffff815ac0f0>] ? bit_wait+0x50/0x50
[62117.317252]  [<ffffffff815ab9a7>] schedule+0x37/0x90
[62117.317253]  [<ffffffff815ae0cc>] schedule_timeout+0x1dc/0x260
[62117.317258]  [<ffffffff810bab5e>] ? ktime_get+0x3e/0xa0
[62117.317259]  [<ffffffff815ab06c>] io_schedule_timeout+0xac/0x140
[62117.317261]  [<ffffffff815ac126>] bit_wait_io+0x36/0x50
[62117.317262]  [<ffffffff815abeeb>] __wait_on_bit_lock+0x4b/0xb0
[62117.317263]  [<ffffffff81136f62>] ? find_get_entries+0xe2/0x130
[62117.317265]  [<ffffffff811342ec>] __lock_page+0xac/0xb0
[62117.317269]  [<ffffffff81090830>] ? autoremove_wake_function+0x40/0x40
[62117.317276]  [<ffffffff8114322f>] truncate_inode_pages_range+0x3af/0x620
[62117.317278]  [<ffffffff8128e837>] ? cpumask_next_and+0x37/0x50
[62117.317279]  [<ffffffff811c6d80>] ? __brelse+0x40/0x40
[62117.317283]  [<ffffffff810c8add>] ? smp_call_function_many+0x5d/0x280
[62117.317284]  [<ffffffff8128e929>] ? free_cpumask_var+0x9/0x10
[62117.317285]  [<ffffffff810c8ebd>] ? on_each_cpu_cond+0xbd/0x160
[62117.317286]  [<ffffffff811c6d80>] ? __brelse+0x40/0x40
[62117.317288]  [<ffffffff811434b5>] truncate_inode_pages+0x15/0x20
[62117.317289]  [<ffffffff811cab13>] kill_bdev+0x33/0x40
[62117.317291]  [<ffffffff811cbfa8>] __blkdev_put+0x68/0x210
[62117.317293]  [<ffffffff811cca20>] blkdev_put+0x50/0x130
[62117.317294]  [<ffffffff811ccbb5>] blkdev_close+0x25/0x30
[62117.317296]  [<ffffffff811969f7>] __fput+0xe7/0x220
[62117.317298]  [<ffffffff81196b7e>] ____fput+0xe/0x10
[62117.317302]  [<ffffffff8106c554>] task_work_run+0xc4/0xe0
[62117.317306]  [<ffffffff810533f8>] do_exit+0x2d8/0xb10
[62117.317308]  [<ffffffff810a4b6c>] ? kmsg_dump+0x9c/0xc0
[62117.317312]  [<ffffffff8100634e>] oops_end+0x8e/0xd0
[62117.317313]  [<ffffffff815a428f>] no_context+0x2d4/0x334
[62117.317314]  [<ffffffff815a435c>] __bad_area_nosemaphore+0x6d/0x1c6
[62117.317317]  [<ffffffff811506c0>] ? zone_statistics+0x80/0xa0
[62117.317319]  [<ffffffff815a44c8>] bad_area_nosemaphore+0x13/0x15
[62117.317321]  [<ffffffff81043ea1>] __do_page_fault+0x91/0x430
[62117.317322]  [<ffffffff8104424c>] do_page_fault+0xc/0x10
[62117.317324]  [<ffffffff815b0a62>] page_fault+0x22/0x30
[62117.317325]  [<ffffffffa0078302>] ? pmem_do_bvec.isra.6+0x212/0x3f0 [pmem]
[62117.317326]  [<ffffffffa0078523>] pmem_rw_page+0x43/0x60 [pmem]
[62117.317328]  [<ffffffff81293148>] ? __radix_tree_preload+0x38/0xa0
[62117.317329]  [<ffffffff811ca9de>] bdev_read_page+0x2e/0x40
[62117.317330]  [<ffffffff811d131f>] do_mpage_readpage+0x51f/0x6c0
[62117.317331]  [<ffffffff8114211e>] ? lru_cache_add+0xe/0x10
[62117.317332]  [<ffffffff811d159b>] mpage_readpages+0xdb/0x130
[62117.317333]  [<ffffffff811ca990>] ? I_BDEV+0x10/0x10
[62117.317334]  [<ffffffff811ca990>] ? I_BDEV+0x10/0x10
[62117.317336]  [<ffffffff811cb14d>] blkdev_readpages+0x1d/0x20
[62117.317336]  [<ffffffff811404d4>] __do_page_cache_readahead+0x194/0x210
[62117.317337]  [<ffffffff811408e5>] force_page_cache_readahead+0x75/0xb0
[62117.317338]  [<ffffffff81140963>] page_cache_sync_readahead+0x43/0x50
[62117.317339]  [<ffffffff81136161>] generic_file_read_iter+0x431/0x630
[62117.317341]  [<ffffffff811cb4e7>] blkdev_read_iter+0x37/0x40
[62117.317342]  [<ffffffff8119466e>] new_sync_read+0x7e/0xb0
[62117.317343]  [<ffffffff81195838>] __vfs_read+0x18/0x50
[62117.317344]  [<ffffffff811958f6>] vfs_read+0x86/0x140
[62117.317345]  [<ffffffff811959f6>] SyS_read+0x46/0xb0
[62117.317346]  [<ffffffff810eac94>] ? __audit_syscall_entry+0xb4/0x110
[62117.317348]  [<ffffffff815aeff2>] system_call_fastpath+0x12/0x17
[62121.618505] note: systemd-udevd[22133] exited with preempt_count 1



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-04-01 19:33 ` Elliott, Robert (Server Storage)
@ 2015-04-02  9:37   ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-04-02  9:37 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe, boaz, Kani, Toshimitsu

On Wed, Apr 01, 2015 at 07:33:38PM +0000, Elliott, Robert (Server Storage) wrote:
> I triggered a paging error in the memcpy call for a block read
> from system-udevd (actually in a modified memcpy() for the cache
> attribute experiments).  
> 
> 1. This triggered an illegal schedule() call from an atomic context.
> The call trace is shown below.
> 
> 2. memcpy() doesn't provide exception handling or error reporting.
> Some functions like do so, like __copy_user_nocache in
> arch/x85/lib/copy_user_nocache_64.S.  
> 
> Should pmem only use functions that do so, if available on the
> architecture?

We'll need to define an interface for the function to use if it isn't
plain memcpy, which would include that detail.

But I can't see how that memcpy should ever have to handle a page fault,
I'd be curious how your reproduces this issue.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: another pmem variant V2
  2015-04-01  7:26   ` Christoph Hellwig
@ 2015-04-02 15:11     ` Elliott, Robert (Server Storage)
  2015-04-02 16:41       ` Christoph Hellwig
  0 siblings, 1 reply; 69+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-04-02 15:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz, Kani, Toshimitsu



> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@lst.de]
> Sent: Wednesday, April 1, 2015 2:26 AM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; linux-nvdimm@ml01.01.org; linux-
> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org;
> ross.zwisler@linux.intel.com; axboe@kernel.dk; boaz@plexistor.com; Kani,
> Toshimitsu
> Subject: Re: another pmem variant V2
> 
> On Tue, Mar 31, 2015 at 10:11:29PM +0000, Elliott, Robert (Server Storage)
> wrote:
> > I used fio to test 4 KiB random read and write IOPS
> > on a 2-socket x86 DDR4 system.  With various cache attributes:
> >
> > attr	read		write		notes
> > ----	----		-----		-----
> > UC	37 K		21 K		ioremap_nocache
> > WB	3.6 M		2.5 M		ioremap
> > WC	764 K		3.7 M		ioremap_wc
> > WT	<not tested yet>		ioremap_wt
> >
> > So, although UC and WT are the only modes certain to be safe,
> > the V1 default of UC provides abysmal performance - worse than
> > a consumer-class SATA SSD.
> 
> It doesn't look quite as bad on my setup, but performance is fairly
> bad here as well.
> 
> > A solution for x86 is to use the MOVNTI instruction in WB
> > mode. This non-temporal hint uses a buffer like the write
> > combining buffer, not filling the cache and not stopping
> > everything in the CPU.  The kernel function __copy_from_user()
> > uses that instruction (with SFENCE at the end) - see
> > arch/x86/lib/copy_user_nocache_64.S.
> >
> > If I made the change from memcpy() to __copy_from_user()
> > correctly, that results in:
> >
> > attr		read		write		notes
> > ----		----		-----		-----
> > WB w/NTI	2.4 M		2.6 M		__copy_from_user()
> > WC w/NTI	3.2 M		2.1 M		__copy_from_user()
> 
> That looks a lot better.  It doesn't help us with a pmem device
> mapped directly into userspace using mmap with the DAX infrastructure,
> though.
> 
> Note when we want to move to non-temporal copies we'll need to add
> a new prototype, as __copy_from_user isn't guaranteed to use these,
> and it is defined to only work on user addresses.  That doesn't matter
> on x86 but would blow up on say sparc or s390.

Here are some updated numbers including:
* WT (writethrough) cache attribute
* memcpy that uses non-temporal stores (MOVNTDQ) to the 
  persistent memory for block writes (rather than MOVNTI)
* memcpy that uses non-temporal loads (MOVNTDQA) from the 
  persistent memory for block reads

Attr	Copy		Read IOPS		Write IOPS
====	====		=========		==========
UC	memcpy		36 K			22 K
UC	NT rd,wr	513 K			326 K

WB	memcpy		3.4 M			2.5 M
WB	NT rd,wr	3.3 M			3.5 M

WC	memcpy		776 K			3.5 M
WC	NT rd,wr	3.0 M			3.9 M

WT	memcpy		2.1 M			22 K
WT	NT rd,wr	3.3 M			2.1 M

a few other variations yielded the peak numbers:
WC	NT rd only	3.2 M			4.1 M
WC	NT wr only	712 K			4.6 M
WT	NT wr only	2.6 M			4.0 M

There are lots of tuning considerations for those memcpy 
functions - how far to unroll the loop, whether to
include PRFETCHNTA instructions, etc.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-04-02 15:11     ` Elliott, Robert (Server Storage)
@ 2015-04-02 16:41       ` Christoph Hellwig
  2015-04-02 18:03         ` Ingo Molnar
  0 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2015-04-02 16:41 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe, boaz, Kani, Toshimitsu

On Thu, Apr 02, 2015 at 03:11:36PM +0000, Elliott, Robert (Server Storage) wrote:
> Attr	Copy		Read IOPS		Write IOPS
> ====	====		=========		==========
> UC	memcpy		36 K			22 K
> UC	NT rd,wr	513 K			326 K
> 
> WB	memcpy		3.4 M			2.5 M
> WB	NT rd,wr	3.3 M			3.5 M
> 
> WC	memcpy		776 K			3.5 M
> WC	NT rd,wr	3.0 M			3.9 M
> 
> WT	memcpy		2.1 M			22 K
> WT	NT rd,wr	3.3 M			2.1 M
> 
> a few other variations yielded the peak numbers:
> WC	NT rd only	3.2 M			4.1 M
> WC	NT wr only	712 K			4.6 M
> WT	NT wr only	2.6 M			4.0 M
> 
> There are lots of tuning considerations for those memcpy 
> functions - how far to unroll the loop, whether to
> include PRFETCHNTA instructions, etc.

Looks like we should aіm for WC + NT would be a good start.

Can you prepare a patch to add your NT memcpy variants and
a second one to use them in the pmem driver?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
  2015-04-02 16:41       ` Christoph Hellwig
@ 2015-04-02 18:03         ` Ingo Molnar
  0 siblings, 0 replies; 69+ messages in thread
From: Ingo Molnar @ 2015-04-02 18:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Elliott, Robert (Server Storage),
	linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz, Kani, Toshimitsu


* Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Apr 02, 2015 at 03:11:36PM +0000, Elliott, Robert (Server Storage) wrote:
> > Attr	Copy		Read IOPS		Write IOPS
> > ====	====		=========		==========
> > UC	memcpy		36 K			22 K
> > UC	NT rd,wr	513 K			326 K
> > 
> > WB	memcpy		3.4 M			2.5 M
> > WB	NT rd,wr	3.3 M			3.5 M
> > 
> > WC	memcpy		776 K			3.5 M
> > WC	NT rd,wr	3.0 M			3.9 M
> > 
> > WT	memcpy		2.1 M			22 K
> > WT	NT rd,wr	3.3 M			2.1 M
> > 
> > a few other variations yielded the peak numbers:
> > WC	NT rd only	3.2 M			4.1 M
> > WC	NT wr only	712 K			4.6 M
> > WT	NT wr only	2.6 M			4.0 M
> > 
> > There are lots of tuning considerations for those memcpy 
> > functions - how far to unroll the loop, whether to
> > include PRFETCHNTA instructions, etc.
> 
> Looks like we should aіm for WC + NT would be a good start.
> 
> Can you prepare a patch to add your NT memcpy variants and
> a second one to use them in the pmem driver?

So we already have NT memcpy variants, see the copy_*_user_*_nocache() 
primitives in arch/x86/. They could be used almost straight away for 
kernel memory as well, as kernel buffers will not fault.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: another pmem variant V2
@ 2015-03-26 18:38 Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2015-03-26 18:38 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

And here's the patch, sorry:


diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 4bd525a..e7bf89e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -346,7 +346,7 @@ int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map,
 		 * continue building up new bios map based on this
 		 * information
 		 */
-		if (current_type != last_type)	{
+		if (current_type != last_type || current_type == E820_PRAM) {
 			if (last_type != 0)	 {
 				new_bios[new_bios_entry].size =
 					change_point[chgidx]->addr - last_addr;

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2015-04-02 18:03 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26  8:32 another pmem variant V2 Christoph Hellwig
2015-03-26  8:32 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
2015-03-26 14:12   ` [Linux-nvdimm] " Dan Williams
2015-03-26 14:35     ` Christoph Hellwig
2015-03-26 21:37       ` Ross Zwisler
2015-03-26 14:52     ` Boaz Harrosh
2015-03-26 15:59       ` Dan Williams
2015-03-26  8:32 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
2015-03-26  9:02   ` Ingo Molnar
2015-03-26  9:34     ` Christoph Hellwig
2015-03-26 10:04       ` Ingo Molnar
2015-03-26 10:19         ` Christoph Hellwig
2015-03-26 10:28           ` Ingo Molnar
2015-03-26 10:29             ` Christoph Hellwig
2015-03-26 15:49       ` Boaz Harrosh
2015-03-26 16:02         ` [Linux-nvdimm] " Dan Williams
2015-03-26 16:07           ` Boaz Harrosh
2015-03-26 16:43         ` Christoph Hellwig
2015-03-26 18:46           ` Elliott, Robert (Server Storage)
2015-03-26 19:25             ` [Linux-nvdimm] " Dan Williams
2015-03-26 20:53           ` Ross Zwisler
2015-03-26 22:59       ` Yinghai Lu
2015-03-27  8:10         ` Christoph Hellwig
2015-03-26  8:32 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
2015-03-26 16:57 ` another pmem variant V2 Boaz Harrosh
2015-03-26 17:02   ` [PATCH] SQUASHME: Streamline pmem.c Boaz Harrosh
2015-03-26 17:23     ` Christoph Hellwig
2015-03-26 22:17     ` Ross Zwisler
2015-03-26 22:22     ` Ross Zwisler
2015-03-26 23:31     ` [Linux-nvdimm] " Dan Williams
2015-03-31 13:44       ` Boaz Harrosh
2015-03-26 17:18   ` another pmem variant V2 Christoph Hellwig
2015-03-26 17:31     ` Boaz Harrosh
2015-03-26 18:38       ` Christoph Hellwig
2015-03-31  9:25   ` Christoph Hellwig
2015-03-31 10:25     ` Boaz Harrosh
2015-03-31 10:31       ` Boaz Harrosh
2015-03-31 14:21       ` [RFC] SQUASHME: pmem: Split up pmem_probe from pmem_alloc Boaz Harrosh
2015-03-31 16:10         ` Christoph Hellwig
2015-03-31 16:08       ` another pmem variant V2 Christoph Hellwig
2015-03-31 13:18     ` [SQUASHME 0/6] Streamline of Initial pmem submission Boaz Harrosh
2015-03-31 13:23       ` [PATCH 1/6] SQUASHME: Don't let e820_PMEM sections Boaz Harrosh
2015-03-31 17:16         ` [Linux-nvdimm] " Brooks, Adam J
2015-03-31 13:24       ` [PATCH 2/6] SQUASHME: pmem: Remove getgeo Boaz Harrosh
2015-03-31 13:25       ` [PATCH 3/6] SQUASHME: pmem: Streamline pmem driver Boaz Harrosh
2015-03-31 13:27       ` [PATCH 4/6] SQUSHME: pmem: Micro cleaning Boaz Harrosh
2015-03-31 15:17         ` [Linux-nvdimm] " Dan Williams
2015-03-31 15:24           ` Boaz Harrosh
2015-03-31 15:30             ` Dan Williams
2015-03-31 15:43               ` Boaz Harrosh
2015-03-31 19:40                 ` Matthew Wilcox
2015-03-31 13:28       ` [PATCH 5/6] SQUASHME: pmem: Remove SECTOR_SHIFT Boaz Harrosh
2015-03-31 13:33       ` [PATCH 6/6] SQUASHME: pmem: Remove "... based on brd.c" + Copyright Boaz Harrosh
2015-03-31 15:14     ` another pmem variant V2 Boaz Harrosh
2015-03-31 16:16       ` Christoph Hellwig
2015-03-31 16:44         ` Ingo Molnar
2015-03-31 17:24           ` Christoph Hellwig
2015-03-31 17:33             ` [Linux-nvdimm] " Dan Williams
2015-04-01  7:50               ` Ingo Molnar
2015-04-01  8:06                 ` Boaz Harrosh
2015-04-01 12:49         ` Boaz Harrosh
2015-03-31 22:11 ` Elliott, Robert (Server Storage)
2015-04-01  7:26   ` Christoph Hellwig
2015-04-02 15:11     ` Elliott, Robert (Server Storage)
2015-04-02 16:41       ` Christoph Hellwig
2015-04-02 18:03         ` Ingo Molnar
2015-04-01 19:33 ` Elliott, Robert (Server Storage)
2015-04-02  9:37   ` Christoph Hellwig
2015-03-26 18:38 Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).