LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v7 0/3] block device interposer
@ 2021-03-12 15:44 Sergei Shtepa
  2021-03-12 15:44 ` [PATCH v7 1/3] block: add blk_mq_is_queue_frozen() Sergei Shtepa
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-12 15:44 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api
  Cc: sergei.shtepa, pavel.tide

Hi all.

I'm joyful to suggest the block device interposer (bdev_interposer) v7.
bdev_interposer allows to redirect bio requests to other block devices.

In this series of patches I suggest a different implementation of the bio
interception mechanism. Now the interposer is a different block device.
Instead of an additional hook, the function fops->submit_bio() of
the interposer device is used.
This implementation greatly simplifies the application of this
bdev_interposer in device-mapper. But there is one limitation - the size
of the interposer device must be greater than or equal to the size of
the original device.

The first patch adds the function blk_mq_is_queue_frozen(). It allows to
check a queue state.

The second patch is dedicated to bdev_interposer itself, which provides
the ability to redirect bio to the interposer device.

The third one adds the DM_INTERPOSED_FLAG flag. When this flag is
applied with the ioctl DM_TABLE_LOAD_CMD, the underlying devices are
opened without the FMODE_EXCL flag and connected via bdev_interposer.

Changes in this patchset v7:
  * the request interception mechanism. Now the interposer is
    a block device that receives requests instead of the original device;
  * code design fixes.

History:
v6 - https://patchwork.kernel.org/project/linux-block/cover/1614774618-22410-1-git-send-email-sergei.shtepa@veeam.com/
  * designed for 5.12;
  * thanks to the new design of the bio structure in v5.12, it is
    possible to perform interception not for the entire disk, but
    for each block device;
  * instead of the new ioctl DM_DEV_REMAP_CMD and the 'noexcl' option,
    the DM_INTERPOSED_FLAG flag for the ioctl DM_TABLE_LOAD_CMD is
    applied.

v5 - https://patchwork.kernel.org/project/linux-block/cover/1612881028-7878-1-git-send-email-sergei.shtepa@veeam.com/
 * rebase for v5.11-rc7;
 * patch set organization;
 * fix defects in documentation;
 * add some comments;
 * change mutex names for better code readability;
 * remove calling bd_unlink_disk_holder() for targets with non-exclusive
   flag;
 * change type for struct dm_remap_param from uint8_t to __u8.

v4 - https://patchwork.kernel.org/project/linux-block/cover/1612367638-3794-1-git-send-email-sergei.shtepa@veeam.com/
Mostly changes were made, due to Damien's comments:
 * on the design of the code;
 * by the patch set organization;
 * bug with passing a wrong parameter to dm_get_device();
 * description of the 'noexcl' parameter in the linear.rst.
Also added remap_and_filter.rst.

v3 - https://patchwork.kernel.org/project/linux-block/cover/1611853955-32167-1-git-send-email-sergei.shtepa@veeam.com/
In this version, I already suggested blk_interposer to apply to dm-linear.
Problems were solved:
 * Interception of bio requests from a specific device on the disk, not
   from the entire disk. To do this, we added the dm_interposed_dev
   structure and an interval tree to store these structures.
 * Implemented ioctl DM_DEV_REMAP_CMD. A patch with changes in the lvm2
   project was sent to the team lvm-devel@redhat.com.
 * Added the 'noexcl' option for dm-linear, which allows you to open
   the underlying block-device without FMODE_EXCL mode.

v2 - https://patchwork.kernel.org/project/linux-block/cover/1607518911-30692-1-git-send-email-sergei.shtepa@veeam.com/
I tried to suggest blk_interposer without using it in device mapper,
but with the addition of a sample of its use. It was then that I learned
about the maintainers' attitudes towards the samples directory :).

v1 - https://lwn.net/ml/linux-block/20201119164924.74401-1-hare@suse.de/
This Hannes's patch can be considered as a starting point, since this is
where the interception mechanism and the term blk_interposer itself
appeared. It became clear that blk_interposer can be useful for
device mapper.

before v1 - https://patchwork.kernel.org/project/linux-block/cover/1603271049-20681-1-git-send-email-sergei.shtepa@veeam.com/
I tried to offer a rather cumbersome blk-filter and a monster-like
blk-snap module for creating snapshots.

Thank you to everyone who was able to take the time to review
the previous versions.
I hope that this time I achieved the required quality.

Thanks,
Sergei.

Sergei Shtepa (3):
  block: add blk_mq_is_queue_frozen()
  block: add bdev_interposer
  dm: add DM_INTERPOSED_FLAG

 block/bio.c                   |  2 ++
 block/blk-core.c              | 57 ++++++++++++++++++++++++++++++++
 block/blk-mq.c                | 13 ++++++++
 block/genhd.c                 | 54 +++++++++++++++++++++++++++++++
 drivers/md/dm-core.h          |  3 ++
 drivers/md/dm-ioctl.c         | 13 ++++++++
 drivers/md/dm-table.c         | 61 +++++++++++++++++++++++++++++------
 drivers/md/dm.c               | 38 +++++++++++++++-------
 include/linux/blk-mq.h        |  1 +
 include/linux/blk_types.h     |  3 ++
 include/linux/blkdev.h        |  9 ++++++
 include/linux/device-mapper.h |  1 +
 include/uapi/linux/dm-ioctl.h |  6 ++++
 13 files changed, 240 insertions(+), 21 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 1/3] block: add blk_mq_is_queue_frozen()
  2021-03-12 15:44 [PATCH v7 0/3] block device interposer Sergei Shtepa
@ 2021-03-12 15:44 ` Sergei Shtepa
  2021-03-12 19:06   ` Mike Snitzer
  2021-03-12 15:44 ` [PATCH v7 2/3] block: add bdev_interposer Sergei Shtepa
  2021-03-12 15:44 ` [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG Sergei Shtepa
  2 siblings, 1 reply; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-12 15:44 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api
  Cc: sergei.shtepa, pavel.tide

blk_mq_is_queue_frozen() allow to assert that the queue is frozen.

Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
 block/blk-mq.c         | 13 +++++++++++++
 include/linux/blk-mq.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..2f188a865024 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -161,6 +161,19 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
 
+bool blk_mq_is_queue_frozen(struct request_queue *q)
+{
+	bool frozen;
+
+	mutex_lock(&q->mq_freeze_lock);
+	frozen = percpu_ref_is_dying(&q->q_usage_counter) &&
+		 percpu_ref_is_zero(&q->q_usage_counter);
+	mutex_unlock(&q->mq_freeze_lock);
+
+	return frozen;
+}
+EXPORT_SYMBOL_GPL(blk_mq_is_queue_frozen);
+
 /*
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b8990..6f01971abf7b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -533,6 +533,7 @@ void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
+bool blk_mq_is_queue_frozen(struct request_queue *q);
 
 int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 2/3] block: add bdev_interposer
  2021-03-12 15:44 [PATCH v7 0/3] block device interposer Sergei Shtepa
  2021-03-12 15:44 ` [PATCH v7 1/3] block: add blk_mq_is_queue_frozen() Sergei Shtepa
@ 2021-03-12 15:44 ` Sergei Shtepa
  2021-03-14  9:28   ` Christoph Hellwig
  2021-03-16  8:09   ` Ming Lei
  2021-03-12 15:44 ` [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG Sergei Shtepa
  2 siblings, 2 replies; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-12 15:44 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api
  Cc: sergei.shtepa, pavel.tide

bdev_interposer allows to redirect bio requests to another devices.

Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
 block/bio.c               |  2 ++
 block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
 block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
 include/linux/blk_types.h |  3 +++
 include/linux/blkdev.h    |  9 +++++++
 5 files changed, 125 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index a1c4d2900c7a..0bfbf06475ee 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 		bio_set_flag(bio, BIO_THROTTLED);
 	if (bio_flagged(bio_src, BIO_REMAPPED))
 		bio_set_flag(bio, BIO_REMAPPED);
+	if (bio_flagged(bio_src, BIO_INTERPOSED))
+		bio_set_flag(bio, BIO_INTERPOSED);
 	bio->bi_opf = bio_src->bi_opf;
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_write_hint = bio_src->bi_write_hint;
diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..da1abc4c27a9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
 	return ret;
 }
 
+static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
+{
+	blk_qc_t ret = BLK_QC_T_NONE;
+	struct bio_list bio_list[2] = { };
+	struct gendisk *orig_disk;
+
+	if (current->bio_list) {
+		bio_list_add(&current->bio_list[0], bio);
+		return BLK_QC_T_NONE;
+	}
+
+	orig_disk = bio->bi_bdev->bd_disk;
+	if (unlikely(bio_queue_enter(bio)))
+		return BLK_QC_T_NONE;
+
+	current->bio_list = bio_list;
+
+	do {
+		struct block_device *interposer = bio->bi_bdev->bd_interposer;
+
+		if (unlikely(!interposer)) {
+			/* interposer was removed */
+			bio_list_add(&current->bio_list[0], bio);
+			break;
+		}
+		/* assign bio to interposer device */
+		bio_set_dev(bio, interposer);
+		bio_set_flag(bio, BIO_INTERPOSED);
+
+		if (!submit_bio_checks(bio))
+			break;
+		/*
+		 * Because the current->bio_list is initialized,
+		 * the submit_bio callback will always return BLK_QC_T_NONE.
+		 */
+		interposer->bd_disk->fops->submit_bio(bio);
+	} while (false);
+
+	current->bio_list = NULL;
+
+	blk_queue_exit(orig_disk->queue);
+
+	/* Resubmit remaining bios */
+	while ((bio = bio_list_pop(&bio_list[0])))
+		ret = submit_bio_noacct(bio);
+
+	return ret;
+}
+
 /**
  * submit_bio_noacct - re-submit a bio to the block device layer for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
  */
 blk_qc_t submit_bio_noacct(struct bio *bio)
 {
+	/*
+	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
+	 * created by the bdev_interposer do not get to it for processing.
+	 */
+	if (bdev_has_interposer(bio->bi_bdev) &&
+	    !bio_flagged(bio, BIO_INTERPOSED))
+		return submit_bio_interposed(bio);
+
 	if (!submit_bio_checks(bio))
 		return BLK_QC_T_NONE;
 
diff --git a/block/genhd.c b/block/genhd.c
index c55e8f0fced1..c840ecffea68 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -30,6 +30,11 @@
 static struct kobject *block_depr;
 
 DECLARE_RWSEM(bdev_lookup_sem);
+/*
+ * Prevents different block-layer interposers from attaching or detaching
+ * to the block device at the same time.
+ */
+static DEFINE_MUTEX(bdev_interposer_attach_lock);
 
 /* for extended dynamic devt allocation, currently only one major is used */
 #define NR_EXT_DEVT		(1 << MINORBITS)
@@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
 	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
 	kfree(disk->ev);
 }
+
+int bdev_interposer_attach(struct block_device *original,
+			   struct block_device *interposer)
+{
+	int ret = 0;
+
+	if (WARN_ON(((!original) || (!interposer))))
+		return -EINVAL;
+	/*
+	 * interposer should be simple, no a multi-queue device
+	 */
+	if (!interposer->bd_disk->fops->submit_bio)
+		return -EINVAL;
+
+	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
+		return -EPERM;
+
+	mutex_lock(&bdev_interposer_attach_lock);
+
+	if (bdev_has_interposer(original))
+		ret = -EBUSY;
+	else {
+		original->bd_interposer = bdgrab(interposer);
+		if (!original->bd_interposer)
+			ret = -ENODEV;
+	}
+
+	mutex_unlock(&bdev_interposer_attach_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(bdev_interposer_attach);
+
+void bdev_interposer_detach(struct block_device *original)
+{
+	if (WARN_ON(!original))
+		return;
+
+	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
+		return;
+
+	mutex_lock(&bdev_interposer_attach_lock);
+	if (bdev_has_interposer(original)) {
+		bdput(original->bd_interposer);
+		original->bd_interposer = NULL;
+	}
+	mutex_unlock(&bdev_interposer_attach_lock);
+}
+EXPORT_SYMBOL_GPL(bdev_interposer_detach);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..13bda4732cf5 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -19,6 +19,7 @@ struct io_context;
 struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *);
 struct bio_crypt_ctx;
+struct bdev_interposer;
 
 struct block_device {
 	sector_t		bd_start_sect;
@@ -46,6 +47,7 @@ struct block_device {
 	spinlock_t		bd_size_lock; /* for bd_inode->i_size updates */
 	struct gendisk *	bd_disk;
 	struct backing_dev_info *bd_bdi;
+	struct block_device     *bd_interposer;
 
 	/* The counter of freeze processes */
 	int			bd_fsfreeze_count;
@@ -304,6 +306,7 @@ enum {
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_REMAPPED,
+	BIO_INTERPOSED,		/* bio was reassigned to another block device */
 	BIO_FLAG_LAST
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..90f62b4197da 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -2031,4 +2031,13 @@ int fsync_bdev(struct block_device *bdev);
 int freeze_bdev(struct block_device *bdev);
 int thaw_bdev(struct block_device *bdev);
 
+static inline bool bdev_has_interposer(struct block_device *bdev)
+{
+	return (bdev->bd_interposer != NULL);
+};
+
+int bdev_interposer_attach(struct block_device *original,
+			   struct block_device *interposer);
+void bdev_interposer_detach(struct block_device *original);
+
 #endif /* _LINUX_BLKDEV_H */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG
  2021-03-12 15:44 [PATCH v7 0/3] block device interposer Sergei Shtepa
  2021-03-12 15:44 ` [PATCH v7 1/3] block: add blk_mq_is_queue_frozen() Sergei Shtepa
  2021-03-12 15:44 ` [PATCH v7 2/3] block: add bdev_interposer Sergei Shtepa
@ 2021-03-12 15:44 ` Sergei Shtepa
  2021-03-12 19:00   ` Mike Snitzer
  2021-03-14  9:30   ` Christoph Hellwig
  2 siblings, 2 replies; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-12 15:44 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api
  Cc: sergei.shtepa, pavel.tide

DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
Underlying block device opens without a flag FMODE_EXCL.
DM target receives bio from the original device via
bdev_interposer.

Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
 drivers/md/dm-core.h          |  3 ++
 drivers/md/dm-ioctl.c         | 13 ++++++++
 drivers/md/dm-table.c         | 61 +++++++++++++++++++++++++++++------
 drivers/md/dm.c               | 38 +++++++++++++++-------
 include/linux/device-mapper.h |  1 +
 include/uapi/linux/dm-ioctl.h |  6 ++++
 6 files changed, 101 insertions(+), 21 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 5953ff2bd260..9eae419c7b18 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -114,6 +114,9 @@ struct mapped_device {
 	bool init_tio_pdu:1;
 
 	struct srcu_struct io_barrier;
+
+	/* attach target via block-layer interposer */
+	bool is_interposed;
 };
 
 void disable_discard(struct mapped_device *md);
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 5e306bba4375..2b4c9557c283 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1267,6 +1267,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
 	return mode;
 }
 
+static inline bool get_interposer_flag(struct dm_ioctl *param)
+{
+	return (param->flags & DM_INTERPOSED_FLAG);
+}
+
 static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
 		       struct dm_target_spec **spec, char **target_params)
 {
@@ -1293,6 +1298,10 @@ static int populate_table(struct dm_table *table,
 		DMWARN("populate_table: no targets specified");
 		return -EINVAL;
 	}
+	if (table->md->is_interposed && (param->target_count != 1)) {
+		DMWARN("%s: with interposer should be specified only one target", __func__);
+		return -EINVAL;
+	}
 
 	for (i = 0; i < param->target_count; i++) {
 
@@ -1338,6 +1347,8 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si
 	if (!md)
 		return -ENXIO;
 
+	md->is_interposed = get_interposer_flag(param);
+
 	r = dm_table_create(&t, get_mode(param), param->target_count, md);
 	if (r)
 		goto err;
@@ -2098,6 +2109,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
 	if (r)
 		goto err_hash_remove;
 
+	md->is_interposed = get_interposer_flag(dmi);
+
 	/* add targets */
 	for (i = 0; i < dmi->target_count; i++) {
 		r = dm_table_add_target(t, spec_array[i]->target_type,
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..f6e2eb3f8949 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -225,12 +225,13 @@ void dm_table_destroy(struct dm_table *t)
 /*
  * See if we've already got a device in the list.
  */
-static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev)
+static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev, bool is_interposed)
 {
 	struct dm_dev_internal *dd;
 
 	list_for_each_entry (dd, l, list)
-		if (dd->dm_dev->bdev->bd_dev == dev)
+		if ((dd->dm_dev->bdev->bd_dev == dev) &&
+		    (dd->dm_dev->is_interposed == is_interposed))
 			return dd;
 
 	return NULL;
@@ -358,6 +359,18 @@ dev_t dm_get_dev_t(const char *path)
 }
 EXPORT_SYMBOL_GPL(dm_get_dev_t);
 
+static inline void dm_disk_freeze(struct gendisk *disk)
+{
+	blk_mq_freeze_queue(disk->queue);
+	blk_mq_quiesce_queue(disk->queue);
+}
+
+static inline void dm_disk_unfreeze(struct gendisk *disk)
+{
+	blk_mq_unquiesce_queue(disk->queue);
+	blk_mq_unfreeze_queue(disk->queue);
+}
+
 /*
  * Add a device to the list, or just increment the usage count if
  * it's already present.
@@ -385,7 +398,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 			return -ENODEV;
 	}
 
-	dd = find_device(&t->devices, dev);
+	dd = find_device(&t->devices, dev, t->md->is_interposed);
 	if (!dd) {
 		dd = kmalloc(sizeof(*dd), GFP_KERNEL);
 		if (!dd)
@@ -398,15 +411,38 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 
 		refcount_set(&dd->count, 1);
 		list_add(&dd->list, &t->devices);
-		goto out;
-
 	} else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
 		r = upgrade_mode(dd, mode, t->md);
 		if (r)
 			return r;
+		refcount_inc(&dd->count);
 	}
-	refcount_inc(&dd->count);
-out:
+
+	if (t->md->is_interposed) {
+		struct block_device *original = dd->dm_dev->bdev;
+		struct block_device *interposer = t->md->disk->part0;
+
+		if ((ti->begin != 0) || (ti->len < bdev_nr_sectors(original))) {
+			dm_put_device(ti, dd->dm_dev);
+			DMERR("The interposer device should not be less than the original.");
+			return -EINVAL;
+		}
+
+		/*
+		 * Attach mapped interposer device to original.
+		 * It is quite convenient that device mapper creates
+		 * one disk for one block device.
+		 */
+		dm_disk_freeze(original->bd_disk);
+		r = bdev_interposer_attach(original, interposer);
+		dm_disk_unfreeze(original->bd_disk);
+		if (r) {
+			dm_put_device(ti, dd->dm_dev);
+			DMERR("Failed to attach dm interposer.");
+			return r;
+		}
+	}
+
 	*result = dd->dm_dev;
 	return 0;
 }
@@ -446,6 +482,7 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
 {
 	int found = 0;
 	struct list_head *devices = &ti->table->devices;
+	struct mapped_device *md = ti->table->md;
 	struct dm_dev_internal *dd;
 
 	list_for_each_entry(dd, devices, list) {
@@ -456,11 +493,17 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
 	}
 	if (!found) {
 		DMWARN("%s: device %s not in table devices list",
-		       dm_device_name(ti->table->md), d->name);
+		       dm_device_name(md), d->name);
 		return;
 	}
+	if (md->is_interposed) {
+		dm_disk_freeze(d->bdev->bd_disk);
+		bdev_interposer_detach(d->bdev);
+		dm_disk_unfreeze(d->bdev->bd_disk);
+	}
+
 	if (refcount_dec_and_test(&dd->count)) {
-		dm_put_table_device(ti->table->md, d);
+		dm_put_table_device(md, d);
 		list_del(&dd->list);
 		kfree(dd);
 	}
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..466bf70a66b0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -762,16 +762,24 @@ static int open_table_device(struct table_device *td, dev_t dev,
 
 	BUG_ON(td->dm_dev.bdev);
 
-	bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
-	if (IS_ERR(bdev))
-		return PTR_ERR(bdev);
+	if (md->is_interposed) {
 
-	r = bd_link_disk_holder(bdev, dm_disk(md));
-	if (r) {
-		blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
-		return r;
+		bdev = blkdev_get_by_dev(dev, td->dm_dev.mode, NULL);
+		if (IS_ERR(bdev))
+			return PTR_ERR(bdev);
+	} else {
+		bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
+		if (IS_ERR(bdev))
+			return PTR_ERR(bdev);
+
+		r = bd_link_disk_holder(bdev, dm_disk(md));
+		if (r) {
+			blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
+			return r;
+		}
 	}
 
+	td->dm_dev.is_interposed = md->is_interposed;
 	td->dm_dev.bdev = bdev;
 	td->dm_dev.dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	return 0;
@@ -785,20 +793,26 @@ static void close_table_device(struct table_device *td, struct mapped_device *md
 	if (!td->dm_dev.bdev)
 		return;
 
-	bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
-	blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
+	if (td->dm_dev.is_interposed)
+		blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
+	else {
+		bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
+		blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
+	}
 	put_dax(td->dm_dev.dax_dev);
 	td->dm_dev.bdev = NULL;
 	td->dm_dev.dax_dev = NULL;
 }
 
 static struct table_device *find_table_device(struct list_head *l, dev_t dev,
-					      fmode_t mode)
+					      fmode_t mode, bool is_interposed)
 {
 	struct table_device *td;
 
 	list_for_each_entry(td, l, list)
-		if (td->dm_dev.bdev->bd_dev == dev && td->dm_dev.mode == mode)
+		if (td->dm_dev.bdev->bd_dev == dev &&
+		    td->dm_dev.mode == mode &&
+		    td->dm_dev.is_interposed == is_interposed)
 			return td;
 
 	return NULL;
@@ -811,7 +825,7 @@ int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
 	struct table_device *td;
 
 	mutex_lock(&md->table_devices_lock);
-	td = find_table_device(&md->table_devices, dev, mode);
+	td = find_table_device(&md->table_devices, dev, mode, md->is_interposed);
 	if (!td) {
 		td = kmalloc_node(sizeof(*td), GFP_KERNEL, md->numa_node_id);
 		if (!td) {
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..76a6dfb1cb29 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -159,6 +159,7 @@ struct dm_dev {
 	struct block_device *bdev;
 	struct dax_device *dax_dev;
 	fmode_t mode;
+	bool is_interposed;
 	char name[16];
 };
 
diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
index fcff6669137b..fc4d06bb3dbb 100644
--- a/include/uapi/linux/dm-ioctl.h
+++ b/include/uapi/linux/dm-ioctl.h
@@ -362,4 +362,10 @@ enum {
  */
 #define DM_INTERNAL_SUSPEND_FLAG	(1 << 18) /* Out */
 
+/*
+ * If set, the underlying device should open without FMODE_EXCL
+ * and attach mapped device via bdev_interposer.
+ */
+#define DM_INTERPOSED_FLAG		(1 << 19) /* In */
+
 #endif				/* _LINUX_DM_IOCTL_H */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG
  2021-03-12 15:44 ` [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG Sergei Shtepa
@ 2021-03-12 19:00   ` Mike Snitzer
  2021-03-15 12:29     ` Sergei Shtepa
  2021-03-14  9:30   ` Christoph Hellwig
  1 sibling, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2021-03-12 19:00 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Alasdair Kergon, Hannes Reinecke, Jens Axboe,
	dm-devel, linux-block, linux-kernel, linux-api, pavel.tide

On Fri, Mar 12 2021 at 10:44am -0500,
Sergei Shtepa <sergei.shtepa@veeam.com> wrote:

> DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
> Underlying block device opens without a flag FMODE_EXCL.
> DM target receives bio from the original device via
> bdev_interposer.
> 
> Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> ---
>  drivers/md/dm-core.h          |  3 ++
>  drivers/md/dm-ioctl.c         | 13 ++++++++
>  drivers/md/dm-table.c         | 61 +++++++++++++++++++++++++++++------
>  drivers/md/dm.c               | 38 +++++++++++++++-------
>  include/linux/device-mapper.h |  1 +
>  include/uapi/linux/dm-ioctl.h |  6 ++++
>  6 files changed, 101 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 5953ff2bd260..9eae419c7b18 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -114,6 +114,9 @@ struct mapped_device {
>  	bool init_tio_pdu:1;
>  
>  	struct srcu_struct io_barrier;
> +
> +	/* attach target via block-layer interposer */
> +	bool is_interposed;
>  };

This flag is a mix of uses.  First it is used to store that
DM_INTERPOSED_FLAG was provided as input param during load.

And the same 'is_interposed' name is used in 'struct dm_dev'.

To me this state should be elevated to block core -- awkward for every
driver that might want to use blk-interposer to be sprinkling state
around its core structures.

So I'd prefer you:
1) rename 'struct mapped_device' to 'interpose' _and_ add it just after
   "bool init_tio_pdu:1;" with "bool interpose:1;" -- this reflects
   interpose was requested during load.
2) bdev_interposer_attach() should be made to set some block core state
   that is able to be tested to check if a device is_interposed.
3) Don't add an 'is_interposed' flag to 'struct dm_dev'

>  
>  void disable_discard(struct mapped_device *md);
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index 5e306bba4375..2b4c9557c283 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1267,6 +1267,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
>  	return mode;
>  }
>  
> +static inline bool get_interposer_flag(struct dm_ioctl *param)
> +{
> +	return (param->flags & DM_INTERPOSED_FLAG);
> +}
> +

As I mention at the end: rename to DM_INTERPOSE_FLAG

>  static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
>  		       struct dm_target_spec **spec, char **target_params)
>  {
> @@ -1293,6 +1298,10 @@ static int populate_table(struct dm_table *table,
>  		DMWARN("populate_table: no targets specified");
>  		return -EINVAL;
>  	}
> +	if (table->md->is_interposed && (param->target_count != 1)) {
> +		DMWARN("%s: with interposer should be specified only one target", __func__);

This error/warning reads very awkwardly. Maybe?:
"%s: interposer requires only a single target be specified"

> +		return -EINVAL;
> +	}
>  
>  	for (i = 0; i < param->target_count; i++) {
>  
> @@ -1338,6 +1347,8 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si
>  	if (!md)
>  		return -ENXIO;
>  
> +	md->is_interposed = get_interposer_flag(param);
> +
>  	r = dm_table_create(&t, get_mode(param), param->target_count, md);
>  	if (r)
>  		goto err;
> @@ -2098,6 +2109,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
>  	if (r)
>  		goto err_hash_remove;
>  
> +	md->is_interposed = get_interposer_flag(dmi);
> +
>  	/* add targets */
>  	for (i = 0; i < dmi->target_count; i++) {
>  		r = dm_table_add_target(t, spec_array[i]->target_type,
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 95391f78b8d5..f6e2eb3f8949 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -225,12 +225,13 @@ void dm_table_destroy(struct dm_table *t)
>  /*
>   * See if we've already got a device in the list.
>   */
> -static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev)
> +static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev, bool is_interposed)

Think in make more sense to internalize the need to consider
md->interpose here.

So:

static struct dm_dev_internal *find_device(struct dm_table *t, dev_t dev)
{
        struct list_head *l = &t->devices;
        bool is_interposed = t->md->interpose;
        ...

>  {
>  	struct dm_dev_internal *dd;
>  
>  	list_for_each_entry (dd, l, list)
> -		if (dd->dm_dev->bdev->bd_dev == dev)
> +		if ((dd->dm_dev->bdev->bd_dev == dev) &&
> +		    (dd->dm_dev->is_interposed == is_interposed))
>  			return dd;

But why must this extra state be used/tested?  Seems like quite a deep
embedding of interposer into dm_dev_internal.. feels unnecessary.

>  
>  	return NULL;
> @@ -358,6 +359,18 @@ dev_t dm_get_dev_t(const char *path)
>  }
>  EXPORT_SYMBOL_GPL(dm_get_dev_t);
>  
> +static inline void dm_disk_freeze(struct gendisk *disk)
> +{
> +	blk_mq_freeze_queue(disk->queue);
> +	blk_mq_quiesce_queue(disk->queue);
> +}
> +
> +static inline void dm_disk_unfreeze(struct gendisk *disk)
> +{
> +	blk_mq_unquiesce_queue(disk->queue);
> +	blk_mq_unfreeze_queue(disk->queue);
> +}
> +

These interfaces don't account for bio-based at all (pretty sure we've
been over this and you pointed out that they'll just return early), but
they also don't take steps to properly flush outstanding DM io.
Shouldn't you require DM devices do an internal suspend/resume?  And if
original device isn't DM then fallback to blk_mq calls?

>  /*
>   * Add a device to the list, or just increment the usage count if
>   * it's already present.
> @@ -385,7 +398,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
>  			return -ENODEV;
>  	}
>  
> -	dd = find_device(&t->devices, dev);
> +	dd = find_device(&t->devices, dev, t->md->is_interposed);
>  	if (!dd) {
>  		dd = kmalloc(sizeof(*dd), GFP_KERNEL);
>  		if (!dd)
> @@ -398,15 +411,38 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
>  
>  		refcount_set(&dd->count, 1);
>  		list_add(&dd->list, &t->devices);
> -		goto out;
> -
>  	} else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
>  		r = upgrade_mode(dd, mode, t->md);
>  		if (r)
>  			return r;
> +		refcount_inc(&dd->count);
>  	}
> -	refcount_inc(&dd->count);

This looks bogus... you cannot only increment refcount with the mode
check/upgrade branch (IIRC: I've made this same mistake in the past)

> -out:
> +
> +	if (t->md->is_interposed) {
> +		struct block_device *original = dd->dm_dev->bdev;
> +		struct block_device *interposer = t->md->disk->part0;
> +
> +		if ((ti->begin != 0) || (ti->len < bdev_nr_sectors(original))) {
> +			dm_put_device(ti, dd->dm_dev);
> +			DMERR("The interposer device should not be less than the original.");
> +			return -EINVAL;

Can you explain why allowing the device to be larger is meaningful?  Not
saying it isn't I'd just like to understand use-cases you forsee.

> +		}
> +
> +		/*
> +		 * Attach mapped interposer device to original.
> +		 * It is quite convenient that device mapper creates
> +		 * one disk for one block device.
> +		 */
> +		dm_disk_freeze(original->bd_disk);
> +		r = bdev_interposer_attach(original, interposer);
> +		dm_disk_unfreeze(original->bd_disk);
> +		if (r) {
> +			dm_put_device(ti, dd->dm_dev);
> +			DMERR("Failed to attach dm interposer.");
> +			return r;
> +		}
> +	}
> +
>  	*result = dd->dm_dev;
>  	return 0;
>  }
> @@ -446,6 +482,7 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
>  {
>  	int found = 0;
>  	struct list_head *devices = &ti->table->devices;
> +	struct mapped_device *md = ti->table->md;
>  	struct dm_dev_internal *dd;
>  
>  	list_for_each_entry(dd, devices, list) {
> @@ -456,11 +493,17 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
>  	}
>  	if (!found) {
>  		DMWARN("%s: device %s not in table devices list",
> -		       dm_device_name(ti->table->md), d->name);
> +		       dm_device_name(md), d->name);
>  		return;
>  	}
> +	if (md->is_interposed) {
> +		dm_disk_freeze(d->bdev->bd_disk);
> +		bdev_interposer_detach(d->bdev);
> +		dm_disk_unfreeze(d->bdev->bd_disk);
> +	}
> +
>  	if (refcount_dec_and_test(&dd->count)) {
> -		dm_put_table_device(ti->table->md, d);
> +		dm_put_table_device(md, d);
>  		list_del(&dd->list);
>  		kfree(dd);
>  	}
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 50b693d776d6..466bf70a66b0 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -762,16 +762,24 @@ static int open_table_device(struct table_device *td, dev_t dev,
>  
>  	BUG_ON(td->dm_dev.bdev);
>  
> -	bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
> -	if (IS_ERR(bdev))
> -		return PTR_ERR(bdev);
> +	if (md->is_interposed) {
>  
> -	r = bd_link_disk_holder(bdev, dm_disk(md));
> -	if (r) {
> -		blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
> -		return r;
> +		bdev = blkdev_get_by_dev(dev, td->dm_dev.mode, NULL);
> +		if (IS_ERR(bdev))
> +			return PTR_ERR(bdev);
> +	} else {
> +		bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
> +		if (IS_ERR(bdev))
> +			return PTR_ERR(bdev);
> +
> +		r = bd_link_disk_holder(bdev, dm_disk(md));
> +		if (r) {
> +			blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
> +			return r;
> +		}
>  	}
>  
> +	td->dm_dev.is_interposed = md->is_interposed;

This _should_ hopefully get cleaned up by pushing such state into block
core's interposer interfaces.

But again, not seeing what utility/safety this extra flag is providing
to begin with.  Is this state _actually_ needed at all?


>  	td->dm_dev.bdev = bdev;
>  	td->dm_dev.dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
>  	return 0;
> @@ -785,20 +793,26 @@ static void close_table_device(struct table_device *td, struct mapped_device *md
>  	if (!td->dm_dev.bdev)
>  		return;
>  
> -	bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
> -	blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
> +	if (td->dm_dev.is_interposed)
> +		blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
> +	else {
> +		bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
> +		blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
> +	}
>  	put_dax(td->dm_dev.dax_dev);
>  	td->dm_dev.bdev = NULL;
>  	td->dm_dev.dax_dev = NULL;
>  }
>  
>  static struct table_device *find_table_device(struct list_head *l, dev_t dev,
> -					      fmode_t mode)
> +					      fmode_t mode, bool is_interposed)
>  {
>  	struct table_device *td;
>  
>  	list_for_each_entry(td, l, list)
> -		if (td->dm_dev.bdev->bd_dev == dev && td->dm_dev.mode == mode)
> +		if (td->dm_dev.bdev->bd_dev == dev &&
> +		    td->dm_dev.mode == mode &&
> +		    td->dm_dev.is_interposed == is_interposed)
>  			return td;
>  
>  	return NULL;
> @@ -811,7 +825,7 @@ int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
>  	struct table_device *td;
>  
>  	mutex_lock(&md->table_devices_lock);
> -	td = find_table_device(&md->table_devices, dev, mode);
> +	td = find_table_device(&md->table_devices, dev, mode, md->is_interposed);
>  	if (!td) {
>  		td = kmalloc_node(sizeof(*td), GFP_KERNEL, md->numa_node_id);
>  		if (!td) {
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 7f4ac87c0b32..76a6dfb1cb29 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -159,6 +159,7 @@ struct dm_dev {
>  	struct block_device *bdev;
>  	struct dax_device *dax_dev;
>  	fmode_t mode;
> +	bool is_interposed;

Again, I'd like this state to be part of 'struct block_device'

>  	char name[16];
>  };
>  
> diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
> index fcff6669137b..fc4d06bb3dbb 100644
> --- a/include/uapi/linux/dm-ioctl.h
> +++ b/include/uapi/linux/dm-ioctl.h
> @@ -362,4 +362,10 @@ enum {
>   */
>  #define DM_INTERNAL_SUSPEND_FLAG	(1 << 18) /* Out */
>  
> +/*
> + * If set, the underlying device should open without FMODE_EXCL
> + * and attach mapped device via bdev_interposer.
> + */
> +#define DM_INTERPOSED_FLAG		(1 << 19) /* In */

Please rename to DM_INTERPOSE_FLAG

> +
>  #endif				/* _LINUX_DM_IOCTL_H */
> -- 
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/3] block: add blk_mq_is_queue_frozen()
  2021-03-12 15:44 ` [PATCH v7 1/3] block: add blk_mq_is_queue_frozen() Sergei Shtepa
@ 2021-03-12 19:06   ` Mike Snitzer
  2021-03-14  9:14     ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2021-03-12 19:06 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Alasdair Kergon, Hannes Reinecke, Jens Axboe,
	dm-devel, linux-block, linux-kernel, linux-api, pavel.tide

On Fri, Mar 12 2021 at 10:44am -0500,
Sergei Shtepa <sergei.shtepa@veeam.com> wrote:

> blk_mq_is_queue_frozen() allow to assert that the queue is frozen.
> 
> Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> ---
>  block/blk-mq.c         | 13 +++++++++++++
>  include/linux/blk-mq.h |  1 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..2f188a865024 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -161,6 +161,19 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
>  
> +bool blk_mq_is_queue_frozen(struct request_queue *q)
> +{
> +	bool frozen;
> +
> +	mutex_lock(&q->mq_freeze_lock);
> +	frozen = percpu_ref_is_dying(&q->q_usage_counter) &&
> +		 percpu_ref_is_zero(&q->q_usage_counter);
> +	mutex_unlock(&q->mq_freeze_lock);
> +
> +	return frozen;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_is_queue_frozen);
> +

This is returning a frozen state that is immediately stale.  I don't
think any code calling this is providing the guarantees you think it
does due to the racey nature of this state once the mutex is dropped.

>  /*
>   * Guarantee no request is in use, so we can change any data structure of
>   * the queue afterward.
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2c473c9b8990..6f01971abf7b 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -533,6 +533,7 @@ void blk_freeze_queue_start(struct request_queue *q);
>  void blk_mq_freeze_queue_wait(struct request_queue *q);
>  int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
>  				     unsigned long timeout);
> +bool blk_mq_is_queue_frozen(struct request_queue *q);
>  
>  int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
> -- 
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/3] block: add blk_mq_is_queue_frozen()
  2021-03-12 19:06   ` Mike Snitzer
@ 2021-03-14  9:14     ` Christoph Hellwig
  2021-03-15 12:06       ` Sergei Shtepa
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-14  9:14 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Sergei Shtepa, Christoph Hellwig, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, pavel.tide

On Fri, Mar 12, 2021 at 02:06:41PM -0500, Mike Snitzer wrote:
> This is returning a frozen state that is immediately stale.  I don't
> think any code calling this is providing the guarantees you think it
> does due to the racey nature of this state once the mutex is dropped.

The code only uses it for asserts in the form of WARN_ONs.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-12 15:44 ` [PATCH v7 2/3] block: add bdev_interposer Sergei Shtepa
@ 2021-03-14  9:28   ` Christoph Hellwig
  2021-03-15 13:06     ` Sergei Shtepa
  2021-03-16  8:09   ` Ming Lei
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-14  9:28 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, pavel.tide

On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> bdev_interposer allows to redirect bio requests to another devices.

I think this warrants a somewhat more detailed description.

The code itself looks pretty good to me now, a bunch of nitpicks and
a question below:

> +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> +{
> +	blk_qc_t ret = BLK_QC_T_NONE;
> +	struct bio_list bio_list[2] = { };
> +	struct gendisk *orig_disk;
> +
> +	if (current->bio_list) {
> +		bio_list_add(&current->bio_list[0], bio);
> +		return BLK_QC_T_NONE;
> +	}

I don't think this case can ever happen:

 - current->bio_list != NULL means a ->submit_bio or blk_mq_submit_bio
   is active.  But if this device is being interposed this means the
   interposer recurses into itself, which should never happen.  So
   I think we'll want a WARN_ON_ONCE here as a debug check instead.

> +
> +	orig_disk = bio->bi_bdev->bd_disk;
> +	if (unlikely(bio_queue_enter(bio)))
> +		return BLK_QC_T_NONE;
> +
> +	current->bio_list = bio_list;
> +
> +	do {
> +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> +
> +		if (unlikely(!interposer)) {
> +			/* interposer was removed */
> +			bio_list_add(&current->bio_list[0], bio);
> +			break;
> +		}
> +		/* assign bio to interposer device */
> +		bio_set_dev(bio, interposer);
> +		bio_set_flag(bio, BIO_INTERPOSED);

Reassigning the bi_bdev here means the original source is lost by the
time we reach the interposer.  This initially seemed a little limiting,
but I guess the interposer driver can just record that information
locally, so we should be fine.  The big upside of this is that no
extra argument to submit_bio_checks, which means less changes to the
normal fast path, so if this works for everyone that is a nice
improvement over my draft.

> +
> +		if (!submit_bio_checks(bio))
> +			break;
> +		/*
> +		 * Because the current->bio_list is initialized,
> +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> +		 */
> +		interposer->bd_disk->fops->submit_bio(bio);
> +	} while (false);

I find the do { ... } while (false) idiom here a little strange.  Normal
kernel style would be a goto done instead of the breaks.

> +int bdev_interposer_attach(struct block_device *original,
> +			   struct block_device *interposer)

A kerneldoc comment for bdev_interposer_attach (and
bdev_interposer_detach) would be nice to explain the API a little more.

> +{
> +	int ret = 0;
> +
> +	if (WARN_ON(((!original) || (!interposer))))
> +		return -EINVAL;

No need for the inner two levels of braces.

> +	 * interposer should be simple, no a multi-queue device
> +	 */
> +	if (!interposer->bd_disk->fops->submit_bio)

Please use queue_is_mq() instead.

> +	if (bdev_has_interposer(original))
> +		ret = -EBUSY;
> +	else {
> +		original->bd_interposer = bdgrab(interposer);

Just thinking out a loud:  what happens if the interposed device
goes away?  Shouldn't we at very least also make sure this
gabs another refererence on bdev as well?

> +struct bdev_interposer;

Not needed any more.

> +static inline bool bdev_has_interposer(struct block_device *bdev)
> +{
> +	return (bdev->bd_interposer != NULL);
> +};

No need for the braces.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG
  2021-03-12 15:44 ` [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG Sergei Shtepa
  2021-03-12 19:00   ` Mike Snitzer
@ 2021-03-14  9:30   ` Christoph Hellwig
  2021-03-15 13:25     ` Sergei Shtepa
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-14  9:30 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, pavel.tide

On Fri, Mar 12, 2021 at 06:44:55PM +0300, Sergei Shtepa wrote:
> DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
> Underlying block device opens without a flag FMODE_EXCL.
> DM target receives bio from the original device via
> bdev_interposer.

This is more of a philopical comment, but the idea of just letting the
interposed reopen the device by itself seems like a bad idea.  I think
that is probably better hidden in the block layer interposer attachment
function, which could do the extra blkdev_get_by_dev for the caller.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 1/3] block: add blk_mq_is_queue_frozen()
  2021-03-14  9:14     ` Christoph Hellwig
@ 2021-03-15 12:06       ` Sergei Shtepa
  0 siblings, 0 replies; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-15 12:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, Alasdair Kergon, Hannes Reinecke, Jens Axboe,
	dm-devel, linux-block, linux-kernel, linux-api, Pavel Tide

The 03/14/2021 12:14, Christoph Hellwig wrote:
> On Fri, Mar 12, 2021 at 02:06:41PM -0500, Mike Snitzer wrote:
> > This is returning a frozen state that is immediately stale.  I don't
> > think any code calling this is providing the guarantees you think it
> > does due to the racey nature of this state once the mutex is dropped.
> 
> The code only uses it for asserts in the form of WARN_ONs.

But perhaps it is possible to come up with a more elegant solution?
I'll think about it.
-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG
  2021-03-12 19:00   ` Mike Snitzer
@ 2021-03-15 12:29     ` Sergei Shtepa
  0 siblings, 0 replies; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-15 12:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Alasdair Kergon, Hannes Reinecke, Jens Axboe,
	dm-devel, linux-block, linux-kernel, linux-api, Pavel Tide

The 03/12/2021 22:00, Mike Snitzer wrote:
> On Fri, Mar 12 2021 at 10:44am -0500,
> Sergei Shtepa <sergei.shtepa@veeam.com> wrote:
> 
> > DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
> > Underlying block device opens without a flag FMODE_EXCL.
> > DM target receives bio from the original device via
> > bdev_interposer.
> > 
> > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> > ---
> >  drivers/md/dm-core.h          |  3 ++
> >  drivers/md/dm-ioctl.c         | 13 ++++++++
> >  drivers/md/dm-table.c         | 61 +++++++++++++++++++++++++++++------
> >  drivers/md/dm.c               | 38 +++++++++++++++-------
> >  include/linux/device-mapper.h |  1 +
> >  include/uapi/linux/dm-ioctl.h |  6 ++++
> >  6 files changed, 101 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> > index 5953ff2bd260..9eae419c7b18 100644
> > --- a/drivers/md/dm-core.h
> > +++ b/drivers/md/dm-core.h
> > @@ -114,6 +114,9 @@ struct mapped_device {
> >  	bool init_tio_pdu:1;
> >  
> >  	struct srcu_struct io_barrier;
> > +
> > +	/* attach target via block-layer interposer */
> > +	bool is_interposed;
> >  };
> 
> This flag is a mix of uses.  First it is used to store that
> DM_INTERPOSED_FLAG was provided as input param during load.
> 
> And the same 'is_interposed' name is used in 'struct dm_dev'.
> 
> To me this state should be elevated to block core -- awkward for every
> driver that might want to use blk-interposer to be sprinkling state
> around its core structures.
> 
> So I'd prefer you:
> 1) rename 'struct mapped_device' to 'interpose' _and_ add it just after
>    "bool init_tio_pdu:1;" with "bool interpose:1;" -- this reflects
>    interpose was requested during load.
> 2) bdev_interposer_attach() should be made to set some block core state
>    that is able to be tested to check if a device is_interposed.
> 3) Don't add an 'is_interposed' flag to 'struct dm_dev'

Ok, but little fix in "rename 'struct mapped_device' to 'interpose'".
Maybe "rename 'bool is_interposed' to 'bool interpose:1'"?
I think I understand your idea, I'll try to implement it.

> 
> >  
> >  void disable_discard(struct mapped_device *md);
> > diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> > index 5e306bba4375..2b4c9557c283 100644
> > --- a/drivers/md/dm-ioctl.c
> > +++ b/drivers/md/dm-ioctl.c
> > @@ -1267,6 +1267,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
> >  	return mode;
> >  }
> >  
> > +static inline bool get_interposer_flag(struct dm_ioctl *param)
> > +{
> > +	return (param->flags & DM_INTERPOSED_FLAG);
> > +}
> > +
> 
> As I mention at the end: rename to DM_INTERPOSE_FLAG

Ok.

> 
> >  static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
> >  		       struct dm_target_spec **spec, char **target_params)
> >  {
> > @@ -1293,6 +1298,10 @@ static int populate_table(struct dm_table *table,
> >  		DMWARN("populate_table: no targets specified");
> >  		return -EINVAL;
> >  	}
> > +	if (table->md->is_interposed && (param->target_count != 1)) {
> > +		DMWARN("%s: with interposer should be specified only one target", __func__);
> 
> This error/warning reads very awkwardly. Maybe?:
> "%s: interposer requires only a single target be specified"

Ok.

> 
> > +		return -EINVAL;
> > +	}
> >  
> >  	for (i = 0; i < param->target_count; i++) {
> >  
> > @@ -1338,6 +1347,8 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si
> >  	if (!md)
> >  		return -ENXIO;
> >  
> > +	md->is_interposed = get_interposer_flag(param);
> > +
> >  	r = dm_table_create(&t, get_mode(param), param->target_count, md);
> >  	if (r)
> >  		goto err;
> > @@ -2098,6 +2109,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
> >  	if (r)
> >  		goto err_hash_remove;
> >  
> > +	md->is_interposed = get_interposer_flag(dmi);
> > +
> >  	/* add targets */
> >  	for (i = 0; i < dmi->target_count; i++) {
> >  		r = dm_table_add_target(t, spec_array[i]->target_type,
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index 95391f78b8d5..f6e2eb3f8949 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -225,12 +225,13 @@ void dm_table_destroy(struct dm_table *t)
> >  /*
> >   * See if we've already got a device in the list.
> >   */
> > -static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev)
> > +static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev, bool is_interposed)
> 
> Think in make more sense to internalize the need to consider
> md->interpose here.
> 
> So:
> 
> static struct dm_dev_internal *find_device(struct dm_table *t, dev_t dev)
> {
>         struct list_head *l = &t->devices;
>         bool is_interposed = t->md->interpose;
>         ...

Yes.

> 
> >  {
> >  	struct dm_dev_internal *dd;
> >  
> >  	list_for_each_entry (dd, l, list)
> > -		if (dd->dm_dev->bdev->bd_dev == dev)
> > +		if ((dd->dm_dev->bdev->bd_dev == dev) &&
> > +		    (dd->dm_dev->is_interposed == is_interposed))
> >  			return dd;
> 
> But why must this extra state be used/tested?  Seems like quite a deep
> embedding of interposer into dm_dev_internal.. feels unnecessary.

Yeah, I guess so.
Since dm_table is unique for each dm_target and in our case will generally
contain only one device.

> 
> >  
> >  	return NULL;
> > @@ -358,6 +359,18 @@ dev_t dm_get_dev_t(const char *path)
> >  }
> >  EXPORT_SYMBOL_GPL(dm_get_dev_t);
> >  
> > +static inline void dm_disk_freeze(struct gendisk *disk)
> > +{
> > +	blk_mq_freeze_queue(disk->queue);
> > +	blk_mq_quiesce_queue(disk->queue);
> > +}
> > +
> > +static inline void dm_disk_unfreeze(struct gendisk *disk)
> > +{
> > +	blk_mq_unquiesce_queue(disk->queue);
> > +	blk_mq_unfreeze_queue(disk->queue);
> > +}
> > +
> 
> These interfaces don't account for bio-based at all (pretty sure we've
> been over this and you pointed out that they'll just return early), but
> they also don't take steps to properly flush outstanding DM io.
> Shouldn't you require DM devices do an internal suspend/resume?  And if
> original device isn't DM then fallback to blk_mq calls?

I thought that was enough.
The function dm_disk_freeze() guarantees that no bio will be processed,
which means that we can safely attach or detach the interposer.
All bios will continue their processing after dm_disk_unfreeze().
Sorry, but I don't understand why it is required for DM devices to do
internal suspend/resume.
If this is not too difficult, can you explain what problem can cause
a simple queue freeze?

> 
> >  /*
> >   * Add a device to the list, or just increment the usage count if
> >   * it's already present.
> > @@ -385,7 +398,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
> >  			return -ENODEV;
> >  	}
> >  
> > -	dd = find_device(&t->devices, dev);
> > +	dd = find_device(&t->devices, dev, t->md->is_interposed);
> >  	if (!dd) {
> >  		dd = kmalloc(sizeof(*dd), GFP_KERNEL);
> >  		if (!dd)
> > @@ -398,15 +411,38 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
> >  
> >  		refcount_set(&dd->count, 1);
> >  		list_add(&dd->list, &t->devices);
> > -		goto out;
> > -
> >  	} else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
> >  		r = upgrade_mode(dd, mode, t->md);
> >  		if (r)
> >  			return r;
> > +		refcount_inc(&dd->count);
> >  	}
> > -	refcount_inc(&dd->count);
> 
> This looks bogus... you cannot only increment refcount with the mode
> check/upgrade branch (IIRC: I've made this same mistake in the past)
> 

I realized my mistake. I wanted to remove the unnecessary goto,
but I change the logic.

> > -out:
> > +
> > +	if (t->md->is_interposed) {
> > +		struct block_device *original = dd->dm_dev->bdev;
> > +		struct block_device *interposer = t->md->disk->part0;
> > +
> > +		if ((ti->begin != 0) || (ti->len < bdev_nr_sectors(original))) {
> > +			dm_put_device(ti, dd->dm_dev);
> > +			DMERR("The interposer device should not be less than the original.");
> > +			return -EINVAL;
> 
> Can you explain why allowing the device to be larger is meaningful?  Not
> saying it isn't I'd just like to understand use-cases you forsee.

Hmm... Maybe strict equality would be better.

> 
> > +		}
> > +
> > +		/*
> > +		 * Attach mapped interposer device to original.
> > +		 * It is quite convenient that device mapper creates
> > +		 * one disk for one block device.
> > +		 */
> > +		dm_disk_freeze(original->bd_disk);
> > +		r = bdev_interposer_attach(original, interposer);
> > +		dm_disk_unfreeze(original->bd_disk);
> > +		if (r) {
> > +			dm_put_device(ti, dd->dm_dev);
> > +			DMERR("Failed to attach dm interposer.");
> > +			return r;
> > +		}
> > +	}
> > +
> >  	*result = dd->dm_dev;
> >  	return 0;
> >  }
> > @@ -446,6 +482,7 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
> >  {
> >  	int found = 0;
> >  	struct list_head *devices = &ti->table->devices;
> > +	struct mapped_device *md = ti->table->md;
> >  	struct dm_dev_internal *dd;
> >  
> >  	list_for_each_entry(dd, devices, list) {
> > @@ -456,11 +493,17 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
> >  	}
> >  	if (!found) {
> >  		DMWARN("%s: device %s not in table devices list",
> > -		       dm_device_name(ti->table->md), d->name);
> > +		       dm_device_name(md), d->name);
> >  		return;
> >  	}
> > +	if (md->is_interposed) {
> > +		dm_disk_freeze(d->bdev->bd_disk);
> > +		bdev_interposer_detach(d->bdev);
> > +		dm_disk_unfreeze(d->bdev->bd_disk);
> > +	}
> > +
> >  	if (refcount_dec_and_test(&dd->count)) {
> > -		dm_put_table_device(ti->table->md, d);
> > +		dm_put_table_device(md, d);
> >  		list_del(&dd->list);
> >  		kfree(dd);
> >  	}
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 50b693d776d6..466bf70a66b0 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -762,16 +762,24 @@ static int open_table_device(struct table_device *td, dev_t dev,
> >  
> >  	BUG_ON(td->dm_dev.bdev);
> >  
> > -	bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
> > -	if (IS_ERR(bdev))
> > -		return PTR_ERR(bdev);
> > +	if (md->is_interposed) {
> >  
> > -	r = bd_link_disk_holder(bdev, dm_disk(md));
> > -	if (r) {
> > -		blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
> > -		return r;
> > +		bdev = blkdev_get_by_dev(dev, td->dm_dev.mode, NULL);
> > +		if (IS_ERR(bdev))
> > +			return PTR_ERR(bdev);
> > +	} else {
> > +		bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
> > +		if (IS_ERR(bdev))
> > +			return PTR_ERR(bdev);
> > +
> > +		r = bd_link_disk_holder(bdev, dm_disk(md));
> > +		if (r) {
> > +			blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
> > +			return r;
> > +		}
> >  	}
> >  
> > +	td->dm_dev.is_interposed = md->is_interposed;
> 
> This _should_ hopefully get cleaned up by pushing such state into block
> core's interposer interfaces.
> 
> But again, not seeing what utility/safety this extra flag is providing
> to begin with.  Is this state _actually_ needed at all?
> 

Yes, it is not necessary.

> 
> >  	td->dm_dev.bdev = bdev;
> >  	td->dm_dev.dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
> >  	return 0;
> > @@ -785,20 +793,26 @@ static void close_table_device(struct table_device *td, struct mapped_device *md
> >  	if (!td->dm_dev.bdev)
> >  		return;
> >  
> > -	bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
> > -	blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
> > +	if (td->dm_dev.is_interposed)
> > +		blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
> > +	else {
> > +		bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
> > +		blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
> > +	}
> >  	put_dax(td->dm_dev.dax_dev);
> >  	td->dm_dev.bdev = NULL;
> >  	td->dm_dev.dax_dev = NULL;
> >  }
> >  
> >  static struct table_device *find_table_device(struct list_head *l, dev_t dev,
> > -					      fmode_t mode)
> > +					      fmode_t mode, bool is_interposed)
> >  {
> >  	struct table_device *td;
> >  
> >  	list_for_each_entry(td, l, list)
> > -		if (td->dm_dev.bdev->bd_dev == dev && td->dm_dev.mode == mode)
> > +		if (td->dm_dev.bdev->bd_dev == dev &&
> > +		    td->dm_dev.mode == mode &&
> > +		    td->dm_dev.is_interposed == is_interposed)
> >  			return td;
> >  
> >  	return NULL;
> > @@ -811,7 +825,7 @@ int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
> >  	struct table_device *td;
> >  
> >  	mutex_lock(&md->table_devices_lock);
> > -	td = find_table_device(&md->table_devices, dev, mode);
> > +	td = find_table_device(&md->table_devices, dev, mode, md->is_interposed);
> >  	if (!td) {
> >  		td = kmalloc_node(sizeof(*td), GFP_KERNEL, md->numa_node_id);
> >  		if (!td) {
> > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > index 7f4ac87c0b32..76a6dfb1cb29 100644
> > --- a/include/linux/device-mapper.h
> > +++ b/include/linux/device-mapper.h
> > @@ -159,6 +159,7 @@ struct dm_dev {
> >  	struct block_device *bdev;
> >  	struct dax_device *dax_dev;
> >  	fmode_t mode;
> > +	bool is_interposed;
> 
> Again, I'd like this state to be part of 'struct block_device'

Ok.

> 
> >  	char name[16];
> >  };
> >  
> > diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
> > index fcff6669137b..fc4d06bb3dbb 100644
> > --- a/include/uapi/linux/dm-ioctl.h
> > +++ b/include/uapi/linux/dm-ioctl.h
> > @@ -362,4 +362,10 @@ enum {
> >   */
> >  #define DM_INTERNAL_SUSPEND_FLAG	(1 << 18) /* Out */
> >  
> > +/*
> > + * If set, the underlying device should open without FMODE_EXCL
> > + * and attach mapped device via bdev_interposer.
> > + */
> > +#define DM_INTERPOSED_FLAG		(1 << 19) /* In */
> 
> Please rename to DM_INTERPOSE_FLAG

Ok.

> 
> > +
> >  #endif				/* _LINUX_DM_IOCTL_H */
> > -- 
> > 2.20.1
> > 
> 

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-14  9:28   ` Christoph Hellwig
@ 2021-03-15 13:06     ` Sergei Shtepa
  0 siblings, 0 replies; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-15 13:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, Alasdair Kergon, Hannes Reinecke, Jens Axboe,
	dm-devel, linux-block, linux-kernel, linux-api, Pavel Tide

The 03/14/2021 12:28, Christoph Hellwig wrote:
> On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > bdev_interposer allows to redirect bio requests to another devices.
> 
> I think this warrants a somewhat more detailed description.
> 
> The code itself looks pretty good to me now, a bunch of nitpicks and
> a question below:
> 
> > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > +{
> > +	blk_qc_t ret = BLK_QC_T_NONE;
> > +	struct bio_list bio_list[2] = { };
> > +	struct gendisk *orig_disk;
> > +
> > +	if (current->bio_list) {
> > +		bio_list_add(&current->bio_list[0], bio);
> > +		return BLK_QC_T_NONE;
> > +	}
> 
> I don't think this case can ever happen:
> 
>  - current->bio_list != NULL means a ->submit_bio or blk_mq_submit_bio
>    is active.  But if this device is being interposed this means the
>    interposer recurses into itself, which should never happen.  So
>    I think we'll want a WARN_ON_ONCE here as a debug check instead.

Yes, it is.
Completely remove this check or add "BUG_ON(current->bio_list);" for
an emergency?

> 
> > +
> > +	orig_disk = bio->bi_bdev->bd_disk;
> > +	if (unlikely(bio_queue_enter(bio)))
> > +		return BLK_QC_T_NONE;
> > +
> > +	current->bio_list = bio_list;
> > +
> > +	do {
> > +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> > +
> > +		if (unlikely(!interposer)) {
> > +			/* interposer was removed */
> > +			bio_list_add(&current->bio_list[0], bio);
> > +			break;
> > +		}
> > +		/* assign bio to interposer device */
> > +		bio_set_dev(bio, interposer);
> > +		bio_set_flag(bio, BIO_INTERPOSED);
> 
> Reassigning the bi_bdev here means the original source is lost by the
> time we reach the interposer.  This initially seemed a little limiting,
> but I guess the interposer driver can just record that information
> locally, so we should be fine.  The big upside of this is that no
> extra argument to submit_bio_checks, which means less changes to the
> normal fast path, so if this works for everyone that is a nice
> improvement over my draft.
> 
> > +
> > +		if (!submit_bio_checks(bio))
> > +			break;
> > +		/*
> > +		 * Because the current->bio_list is initialized,
> > +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> > +		 */
> > +		interposer->bd_disk->fops->submit_bio(bio);
> > +	} while (false);
> 
> I find the do { ... } while (false) idiom here a little strange.  Normal
> kernel style would be a goto done instead of the breaks.
> 

Ok. I'll use the normal kernel style.

> > +int bdev_interposer_attach(struct block_device *original,
> > +			   struct block_device *interposer)
> 
> A kerneldoc comment for bdev_interposer_attach (and
> bdev_interposer_detach) would be nice to explain the API a little more.
> 

Yes, I should add kerneldoc comments.

> > +{
> > +	int ret = 0;
> > +
> > +	if (WARN_ON(((!original) || (!interposer))))
> > +		return -EINVAL;
> 
> No need for the inner two levels of braces.

Ok.

> 
> > +	 * interposer should be simple, no a multi-queue device
> > +	 */
> > +	if (!interposer->bd_disk->fops->submit_bio)
> 
> Please use queue_is_mq() instead.

Ok.

> 
> > +	if (bdev_has_interposer(original))
> > +		ret = -EBUSY;
> > +	else {
> > +		original->bd_interposer = bdgrab(interposer);
> 
> Just thinking out a loud:  what happens if the interposed device
> goes away?  Shouldn't we at very least also make sure this
> gabs another refererence on bdev as well?

If the original device is removed from the system, the interposer device
will be permanently occupied. I need to add an interposer release when
deleting a block device.

> 
> > +struct bdev_interposer;
> 
> Not needed any more.

Yes.

> 
> > +static inline bool bdev_has_interposer(struct block_device *bdev)
> > +{
> > +	return (bdev->bd_interposer != NULL);
> > +};
> 
> No need for the braces.

Ok.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG
  2021-03-14  9:30   ` Christoph Hellwig
@ 2021-03-15 13:25     ` Sergei Shtepa
  2021-03-16 15:23       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-15 13:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, Alasdair Kergon, Hannes Reinecke, Jens Axboe,
	dm-devel, linux-block, linux-kernel, linux-api, Pavel Tide

The 03/14/2021 12:30, Christoph Hellwig wrote:
> On Fri, Mar 12, 2021 at 06:44:55PM +0300, Sergei Shtepa wrote:
> > DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
> > Underlying block device opens without a flag FMODE_EXCL.
> > DM target receives bio from the original device via
> > bdev_interposer.
> 
> This is more of a philopical comment, but the idea of just letting the
> interposed reopen the device by itself seems like a bad idea.  I think
> that is probably better hidden in the block layer interposer attachment
> function, which could do the extra blkdev_get_by_dev for the caller.

I suppose this cannot be implemented, since we need to change the behavior
for block devices that already have been opened.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-12 15:44 ` [PATCH v7 2/3] block: add bdev_interposer Sergei Shtepa
  2021-03-14  9:28   ` Christoph Hellwig
@ 2021-03-16  8:09   ` Ming Lei
  2021-03-16 16:35     ` Sergei Shtepa
  1 sibling, 1 reply; 25+ messages in thread
From: Ming Lei @ 2021-03-16  8:09 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, pavel.tide

On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> bdev_interposer allows to redirect bio requests to another devices.
> 
> Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> ---
>  block/bio.c               |  2 ++
>  block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
>  block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
>  include/linux/blk_types.h |  3 +++
>  include/linux/blkdev.h    |  9 +++++++
>  5 files changed, 125 insertions(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index a1c4d2900c7a..0bfbf06475ee 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
>  		bio_set_flag(bio, BIO_THROTTLED);
>  	if (bio_flagged(bio_src, BIO_REMAPPED))
>  		bio_set_flag(bio, BIO_REMAPPED);
> +	if (bio_flagged(bio_src, BIO_INTERPOSED))
> +		bio_set_flag(bio, BIO_INTERPOSED);
>  	bio->bi_opf = bio_src->bi_opf;
>  	bio->bi_ioprio = bio_src->bi_ioprio;
>  	bio->bi_write_hint = bio_src->bi_write_hint;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff208497..da1abc4c27a9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  	return ret;
>  }
>  
> +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> +{
> +	blk_qc_t ret = BLK_QC_T_NONE;
> +	struct bio_list bio_list[2] = { };
> +	struct gendisk *orig_disk;
> +
> +	if (current->bio_list) {
> +		bio_list_add(&current->bio_list[0], bio);
> +		return BLK_QC_T_NONE;
> +	}
> +
> +	orig_disk = bio->bi_bdev->bd_disk;
> +	if (unlikely(bio_queue_enter(bio)))
> +		return BLK_QC_T_NONE;
> +
> +	current->bio_list = bio_list;
> +
> +	do {
> +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> +
> +		if (unlikely(!interposer)) {
> +			/* interposer was removed */
> +			bio_list_add(&current->bio_list[0], bio);
> +			break;
> +		}
> +		/* assign bio to interposer device */
> +		bio_set_dev(bio, interposer);
> +		bio_set_flag(bio, BIO_INTERPOSED);
> +
> +		if (!submit_bio_checks(bio))
> +			break;
> +		/*
> +		 * Because the current->bio_list is initialized,
> +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> +		 */
> +		interposer->bd_disk->fops->submit_bio(bio);

Given original request queue may become live when calling attach() and
detach(), see below comment. bdev_interposer_detach() may be run
when running ->submit_bio(), meantime the interposer device is
gone during the period, then kernel oops.

> +	} while (false);
> +
> +	current->bio_list = NULL;
> +
> +	blk_queue_exit(orig_disk->queue);
> +
> +	/* Resubmit remaining bios */
> +	while ((bio = bio_list_pop(&bio_list[0])))
> +		ret = submit_bio_noacct(bio);
> +
> +	return ret;
> +}
> +
>  /**
>   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
>   * @bio:  The bio describing the location in memory and on the device.
> @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>   */
>  blk_qc_t submit_bio_noacct(struct bio *bio)
>  {
> +	/*
> +	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
> +	 * created by the bdev_interposer do not get to it for processing.
> +	 */
> +	if (bdev_has_interposer(bio->bi_bdev) &&
> +	    !bio_flagged(bio, BIO_INTERPOSED))
> +		return submit_bio_interposed(bio);
> +
>  	if (!submit_bio_checks(bio))
>  		return BLK_QC_T_NONE;
>  
> diff --git a/block/genhd.c b/block/genhd.c
> index c55e8f0fced1..c840ecffea68 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -30,6 +30,11 @@
>  static struct kobject *block_depr;
>  
>  DECLARE_RWSEM(bdev_lookup_sem);
> +/*
> + * Prevents different block-layer interposers from attaching or detaching
> + * to the block device at the same time.
> + */
> +static DEFINE_MUTEX(bdev_interposer_attach_lock);
>  
>  /* for extended dynamic devt allocation, currently only one major is used */
>  #define NR_EXT_DEVT		(1 << MINORBITS)
> @@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
>  	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
>  	kfree(disk->ev);
>  }
> +
> +int bdev_interposer_attach(struct block_device *original,
> +			   struct block_device *interposer)
> +{
> +	int ret = 0;
> +
> +	if (WARN_ON(((!original) || (!interposer))))
> +		return -EINVAL;
> +	/*
> +	 * interposer should be simple, no a multi-queue device
> +	 */
> +	if (!interposer->bd_disk->fops->submit_bio)
> +		return -EINVAL;
> +
> +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> +		return -EPERM;

The original request queue may become live now...

> +
> +	mutex_lock(&bdev_interposer_attach_lock);
> +
> +	if (bdev_has_interposer(original))
> +		ret = -EBUSY;
> +	else {
> +		original->bd_interposer = bdgrab(interposer);
> +		if (!original->bd_interposer)
> +			ret = -ENODEV;
> +	}
> +
> +	mutex_unlock(&bdev_interposer_attach_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(bdev_interposer_attach);
> +
> +void bdev_interposer_detach(struct block_device *original)
> +{
> +	if (WARN_ON(!original))
> +		return;
> +
> +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> +		return;

The original request queue may become live now...


-- 
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG
  2021-03-15 13:25     ` Sergei Shtepa
@ 2021-03-16 15:23       ` Christoph Hellwig
  2021-03-16 15:25         ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-16 15:23 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, Pavel Tide

On Mon, Mar 15, 2021 at 04:25:09PM +0300, Sergei Shtepa wrote:
> The 03/14/2021 12:30, Christoph Hellwig wrote:
> > On Fri, Mar 12, 2021 at 06:44:55PM +0300, Sergei Shtepa wrote:
> > > DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
> > > Underlying block device opens without a flag FMODE_EXCL.
> > > DM target receives bio from the original device via
> > > bdev_interposer.
> > 
> > This is more of a philopical comment, but the idea of just letting the
> > interposed reopen the device by itself seems like a bad idea.  I think
> > that is probably better hidden in the block layer interposer attachment
> > function, which could do the extra blkdev_get_by_dev for the caller.
> 
> I suppose this cannot be implemented, since we need to change the behavior
> for block devices that already have been opened.

That's not what I mean.  Take a look at the patch relative to your
series to let me know what you think.  The new blkdev_interposer_attach
now takes a dev_t + mode for the original device and opens it on
behalf of the interposer.  It also moves the queue freezing into the
API, which should address the concerns about the helper and adds a few
more sanity checks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG
  2021-03-16 15:23       ` Christoph Hellwig
@ 2021-03-16 15:25         ` Christoph Hellwig
  2021-03-16 16:20           ` Sergei Shtepa
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2021-03-16 15:25 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, Pavel Tide

On Tue, Mar 16, 2021 at 03:23:14PM +0000, Christoph Hellwig wrote:
> On Mon, Mar 15, 2021 at 04:25:09PM +0300, Sergei Shtepa wrote:
> > The 03/14/2021 12:30, Christoph Hellwig wrote:
> > > On Fri, Mar 12, 2021 at 06:44:55PM +0300, Sergei Shtepa wrote:
> > > > DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
> > > > Underlying block device opens without a flag FMODE_EXCL.
> > > > DM target receives bio from the original device via
> > > > bdev_interposer.
> > > 
> > > This is more of a philopical comment, but the idea of just letting the
> > > interposed reopen the device by itself seems like a bad idea.  I think
> > > that is probably better hidden in the block layer interposer attachment
> > > function, which could do the extra blkdev_get_by_dev for the caller.
> > 
> > I suppose this cannot be implemented, since we need to change the behavior
> > for block devices that already have been opened.
> 
> That's not what I mean.  Take a look at the patch relative to your
> series to let me know what you think.  The new blkdev_interposer_attach
> now takes a dev_t + mode for the original device and opens it on
> behalf of the interposer.  It also moves the queue freezing into the
> API, which should address the concerns about the helper and adds a few
> more sanity checks.

And now actually with the diff:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2f188a865024ac..d4d7c1caa43966 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -161,19 +161,6 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
 
-bool blk_mq_is_queue_frozen(struct request_queue *q)
-{
-	bool frozen;
-
-	mutex_lock(&q->mq_freeze_lock);
-	frozen = percpu_ref_is_dying(&q->q_usage_counter) &&
-		 percpu_ref_is_zero(&q->q_usage_counter);
-	mutex_unlock(&q->mq_freeze_lock);
-
-	return frozen;
-}
-EXPORT_SYMBOL_GPL(blk_mq_is_queue_frozen);
-
 /*
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
diff --git a/block/genhd.c b/block/genhd.c
index fa406b972371ae..64d6338b08cc87 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1944,51 +1944,70 @@ static void disk_release_events(struct gendisk *disk)
 	kfree(disk->ev);
 }
 
-int bdev_interposer_attach(struct block_device *original,
+struct block_device *blkdev_interposer_attach(dev_t dev, fmode_t mode,
 			   struct block_device *interposer)
 {
+	struct block_device *bdev;
 	int ret = 0;
 
-	if (WARN_ON(((!original) || (!interposer))))
-		return -EINVAL;
-	/*
-	 * interposer should be simple, no a multi-queue device
-	 */
-	if (!interposer->bd_disk->fops->submit_bio)
-		return -EINVAL;
+	if (WARN_ON_ONCE(!bdev_is_partition(interposer)))
+		return ERR_PTR(-EINVAL);
+	if (WARN_ON_ONCE(!queue_is_mq(interposer->bd_disk->queue)))
+		return ERR_PTR(-EINVAL);
 
-	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
-		return -EPERM;
+	bdev = blkdev_get_by_dev(dev, mode, NULL);
+	if (IS_ERR(bdev))
+		return bdev;
 
-	mutex_lock(&bdev_interposer_attach_lock);
+	ret = -EINVAL;
+	if (WARN_ON_ONCE(bdev_nr_sectors(bdev) != bdev_nr_sectors(interposer)))
+		goto out;
 
-	if (bdev_has_interposer(original))
-		ret = -EBUSY;
-	else {
-		original->bd_interposer = bdgrab(interposer);
-		if (!original->bd_interposer)
-			ret = -ENODEV;
-	}
+	blk_mq_freeze_queue(bdev->bd_disk->queue);
+	blk_mq_quiesce_queue(bdev->bd_disk->queue);
 
+	mutex_lock(&bdev_interposer_attach_lock);
+	ret = -EBUSY;
+	if (bdev_has_interposer(bdev))
+		goto out_unlock;
+	ret = -ENODEV;
+	bdev->bd_interposer = bdgrab(interposer);
+	if (!bdev->bd_interposer)
+		goto out_unlock;
+	ret = 0;
+out_unlock:
 	mutex_unlock(&bdev_interposer_attach_lock);
 
-	return ret;
+	blk_mq_unquiesce_queue(bdev->bd_disk->queue);
+	blk_mq_unfreeze_queue(bdev->bd_disk->queue);
+out:
+	if (ret) {
+		blkdev_put(bdev, mode);
+		bdev = ERR_PTR(ret);
+	}
+
+	return bdev;
 }
-EXPORT_SYMBOL_GPL(bdev_interposer_attach);
+EXPORT_SYMBOL_GPL(blkdev_interposer_attach);
 
-void bdev_interposer_detach(struct block_device *original)
+void blkdev_interposer_detach(struct block_device *bdev, fmode_t mode)
 {
-	if (WARN_ON(!original))
-		return;
+	struct block_device *interposer;
 
-	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
+	if (WARN_ON_ONCE(!bdev_has_interposer(bdev)))
 		return;
 
+	blk_mq_freeze_queue(bdev->bd_disk->queue);
+	blk_mq_quiesce_queue(bdev->bd_disk->queue);
+
 	mutex_lock(&bdev_interposer_attach_lock);
-	if (bdev_has_interposer(original)) {
-		bdput(original->bd_interposer);
-		original->bd_interposer = NULL;
-	}
+	interposer = bdev->bd_interposer;
+	bdev->bd_interposer = NULL;
 	mutex_unlock(&bdev_interposer_attach_lock);
+
+	blk_mq_unquiesce_queue(bdev->bd_disk->queue);
+	blk_mq_unfreeze_queue(bdev->bd_disk->queue);
+
+	blkdev_put(interposer, mode);
 }
-EXPORT_SYMBOL_GPL(bdev_interposer_detach);
+EXPORT_SYMBOL_GPL(blkdev_interposer_detach);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index f6e2eb3f894940..fde57bb5105025 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -359,18 +359,6 @@ dev_t dm_get_dev_t(const char *path)
 }
 EXPORT_SYMBOL_GPL(dm_get_dev_t);
 
-static inline void dm_disk_freeze(struct gendisk *disk)
-{
-	blk_mq_freeze_queue(disk->queue);
-	blk_mq_quiesce_queue(disk->queue);
-}
-
-static inline void dm_disk_unfreeze(struct gendisk *disk)
-{
-	blk_mq_unquiesce_queue(disk->queue);
-	blk_mq_unfreeze_queue(disk->queue);
-}
-
 /*
  * Add a device to the list, or just increment the usage count if
  * it's already present.
@@ -418,29 +406,11 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
 		refcount_inc(&dd->count);
 	}
 
-	if (t->md->is_interposed) {
-		struct block_device *original = dd->dm_dev->bdev;
-		struct block_device *interposer = t->md->disk->part0;
-
-		if ((ti->begin != 0) || (ti->len < bdev_nr_sectors(original))) {
-			dm_put_device(ti, dd->dm_dev);
-			DMERR("The interposer device should not be less than the original.");
-			return -EINVAL;
-		}
-
-		/*
-		 * Attach mapped interposer device to original.
-		 * It is quite convenient that device mapper creates
-		 * one disk for one block device.
-		 */
-		dm_disk_freeze(original->bd_disk);
-		r = bdev_interposer_attach(original, interposer);
-		dm_disk_unfreeze(original->bd_disk);
-		if (r) {
-			dm_put_device(ti, dd->dm_dev);
-			DMERR("Failed to attach dm interposer.");
-			return r;
-		}
+	if (t->md->is_interposed &&
+	    (ti->begin != 0 || ti->len < bdev_nr_sectors(dd->dm_dev->bdev))) {
+		dm_put_device(ti, dd->dm_dev);
+		DMERR("The interposer device should not be less than the original.");
+		return -EINVAL;
 	}
 
 	*result = dd->dm_dev;
@@ -496,11 +466,6 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
 		       dm_device_name(md), d->name);
 		return;
 	}
-	if (md->is_interposed) {
-		dm_disk_freeze(d->bdev->bd_disk);
-		bdev_interposer_detach(d->bdev);
-		dm_disk_unfreeze(d->bdev->bd_disk);
-	}
 
 	if (refcount_dec_and_test(&dd->count)) {
 		dm_put_table_device(md, d);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c488e9554aa000..532ce17064b1c1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -763,10 +763,12 @@ static int open_table_device(struct table_device *td, dev_t dev,
 	BUG_ON(td->dm_dev.bdev);
 
 	if (md->is_interposed) {
-
-		bdev = blkdev_get_by_dev(dev, td->dm_dev.mode, NULL);
-		if (IS_ERR(bdev))
+		bdev = blkdev_interposer_attach(dev, td->dm_dev.mode,
+						md->disk->part0);
+		if (IS_ERR(bdev)) {
+			DMERR("Failed to attach dm interposer.");
 			return PTR_ERR(bdev);
+		}
 	} else {
 		bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
 		if (IS_ERR(bdev))
@@ -793,9 +795,9 @@ static void close_table_device(struct table_device *td, struct mapped_device *md
 	if (!td->dm_dev.bdev)
 		return;
 
-	if (td->dm_dev.is_interposed)
-		blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
-	else {
+	if (td->dm_dev.is_interposed) {
+		blkdev_interposer_detach(td->dm_dev.bdev, td->dm_dev.mode);
+	} else {
 		bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
 		blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
 	}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 6f01971abf7b9b..2c473c9b899089 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -533,7 +533,6 @@ void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
-bool blk_mq_is_queue_frozen(struct request_queue *q);
 
 int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 90f62b4197da91..fbc510162c3827 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -2036,8 +2036,8 @@ static inline bool bdev_has_interposer(struct block_device *bdev)
 	return (bdev->bd_interposer != NULL);
 };
 
-int bdev_interposer_attach(struct block_device *original,
+struct block_device *blkdev_interposer_attach(dev_t dev, fmode_t mode,
 			   struct block_device *interposer);
-void bdev_interposer_detach(struct block_device *original);
+void blkdev_interposer_detach(struct block_device *bdev, fmode_t mode);
 
 #endif /* _LINUX_BLKDEV_H */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG
  2021-03-16 15:25         ` Christoph Hellwig
@ 2021-03-16 16:20           ` Sergei Shtepa
  0 siblings, 0 replies; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-16 16:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, Alasdair Kergon, Hannes Reinecke, Jens Axboe,
	dm-devel, linux-block, linux-kernel, linux-api, Pavel Tide

Thanks!
I've already started doing something like that.
I'm glad we're thinking in the same direction.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-16  8:09   ` Ming Lei
@ 2021-03-16 16:35     ` Sergei Shtepa
  2021-03-17  3:03       ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-16 16:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, Pavel Tide

The 03/16/2021 11:09, Ming Lei wrote:
> On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > bdev_interposer allows to redirect bio requests to another devices.
> > 
> > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> > ---
> >  block/bio.c               |  2 ++
> >  block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
> >  block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
> >  include/linux/blk_types.h |  3 +++
> >  include/linux/blkdev.h    |  9 +++++++
> >  5 files changed, 125 insertions(+)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index a1c4d2900c7a..0bfbf06475ee 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> >  		bio_set_flag(bio, BIO_THROTTLED);
> >  	if (bio_flagged(bio_src, BIO_REMAPPED))
> >  		bio_set_flag(bio, BIO_REMAPPED);
> > +	if (bio_flagged(bio_src, BIO_INTERPOSED))
> > +		bio_set_flag(bio, BIO_INTERPOSED);
> >  	bio->bi_opf = bio_src->bi_opf;
> >  	bio->bi_ioprio = bio_src->bi_ioprio;
> >  	bio->bi_write_hint = bio_src->bi_write_hint;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index fc60ff208497..da1abc4c27a9 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > +{
> > +	blk_qc_t ret = BLK_QC_T_NONE;
> > +	struct bio_list bio_list[2] = { };
> > +	struct gendisk *orig_disk;
> > +
> > +	if (current->bio_list) {
> > +		bio_list_add(&current->bio_list[0], bio);
> > +		return BLK_QC_T_NONE;
> > +	}
> > +
> > +	orig_disk = bio->bi_bdev->bd_disk;
> > +	if (unlikely(bio_queue_enter(bio)))
> > +		return BLK_QC_T_NONE;
> > +
> > +	current->bio_list = bio_list;
> > +
> > +	do {
> > +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> > +
> > +		if (unlikely(!interposer)) {
> > +			/* interposer was removed */
> > +			bio_list_add(&current->bio_list[0], bio);
> > +			break;
> > +		}
> > +		/* assign bio to interposer device */
> > +		bio_set_dev(bio, interposer);
> > +		bio_set_flag(bio, BIO_INTERPOSED);
> > +
> > +		if (!submit_bio_checks(bio))
> > +			break;
> > +		/*
> > +		 * Because the current->bio_list is initialized,
> > +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> > +		 */
> > +		interposer->bd_disk->fops->submit_bio(bio);
> 
> Given original request queue may become live when calling attach() and
> detach(), see below comment. bdev_interposer_detach() may be run
> when running ->submit_bio(), meantime the interposer device is
> gone during the period, then kernel oops.

I think that since the bio_queue_enter() function was called,
q->q_usage_counter will not allow the critical code in the attach/detach
functions to be executed, which is located between the blk_freeze_queue
and blk_unfreeze_queue calls.
Please correct me if I'm wrong.

> 
> > +	} while (false);
> > +
> > +	current->bio_list = NULL;
> > +
> > +	blk_queue_exit(orig_disk->queue);
> > +
> > +	/* Resubmit remaining bios */
> > +	while ((bio = bio_list_pop(&bio_list[0])))
> > +		ret = submit_bio_noacct(bio);
> > +
> > +	return ret;
> > +}
> > +
> >  /**
> >   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> >   * @bio:  The bio describing the location in memory and on the device.
> > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >   */
> >  blk_qc_t submit_bio_noacct(struct bio *bio)
> >  {
> > +	/*
> > +	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > +	 * created by the bdev_interposer do not get to it for processing.
> > +	 */
> > +	if (bdev_has_interposer(bio->bi_bdev) &&
> > +	    !bio_flagged(bio, BIO_INTERPOSED))
> > +		return submit_bio_interposed(bio);
> > +
> >  	if (!submit_bio_checks(bio))
> >  		return BLK_QC_T_NONE;
> >  
> > diff --git a/block/genhd.c b/block/genhd.c
> > index c55e8f0fced1..c840ecffea68 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -30,6 +30,11 @@
> >  static struct kobject *block_depr;
> >  
> >  DECLARE_RWSEM(bdev_lookup_sem);
> > +/*
> > + * Prevents different block-layer interposers from attaching or detaching
> > + * to the block device at the same time.
> > + */
> > +static DEFINE_MUTEX(bdev_interposer_attach_lock);
> >  
> >  /* for extended dynamic devt allocation, currently only one major is used */
> >  #define NR_EXT_DEVT		(1 << MINORBITS)
> > @@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
> >  	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
> >  	kfree(disk->ev);
> >  }
> > +
> > +int bdev_interposer_attach(struct block_device *original,
> > +			   struct block_device *interposer)
> > +{
> > +	int ret = 0;
> > +
> > +	if (WARN_ON(((!original) || (!interposer))))
> > +		return -EINVAL;
> > +	/*
> > +	 * interposer should be simple, no a multi-queue device
> > +	 */
> > +	if (!interposer->bd_disk->fops->submit_bio)
> > +		return -EINVAL;
> > +
> > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > +		return -EPERM;
> 
> The original request queue may become live now...

Yes.
I will remove the blk_mq_is_queue_frozen() function and use a different
approach.

> 
> > +
> > +	mutex_lock(&bdev_interposer_attach_lock);
> > +
> > +	if (bdev_has_interposer(original))
> > +		ret = -EBUSY;
> > +	else {
> > +		original->bd_interposer = bdgrab(interposer);
> > +		if (!original->bd_interposer)
> > +			ret = -ENODEV;
> > +	}
> > +
> > +	mutex_unlock(&bdev_interposer_attach_lock);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(bdev_interposer_attach);
> > +
> > +void bdev_interposer_detach(struct block_device *original)
> > +{
> > +	if (WARN_ON(!original))
> > +		return;
> > +
> > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > +		return;
> 
> The original request queue may become live now...
> 
> 
> -- 
> Ming
> 

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-16 16:35     ` Sergei Shtepa
@ 2021-03-17  3:03       ` Ming Lei
  2021-03-17 12:22         ` Sergei Shtepa
  2021-03-17 14:58         ` Mike Snitzer
  0 siblings, 2 replies; 25+ messages in thread
From: Ming Lei @ 2021-03-17  3:03 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, Pavel Tide

On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> The 03/16/2021 11:09, Ming Lei wrote:
> > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > bdev_interposer allows to redirect bio requests to another devices.
> > > 
> > > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> > > ---
> > >  block/bio.c               |  2 ++
> > >  block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
> > >  block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
> > >  include/linux/blk_types.h |  3 +++
> > >  include/linux/blkdev.h    |  9 +++++++
> > >  5 files changed, 125 insertions(+)
> > > 
> > > diff --git a/block/bio.c b/block/bio.c
> > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> > >  		bio_set_flag(bio, BIO_THROTTLED);
> > >  	if (bio_flagged(bio_src, BIO_REMAPPED))
> > >  		bio_set_flag(bio, BIO_REMAPPED);
> > > +	if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > >  	bio->bi_opf = bio_src->bi_opf;
> > >  	bio->bi_ioprio = bio_src->bi_ioprio;
> > >  	bio->bi_write_hint = bio_src->bi_write_hint;
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index fc60ff208497..da1abc4c27a9 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > >  	return ret;
> > >  }
> > >  
> > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > +{
> > > +	blk_qc_t ret = BLK_QC_T_NONE;
> > > +	struct bio_list bio_list[2] = { };
> > > +	struct gendisk *orig_disk;
> > > +
> > > +	if (current->bio_list) {
> > > +		bio_list_add(&current->bio_list[0], bio);
> > > +		return BLK_QC_T_NONE;
> > > +	}
> > > +
> > > +	orig_disk = bio->bi_bdev->bd_disk;
> > > +	if (unlikely(bio_queue_enter(bio)))
> > > +		return BLK_QC_T_NONE;
> > > +
> > > +	current->bio_list = bio_list;
> > > +
> > > +	do {
> > > +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> > > +
> > > +		if (unlikely(!interposer)) {
> > > +			/* interposer was removed */
> > > +			bio_list_add(&current->bio_list[0], bio);
> > > +			break;
> > > +		}
> > > +		/* assign bio to interposer device */
> > > +		bio_set_dev(bio, interposer);
> > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > +
> > > +		if (!submit_bio_checks(bio))
> > > +			break;
> > > +		/*
> > > +		 * Because the current->bio_list is initialized,
> > > +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> > > +		 */
> > > +		interposer->bd_disk->fops->submit_bio(bio);
> > 
> > Given original request queue may become live when calling attach() and
> > detach(), see below comment. bdev_interposer_detach() may be run
> > when running ->submit_bio(), meantime the interposer device is
> > gone during the period, then kernel oops.
> 
> I think that since the bio_queue_enter() function was called,
> q->q_usage_counter will not allow the critical code in the attach/detach
> functions to be executed, which is located between the blk_freeze_queue
> and blk_unfreeze_queue calls.
> Please correct me if I'm wrong.
> 
> > 
> > > +	} while (false);
> > > +
> > > +	current->bio_list = NULL;
> > > +
> > > +	blk_queue_exit(orig_disk->queue);
> > > +
> > > +	/* Resubmit remaining bios */
> > > +	while ((bio = bio_list_pop(&bio_list[0])))
> > > +		ret = submit_bio_noacct(bio);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  /**
> > >   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> > >   * @bio:  The bio describing the location in memory and on the device.
> > > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > >   */
> > >  blk_qc_t submit_bio_noacct(struct bio *bio)
> > >  {
> > > +	/*
> > > +	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > > +	 * created by the bdev_interposer do not get to it for processing.
> > > +	 */
> > > +	if (bdev_has_interposer(bio->bi_bdev) &&
> > > +	    !bio_flagged(bio, BIO_INTERPOSED))
> > > +		return submit_bio_interposed(bio);
> > > +
> > >  	if (!submit_bio_checks(bio))
> > >  		return BLK_QC_T_NONE;
> > >  
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index c55e8f0fced1..c840ecffea68 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -30,6 +30,11 @@
> > >  static struct kobject *block_depr;
> > >  
> > >  DECLARE_RWSEM(bdev_lookup_sem);
> > > +/*
> > > + * Prevents different block-layer interposers from attaching or detaching
> > > + * to the block device at the same time.
> > > + */
> > > +static DEFINE_MUTEX(bdev_interposer_attach_lock);
> > >  
> > >  /* for extended dynamic devt allocation, currently only one major is used */
> > >  #define NR_EXT_DEVT		(1 << MINORBITS)
> > > @@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
> > >  	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
> > >  	kfree(disk->ev);
> > >  }
> > > +
> > > +int bdev_interposer_attach(struct block_device *original,
> > > +			   struct block_device *interposer)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	if (WARN_ON(((!original) || (!interposer))))
> > > +		return -EINVAL;
> > > +	/*
> > > +	 * interposer should be simple, no a multi-queue device
> > > +	 */
> > > +	if (!interposer->bd_disk->fops->submit_bio)
> > > +		return -EINVAL;
> > > +
> > > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > +		return -EPERM;
> > 
> > The original request queue may become live now...
> 
> Yes.
> I will remove the blk_mq_is_queue_frozen() function and use a different
> approach.

Looks what attach and detach needs is that queue is kept as frozen state
instead of being froze simply at the beginning of the two functions, so
you can simply call freeze/unfreeze inside the two functions.

But what if 'original' isn't a MQ queue?  queue usage counter is just
grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
isn't any io activity, is that a problem for bdev_interposer use case?

-- 
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-17  3:03       ` Ming Lei
@ 2021-03-17 12:22         ` Sergei Shtepa
  2021-03-17 15:04           ` Mike Snitzer
  2021-03-17 14:58         ` Mike Snitzer
  1 sibling, 1 reply; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-17 12:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Mike Snitzer, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, Pavel Tide

The 03/17/2021 06:03, Ming Lei wrote:
> On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > The 03/16/2021 11:09, Ming Lei wrote:
> > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > 
> > > > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> > > > ---
> > > >  block/bio.c               |  2 ++
> > > >  block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
> > > >  block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
> > > >  include/linux/blk_types.h |  3 +++
> > > >  include/linux/blkdev.h    |  9 +++++++
> > > >  5 files changed, 125 insertions(+)
> > > > 
> > > > diff --git a/block/bio.c b/block/bio.c
> > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > --- a/block/bio.c
> > > > +++ b/block/bio.c
> > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> > > >  		bio_set_flag(bio, BIO_THROTTLED);
> > > >  	if (bio_flagged(bio_src, BIO_REMAPPED))
> > > >  		bio_set_flag(bio, BIO_REMAPPED);
> > > > +	if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > >  	bio->bi_opf = bio_src->bi_opf;
> > > >  	bio->bi_ioprio = bio_src->bi_ioprio;
> > > >  	bio->bi_write_hint = bio_src->bi_write_hint;
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index fc60ff208497..da1abc4c27a9 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > >  	return ret;
> > > >  }
> > > >  
> > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > +{
> > > > +	blk_qc_t ret = BLK_QC_T_NONE;
> > > > +	struct bio_list bio_list[2] = { };
> > > > +	struct gendisk *orig_disk;
> > > > +
> > > > +	if (current->bio_list) {
> > > > +		bio_list_add(&current->bio_list[0], bio);
> > > > +		return BLK_QC_T_NONE;
> > > > +	}
> > > > +
> > > > +	orig_disk = bio->bi_bdev->bd_disk;
> > > > +	if (unlikely(bio_queue_enter(bio)))
> > > > +		return BLK_QC_T_NONE;
> > > > +
> > > > +	current->bio_list = bio_list;
> > > > +
> > > > +	do {
> > > > +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> > > > +
> > > > +		if (unlikely(!interposer)) {
> > > > +			/* interposer was removed */
> > > > +			bio_list_add(&current->bio_list[0], bio);
> > > > +			break;
> > > > +		}
> > > > +		/* assign bio to interposer device */
> > > > +		bio_set_dev(bio, interposer);
> > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > +
> > > > +		if (!submit_bio_checks(bio))
> > > > +			break;
> > > > +		/*
> > > > +		 * Because the current->bio_list is initialized,
> > > > +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> > > > +		 */
> > > > +		interposer->bd_disk->fops->submit_bio(bio);
> > > 
> > > Given original request queue may become live when calling attach() and
> > > detach(), see below comment. bdev_interposer_detach() may be run
> > > when running ->submit_bio(), meantime the interposer device is
> > > gone during the period, then kernel oops.
> > 
> > I think that since the bio_queue_enter() function was called,
> > q->q_usage_counter will not allow the critical code in the attach/detach
> > functions to be executed, which is located between the blk_freeze_queue
> > and blk_unfreeze_queue calls.
> > Please correct me if I'm wrong.
> > 
> > > 
> > > > +	} while (false);
> > > > +
> > > > +	current->bio_list = NULL;
> > > > +
> > > > +	blk_queue_exit(orig_disk->queue);
> > > > +
> > > > +	/* Resubmit remaining bios */
> > > > +	while ((bio = bio_list_pop(&bio_list[0])))
> > > > +		ret = submit_bio_noacct(bio);
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  /**
> > > >   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> > > >   * @bio:  The bio describing the location in memory and on the device.
> > > > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > >   */
> > > >  blk_qc_t submit_bio_noacct(struct bio *bio)
> > > >  {
> > > > +	/*
> > > > +	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > > > +	 * created by the bdev_interposer do not get to it for processing.
> > > > +	 */
> > > > +	if (bdev_has_interposer(bio->bi_bdev) &&
> > > > +	    !bio_flagged(bio, BIO_INTERPOSED))
> > > > +		return submit_bio_interposed(bio);
> > > > +
> > > >  	if (!submit_bio_checks(bio))
> > > >  		return BLK_QC_T_NONE;
> > > >  
> > > > diff --git a/block/genhd.c b/block/genhd.c
> > > > index c55e8f0fced1..c840ecffea68 100644
> > > > --- a/block/genhd.c
> > > > +++ b/block/genhd.c
> > > > @@ -30,6 +30,11 @@
> > > >  static struct kobject *block_depr;
> > > >  
> > > >  DECLARE_RWSEM(bdev_lookup_sem);
> > > > +/*
> > > > + * Prevents different block-layer interposers from attaching or detaching
> > > > + * to the block device at the same time.
> > > > + */
> > > > +static DEFINE_MUTEX(bdev_interposer_attach_lock);
> > > >  
> > > >  /* for extended dynamic devt allocation, currently only one major is used */
> > > >  #define NR_EXT_DEVT		(1 << MINORBITS)
> > > > @@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
> > > >  	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
> > > >  	kfree(disk->ev);
> > > >  }
> > > > +
> > > > +int bdev_interposer_attach(struct block_device *original,
> > > > +			   struct block_device *interposer)
> > > > +{
> > > > +	int ret = 0;
> > > > +
> > > > +	if (WARN_ON(((!original) || (!interposer))))
> > > > +		return -EINVAL;
> > > > +	/*
> > > > +	 * interposer should be simple, no a multi-queue device
> > > > +	 */
> > > > +	if (!interposer->bd_disk->fops->submit_bio)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > > +		return -EPERM;
> > > 
> > > The original request queue may become live now...
> > 
> > Yes.
> > I will remove the blk_mq_is_queue_frozen() function and use a different
> > approach.
> 
> Looks what attach and detach needs is that queue is kept as frozen state
> instead of being froze simply at the beginning of the two functions, so
> you can simply call freeze/unfreeze inside the two functions.
> 
> But what if 'original' isn't a MQ queue?  queue usage counter is just
> grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
> isn't any io activity, is that a problem for bdev_interposer use case?
> 
> -- 
> Ming
> 

It makes sense to add freeze_bdev/thaw_bdev. This will be useful.
For the main file systems, the freeze functions are defined 
sb->s_op->freeze_super() or sb - >s_op->freeze_fs()
(btrfs, ext2, ext4, f2fs, jfs, nilfs2, reiserfs, xfs).
If the file system is frozen, then no new requests should be received.

But if the file system does not support freeze or the disk is used without
a file system, as for some databases, freeze_bdev seems useless to me.
In this case, we will need to stop working with the disk from user-space,
for example, to freeze the database itself.

I can add dm_suspend() before bdev_interposer_detach(). This will ensure that
all intercepted requests have been processed. Applying dm_suspend() before
bdev_interposer_attach() is pointless. The attachment is made when the target
is created, and at this time the target is not ready to work yet.
There shouldn't be any bio requests, I suppose. In addition,
sb->s_op->freeze_fs() for the interposer will not be called, because the file
system is not mounted for the interposer device. It should not be able to
be mounted. To do this, I will add an exclusive opening of the interposer
device.

I'll add freeze_bdev() for the original device and dm_suspend() for the
interposer to the DM code. For normal operation of bdev_interposer,
it is enough to transfer blk_mq_freeze_queue and blk_mq_quiesce_queue to
bdev_interposer_attach/bdev_interposer_detach.
The lock on the counter q->q_usage_counter is enough to not catch NULL in
bd_interposer.

Do you think this is enough?
I think there are no other ways to stop the block device queue.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-17  3:03       ` Ming Lei
  2021-03-17 12:22         ` Sergei Shtepa
@ 2021-03-17 14:58         ` Mike Snitzer
  1 sibling, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2021-03-17 14:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sergei Shtepa, Christoph Hellwig, Alasdair Kergon,
	Hannes Reinecke, Jens Axboe, dm-devel, linux-block, linux-kernel,
	linux-api, Pavel Tide

On Tue, Mar 16 2021 at 11:03pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > The 03/16/2021 11:09, Ming Lei wrote:
> > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > 
> > > > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>

...

> > > > +
> > > > +int bdev_interposer_attach(struct block_device *original,
> > > > +			   struct block_device *interposer)
> > > > +{
> > > > +	int ret = 0;
> > > > +
> > > > +	if (WARN_ON(((!original) || (!interposer))))
> > > > +		return -EINVAL;
> > > > +	/*
> > > > +	 * interposer should be simple, no a multi-queue device
> > > > +	 */
> > > > +	if (!interposer->bd_disk->fops->submit_bio)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > > +		return -EPERM;
> > > 
> > > The original request queue may become live now...
> > 
> > Yes.
> > I will remove the blk_mq_is_queue_frozen() function and use a different
> > approach.
> 
> Looks what attach and detach needs is that queue is kept as frozen state
> instead of being froze simply at the beginning of the two functions, so
> you can simply call freeze/unfreeze inside the two functions.
> 
> But what if 'original' isn't a MQ queue?  queue usage counter is just
> grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
> isn't any io activity, is that a problem for bdev_interposer use case?

Right, I raised the same concern here:
https://listman.redhat.com/archives/dm-devel/2021-March/msg00135.html
(toward bottom inlined after dm_disk_{freeze,unfreeze}

Anyway, this certainly needs to be addressed.

Mike


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-17 12:22         ` Sergei Shtepa
@ 2021-03-17 15:04           ` Mike Snitzer
  2021-03-17 18:14             ` Sergei Shtepa
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2021-03-17 15:04 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Ming Lei, Christoph Hellwig, Alasdair Kergon, Hannes Reinecke,
	Jens Axboe, dm-devel, linux-block, linux-kernel, linux-api,
	Pavel Tide

On Wed, Mar 17 2021 at  8:22am -0400,
Sergei Shtepa <sergei.shtepa@veeam.com> wrote:

> The 03/17/2021 06:03, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > > The 03/16/2021 11:09, Ming Lei wrote:
> > > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > > 
> > > > > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> > > > > ---
> > > > >  block/bio.c               |  2 ++
> > > > >  block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
> > > > >  block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
> > > > >  include/linux/blk_types.h |  3 +++
> > > > >  include/linux/blkdev.h    |  9 +++++++
> > > > >  5 files changed, 125 insertions(+)
> > > > > 
> > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > > --- a/block/bio.c
> > > > > +++ b/block/bio.c
> > > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> > > > >  		bio_set_flag(bio, BIO_THROTTLED);
> > > > >  	if (bio_flagged(bio_src, BIO_REMAPPED))
> > > > >  		bio_set_flag(bio, BIO_REMAPPED);
> > > > > +	if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > >  	bio->bi_opf = bio_src->bi_opf;
> > > > >  	bio->bi_ioprio = bio_src->bi_ioprio;
> > > > >  	bio->bi_write_hint = bio_src->bi_write_hint;
> > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > index fc60ff208497..da1abc4c27a9 100644
> > > > > --- a/block/blk-core.c
> > > > > +++ b/block/blk-core.c
> > > > > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > > >  	return ret;
> > > > >  }
> > > > >  
> > > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > > +{
> > > > > +	blk_qc_t ret = BLK_QC_T_NONE;
> > > > > +	struct bio_list bio_list[2] = { };
> > > > > +	struct gendisk *orig_disk;
> > > > > +
> > > > > +	if (current->bio_list) {
> > > > > +		bio_list_add(&current->bio_list[0], bio);
> > > > > +		return BLK_QC_T_NONE;
> > > > > +	}
> > > > > +
> > > > > +	orig_disk = bio->bi_bdev->bd_disk;
> > > > > +	if (unlikely(bio_queue_enter(bio)))
> > > > > +		return BLK_QC_T_NONE;
> > > > > +
> > > > > +	current->bio_list = bio_list;
> > > > > +
> > > > > +	do {
> > > > > +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> > > > > +
> > > > > +		if (unlikely(!interposer)) {
> > > > > +			/* interposer was removed */
> > > > > +			bio_list_add(&current->bio_list[0], bio);
> > > > > +			break;
> > > > > +		}
> > > > > +		/* assign bio to interposer device */
> > > > > +		bio_set_dev(bio, interposer);
> > > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > > +
> > > > > +		if (!submit_bio_checks(bio))
> > > > > +			break;
> > > > > +		/*
> > > > > +		 * Because the current->bio_list is initialized,
> > > > > +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> > > > > +		 */
> > > > > +		interposer->bd_disk->fops->submit_bio(bio);
> > > > 
> > > > Given original request queue may become live when calling attach() and
> > > > detach(), see below comment. bdev_interposer_detach() may be run
> > > > when running ->submit_bio(), meantime the interposer device is
> > > > gone during the period, then kernel oops.
> > > 
> > > I think that since the bio_queue_enter() function was called,
> > > q->q_usage_counter will not allow the critical code in the attach/detach
> > > functions to be executed, which is located between the blk_freeze_queue
> > > and blk_unfreeze_queue calls.
> > > Please correct me if I'm wrong.
> > > 
> > > > 
> > > > > +	} while (false);
> > > > > +
> > > > > +	current->bio_list = NULL;
> > > > > +
> > > > > +	blk_queue_exit(orig_disk->queue);
> > > > > +
> > > > > +	/* Resubmit remaining bios */
> > > > > +	while ((bio = bio_list_pop(&bio_list[0])))
> > > > > +		ret = submit_bio_noacct(bio);
> > > > > +
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> > > > >   * @bio:  The bio describing the location in memory and on the device.
> > > > > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > > >   */
> > > > >  blk_qc_t submit_bio_noacct(struct bio *bio)
> > > > >  {
> > > > > +	/*
> > > > > +	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > > > > +	 * created by the bdev_interposer do not get to it for processing.
> > > > > +	 */
> > > > > +	if (bdev_has_interposer(bio->bi_bdev) &&
> > > > > +	    !bio_flagged(bio, BIO_INTERPOSED))
> > > > > +		return submit_bio_interposed(bio);
> > > > > +
> > > > >  	if (!submit_bio_checks(bio))
> > > > >  		return BLK_QC_T_NONE;
> > > > >  
> > > > > diff --git a/block/genhd.c b/block/genhd.c
> > > > > index c55e8f0fced1..c840ecffea68 100644
> > > > > --- a/block/genhd.c
> > > > > +++ b/block/genhd.c
> > > > > @@ -30,6 +30,11 @@
> > > > >  static struct kobject *block_depr;
> > > > >  
> > > > >  DECLARE_RWSEM(bdev_lookup_sem);
> > > > > +/*
> > > > > + * Prevents different block-layer interposers from attaching or detaching
> > > > > + * to the block device at the same time.
> > > > > + */
> > > > > +static DEFINE_MUTEX(bdev_interposer_attach_lock);
> > > > >  
> > > > >  /* for extended dynamic devt allocation, currently only one major is used */
> > > > >  #define NR_EXT_DEVT		(1 << MINORBITS)
> > > > > @@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
> > > > >  	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
> > > > >  	kfree(disk->ev);
> > > > >  }
> > > > > +
> > > > > +int bdev_interposer_attach(struct block_device *original,
> > > > > +			   struct block_device *interposer)
> > > > > +{
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	if (WARN_ON(((!original) || (!interposer))))
> > > > > +		return -EINVAL;
> > > > > +	/*
> > > > > +	 * interposer should be simple, no a multi-queue device
> > > > > +	 */
> > > > > +	if (!interposer->bd_disk->fops->submit_bio)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > > > +		return -EPERM;
> > > > 
> > > > The original request queue may become live now...
> > > 
> > > Yes.
> > > I will remove the blk_mq_is_queue_frozen() function and use a different
> > > approach.
> > 
> > Looks what attach and detach needs is that queue is kept as frozen state
> > instead of being froze simply at the beginning of the two functions, so
> > you can simply call freeze/unfreeze inside the two functions.
> > 
> > But what if 'original' isn't a MQ queue?  queue usage counter is just
> > grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
> > isn't any io activity, is that a problem for bdev_interposer use case?
> > 
> > -- 
> > Ming
> > 
> 
> It makes sense to add freeze_bdev/thaw_bdev. This will be useful.
> For the main file systems, the freeze functions are defined 
> sb->s_op->freeze_super() or sb - >s_op->freeze_fs()
> (btrfs, ext2, ext4, f2fs, jfs, nilfs2, reiserfs, xfs).
> If the file system is frozen, then no new requests should be received.
> 
> But if the file system does not support freeze or the disk is used without
> a file system, as for some databases, freeze_bdev seems useless to me.
> In this case, we will need to stop working with the disk from user-space,
> for example, to freeze the database itself.
> 
> I can add dm_suspend() before bdev_interposer_detach(). This will ensure that
> all intercepted requests have been processed. Applying dm_suspend() before
> bdev_interposer_attach() is pointless. The attachment is made when the target
> is created, and at this time the target is not ready to work yet.
> There shouldn't be any bio requests, I suppose. In addition,
> sb->s_op->freeze_fs() for the interposer will not be called, because the file
> system is not mounted for the interposer device. It should not be able to
> be mounted. To do this, I will add an exclusive opening of the interposer
> device.
> 
> I'll add freeze_bdev() for the original device and dm_suspend() for the
> interposer to the DM code. For normal operation of bdev_interposer,
> it is enough to transfer blk_mq_freeze_queue and blk_mq_quiesce_queue to
> bdev_interposer_attach/bdev_interposer_detach.
> The lock on the counter q->q_usage_counter is enough to not catch NULL in
> bd_interposer.
> 
> Do you think this is enough?
> I think there are no other ways to stop the block device queue.

Either you're pretty confused, or I am... regardless.. I think we need
to cover the basics of how interposer is expected to be paired with
an "original" device.

Those "original" device are already active and potentially in use
right?  They may be either request-based blk-mq _or_ bio-based.

So what confuses me is that you're making assertions about how actively
used bio-based DM devices aren't in use until the interposed device
create happens... this is all getting very muddled.

And your lack of understanding of these various IO flushing methods
(freeze/thaw, suspend/resume, etc) is showing.  Please slow down and
approach this more systematically.

Thanks,
Mike


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-17 15:04           ` Mike Snitzer
@ 2021-03-17 18:14             ` Sergei Shtepa
  2021-03-17 19:13               ` Mike Snitzer
  0 siblings, 1 reply; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-17 18:14 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Christoph Hellwig, Alasdair Kergon, Hannes Reinecke,
	Jens Axboe, dm-devel, linux-block, linux-kernel, linux-api,
	Pavel Tide

The 03/17/2021 18:04, Mike Snitzer wrote:
> On Wed, Mar 17 2021 at  8:22am -0400,
> Sergei Shtepa <sergei.shtepa@veeam.com> wrote:
> 
> > The 03/17/2021 06:03, Ming Lei wrote:
> > > On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > > > The 03/16/2021 11:09, Ming Lei wrote:
> > > > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > > > 
> > > > > > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> > > > > > ---
> > > > > >  block/bio.c               |  2 ++
> > > > > >  block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
> > > > > >  block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
> > > > > >  include/linux/blk_types.h |  3 +++
> > > > > >  include/linux/blkdev.h    |  9 +++++++
> > > > > >  5 files changed, 125 insertions(+)
> > > > > > 
> > > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > > > --- a/block/bio.c
> > > > > > +++ b/block/bio.c
> > > > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> > > > > >  		bio_set_flag(bio, BIO_THROTTLED);
> > > > > >  	if (bio_flagged(bio_src, BIO_REMAPPED))
> > > > > >  		bio_set_flag(bio, BIO_REMAPPED);
> > > > > > +	if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > > >  	bio->bi_opf = bio_src->bi_opf;
> > > > > >  	bio->bi_ioprio = bio_src->bi_ioprio;
> > > > > >  	bio->bi_write_hint = bio_src->bi_write_hint;
> > > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > > index fc60ff208497..da1abc4c27a9 100644
> > > > > > --- a/block/blk-core.c
> > > > > > +++ b/block/blk-core.c
> > > > > > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > > > >  	return ret;
> > > > > >  }
> > > > > >  
> > > > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > > > +{
> > > > > > +	blk_qc_t ret = BLK_QC_T_NONE;
> > > > > > +	struct bio_list bio_list[2] = { };
> > > > > > +	struct gendisk *orig_disk;
> > > > > > +
> > > > > > +	if (current->bio_list) {
> > > > > > +		bio_list_add(&current->bio_list[0], bio);
> > > > > > +		return BLK_QC_T_NONE;
> > > > > > +	}
> > > > > > +
> > > > > > +	orig_disk = bio->bi_bdev->bd_disk;
> > > > > > +	if (unlikely(bio_queue_enter(bio)))
> > > > > > +		return BLK_QC_T_NONE;
> > > > > > +
> > > > > > +	current->bio_list = bio_list;
> > > > > > +
> > > > > > +	do {
> > > > > > +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> > > > > > +
> > > > > > +		if (unlikely(!interposer)) {
> > > > > > +			/* interposer was removed */
> > > > > > +			bio_list_add(&current->bio_list[0], bio);
> > > > > > +			break;
> > > > > > +		}
> > > > > > +		/* assign bio to interposer device */
> > > > > > +		bio_set_dev(bio, interposer);
> > > > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > +
> > > > > > +		if (!submit_bio_checks(bio))
> > > > > > +			break;
> > > > > > +		/*
> > > > > > +		 * Because the current->bio_list is initialized,
> > > > > > +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> > > > > > +		 */
> > > > > > +		interposer->bd_disk->fops->submit_bio(bio);
> > > > > 
> > > > > Given original request queue may become live when calling attach() and
> > > > > detach(), see below comment. bdev_interposer_detach() may be run
> > > > > when running ->submit_bio(), meantime the interposer device is
> > > > > gone during the period, then kernel oops.
> > > > 
> > > > I think that since the bio_queue_enter() function was called,
> > > > q->q_usage_counter will not allow the critical code in the attach/detach
> > > > functions to be executed, which is located between the blk_freeze_queue
> > > > and blk_unfreeze_queue calls.
> > > > Please correct me if I'm wrong.
> > > > 
> > > > > 
> > > > > > +	} while (false);
> > > > > > +
> > > > > > +	current->bio_list = NULL;
> > > > > > +
> > > > > > +	blk_queue_exit(orig_disk->queue);
> > > > > > +
> > > > > > +	/* Resubmit remaining bios */
> > > > > > +	while ((bio = bio_list_pop(&bio_list[0])))
> > > > > > +		ret = submit_bio_noacct(bio);
> > > > > > +
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> > > > > >   * @bio:  The bio describing the location in memory and on the device.
> > > > > > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > > > >   */
> > > > > >  blk_qc_t submit_bio_noacct(struct bio *bio)
> > > > > >  {
> > > > > > +	/*
> > > > > > +	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > > > > > +	 * created by the bdev_interposer do not get to it for processing.
> > > > > > +	 */
> > > > > > +	if (bdev_has_interposer(bio->bi_bdev) &&
> > > > > > +	    !bio_flagged(bio, BIO_INTERPOSED))
> > > > > > +		return submit_bio_interposed(bio);
> > > > > > +
> > > > > >  	if (!submit_bio_checks(bio))
> > > > > >  		return BLK_QC_T_NONE;
> > > > > >  
> > > > > > diff --git a/block/genhd.c b/block/genhd.c
> > > > > > index c55e8f0fced1..c840ecffea68 100644
> > > > > > --- a/block/genhd.c
> > > > > > +++ b/block/genhd.c
> > > > > > @@ -30,6 +30,11 @@
> > > > > >  static struct kobject *block_depr;
> > > > > >  
> > > > > >  DECLARE_RWSEM(bdev_lookup_sem);
> > > > > > +/*
> > > > > > + * Prevents different block-layer interposers from attaching or detaching
> > > > > > + * to the block device at the same time.
> > > > > > + */
> > > > > > +static DEFINE_MUTEX(bdev_interposer_attach_lock);
> > > > > >  
> > > > > >  /* for extended dynamic devt allocation, currently only one major is used */
> > > > > >  #define NR_EXT_DEVT		(1 << MINORBITS)
> > > > > > @@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
> > > > > >  	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
> > > > > >  	kfree(disk->ev);
> > > > > >  }
> > > > > > +
> > > > > > +int bdev_interposer_attach(struct block_device *original,
> > > > > > +			   struct block_device *interposer)
> > > > > > +{
> > > > > > +	int ret = 0;
> > > > > > +
> > > > > > +	if (WARN_ON(((!original) || (!interposer))))
> > > > > > +		return -EINVAL;
> > > > > > +	/*
> > > > > > +	 * interposer should be simple, no a multi-queue device
> > > > > > +	 */
> > > > > > +	if (!interposer->bd_disk->fops->submit_bio)
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > > > > +		return -EPERM;
> > > > > 
> > > > > The original request queue may become live now...
> > > > 
> > > > Yes.
> > > > I will remove the blk_mq_is_queue_frozen() function and use a different
> > > > approach.
> > > 
> > > Looks what attach and detach needs is that queue is kept as frozen state
> > > instead of being froze simply at the beginning of the two functions, so
> > > you can simply call freeze/unfreeze inside the two functions.
> > > 
> > > But what if 'original' isn't a MQ queue?  queue usage counter is just
> > > grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
> > > isn't any io activity, is that a problem for bdev_interposer use case?
> > > 
> > > -- 
> > > Ming
> > > 
> > 
> > It makes sense to add freeze_bdev/thaw_bdev. This will be useful.
> > For the main file systems, the freeze functions are defined 
> > sb->s_op->freeze_super() or sb - >s_op->freeze_fs()
> > (btrfs, ext2, ext4, f2fs, jfs, nilfs2, reiserfs, xfs).
> > If the file system is frozen, then no new requests should be received.
> > 
> > But if the file system does not support freeze or the disk is used without
> > a file system, as for some databases, freeze_bdev seems useless to me.
> > In this case, we will need to stop working with the disk from user-space,
> > for example, to freeze the database itself.
> > 
> > I can add dm_suspend() before bdev_interposer_detach(). This will ensure that
> > all intercepted requests have been processed. Applying dm_suspend() before
> > bdev_interposer_attach() is pointless. The attachment is made when the target
> > is created, and at this time the target is not ready to work yet.
> > There shouldn't be any bio requests, I suppose. In addition,
> > sb->s_op->freeze_fs() for the interposer will not be called, because the file
> > system is not mounted for the interposer device. It should not be able to
> > be mounted. To do this, I will add an exclusive opening of the interposer
> > device.
> > 
> > I'll add freeze_bdev() for the original device and dm_suspend() for the
> > interposer to the DM code. For normal operation of bdev_interposer,
> > it is enough to transfer blk_mq_freeze_queue and blk_mq_quiesce_queue to
> > bdev_interposer_attach/bdev_interposer_detach.
> > The lock on the counter q->q_usage_counter is enough to not catch NULL in
> > bd_interposer.
> > 
> > Do you think this is enough?
> > I think there are no other ways to stop the block device queue.
> 
> Either you're pretty confused, or I am... regardless.. I think we need
> to cover the basics of how interposer is expected to be paired with
> an "original" device.

Thank you Mike for your patience. I really appreciate it.
I really may not understand something. Let me get this straight.

> 
> Those "original" device are already active and potentially in use
> right?  They may be either request-based blk-mq _or_ bio-based.

Yes. Exactly.

> 
> So what confuses me is that you're making assertions about how actively
> used bio-based DM devices aren't in use until the interposed device
> create happens... this is all getting very muddled.

The original device is indeed already actively used and already mounted.
This is most likely not a DM device.
If it is a request-based blk-mq, then it is enough to stop its queue by
blk_mq_freeze_queue(). 
If it is a bio-based device, then we can try to stop it by freeze_bdev.
But in both cases, if the blk_mq_freeze_bdev() function was called, bio cannot
get into the critical section between bio_queue_enter() and blk_queue_exit().
This allows to safely change the value of original->bd_interposer.

To intercept requests to the original device, we create a new md with
the DM_INTERPOSE_FLAG flag. It is this interposer device that has not
yet been initialized by this time. It just runs DM_TABLE_LOAD_CMD.
That is why I think that the queue of this device should not be stopped,
since this device has not yet been initialized.

> 
> And your lack of understanding of these various IO flushing methods
> (freeze/thaw, suspend/resume, etc) is showing.  Please slow down and
> approach this more systematically.

For any block device, we can call the freeze_bdev() function. It will 
allow to wait until the processing of previously sent requests is 
completed and block the sending of new ones. blk_mq_freeze_queue() 
allows to change the bd_interposer variable. This allow to attach/detach 
the interposer to original device.
dm_suspend() is used to stop mapped device. This is what I plan to use
before detaching the interposer. It will allow to wait for the
completion of all the bios that were sent for the interposer.

> 
> Thanks,
> Mike
> 

Please correct me if my reasoning is wrong.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-17 18:14             ` Sergei Shtepa
@ 2021-03-17 19:13               ` Mike Snitzer
  2021-03-18 14:56                 ` Sergei Shtepa
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2021-03-17 19:13 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Ming Lei, Christoph Hellwig, Alasdair Kergon, Hannes Reinecke,
	Jens Axboe, dm-devel, linux-block, linux-kernel, linux-api,
	Pavel Tide, Mikulas Patocka

On Wed, Mar 17 2021 at  2:14pm -0400,
Sergei Shtepa <sergei.shtepa@veeam.com> wrote:

> The 03/17/2021 18:04, Mike Snitzer wrote:
> > On Wed, Mar 17 2021 at  8:22am -0400,
> > Sergei Shtepa <sergei.shtepa@veeam.com> wrote:
> > 
> > > The 03/17/2021 06:03, Ming Lei wrote:
> > > > On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > > > > The 03/16/2021 11:09, Ming Lei wrote:
> > > > > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > > > > 
> > > > > > > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> > > > > > > ---
> > > > > > >  block/bio.c               |  2 ++
> > > > > > >  block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
> > > > > > >  block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
> > > > > > >  include/linux/blk_types.h |  3 +++
> > > > > > >  include/linux/blkdev.h    |  9 +++++++
> > > > > > >  5 files changed, 125 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > > > > --- a/block/bio.c
> > > > > > > +++ b/block/bio.c
> > > > > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> > > > > > >  		bio_set_flag(bio, BIO_THROTTLED);
> > > > > > >  	if (bio_flagged(bio_src, BIO_REMAPPED))
> > > > > > >  		bio_set_flag(bio, BIO_REMAPPED);
> > > > > > > +	if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > >  	bio->bi_opf = bio_src->bi_opf;
> > > > > > >  	bio->bi_ioprio = bio_src->bi_ioprio;
> > > > > > >  	bio->bi_write_hint = bio_src->bi_write_hint;
> > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > > > index fc60ff208497..da1abc4c27a9 100644
> > > > > > > --- a/block/blk-core.c
> > > > > > > +++ b/block/blk-core.c
> > > > > > > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > > > > >  	return ret;
> > > > > > >  }
> > > > > > >  
> > > > > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > > > > +{
> > > > > > > +	blk_qc_t ret = BLK_QC_T_NONE;
> > > > > > > +	struct bio_list bio_list[2] = { };
> > > > > > > +	struct gendisk *orig_disk;
> > > > > > > +
> > > > > > > +	if (current->bio_list) {
> > > > > > > +		bio_list_add(&current->bio_list[0], bio);
> > > > > > > +		return BLK_QC_T_NONE;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	orig_disk = bio->bi_bdev->bd_disk;
> > > > > > > +	if (unlikely(bio_queue_enter(bio)))
> > > > > > > +		return BLK_QC_T_NONE;
> > > > > > > +
> > > > > > > +	current->bio_list = bio_list;
> > > > > > > +
> > > > > > > +	do {
> > > > > > > +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> > > > > > > +
> > > > > > > +		if (unlikely(!interposer)) {
> > > > > > > +			/* interposer was removed */
> > > > > > > +			bio_list_add(&current->bio_list[0], bio);
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +		/* assign bio to interposer device */
> > > > > > > +		bio_set_dev(bio, interposer);
> > > > > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > > +
> > > > > > > +		if (!submit_bio_checks(bio))
> > > > > > > +			break;
> > > > > > > +		/*
> > > > > > > +		 * Because the current->bio_list is initialized,
> > > > > > > +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> > > > > > > +		 */
> > > > > > > +		interposer->bd_disk->fops->submit_bio(bio);
> > > > > > 
> > > > > > Given original request queue may become live when calling attach() and
> > > > > > detach(), see below comment. bdev_interposer_detach() may be run
> > > > > > when running ->submit_bio(), meantime the interposer device is
> > > > > > gone during the period, then kernel oops.
> > > > > 
> > > > > I think that since the bio_queue_enter() function was called,
> > > > > q->q_usage_counter will not allow the critical code in the attach/detach
> > > > > functions to be executed, which is located between the blk_freeze_queue
> > > > > and blk_unfreeze_queue calls.
> > > > > Please correct me if I'm wrong.
> > > > > 
> > > > > > 
> > > > > > > +	} while (false);
> > > > > > > +
> > > > > > > +	current->bio_list = NULL;
> > > > > > > +
> > > > > > > +	blk_queue_exit(orig_disk->queue);
> > > > > > > +
> > > > > > > +	/* Resubmit remaining bios */
> > > > > > > +	while ((bio = bio_list_pop(&bio_list[0])))
> > > > > > > +		ret = submit_bio_noacct(bio);
> > > > > > > +
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /**
> > > > > > >   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> > > > > > >   * @bio:  The bio describing the location in memory and on the device.
> > > > > > > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > > > > >   */
> > > > > > >  blk_qc_t submit_bio_noacct(struct bio *bio)
> > > > > > >  {
> > > > > > > +	/*
> > > > > > > +	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > > > > > > +	 * created by the bdev_interposer do not get to it for processing.
> > > > > > > +	 */
> > > > > > > +	if (bdev_has_interposer(bio->bi_bdev) &&
> > > > > > > +	    !bio_flagged(bio, BIO_INTERPOSED))
> > > > > > > +		return submit_bio_interposed(bio);
> > > > > > > +
> > > > > > >  	if (!submit_bio_checks(bio))
> > > > > > >  		return BLK_QC_T_NONE;
> > > > > > >  
> > > > > > > diff --git a/block/genhd.c b/block/genhd.c
> > > > > > > index c55e8f0fced1..c840ecffea68 100644
> > > > > > > --- a/block/genhd.c
> > > > > > > +++ b/block/genhd.c
> > > > > > > @@ -30,6 +30,11 @@
> > > > > > >  static struct kobject *block_depr;
> > > > > > >  
> > > > > > >  DECLARE_RWSEM(bdev_lookup_sem);
> > > > > > > +/*
> > > > > > > + * Prevents different block-layer interposers from attaching or detaching
> > > > > > > + * to the block device at the same time.
> > > > > > > + */
> > > > > > > +static DEFINE_MUTEX(bdev_interposer_attach_lock);
> > > > > > >  
> > > > > > >  /* for extended dynamic devt allocation, currently only one major is used */
> > > > > > >  #define NR_EXT_DEVT		(1 << MINORBITS)
> > > > > > > @@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
> > > > > > >  	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
> > > > > > >  	kfree(disk->ev);
> > > > > > >  }
> > > > > > > +
> > > > > > > +int bdev_interposer_attach(struct block_device *original,
> > > > > > > +			   struct block_device *interposer)
> > > > > > > +{
> > > > > > > +	int ret = 0;
> > > > > > > +
> > > > > > > +	if (WARN_ON(((!original) || (!interposer))))
> > > > > > > +		return -EINVAL;
> > > > > > > +	/*
> > > > > > > +	 * interposer should be simple, no a multi-queue device
> > > > > > > +	 */
> > > > > > > +	if (!interposer->bd_disk->fops->submit_bio)
> > > > > > > +		return -EINVAL;
> > > > > > > +
> > > > > > > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > > > > > +		return -EPERM;
> > > > > > 
> > > > > > The original request queue may become live now...
> > > > > 
> > > > > Yes.
> > > > > I will remove the blk_mq_is_queue_frozen() function and use a different
> > > > > approach.
> > > > 
> > > > Looks what attach and detach needs is that queue is kept as frozen state
> > > > instead of being froze simply at the beginning of the two functions, so
> > > > you can simply call freeze/unfreeze inside the two functions.
> > > > 
> > > > But what if 'original' isn't a MQ queue?  queue usage counter is just
> > > > grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
> > > > isn't any io activity, is that a problem for bdev_interposer use case?
> > > > 
> > > > -- 
> > > > Ming
> > > > 
> > > 
> > > It makes sense to add freeze_bdev/thaw_bdev. This will be useful.
> > > For the main file systems, the freeze functions are defined 
> > > sb->s_op->freeze_super() or sb - >s_op->freeze_fs()
> > > (btrfs, ext2, ext4, f2fs, jfs, nilfs2, reiserfs, xfs).
> > > If the file system is frozen, then no new requests should be received.
> > > 
> > > But if the file system does not support freeze or the disk is used without
> > > a file system, as for some databases, freeze_bdev seems useless to me.
> > > In this case, we will need to stop working with the disk from user-space,
> > > for example, to freeze the database itself.
> > > 
> > > I can add dm_suspend() before bdev_interposer_detach(). This will ensure that
> > > all intercepted requests have been processed. Applying dm_suspend() before
> > > bdev_interposer_attach() is pointless. The attachment is made when the target
> > > is created, and at this time the target is not ready to work yet.
> > > There shouldn't be any bio requests, I suppose. In addition,
> > > sb->s_op->freeze_fs() for the interposer will not be called, because the file
> > > system is not mounted for the interposer device. It should not be able to
> > > be mounted. To do this, I will add an exclusive opening of the interposer
> > > device.
> > > 
> > > I'll add freeze_bdev() for the original device and dm_suspend() for the
> > > interposer to the DM code. For normal operation of bdev_interposer,
> > > it is enough to transfer blk_mq_freeze_queue and blk_mq_quiesce_queue to
> > > bdev_interposer_attach/bdev_interposer_detach.
> > > The lock on the counter q->q_usage_counter is enough to not catch NULL in
> > > bd_interposer.
> > > 
> > > Do you think this is enough?
> > > I think there are no other ways to stop the block device queue.
> > 
> > Either you're pretty confused, or I am... regardless.. I think we need
> > to cover the basics of how interposer is expected to be paired with
> > an "original" device.
> 
> Thank you Mike for your patience. I really appreciate it.
> I really may not understand something. Let me get this straight.
> 
> > 
> > Those "original" device are already active and potentially in use
> > right?  They may be either request-based blk-mq _or_ bio-based.
> 
> Yes. Exactly.
> 
> > 
> > So what confuses me is that you're making assertions about how actively
> > used bio-based DM devices aren't in use until the interposed device
> > create happens... this is all getting very muddled.
> 
> The original device is indeed already actively used and already mounted.
> This is most likely not a DM device.
> If it is a request-based blk-mq, then it is enough to stop its queue by
> blk_mq_freeze_queue(). 
> If it is a bio-based device, then we can try to stop it by freeze_bdev.
> But in both cases, if the blk_mq_freeze_bdev() function was called, bio cannot
> get into the critical section between bio_queue_enter() and blk_queue_exit().
> This allows to safely change the value of original->bd_interposer.

Even though bios cannot get into underlying blk-mq they are already
inflight on behalf of the upper-layer bio-based device. I'll look closer
at the code but it seems like there is potential for the original
device's bios to still be queued to original, past the ->submit_bio
entry, and waiting for blk-mq to unfreeze.  Meaning upon return from
what I _think_ you're saying will be sufficient: DM bio-based device
will carry on submitting IO to the blk-mq device that has since been
interposed.. that IO will _not_ complete in terms of the interposed
device.. so you'll have a split-brain dual completion of IO from the
original bio-based DM device _and_ the interposed device (for any new io
that hits ->submit_bio after the interposed device is in place).

I think you need to have original bio-based DM suspend, interpose
device, and then resume the original.  Anything entering original's
->submit_bio from that point will all get sent to interposed
device. Right?

> To intercept requests to the original device, we create a new md with
> the DM_INTERPOSE_FLAG flag. It is this interposer device that has not
> yet been initialized by this time. It just runs DM_TABLE_LOAD_CMD.
> That is why I think that the queue of this device should not be stopped,
> since this device has not yet been initialized.
> 
> > 
> > And your lack of understanding of these various IO flushing methods
> > (freeze/thaw, suspend/resume, etc) is showing.  Please slow down and
> > approach this more systematically.
> 
> For any block device, we can call the freeze_bdev() function. It will 
> allow to wait until the processing of previously sent requests is 
> completed and block the sending of new ones. blk_mq_freeze_queue() 
> allows to change the bd_interposer variable. This allow to attach/detach 
> the interposer to original device.

freeze_bdev/thaw_bdev are only relevant if a filesystem is layered
ontop.  A bio-based DM device can be used directly (by a database or
whatever).

> dm_suspend() is used to stop mapped device. This is what I plan to use
> before detaching the interposer. It will allow to wait for the
> completion of all the bios that were sent for the interposer.

Yes, but you need to suspend before attaching the interposer too, to
flush any in-flight bios that might be in-flight within the various DM
target code.

DM should be able to internalize all this when handling the
DM_INTERPOSE_FLAG during the new table load.  It'd call into
dm_internal_suspend_fast and then dm_internal_resume_fast for the
original md device.

Mike


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 2/3] block: add bdev_interposer
  2021-03-17 19:13               ` Mike Snitzer
@ 2021-03-18 14:56                 ` Sergei Shtepa
  0 siblings, 0 replies; 25+ messages in thread
From: Sergei Shtepa @ 2021-03-18 14:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Christoph Hellwig, Alasdair Kergon, Hannes Reinecke,
	Jens Axboe, dm-devel, linux-block, linux-kernel, linux-api,
	Pavel Tide, Mikulas Patocka

The 03/17/2021 22:13, Mike Snitzer wrote:
> On Wed, Mar 17 2021 at  2:14pm -0400,
> Sergei Shtepa <sergei.shtepa@veeam.com> wrote:
> 
> > The 03/17/2021 18:04, Mike Snitzer wrote:
> > > On Wed, Mar 17 2021 at  8:22am -0400,
> > > Sergei Shtepa <sergei.shtepa@veeam.com> wrote:
> > > 
> > > > The 03/17/2021 06:03, Ming Lei wrote:
> > > > > On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > > > > > The 03/16/2021 11:09, Ming Lei wrote:
> > > > > > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > > > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > > > > > 
> > > > > > > > Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> > > > > > > > ---
> > > > > > > >  block/bio.c               |  2 ++
> > > > > > > >  block/blk-core.c          | 57 +++++++++++++++++++++++++++++++++++++++
> > > > > > > >  block/genhd.c             | 54 +++++++++++++++++++++++++++++++++++++
> > > > > > > >  include/linux/blk_types.h |  3 +++
> > > > > > > >  include/linux/blkdev.h    |  9 +++++++
> > > > > > > >  5 files changed, 125 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > > > > > --- a/block/bio.c
> > > > > > > > +++ b/block/bio.c
> > > > > > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> > > > > > > >  		bio_set_flag(bio, BIO_THROTTLED);
> > > > > > > >  	if (bio_flagged(bio_src, BIO_REMAPPED))
> > > > > > > >  		bio_set_flag(bio, BIO_REMAPPED);
> > > > > > > > +	if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > > > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > > >  	bio->bi_opf = bio_src->bi_opf;
> > > > > > > >  	bio->bi_ioprio = bio_src->bi_ioprio;
> > > > > > > >  	bio->bi_write_hint = bio_src->bi_write_hint;
> > > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > > > > index fc60ff208497..da1abc4c27a9 100644
> > > > > > > > --- a/block/blk-core.c
> > > > > > > > +++ b/block/blk-core.c
> > > > > > > > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > > > > > >  	return ret;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > > > > > +{
> > > > > > > > +	blk_qc_t ret = BLK_QC_T_NONE;
> > > > > > > > +	struct bio_list bio_list[2] = { };
> > > > > > > > +	struct gendisk *orig_disk;
> > > > > > > > +
> > > > > > > > +	if (current->bio_list) {
> > > > > > > > +		bio_list_add(&current->bio_list[0], bio);
> > > > > > > > +		return BLK_QC_T_NONE;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	orig_disk = bio->bi_bdev->bd_disk;
> > > > > > > > +	if (unlikely(bio_queue_enter(bio)))
> > > > > > > > +		return BLK_QC_T_NONE;
> > > > > > > > +
> > > > > > > > +	current->bio_list = bio_list;
> > > > > > > > +
> > > > > > > > +	do {
> > > > > > > > +		struct block_device *interposer = bio->bi_bdev->bd_interposer;
> > > > > > > > +
> > > > > > > > +		if (unlikely(!interposer)) {
> > > > > > > > +			/* interposer was removed */
> > > > > > > > +			bio_list_add(&current->bio_list[0], bio);
> > > > > > > > +			break;
> > > > > > > > +		}
> > > > > > > > +		/* assign bio to interposer device */
> > > > > > > > +		bio_set_dev(bio, interposer);
> > > > > > > > +		bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > > > +
> > > > > > > > +		if (!submit_bio_checks(bio))
> > > > > > > > +			break;
> > > > > > > > +		/*
> > > > > > > > +		 * Because the current->bio_list is initialized,
> > > > > > > > +		 * the submit_bio callback will always return BLK_QC_T_NONE.
> > > > > > > > +		 */
> > > > > > > > +		interposer->bd_disk->fops->submit_bio(bio);
> > > > > > > 
> > > > > > > Given original request queue may become live when calling attach() and
> > > > > > > detach(), see below comment. bdev_interposer_detach() may be run
> > > > > > > when running ->submit_bio(), meantime the interposer device is
> > > > > > > gone during the period, then kernel oops.
> > > > > > 
> > > > > > I think that since the bio_queue_enter() function was called,
> > > > > > q->q_usage_counter will not allow the critical code in the attach/detach
> > > > > > functions to be executed, which is located between the blk_freeze_queue
> > > > > > and blk_unfreeze_queue calls.
> > > > > > Please correct me if I'm wrong.
> > > > > > 
> > > > > > > 
> > > > > > > > +	} while (false);
> > > > > > > > +
> > > > > > > > +	current->bio_list = NULL;
> > > > > > > > +
> > > > > > > > +	blk_queue_exit(orig_disk->queue);
> > > > > > > > +
> > > > > > > > +	/* Resubmit remaining bios */
> > > > > > > > +	while ((bio = bio_list_pop(&bio_list[0])))
> > > > > > > > +		ret = submit_bio_noacct(bio);
> > > > > > > > +
> > > > > > > > +	return ret;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> > > > > > > >   * @bio:  The bio describing the location in memory and on the device.
> > > > > > > > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > > > > > > >   */
> > > > > > > >  blk_qc_t submit_bio_noacct(struct bio *bio)
> > > > > > > >  {
> > > > > > > > +	/*
> > > > > > > > +	 * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > > > > > > > +	 * created by the bdev_interposer do not get to it for processing.
> > > > > > > > +	 */
> > > > > > > > +	if (bdev_has_interposer(bio->bi_bdev) &&
> > > > > > > > +	    !bio_flagged(bio, BIO_INTERPOSED))
> > > > > > > > +		return submit_bio_interposed(bio);
> > > > > > > > +
> > > > > > > >  	if (!submit_bio_checks(bio))
> > > > > > > >  		return BLK_QC_T_NONE;
> > > > > > > >  
> > > > > > > > diff --git a/block/genhd.c b/block/genhd.c
> > > > > > > > index c55e8f0fced1..c840ecffea68 100644
> > > > > > > > --- a/block/genhd.c
> > > > > > > > +++ b/block/genhd.c
> > > > > > > > @@ -30,6 +30,11 @@
> > > > > > > >  static struct kobject *block_depr;
> > > > > > > >  
> > > > > > > >  DECLARE_RWSEM(bdev_lookup_sem);
> > > > > > > > +/*
> > > > > > > > + * Prevents different block-layer interposers from attaching or detaching
> > > > > > > > + * to the block device at the same time.
> > > > > > > > + */
> > > > > > > > +static DEFINE_MUTEX(bdev_interposer_attach_lock);
> > > > > > > >  
> > > > > > > >  /* for extended dynamic devt allocation, currently only one major is used */
> > > > > > > >  #define NR_EXT_DEVT		(1 << MINORBITS)
> > > > > > > > @@ -1940,3 +1945,52 @@ static void disk_release_events(struct gendisk *disk)
> > > > > > > >  	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
> > > > > > > >  	kfree(disk->ev);
> > > > > > > >  }
> > > > > > > > +
> > > > > > > > +int bdev_interposer_attach(struct block_device *original,
> > > > > > > > +			   struct block_device *interposer)
> > > > > > > > +{
> > > > > > > > +	int ret = 0;
> > > > > > > > +
> > > > > > > > +	if (WARN_ON(((!original) || (!interposer))))
> > > > > > > > +		return -EINVAL;
> > > > > > > > +	/*
> > > > > > > > +	 * interposer should be simple, no a multi-queue device
> > > > > > > > +	 */
> > > > > > > > +	if (!interposer->bd_disk->fops->submit_bio)
> > > > > > > > +		return -EINVAL;
> > > > > > > > +
> > > > > > > > +	if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > > > > > > +		return -EPERM;
> > > > > > > 
> > > > > > > The original request queue may become live now...
> > > > > > 
> > > > > > Yes.
> > > > > > I will remove the blk_mq_is_queue_frozen() function and use a different
> > > > > > approach.
> > > > > 
> > > > > Looks what attach and detach needs is that queue is kept as frozen state
> > > > > instead of being froze simply at the beginning of the two functions, so
> > > > > you can simply call freeze/unfreeze inside the two functions.
> > > > > 
> > > > > But what if 'original' isn't a MQ queue?  queue usage counter is just
> > > > > grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
> > > > > isn't any io activity, is that a problem for bdev_interposer use case?
> > > > > 
> > > > > -- 
> > > > > Ming
> > > > > 
> > > > 
> > > > It makes sense to add freeze_bdev/thaw_bdev. This will be useful.
> > > > For the main file systems, the freeze functions are defined 
> > > > sb->s_op->freeze_super() or sb - >s_op->freeze_fs()
> > > > (btrfs, ext2, ext4, f2fs, jfs, nilfs2, reiserfs, xfs).
> > > > If the file system is frozen, then no new requests should be received.
> > > > 
> > > > But if the file system does not support freeze or the disk is used without
> > > > a file system, as for some databases, freeze_bdev seems useless to me.
> > > > In this case, we will need to stop working with the disk from user-space,
> > > > for example, to freeze the database itself.
> > > > 
> > > > I can add dm_suspend() before bdev_interposer_detach(). This will ensure that
> > > > all intercepted requests have been processed. Applying dm_suspend() before
> > > > bdev_interposer_attach() is pointless. The attachment is made when the target
> > > > is created, and at this time the target is not ready to work yet.
> > > > There shouldn't be any bio requests, I suppose. In addition,
> > > > sb->s_op->freeze_fs() for the interposer will not be called, because the file
> > > > system is not mounted for the interposer device. It should not be able to
> > > > be mounted. To do this, I will add an exclusive opening of the interposer
> > > > device.
> > > > 
> > > > I'll add freeze_bdev() for the original device and dm_suspend() for the
> > > > interposer to the DM code. For normal operation of bdev_interposer,
> > > > it is enough to transfer blk_mq_freeze_queue and blk_mq_quiesce_queue to
> > > > bdev_interposer_attach/bdev_interposer_detach.
> > > > The lock on the counter q->q_usage_counter is enough to not catch NULL in
> > > > bd_interposer.
> > > > 
> > > > Do you think this is enough?
> > > > I think there are no other ways to stop the block device queue.
> > > 
> > > Either you're pretty confused, or I am... regardless.. I think we need
> > > to cover the basics of how interposer is expected to be paired with
> > > an "original" device.
> > 
> > Thank you Mike for your patience. I really appreciate it.
> > I really may not understand something. Let me get this straight.
> > 
> > > 
> > > Those "original" device are already active and potentially in use
> > > right?  They may be either request-based blk-mq _or_ bio-based.
> > 
> > Yes. Exactly.
> > 
> > > 
> > > So what confuses me is that you're making assertions about how actively
> > > used bio-based DM devices aren't in use until the interposed device
> > > create happens... this is all getting very muddled.
> > 
> > The original device is indeed already actively used and already mounted.
> > This is most likely not a DM device.
> > If it is a request-based blk-mq, then it is enough to stop its queue by
> > blk_mq_freeze_queue(). 
> > If it is a bio-based device, then we can try to stop it by freeze_bdev.
> > But in both cases, if the blk_mq_freeze_bdev() function was called, bio cannot
> > get into the critical section between bio_queue_enter() and blk_queue_exit().
> > This allows to safely change the value of original->bd_interposer.
> 
> Even though bios cannot get into underlying blk-mq they are already
> inflight on behalf of the upper-layer bio-based device. I'll look closer
> at the code but it seems like there is potential for the original
> device's bios to still be queued to original, past the ->submit_bio
> entry, and waiting for blk-mq to unfreeze. Meaning upon return from
> what I _think_ you're saying will be sufficient: DM bio-based device
> will carry on submitting IO to the blk-mq device that has since been
> interposed.. that IO will _not_ complete in terms of the interposed
> device.. so you'll have a split-brain dual completion of IO from the
> original bio-based DM device _and_ the interposed device (for any new io
> that hits ->submit_bio after the interposed device is in place).
> 

Yes, You right. I looked closer at function submit_bio_noacct().
Indeed, the bio can wait to enter to the queue after checking that
the device has a interposer.
This means that some bio requests can go to the original device after
attaching the interposer. Conversely, bio requests can fall into
the function submit_bio_interposer() at a time when the interposer
has already been detached. In submit_bio_interposer() for this case,
there is a re-check that the interposer is there.

I don't see what kind of problems this can cause when attaching
the interposer, but detaching it bothers me.
I need to take a timeout and think it through.

> I think you need to have original bio-based DM suspend, interpose
> device, and then resume the original.  Anything entering original's
> ->submit_bio from that point will all get sent to interposed
> device. Right?

A small remark. The original device is not a DM device. The DM device
plays the role of the interposer. And it really needs to be suspended.

> 
> > To intercept requests to the original device, we create a new md with
> > the DM_INTERPOSE_FLAG flag. It is this interposer device that has not
> > yet been initialized by this time. It just runs DM_TABLE_LOAD_CMD.
> > That is why I think that the queue of this device should not be stopped,
> > since this device has not yet been initialized.
> > 
> > > 
> > > And your lack of understanding of these various IO flushing methods
> > > (freeze/thaw, suspend/resume, etc) is showing.  Please slow down and
> > > approach this more systematically.
> > 
> > For any block device, we can call the freeze_bdev() function. It will 
> > allow to wait until the processing of previously sent requests is 
> > completed and block the sending of new ones. blk_mq_freeze_queue() 
> > allows to change the bd_interposer variable. This allow to attach/detach 
> > the interposer to original device.
> 
> freeze_bdev/thaw_bdev are only relevant if a filesystem is layered
> ontop.  A bio-based DM device can be used directly (by a database or
> whatever).
> 
> > dm_suspend() is used to stop mapped device. This is what I plan to use
> > before detaching the interposer. It will allow to wait for the
> > completion of all the bios that were sent for the interposer.
> 
> Yes, but you need to suspend before attaching the interposer too, to
> flush any in-flight bios that might be in-flight within the various DM
> target code.
> 
> DM should be able to internalize all this when handling the
> DM_INTERPOSE_FLAG during the new table load.  It'd call into
> dm_internal_suspend_fast and then dm_internal_resume_fast for the
> original md device.

The dm_internal_suspend_fast() function looks very useful. I'll try it.

> 
> Mike
> 

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-03-18 14:57 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-12 15:44 [PATCH v7 0/3] block device interposer Sergei Shtepa
2021-03-12 15:44 ` [PATCH v7 1/3] block: add blk_mq_is_queue_frozen() Sergei Shtepa
2021-03-12 19:06   ` Mike Snitzer
2021-03-14  9:14     ` Christoph Hellwig
2021-03-15 12:06       ` Sergei Shtepa
2021-03-12 15:44 ` [PATCH v7 2/3] block: add bdev_interposer Sergei Shtepa
2021-03-14  9:28   ` Christoph Hellwig
2021-03-15 13:06     ` Sergei Shtepa
2021-03-16  8:09   ` Ming Lei
2021-03-16 16:35     ` Sergei Shtepa
2021-03-17  3:03       ` Ming Lei
2021-03-17 12:22         ` Sergei Shtepa
2021-03-17 15:04           ` Mike Snitzer
2021-03-17 18:14             ` Sergei Shtepa
2021-03-17 19:13               ` Mike Snitzer
2021-03-18 14:56                 ` Sergei Shtepa
2021-03-17 14:58         ` Mike Snitzer
2021-03-12 15:44 ` [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG Sergei Shtepa
2021-03-12 19:00   ` Mike Snitzer
2021-03-15 12:29     ` Sergei Shtepa
2021-03-14  9:30   ` Christoph Hellwig
2021-03-15 13:25     ` Sergei Shtepa
2021-03-16 15:23       ` Christoph Hellwig
2021-03-16 15:25         ` Christoph Hellwig
2021-03-16 16:20           ` Sergei Shtepa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).