LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Matteo Croce <mcroce@linux.microsoft.com>
To: linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>
Cc: linux-kernel@vger.kernel.org,
	"Lennart Poettering" <lennart@poettering.net>,
	"Luca Boccassi" <bluca@debian.org>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Damien Le Moal" <damien.lemoal@wdc.com>,
	"Tejun Heo" <tj@kernel.org>,
	"Javier González" <javier@javigon.com>,
	"Niklas Cassel" <niklas.cassel@wdc.com>,
	"Johannes Thumshirn" <johannes.thumshirn@wdc.com>,
	"Hannes Reinecke" <hare@suse.de>,
	"Matthew Wilcox" <willy@infradead.org>,
	JeffleXu <jefflexu@linux.alibaba.com>
Subject: [PATCH v5 0/5] block: add a sequence number to disks
Date: Tue, 13 Jul 2021 01:05:24 +0200	[thread overview]
Message-ID: <20210712230530.29323-1-mcroce@linux.microsoft.com> (raw)

From: Matteo Croce <mcroce@microsoft.com>

Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems has
a very high latency. Block devices do not have exclusive owners in
userspace, any process can set one up (e.g. loop devices). Moreover, device
names can be reused (e.g. loop0 can be reused again and again). A userspace
process setting up a block device and watching for its events cannot thus
reliably tell whether an event relates to the device it just set up or
another earlier instance with the same name.

Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a uevent
with a UUID that does not match the device you are waiting for, you cannot
tell whether it's because the right uevent has not arrived yet, or it was
already sent and you missed it. So you cannot tell whether you should wait
for it or not.

Being able to set devices up in a namespace would solve the race conditions
too, but it can work only if being namespaced is feasible in the first
place. Many userspace processes need to set devices up for the root
namespace, so this solution cannot always work.

Changing the loop devices naming implementation to always use
monotonically increasing device numbers, instead of reusing the lowest
free number, would also solve the problem, but it would be very disruptive
to userspace and likely break many existing use cases. It would also be
quite awkward to use on long-running machines, as the loop device name
would quickly grow to many-digits length.

Furthermore, this problem does not affect only loop devices - partition
probing is asynchronous and very slow on busy systems. It is very easy to
enter races when using LO_FLAGS_PARTSCAN and watching for the partitions to
show up, as it can take a long time for the uevents to be delivered after
setting them up.

Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.

This does not benefit only loop devices and block devices with multiple
partitions, but for example also removable media such as USB sticks or
cdroms/dvdroms/etc.

The first patch is the core one, the 2..4 expose the information in
different ways, and the last one makes the loop device generate a media
changed event upon attach, detach or reconfigure, so the sequence number
is increased.

If merged, this feature will immediately used by the userspace:
https://github.com/systemd/systemd/issues/17469#issuecomment-762919781

v4 -> v5:
- introduce a helper to raise media changed events
- use the new helper in loop instead of the full event code
- unexport inc_diskseq() which is only used by the block code now
- rebase on top of 5.14-rc1

v3 -> v4:
- rebased on top of 5.13
- hook the seqnum increase into the media change event
- make the loop device raise media change events
- merge 1/6 and 5/6
- move the uevent part of 1/6 into a separate one
- drop the now unneeded sysfs refactor
- change 'diskseq' to a global static variable
- add more comments
- refactor commit messages

v2 -> v3:
- rebased on top of 5.13-rc7
- resend because it appeared archived on patchwork

v1 -> v2:
- increase seqnum on media change
- increase on loop detach

Matteo Croce (6):
  block: add disk sequence number
  block: export the diskseq in uevents
  block: add ioctl to read the disk sequence number
  block: export diskseq in sysfs
  block: add a helper to raise a media changed event
  loop: raise media_change event

 Documentation/ABI/testing/sysfs-block | 12 ++++++
 block/disk-events.c                   | 62 +++++++++++++++++++++------
 block/genhd.c                         | 43 +++++++++++++++++++
 block/ioctl.c                         |  2 +
 drivers/block/loop.c                  |  5 +++
 include/linux/genhd.h                 |  3 ++
 include/uapi/linux/fs.h               |  1 +
 7 files changed, 114 insertions(+), 14 deletions(-)

-- 
2.31.1


             reply	other threads:[~2021-07-12 23:05 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-12 23:05 Matteo Croce [this message]
2021-07-12 23:05 ` [PATCH v5 1/6] block: add disk sequence number Matteo Croce
2021-07-12 23:05 ` [PATCH v5 2/6] block: export the diskseq in uevents Matteo Croce
2021-07-12 23:05 ` [PATCH v5 3/6] block: add ioctl to read the disk sequence number Matteo Croce
2021-07-12 23:05 ` [PATCH v5 4/6] block: export diskseq in sysfs Matteo Croce
2021-07-12 23:05 ` [PATCH v5 5/6] block: add a helper to raise a media changed event Matteo Croce
2021-07-12 23:05 ` [PATCH v5 6/6] loop: raise media_change event Matteo Croce
2021-07-13  6:03   ` Christoph Hellwig
2021-07-20 17:27 ` [PATCH v5 0/5] block: add a sequence number to disks Luca Boccassi
2021-07-22 11:41   ` Matteo Croce
2021-07-28 19:01   ` Lennart Poettering
2021-07-28 19:22 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210712230530.29323-1-mcroce@linux.microsoft.com \
    --to=mcroce@linux.microsoft.com \
    --cc=axboe@kernel.dk \
    --cc=bluca@debian.org \
    --cc=damien.lemoal@wdc.com \
    --cc=hare@suse.de \
    --cc=hch@infradead.org \
    --cc=javier@javigon.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=johannes.thumshirn@wdc.com \
    --cc=lennart@poettering.net \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=niklas.cassel@wdc.com \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --subject='Re: [PATCH v5 0/5] block: add a sequence number to disks' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).