LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Luca Boccassi <bluca@debian.org>
To: Matteo Croce <mcroce@linux.microsoft.com>,
	linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig <hch@infradead.org>
Cc: linux-kernel@vger.kernel.org,
	"Lennart Poettering" <lennart@poettering.net>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Damien Le Moal" <damien.lemoal@wdc.com>,
	"Tejun Heo" <tj@kernel.org>,
	"Javier González" <javier@javigon.com>,
	"Niklas Cassel" <niklas.cassel@wdc.com>,
	"Johannes Thumshirn" <johannes.thumshirn@wdc.com>,
	"Hannes Reinecke" <hare@suse.de>,
	"Matthew Wilcox" <willy@infradead.org>,
	JeffleXu <jefflexu@linux.alibaba.com>
Subject: Re: [PATCH v5 0/5] block: add a sequence number to disks
Date: Tue, 20 Jul 2021 18:27:19 +0100	[thread overview]
Message-ID: <3ca56654449b53814a22e3f06179292bc959ae72.camel@debian.org> (raw)
In-Reply-To: <20210712230530.29323-1-mcroce@linux.microsoft.com>

[-- Attachment #1: Type: text/plain, Size: 5061 bytes --]

On Tue, 2021-07-13 at 01:05 +0200, Matteo Croce wrote:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> Associating uevents with block devices in userspace is difficult and racy:
> the uevent netlink socket is lossy, and on slow and overloaded systems has
> a very high latency. Block devices do not have exclusive owners in
> userspace, any process can set one up (e.g. loop devices). Moreover, device
> names can be reused (e.g. loop0 can be reused again and again). A userspace
> process setting up a block device and watching for its events cannot thus
> reliably tell whether an event relates to the device it just set up or
> another earlier instance with the same name.
> 
> Being able to set a UUID on a loop device would solve the race conditions.
> But it does not allow to derive orderings from uevents: if you see a uevent
> with a UUID that does not match the device you are waiting for, you cannot
> tell whether it's because the right uevent has not arrived yet, or it was
> already sent and you missed it. So you cannot tell whether you should wait
> for it or not.
> 
> Being able to set devices up in a namespace would solve the race conditions
> too, but it can work only if being namespaced is feasible in the first
> place. Many userspace processes need to set devices up for the root
> namespace, so this solution cannot always work.
> 
> Changing the loop devices naming implementation to always use
> monotonically increasing device numbers, instead of reusing the lowest
> free number, would also solve the problem, but it would be very disruptive
> to userspace and likely break many existing use cases. It would also be
> quite awkward to use on long-running machines, as the loop device name
> would quickly grow to many-digits length.
> 
> Furthermore, this problem does not affect only loop devices - partition
> probing is asynchronous and very slow on busy systems. It is very easy to
> enter races when using LO_FLAGS_PARTSCAN and watching for the partitions to
> show up, as it can take a long time for the uevents to be delivered after
> setting them up.
> 
> Associating a unique, monotonically increasing sequential number to the
> lifetime of each block device, which can be retrieved with an ioctl
> immediately upon setting it up, allows to solve the race conditions with
> uevents, and also allows userspace processes to know whether they should
> wait for the uevent they need or if it was dropped and thus they should
> move on.
> 
> This does not benefit only loop devices and block devices with multiple
> partitions, but for example also removable media such as USB sticks or
> cdroms/dvdroms/etc.
> 
> The first patch is the core one, the 2..4 expose the information in
> different ways, and the last one makes the loop device generate a media
> changed event upon attach, detach or reconfigure, so the sequence number
> is increased.
> 
> If merged, this feature will immediately used by the userspace:
> https://github.com/systemd/systemd/issues/17469#issuecomment-762919781
> 
> v4 -> v5:
> - introduce a helper to raise media changed events
> - use the new helper in loop instead of the full event code
> - unexport inc_diskseq() which is only used by the block code now
> - rebase on top of 5.14-rc1
> 
> v3 -> v4:
> - rebased on top of 5.13
> - hook the seqnum increase into the media change event
> - make the loop device raise media change events
> - merge 1/6 and 5/6
> - move the uevent part of 1/6 into a separate one
> - drop the now unneeded sysfs refactor
> - change 'diskseq' to a global static variable
> - add more comments
> - refactor commit messages
> 
> v2 -> v3:
> - rebased on top of 5.13-rc7
> - resend because it appeared archived on patchwork
> 
> v1 -> v2:
> - increase seqnum on media change
> - increase on loop detach
> 
> Matteo Croce (6):
>   block: add disk sequence number
>   block: export the diskseq in uevents
>   block: add ioctl to read the disk sequence number
>   block: export diskseq in sysfs
>   block: add a helper to raise a media changed event
>   loop: raise media_change event
> 
>  Documentation/ABI/testing/sysfs-block | 12 ++++++
>  block/disk-events.c                   | 62 +++++++++++++++++++++------
>  block/genhd.c                         | 43 +++++++++++++++++++
>  block/ioctl.c                         |  2 +
>  drivers/block/loop.c                  |  5 +++
>  include/linux/genhd.h                 |  3 ++
>  include/uapi/linux/fs.h               |  1 +
>  7 files changed, 114 insertions(+), 14 deletions(-)

For the series:

Tested-by: Luca Boccassi <bluca@debian.org>

I have implemented the basic systemd support for this (ioctl + uevent,
sysfs will be done later), and tested with this series on x86_64 and
Debian 11 userspace, everything seems to work great. Thanks Matteo!

Here's the implementation, in draft state until the kernel side is
merged:

https://github.com/systemd/systemd/pull/20257

-- 
Kind regards,
Luca Boccassi

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2021-07-20 17:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-12 23:05 Matteo Croce
2021-07-12 23:05 ` [PATCH v5 1/6] block: add disk sequence number Matteo Croce
2021-07-12 23:05 ` [PATCH v5 2/6] block: export the diskseq in uevents Matteo Croce
2021-07-12 23:05 ` [PATCH v5 3/6] block: add ioctl to read the disk sequence number Matteo Croce
2021-07-12 23:05 ` [PATCH v5 4/6] block: export diskseq in sysfs Matteo Croce
2021-07-12 23:05 ` [PATCH v5 5/6] block: add a helper to raise a media changed event Matteo Croce
2021-07-12 23:05 ` [PATCH v5 6/6] loop: raise media_change event Matteo Croce
2021-07-13  6:03   ` Christoph Hellwig
2021-07-20 17:27 ` Luca Boccassi [this message]
2021-07-22 11:41   ` [PATCH v5 0/5] block: add a sequence number to disks Matteo Croce
2021-07-28 19:01   ` Lennart Poettering
2021-07-28 19:22 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ca56654449b53814a22e3f06179292bc959ae72.camel@debian.org \
    --to=bluca@debian.org \
    --cc=axboe@kernel.dk \
    --cc=damien.lemoal@wdc.com \
    --cc=hare@suse.de \
    --cc=hch@infradead.org \
    --cc=javier@javigon.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=johannes.thumshirn@wdc.com \
    --cc=lennart@poettering.net \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcroce@linux.microsoft.com \
    --cc=niklas.cassel@wdc.com \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --subject='Re: [PATCH v5 0/5] block: add a sequence number to disks' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).