LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()
@ 2021-08-24 10:59 Stefan Hajnoczi
  2021-08-24 10:59 ` [RFC PATCH 1/1] fuse: disable local irqs when processing vq completions Stefan Hajnoczi
  2021-08-24 11:31 ` [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Michael S. Tsirkin
  0 siblings, 2 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2021-08-24 10:59 UTC (permalink / raw)
  To: virtualization
  Cc: Gerd Hoffmann, linux-kernel, David Airlie, vgoyal, jasowang,
	Michael S. Tsirkin, Stefan Hajnoczi

While investigating an unhandled irq from vring_interrupt() with virtiofs I\r
stumbled onto a possible race that also affects virtio_gpu. This theory is\r
based on code inspection and hopefully you can point out something that makes\r
this a non-issue in practice :).\r
\r
The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is\r
taken with no used (completed) buffers in the virtqueue. The kernel disables\r
the irq and the driver is no longer receives irqs when this happens:\r
\r
  irq 77: nobody cared (try booting with the "irqpoll" option)\r
  ...\r
  handlers:\r
  [<00000000a40a49bb>] vring_interrupt\r
  Disabling IRQ #77\r
\r
Consider the following:\r
\r
1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work\r
   function is scheduled to dequeue used buffers:\r
   vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()\r
\r
2. The device adds more used requests and just before...\r
\r
3. ...virtio_fs_requests_done_work() empties the virtqueue with\r
   virtqueue_get_buf().\r
\r
4. The device raises the irq and vring_interrupt() is called after\r
   virtio_fs_requests_done_work emptied the virtqueue:\r
\r
   irqreturn_t vring_interrupt(int irq, void *_vq)\r
   {\r
       struct vring_virtqueue *vq = to_vvq(_vq);\r
\r
       if (!more_used(vq)) {\r
           pr_debug("virtqueue interrupt with no work for %p\n", vq);\r
           return IRQ_NONE;\r
           ^^^^^^^^^^^^^^^^\r
\r
I have included a patch that switches virtiofs from spin_lock() to\r
spin_lock_irqsave() to prevent vring_interrupt() from running while the\r
virtqueue is processed from a work function.\r
\r
virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and\r
virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().\r
I think this can result in the same false unhandled irq problem as virtiofs.\r
\r
This race condition could in theory affect all drivers. The VIRTIO\r
specification says:\r
\r
  Neither of these notification suppression methods are reliable, as they are\r
  not synchronized with the device, but they serve as useful optimizations.\r
\r
If virtqueue_disable_cb() is just a hint and might not disable virtqueue irqs\r
then virtio_net and other drivers have a problem because because an irq could\r
be raised while the driver is dequeuing used buffers. I think we haven't seen\r
this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware\r
devices might cache the value and not disable notifications for some time...\r
\r
Have I missed something?\r
\r
The virtiofs patch I attached is being stress tested to see if the unhandled\r
irqs still occur.\r
\r
Stefan Hajnoczi (1):\r
  fuse: disable local irqs when processing vq completions\r
\r
 fs/fuse/virtio_fs.c | 15 ++++++++++-----\r
 1 file changed, 10 insertions(+), 5 deletions(-)\r
\r
-- \r
2.31.1\r
\r


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/1] fuse: disable local irqs when processing vq completions
  2021-08-24 10:59 [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Stefan Hajnoczi
@ 2021-08-24 10:59 ` Stefan Hajnoczi
  2021-08-24 11:31 ` [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Michael S. Tsirkin
  1 sibling, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2021-08-24 10:59 UTC (permalink / raw)
  To: virtualization
  Cc: Gerd Hoffmann, linux-kernel, David Airlie, vgoyal, jasowang,
	Michael S. Tsirkin, Stefan Hajnoczi, Xiaoling Gao

The virtqueue completion handler function runs on a work queue and local
irqs are still enabled. There is a race where the completion handler
function grabs the next completed request just before vring_interrupt()
runs. vring_interrupt() sees an empty virtqueue and returns IRQ_NONE,
falsely declaring this interrupt unhandled.

The unhandled irq causes the kernel to disable the irq:

  irq 77: nobody cared (try booting with the "irqpoll" option)
  ...
  handlers:
  [<00000000d33eeed7>] vring_interrupt
  Disabling IRQ #77

The driver hangs afterwards since virtqueue irqs are now ignored.

Disable local irqs before calling virtqueue_get_buf() and re-enable them
afterwards so that vring_interrupt() doesn't run during the race window.

Reported-by: Xiaoling Gao <xiagao@redhat.com>
Cc: Michael Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
I'm not 100% convinced this fixes everything because vring_interrupt()
can still run after our critical section and find the virtqueue empty.
virtqueue_disable_cb() should minimize that but it's only a hint and
there is a small window when the race condition can happen before it's
called.
---
 fs/fuse/virtio_fs.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 8f52cdaa8445..57e1f264b0a8 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -319,9 +319,10 @@ static void virtio_fs_hiprio_done_work(struct work_struct *work)
 	struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
 						 done_work);
 	struct virtqueue *vq = fsvq->vq;
+	unsigned long flags;
 
 	/* Free completed FUSE_FORGET requests */
-	spin_lock(&fsvq->lock);
+	spin_lock_irqsave(&fsvq->lock, flags);
 	do {
 		unsigned int len;
 		void *req;
@@ -333,7 +334,7 @@ static void virtio_fs_hiprio_done_work(struct work_struct *work)
 			dec_in_flight_req(fsvq);
 		}
 	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
-	spin_unlock(&fsvq->lock);
+	spin_unlock_irqrestore(&fsvq->lock, flags);
 }
 
 static void virtio_fs_request_dispatch_work(struct work_struct *work)
@@ -601,11 +602,15 @@ static void virtio_fs_requests_done_work(struct work_struct *work)
 	struct virtqueue *vq = fsvq->vq;
 	struct fuse_req *req;
 	struct fuse_req *next;
+	unsigned long flags;
 	unsigned int len;
 	LIST_HEAD(reqs);
 
-	/* Collect completed requests off the virtqueue */
-	spin_lock(&fsvq->lock);
+	/*
+	 * Collect completed requests off the virtqueue with irqs disabled to
+	 * prevent races with vring_interrupt().
+	 */
+	spin_lock_irqsave(&fsvq->lock, flags);
 	do {
 		virtqueue_disable_cb(vq);
 
@@ -615,7 +620,7 @@ static void virtio_fs_requests_done_work(struct work_struct *work)
 			spin_unlock(&fpq->lock);
 		}
 	} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
-	spin_unlock(&fsvq->lock);
+	spin_unlock_irqrestore(&fsvq->lock, flags);
 
 	/* End requests */
 	list_for_each_entry_safe(req, next, &reqs, list) {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()
  2021-08-24 10:59 [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Stefan Hajnoczi
  2021-08-24 10:59 ` [RFC PATCH 1/1] fuse: disable local irqs when processing vq completions Stefan Hajnoczi
@ 2021-08-24 11:31 ` Michael S. Tsirkin
  2021-08-24 16:36   ` Stefan Hajnoczi
                     ` (2 more replies)
  1 sibling, 3 replies; 6+ messages in thread
From: Michael S. Tsirkin @ 2021-08-24 11:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtualization, Gerd Hoffmann, linux-kernel, David Airlie,
	vgoyal, jasowang

On Tue, Aug 24, 2021 at 11:59:43AM +0100, Stefan Hajnoczi wrote:
> While investigating an unhandled irq from vring_interrupt() with virtiofs I
> stumbled onto a possible race that also affects virtio_gpu. This theory is
> based on code inspection and hopefully you can point out something that makes
> this a non-issue in practice :).
> 
> The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is
> taken with no used (completed) buffers in the virtqueue. The kernel disables
> the irq and the driver is no longer receives irqs when this happens:
> 
>   irq 77: nobody cared (try booting with the "irqpoll" option)
>   ...
>   handlers:
>   [<00000000a40a49bb>] vring_interrupt
>   Disabling IRQ #77
> 
> Consider the following:
> 
> 1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work
>    function is scheduled to dequeue used buffers:
>    vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()
> 
> 2. The device adds more used requests and just before...
> 
> 3. ...virtio_fs_requests_done_work() empties the virtqueue with
>    virtqueue_get_buf().
> 
> 4. The device raises the irq and vring_interrupt() is called after
>    virtio_fs_requests_done_work emptied the virtqueue:
> 
>    irqreturn_t vring_interrupt(int irq, void *_vq)
>    {
>        struct vring_virtqueue *vq = to_vvq(_vq);
> 
>        if (!more_used(vq)) {
>            pr_debug("virtqueue interrupt with no work for %p\n", vq);
>            return IRQ_NONE;
>            ^^^^^^^^^^^^^^^^
> 
> I have included a patch that switches virtiofs from spin_lock() to
> spin_lock_irqsave() to prevent vring_interrupt() from running while the
> virtqueue is processed from a work function.
> 
> virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and
> virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().
> I think this can result in the same false unhandled irq problem as virtiofs.
> 
> This race condition could in theory affect all drivers. The VIRTIO
> specification says:
> 
>   Neither of these notification suppression methods are reliable, as they are
>   not synchronized with the device, but they serve as useful optimizations.
> 
> If virtqueue_disable_cb() is just a hint and might not disable virtqueue irqs
> then virtio_net and other drivers have a problem because because an irq could
> be raised while the driver is dequeuing used buffers. I think we haven't seen
> this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware
> devices might cache the value and not disable notifications for some time...
> 
> Have I missed something?
> 
> The virtiofs patch I attached is being stress tested to see if the unhandled
> irqs still occur.
> 
> Stefan Hajnoczi (1):
>   fuse: disable local irqs when processing vq completions
> 
>  fs/fuse/virtio_fs.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)

Fundamentally it is not a problem to have an unhandled IRQ
once in a while. It's only a problem if this happens time
after time.


Does the kernel you are testing include
commit 8d622d21d24803408b256d96463eac4574dcf067
Author: Michael S. Tsirkin <mst@redhat.com>
Date:   Tue Apr 13 01:19:16 2021 -0400

    virtio: fix up virtio_disable_cb

?

If not it's worth checking whether the latest kernel still
has the issue.

Note cherry picking that commit isn't trivial there are
a bunch of dependencies.

If you want to try on an old kernel, disable event index.

> -- 
> 2.31.1
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()
  2021-08-24 11:31 ` [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Michael S. Tsirkin
@ 2021-08-24 16:36   ` Stefan Hajnoczi
  2021-08-26 15:42   ` Stefan Hajnoczi
  2021-09-06 15:18   ` Stefan Hajnoczi
  2 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2021-08-24 16:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, Gerd Hoffmann, linux-kernel, David Airlie,
	vgoyal, jasowang

[-- Attachment #1: Type: text/plain, Size: 4269 bytes --]

On Tue, Aug 24, 2021 at 07:31:29AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 24, 2021 at 11:59:43AM +0100, Stefan Hajnoczi wrote:
> > While investigating an unhandled irq from vring_interrupt() with virtiofs I
> > stumbled onto a possible race that also affects virtio_gpu. This theory is
> > based on code inspection and hopefully you can point out something that makes
> > this a non-issue in practice :).
> > 
> > The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is
> > taken with no used (completed) buffers in the virtqueue. The kernel disables
> > the irq and the driver is no longer receives irqs when this happens:
> > 
> >   irq 77: nobody cared (try booting with the "irqpoll" option)
> >   ...
> >   handlers:
> >   [<00000000a40a49bb>] vring_interrupt
> >   Disabling IRQ #77
> > 
> > Consider the following:
> > 
> > 1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work
> >    function is scheduled to dequeue used buffers:
> >    vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()
> > 
> > 2. The device adds more used requests and just before...
> > 
> > 3. ...virtio_fs_requests_done_work() empties the virtqueue with
> >    virtqueue_get_buf().
> > 
> > 4. The device raises the irq and vring_interrupt() is called after
> >    virtio_fs_requests_done_work emptied the virtqueue:
> > 
> >    irqreturn_t vring_interrupt(int irq, void *_vq)
> >    {
> >        struct vring_virtqueue *vq = to_vvq(_vq);
> > 
> >        if (!more_used(vq)) {
> >            pr_debug("virtqueue interrupt with no work for %p\n", vq);
> >            return IRQ_NONE;
> >            ^^^^^^^^^^^^^^^^
> > 
> > I have included a patch that switches virtiofs from spin_lock() to
> > spin_lock_irqsave() to prevent vring_interrupt() from running while the
> > virtqueue is processed from a work function.
> > 
> > virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and
> > virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().
> > I think this can result in the same false unhandled irq problem as virtiofs.
> > 
> > This race condition could in theory affect all drivers. The VIRTIO
> > specification says:
> > 
> >   Neither of these notification suppression methods are reliable, as they are
> >   not synchronized with the device, but they serve as useful optimizations.
> > 
> > If virtqueue_disable_cb() is just a hint and might not disable virtqueue irqs
> > then virtio_net and other drivers have a problem because because an irq could
> > be raised while the driver is dequeuing used buffers. I think we haven't seen
> > this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware
> > devices might cache the value and not disable notifications for some time...
> > 
> > Have I missed something?
> > 
> > The virtiofs patch I attached is being stress tested to see if the unhandled
> > irqs still occur.
> > 
> > Stefan Hajnoczi (1):
> >   fuse: disable local irqs when processing vq completions
> > 
> >  fs/fuse/virtio_fs.c | 15 ++++++++++-----
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> Fundamentally it is not a problem to have an unhandled IRQ
> once in a while. It's only a problem if this happens time
> after time.
> 
> 
> Does the kernel you are testing include
> commit 8d622d21d24803408b256d96463eac4574dcf067
> Author: Michael S. Tsirkin <mst@redhat.com>
> Date:   Tue Apr 13 01:19:16 2021 -0400
> 
>     virtio: fix up virtio_disable_cb
> 
> ?
> 
> If not it's worth checking whether the latest kernel still
> has the issue.
> 
> Note cherry picking that commit isn't trivial there are
> a bunch of dependencies.
> 
> If you want to try on an old kernel, disable event index.

Thanks for pointing out this commit. My kernel does not have it. The
device is using the split vring layout (probably with EVENT_IDX) so
virtqueue_disable_cb() has no effect.

kernel/irq/spurious.c:note_interrupt() disables the irq after 99.9k
unhandled irqs. It's surprising that virtiofs is hitting this limit
through the scenario I described, but maybe.

I'll try different kernel versions and/or disable EVENT_IDX as you
suggested. Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()
  2021-08-24 11:31 ` [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Michael S. Tsirkin
  2021-08-24 16:36   ` Stefan Hajnoczi
@ 2021-08-26 15:42   ` Stefan Hajnoczi
  2021-09-06 15:18   ` Stefan Hajnoczi
  2 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2021-08-26 15:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, Gerd Hoffmann, linux-kernel, David Airlie,
	vgoyal, jasowang

[-- Attachment #1: Type: text/plain, Size: 4059 bytes --]

On Tue, Aug 24, 2021 at 07:31:29AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 24, 2021 at 11:59:43AM +0100, Stefan Hajnoczi wrote:
> > While investigating an unhandled irq from vring_interrupt() with virtiofs I
> > stumbled onto a possible race that also affects virtio_gpu. This theory is
> > based on code inspection and hopefully you can point out something that makes
> > this a non-issue in practice :).
> > 
> > The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is
> > taken with no used (completed) buffers in the virtqueue. The kernel disables
> > the irq and the driver is no longer receives irqs when this happens:
> > 
> >   irq 77: nobody cared (try booting with the "irqpoll" option)
> >   ...
> >   handlers:
> >   [<00000000a40a49bb>] vring_interrupt
> >   Disabling IRQ #77
> > 
> > Consider the following:
> > 
> > 1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work
> >    function is scheduled to dequeue used buffers:
> >    vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()
> > 
> > 2. The device adds more used requests and just before...
> > 
> > 3. ...virtio_fs_requests_done_work() empties the virtqueue with
> >    virtqueue_get_buf().
> > 
> > 4. The device raises the irq and vring_interrupt() is called after
> >    virtio_fs_requests_done_work emptied the virtqueue:
> > 
> >    irqreturn_t vring_interrupt(int irq, void *_vq)
> >    {
> >        struct vring_virtqueue *vq = to_vvq(_vq);
> > 
> >        if (!more_used(vq)) {
> >            pr_debug("virtqueue interrupt with no work for %p\n", vq);
> >            return IRQ_NONE;
> >            ^^^^^^^^^^^^^^^^
> > 
> > I have included a patch that switches virtiofs from spin_lock() to
> > spin_lock_irqsave() to prevent vring_interrupt() from running while the
> > virtqueue is processed from a work function.
> > 
> > virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and
> > virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().
> > I think this can result in the same false unhandled irq problem as virtiofs.
> > 
> > This race condition could in theory affect all drivers. The VIRTIO
> > specification says:
> > 
> >   Neither of these notification suppression methods are reliable, as they are
> >   not synchronized with the device, but they serve as useful optimizations.
> > 
> > If virtqueue_disable_cb() is just a hint and might not disable virtqueue irqs
> > then virtio_net and other drivers have a problem because because an irq could
> > be raised while the driver is dequeuing used buffers. I think we haven't seen
> > this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware
> > devices might cache the value and not disable notifications for some time...
> > 
> > Have I missed something?
> > 
> > The virtiofs patch I attached is being stress tested to see if the unhandled
> > irqs still occur.
> > 
> > Stefan Hajnoczi (1):
> >   fuse: disable local irqs when processing vq completions
> > 
> >  fs/fuse/virtio_fs.c | 15 ++++++++++-----
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> Fundamentally it is not a problem to have an unhandled IRQ
> once in a while. It's only a problem if this happens time
> after time.

That virtiofs patch in this series passed the stress test. We must have
been getting too many unhandled interrupts due to processing the
virtqueue from a work function that's not synchronized with
vring_interrupt() irq handler.

> Does the kernel you are testing include
> commit 8d622d21d24803408b256d96463eac4574dcf067
> Author: Michael S. Tsirkin <mst@redhat.com>
> Date:   Tue Apr 13 01:19:16 2021 -0400
> 
>     virtio: fix up virtio_disable_cb
> 
> ?
> 
> If not it's worth checking whether the latest kernel still
> has the issue.
> 
> Note cherry picking that commit isn't trivial there are
> a bunch of dependencies.

We'll test a recent kernel with just your patch to see if it solves the
issue.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt()
  2021-08-24 11:31 ` [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Michael S. Tsirkin
  2021-08-24 16:36   ` Stefan Hajnoczi
  2021-08-26 15:42   ` Stefan Hajnoczi
@ 2021-09-06 15:18   ` Stefan Hajnoczi
  2 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2021-09-06 15:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtualization, Gerd Hoffmann, linux-kernel, David Airlie,
	vgoyal, jasowang

[-- Attachment #1: Type: text/plain, Size: 3785 bytes --]

On Tue, Aug 24, 2021 at 07:31:29AM -0400, Michael S. Tsirkin wrote:
> On Tue, Aug 24, 2021 at 11:59:43AM +0100, Stefan Hajnoczi wrote:
> > While investigating an unhandled irq from vring_interrupt() with virtiofs I
> > stumbled onto a possible race that also affects virtio_gpu. This theory is
> > based on code inspection and hopefully you can point out something that makes
> > this a non-issue in practice :).
> > 
> > The vring_interrupt() function returns IRQ_NONE when an MSI-X interrupt is
> > taken with no used (completed) buffers in the virtqueue. The kernel disables
> > the irq and the driver is no longer receives irqs when this happens:
> > 
> >   irq 77: nobody cared (try booting with the "irqpoll" option)
> >   ...
> >   handlers:
> >   [<00000000a40a49bb>] vring_interrupt
> >   Disabling IRQ #77
> > 
> > Consider the following:
> > 
> > 1. An virtiofs irq is handled and the virtio_fs_requests_done_work() work
> >    function is scheduled to dequeue used buffers:
> >    vring_interrupt() -> virtio_fs_vq_done() -> schedule_work()
> > 
> > 2. The device adds more used requests and just before...
> > 
> > 3. ...virtio_fs_requests_done_work() empties the virtqueue with
> >    virtqueue_get_buf().
> > 
> > 4. The device raises the irq and vring_interrupt() is called after
> >    virtio_fs_requests_done_work emptied the virtqueue:
> > 
> >    irqreturn_t vring_interrupt(int irq, void *_vq)
> >    {
> >        struct vring_virtqueue *vq = to_vvq(_vq);
> > 
> >        if (!more_used(vq)) {
> >            pr_debug("virtqueue interrupt with no work for %p\n", vq);
> >            return IRQ_NONE;
> >            ^^^^^^^^^^^^^^^^
> > 
> > I have included a patch that switches virtiofs from spin_lock() to
> > spin_lock_irqsave() to prevent vring_interrupt() from running while the
> > virtqueue is processed from a work function.
> > 
> > virtio_gpu has a similar case where virtio_gpu_dequeue_ctrl_func() and
> > virtio_gpu_dequeue_cursor_func() work functions only use spin_lock().
> > I think this can result in the same false unhandled irq problem as virtiofs.
> > 
> > This race condition could in theory affect all drivers. The VIRTIO
> > specification says:
> > 
> >   Neither of these notification suppression methods are reliable, as they are
> >   not synchronized with the device, but they serve as useful optimizations.
> > 
> > If virtqueue_disable_cb() is just a hint and might not disable virtqueue irqs
> > then virtio_net and other drivers have a problem because because an irq could
> > be raised while the driver is dequeuing used buffers. I think we haven't seen
> > this because software VIRTIO devices honor virtqueue_disable_cb(). Hardware
> > devices might cache the value and not disable notifications for some time...
> > 
> > Have I missed something?
> > 
> > The virtiofs patch I attached is being stress tested to see if the unhandled
> > irqs still occur.
> > 
> > Stefan Hajnoczi (1):
> >   fuse: disable local irqs when processing vq completions
> > 
> >  fs/fuse/virtio_fs.c | 15 ++++++++++-----
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> Fundamentally it is not a problem to have an unhandled IRQ
> once in a while. It's only a problem if this happens time
> after time.
> 
> 
> Does the kernel you are testing include
> commit 8d622d21d24803408b256d96463eac4574dcf067
> Author: Michael S. Tsirkin <mst@redhat.com>
> Date:   Tue Apr 13 01:19:16 2021 -0400
> 
>     virtio: fix up virtio_disable_cb
> 
> ?
> 
> If not it's worth checking whether the latest kernel still
> has the issue.

A new kernel with your patch doesn't have this issue.

Please disregard the patch I posted, your patch seems to be enough.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-06 15:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24 10:59 [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Stefan Hajnoczi
2021-08-24 10:59 ` [RFC PATCH 1/1] fuse: disable local irqs when processing vq completions Stefan Hajnoczi
2021-08-24 11:31 ` [RFC PATCH 0/1] virtio: false unhandled irqs from vring_interrupt() Michael S. Tsirkin
2021-08-24 16:36   ` Stefan Hajnoczi
2021-08-26 15:42   ` Stefan Hajnoczi
2021-09-06 15:18   ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).