LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* Re: [PATCH 0/2] Fix hangup after creating checkpoint on Xen. [not found] <20110207.180718.885783824287845425.kaz@jp.fujitsu.com> @ 2011-02-07 9:35 ` Rafael J. Wysocki 2011-02-08 11:22 ` Ian Campbell 0 siblings, 1 reply; 14+ messages in thread From: Rafael J. Wysocki @ 2011-02-07 9:35 UTC (permalink / raw) To: SUZUKI, Kazuhiro; +Cc: linux-pm, xen-devel, Greg KH, LKML On Monday, February 07, 2011, SUZUKI, Kazuhiro wrote: > Hi, Hi, > The following patch series fixes hangup after creating checkpoint on > Xen. The Linux Xen guest can be saved the state to restore later, and > also created snapshot like checkpoint via the hypervisor. > But, when the snapshot is created for the PV guest, it will hangup. > > We added 'PMSG_CANCEL' message and 'cancel' handler in dev_pm_ops > struct in the pm-linux part. Please don't do that, unless you can convince me there's no other way to fix the problem you're trying to address. In my opinion it's highly unrealistic to assume that device drivers (or even subsystems) will implement the ->cancel() callback just for the benefit of Xen. And if the only subsystem that needs to implement ->cancel() is Xen, then the issue should be addressed without modifying the device core code, in a different way. > In creating checkpoint mode, the resume handler of xenbus should not > be called. In this case, it is recognized that the suspend was canceled > in drivers/xen/manage.c and call dpm_resume_end() with PMSG_CANCEL. > If the 'cancel' handler is defined, it is called instead of resume(). > > [1/2] - Fix hangup after creating checkpoint on Xen -- pm-linux part. > [2/2] - Fix hangup after creating checkpoint on Xen -- Xen part. Thanks, Rafael ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-07 9:35 ` [PATCH 0/2] Fix hangup after creating checkpoint on Xen Rafael J. Wysocki @ 2011-02-08 11:22 ` Ian Campbell 2011-02-08 16:46 ` [linux-pm] " Alan Stern 0 siblings, 1 reply; 14+ messages in thread From: Ian Campbell @ 2011-02-08 11:22 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: SUZUKI, Kazuhiro, linux-pm, xen-devel, Greg KH, LKML On Mon, 2011-02-07 at 10:35 +0100, Rafael J. Wysocki wrote: > On Monday, February 07, 2011, SUZUKI, Kazuhiro wrote: > > Hi, > > Hi, > > > The following patch series fixes hangup after creating checkpoint on > > Xen. The Linux Xen guest can be saved the state to restore later, and > > also created snapshot like checkpoint via the hypervisor. > > But, when the snapshot is created for the PV guest, it will hangup. > > > > We added 'PMSG_CANCEL' message and 'cancel' handler in dev_pm_ops > > struct in the pm-linux part. > > Please don't do that, unless you can convince me there's no other way to fix > the problem you're trying to address. Sorry, it was my advise to Kazuhiro that solving the underlying issue by extending the core would be preferable to making Xen specific hacks. The problem is that currently we have: dpm_suspend_start(PMSG_SUSPEND); dpm_suspend_noirq(PMSG_SUSPEND); sysdev_suspend(PMSG_SUSPEND); /* suspend hypercall */ sysdev_resume(); dpm_resume_noirq(PMSG_RESUME); dpm_resume_end(PMSG_RESUME); However the suspend hypercall can return a value indicating that the suspend didn't actually happen (e.g. was cancelled). This is used e.g. when checkpointing guests, because in that case you want the original guest to continue. When the suspend didn't happen the drivers need to recover differently from if it did. The originally proposed solution was to only call dpm_resume_end if the suspend was not cancelled. My concern with this was that unbalancing the dpm_suspend_* and dpm_resume_* did not seem like a correct interaction with the core. For example dpm_suspend_* adds stuff to dpm_suspended_list and if dpm_resume_* is not called presumably this all gets out of sync for next time. Hence my suggestion to add a cancel message type. > In my opinion it's highly unrealistic to assume that device drivers > (or even subsystems) will implement the ->cancel() callback just for the > benefit of Xen. I did vague wonder if a similar message might be of interest to e.g. cancelling hibernations or similar. > And if the only subsystem that needs to implement ->cancel() is Xen, then the > issue should be addressed without modifying the device core code, in a different > way. I thought it would be preferable to make use of/extend core functionality where possible but if that's not the case we can find another way. Do you have any suggestions for how to correctly interact with the core functions? Is adding a suspend_cancel operation to just at the struct xenbus_driver level and introducing a xen specific function to walk to the bus the sort of thing you were thinking of? (it seems reasonable). Should we be doing anything with dpm_*_list in that case? (FWIW original thread is on xen-devel at http://thread.gmane.org/gmane.comp.emulators.xen.devel/95265/) Ian. -- Ian Campbell Current Noise: Crowbar - Dead Sun Why on earth do people buy old bottles of wine when they can get a fresh one for a quarter of the price? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-08 11:22 ` Ian Campbell @ 2011-02-08 16:46 ` Alan Stern 2011-02-08 17:35 ` Ian Campbell 0 siblings, 1 reply; 14+ messages in thread From: Alan Stern @ 2011-02-08 16:46 UTC (permalink / raw) To: Ian Campbell Cc: Rafael J. Wysocki, linux-pm, xen-devel, SUZUKI, Kazuhiro, LKML On Tue, 8 Feb 2011, Ian Campbell wrote: > The problem is that currently we have: > > dpm_suspend_start(PMSG_SUSPEND); > > dpm_suspend_noirq(PMSG_SUSPEND); > > sysdev_suspend(PMSG_SUSPEND); > /* suspend hypercall */ > sysdev_resume(); > > dpm_resume_noirq(PMSG_RESUME); > > dpm_resume_end(PMSG_RESUME); > > However the suspend hypercall can return a value indicating that the > suspend didn't actually happen (e.g. was cancelled). This is used e.g. > when checkpointing guests, because in that case you want the original > guest to continue. When the suspend didn't happen the drivers need to > recover differently from if it did. That is odd, and it is quite different from the intended design of the PM core. Drivers are supposed to put their devices into a known suspended state; then afterwards they put the devices back into an operational state. What happens while the devices are in the suspended state isn't supposed to matter -- the system transition can fail, but devices get treated exactly the same way as if it succeeded. Why do your drivers need to recover differently based on the success of the hypercall? Alan Stern ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-08 16:46 ` [linux-pm] " Alan Stern @ 2011-02-08 17:35 ` Ian Campbell 2011-02-09 23:16 ` Brendan Cully 0 siblings, 1 reply; 14+ messages in thread From: Ian Campbell @ 2011-02-08 17:35 UTC (permalink / raw) To: Alan Stern Cc: Rafael J. Wysocki, linux-pm, xen-devel, SUZUKI, Kazuhiro, LKML, Brendan Cully On Tue, 2011-02-08 at 11:46 -0500, Alan Stern wrote: > On Tue, 8 Feb 2011, Ian Campbell wrote: > > > The problem is that currently we have: > > > > dpm_suspend_start(PMSG_SUSPEND); > > > > dpm_suspend_noirq(PMSG_SUSPEND); > > > > sysdev_suspend(PMSG_SUSPEND); > > /* suspend hypercall */ > > sysdev_resume(); > > > > dpm_resume_noirq(PMSG_RESUME); > > > > dpm_resume_end(PMSG_RESUME); > > > > However the suspend hypercall can return a value indicating that the > > suspend didn't actually happen (e.g. was cancelled). This is used e.g. > > when checkpointing guests, because in that case you want the original > > guest to continue. When the suspend didn't happen the drivers need to > > recover differently from if it did. > > That is odd, and it is quite different from the intended design of the > PM core. Drivers are supposed to put their devices into a known > suspended state; then afterwards they put the devices back into an > operational state. What happens while the devices are in the suspended > state isn't supposed to matter -- the system transition can fail, but > devices get treated exactly the same way as if it succeeded. > > Why do your drivers need to recover differently based on the success of > the hypercall? checkpointing isn't really my area but AIUI you don't want to do a full device teardown and reconnect like you would with a proper suspend because of the time that takes which prevents you from doing continuous rolling checkpoints at granularity which people want to implement various disaster recovery schemes. Hopefully one of the Xen checkpointing folks will chime in and explain why this is not possible to achieve at the individual driver level (or, even better, with a patch which does it that way ;-)). Ian. -- Ian Campbell Current Noise: Buckcherry - King Of Kings Beauty? What's that? -- Larry Wall in <199710221937.MAA25131@wall.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-08 17:35 ` Ian Campbell @ 2011-02-09 23:16 ` Brendan Cully 2011-02-09 23:42 ` Alan Stern 2011-02-10 11:31 ` Ian Campbell 0 siblings, 2 replies; 14+ messages in thread From: Brendan Cully @ 2011-02-09 23:16 UTC (permalink / raw) To: Ian Campbell Cc: Alan Stern, Rafael J. Wysocki, linux-pm, xen-devel, SUZUKI, Kazuhiro, LKML On Tuesday, 08 February 2011 at 17:35, Ian Campbell wrote: > On Tue, 2011-02-08 at 11:46 -0500, Alan Stern wrote: > > On Tue, 8 Feb 2011, Ian Campbell wrote: > > > > > The problem is that currently we have: > > > > > > dpm_suspend_start(PMSG_SUSPEND); > > > > > > dpm_suspend_noirq(PMSG_SUSPEND); > > > > > > sysdev_suspend(PMSG_SUSPEND); > > > /* suspend hypercall */ > > > sysdev_resume(); > > > > > > dpm_resume_noirq(PMSG_RESUME); > > > > > > dpm_resume_end(PMSG_RESUME); > > > > > > However the suspend hypercall can return a value indicating that the > > > suspend didn't actually happen (e.g. was cancelled). This is used e.g. > > > when checkpointing guests, because in that case you want the original > > > guest to continue. When the suspend didn't happen the drivers need to > > > recover differently from if it did. > > > > That is odd, and it is quite different from the intended design of the > > PM core. Drivers are supposed to put their devices into a known > > suspended state; then afterwards they put the devices back into an > > operational state. What happens while the devices are in the suspended > > state isn't supposed to matter -- the system transition can fail, but > > devices get treated exactly the same way as if it succeeded. > > > > Why do your drivers need to recover differently based on the success of > > the hypercall? > > checkpointing isn't really my area but AIUI you don't want to do a full > device teardown and reconnect like you would with a proper suspend > because of the time that takes which prevents you from doing continuous > rolling checkpoints at granularity which people want to implement > various disaster recovery schemes. > > Hopefully one of the Xen checkpointing folks will chime in and explain > why this is not possible to achieve at the individual driver level (or, > even better, with a patch which does it that way ;-)). As Ian says, Xen has suspend_cancel because while the normal full suspend/resume path works, it is much slower, and the work done during resume is redundant. I don't remember the numbers offhand, but when we added suspend_cancel I think we could do full suspend/resume in under 100us, which was maybe a couple of orders of magnitude faster than full resume (largely due to slow xenstore handshaking on resume, IIRC). It made a big difference for our Remus HA project, which was checkpointing tens of times per second. I'd like to keep the fast resume option, and expect that it can be contained entirely in Xen-specific code. I'll try to get someone to look into it here. I think fast resume is somewhat orthogonal to the problem of hanging on resume, which just sounds like a xen-specific bug in the slow path. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-09 23:16 ` Brendan Cully @ 2011-02-09 23:42 ` Alan Stern 2011-02-10 11:40 ` [Xen-devel] " Ian Campbell 2011-02-10 11:31 ` Ian Campbell 1 sibling, 1 reply; 14+ messages in thread From: Alan Stern @ 2011-02-09 23:42 UTC (permalink / raw) To: Brendan Cully Cc: Ian Campbell, Rafael J. Wysocki, linux-pm, xen-devel, SUZUKI, Kazuhiro, LKML On Wed, 9 Feb 2011, Brendan Cully wrote: > > > That is odd, and it is quite different from the intended design of the > > > PM core. Drivers are supposed to put their devices into a known > > > suspended state; then afterwards they put the devices back into an > > > operational state. What happens while the devices are in the suspended > > > state isn't supposed to matter -- the system transition can fail, but > > > devices get treated exactly the same way as if it succeeded. > > > > > > Why do your drivers need to recover differently based on the success of > > > the hypercall? > > > > checkpointing isn't really my area but AIUI you don't want to do a full > > device teardown and reconnect like you would with a proper suspend > > because of the time that takes which prevents you from doing continuous > > rolling checkpoints at granularity which people want to implement > > various disaster recovery schemes. > > > > Hopefully one of the Xen checkpointing folks will chime in and explain > > why this is not possible to achieve at the individual driver level (or, > > even better, with a patch which does it that way ;-)). > > As Ian says, Xen has suspend_cancel because while the normal full > suspend/resume path works, it is much slower, and the work done during > resume is redundant. I don't remember the numbers offhand, but when we > added suspend_cancel I think we could do full suspend/resume in under > 100us, which was maybe a couple of orders of magnitude faster than > full resume (largely due to slow xenstore handshaking on resume, > IIRC). It made a big difference for our Remus HA project, which was > checkpointing tens of times per second. > > I'd like to keep the fast resume option, and expect that it can be > contained entirely in Xen-specific code. I'll try to get someone to > look into it here. > > I think fast resume is somewhat orthogonal to the problem of hanging > on resume, which just sounds like a xen-specific bug in the slow > path. In fact there already is a "fast suspend & resume" path in the PM core. It's the freeze/thaw procedure used when starting to hibernate. The documentation specifically says that drivers' freeze methods are supposed to quiesce their devices but not change power levels. In addition, the thaw method is invoked as part of recovery from a failed hibernation attempt, so it already has the "cancel" semantics that xen seems to want. Alan Stern ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-09 23:42 ` Alan Stern @ 2011-02-10 11:40 ` Ian Campbell 2011-02-10 16:00 ` Alan Stern 0 siblings, 1 reply; 14+ messages in thread From: Ian Campbell @ 2011-02-10 11:40 UTC (permalink / raw) To: Alan Stern Cc: Brendan Cully, xen-devel, LKML, Rafael J. Wysocki, linux-pm, SUZUKI, Kazuhiro On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote: > In fact there already is a "fast suspend & resume" path in the PM core. > It's the freeze/thaw procedure used when starting to hibernate. The > documentation specifically says that drivers' freeze methods are > supposed to quiesce their devices but not change power levels. In > addition, the thaw method is invoked as part of recovery from a failed > hibernation attempt, so it already has the "cancel" semantics that xen > seems to want. Sounds like that would work and I would much prefer to simply make correct use of the core functionality. So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW depending on whether the suspend was cancelled or not? So the sequence of events is something like: dpm_suspend_start(PMSG_FREEZE); dpm_suspend_noirq(PMSG_FREEZE); sysdev_suspend(PMSG_QUIESCE); cancelled = suspend_hypercall() sysdev_resume(); dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW); dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW); ? (For comparison we currently have: > > > dpm_suspend_start(PMSG_SUSPEND); > > > > > > dpm_suspend_noirq(PMSG_SUSPEND); > > > > > > sysdev_suspend(PMSG_SUSPEND); > > > /* suspend hypercall */ > > > sysdev_resume(); > > > > > > dpm_resume_noirq(PMSG_RESUME); > > > > > > dpm_resume_end(PMSG_RESUME); ) Ian. -- Ian Campbell Current Noise: Neil Young - Angry World If only one could get that wonderful feeling of accomplishment without having to accomplish anything. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-10 11:40 ` [Xen-devel] " Ian Campbell @ 2011-02-10 16:00 ` Alan Stern 2011-02-10 16:26 ` Rafael J. Wysocki 2011-02-10 16:34 ` Ian Campbell 0 siblings, 2 replies; 14+ messages in thread From: Alan Stern @ 2011-02-10 16:00 UTC (permalink / raw) To: Ian Campbell Cc: Brendan Cully, xen-devel, LKML, Rafael J. Wysocki, linux-pm, SUZUKI, Kazuhiro On Thu, 10 Feb 2011, Ian Campbell wrote: > On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote: > > In fact there already is a "fast suspend & resume" path in the PM core. > > It's the freeze/thaw procedure used when starting to hibernate. The > > documentation specifically says that drivers' freeze methods are > > supposed to quiesce their devices but not change power levels. In > > addition, the thaw method is invoked as part of recovery from a failed > > hibernation attempt, so it already has the "cancel" semantics that xen > > seems to want. > > Sounds like that would work and I would much prefer to simply make > correct use of the core functionality. It seems like a reasonable approach. Whether it will actually _work_ is a harder question... :-) > So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW depending > on whether the suspend was cancelled or not? Basically yes. It is also "balanced" by PMSG_RESTORE, which is used after a memory image has been restored (although this isn't relevant to your snapshotting). See the comments in include/linux/pm.h. > So the sequence of events > is something like: > dpm_suspend_start(PMSG_FREEZE); > > dpm_suspend_noirq(PMSG_FREEZE); > > sysdev_suspend(PMSG_QUIESCE); This should say sysdev_suspend(PMSG_FREEZE). > cancelled = suspend_hypercall() At this point swsusp_arch_suspend() is called. If that translates to suspend_hypercall() in your setting, then yes. > sysdev_resume(); > > dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW); > > dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW); > ? Yes. > (For comparison we currently have: > > > > dpm_suspend_start(PMSG_SUSPEND); > > > > > > > > dpm_suspend_noirq(PMSG_SUSPEND); > > > > > > > > sysdev_suspend(PMSG_SUSPEND); > > > > /* suspend hypercall */ > > > > sysdev_resume(); > > > > > > > > dpm_resume_noirq(PMSG_RESUME); > > > > > > > > dpm_resume_end(PMSG_RESUME); > ) Right. The sequence of calls is the same, but the PMSG_ argument is different so drivers are expected to act differently in response. Alan Stern ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-10 16:00 ` Alan Stern @ 2011-02-10 16:26 ` Rafael J. Wysocki 2011-02-10 16:34 ` Ian Campbell 1 sibling, 0 replies; 14+ messages in thread From: Rafael J. Wysocki @ 2011-02-10 16:26 UTC (permalink / raw) To: Alan Stern Cc: Ian Campbell, Brendan Cully, xen-devel, LKML, linux-pm, SUZUKI, Kazuhiro On Thursday, February 10, 2011, Alan Stern wrote: > On Thu, 10 Feb 2011, Ian Campbell wrote: > > > On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote: > > > In fact there already is a "fast suspend & resume" path in the PM core. > > > It's the freeze/thaw procedure used when starting to hibernate. The > > > documentation specifically says that drivers' freeze methods are > > > supposed to quiesce their devices but not change power levels. In > > > addition, the thaw method is invoked as part of recovery from a failed > > > hibernation attempt, so it already has the "cancel" semantics that xen > > > seems to want. > > > > Sounds like that would work and I would much prefer to simply make > > correct use of the core functionality. > > It seems like a reasonable approach. Whether it will actually _work_ > is a harder question... :-) > > > So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW depending > > on whether the suspend was cancelled or not? That's correct, but from drivers' point of view PMSG_THAW is equivalent to PMSG_RECOVER, because the both of them cause ->thaw() callbacks to be executed. > Basically yes. It is also "balanced" by PMSG_RESTORE, which is used > after a memory image has been restored (although this isn't relevant to > your snapshotting). See the comments in include/linux/pm.h. Yup. > > So the sequence of events > > is something like: > > dpm_suspend_start(PMSG_FREEZE); > > > > dpm_suspend_noirq(PMSG_FREEZE); > > > > sysdev_suspend(PMSG_QUIESCE); > > This should say sysdev_suspend(PMSG_FREEZE). Yes, PMSG_QUIESCE is restore-specific. > > cancelled = suspend_hypercall() > > At this point swsusp_arch_suspend() is called. If that translates to > suspend_hypercall() in your setting, then yes. > > > sysdev_resume(); > > > > dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW); > > > > dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW); > > ? > > Yes. Actually, I think PMSG_THAW can be used in both cases. The resume-side routines only use the 'state' argument for diagnostics. > > (For comparison we currently have: > > > > > dpm_suspend_start(PMSG_SUSPEND); > > > > > > > > > > dpm_suspend_noirq(PMSG_SUSPEND); > > > > > > > > > > sysdev_suspend(PMSG_SUSPEND); > > > > > /* suspend hypercall */ > > > > > sysdev_resume(); > > > > > > > > > > dpm_resume_noirq(PMSG_RESUME); > > > > > > > > > > dpm_resume_end(PMSG_RESUME); > > ) > > Right. The sequence of calls is the same, but the PMSG_ argument is > different so drivers are expected to act differently in response. That's correct. Thanks, Rafael ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-10 16:00 ` Alan Stern 2011-02-10 16:26 ` Rafael J. Wysocki @ 2011-02-10 16:34 ` Ian Campbell 2011-02-10 17:01 ` Rafael J. Wysocki 2011-02-10 18:56 ` Alan Stern 1 sibling, 2 replies; 14+ messages in thread From: Ian Campbell @ 2011-02-10 16:34 UTC (permalink / raw) To: Alan Stern Cc: Brendan Cully, xen-devel, LKML, Rafael J. Wysocki, linux-pm, SUZUKI, Kazuhiro On Thu, 2011-02-10 at 11:00 -0500, Alan Stern wrote: > On Thu, 10 Feb 2011, Ian Campbell wrote: > > > On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote: > > > In fact there already is a "fast suspend & resume" path in the PM core. > > > It's the freeze/thaw procedure used when starting to hibernate. The > > > documentation specifically says that drivers' freeze methods are > > > supposed to quiesce their devices but not change power levels. In > > > addition, the thaw method is invoked as part of recovery from a failed > > > hibernation attempt, so it already has the "cancel" semantics that xen > > > seems to want. > > > > Sounds like that would work and I would much prefer to simply make > > correct use of the core functionality. > > It seems like a reasonable approach. Whether it will actually _work_ > is a harder question... :-) Heh. > > So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW depending > > on whether the suspend was cancelled or not? > > Basically yes. It is also "balanced" by PMSG_RESTORE, which is used > after a memory image has been restored (although this isn't relevant to > your snapshotting). See the comments in include/linux/pm.h. The documentation of the individual events in pm.h is good. Is there a reference for the sequence of events for the different types of suspend/hibernate/etc? > > So the sequence of events > > is something like: > > dpm_suspend_start(PMSG_FREEZE); > > > > dpm_suspend_noirq(PMSG_FREEZE); > > > > sysdev_suspend(PMSG_QUIESCE); > > This should say sysdev_suspend(PMSG_FREEZE). > > > cancelled = suspend_hypercall() > > At this point swsusp_arch_suspend() is called. If that translates to > suspend_hypercall() in your setting, then yes. > > > sysdev_resume(); > > > > dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW); > > > > dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW); > > ? > > Yes. Both of those call ->thaw ->complete. Did I mean "cancelled ? PMSG_THAW : PMSG_RESTORE"? (or s/THAW/RECOVER?) If the suspend was cancelled then we want the devices to simply pickup where they were before the freeze, wereas if we really did suspend (or migrate or whatever) then they need to do a more complete reset and reconnect operation so we want some sort of indication to the driver which happened. > > (For comparison we currently have: > > > > > dpm_suspend_start(PMSG_SUSPEND); > > > > > > > > > > dpm_suspend_noirq(PMSG_SUSPEND); > > > > > > > > > > sysdev_suspend(PMSG_SUSPEND); > > > > > /* suspend hypercall */ > > > > > sysdev_resume(); > > > > > > > > > > dpm_resume_noirq(PMSG_RESUME); > > > > > > > > > > dpm_resume_end(PMSG_RESUME); > > ) > > Right. The sequence of calls is the same, but the PMSG_ argument is > different so drivers are expected to act differently in response. The drivers don't actually see the PMSG_* though right? They only see a differing sequence of hooks from dev_pm_ops called. Thanks, Ian. > > Alan Stern > > -- Ian Campbell Current Noise: Sworn Amongst - Useless Accident, n.: A condition in which presence of mind is good, but absence of body is better. -- Foolish Dictionary ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-10 16:34 ` Ian Campbell @ 2011-02-10 17:01 ` Rafael J. Wysocki 2011-02-10 18:56 ` Alan Stern 1 sibling, 0 replies; 14+ messages in thread From: Rafael J. Wysocki @ 2011-02-10 17:01 UTC (permalink / raw) To: Ian Campbell Cc: Alan Stern, Brendan Cully, xen-devel, LKML, linux-pm, SUZUKI, Kazuhiro On Thursday, February 10, 2011, Ian Campbell wrote: > On Thu, 2011-02-10 at 11:00 -0500, Alan Stern wrote: > > On Thu, 10 Feb 2011, Ian Campbell wrote: > > > > > On Wed, 2011-02-09 at 23:42 +0000, Alan Stern wrote: > > > > In fact there already is a "fast suspend & resume" path in the PM core. > > > > It's the freeze/thaw procedure used when starting to hibernate. The > > > > documentation specifically says that drivers' freeze methods are > > > > supposed to quiesce their devices but not change power levels. In > > > > addition, the thaw method is invoked as part of recovery from a failed > > > > hibernation attempt, so it already has the "cancel" semantics that xen > > > > seems to want. > > > > > > Sounds like that would work and I would much prefer to simply make > > > correct use of the core functionality. > > > > It seems like a reasonable approach. Whether it will actually _work_ > > is a harder question... :-) > > Heh. > > > > So PMSG_FREEZE is balanced by either PMSG_RECOVER or PMSG_THAW depending > > > on whether the suspend was cancelled or not? > > > > Basically yes. It is also "balanced" by PMSG_RESTORE, which is used > > after a memory image has been restored (although this isn't relevant to > > your snapshotting). See the comments in include/linux/pm.h. > > The documentation of the individual events in pm.h is good. Is there a > reference for the sequence of events for the different types of > suspend/hibernate/etc? > > > > So the sequence of events > > > is something like: > > > dpm_suspend_start(PMSG_FREEZE); > > > > > > dpm_suspend_noirq(PMSG_FREEZE); > > > > > > sysdev_suspend(PMSG_QUIESCE); > > > > This should say sysdev_suspend(PMSG_FREEZE). > > > > > cancelled = suspend_hypercall() > > > > At this point swsusp_arch_suspend() is called. If that translates to > > suspend_hypercall() in your setting, then yes. > > > > > sysdev_resume(); > > > > > > dpm_resume_noirq(cancelled ? PMSG_RECOVER : PMSG_THAW); > > > > > > dpm_resume_end(cancelled ? PMSG_RECOVER : PMSG_THAW); > > > ? > > > > Yes. > > Both of those call ->thaw ->complete. Did I mean "cancelled ? > PMSG_THAW : PMSG_RESTORE"? (or s/THAW/RECOVER?) > > If the suspend was cancelled then we want the devices to simply pickup > where they were before the freeze, wereas if we really did suspend (or > migrate or whatever) then they need to do a more complete reset and > reconnect operation so we want some sort of indication to the driver > which happened. In that case you should probably use PMSG_THAW (or PMSG_RECOVER) for the "cancel" case and PMSG_RESTORE for the "success" case (pretty much what hibernation does). And please don't forget to update the comments in pm.h to cover your usage case. :-) > > > (For comparison we currently have: > > > > > > dpm_suspend_start(PMSG_SUSPEND); > > > > > > > > > > > > dpm_suspend_noirq(PMSG_SUSPEND); > > > > > > > > > > > > sysdev_suspend(PMSG_SUSPEND); > > > > > > /* suspend hypercall */ > > > > > > sysdev_resume(); > > > > > > > > > > > > dpm_resume_noirq(PMSG_RESUME); > > > > > > > > > > > > dpm_resume_end(PMSG_RESUME); > > > ) > > > > Right. The sequence of calls is the same, but the PMSG_ argument is > > different so drivers are expected to act differently in response. > > The drivers don't actually see the PMSG_* though right? They only see a > differing sequence of hooks from dev_pm_ops called. That's correct. Thanks, Rafael ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-10 16:34 ` Ian Campbell 2011-02-10 17:01 ` Rafael J. Wysocki @ 2011-02-10 18:56 ` Alan Stern 1 sibling, 0 replies; 14+ messages in thread From: Alan Stern @ 2011-02-10 18:56 UTC (permalink / raw) To: Ian Campbell Cc: Brendan Cully, xen-devel, LKML, Rafael J. Wysocki, linux-pm, SUZUKI, Kazuhiro On Thu, 10 Feb 2011, Ian Campbell wrote: > The documentation of the individual events in pm.h is good. Is there a > reference for the sequence of events for the different types of > suspend/hibernate/etc? There's Documentation/power/devices.txt. Alan Stern ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-09 23:16 ` Brendan Cully 2011-02-09 23:42 ` Alan Stern @ 2011-02-10 11:31 ` Ian Campbell 2011-02-10 17:53 ` Brendan Cully 1 sibling, 1 reply; 14+ messages in thread From: Ian Campbell @ 2011-02-10 11:31 UTC (permalink / raw) To: Brendan Cully Cc: xen-devel, LKML, Rafael J. Wysocki, Alan Stern, linux-pm, SUZUKI, Kazuhiro On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote: > I'd like to keep the fast resume option, and expect that it can be > contained entirely in Xen-specific code. I'll try to get someone to > look into it here. Thanks, please put them in contact with Kazuhiro who has already been looking into this. > I think fast resume is somewhat orthogonal to the problem of hanging > on resume, which just sounds like a xen-specific bug in the slow > path. The bug on the Xen side is taking the slow path when the suspend was actually cancelled, which is what the original patch tries to fix. Or should/could the slow path be able cope either way? Ian. -- Ian Campbell Current Noise: Neil Young - Love And War For every complex problem, there is a solution that is simple, neat, and wrong. -- H. L. Mencken ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Xen-devel] Re: [linux-pm] [PATCH 0/2] Fix hangup after creating checkpoint on Xen. 2011-02-10 11:31 ` Ian Campbell @ 2011-02-10 17:53 ` Brendan Cully 0 siblings, 0 replies; 14+ messages in thread From: Brendan Cully @ 2011-02-10 17:53 UTC (permalink / raw) To: Ian Campbell Cc: xen-devel, LKML, Alan, Rafael J. Wysocki, Stern, linux-pm, SUZUKI, Kazuhiro On Thursday, 10 February 2011 at 11:31, Ian Campbell wrote: > On Wed, 2011-02-09 at 23:16 +0000, Brendan Cully wrote: > > I'd like to keep the fast resume option, and expect that it can be > > contained entirely in Xen-specific code. I'll try to get someone to > > look into it here. > > Thanks, please put them in contact with Kazuhiro who has already been > looking into this. > > > I think fast resume is somewhat orthogonal to the problem of hanging > > on resume, which just sounds like a xen-specific bug in the slow > > path. > > The bug on the Xen side is taking the slow path when the suspend was > actually cancelled, which is what the original patch tries to fix. > > Or should/could the slow path be able cope either way? I haven't looked at what's actually happening yet, but what should be happening is that the pvops kernel should not advertise SUSPEND_CANCEL since it doesn't yet support it. So xm save -c should call xc_domain_resume with the fast argument set to 0, to do slow path resumption, and that slow path should be able to handle resuming on the same host where it suspended. Slow path resumption from checkpoint was supported in the pre-pvops kernel. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-02-10 18:56 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20110207.180718.885783824287845425.kaz@jp.fujitsu.com> 2011-02-07 9:35 ` [PATCH 0/2] Fix hangup after creating checkpoint on Xen Rafael J. Wysocki 2011-02-08 11:22 ` Ian Campbell 2011-02-08 16:46 ` [linux-pm] " Alan Stern 2011-02-08 17:35 ` Ian Campbell 2011-02-09 23:16 ` Brendan Cully 2011-02-09 23:42 ` Alan Stern 2011-02-10 11:40 ` [Xen-devel] " Ian Campbell 2011-02-10 16:00 ` Alan Stern 2011-02-10 16:26 ` Rafael J. Wysocki 2011-02-10 16:34 ` Ian Campbell 2011-02-10 17:01 ` Rafael J. Wysocki 2011-02-10 18:56 ` Alan Stern 2011-02-10 11:31 ` Ian Campbell 2011-02-10 17:53 ` Brendan Cully
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).