LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Hibernation considerations
@ 2007-07-15 12:33 Rafael J. Wysocki
  2007-07-15 12:51 ` Nigel Cunningham
                   ` (5 more replies)
  0 siblings, 6 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-15 12:33 UTC (permalink / raw)
  To: LKML
  Cc: Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, david, Al Boldi

Hi,

Since many alternative approaches to hibernation are now being considered and
discussed, I thought it might be a good idea to list some things that in my not
so humble opinion should be taken care of by any hibernation framework.  They
are listed below, not in any particular order, because I think they all are
important.  Still, I might have forgotten something, so everyone with
experience in implementing hibernation, especially Pavel and Nigel, please
check if the list is complete.

(1) Filesystems mounted before the hibernation are untouchable

    When there's a memory snapshot, either in the form of a hibernation image,
    or in the form of the "old" kernel and processes available to the "new"
    kexeced kernel responsible for saving their memory, the filesystems mounted
    before the hibernation should not be accessed, even for reading, because
    that would cause their on-disk state to be inconsistent with the snapshot
    and might lead to a filesystem corruption.

(2) Swap space in use before the hibernation must be handled with care

    If swap space is used for saving the memory snapshot, the snapshot-saving
    application (or kernel) must be careful enough not to overwrite swap pages
    that contain valid memory contents stored in there before the hibernation.

(3) There are memory regions that must not be saved or restored

    Some memory regions contain data that shouldn't be overwritten during the
    restore, because that might lead to the system not working correctly
    afterwards.  Also, on some systems there are valid 'struct pages'
    structures that in fact corresond to memory holes and we should not attempt
    to save those pages.

(4) The user should be able to limit the size of a hibernation image

    There are a couple of reasons of that.  For example, the storage space
    used for saving the image may be smaller than the entire RAM or the user
    may want the image to be saved quickier.

(5) Hibernation should be transparent from the applications' point of view

    Generally, applications should not notice that hibernation took place.
    [Note that I don't regard all processes as applications and I think that
    there may be processes which need to handle the hibernation in a special
    way.]  Ideally, for example, if some audio is being played when a
    hibernation starts, the audio player should be able to continue playing the
    same audio after the restore from the point in which it has been
    interrupted by the hibernation.  Also, the CPU affinities and similar
    settings requested by the applications before a hibernation should be
    binding after the restore.

(6) State of devices from before hibernation should be restored, if possible

    If possible, during a restore devices should be brought back to the same
    state in which they were before the corresponding hibernation.  Of course
    in some situations it might be impossible to do that (eg. the user
    connected the hibernated system to a different IP subnet and then
    restored), but as a general rule, we should do our best to restore the
    state of devices, which is directly related to point (5) above.

(7) On ACPI systems special platform-related actions have to be carried out at
    the right points, so that the platform works correctly after the restore

    The ACPI specification requires us to invoke some global ACPI methods
    during the hibernation and during the restore.  Moreover, the ordering of
    code related to these ACPI methods may not be arbitrary (eg. some of
    them have to be executed after devices are put into low power states etc.).

(8) Hibernation and restore should not be too slow

    In my opinion, if more than one minute is needed to hibernate the system
    with the help of certain hibernation framework, then this framework is not
    very useful in practice.  It might be useful to perform some special tasks
    (eg. moving a server to another place without taking it down), but it is
    not very useful, for example, to notebook users.

(9) Hibernation framework should not be too difficult to set up

    It follows from my experience that if the users are required to do too much
    work to set up a hibernation framework, they will not use it as long as
    there are simpler alternatives (some of them will not use hibernation at
    all if it's too difficult to get to work).  On the other hand, if the users
    are provided with a working hibernation framework by their distribution
    and they find it useful, they are not likely to use kernel.org kernels if
    t's too difficult to replace the distribution kernel with a generic one due
    to the hibernation framework's requirements.

All of the existing hibernation frameworks have been written with the above
points in mind and that's why they are what they are.  In particular, the
existence of the tasks freezer, hated by some people to the point of insanity,
follows directly from points (1), (4) and (5).

In my opinion any hibernation framework that doesn't take the above
requirements into account in any way will be a failure.  Moreover, the existing
frameworks fail to follow some of them too, so I consider all of these
frameworks as a work in progress.  For this reason, I will much more appreciate
ideas allowing us to improve the existing frameworks in a more or less
evolutionary way, then attempts to replace them all with something entirely
new.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 12:33 Hibernation considerations Rafael J. Wysocki
@ 2007-07-15 12:51 ` Nigel Cunningham
  2007-07-15 12:58 ` Dr. David Alan Gilbert
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-15 12:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Pavel Machek, pm list,
	david, Al Boldi

[-- Attachment #1: Type: text/plain, Size: 6039 bytes --]

Hi.

On Sunday 15 July 2007 22:33:32 Rafael J. Wysocki wrote:
> Hi,
> 
> Since many alternative approaches to hibernation are now being considered 
and
> discussed, I thought it might be a good idea to list some things that in my 
not
> so humble opinion should be taken care of by any hibernation framework.  
They
> are listed below, not in any particular order, because I think they all are
> important.  Still, I might have forgotten something, so everyone with
> experience in implementing hibernation, especially Pavel and Nigel, please
> check if the list is complete.
> 
> (1) Filesystems mounted before the hibernation are untouchable
> 
>     When there's a memory snapshot, either in the form of a hibernation 
image,
>     or in the form of the "old" kernel and processes available to the "new"
>     kexeced kernel responsible for saving their memory, the filesystems 
mounted
>     before the hibernation should not be accessed, even for reading, because
>     that would cause their on-disk state to be inconsistent with the 
snapshot
>     and might lead to a filesystem corruption.
> 
> (2) Swap space in use before the hibernation must be handled with care
> 
>     If swap space is used for saving the memory snapshot, the 
snapshot-saving
>     application (or kernel) must be careful enough not to overwrite swap 
pages
>     that contain valid memory contents stored in there before the 
hibernation.
> 
> (3) There are memory regions that must not be saved or restored
> 
>     Some memory regions contain data that shouldn't be overwritten during 
the
>     restore, because that might lead to the system not working correctly
>     afterwards.  Also, on some systems there are valid 'struct pages'
>     structures that in fact corresond to memory holes and we should not 
attempt
>     to save those pages.
> 
> (4) The user should be able to limit the size of a hibernation image
> 
>     There are a couple of reasons of that.  For example, the storage space
>     used for saving the image may be smaller than the entire RAM or the user
>     may want the image to be saved quickier.
> 
> (5) Hibernation should be transparent from the applications' point of view
> 
>     Generally, applications should not notice that hibernation took place.
>     [Note that I don't regard all processes as applications and I think that
>     there may be processes which need to handle the hibernation in a special
>     way.]  Ideally, for example, if some audio is being played when a
>     hibernation starts, the audio player should be able to continue playing 
the
>     same audio after the restore from the point in which it has been
>     interrupted by the hibernation.  Also, the CPU affinities and similar
>     settings requested by the applications before a hibernation should be
>     binding after the restore.
> 
> (6) State of devices from before hibernation should be restored, if possible
> 
>     If possible, during a restore devices should be brought back to the same
>     state in which they were before the corresponding hibernation.  Of 
course
>     in some situations it might be impossible to do that (eg. the user
>     connected the hibernated system to a different IP subnet and then
>     restored), but as a general rule, we should do our best to restore the
>     state of devices, which is directly related to point (5) above.
> 
> (7) On ACPI systems special platform-related actions have to be carried out 
at
>     the right points, so that the platform works correctly after the restore
> 
>     The ACPI specification requires us to invoke some global ACPI methods
>     during the hibernation and during the restore.  Moreover, the ordering 
of
>     code related to these ACPI methods may not be arbitrary (eg. some of
>     them have to be executed after devices are put into low power states 
etc.).
> 
> (8) Hibernation and restore should not be too slow
> 
>     In my opinion, if more than one minute is needed to hibernate the system
>     with the help of certain hibernation framework, then this framework is 
not
>     very useful in practice.  It might be useful to perform some special 
tasks
>     (eg. moving a server to another place without taking it down), but it is
>     not very useful, for example, to notebook users.
> 
> (9) Hibernation framework should not be too difficult to set up
> 
>     It follows from my experience that if the users are required to do too 
much
>     work to set up a hibernation framework, they will not use it as long as
>     there are simpler alternatives (some of them will not use hibernation at
>     all if it's too difficult to get to work).  On the other hand, if the 
users
>     are provided with a working hibernation framework by their distribution
>     and they find it useful, they are not likely to use kernel.org kernels 
if
>     t's too difficult to replace the distribution kernel with a generic one 
due
>     to the hibernation framework's requirements.
> 
> All of the existing hibernation frameworks have been written with the above
> points in mind and that's why they are what they are.  In particular, the
> existence of the tasks freezer, hated by some people to the point of 
insanity,
> follows directly from points (1), (4) and (5).
> 
> In my opinion any hibernation framework that doesn't take the above
> requirements into account in any way will be a failure.  Moreover, the 
existing
> frameworks fail to follow some of them too, so I consider all of these
> frameworks as a work in progress.  For this reason, I will much more 
appreciate
> ideas allowing us to improve the existing frameworks in a more or less
> evolutionary way, then attempts to replace them all with something entirely
> new.

Sounds good to me. Nothing extra occurs immediately.

Regards,

Nigel
-- 
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 12:33 Hibernation considerations Rafael J. Wysocki
  2007-07-15 12:51 ` Nigel Cunningham
@ 2007-07-15 12:58 ` Dr. David Alan Gilbert
  2007-07-15 22:38   ` Rafael J. Wysocki
  2007-07-15 15:10 ` Hibernation considerations Al Boldi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 220+ messages in thread
From: Dr. David Alan Gilbert @ 2007-07-15 12:58 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, david, Al Boldi

* Rafael J. Wysocki (rjw@sisk.pl) wrote:

> (5) Hibernation should be transparent from the applications' point of view
> 
>     Generally, applications should not notice that hibernation took place.
>     [Note that I don't regard all processes as applications and I think that
>     there may be processes which need to handle the hibernation in a special
>     way.]  Ideally, for example, if some audio is being played when a
>     hibernation starts, the audio player should be able to continue playing the
>     same audio after the restore from the point in which it has been
>     interrupted by the hibernation.  Also, the CPU affinities and similar

That would be _so_ embarrassing in a library; I'd rather the audio
player had the opportunity to consider whether restarting was a good idea.

> (6) State of devices from before hibernation should be restored, if possible
> 
>     If possible, during a restore devices should be brought back to the same
>     state in which they were before the corresponding hibernation.  Of course
>     in some situations it might be impossible to do that (eg. the user
>     connected the hibernated system to a different IP subnet and then
>     restored), but as a general rule, we should do our best to restore the
>     state of devices, which is directly related to point (5) above.

Or the user unplugs their flash drive after hibernation rather than before.

Two things which I think would be nice to consider are:
   1) Encryption - I'd actually prefer if my luks device did not
       remember the key accross a hibernation; I want to be forced to
       reenter the phrase.  However I don't know what the best thing
       to do to partitions/applications using the luks device is.

   2) Some level of debugging needs to be available so that users can
      provide something so you can see why something hasn't hibernated
      or why (as in the case of this tosh laptop) it still takes power
      during hibernation.

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 12:33 Hibernation considerations Rafael J. Wysocki
  2007-07-15 12:51 ` Nigel Cunningham
  2007-07-15 12:58 ` Dr. David Alan Gilbert
@ 2007-07-15 15:10 ` Al Boldi
  2007-07-15 15:35   ` jimmy bahuleyan
  2007-07-15 16:29   ` Alan Stern
  2007-07-15 20:13 ` david
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 220+ messages in thread
From: Al Boldi @ 2007-07-15 15:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, Alan Stern, Andrew Morton, Eric W. Biederman,
	Huang, Ying, Jeremy Maitin-Shepard, Kyle Moffett,
	Nigel Cunningham, Pavel Machek, pm list, david

Rafael J. Wysocki wrote:
> (3) There are memory regions that must not be saved or restored
>
>     Some memory regions contain data that shouldn't be overwritten during
> the restore, because that might lead to the system not working correctly
> afterwards.  Also, on some systems there are valid 'struct pages'
> structures that in fact corresond to memory holes and we should not
> attempt to save those pages.

That's only true if we use the current swsusp code to restore the image.  If 
this becomes an issue, we can always use the kexec approach to restore the 
image, even if that required a double kexec boot to slot in the hibernation 
kernel, as kexec boot overhead is low.

> (5) Hibernation should be transparent from the applications' point of view
>
>     Generally, applications should not notice that hibernation took place.
>     [Note that I don't regard all processes as applications and I think
> that there may be processes which need to handle the hibernation in a
> special way.]  Ideally, for example, if some audio is being played when a
> hibernation starts, the audio player should be able to continue playing
> the same audio after the restore from the point in which it has been
> interrupted by the hibernation.  Also, the CPU affinities and similar
> settings requested by the applications before a hibernation should be
> binding after the restore.

Using kexec may be as transparent as it gets.  What's critical here, is to 
never try memory dependent operations from within the normal kernel, not 
even backup, as that would get you into an inter-dependency mess we dearly 
want to avoid.

> (6) State of devices from before hibernation should be restored, if
> possible
>
>     If possible, during a restore devices should be brought back to the
> same state in which they were before the corresponding hibernation.  Of
> course in some situations it might be impossible to do that (eg. the user
> connected the hibernated system to a different IP subnet and then
> restored), but as a general rule, we should do our best to restore the
> state of devices, which is directly related to point (5) above.

This part could easily be handled by the normal kernel before and after 
resume.

> (7) On ACPI systems special platform-related actions have to be carried
> out at the right points, so that the platform works correctly after the
> restore
>
>     The ACPI specification requires us to invoke some global ACPI methods
>     during the hibernation and during the restore.  Moreover, the ordering
> of code related to these ACPI methods may not be arbitrary (eg. some of
> them have to be executed after devices are put into low power states
> etc.).

This should be the responsibility of the kexec'd hibernating kernel.  Note 
though in (6), the normal kernel takes care of preparing devices, then the 
hibernating kernel dumps the image and either calls S4 or S3.  On resume 
from S3 it can immediately switch over to the normal kernel, and from S4 the 
known bootup would occur.

> (8) Hibernation and restore should not be too slow
>
>     In my opinion, if more than one minute is needed to hibernate the
> system with the help of certain hibernation framework, then this framework
> is not very useful in practice.  It might be useful to perform some
> special tasks (eg. moving a server to another place without taking it
> down), but it is not very useful, for example, to notebook users.

The latest hibernating kexec patches boot a kexec'd modular kernel with 
initramfs into crashkernel=16M@16M in less than one second.  Switch-back is 
almost instant.  Add to this the time required to either store or restore 
the image, and it may be obvious that this approach isn't slower, but maybe 
even faster than the current swsusp.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 15:10 ` Hibernation considerations Al Boldi
@ 2007-07-15 15:35   ` jimmy bahuleyan
  2007-07-15 17:40     ` Al Boldi
  2007-07-15 16:29   ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: jimmy bahuleyan @ 2007-07-15 15:35 UTC (permalink / raw)
  To: Al Boldi
  Cc: Rafael J. Wysocki, linux-kernel, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david

Al Boldi wrote:
> 
> This should be the responsibility of the kexec'd hibernating kernel.  Note 
> though in (6), the normal kernel takes care of preparing devices, then the 
> hibernating kernel dumps the image and either calls S4 or S3.  On resume 
> from S3 it can immediately switch over to the normal kernel, and from S4 the 
> known bootup would occur.
> 
>> (8) Hibernation and restore should not be too slow
>>
>>     In my opinion, if more than one minute is needed to hibernate the
>> system with the help of certain hibernation framework, then this framework
>> is not very useful in practice.  It might be useful to perform some
>> special tasks (eg. moving a server to another place without taking it
>> down), but it is not very useful, for example, to notebook users.
> 
> The latest hibernating kexec patches boot a kexec'd modular kernel with 
> initramfs into crashkernel=16M@16M in less than one second.  Switch-back is 
> almost instant.  Add to this the time required to either store or restore 
> the image, and it may be obvious that this approach isn't slower, but maybe 
> even faster than the current swsusp.
> 

What about (9)? Would it be that a user choosing to build a kernel with
hibernate support gets a additional modular kernel built (which he
should then use for resumption) or he should configure & build the
modular kernel independent of main kernel?

Or will the Linux boot procedure change so that it always goes thru a
modular part followed by kexec (just to be uniform)?

Although the kexec approach seems interesting, the final user-scenario
seems a bit complex (or confusing).

-jb
-- 
Tact is the art of making a point without making an enemy.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 15:10 ` Hibernation considerations Al Boldi
  2007-07-15 15:35   ` jimmy bahuleyan
@ 2007-07-15 16:29   ` Alan Stern
  2007-07-15 17:40     ` Al Boldi
  2007-07-15 19:52     ` david
  1 sibling, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-15 16:29 UTC (permalink / raw)
  To: Al Boldi
  Cc: Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david

On Sun, 15 Jul 2007, Al Boldi wrote:

> >     If possible, during a restore devices should be brought back to the
> > same state in which they were before the corresponding hibernation.  Of
> > course in some situations it might be impossible to do that (eg. the user
> > connected the hibernated system to a different IP subnet and then
> > restored), but as a general rule, we should do our best to restore the
> > state of devices, which is directly related to point (5) above.
> 
> This part could easily be handled by the normal kernel before and after 
> resume.

I agree with you except for the word "easily".  And there are some 
things the kernel simply punts on (I'm thinking of the current VGA 
font).

> > (7) On ACPI systems special platform-related actions have to be carried
> > out at the right points, so that the platform works correctly after the
> > restore
> >
> >     The ACPI specification requires us to invoke some global ACPI methods
> >     during the hibernation and during the restore.  Moreover, the ordering
> > of code related to these ACPI methods may not be arbitrary (eg. some of
> > them have to be executed after devices are put into low power states
> > etc.).
> 
> This should be the responsibility of the kexec'd hibernating kernel.  Note 
> though in (6), the normal kernel takes care of preparing devices, then the 
> hibernating kernel dumps the image and either calls S4 or S3.  On resume 
> from S3 it can immediately switch over to the normal kernel, and from S4 the 
> known bootup would occur.

Is it really that simple?  Somehow I doubt it.  In order for some 
devices to remain available for the kexec'd kernel to use, they cannot 
be suspended at the ACPI level.  So the kexec'd kernel will have to 
handle the ACPI requirements for those devices.  Likewise, it would 
have to handle the ACPI interactions which need to be done after all 
devices are prepared for the transition to S3 or S4.

> > (8) Hibernation and restore should not be too slow
> >
> >     In my opinion, if more than one minute is needed to hibernate the
> > system with the help of certain hibernation framework, then this framework
> > is not very useful in practice.  It might be useful to perform some
> > special tasks (eg. moving a server to another place without taking it
> > down), but it is not very useful, for example, to notebook users.
> 
> The latest hibernating kexec patches boot a kexec'd modular kernel with 
> initramfs into crashkernel=16M@16M in less than one second.  Switch-back is 
> almost instant.  Add to this the time required to either store or restore 
> the image, and it may be obvious that this approach isn't slower, but maybe 
> even faster than the current swsusp.

Does that include the time required for probing PCI buses?  On my
desktop system, PCI probing incurs a five-second timeout delay because
of a bug in the BIOS's USB firmware.  Don't be so sure that kexec will
always be lightning fast; it is always better to avoid unnecessary
boots.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 15:35   ` jimmy bahuleyan
@ 2007-07-15 17:40     ` Al Boldi
  0 siblings, 0 replies; 220+ messages in thread
From: Al Boldi @ 2007-07-15 17:40 UTC (permalink / raw)
  To: jimmy bahuleyan
  Cc: Rafael J. Wysocki, linux-kernel, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david

jimmy bahuleyan wrote:
> Al Boldi wrote:
> > This should be the responsibility of the kexec'd hibernating kernel. 
> > Note though in (6), the normal kernel takes care of preparing devices,
> > then the hibernating kernel dumps the image and either calls S4 or S3. 
> > On resume from S3 it can immediately switch over to the normal kernel,
> > and from S4 the known bootup would occur.
> >
> >> (8) Hibernation and restore should not be too slow
> >>
> >>     In my opinion, if more than one minute is needed to hibernate the
> >> system with the help of certain hibernation framework, then this
> >> framework is not very useful in practice.  It might be useful to
> >> perform some special tasks (eg. moving a server to another place
> >> without taking it down), but it is not very useful, for example, to
> >> notebook users.
> >
> > The latest hibernating kexec patches boot a kexec'd modular kernel with
> > initramfs into crashkernel=16M@16M in less than one second.  Switch-back
> > is almost instant.  Add to this the time required to either store or
> > restore the image, and it may be obvious that this approach isn't
> > slower, but maybe even faster than the current swsusp.
>
> What about (9)? Would it be that a user choosing to build a kernel with
> hibernate support gets a additional modular kernel built (which he
> should then use for resumption) or he should configure & build the
> modular kernel independent of main kernel?
>
> Or will the Linux boot procedure change so that it always goes thru a
> modular part followed by kexec (just to be uniform)?
>
> Although the kexec approach seems interesting, the final user-scenario
> seems a bit complex (or confusing).

Well, it may sound confusing because it is so unexpectedly simple.  I didn't 
answer to (9) because from a user pov nothing should change, and everything 
should be scriptable such that the user wouldn't even notice the kernel 
using a new hibernation approach.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 16:29   ` Alan Stern
@ 2007-07-15 17:40     ` Al Boldi
  2007-07-15 23:28       ` Alan Stern
  2007-07-15 19:52     ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Al Boldi @ 2007-07-15 17:40 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david

Alan Stern wrote:
> On Sun, 15 Jul 2007, Al Boldi wrote:
> > >     If possible, during a restore devices should be brought back to
> > > the same state in which they were before the corresponding
> > > hibernation.  Of course in some situations it might be impossible to
> > > do that (eg. the user connected the hibernated system to a different
> > > IP subnet and then restored), but as a general rule, we should do our
> > > best to restore the state of devices, which is directly related to
> > > point (5) above.
> >
> > This part could easily be handled by the normal kernel before and after
> > resume.
>
> I agree with you except for the word "easily".  And there are some
> things the kernel simply punts on (I'm thinking of the current VGA
> font).

Why; can you explain?

> > > (7) On ACPI systems special platform-related actions have to be
> > > carried out at the right points, so that the platform works correctly
> > > after the restore
> > >
> > >     The ACPI specification requires us to invoke some global ACPI
> > > methods during the hibernation and during the restore.  Moreover, the
> > > ordering of code related to these ACPI methods may not be arbitrary
> > > (eg. some of them have to be executed after devices are put into low
> > > power states etc.).
> >
> > This should be the responsibility of the kexec'd hibernating kernel. 
> > Note though in (6), the normal kernel takes care of preparing devices,
> > then the hibernating kernel dumps the image and either calls S4 or S3. 
> > On resume from S3 it can immediately switch over to the normal kernel,
> > and from S4 the known bootup would occur.
>
> Is it really that simple?  Somehow I doubt it.  In order for some
> devices to remain available for the kexec'd kernel to use, they cannot
> be suspended at the ACPI level.  So the kexec'd kernel will have to
> handle the ACPI requirements for those devices.  Likewise, it would
> have to handle the ACPI interactions which need to be done after all
> devices are prepared for the transition to S3 or S4.

Ok, after applying the latest kexec patches, I was able to use the kexec'd 
kernel to suspend to ram and resume to the normal kernel, while working 
under a full-blown X session.  It went without a hitch.  All that is needed 
now are the dump/restore hibernation-image routines.

> > > (8) Hibernation and restore should not be too slow
> > >
> > >     In my opinion, if more than one minute is needed to hibernate the
> > > system with the help of certain hibernation framework, then this
> > > framework is not very useful in practice.  It might be useful to
> > > perform some special tasks (eg. moving a server to another place
> > > without taking it down), but it is not very useful, for example, to
> > > notebook users.
> >
> > The latest hibernating kexec patches boot a kexec'd modular kernel with
> > initramfs into crashkernel=16M@16M in less than one second.  Switch-back
> > is almost instant.  Add to this the time required to either store or
> > restore the image, and it may be obvious that this approach isn't
> > slower, but maybe even faster than the current swsusp.
>
> Does that include the time required for probing PCI buses?  On my
> desktop system, PCI probing incurs a five-second timeout delay because
> of a bug in the BIOS's USB firmware.

Using a modular kernel you would only insmod those modules that you need to 
dump the image, which is mainly the diskdriver.  If you wanted to dump it 
onto USB flash, then you would insmod that driver, and if that driver is 
slow due to a bug, then a fix should be in order.

> Don't be so sure that kexec will
> always be lightning fast; it is always better to avoid unnecessary
> boots.

Agreed, but what we want to achieve right know is a proof of concept.  Later, 
I could imagine a specially stub'd device driver to be kexec'd instead of 
the full kernel.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 20:35 ` Cornelius Riemenschneider
@ 2007-07-15 19:46   ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-15 19:46 UTC (permalink / raw)
  To: Cornelius Riemenschneider
  Cc: Rafael J. Wysocki, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Sun, 15 Jul 2007, Cornelius Riemenschneider wrote:

> Hi,
> I think a (10) is needed to be clear:
>
> (10) This depends directly on (9)
> Easy Userinterface for the average user, no one wants to edit several
> files and use a long commandline for a suspend (not all know about
> alias(P)), so keeping the effort to suspend your system low, please.(eg
> for my TuxOnIce i just type hibernate as root/ sudo hibernate as normal
> user and the suspend begin, the rest is automatically done by the
> configuration(for me this is shutting down my wireless interface)
>
> Cornelius Riemenschneider

for both #9 and #10, if there are complex command lines that need to be 
used, they can be put into scripts (including things like 'sudo 
hibernate')

this doesn't prevent people from useing kernel.org kernels, in fact the 
seperate between the main system and what's used for hibernation may make 
it easier to change the main kernel (all that you would need is to enable 
kexec in your main kernel and 'everything else would just work')

now if you need/want to change out your hibernate kernel you will have to 
change the hibernate scripts to tell them to use the new kernel. yes this 
is an extra step, but it can still be simple. for that matter, it may be 
possible for the distros to agree enough to add 'make install_hibernate' 
to the kernel makefiles like they have 'make install' that will just 'do 
the right thing' for each particular distro.

makeing things explicit and flexible may require complex command lines to 
implement, but nothing says that the user (even a moderatly advanced user) 
needs to deal with those command lines directly.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 16:29   ` Alan Stern
  2007-07-15 17:40     ` Al Boldi
@ 2007-07-15 19:52     ` david
  1 sibling, 0 replies; 220+ messages in thread
From: david @ 2007-07-15 19:52 UTC (permalink / raw)
  To: Alan Stern
  Cc: Al Boldi, Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

On Sun, 15 Jul 2007, Alan Stern wrote:

>>> (7) On ACPI systems special platform-related actions have to be carried
>>> out at the right points, so that the platform works correctly after the
>>> restore
>>>
>>>     The ACPI specification requires us to invoke some global ACPI methods
>>>     during the hibernation and during the restore.  Moreover, the ordering
>>> of code related to these ACPI methods may not be arbitrary (eg. some of
>>> them have to be executed after devices are put into low power states
>>> etc.).
>>
>> This should be the responsibility of the kexec'd hibernating kernel.  Note
>> though in (6), the normal kernel takes care of preparing devices, then the
>> hibernating kernel dumps the image and either calls S4 or S3.  On resume
>> from S3 it can immediately switch over to the normal kernel, and from S4 the
>> known bootup would occur.
>
> Is it really that simple?  Somehow I doubt it.  In order for some
> devices to remain available for the kexec'd kernel to use, they cannot
> be suspended at the ACPI level.  So the kexec'd kernel will have to
> handle the ACPI requirements for those devices.  Likewise, it would
> have to handle the ACPI interactions which need to be done after all
> devices are prepared for the transition to S3 or S4.

so if there are two states for hardware

1. able to be detected and initialized with the standard boot routines

2. ACPI suspend mode, requireing ACPI specifics to wake up

is it possible to detect which is needed? cna you try the ACPI wakeup and 
if it doesn't work go to the standard boot routine (or vice-versa)?

>>> (8) Hibernation and restore should not be too slow
>>>
>>>     In my opinion, if more than one minute is needed to hibernate the
>>> system with the help of certain hibernation framework, then this framework
>>> is not very useful in practice.  It might be useful to perform some
>>> special tasks (eg. moving a server to another place without taking it
>>> down), but it is not very useful, for example, to notebook users.
>>
>> The latest hibernating kexec patches boot a kexec'd modular kernel with
>> initramfs into crashkernel=16M@16M in less than one second.  Switch-back is
>> almost instant.  Add to this the time required to either store or restore
>> the image, and it may be obvious that this approach isn't slower, but maybe
>> even faster than the current swsusp.
>
> Does that include the time required for probing PCI buses?  On my
> desktop system, PCI probing incurs a five-second timeout delay because
> of a bug in the BIOS's USB firmware.  Don't be so sure that kexec will
> always be lightning fast; it is always better to avoid unnecessary
> boots.

some drivers will have delays on some machines (especially when you 
consider hardware bugs that have to be worked around), but at the moment 
nothing is remotely close to the 'limit' of 60 seconds being talked about 
here.

although with some machines the mear trasnfer times for the data could 
cause noticable delays (if you have a server with 128G of ram it may take 
a while to get it to a drive)

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 12:33 Hibernation considerations Rafael J. Wysocki
                   ` (2 preceding siblings ...)
  2007-07-15 15:10 ` Hibernation considerations Al Boldi
@ 2007-07-15 20:13 ` david
  2007-07-15 22:47   ` Rafael J. Wysocki
  2007-07-15 23:17   ` Alan Stern
  2007-07-15 20:35 ` Cornelius Riemenschneider
  2007-07-16  0:51 ` Matthew Garrett
  5 siblings, 2 replies; 220+ messages in thread
From: david @ 2007-07-15 20:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Sun, 15 Jul 2007, Rafael J. Wysocki wrote:

> Hi,
>
> Since many alternative approaches to hibernation are now being considered and
> discussed, I thought it might be a good idea to list some things that in my not
> so humble opinion should be taken care of by any hibernation framework.  They
> are listed below, not in any particular order, because I think they all are
> important.  Still, I might have forgotten something, so everyone with
> experience in implementing hibernation, especially Pavel and Nigel, please
> check if the list is complete.
>
> (1) Filesystems mounted before the hibernation are untouchable
>
>    When there's a memory snapshot, either in the form of a hibernation image,
>    or in the form of the "old" kernel and processes available to the "new"
>    kexeced kernel responsible for saving their memory, the filesystems mounted
>    before the hibernation should not be accessed, even for reading, because
>    that would cause their on-disk state to be inconsistent with the snapshot
>    and might lead to a filesystem corruption.

AFAIK this is only the case with ext3, all other filesystems could be 
accessed read-only safely

this is arguably a bug with ext3 (and has been discussed as such), but 
right now the ext3 team has decided not to change this bahavior so 
hibernate needs to work around it. but don't mistake a work-around for a 
single (admittedly very popular) filesystem with a hard and fast 
directive.

> (2) Swap space in use before the hibernation must be handled with care
>
>    If swap space is used for saving the memory snapshot, the snapshot-saving
>    application (or kernel) must be careful enough not to overwrite swap pages
>    that contain valid memory contents stored in there before the hibernation.

true, in fact, given that many distros and live-CD's autodetect swap 
partitions and consider them fair game, I would argue that the best thing 
to do would be to have the main system free up it's swap partitions before 
going into hibernation.

however, this could be a decision of the particular hibernate routines.

for the kexec approach the mapping of what swap pages are in use is one 
more chunk of data that needs to be assembled and made available through a 
defined interface.

> (4) The user should be able to limit the size of a hibernation image
>
>    There are a couple of reasons of that.  For example, the storage space
>    used for saving the image may be smaller than the entire RAM or the user
>    may want the image to be saved quickier.

it may make sense for this to be split into hard and soft limits.

if you try to save more then the storage space can hold you cannot 
continue, but if you are just a little over the arbatrary size limit that 
was set to make things fast you are better off saving things as-is then 
punting, going back to the system, trying to free more ram, and trying a 
hibernate again.

with the kexec approach the enforcment of these limits is also split into 
two sections.

when the hibernate command is given in the main kernel, it's userspace 
needs to follow some policy to decide how much (if any) memory to free.

this could be anything from 'none, try and save all caches' to 'anything 
you can to minimize the amount of data to be saved, trash all caches' to 
something in between like 'try and free up enough memory to get the saved 
data below 1G, but save caches beyond that point'

then when the second kernel runs, it's userspace tools get the list of 
what memory should be saved that the main kernel handed to it, and then 
decides if this is acceptable (probably mostly the hard limits of 'can 
this work') and proceeds to save it somewhere.

but since the kexec command and the preporation of the devices can change 
the memory, the estimates done by the first kernel's userspace are just 
that, estimates.

> (7) On ACPI systems special platform-related actions have to be carried out at
>    the right points, so that the platform works correctly after the restore
>
>    The ACPI specification requires us to invoke some global ACPI methods
>    during the hibernation and during the restore.  Moreover, the ordering of
>    code related to these ACPI methods may not be arbitrary (eg. some of
>    them have to be executed after devices are put into low power states etc.).

for a pure hibernate mode, you will be powering off the box after saving 
the suspend image. why are there any special ACPI modes involved?

now, for suspend-to-ram you need to be aware of every possible power 
saving option you have and the cost of each of them, and here the ACPI 
modes are heavily used.

I think this is mixing suspend and hibernate still.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 12:33 Hibernation considerations Rafael J. Wysocki
                   ` (3 preceding siblings ...)
  2007-07-15 20:13 ` david
@ 2007-07-15 20:35 ` Cornelius Riemenschneider
  2007-07-15 19:46   ` david
  2007-07-16  0:51 ` Matthew Garrett
  5 siblings, 1 reply; 220+ messages in thread
From: Cornelius Riemenschneider @ 2007-07-15 20:35 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, david, Al Boldi

Hi,
I think a (10) is needed to be clear:

(10) This depends directly on (9)
Easy Userinterface for the average user, no one wants to edit several
files and use a long commandline for a suspend (not all know about
alias(P)), so keeping the effort to suspend your system low, please.(eg
for my TuxOnIce i just type hibernate as root/ sudo hibernate as normal
user and the suspend begin, the rest is automatically done by the
configuration(for me this is shutting down my wireless interface)

Cornelius Riemenschneider

My source of power: www.humppa.com

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 22:38   ` Rafael J. Wysocki
@ 2007-07-15 22:27     ` david
  2007-07-17 17:40       ` Dr. David Alan Gilbert
  2007-07-29  6:53     ` Vojtech Pavlik
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-15 22:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Dr. David Alan Gilbert, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:

> On Sunday, 15 July 2007 14:58, Dr. David Alan Gilbert wrote:
>> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
>>
>>> (5) Hibernation should be transparent from the applications' point of view
>>>
>>>     Generally, applications should not notice that hibernation took place.
>>>     [Note that I don't regard all processes as applications and I think that
>>>     there may be processes which need to handle the hibernation in a special
>>>     way.]  Ideally, for example, if some audio is being played when a
>>>     hibernation starts, the audio player should be able to continue playing the
>>>     same audio after the restore from the point in which it has been
>>>     interrupted by the hibernation.  Also, the CPU affinities and similar
>>
>> That would be _so_ embarrassing in a library; I'd rather the audio
>> player had the opportunity to consider whether restarting was a good idea.
>>
>>> (6) State of devices from before hibernation should be restored, if possible
>>>
>>>     If possible, during a restore devices should be brought back to the same
>>>     state in which they were before the corresponding hibernation.  Of course
>>>     in some situations it might be impossible to do that (eg. the user
>>>     connected the hibernated system to a different IP subnet and then
>>>     restored), but as a general rule, we should do our best to restore the
>>>     state of devices, which is directly related to point (5) above.
>>
>> Or the user unplugs their flash drive after hibernation rather than before.
>>
>> Two things which I think would be nice to consider are:
>>    1) Encryption - I'd actually prefer if my luks device did not
>>        remember the key accross a hibernation; I want to be forced to
>>        reenter the phrase.  However I don't know what the best thing
>>        to do to partitions/applications using the luks device is.
>
> Encryption is possible with both the userland hibernation (aka uswsusp) and
> TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a "must
> have" feature for a framework to be generally useful (many users don't use it
> anyway).

he's talking about the main system useing an encrypted device/partition, 
not the hibernate image being stored encrypted.

This would require the main system 'forget' the keys when it does the 
hinbernate and prompt for it again during the wake-up phase.

David Lang


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 12:58 ` Dr. David Alan Gilbert
@ 2007-07-15 22:38   ` Rafael J. Wysocki
  2007-07-15 22:27     ` david
  2007-07-29  6:53     ` Vojtech Pavlik
  0 siblings, 2 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-15 22:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, david, Al Boldi

On Sunday, 15 July 2007 14:58, Dr. David Alan Gilbert wrote:
> * Rafael J. Wysocki (rjw@sisk.pl) wrote:
> 
> > (5) Hibernation should be transparent from the applications' point of view
> > 
> >     Generally, applications should not notice that hibernation took place.
> >     [Note that I don't regard all processes as applications and I think that
> >     there may be processes which need to handle the hibernation in a special
> >     way.]  Ideally, for example, if some audio is being played when a
> >     hibernation starts, the audio player should be able to continue playing the
> >     same audio after the restore from the point in which it has been
> >     interrupted by the hibernation.  Also, the CPU affinities and similar
> 
> That would be _so_ embarrassing in a library; I'd rather the audio
> player had the opportunity to consider whether restarting was a good idea.
> 
> > (6) State of devices from before hibernation should be restored, if possible
> > 
> >     If possible, during a restore devices should be brought back to the same
> >     state in which they were before the corresponding hibernation.  Of course
> >     in some situations it might be impossible to do that (eg. the user
> >     connected the hibernated system to a different IP subnet and then
> >     restored), but as a general rule, we should do our best to restore the
> >     state of devices, which is directly related to point (5) above.
> 
> Or the user unplugs their flash drive after hibernation rather than before.
> 
> Two things which I think would be nice to consider are:
>    1) Encryption - I'd actually prefer if my luks device did not
>        remember the key accross a hibernation; I want to be forced to
>        reenter the phrase.  However I don't know what the best thing
>        to do to partitions/applications using the luks device is.

Encryption is possible with both the userland hibernation (aka uswsusp) and
TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a "must
have" feature for a framework to be generally useful (many users don't use it
anyway).

>    2) Some level of debugging needs to be available so that users can
>       provide something so you can see why something hasn't hibernated
>       or why (as in the case of this tosh laptop) it still takes power
>       during hibernation.

I agree.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 22:47   ` Rafael J. Wysocki
@ 2007-07-15 22:42     ` david
  2007-07-15 23:15       ` Alan Stern
  2007-07-15 23:22       ` Rafael J. Wysocki
  0 siblings, 2 replies; 220+ messages in thread
From: david @ 2007-07-15 22:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:

> On Sunday, 15 July 2007 22:13, david@lang.hm wrote:
>> On Sun, 15 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> Hi,
>>>
>>> Since many alternative approaches to hibernation are now being considered and
>>> discussed, I thought it might be a good idea to list some things that in my not
>>> so humble opinion should be taken care of by any hibernation framework.  They
>>> are listed below, not in any particular order, because I think they all are
>>> important.  Still, I might have forgotten something, so everyone with
>>> experience in implementing hibernation, especially Pavel and Nigel, please
>>> check if the list is complete.
>>>
>>> (1) Filesystems mounted before the hibernation are untouchable
>>>
>>>    When there's a memory snapshot, either in the form of a hibernation image,
>>>    or in the form of the "old" kernel and processes available to the "new"
>>>    kexeced kernel responsible for saving their memory, the filesystems mounted
>>>    before the hibernation should not be accessed, even for reading, because
>>>    that would cause their on-disk state to be inconsistent with the snapshot
>>>    and might lead to a filesystem corruption.
>>
>> AFAIK this is only the case with ext3, all other filesystems could be
>> accessed read-only safely
>>
>> this is arguably a bug with ext3 (and has been discussed as such), but
>> right now the ext3 team has decided not to change this bahavior so
>> hibernate needs to work around it. but don't mistake a work-around for a
>> single (admittedly very popular) filesystem with a hard and fast
>> directive.
>>
>>> (2) Swap space in use before the hibernation must be handled with care
>>>
>>>    If swap space is used for saving the memory snapshot, the snapshot-saving
>>>    application (or kernel) must be careful enough not to overwrite swap pages
>>>    that contain valid memory contents stored in there before the hibernation.
>>
>> true, in fact, given that many distros and live-CD's autodetect swap
>> partitions and consider them fair game, I would argue that the best thing
>> to do would be to have the main system free up it's swap partitions before
>> going into hibernation.
>>
>> however, this could be a decision of the particular hibernate routines.
>>
>> for the kexec approach the mapping of what swap pages are in use is one
>> more chunk of data that needs to be assembled and made available through a
>> defined interface.
>>
>>> (4) The user should be able to limit the size of a hibernation image
>>>
>>>    There are a couple of reasons of that.  For example, the storage space
>>>    used for saving the image may be smaller than the entire RAM or the user
>>>    may want the image to be saved quickier.
>>
>> it may make sense for this to be split into hard and soft limits.
>>
>> if you try to save more then the storage space can hold you cannot
>> continue, but if you are just a little over the arbatrary size limit that
>> was set to make things fast you are better off saving things as-is then
>> punting, going back to the system, trying to free more ram, and trying a
>> hibernate again.
>>
>> with the kexec approach the enforcment of these limits is also split into
>> two sections.
>>
>> when the hibernate command is given in the main kernel, it's userspace
>> needs to follow some policy to decide how much (if any) memory to free.
>
> How are you going to achieve this without (a) having hibernation-aware
> user space or (b) the freezer?

the hibernate command is a userspace command, but the fact that other 
things in userspace are running at the same time is exactly why this is 
only an estimate and best-effort as I said in the paragraph below.

>> but since the kexec command and the preporation of the devices can change
>> the memory, the estimates done by the first kernel's userspace are just
>> that, estimates.
>>
>>> (7) On ACPI systems special platform-related actions have to be carried out at
>>>    the right points, so that the platform works correctly after the restore
>>>
>>>    The ACPI specification requires us to invoke some global ACPI methods
>>>    during the hibernation and during the restore.  Moreover, the ordering of
>>>    code related to these ACPI methods may not be arbitrary (eg. some of
>>>    them have to be executed after devices are put into low power states etc.).
>>
>> for a pure hibernate mode, you will be powering off the box after saving
>> the suspend image. why are there any special ACPI modes involved?
>
> Because, for example, on my machine the status of power supply (present
> vs not present) is not updated correctly after the restore if ACPI callbacks
> aren't used during the hibernation.  That's just experience and it's in line
> with the ACPI spec.

so if a machine is actually powered off the /dev/suspend process won't 
work?

remember that the system may run a different OS between the hibernate and 
the resume, makeing any assumptions about what state the hardware is in 
when you start the resume is a problem.

David Lang

>> now, for suspend-to-ram you need to be aware of every possible power
>> saving option you have and the cost of each of them, and here the ACPI
>> modes are heavily used.
>>
>> I think this is mixing suspend and hibernate still.
>
> Yes, it is, but that's not we who's mixing.  We just need to handle some
> systems built with ACPI in mind.
>
> Greetings,
> Rafael
>
>
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 20:13 ` david
@ 2007-07-15 22:47   ` Rafael J. Wysocki
  2007-07-15 22:42     ` david
  2007-07-15 23:17   ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-15 22:47 UTC (permalink / raw)
  To: david
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Sunday, 15 July 2007 22:13, david@lang.hm wrote:
> On Sun, 15 Jul 2007, Rafael J. Wysocki wrote:
> 
> > Hi,
> >
> > Since many alternative approaches to hibernation are now being considered and
> > discussed, I thought it might be a good idea to list some things that in my not
> > so humble opinion should be taken care of by any hibernation framework.  They
> > are listed below, not in any particular order, because I think they all are
> > important.  Still, I might have forgotten something, so everyone with
> > experience in implementing hibernation, especially Pavel and Nigel, please
> > check if the list is complete.
> >
> > (1) Filesystems mounted before the hibernation are untouchable
> >
> >    When there's a memory snapshot, either in the form of a hibernation image,
> >    or in the form of the "old" kernel and processes available to the "new"
> >    kexeced kernel responsible for saving their memory, the filesystems mounted
> >    before the hibernation should not be accessed, even for reading, because
> >    that would cause their on-disk state to be inconsistent with the snapshot
> >    and might lead to a filesystem corruption.
> 
> AFAIK this is only the case with ext3, all other filesystems could be 
> accessed read-only safely
> 
> this is arguably a bug with ext3 (and has been discussed as such), but 
> right now the ext3 team has decided not to change this bahavior so 
> hibernate needs to work around it. but don't mistake a work-around for a 
> single (admittedly very popular) filesystem with a hard and fast 
> directive.
> 
> > (2) Swap space in use before the hibernation must be handled with care
> >
> >    If swap space is used for saving the memory snapshot, the snapshot-saving
> >    application (or kernel) must be careful enough not to overwrite swap pages
> >    that contain valid memory contents stored in there before the hibernation.
> 
> true, in fact, given that many distros and live-CD's autodetect swap 
> partitions and consider them fair game, I would argue that the best thing 
> to do would be to have the main system free up it's swap partitions before 
> going into hibernation.
> 
> however, this could be a decision of the particular hibernate routines.
> 
> for the kexec approach the mapping of what swap pages are in use is one 
> more chunk of data that needs to be assembled and made available through a 
> defined interface.
> 
> > (4) The user should be able to limit the size of a hibernation image
> >
> >    There are a couple of reasons of that.  For example, the storage space
> >    used for saving the image may be smaller than the entire RAM or the user
> >    may want the image to be saved quickier.
> 
> it may make sense for this to be split into hard and soft limits.
> 
> if you try to save more then the storage space can hold you cannot 
> continue, but if you are just a little over the arbatrary size limit that 
> was set to make things fast you are better off saving things as-is then 
> punting, going back to the system, trying to free more ram, and trying a 
> hibernate again.
> 
> with the kexec approach the enforcment of these limits is also split into 
> two sections.
> 
> when the hibernate command is given in the main kernel, it's userspace 
> needs to follow some policy to decide how much (if any) memory to free.

How are you going to achieve this without (a) having hibernation-aware
user space or (b) the freezer?

> this could be anything from 'none, try and save all caches' to 'anything 
> you can to minimize the amount of data to be saved, trash all caches' to 
> something in between like 'try and free up enough memory to get the saved 
> data below 1G, but save caches beyond that point'
> 
> then when the second kernel runs, it's userspace tools get the list of 
> what memory should be saved that the main kernel handed to it, and then 
> decides if this is acceptable (probably mostly the hard limits of 'can 
> this work') and proceeds to save it somewhere.
> 
> but since the kexec command and the preporation of the devices can change 
> the memory, the estimates done by the first kernel's userspace are just 
> that, estimates.
> 
> > (7) On ACPI systems special platform-related actions have to be carried out at
> >    the right points, so that the platform works correctly after the restore
> >
> >    The ACPI specification requires us to invoke some global ACPI methods
> >    during the hibernation and during the restore.  Moreover, the ordering of
> >    code related to these ACPI methods may not be arbitrary (eg. some of
> >    them have to be executed after devices are put into low power states etc.).
> 
> for a pure hibernate mode, you will be powering off the box after saving 
> the suspend image. why are there any special ACPI modes involved?

Because, for example, on my machine the status of power supply (present
vs not present) is not updated correctly after the restore if ACPI callbacks
aren't used during the hibernation.  That's just experience and it's in line
with the ACPI spec.

> now, for suspend-to-ram you need to be aware of every possible power 
> saving option you have and the cost of each of them, and here the ACPI 
> modes are heavily used.
> 
> I think this is mixing suspend and hibernate still.

Yes, it is, but that's not we who's mixing.  We just need to handle some
systems built with ACPI in mind.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 22:42     ` david
@ 2007-07-15 23:15       ` Alan Stern
  2007-07-15 23:38         ` Nigel Cunningham
  2007-07-15 23:41         ` david
  2007-07-15 23:22       ` Rafael J. Wysocki
  1 sibling, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-15 23:15 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Sun, 15 Jul 2007 david@lang.hm wrote:

> >> for a pure hibernate mode, you will be powering off the box after saving
> >> the suspend image. why are there any special ACPI modes involved?
> >
> > Because, for example, on my machine the status of power supply (present
> > vs not present) is not updated correctly after the restore if ACPI callbacks
> > aren't used during the hibernation.  That's just experience and it's in line
> > with the ACPI spec.
> 
> so if a machine is actually powered off the /dev/suspend process won't 
> work?
> 
> remember that the system may run a different OS between the hibernate and 
> the resume, makeing any assumptions about what state the hardware is in 
> when you start the resume is a problem.

As I understand it, running a different OS between the hibernate and 
the resume would violate the ACPI spec.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 20:13 ` david
  2007-07-15 22:47   ` Rafael J. Wysocki
@ 2007-07-15 23:17   ` Alan Stern
  2007-07-15 23:53     ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-15 23:17 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Sun, 15 Jul 2007 david@lang.hm wrote:

> > (1) Filesystems mounted before the hibernation are untouchable
> >
> >    When there's a memory snapshot, either in the form of a hibernation image,
> >    or in the form of the "old" kernel and processes available to the "new"
> >    kexeced kernel responsible for saving their memory, the filesystems mounted
> >    before the hibernation should not be accessed, even for reading, because
> >    that would cause their on-disk state to be inconsistent with the snapshot
> >    and might lead to a filesystem corruption.
> 
> AFAIK this is only the case with ext3, all other filesystems could be 
> accessed read-only safely
> 
> this is arguably a bug with ext3 (and has been discussed as such), but 
> right now the ext3 team has decided not to change this bahavior so 
> hibernate needs to work around it. but don't mistake a work-around for a 
> single (admittedly very popular) filesystem with a hard and fast 
> directive.

Isn't is possible to avoid this problem by mounting an ext3 filesystem 
as readonly ext2?  Provided the filesystem isn't dirty it should be 
doable.  (And provided the filesystem doesn't use any ext3 extensions 
that are incompatible with ext2.)

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 22:42     ` david
  2007-07-15 23:15       ` Alan Stern
@ 2007-07-15 23:22       ` Rafael J. Wysocki
  2007-07-15 23:49         ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-15 23:22 UTC (permalink / raw)
  To: david
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Monday, 16 July 2007 00:42, david@lang.hm wrote:
> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Sunday, 15 July 2007 22:13, david@lang.hm wrote:
> >> On Sun, 15 Jul 2007, Rafael J. Wysocki wrote:
> >>
> >>> Hi,
> >>>
> >>> Since many alternative approaches to hibernation are now being considered and
> >>> discussed, I thought it might be a good idea to list some things that in my not
> >>> so humble opinion should be taken care of by any hibernation framework.  They
> >>> are listed below, not in any particular order, because I think they all are
> >>> important.  Still, I might have forgotten something, so everyone with
> >>> experience in implementing hibernation, especially Pavel and Nigel, please
> >>> check if the list is complete.
> >>>
> >>> (1) Filesystems mounted before the hibernation are untouchable
> >>>
> >>>    When there's a memory snapshot, either in the form of a hibernation image,
> >>>    or in the form of the "old" kernel and processes available to the "new"
> >>>    kexeced kernel responsible for saving their memory, the filesystems mounted
> >>>    before the hibernation should not be accessed, even for reading, because
> >>>    that would cause their on-disk state to be inconsistent with the snapshot
> >>>    and might lead to a filesystem corruption.
> >>
> >> AFAIK this is only the case with ext3, all other filesystems could be
> >> accessed read-only safely
> >>
> >> this is arguably a bug with ext3 (and has been discussed as such), but
> >> right now the ext3 team has decided not to change this bahavior so
> >> hibernate needs to work around it. but don't mistake a work-around for a
> >> single (admittedly very popular) filesystem with a hard and fast
> >> directive.
> >>
> >>> (2) Swap space in use before the hibernation must be handled with care
> >>>
> >>>    If swap space is used for saving the memory snapshot, the snapshot-saving
> >>>    application (or kernel) must be careful enough not to overwrite swap pages
> >>>    that contain valid memory contents stored in there before the hibernation.
> >>
> >> true, in fact, given that many distros and live-CD's autodetect swap
> >> partitions and consider them fair game, I would argue that the best thing
> >> to do would be to have the main system free up it's swap partitions before
> >> going into hibernation.
> >>
> >> however, this could be a decision of the particular hibernate routines.
> >>
> >> for the kexec approach the mapping of what swap pages are in use is one
> >> more chunk of data that needs to be assembled and made available through a
> >> defined interface.
> >>
> >>> (4) The user should be able to limit the size of a hibernation image
> >>>
> >>>    There are a couple of reasons of that.  For example, the storage space
> >>>    used for saving the image may be smaller than the entire RAM or the user
> >>>    may want the image to be saved quickier.
> >>
> >> it may make sense for this to be split into hard and soft limits.
> >>
> >> if you try to save more then the storage space can hold you cannot
> >> continue, but if you are just a little over the arbatrary size limit that
> >> was set to make things fast you are better off saving things as-is then
> >> punting, going back to the system, trying to free more ram, and trying a
> >> hibernate again.
> >>
> >> with the kexec approach the enforcment of these limits is also split into
> >> two sections.
> >>
> >> when the hibernate command is given in the main kernel, it's userspace
> >> needs to follow some policy to decide how much (if any) memory to free.
> >
> > How are you going to achieve this without (a) having hibernation-aware
> > user space or (b) the freezer?
> 
> the hibernate command is a userspace command, but the fact that other 
> things in userspace are running at the same time is exactly why this is 
> only an estimate and best-effort as I said in the paragraph below.
> 
> >> but since the kexec command and the preporation of the devices can change
> >> the memory, the estimates done by the first kernel's userspace are just
> >> that, estimates.
> >>
> >>> (7) On ACPI systems special platform-related actions have to be carried out at
> >>>    the right points, so that the platform works correctly after the restore
> >>>
> >>>    The ACPI specification requires us to invoke some global ACPI methods
> >>>    during the hibernation and during the restore.  Moreover, the ordering of
> >>>    code related to these ACPI methods may not be arbitrary (eg. some of
> >>>    them have to be executed after devices are put into low power states etc.).
> >>
> >> for a pure hibernate mode, you will be powering off the box after saving
> >> the suspend image. why are there any special ACPI modes involved?
> >
> > Because, for example, on my machine the status of power supply (present
> > vs not present) is not updated correctly after the restore if ACPI callbacks
> > aren't used during the hibernation.  That's just experience and it's in line
> > with the ACPI spec.
> 
> so if a machine is actually powered off the /dev/suspend process won't 
> work?

No, it sort of works as usual, but after the restore the platform is not in the
correct state.

> remember that the system may run a different OS between the hibernate and 
> the resume, makeing any assumptions about what state the hardware is in 
> when you start the resume is a problem.

True, that's problematic.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 17:40     ` Al Boldi
@ 2007-07-15 23:28       ` Alan Stern
  2007-07-15 23:58         ` david
  2007-07-16  5:02         ` Al Boldi
  0 siblings, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-15 23:28 UTC (permalink / raw)
  To: Al Boldi
  Cc: Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david

On Sun, 15 Jul 2007, Al Boldi wrote:

> Alan Stern wrote:
> > On Sun, 15 Jul 2007, Al Boldi wrote:
> > > >     If possible, during a restore devices should be brought back to
> > > > the same state in which they were before the corresponding
> > > > hibernation.  Of course in some situations it might be impossible to
> > > > do that (eg. the user connected the hibernated system to a different
> > > > IP subnet and then restored), but as a general rule, we should do our
> > > > best to restore the state of devices, which is directly related to
> > > > point (5) above.
> > >
> > > This part could easily be handled by the normal kernel before and after
> > > resume.
> >
> > I agree with you except for the word "easily".  And there are some
> > things the kernel simply punts on (I'm thinking of the current VGA
> > font).
> 
> Why; can you explain?

>From personal experience I can assure you that it hasn't been easy
getting the USB subsystem to restore devices following a hibernate.  
(In fact the current implementation goes against the spirit, if not the
letter, of the USB spec.)  And making it work requires user
intervention.

As for the VGA font, the effect is easy to see: Run setfont before 
hibernating; when you resume the original font will be back.  The 
kernel simply does not bother to save the VGA font information across a 
hibernate.

> > > This should be the responsibility of the kexec'd hibernating kernel. 
> > > Note though in (6), the normal kernel takes care of preparing devices,
> > > then the hibernating kernel dumps the image and either calls S4 or S3. 
> > > On resume from S3 it can immediately switch over to the normal kernel,
> > > and from S4 the known bootup would occur.
> >
> > Is it really that simple?  Somehow I doubt it.  In order for some
> > devices to remain available for the kexec'd kernel to use, they cannot
> > be suspended at the ACPI level.  So the kexec'd kernel will have to
> > handle the ACPI requirements for those devices.  Likewise, it would
> > have to handle the ACPI interactions which need to be done after all
> > devices are prepared for the transition to S3 or S4.
> 
> Ok, after applying the latest kexec patches, I was able to use the kexec'd 
> kernel to suspend to ram and resume to the normal kernel, while working 
> under a full-blown X session.  It went without a hitch.  All that is needed 
> now are the dump/restore hibernation-image routines.

That's exactly my point.  While doing suspend-to-RAM from a kexec'd 
kernel may be simple, saving the hibernation image will add 
complications.

> > > The latest hibernating kexec patches boot a kexec'd modular kernel with
> > > initramfs into crashkernel=16M@16M in less than one second.  Switch-back
> > > is almost instant.  Add to this the time required to either store or
> > > restore the image, and it may be obvious that this approach isn't
> > > slower, but maybe even faster than the current swsusp.
> >
> > Does that include the time required for probing PCI buses?  On my
> > desktop system, PCI probing incurs a five-second timeout delay because
> > of a bug in the BIOS's USB firmware.
> 
> Using a modular kernel you would only insmod those modules that you need to 
> dump the image, which is mainly the diskdriver.  If you wanted to dump it 
> onto USB flash, then you would insmod that driver, and if that driver is 
> slow due to a bug, then a fix should be in order.

I said nothing about dumping onto USB flash.  I was referring to PCI 
probing; presumably any reasonable kernel for desktop/laptop systems 
will include PCI support.

And the bug isn't in Linux; it is in the firmware.  There's no way to 
fix it short of a BIOS upgrade (and this particular BIOS is no longer 
supported).

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:15       ` Alan Stern
@ 2007-07-15 23:38         ` Nigel Cunningham
  2007-07-16 14:15           ` Alan Stern
  2007-07-15 23:41         ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-15 23:38 UTC (permalink / raw)
  To: Alan Stern
  Cc: david, Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman,
	Huang, Ying, Jeremy Maitin-Shepard, Kyle Moffett, Pavel Machek,
	pm list, Al Boldi

[-- Attachment #1: Type: text/plain, Size: 1267 bytes --]

Hi.

On Monday 16 July 2007 09:15:47 Alan Stern wrote:
> On Sun, 15 Jul 2007 david@lang.hm wrote:
> 
> > >> for a pure hibernate mode, you will be powering off the box after 
saving
> > >> the suspend image. why are there any special ACPI modes involved?
> > >
> > > Because, for example, on my machine the status of power supply (present
> > > vs not present) is not updated correctly after the restore if ACPI 
callbacks
> > > aren't used during the hibernation.  That's just experience and it's in 
line
> > > with the ACPI spec.
> > 
> > so if a machine is actually powered off the /dev/suspend process won't 
> > work?
> > 
> > remember that the system may run a different OS between the hibernate and 
> > the resume, makeing any assumptions about what state the hardware is in 
> > when you start the resume is a problem.
> 
> As I understand it, running a different OS between the hibernate and 
> the resume would violate the ACPI spec.

Well then, I know one or two people who would argue that the ACPI spec is 
faulty. :\

Regards,

Nigel
-- 
Nigel Cunningham
Christian Reformed Church of Cobden
103 Curdie Street, Cobden 3266, Victoria, Australia
Ph. +61 3 5595 1185 / +61 417 100 574
Communal Worship: 11 am Sunday.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:15       ` Alan Stern
  2007-07-15 23:38         ` Nigel Cunningham
@ 2007-07-15 23:41         ` david
  2007-07-16 14:21           ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-15 23:41 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Sun, 15 Jul 2007, Alan Stern wrote:

> On Sun, 15 Jul 2007 david@lang.hm wrote:
>
>>>> for a pure hibernate mode, you will be powering off the box after saving
>>>> the suspend image. why are there any special ACPI modes involved?
>>>
>>> Because, for example, on my machine the status of power supply (present
>>> vs not present) is not updated correctly after the restore if ACPI callbacks
>>> aren't used during the hibernation.  That's just experience and it's in line
>>> with the ACPI spec.
>>
>> so if a machine is actually powered off the /dev/suspend process won't
>> work?
>>
>> remember that the system may run a different OS between the hibernate and
>> the resume, makeing any assumptions about what state the hardware is in
>> when you start the resume is a problem.
>
> As I understand it, running a different OS between the hibernate and
> the resume would violate the ACPI spec.

then we need a third mode of operation.

mode 1: Suspend-to-ram

   the system is paused and put into a low-power mode but data remains in 
memory and the system stays awake enough to keep the memory refreshed.

mode 2: new

   the system is paused, data is stored to permanent media, and the system 
is put into a ultra-low power mode.

mode 3: hibernate

   the system is paused, data is stored to permanent media, and the system 
is powered off

with mode 3 there are no requirements or limitations about what can be 
done with the hardware before a resume (the resume could even take place 
on a different piece of identical hardware)

mode 2 could be what you are talking about doing, although I don't see any 
advantage of creating it in additon to mode 3, it doesn't use any less 
power and it locks the system so that it can't be used for anything else 
in the meantime. I guess if it was significantly faster to do then mode 3 
there may be _some_ reason to consider it, but I don't see the speed 
difference.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:22       ` Rafael J. Wysocki
@ 2007-07-15 23:49         ` david
  2007-07-16 12:06           ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-15 23:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:

> On Monday, 16 July 2007 00:42, david@lang.hm wrote:
>> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> On Sunday, 15 July 2007 22:13, david@lang.hm wrote:
>>>> On Sun, 15 Jul 2007, Rafael J. Wysocki wrote:
>>>>
>>>>>    The ACPI specification requires us to invoke some global ACPI methods
>>>>>    during the hibernation and during the restore.  Moreover, the ordering of
>>>>>    code related to these ACPI methods may not be arbitrary (eg. some of
>>>>>    them have to be executed after devices are put into low power states etc.).
>>>>
>>>> for a pure hibernate mode, you will be powering off the box after saving
>>>> the suspend image. why are there any special ACPI modes involved?
>>>
>>> Because, for example, on my machine the status of power supply (present
>>> vs not present) is not updated correctly after the restore if ACPI callbacks
>>> aren't used during the hibernation.  That's just experience and it's in line
>>> with the ACPI spec.
>>
>> so if a machine is actually powered off the /dev/suspend process won't
>> work?
>
> No, it sort of works as usual, but after the restore the platform is not in the
> correct state.

this is not hibernate as I and many others are thinking of it.

hibernate as we are thinking would work on basicly any hardware, including 
things with no ACPI or power savings support. and the system could be in 
hibernate mode for any time period.

for that matter, after a system is put into hibernate mode the system 
could be completely disassembled and any components replaced and the 
system would work after a resume (assuming you still have access to the 
suspend image)

>> remember that the system may run a different OS between the hibernate and
>> the resume, makeing any assumptions about what state the hardware is in
>> when you start the resume is a problem.
>
> True, that's problematic.

putting it mildly.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:17   ` Alan Stern
@ 2007-07-15 23:53     ` david
  2007-07-16  5:18       ` Jeremy Maitin-Shepard
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-15 23:53 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Sun, 15 Jul 2007, Alan Stern wrote:

> On Sun, 15 Jul 2007 david@lang.hm wrote:
>
>>> (1) Filesystems mounted before the hibernation are untouchable
>>>
>>>    When there's a memory snapshot, either in the form of a hibernation image,
>>>    or in the form of the "old" kernel and processes available to the "new"
>>>    kexeced kernel responsible for saving their memory, the filesystems mounted
>>>    before the hibernation should not be accessed, even for reading, because
>>>    that would cause their on-disk state to be inconsistent with the snapshot
>>>    and might lead to a filesystem corruption.
>>
>> AFAIK this is only the case with ext3, all other filesystems could be
>> accessed read-only safely
>>
>> this is arguably a bug with ext3 (and has been discussed as such), but
>> right now the ext3 team has decided not to change this bahavior so
>> hibernate needs to work around it. but don't mistake a work-around for a
>> single (admittedly very popular) filesystem with a hard and fast
>> directive.
>
> Isn't is possible to avoid this problem by mounting an ext3 filesystem
> as readonly ext2?  Provided the filesystem isn't dirty it should be
> doable.  (And provided the filesystem doesn't use any ext3 extensions
> that are incompatible with ext2.)

from the last discussion I saw on the kernel mailing list, no. the act of 
mounting the ext3 filesystem as ext2 read-only will change it as the 
unsupported extentions get turned off (and I think the journal contents at 
least are lost as part of this)

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:28       ` Alan Stern
@ 2007-07-15 23:58         ` david
  2007-07-16  5:02         ` Al Boldi
  1 sibling, 0 replies; 220+ messages in thread
From: david @ 2007-07-15 23:58 UTC (permalink / raw)
  To: Alan Stern
  Cc: Al Boldi, Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

On Sun, 15 Jul 2007, Alan Stern wrote:

> On Sun, 15 Jul 2007, Al Boldi wrote:
>
>> Alan Stern wrote:
>>> On Sun, 15 Jul 2007, Al Boldi wrote:
>
>>>> This should be the responsibility of the kexec'd hibernating kernel.
>>>> Note though in (6), the normal kernel takes care of preparing devices,
>>>> then the hibernating kernel dumps the image and either calls S4 or S3.
>>>> On resume from S3 it can immediately switch over to the normal kernel,
>>>> and from S4 the known bootup would occur.
>>>
>>> Is it really that simple?  Somehow I doubt it.  In order for some
>>> devices to remain available for the kexec'd kernel to use, they cannot
>>> be suspended at the ACPI level.  So the kexec'd kernel will have to
>>> handle the ACPI requirements for those devices.  Likewise, it would
>>> have to handle the ACPI interactions which need to be done after all
>>> devices are prepared for the transition to S3 or S4.
>>
>> Ok, after applying the latest kexec patches, I was able to use the kexec'd
>> kernel to suspend to ram and resume to the normal kernel, while working
>> under a full-blown X session.  It went without a hitch.  All that is needed
>> now are the dump/restore hibernation-image routines.
>
> That's exactly my point.  While doing suspend-to-RAM from a kexec'd
> kernel may be simple, saving the hibernation image will add
> complications.

suspend-to-RAM should not involve kexec, the only reason for doing the 
kexec to to get a seperate userspace to use for suspend-to-disk operations 
instead of trying to partially freeze the sustem and keep useing it.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 12:33 Hibernation considerations Rafael J. Wysocki
                   ` (4 preceding siblings ...)
  2007-07-15 20:35 ` Cornelius Riemenschneider
@ 2007-07-16  0:51 ` Matthew Garrett
  2007-07-16  0:51   ` david
  5 siblings, 1 reply; 220+ messages in thread
From: Matthew Garrett @ 2007-07-16  0:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, david, Al Boldi

On Sun, Jul 15, 2007 at 02:33:32PM +0200, Rafael J. Wysocki wrote:

(snip)

Many of these assumptions are based on the assumption that we want to 
save a full image of RAM. I'm not convinced that this is true. The two 
things that we need are application state and hardware state. 
Application state can clearly be saved without kernel involvement 
(though restoring some of it may need some help from the kernel...), so 
hardware state is a more interesting question.

The obvious argument for saving the entirity of memory is that we have 
no mechanism for picking apart hardware state from any other part of the 
kernel. In reality, we're looking at implementing a set of hibernation 
operations anyway - it would be possible to utilise those to save as 
much state as needed. You also get fringe benefits, like being able to 
freeze a process that's accessing a piece of flaky hardware, swap the 
card out (assuming hotplug PCI), restore some amount of state and then 
let the process continue.

I appreciate that this suggestion sounds kind of fragile and 
complicated, but I think that's true of most descriptions of suspend to 
disk :) The main benefit is that it means we can use the hibernation 
infrastructure for other purposes (checkpointing, swapping hardware, 
that kind of thing) and reduce the damage caused by users doing 
seemingly reasonable things (like suspending Linux, booting Windows and 
then writing to a shared partition...). 

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16  0:51 ` Matthew Garrett
@ 2007-07-16  0:51   ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-16  0:51 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Mon, 16 Jul 2007, Matthew Garrett wrote:

> On Sun, Jul 15, 2007 at 02:33:32PM +0200, Rafael J. Wysocki wrote:
>
> (snip)
>
> Many of these assumptions are based on the assumption that we want to
> save a full image of RAM. I'm not convinced that this is true. The two
> things that we need are application state and hardware state.
> Application state can clearly be saved without kernel involvement
> (though restoring some of it may need some help from the kernel...), so
> hardware state is a more interesting question.

one other reason for saving all the ram is that some people want to save 
all the system caches so that the machine is just as responsive immediatly 
after the resume as it was before the hibernate.

> The obvious argument for saving the entirity of memory is that we have
> no mechanism for picking apart hardware state from any other part of the
> kernel. In reality, we're looking at implementing a set of hibernation
> operations anyway - it would be possible to utilise those to save as
> much state as needed. You also get fringe benefits, like being able to
> freeze a process that's accessing a piece of flaky hardware, swap the
> card out (assuming hotplug PCI), restore some amount of state and then
> let the process continue.
>
> I appreciate that this suggestion sounds kind of fragile and
> complicated, but I think that's true of most descriptions of suspend to
> disk :) The main benefit is that it means we can use the hibernation
> infrastructure for other purposes (checkpointing, swapping hardware,
> that kind of thing) and reduce the damage caused by users doing
> seemingly reasonable things (like suspending Linux, booting Windows and
> then writing to a shared partition...).

I see the order being a little different.

if anyone implements a reliable checkpoint/restore of applications then 
that could be used as to pause specific applicaitons, move applications 
from one machine to another, move applications from one kernel to another, 
and as a side effect, as a form of hibernation when you are willing to 
loose your cache in favor of storing as little info as possible (since you 
wouldn't need to store the cache memory or any kernel memory/state)

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:28       ` Alan Stern
  2007-07-15 23:58         ` david
@ 2007-07-16  5:02         ` Al Boldi
  2007-07-16  6:49           ` david
  2007-07-16 14:53           ` Alan Stern
  1 sibling, 2 replies; 220+ messages in thread
From: Al Boldi @ 2007-07-16  5:02 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david

Alan Stern wrote:
> As for the VGA font, the effect is easy to see: Run setfont before
> hibernating; when you resume the original font will be back.  The
> kernel simply does not bother to save the VGA font information across a
> hibernate.

This could probably be handled by a device suspend/resume call; but this 
problem is not specific to kexec.

> > Ok, after applying the latest kexec patches, I was able to use the
> > kexec'd kernel to suspend to ram and resume to the normal kernel, while
> > working under a full-blown X session.  It went without a hitch.  All
> > that is needed now are the dump/restore hibernation-image routines.
>
> That's exactly my point.  While doing suspend-to-RAM from a kexec'd
> kernel may be simple, saving the hibernation image will add
> complications.

>From a kexec'd hibernation kernel pov, both S3 and S4 look conceptually 
exactly the same.  The only difference is, in S3 the memory is in memory and 
in S4 the memory is on storage.  All device handling is exactly the same, so 
if there is a problem with device handling between the kexec'd hibernation 
kernel and the normal kernel, then that would have made itself visible.

david@lang.hm wrote:
> suspend-to-RAM should not involve kexec, the only reason for doing the
> kexec to to get a seperate userspace to use for suspend-to-disk operations
> instead of trying to partially freeze the sustem and keep useing it.

Or you could do suspend-to-disk-and-RAM.  But in the above case, it was meant 
to test kexec compatibility with device suspend/resume calls.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:53     ` david
@ 2007-07-16  5:18       ` Jeremy Maitin-Shepard
  0 siblings, 0 replies; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-16  5:18 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, Rafael J. Wysocki, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

david@lang.hm writes:

[snip]

>> Isn't is possible to avoid this problem by mounting an ext3 filesystem
>> as readonly ext2?  Provided the filesystem isn't dirty it should be
>> doable.  (And provided the filesystem doesn't use any ext3 extensions
>> that are incompatible with ext2.)

> from the last discussion I saw on the kernel mailing list, no. the act of
> mounting the ext3 filesystem as ext2 read-only will change it as the unsupported
> extentions get turned off (and I think the journal contents at least are lost as
> part of this)

The fact of the matter is that it really doesn't matter whether mounting
it read-only actually corrupts the data on disk or not.  Regardless, it
should not be done, because you are accessing a dirty filesystem that is
still in use, and consequently there are no guarantees that either the
metadata or the file contents are consistent.  It isn't necessary for
hibernation to be able to access mounted partitions anyway.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16  5:02         ` Al Boldi
@ 2007-07-16  6:49           ` david
  2007-07-16 13:32             ` Al Boldi
  2007-07-16 14:53           ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-16  6:49 UTC (permalink / raw)
  To: Al Boldi
  Cc: Alan Stern, Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

On Mon, 16 Jul 2007, Al Boldi wrote:

>
> david@lang.hm wrote:
>> suspend-to-RAM should not involve kexec, the only reason for doing the
>> kexec to to get a seperate userspace to use for suspend-to-disk operations
>> instead of trying to partially freeze the sustem and keep useing it.
>
> Or you could do suspend-to-disk-and-RAM.  But in the above case, it was meant
> to test kexec compatibility with device suspend/resume calls.

the point I am trying to make here is that there is no reason that the 
kexec approach needs to do _any_ suspend/resume calls.

all that is needed is the ability of the new kernel to initialize the 
devices it needs.

suspend-to-disk-and-ram could be implemented as three 
seperate steps

1. suspend-to-disk

2. resume-from-disk

3. suspend-to-ram

followed by either

4. resume-from-ram

or

4. battery dies and loptop powers off completely

5. power-on boot.

6. resume-from-disk

all that you need to do is to make sure that the system doesn't run 
anything that would affect permanent media or the outside world between 
steps #2 and #3

yes it's more steps, but each step is logicly seperate, and each step will 
be excercised on a regular basis, so the combination of the steps will 
also be reliable.

this is far better then creating yet another way to pause the system that 
only a few people will use.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:49         ` david
@ 2007-07-16 12:06           ` Rafael J. Wysocki
       [not found]             ` <20070716123849.GC14212@grifter.jdc.home>
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-16 12:06 UTC (permalink / raw)
  To: david
  Cc: LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Monday, 16 July 2007 01:49, david@lang.hm wrote:
> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Monday, 16 July 2007 00:42, david@lang.hm wrote:
> >> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
> >>
> >>> On Sunday, 15 July 2007 22:13, david@lang.hm wrote:
> >>>> On Sun, 15 Jul 2007, Rafael J. Wysocki wrote:
> >>>>
> >>>>>    The ACPI specification requires us to invoke some global ACPI methods
> >>>>>    during the hibernation and during the restore.  Moreover, the ordering of
> >>>>>    code related to these ACPI methods may not be arbitrary (eg. some of
> >>>>>    them have to be executed after devices are put into low power states etc.).
> >>>>
> >>>> for a pure hibernate mode, you will be powering off the box after saving
> >>>> the suspend image. why are there any special ACPI modes involved?
> >>>
> >>> Because, for example, on my machine the status of power supply (present
> >>> vs not present) is not updated correctly after the restore if ACPI callbacks
> >>> aren't used during the hibernation.  That's just experience and it's in line
> >>> with the ACPI spec.
> >>
> >> so if a machine is actually powered off the /dev/suspend process won't
> >> work?
> >
> > No, it sort of works as usual, but after the restore the platform is not in the
> > correct state.
> 
> this is not hibernate as I and many others are thinking of it.
> 
> hibernate as we are thinking would work on basicly any hardware, including 
> things with no ACPI or power savings support. and the system could be in 
> hibernate mode for any time period.
> 
> for that matter, after a system is put into hibernate mode the system 
> could be completely disassembled and any components replaced and the 
> system would work after a resume (assuming you still have access to the 
> suspend image)

Well, this is not how ACPI defines the S4 sleep state.  If the system is in
S4, that corresponds to our hibernation, you are _not_ allowed to disassemble
it.

I've just done an experiment on my test desktop.  I had enabled suspend support
in the CMOS setup and afterwards I made Linux hibernate in the "platform" mode.
Then, when the system was powred on, the BIOS showed me a nice "Resume from
hibernation" screen that is not normally displayed during boot.  This clearly
means that some information has been preserved by the platform across the
hibernate/restore cycle.  We are supposed to handle that.

Greetngs,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16  6:49           ` david
@ 2007-07-16 13:32             ` Al Boldi
  2007-07-17  4:33               ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Al Boldi @ 2007-07-16 13:32 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

david@lang.hm wrote:
> On Mon, 16 Jul 2007, Al Boldi wrote:
> > david@lang.hm wrote:
> >> suspend-to-RAM should not involve kexec, the only reason for doing the
> >> kexec to to get a seperate userspace to use for suspend-to-disk
> >> operations instead of trying to partially freeze the sustem and keep
> >> useing it.
> >
> > Or you could do suspend-to-disk-and-RAM.  But in the above case, it was
> > meant to test kexec compatibility with device suspend/resume calls.
>
> the point I am trying to make here is that there is no reason that the
> kexec approach needs to do _any_ suspend/resume calls.

When you go through ACPI, then that implies calling suspend/resume calls.

> all that is needed is the ability of the new kernel to initialize the
> devices it needs.

We have to go through ACPI, for wakeup functions to succeed.  A simple 
power-off won't do.

> suspend-to-disk-and-ram could be implemented as three
> seperate steps
>
> 1. suspend-to-disk
>
> 2. resume-from-disk
>
> 3. suspend-to-ram
>
> followed by either
>
> 4. resume-from-ram
>
> or
>
> 4. battery dies and loptop powers off completely
>
> 5. power-on boot.
>
> 6. resume-from-disk
>
> all that you need to do is to make sure that the system doesn't run
> anything that would affect permanent media or the outside world between
> steps #2 and #3

Exactly, which is why your scheme would break down on #3, and that's why you 
need to call S3 from within the kexec'd hibernation kernel after saving the 
hibernation image.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:38         ` Nigel Cunningham
@ 2007-07-16 14:15           ` Alan Stern
  2007-07-16 15:25             ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-16 14:15 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: david, Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman,
	Huang, Ying, Jeremy Maitin-Shepard, Kyle Moffett, Pavel Machek,
	pm list, Al Boldi

On Mon, 16 Jul 2007, Nigel Cunningham wrote:

> > As I understand it, running a different OS between the hibernate and 
> > the resume would violate the ACPI spec.
> 
> Well then, I know one or two people who would argue that the ACPI spec is 
> faulty. :\

It's hard to argue against that.  In fact there are people who would 
argue that the ACPI spec is incomplete and inconsistent period, 
regardless of any questions about what happens between hibernate and 
resume.  :-)

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 23:41         ` david
@ 2007-07-16 14:21           ` Alan Stern
  2007-07-17  4:45             ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-16 14:21 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Sun, 15 Jul 2007 david@lang.hm wrote:

> then we need a third mode of operation.
> 
> mode 1: Suspend-to-ram
> 
>    the system is paused and put into a low-power mode but data remains in 
> memory and the system stays awake enough to keep the memory refreshed.
> 
> mode 2: new
> 
>    the system is paused, data is stored to permanent media, and the system 
> is put into a ultra-low power mode.
> 
> mode 3: hibernate
> 
>    the system is paused, data is stored to permanent media, and the system 
> is powered off
> 
> with mode 3 there are no requirements or limitations about what can be 
> done with the hardware before a resume (the resume could even take place 
> on a different piece of identical hardware)
> 
> mode 2 could be what you are talking about doing, although I don't see any 
> advantage of creating it in additon to mode 3, it doesn't use any less 
> power and it locks the system so that it can't be used for anything else 
> in the meantime. I guess if it was significantly faster to do then mode 3 
> there may be _some_ reason to consider it, but I don't see the speed 
> difference.

Part of the problem here is that ACPI already has its own terminology, 
and you're trying to invent a new one instead of using the existing 
one.

I agree, it would be good to have a non-ACPI-specific hibernation mode,
something which would look to ACPI like a normal shutdown.  But I'm not 
so sure this is possible.

You have to understand that the ACPI spec is weird and complex.  The
mere fact that you have written a system image to disk changes the way
ACPI regards the shutdown procedure.  Even though you may treat all the
devices and the rest of the hardware exactly the same, it's a different
operation as far as ACPI is concerned, with different requirements.

Yes, it's bizarre.  Why do you think so many people have complained so 
vehemently about ACPI for all these years?

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16  5:02         ` Al Boldi
  2007-07-16  6:49           ` david
@ 2007-07-16 14:53           ` Alan Stern
  2007-07-16 16:51             ` Al Boldi
  1 sibling, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-16 14:53 UTC (permalink / raw)
  To: Al Boldi
  Cc: Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david

On Mon, 16 Jul 2007, Al Boldi wrote:

> From a kexec'd hibernation kernel pov, both S3 and S4 look conceptually 
> exactly the same.  The only difference is, in S3 the memory is in memory and 
> in S4 the memory is on storage.  All device handling is exactly the same, so 
> if there is a problem with device handling between the kexec'd hibernation 
> kernel and the normal kernel, then that would have made itself visible.

You have contradicted yourself.  "In S3 the memory is in memory and in
S4 the memory is on storage".  How does the memory get onto storage?  
The kexec'd hibernation kernel writes it there.  To do so it accesses a 
storage device.

Consequently the device handling _cannot_ be exactly the same in S3 and
S4.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16 14:15           ` Alan Stern
@ 2007-07-16 15:25             ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-16 15:25 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, david, LKML, Andrew Morton, Eric W. Biederman,
	Huang, Ying, Jeremy Maitin-Shepard, Kyle Moffett, Pavel Machek,
	pm list, Al Boldi

On Monday, 16 July 2007 16:15, Alan Stern wrote:
> On Mon, 16 Jul 2007, Nigel Cunningham wrote:
> 
> > > As I understand it, running a different OS between the hibernate and 
> > > the resume would violate the ACPI spec.
> > 
> > Well then, I know one or two people who would argue that the ACPI spec is 
> > faulty. :\
> 
> It's hard to argue against that.  In fact there are people who would 
> argue that the ACPI spec is incomplete and inconsistent period, 
> regardless of any questions about what happens between hibernate and 
> resume.  :-)

That's true, but OTOH it's difficult to argue with my notebook's firmware
about that. ;-)

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
       [not found]             ` <20070716123849.GC14212@grifter.jdc.home>
@ 2007-07-16 15:29               ` Rafael J. Wysocki
  2007-07-17  4:28                 ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-16 15:29 UTC (permalink / raw)
  To: Jim Crilly
  Cc: david, LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Monday, 16 July 2007 14:38, Jim Crilly wrote:
> On 07/16/07 02:06:27PM +0200, Rafael J. Wysocki wrote:
> > On Monday, 16 July 2007 01:49, david@lang.hm wrote:
> > > On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
> > > 
> > > > On Monday, 16 July 2007 00:42, david@lang.hm wrote:
> > > >> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
> > > >>
> > > >>> On Sunday, 15 July 2007 22:13, david@lang.hm wrote:
> > > >>>> On Sun, 15 Jul 2007, Rafael J. Wysocki wrote:
> > > >>>>
> > > >>>>>    The ACPI specification requires us to invoke some global ACPI methods
> > > >>>>>    during the hibernation and during the restore.  Moreover, the ordering of
> > > >>>>>    code related to these ACPI methods may not be arbitrary (eg. some of
> > > >>>>>    them have to be executed after devices are put into low power states etc.).
> > > >>>>
> > > >>>> for a pure hibernate mode, you will be powering off the box after saving
> > > >>>> the suspend image. why are there any special ACPI modes involved?
> > > >>>
> > > >>> Because, for example, on my machine the status of power supply (present
> > > >>> vs not present) is not updated correctly after the restore if ACPI callbacks
> > > >>> aren't used during the hibernation.  That's just experience and it's in line
> > > >>> with the ACPI spec.
> > > >>
> > > >> so if a machine is actually powered off the /dev/suspend process won't
> > > >> work?
> > > >
> > > > No, it sort of works as usual, but after the restore the platform is not in the
> > > > correct state.
> > > 
> > > this is not hibernate as I and many others are thinking of it.
> > > 
> > > hibernate as we are thinking would work on basicly any hardware, including 
> > > things with no ACPI or power savings support. and the system could be in 
> > > hibernate mode for any time period.
> > > 
> > > for that matter, after a system is put into hibernate mode the system 
> > > could be completely disassembled and any components replaced and the 
> > > system would work after a resume (assuming you still have access to the 
> > > suspend image)
> > 
> > Well, this is not how ACPI defines the S4 sleep state.  If the system is in
> > S4, that corresponds to our hibernation, you are _not_ allowed to disassemble
> > it.
> > 
> > I've just done an experiment on my test desktop.  I had enabled suspend support
> > in the CMOS setup and afterwards I made Linux hibernate in the "platform" mode.
> > Then, when the system was powred on, the BIOS showed me a nice "Resume from
> > hibernation" screen that is not normally displayed during boot.  This clearly
> > means that some information has been preserved by the platform across the
> > hibernate/restore cycle.  We are supposed to handle that.
> > 
> 
> What I believe he's getting at is that Linux hibernation shouldn't be tied
> to any ACPI states. Yes, when available and working most people will want
> to enter ACPI S4 but we should still have the option of doing a normal
> poweroff. With the latter method it would look just like regular power off/on
> cycle to the firmware. And that would definitely be useful for things like
> working around buggy ACPI implementations or supporting platforms that don't
> do ACPI at all. That is the difference between the platform and shutdown
> options in /sys/power/disk, isn't it?

Yes, but this is not my point.

The point is that there are systems that _require_ the ACPI handling to work
correctly after the restore and we need to that _that_ into consideration.

IOW, there are poeple for whom the non-ACPI framework won't work as expected.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16 14:53           ` Alan Stern
@ 2007-07-16 16:51             ` Al Boldi
  2007-07-17  4:37               ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Al Boldi @ 2007-07-16 16:51 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david

Alan Stern wrote:
> On Mon, 16 Jul 2007, Al Boldi wrote:
> > From a kexec'd hibernation kernel pov, both S3 and S4 look conceptually
> > exactly the same.  The only difference is, in S3 the memory is in memory
> > and in S4 the memory is on storage.  All device handling is exactly the
> > same, so if there is a problem with device handling between the kexec'd
> > hibernation kernel and the normal kernel, then that would have made
> > itself visible.
>
> You have contradicted yourself.  "In S3 the memory is in memory and in
> S4 the memory is on storage".  How does the memory get onto storage?
> The kexec'd hibernation kernel writes it there.  To do so it accesses a
> storage device.
>
> Consequently the device handling _cannot_ be exactly the same in S3 and
> S4.

Ok, you should have read this in the context of suspending/resuming from/to 
the normal kernel, and in that case they are exactly the same, i.e. kexec -e 
for suspend and kexec -j for resume.

BTW, it would be really helpful if people would actually try the kexec 
hibernation patches, as this may yield a much more constructive discussion.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16 15:29               ` Rafael J. Wysocki
@ 2007-07-17  4:28                 ` david
  2007-07-17 10:42                   ` Matthew Garrett
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17  4:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jim Crilly, LKML, Alan Stern, Andrew Morton, Eric W. Biederman,
	Huang, Ying, Jeremy Maitin-Shepard, Kyle Moffett,
	Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:

> On Monday, 16 July 2007 14:38, Jim Crilly wrote:
>> On 07/16/07 02:06:27PM +0200, Rafael J. Wysocki wrote:
>>> On Monday, 16 July 2007 01:49, david@lang.hm wrote:
>>>> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
>>>>
>>>>> On Monday, 16 July 2007 00:42, david@lang.hm wrote:
>>>>>> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
>>>>>>
>>>>>>> On Sunday, 15 July 2007 22:13, david@lang.hm wrote:
>>>>>>>> On Sun, 15 Jul 2007, Rafael J. Wysocki wrote:
>>>>>>>>
>>>>>>>>>    The ACPI specification requires us to invoke some global ACPI methods
>>>>>>>>>    during the hibernation and during the restore.  Moreover, the ordering of
>>>>>>>>>    code related to these ACPI methods may not be arbitrary (eg. some of
>>>>>>>>>    them have to be executed after devices are put into low power states etc.).
>>>>>>>>
>>>>>>>> for a pure hibernate mode, you will be powering off the box after saving
>>>>>>>> the suspend image. why are there any special ACPI modes involved?
>>>>>>>
>>>>>>> Because, for example, on my machine the status of power supply (present
>>>>>>> vs not present) is not updated correctly after the restore if ACPI callbacks
>>>>>>> aren't used during the hibernation.  That's just experience and it's in line
>>>>>>> with the ACPI spec.
>>>>>>
>>>>>> so if a machine is actually powered off the /dev/suspend process won't
>>>>>> work?
>>>>>
>>>>> No, it sort of works as usual, but after the restore the platform is not in the
>>>>> correct state.
>>>>
>>>> this is not hibernate as I and many others are thinking of it.
>>>>
>>>> hibernate as we are thinking would work on basicly any hardware, including
>>>> things with no ACPI or power savings support. and the system could be in
>>>> hibernate mode for any time period.
>>>>
>>>> for that matter, after a system is put into hibernate mode the system
>>>> could be completely disassembled and any components replaced and the
>>>> system would work after a resume (assuming you still have access to the
>>>> suspend image)
>>>
>>> Well, this is not how ACPI defines the S4 sleep state.  If the system is in
>>> S4, that corresponds to our hibernation, you are _not_ allowed to disassemble
>>> it.
>>>
>>> I've just done an experiment on my test desktop.  I had enabled suspend support
>>> in the CMOS setup and afterwards I made Linux hibernate in the "platform" mode.
>>> Then, when the system was powred on, the BIOS showed me a nice "Resume from
>>> hibernation" screen that is not normally displayed during boot.  This clearly
>>> means that some information has been preserved by the platform across the
>>> hibernate/restore cycle.  We are supposed to handle that.
>>>
>>
>> What I believe he's getting at is that Linux hibernation shouldn't be tied
>> to any ACPI states. Yes, when available and working most people will want
>> to enter ACPI S4 but we should still have the option of doing a normal
>> poweroff. With the latter method it would look just like regular power off/on
>> cycle to the firmware. And that would definitely be useful for things like
>> working around buggy ACPI implementations or supporting platforms that don't
>> do ACPI at all. That is the difference between the platform and shutdown
>> options in /sys/power/disk, isn't it?
>
> Yes, but this is not my point.
>
> The point is that there are systems that _require_ the ACPI handling to work
> correctly after the restore and we need to that _that_ into consideration.
>
> IOW, there are poeple for whom the non-ACPI framework won't work as expected.

why would the type of hibernate that I'm talking about (power off, not S4 
mode) not work on a box that has ACPI?

you are saying that it wouldn't work right after a restore, but that's 
only if the restored image is expecting to have it's devices in ACPI sleep 
modes. my answer is "don't do that then" S4 mode does not save power 
compared to powering off, and when doing a hibernate you should be 
prepared to run out of battery before you wake up, which would then look 
like a power-on

the S4 mode sounds like a deeper version of suspend-to-ram, with all of 
the limitations of STR becouse you still require power continuously and 
must remain in complete control of the machine.

if you want to do a suspend to S4 mode, go for it, but we need to have a 
different name for that then for the suspend-to-disk-and-poer-off mode 
that I thought was hibernate

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16 13:32             ` Al Boldi
@ 2007-07-17  4:33               ` david
  2007-07-17 12:08                 ` Al Boldi
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17  4:33 UTC (permalink / raw)
  To: Al Boldi
  Cc: Alan Stern, Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

On Mon, 16 Jul 2007, Al Boldi wrote:

> david@lang.hm wrote:
>> On Mon, 16 Jul 2007, Al Boldi wrote:
>>> david@lang.hm wrote:
>>>> suspend-to-RAM should not involve kexec, the only reason for doing the
>>>> kexec to to get a seperate userspace to use for suspend-to-disk
>>>> operations instead of trying to partially freeze the sustem and keep
>>>> useing it.
>>>
>>> Or you could do suspend-to-disk-and-RAM.  But in the above case, it was
>>> meant to test kexec compatibility with device suspend/resume calls.
>>
>> the point I am trying to make here is that there is no reason that the
>> kexec approach needs to do _any_ suspend/resume calls.
>
> When you go through ACPI, then that implies calling suspend/resume calls.

then don't go through ACPI

>> all that is needed is the ability of the new kernel to initialize the
>> devices it needs.
>
> We have to go through ACPI, for wakeup functions to succeed.  A simple
> power-off won't do.

the kexec switch being posted requires ACPI be disabled, so it's clearly 
possible to switch kernels and initialize devices without ACPI

>> suspend-to-disk-and-ram could be implemented as three
>> seperate steps
>>
>> 1. suspend-to-disk
>>
>> 2. resume-from-disk
>>
>> 3. suspend-to-ram
>>
>> followed by either
>>
>> 4. resume-from-ram
>>
>> or
>>
>> 4. battery dies and loptop powers off completely
>>
>> 5. power-on boot.
>>
>> 6. resume-from-disk
>>
>> all that you need to do is to make sure that the system doesn't run
>> anything that would affect permanent media or the outside world between
>> steps #2 and #3
>
> Exactly, which is why your scheme would break down on #3, and that's why you
> need to call S3 from within the kexec'd hibernation kernel after saving the
> hibernation image.

when a kexec is called, how does the kernel know what to execute? 
something needs to tell it what to do, and I think that something is 
either something in the kexec image, or it's something passed as a 
parameter to that image.

all that would be needed to do #3 safely is to have the kernel that you 
restarted on #2 do a suspend-to-ram before it does anything else.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16 16:51             ` Al Boldi
@ 2007-07-17  4:37               ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-17  4:37 UTC (permalink / raw)
  To: Al Boldi
  Cc: Alan Stern, Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

On Mon, 16 Jul 2007, Al Boldi wrote:

> Alan Stern wrote:
>> On Mon, 16 Jul 2007, Al Boldi wrote:
>>> From a kexec'd hibernation kernel pov, both S3 and S4 look conceptually
>>> exactly the same.  The only difference is, in S3 the memory is in memory
>>> and in S4 the memory is on storage.  All device handling is exactly the
>>> same, so if there is a problem with device handling between the kexec'd
>>> hibernation kernel and the normal kernel, then that would have made
>>> itself visible.
>>
>> You have contradicted yourself.  "In S3 the memory is in memory and in
>> S4 the memory is on storage".  How does the memory get onto storage?
>> The kexec'd hibernation kernel writes it there.  To do so it accesses a
>> storage device.
>>
>> Consequently the device handling _cannot_ be exactly the same in S3 and
>> S4.
>
> Ok, you should have read this in the context of suspending/resuming from/to
> the normal kernel, and in that case they are exactly the same, i.e. kexec -e
> for suspend and kexec -j for resume.
>
> BTW, it would be really helpful if people would actually try the kexec
> hibernation patches, as this may yield a much more constructive discussion.

I would love to, but so far I don't see the nessasary pieces

once I kexec to the new kernel, how can it find out what pages of memory 
(and swap) need to be saved?

If i knew that then I could write a trivial perl program to save those 
pages with the appropriate headers to make a suspend file.

however, I'm now being told that that suspend file won't work if the 
machine is actually powered off, so there's a need to something more for 
the wake-up side of things as well.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-16 14:21           ` Alan Stern
@ 2007-07-17  4:45             ` david
  2007-07-17 14:15               ` Alan Stern
  2007-07-21 10:17               ` Pavel Machek
  0 siblings, 2 replies; 220+ messages in thread
From: david @ 2007-07-17  4:45 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Mon, 16 Jul 2007, Alan Stern wrote:

> On Sun, 15 Jul 2007 david@lang.hm wrote:
>
>> then we need a third mode of operation.
>>
>> mode 1: Suspend-to-ram
>>
>>    the system is paused and put into a low-power mode but data remains in
>> memory and the system stays awake enough to keep the memory refreshed.
>>
>> mode 2: new
>>
>>    the system is paused, data is stored to permanent media, and the system
>> is put into a ultra-low power mode.
>>
>> mode 3: hibernate
>>
>>    the system is paused, data is stored to permanent media, and the system
>> is powered off
>>
>> with mode 3 there are no requirements or limitations about what can be
>> done with the hardware before a resume (the resume could even take place
>> on a different piece of identical hardware)
>>
>> mode 2 could be what you are talking about doing, although I don't see any
>> advantage of creating it in additon to mode 3, it doesn't use any less
>> power and it locks the system so that it can't be used for anything else
>> in the meantime. I guess if it was significantly faster to do then mode 3
>> there may be _some_ reason to consider it, but I don't see the speed
>> difference.
>
> Part of the problem here is that ACPI already has its own terminology,
> and you're trying to invent a new one instead of using the existing
> one.
>
> I agree, it would be good to have a non-ACPI-specific hibernation mode,
> something which would look to ACPI like a normal shutdown.  But I'm not
> so sure this is possible.

why would it not be possible?

> You have to understand that the ACPI spec is weird and complex.  The
> mere fact that you have written a system image to disk changes the way
> ACPI regards the shutdown procedure.  Even though you may treat all the
> devices and the rest of the hardware exactly the same, it's a different
> operation as far as ACPI is concerned, with different requirements.
>
> Yes, it's bizarre.  Why do you think so many people have complained so
> vehemently about ACPI for all these years?

so let's act as if ACPI doesn't exist and make a suspend-to-disk that 
works without it and looks to ACPI like a complete power off/on cycle (but 
looks to the user like a suspend/resume cycle)

this should avoid all the headaches about ACPI completely becouse you just 
don't make any ACPI calls at all.

this will also work on any type of system, and it will work in the 
presence of dead batteries, failed components, booting other OS's, moveing 
the image to different hardware, etc.

I can't think of anything much more frustrating then thinking that I 
suspended a system and then discovering that becouse the battery went dead 
(a complete power loss) that the system wouldn't boot up properly. to me 
this would be a fairly common condition (when I'm mobile I use the machine 
until I am out of battery, then stop and it may be a long time (days) 
before I can charge the thing up again) this would not be a reliable 
suspend as far as I'm concerned.

for suspend-to-ram you have to worry about ACPI states and what you are 
doing with them, for suspend-to-disk you can ignore them and completely 
power the system off instead.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17  4:28                 ` david
@ 2007-07-17 10:42                   ` Matthew Garrett
  2007-07-17 15:19                     ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Matthew Garrett @ 2007-07-17 10:42 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Jim Crilly, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Mon, Jul 16, 2007 at 09:28:13PM -0700, david@lang.hm wrote:

> why would the type of hibernate that I'm talking about (power off, not S4 
> mode) not work on a box that has ACPI?

Powering off rather than using S4 means you lose most wakeup device 
support. That would be a functional regression compared to the current 
code.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17  4:33               ` david
@ 2007-07-17 12:08                 ` Al Boldi
  2007-07-17 14:18                   ` Rafael J. Wysocki
  2007-07-17 15:23                   ` david
  0 siblings, 2 replies; 220+ messages in thread
From: Al Boldi @ 2007-07-17 12:08 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

david@lang.hm wrote:
> On Mon, 16 Jul 2007, Al Boldi wrote:
> > We have to go through ACPI, for wakeup functions to succeed.  A simple
> > power-off won't do.
>
> the kexec switch being posted requires ACPI be disabled, so it's clearly
> possible to switch kernels and initialize devices without ACPI

It's a given that kexec works in the absence of ACPI; what we have to handle 
is the ACPI states across kernel invocations, to ensure wakeup functions 
succeed.  If you don't need this, then just power off.

> >> suspend-to-disk-and-ram could be implemented as three
> >> seperate steps
> >>
> >> 1. suspend-to-disk
> >>
> >> 2. resume-from-disk
> >>
> >> 3. suspend-to-ram
> >>
> >> followed by either
> >>
> >> 4. resume-from-ram
> >>
> >> or
> >>
> >> 4. battery dies and loptop powers off completely
> >>
> >> 5. power-on boot.
> >>
> >> 6. resume-from-disk
> >>
> >> all that you need to do is to make sure that the system doesn't run
> >> anything that would affect permanent media or the outside world between
> >> steps #2 and #3
> >
> > Exactly, which is why your scheme would break down on #3, and that's why
> > you need to call S3 from within the kexec'd hibernation kernel after
> > saving the hibernation image.
>
> when a kexec is called, how does the kernel know what to execute?
> something needs to tell it what to do, and I think that something is
> either something in the kexec image, or it's something passed as a
> parameter to that image.
>
> all that would be needed to do #3 safely is to have the kernel that you
> restarted on #2 do a suspend-to-ram before it does anything else.

If you mean by kernel 'the normal kernel', then this won't work, because it 
would imply a change of state after saving its image.

If you mean by kernel 'the kexec'd hibernation kernel', then you wouldn't 
need to do #2, but rather do #3 right after dumping the image in #1.

[...insert from another post...]
> > BTW, it would be really helpful if people would actually try the kexec
> > hibernation patches, as this may yield a much more constructive
> > discussion.
>
> I would love to, but so far I don't see the nessasary pieces
>
> once I kexec to the new kernel, how can it find out what pages of memory
> (and swap) need to be saved?

No need to save the swap, all you need to do is to dump /dev/oldmem onto 
storage, and if that dump image is compatible with swsusp, then a normal 
kernel should be able to resume from this image via /dev/snapshot.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17  4:45             ` david
@ 2007-07-17 14:15               ` Alan Stern
  2007-07-17 14:40                 ` Rafael J. Wysocki
  2007-07-21 10:17               ` Pavel Machek
  1 sibling, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-17 14:15 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Mon, 16 Jul 2007 david@lang.hm wrote:

> > I agree, it would be good to have a non-ACPI-specific hibernation mode,
> > something which would look to ACPI like a normal shutdown.  But I'm not
> > so sure this is possible.
> 
> why would it not be possible?

> I can't think of anything much more frustrating then thinking that I 
> suspended a system and then discovering that becouse the battery went dead 
> (a complete power loss) that the system wouldn't boot up properly. to me 
> this would be a fairly common condition (when I'm mobile I use the machine 
> until I am out of battery, then stop and it may be a long time (days) 
> before I can charge the thing up again) this would not be a reliable 
> suspend as far as I'm concerned.
> 
> for suspend-to-ram you have to worry about ACPI states and what you are 
> doing with them, for suspend-to-disk you can ignore them and completely 
> power the system off instead.

If the only problem with doing this would be lack of wakeup support
then I'm all for it.  There must be a lot of people who would like
their computers to hibernate with power drain as close to 0 as possible
and who don't care about remote wakeup.  In fact they might even prefer
not to have wakeup support, so the computer doesn't resume at
unexpected times.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 12:08                 ` Al Boldi
@ 2007-07-17 14:18                   ` Rafael J. Wysocki
  2007-07-17 15:23                   ` david
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 14:18 UTC (permalink / raw)
  To: Al Boldi
  Cc: david, Alan Stern, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

On Tuesday, 17 July 2007 14:08, Al Boldi wrote:
> david@lang.hm wrote:
> > On Mon, 16 Jul 2007, Al Boldi wrote:
> > > We have to go through ACPI, for wakeup functions to succeed.  A simple
> > > power-off won't do.
> >
> > the kexec switch being posted requires ACPI be disabled, so it's clearly
> > possible to switch kernels and initialize devices without ACPI
> 
> It's a given that kexec works in the absence of ACPI; what we have to handle 
> is the ACPI states across kernel invocations, to ensure wakeup functions 
> succeed.  If you don't need this, then just power off.
> 
> > >> suspend-to-disk-and-ram could be implemented as three
> > >> seperate steps
> > >>
> > >> 1. suspend-to-disk
> > >>
> > >> 2. resume-from-disk
> > >>
> > >> 3. suspend-to-ram
> > >>
> > >> followed by either
> > >>
> > >> 4. resume-from-ram
> > >>
> > >> or
> > >>
> > >> 4. battery dies and loptop powers off completely
> > >>
> > >> 5. power-on boot.
> > >>
> > >> 6. resume-from-disk
> > >>
> > >> all that you need to do is to make sure that the system doesn't run
> > >> anything that would affect permanent media or the outside world between
> > >> steps #2 and #3
> > >
> > > Exactly, which is why your scheme would break down on #3, and that's why
> > > you need to call S3 from within the kexec'd hibernation kernel after
> > > saving the hibernation image.
> >
> > when a kexec is called, how does the kernel know what to execute?
> > something needs to tell it what to do, and I think that something is
> > either something in the kexec image, or it's something passed as a
> > parameter to that image.
> >
> > all that would be needed to do #3 safely is to have the kernel that you
> > restarted on #2 do a suspend-to-ram before it does anything else.
> 
> If you mean by kernel 'the normal kernel', then this won't work, because it 
> would imply a change of state after saving its image.
> 
> If you mean by kernel 'the kexec'd hibernation kernel', then you wouldn't 
> need to do #2, but rather do #3 right after dumping the image in #1.
> 
> [...insert from another post...]
> > > BTW, it would be really helpful if people would actually try the kexec
> > > hibernation patches, as this may yield a much more constructive
> > > discussion.
> >
> > I would love to, but so far I don't see the nessasary pieces
> >
> > once I kexec to the new kernel, how can it find out what pages of memory
> > (and swap) need to be saved?
> 
> No need to save the swap,

Correct.

> all you need to do is to dump /dev/oldmem onto storage,

I'm not sure of that.

> and if that dump image is compatible with swsusp,

No, it's not.  swusp additionally needs to know PFNs to restore the pages into.

> then a normal kernel should be able to resume from this image via
> /dev/snapshot. 

Nope.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 14:15               ` Alan Stern
@ 2007-07-17 14:40                 ` Rafael J. Wysocki
  2007-07-17 15:29                   ` david
                                     ` (2 more replies)
  0 siblings, 3 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 14:40 UTC (permalink / raw)
  To: Alan Stern
  Cc: david, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
> On Mon, 16 Jul 2007 david@lang.hm wrote:
> 
> > > I agree, it would be good to have a non-ACPI-specific hibernation mode,
> > > something which would look to ACPI like a normal shutdown.  But I'm not
> > > so sure this is possible.
> > 
> > why would it not be possible?
> 
> > I can't think of anything much more frustrating then thinking that I 
> > suspended a system and then discovering that becouse the battery went dead 
> > (a complete power loss) that the system wouldn't boot up properly. to me 
> > this would be a fairly common condition (when I'm mobile I use the machine 
> > until I am out of battery, then stop and it may be a long time (days) 
> > before I can charge the thing up again) this would not be a reliable 
> > suspend as far as I'm concerned.
> > 
> > for suspend-to-ram you have to worry about ACPI states and what you are 
> > doing with them, for suspend-to-disk you can ignore them and completely 
> > power the system off instead.
> 
> If the only problem with doing this would be lack of wakeup support
> then I'm all for it.  There must be a lot of people who would like
> their computers to hibernate with power drain as close to 0 as possible
> and who don't care about remote wakeup.  In fact they might even prefer
> not to have wakeup support, so the computer doesn't resume at
> unexpected times.

I'm afraid of one thing, though.

If we create a framework without ACPI (well, ACPI needs to be enabled in the
kernel anyway for other reasons, like the ability to suspend to RAM) and then
it turns out that we have to add some ACPI hooks to it, that might be difficult
to do cleanly.

Thus, it seems reasonable to think of the ACPI handling in advance.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 10:42                   ` Matthew Garrett
@ 2007-07-17 15:19                     ` david
  2007-07-18  2:18                       ` Matthew Garrett
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17 15:19 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, Jim Crilly, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Matthew Garrett wrote:

> On Mon, Jul 16, 2007 at 09:28:13PM -0700, david@lang.hm wrote:
>
>> why would the type of hibernate that I'm talking about (power off, not S4
>> mode) not work on a box that has ACPI?
>
> Powering off rather than using S4 means you lose most wakeup device
> support. That would be a functional regression compared to the current
> code.

only if the kexec isn't able to initialize those devices.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 12:08                 ` Al Boldi
  2007-07-17 14:18                   ` Rafael J. Wysocki
@ 2007-07-17 15:23                   ` david
  1 sibling, 0 replies; 220+ messages in thread
From: david @ 2007-07-17 15:23 UTC (permalink / raw)
  To: Al Boldi
  Cc: Alan Stern, Rafael J. Wysocki, linux-kernel, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list

On Tue, 17 Jul 2007, Al Boldi wrote:

> david@lang.hm wrote:
>> On Mon, 16 Jul 2007, Al Boldi wrote:
>>> We have to go through ACPI, for wakeup functions to succeed.  A simple
>>> power-off won't do.
>>
>> the kexec switch being posted requires ACPI be disabled, so it's clearly
>> possible to switch kernels and initialize devices without ACPI
>
> It's a given that kexec works in the absence of ACPI; what we have to handle
> is the ACPI states across kernel invocations, to ensure wakeup functions
> succeed.  If you don't need this, then just power off.
>
>>>> suspend-to-disk-and-ram could be implemented as three
>>>> seperate steps
>>>>
>>>> 1. suspend-to-disk
>>>>
>>>> 2. resume-from-disk
>>>>
>>>> 3. suspend-to-ram
>>>>
>>>> followed by either
>>>>
>>>> 4. resume-from-ram
>>>>
>>>> or
>>>>
>>>> 4. battery dies and loptop powers off completely
>>>>
>>>> 5. power-on boot.
>>>>
>>>> 6. resume-from-disk
>>>>
>>>> all that you need to do is to make sure that the system doesn't run
>>>> anything that would affect permanent media or the outside world between
>>>> steps #2 and #3
>>>
>>> Exactly, which is why your scheme would break down on #3, and that's why
>>> you need to call S3 from within the kexec'd hibernation kernel after
>>> saving the hibernation image.
>>
>> when a kexec is called, how does the kernel know what to execute?
>> something needs to tell it what to do, and I think that something is
>> either something in the kexec image, or it's something passed as a
>> parameter to that image.
>>
>> all that would be needed to do #3 safely is to have the kernel that you
>> restarted on #2 do a suspend-to-ram before it does anything else.
>
> If you mean by kernel 'the normal kernel', then this won't work, because it
> would imply a change of state after saving its image.

yes, it would change the state, but if it only changes the state in ways 
that aren't visable to the outside world why would it matter?

if power dies you restore from the disk image (useing the non ACPI 
approach), and the changes that you make are just lost

> If you mean by kernel 'the kexec'd hibernation kernel', then you wouldn't
> need to do #2, but rather do #3 right after dumping the image in #1.
>
> [...insert from another post...]
>>> BTW, it would be really helpful if people would actually try the kexec
>>> hibernation patches, as this may yield a much more constructive
>>> discussion.
>>
>> I would love to, but so far I don't see the nessasary pieces
>>
>> once I kexec to the new kernel, how can it find out what pages of memory
>> (and swap) need to be saved?
>
> No need to save the swap, all you need to do is to dump /dev/oldmem onto
> storage, and if that dump image is compatible with swsusp, then a normal
> kernel should be able to resume from this image via /dev/snapshot.

Rafael is saying that there's more involved, you can't just dump 
/dev/oldmem, you have to avoid specific pages.

as for swap, saving that may be required, depending on how clean you want 
toleave the box for other OS's in the meantime.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 14:40                 ` Rafael J. Wysocki
@ 2007-07-17 15:29                   ` david
  2007-07-17 16:02                     ` Rafael J. Wysocki
  2007-07-17 16:09                   ` Jeremy Maitin-Shepard
  2007-07-17 18:32                   ` Alan Stern
  2 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17 15:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:

> On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
>> On Mon, 16 Jul 2007 david@lang.hm wrote:
>>
>>>> I agree, it would be good to have a non-ACPI-specific hibernation mode,
>>>> something which would look to ACPI like a normal shutdown.  But I'm not
>>>> so sure this is possible.
>>>
>>> why would it not be possible?
>>
>>> I can't think of anything much more frustrating then thinking that I
>>> suspended a system and then discovering that becouse the battery went dead
>>> (a complete power loss) that the system wouldn't boot up properly. to me
>>> this would be a fairly common condition (when I'm mobile I use the machine
>>> until I am out of battery, then stop and it may be a long time (days)
>>> before I can charge the thing up again) this would not be a reliable
>>> suspend as far as I'm concerned.
>>>
>>> for suspend-to-ram you have to worry about ACPI states and what you are
>>> doing with them, for suspend-to-disk you can ignore them and completely
>>> power the system off instead.
>>
>> If the only problem with doing this would be lack of wakeup support
>> then I'm all for it.  There must be a lot of people who would like
>> their computers to hibernate with power drain as close to 0 as possible
>> and who don't care about remote wakeup.  In fact they might even prefer
>> not to have wakeup support, so the computer doesn't resume at
>> unexpected times.
>
> I'm afraid of one thing, though.
>
> If we create a framework without ACPI (well, ACPI needs to be enabled in the
> kernel anyway for other reasons, like the ability to suspend to RAM) and then
> it turns out that we have to add some ACPI hooks to it, that might be difficult
> to do cleanly.

doing suspend-to-ram should be orthoginal to doing hibernate-to-disk. some 
people will want both, some won't.

at the moment kexec doesn't work with ACPI, that is a limitation that 
should be fixed, but makeing it able to work with ACPI enabled doesn't 
mean that it needs to be changed to depend on ACPI and it especially 
doesn't mean that it should pick up the limitations of the existing ACPI 
based hibernation approaches.

if there is no ACPI on the system it should work, if ther is ACPI on the 
system it should still work.

> Thus, it seems reasonable to think of the ACPI handling in advance.

but don't become dependant on ACPI.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 15:29                   ` david
@ 2007-07-17 16:02                     ` Rafael J. Wysocki
  2007-07-17 17:06                       ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 16:02 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 17:29, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
> >> On Mon, 16 Jul 2007 david@lang.hm wrote:
> >>
> >>>> I agree, it would be good to have a non-ACPI-specific hibernation mode,
> >>>> something which would look to ACPI like a normal shutdown.  But I'm not
> >>>> so sure this is possible.
> >>>
> >>> why would it not be possible?
> >>
> >>> I can't think of anything much more frustrating then thinking that I
> >>> suspended a system and then discovering that becouse the battery went dead
> >>> (a complete power loss) that the system wouldn't boot up properly. to me
> >>> this would be a fairly common condition (when I'm mobile I use the machine
> >>> until I am out of battery, then stop and it may be a long time (days)
> >>> before I can charge the thing up again) this would not be a reliable
> >>> suspend as far as I'm concerned.
> >>>
> >>> for suspend-to-ram you have to worry about ACPI states and what you are
> >>> doing with them, for suspend-to-disk you can ignore them and completely
> >>> power the system off instead.
> >>
> >> If the only problem with doing this would be lack of wakeup support
> >> then I'm all for it.  There must be a lot of people who would like
> >> their computers to hibernate with power drain as close to 0 as possible
> >> and who don't care about remote wakeup.  In fact they might even prefer
> >> not to have wakeup support, so the computer doesn't resume at
> >> unexpected times.
> >
> > I'm afraid of one thing, though.
> >
> > If we create a framework without ACPI (well, ACPI needs to be enabled in the
> > kernel anyway for other reasons, like the ability to suspend to RAM) and then
> > it turns out that we have to add some ACPI hooks to it, that might be difficult
> > to do cleanly.
> 
> doing suspend-to-ram should be orthoginal to doing hibernate-to-disk. some 
> people will want both, some won't.
> 
> at the moment kexec doesn't work with ACPI, that is a limitation that 
> should be fixed, but makeing it able to work with ACPI enabled doesn't 
> mean that it needs to be changed to depend on ACPI and it especially 
> doesn't mean that it should pick up the limitations of the existing ACPI 
> based hibernation approaches.
> 
> if there is no ACPI on the system it should work, if ther is ACPI on the 
> system it should still work.
> 
> > Thus, it seems reasonable to think of the ACPI handling in advance.
> 
> but don't become dependant on ACPI.

Not dependent, but with the possibility of ACPI support taken into account.

Arguably you can create a framework that, for example, will not allow the user
to adjust the size of the image, but then adding such a functionality may
require you to change the entire design.  Same thing with ACPI.

I would rather avoid such pitfalls, if I could.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 14:40                 ` Rafael J. Wysocki
  2007-07-17 15:29                   ` david
@ 2007-07-17 16:09                   ` Jeremy Maitin-Shepard
  2007-07-17 19:54                     ` Rafael J. Wysocki
  2007-07-17 18:32                   ` Alan Stern
  2 siblings, 1 reply; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-17 16:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, david, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list,
	Al Boldi

"Rafael J. Wysocki" <rjw@sisk.pl> writes:

[snip]

> I'm afraid of one thing, though.

> If we create a framework without ACPI (well, ACPI needs to be enabled in the
> kernel anyway for other reasons, like the ability to suspend to RAM) and then
> it turns out that we have to add some ACPI hooks to it, that might be difficult
> to do cleanly.

> Thus, it seems reasonable to think of the ACPI handling in advance.

As far as I understand, ACPI support is only useful for hibernate to the
extent that it allows some or all of the following features:

 - possibly shows a nice looking "hibernate" LED
 - possibly allows the BIOS to show something about hibernate
 - possibly allows the lid or keyboard to "wake up" (turn on) the system

Note that properly restoring device state (or even properly determining
whether on external/mains power vs. battery) on resume is not something
that should require special hibernate ACPI support, since it should be
possible to make hibernate (and in general it will be the case that
hibernate will) look exactly like a reboot to the BIOS/ACPI/devices.
The problem that you mentioned on your system regarding power source
information would seem to just be a problem with how ACPI is
reinitialized after resuming from hibernation, which is not at all
surprising since we know it (the use of driver calls for hibernate) is
currently broken in many ways.

It seems that enabling S4 mode should just be treated as a special
shutdown mode, independent of hibernate.  In practice, it may likely
only be useful in conjunction with hibernate, but there doesn't seem to
be any reason it needs to be coupled.

It would be useful to determine whether it is necessary to initialize
ACPI specially after "resuming" from S4 mode, though, or whether they
can be initialized normally (i.e. by a normal kernel for instance,
completely unaware of hibernate).  If they can be initialized normally,
then it seems that it is unnecessary to have any ACPI S4 mode support in
the resume path, and it can merely exist as a special shutdown mode.
Note that it seems a bit odd if ACPI can't be initialized normally after
resume from S4 (and still work), since the "load image" kernel
initializes everything normally before attempting to resume the
hibernated system.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 16:02                     ` Rafael J. Wysocki
@ 2007-07-17 17:06                       ` david
  2007-07-17 19:50                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17 17:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:

> On Tuesday, 17 July 2007 17:29, david@lang.hm wrote:
>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
>>>> On Mon, 16 Jul 2007 david@lang.hm wrote:
>>>>
>>>>>> I agree, it would be good to have a non-ACPI-specific hibernation mode,
>>>>>> something which would look to ACPI like a normal shutdown.  But I'm not
>>>>>> so sure this is possible.
>>>>>
>>>>> why would it not be possible?
>>>>
>>>>> I can't think of anything much more frustrating then thinking that I
>>>>> suspended a system and then discovering that becouse the battery went dead
>>>>> (a complete power loss) that the system wouldn't boot up properly. to me
>>>>> this would be a fairly common condition (when I'm mobile I use the machine
>>>>> until I am out of battery, then stop and it may be a long time (days)
>>>>> before I can charge the thing up again) this would not be a reliable
>>>>> suspend as far as I'm concerned.
>>>>>
>>>>> for suspend-to-ram you have to worry about ACPI states and what you are
>>>>> doing with them, for suspend-to-disk you can ignore them and completely
>>>>> power the system off instead.
>>>>
>>>> If the only problem with doing this would be lack of wakeup support
>>>> then I'm all for it.  There must be a lot of people who would like
>>>> their computers to hibernate with power drain as close to 0 as possible
>>>> and who don't care about remote wakeup.  In fact they might even prefer
>>>> not to have wakeup support, so the computer doesn't resume at
>>>> unexpected times.
>>>
>>> I'm afraid of one thing, though.
>>>
>>> If we create a framework without ACPI (well, ACPI needs to be enabled in the
>>> kernel anyway for other reasons, like the ability to suspend to RAM) and then
>>> it turns out that we have to add some ACPI hooks to it, that might be difficult
>>> to do cleanly.
>>
>> doing suspend-to-ram should be orthoginal to doing hibernate-to-disk. some
>> people will want both, some won't.
>>
>> at the moment kexec doesn't work with ACPI, that is a limitation that
>> should be fixed, but makeing it able to work with ACPI enabled doesn't
>> mean that it needs to be changed to depend on ACPI and it especially
>> doesn't mean that it should pick up the limitations of the existing ACPI
>> based hibernation approaches.
>>
>> if there is no ACPI on the system it should work, if ther is ACPI on the
>> system it should still work.
>>
>>> Thus, it seems reasonable to think of the ACPI handling in advance.
>>
>> but don't become dependant on ACPI.
>
> Not dependent, but with the possibility of ACPI support taken into account.
>
> Arguably you can create a framework that, for example, will not allow the user
> to adjust the size of the image, but then adding such a functionality may
> require you to change the entire design.  Same thing with ACPI.
>
> I would rather avoid such pitfalls, if I could.

Ok, what is it that you think ACPI fundamentally changes in this process?

keep in mind that we are not makeing the assumption that the hardware 
will remain powered (even a little bit), or the assumption that nothing 
else will run on the hardware (eliminating any possibility that the 
hardware is in a known ACPI state)

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 22:27     ` david
@ 2007-07-17 17:40       ` Dr. David Alan Gilbert
  2007-07-17 17:49         ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Dr. David Alan Gilbert @ 2007-07-17 17:40 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

* david@lang.hm (david@lang.hm) wrote:
> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:

> >Encryption is possible with both the userland hibernation (aka uswsusp) and
> >TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a 
> >"must
> >have" feature for a framework to be generally useful (many users don't use 
> >it
> >anyway).
> 
> he's talking about the main system useing an encrypted device/partition, 
> not the hibernate image being stored encrypted.
> 
> This would require the main system 'forget' the keys when it does the 
> hinbernate and prompt for it again during the wake-up phase.

Indeed - although as I say I really don't know what you would do with
apps using the mounts at that point.   Still it seems like a 
sensible requrest from the security side.

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 17:40       ` Dr. David Alan Gilbert
@ 2007-07-17 17:49         ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-17 17:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Rafael J. Wysocki, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Dr. David Alan Gilbert wrote:

> * david@lang.hm (david@lang.hm) wrote:
>> On Mon, 16 Jul 2007, Rafael J. Wysocki wrote:
>
>>> Encryption is possible with both the userland hibernation (aka uswsusp) and
>>> TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a
>>> "must
>>> have" feature for a framework to be generally useful (many users don't use
>>> it
>>> anyway).
>>
>> he's talking about the main system useing an encrypted device/partition,
>> not the hibernate image being stored encrypted.
>>
>> This would require the main system 'forget' the keys when it does the
>> hinbernate and prompt for it again during the wake-up phase.
>
> Indeed - although as I say I really don't know what you would do with
> apps using the mounts at that point.   Still it seems like a
> sensible requrest from the security side.

along the same lines, it would probably be a good idea to have the ability 
for a system to re-ask for the pass phrase periodicly while the system is 
running.

I see two possible approaches to these issues.

1. implement the periodic re-request capability, and when going into 
hibernate time-out any known pass phrases.

this is a lot of work overall, but the suspend portion is trivial so there 
would not be any suspend surprises.

2. flush the keyring on hibernate and have the resume process re-populate 
it (either by pokeing directly into the memory, or by providing a table 
that the resuming kernel reads from during wake-up to re-populate it)

this is less work, but it's all suspend related so it will get less 
testing.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 14:40                 ` Rafael J. Wysocki
  2007-07-17 15:29                   ` david
  2007-07-17 16:09                   ` Jeremy Maitin-Shepard
@ 2007-07-17 18:32                   ` Alan Stern
  2007-07-17 20:17                     ` Rafael J. Wysocki
  2007-07-17 20:27                     ` david
  2 siblings, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-17 18:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: david, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:

> I'm afraid of one thing, though.
> 
> If we create a framework without ACPI (well, ACPI needs to be enabled in the
> kernel anyway for other reasons, like the ability to suspend to RAM) and then
> it turns out that we have to add some ACPI hooks to it, that might be difficult
> to do cleanly.
> 
> Thus, it seems reasonable to think of the ACPI handling in advance.

Absolutely.  This needs to be done in such a way that it will work:

	On platforms without ACPI;

	On platforms with ACPI where we do a non-ACPI type of shutdown
	to whatever extent it is possible (or perhaps an ACPI-aware
	shutdown rather than change to S4);

	On platforms with ACPI where we do an ACPI-aware transition
	to S4.

Rafael, for those of us who aren't thoroughly familiar with all the ins
and outs of the ACPI spec, could you please summarize a list of the
ACPI calls needed in the second and third cases above?  Indicate which
ones need to be done from within the original kernel and which should
be done from within a kexec'd hibernation kernel.


I'm still not entirely clear on how "suspend-to-both" ought to be
handled.  Presumably it will start off as a normal hibernation.  But
instead of shutting down, wouldn't the kexec'd kernel return to the
original kernel?  After all, the original kernel knows about all the
devices and can put them into a low-power state, while the kexec'd
kernel might not have sufficient information.

But what about the freezer?  The original reason for using kexec was to
avoid the need for the freezer.  With no freezer, while the original
kernel is busy powering down its devices, user tasks will be free to
carry out I/O -- which will make the memory snapshot inconsistent with
the on-disk data structures.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 17:06                       ` david
@ 2007-07-17 19:50                         ` Rafael J. Wysocki
  2007-07-17 20:18                           ` david
  2007-07-17 20:24                           ` Jeremy Maitin-Shepard
  0 siblings, 2 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 19:50 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 19:06, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Tuesday, 17 July 2007 17:29, david@lang.hm wrote:
> >> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> >>
> >>> On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
> >>>> On Mon, 16 Jul 2007 david@lang.hm wrote:
> >>>>
> >>>>>> I agree, it would be good to have a non-ACPI-specific hibernation mode,
> >>>>>> something which would look to ACPI like a normal shutdown.  But I'm not
> >>>>>> so sure this is possible.
> >>>>>
> >>>>> why would it not be possible?
> >>>>
> >>>>> I can't think of anything much more frustrating then thinking that I
> >>>>> suspended a system and then discovering that becouse the battery went dead
> >>>>> (a complete power loss) that the system wouldn't boot up properly. to me
> >>>>> this would be a fairly common condition (when I'm mobile I use the machine
> >>>>> until I am out of battery, then stop and it may be a long time (days)
> >>>>> before I can charge the thing up again) this would not be a reliable
> >>>>> suspend as far as I'm concerned.
> >>>>>
> >>>>> for suspend-to-ram you have to worry about ACPI states and what you are
> >>>>> doing with them, for suspend-to-disk you can ignore them and completely
> >>>>> power the system off instead.
> >>>>
> >>>> If the only problem with doing this would be lack of wakeup support
> >>>> then I'm all for it.  There must be a lot of people who would like
> >>>> their computers to hibernate with power drain as close to 0 as possible
> >>>> and who don't care about remote wakeup.  In fact they might even prefer
> >>>> not to have wakeup support, so the computer doesn't resume at
> >>>> unexpected times.
> >>>
> >>> I'm afraid of one thing, though.
> >>>
> >>> If we create a framework without ACPI (well, ACPI needs to be enabled in the
> >>> kernel anyway for other reasons, like the ability to suspend to RAM) and then
> >>> it turns out that we have to add some ACPI hooks to it, that might be difficult
> >>> to do cleanly.
> >>
> >> doing suspend-to-ram should be orthoginal to doing hibernate-to-disk. some
> >> people will want both, some won't.
> >>
> >> at the moment kexec doesn't work with ACPI, that is a limitation that
> >> should be fixed, but makeing it able to work with ACPI enabled doesn't
> >> mean that it needs to be changed to depend on ACPI and it especially
> >> doesn't mean that it should pick up the limitations of the existing ACPI
> >> based hibernation approaches.
> >>
> >> if there is no ACPI on the system it should work, if ther is ACPI on the
> >> system it should still work.
> >>
> >>> Thus, it seems reasonable to think of the ACPI handling in advance.
> >>
> >> but don't become dependant on ACPI.
> >
> > Not dependent, but with the possibility of ACPI support taken into account.
> >
> > Arguably you can create a framework that, for example, will not allow the user
> > to adjust the size of the image, but then adding such a functionality may
> > require you to change the entire design.  Same thing with ACPI.
> >
> > I would rather avoid such pitfalls, if I could.
> 
> Ok, what is it that you think ACPI fundamentally changes in this process?
> 
> keep in mind that we are not makeing the assumption that the hardware 
> will remain powered (even a little bit), or the assumption that nothing 
> else will run on the hardware (eliminating any possibility that the 
> hardware is in a known ACPI state)

Well, first, the fact is that _some_ systems _will_ be powered while in
hibernation (the majority of notebooks, for example) and you should assume
that the platform _may_ retain some information accross the hibernation/restore
cycle.  In that case you _should_ _not_ trash the information retained by the
platform.

Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
state as a result of the hibernation procedure.  In my opinion this may be done
after saving the image, but still this means, for example, that the
image-saving kernel needs to support ACPI.

Next, during the restore, we should first check if the image is present (and
valid) _without_ turning ACPI on (note that this is not done by the current
hibernation code and that leads to strange problems on some systems).  Then,
if the image is present (and valid), we should first load it, jump to the
hibernated kernel and _then_ turn ACPI on and execute the _BFS and
_WAK ACPI global methods (again, this is not done by the current code in that
order, which is wrong).  Only after that is the hibernated kernel supposed to
continue.

[Please refer to section 15.3 of the 3.0b ACPI spec for details.]

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 16:09                   ` Jeremy Maitin-Shepard
@ 2007-07-17 19:54                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 19:54 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Alan Stern, david, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list,
	Al Boldi

On Tuesday, 17 July 2007 18:09, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
> 
> [snip]
> 
> > I'm afraid of one thing, though.
> 
> > If we create a framework without ACPI (well, ACPI needs to be enabled in the
> > kernel anyway for other reasons, like the ability to suspend to RAM) and then
> > it turns out that we have to add some ACPI hooks to it, that might be difficult
> > to do cleanly.
> 
> > Thus, it seems reasonable to think of the ACPI handling in advance.
> 
> As far as I understand, ACPI support is only useful for hibernate to the
> extent that it allows some or all of the following features:
> 
>  - possibly shows a nice looking "hibernate" LED
>  - possibly allows the BIOS to show something about hibernate
>  - possibly allows the lid or keyboard to "wake up" (turn on) the system
> 
> Note that properly restoring device state (or even properly determining
> whether on external/mains power vs. battery) on resume is not something
> that should require special hibernate ACPI support, since it should be
> possible to make hibernate (and in general it will be the case that
> hibernate will) look exactly like a reboot to the BIOS/ACPI/devices.
> The problem that you mentioned on your system regarding power source
> information would seem to just be a problem with how ACPI is
> reinitialized after resuming from hibernation, which is not at all
> surprising since we know it (the use of driver calls for hibernate) is
> currently broken in many ways.
> 
> It seems that enabling S4 mode should just be treated as a special
> shutdown mode, independent of hibernate.  In practice, it may likely
> only be useful in conjunction with hibernate, but there doesn't seem to
> be any reason it needs to be coupled.
> 
> It would be useful to determine whether it is necessary to initialize
> ACPI specially after "resuming" from S4 mode, though, or whether they
> can be initialized normally (i.e. by a normal kernel for instance,
> completely unaware of hibernate).  If they can be initialized normally,
> then it seems that it is unnecessary to have any ACPI S4 mode support in
> the resume path, and it can merely exist as a special shutdown mode.
> Note that it seems a bit odd if ACPI can't be initialized normally after
> resume from S4 (and still work), since the "load image" kernel
> initializes everything normally before attempting to resume the
> hibernated system.

Unfortunately, this is more complicated (please see my recent reply to David).

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 18:32                   ` Alan Stern
@ 2007-07-17 20:17                     ` Rafael J. Wysocki
  2007-07-17 20:34                       ` david
  2007-07-17 20:34                       ` Jeremy Maitin-Shepard
  2007-07-17 20:27                     ` david
  1 sibling, 2 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 20:17 UTC (permalink / raw)
  To: Alan Stern
  Cc: david, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 20:32, Alan Stern wrote:
> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> 
> > I'm afraid of one thing, though.
> > 
> > If we create a framework without ACPI (well, ACPI needs to be enabled in the
> > kernel anyway for other reasons, like the ability to suspend to RAM) and then
> > it turns out that we have to add some ACPI hooks to it, that might be difficult
> > to do cleanly.
> > 
> > Thus, it seems reasonable to think of the ACPI handling in advance.
> 
> Absolutely.  This needs to be done in such a way that it will work:
> 
> 	On platforms without ACPI;
> 
> 	On platforms with ACPI where we do a non-ACPI type of shutdown
> 	to whatever extent it is possible (or perhaps an ACPI-aware
> 	shutdown rather than change to S4);
> 
> 	On platforms with ACPI where we do an ACPI-aware transition
> 	to S4.
> 
> Rafael, for those of us who aren't thoroughly familiar with all the ins
> and outs of the ACPI spec, could you please summarize a list of the
> ACPI calls needed in the second and third cases above?  Indicate which
> ones need to be done from within the original kernel and which should
> be done from within a kexec'd hibernation kernel.

Sure.

In the third case (ie. transition to S4) we are supposed to do the following:

(1) Upon entering the sleep state, which IMO can be done _after_ the image
    has been saved:
  * figure out which devices can wake up
  * put devices into low power states (wake-up devices are placed in the Dx
    states compatible with the wake capability, the others are powered off)
  * execute the _PTS global control method
  * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
  * execute the _GTS global control method
  * set the GPE enable registers corresponding to the wake-up devices)
  * make the platform enter S4 (there's a well defined procedure for that)
  I think that this should be done by the image-saving kernel.

(2) Upon start-up (by which I mean what happens after the user has pressed
    the power button or something like that):
  * check if the image is present (and valid) _without_ enabling ACPI (we don't
    do that now, but I see no reason for not doing it in the new framework)
  * if the image is present (and valid), load it
  * turn on ACPI (unless already turned on by the BIOS, that is)
  * execute the _BFS global control method
  * execute the _WAK global control method
  * continue
  Here, the first two things should be done by the image-loading kernel, but
  the remaining operations have to be carried out by the restored kernel.

In the remaining two cases we generally don't need to bother with the global
ACPI handling.

> I'm still not entirely clear on how "suspend-to-both" ought to be
> handled.  Presumably it will start off as a normal hibernation.  But
> instead of shutting down, wouldn't the kexec'd kernel return to the
> original kernel?

No, I think the image-saving kernel should suspend.  Then, on resume the
platform will go back to it and it will jump back to the hibernated kernel.

> After all, the original kernel knows about all the devices and can put them
> into a low-power state, while the kexec'd kernel might not have sufficient
> information.

That's correct, but ...

> But what about the freezer?  The original reason for using kexec was to
> avoid the need for the freezer.  With no freezer, while the original
> kernel is busy powering down its devices, user tasks will be free to
> carry out I/O -- which will make the memory snapshot inconsistent with
> the on-disk data structures.

... we can't return to the hibernated kernel unless we are going to cancel the
hibernation.

That's why I think that for the suspend-to-both the image-saving kernel will
need to support the same set of devices as the hibernated kernel.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 19:50                         ` Rafael J. Wysocki
@ 2007-07-17 20:18                           ` david
  2007-07-17 20:39                             ` Jeremy Maitin-Shepard
                                               ` (3 more replies)
  2007-07-17 20:24                           ` Jeremy Maitin-Shepard
  1 sibling, 4 replies; 220+ messages in thread
From: david @ 2007-07-17 20:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:

> On Tuesday, 17 July 2007 19:06, david@lang.hm wrote:
>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> On Tuesday, 17 July 2007 17:29, david@lang.hm wrote:
>>>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>>>
>>>>> On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
>>>>>> On Mon, 16 Jul 2007 david@lang.hm wrote:
>>>>>>
>>>>>>>> I agree, it would be good to have a non-ACPI-specific hibernation mode,
>>>>>>>> something which would look to ACPI like a normal shutdown.  But I'm not
>>>>>>>> so sure this is possible.
>>>>>>>
>>>>>>> why would it not be possible?
>>>>>>
>>>>>>> I can't think of anything much more frustrating then thinking that I
>>>>>>> suspended a system and then discovering that becouse the battery went dead
>>>>>>> (a complete power loss) that the system wouldn't boot up properly. to me
>>>>>>> this would be a fairly common condition (when I'm mobile I use the machine
>>>>>>> until I am out of battery, then stop and it may be a long time (days)
>>>>>>> before I can charge the thing up again) this would not be a reliable
>>>>>>> suspend as far as I'm concerned.
>>>>>>>
>>>>>>> for suspend-to-ram you have to worry about ACPI states and what you are
>>>>>>> doing with them, for suspend-to-disk you can ignore them and completely
>>>>>>> power the system off instead.
>>>>>>
>>>>>> If the only problem with doing this would be lack of wakeup support
>>>>>> then I'm all for it.  There must be a lot of people who would like
>>>>>> their computers to hibernate with power drain as close to 0 as possible
>>>>>> and who don't care about remote wakeup.  In fact they might even prefer
>>>>>> not to have wakeup support, so the computer doesn't resume at
>>>>>> unexpected times.
>>>>>
>>>>> I'm afraid of one thing, though.
>>>>>
>>>>> If we create a framework without ACPI (well, ACPI needs to be enabled in the
>>>>> kernel anyway for other reasons, like the ability to suspend to RAM) and then
>>>>> it turns out that we have to add some ACPI hooks to it, that might be difficult
>>>>> to do cleanly.
>>>>
>>>> doing suspend-to-ram should be orthoginal to doing hibernate-to-disk. some
>>>> people will want both, some won't.
>>>>
>>>> at the moment kexec doesn't work with ACPI, that is a limitation that
>>>> should be fixed, but makeing it able to work with ACPI enabled doesn't
>>>> mean that it needs to be changed to depend on ACPI and it especially
>>>> doesn't mean that it should pick up the limitations of the existing ACPI
>>>> based hibernation approaches.
>>>>
>>>> if there is no ACPI on the system it should work, if ther is ACPI on the
>>>> system it should still work.
>>>>
>>>>> Thus, it seems reasonable to think of the ACPI handling in advance.
>>>>
>>>> but don't become dependant on ACPI.
>>>
>>> Not dependent, but with the possibility of ACPI support taken into account.
>>>
>>> Arguably you can create a framework that, for example, will not allow the user
>>> to adjust the size of the image, but then adding such a functionality may
>>> require you to change the entire design.  Same thing with ACPI.
>>>
>>> I would rather avoid such pitfalls, if I could.
>>
>> Ok, what is it that you think ACPI fundamentally changes in this process?
>>
>> keep in mind that we are not makeing the assumption that the hardware
>> will remain powered (even a little bit), or the assumption that nothing
>> else will run on the hardware (eliminating any possibility that the
>> hardware is in a known ACPI state)
>
> Well, first, the fact is that _some_ systems _will_ be powered while in
> hibernation (the majority of notebooks, for example) and you should assume
> that the platform _may_ retain some information accross the hibernation/restore
> cycle.  In that case you _should_ _not_ trash the information retained by the
> platform.

no, systems that remain powered while asleep are a different type of 
suspend then ones that don't.

> Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
> state as a result of the hibernation procedure.  In my opinion this may be done
> after saving the image, but still this means, for example, that the
> image-saving kernel needs to support ACPI.
>
> Next, during the restore, we should first check if the image is present (and
> valid) _without_ turning ACPI on (note that this is not done by the current
> hibernation code and that leads to strange problems on some systems).  Then,
> if the image is present (and valid), we should first load it, jump to the
> hibernated kernel and _then_ turn ACPI on and execute the _BFS and
> _WAK ACPI global methods (again, this is not done by the current code in that
> order, which is wrong).  Only after that is the hibernated kernel supposed to
> continue.
>
> [Please refer to section 15.3 of the 3.0b ACPI spec for details.]

you are starting from the assumption that ACPI S4 mode should be used.

I'm saying that a suspend that uses ACPI S4 mode is fundamentally 
different from one that does a power off instead.

from my point of view the ACPI S4 sleep mode has far more in common with 
suspend-to-ram then with the suspend-to-disk that I'm talking about

non-ACPI hibernate

   since the box powers off
     it uses zero power while suspended
     another OS could be run before a resume
     hardware can be swapped, suspend image could be sent around the world to be restored on another system.
     restore makes no assumptions about the state of the hardware when it is restored
     restore is slower (full BIOS boot is required)
   should be able to work on just about any hardware (the limit is the ability to initialize the devices)


ACPI suspends

   since the box never completely powers off:
     a complete power failure breaks the suspend
     the OS must remain in control so other uses must be prevented.
     hardware must remain in the ACPI state from suspend until restore.
     restore can be faster (some initialization may be able to be skipped)
   requires ACPI hardware support

under the catagory of ACPI suspends you have

   fast suspend-to-ram (stop scheduling, put the CPU to sleep, as long as 
the memory keeps getting refreshed)
   slow suspend-to-ram (stop scheduling, put as much of the hardware as 
possible to sleep, including spinning down disks and other things that 
take a while to undo)
   suspend-to-disk (stop scheduleing, copy the ram somewhere so that it 
doesn't need to be refreshed, put everything into low-power mode)

   and there are probably quite a few others as well. but they are all in 
the same family in that you have to worry about ACPI states, and they all 
have the same restrictions on what can happen between suspend and resume

the non-ACPI hibernate behaves very differently, and for some people (and 
I think I am one of them) it will meet their needs better then _any_ of 
the ACPI suspends.

David Lang


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 19:50                         ` Rafael J. Wysocki
  2007-07-17 20:18                           ` david
@ 2007-07-17 20:24                           ` Jeremy Maitin-Shepard
  2007-07-17 20:44                             ` david
  2007-07-17 21:00                             ` Rafael J. Wysocki
  1 sibling, 2 replies; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-17 20:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: david, Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list,
	Al Boldi

"Rafael J. Wysocki" <rjw@sisk.pl> writes:

[snip]

> Well, first, the fact is that _some_ systems _will_ be powered while in
> hibernation (the majority of notebooks, for example) and you should assume
> that the platform _may_ retain some information accross the hibernation/restore
> cycle.  In that case you _should_ _not_ trash the information retained by the
> platform.

I'm not sure the majority of notebook users will want wakeup support in
exchange for some power consumption while the system is off.  I think
many people would not consider the trouble of having to press the power
button instead of merely opening the lid too great.

Furthermore, S4 mode is of course also not suitable if you intend to
replace the battery while the system is hibernated.

It does seem that it is useful to provide S4 as an option, but certainly
just shutting down should also be an option on all systems.

> Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
> state as a result of the hibernation procedure.  In my opinion this may be done
> after saving the image, but still this means, for example, that the
> image-saving kernel needs to support ACPI.

It seems that it most certainly must be done AFTER saving the image, as
the image obviously cannot be saved after entering S4 state, since S4
state is nearly the same as powering off completely and all memory will
be lost.

> Next, during the restore, we should first check if the image is present (and
> valid) _without_ turning ACPI on (note that this is not done by the current
> hibernation code and that leads to strange problems on some systems).  Then,
> if the image is present (and valid), we should first load it, jump to the
> hibernated kernel and _then_ turn ACPI on and execute the _BFS and
> _WAK ACPI global methods (again, this is not done by the current code in that
> order, which is wrong).  Only after that is the hibernated kernel supposed to
> continue.

It seems that the implementation of that behavior for Linux cannot be
quite so simple, since resume from hibernation is driven (in general)
from an initrd/initramfs rather than directly from the kernel
initialization sequence, in order to support modular drivers and
features like DM and LVM.

Thus, there would have to be a new "delay_acpi_initialization" kernel
command-line option.  Additionally, there would be a sysfs interface to
tell the kernel to proceed with the ACPI initialization as normal.  This
would be used by an initrd/initramfs after determining that a resume
from hibernate will not be done.  If a resume from hibernate is done,
this hook won't be used, and instead the resumed kernel will call the
ACPI hibernate resume stuff if S4 state was used; otherwise, the resumed
kernel will just re-initialize ACPI as normal.  Also, if the in-kernel
code for checking if a resume can be done does not find a hibernate
image, it will also invoke the delayed ACPI initialization.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 18:32                   ` Alan Stern
  2007-07-17 20:17                     ` Rafael J. Wysocki
@ 2007-07-17 20:27                     ` david
  2007-07-17 21:20                       ` Rafael J. Wysocki
  2007-07-17 22:38                       ` Alan Stern
  1 sibling, 2 replies; 220+ messages in thread
From: david @ 2007-07-17 20:27 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Alan Stern wrote:

> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>
>> I'm afraid of one thing, though.
>>
>> If we create a framework without ACPI (well, ACPI needs to be enabled in the
>> kernel anyway for other reasons, like the ability to suspend to RAM) and then
>> it turns out that we have to add some ACPI hooks to it, that might be difficult
>> to do cleanly.
>>
>> Thus, it seems reasonable to think of the ACPI handling in advance.
>
> Absolutely.  This needs to be done in such a way that it will work:
>
> 	On platforms without ACPI;
>
> 	On platforms with ACPI where we do a non-ACPI type of shutdown
> 	to whatever extent it is possible (or perhaps an ACPI-aware
> 	shutdown rather than change to S4);
>
> 	On platforms with ACPI where we do an ACPI-aware transition
> 	to S4.
>
> Rafael, for those of us who aren't thoroughly familiar with all the ins
> and outs of the ACPI spec, could you please summarize a list of the
> ACPI calls needed in the second and third cases above?  Indicate which
> ones need to be done from within the original kernel and which should
> be done from within a kexec'd hibernation kernel.
>

there was just a link on slashdot toa primer on the subject of power 
management

http://www.techarp.com/showarticle.aspx?artno=420

>
> I'm still not entirely clear on how "suspend-to-both" ought to be
> handled.  Presumably it will start off as a normal hibernation.  But
> instead of shutting down, wouldn't the kexec'd kernel return to the
> original kernel?  After all, the original kernel knows about all the
> devices and can put them into a low-power state, while the kexec'd
> kernel might not have sufficient information.

this is what I'm thinking, but the issue here is that the original kernel 
needs to go into suspend-to-ram mode instead of resuming operation. per 
the e-mail I got from Ying last night this should not be hard to 
implement.

> But what about the freezer?  The original reason for using kexec was to
> avoid the need for the freezer.  With no freezer, while the original
> kernel is busy powering down its devices, user tasks will be free to
> carry out I/O -- which will make the memory snapshot inconsistent with
> the on-disk data structures.

no, user tasks just don't get scheduled during shutdown.

the big problem with the freezer isn't stopping anything from happening, 
it's _selectivly_ stopping things.

with kexec you don't need to let any portion of the origional kernel or 
userspace operate so you don't have a problem.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:17                     ` Rafael J. Wysocki
@ 2007-07-17 20:34                       ` david
  2007-07-17 20:54                         ` Jeremy Maitin-Shepard
  2007-07-17 21:23                         ` Rafael J. Wysocki
  2007-07-17 20:34                       ` Jeremy Maitin-Shepard
  1 sibling, 2 replies; 220+ messages in thread
From: david @ 2007-07-17 20:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:

> On Tuesday, 17 July 2007 20:32, Alan Stern wrote:
>
>> I'm still not entirely clear on how "suspend-to-both" ought to be
>> handled.  Presumably it will start off as a normal hibernation.  But
>> instead of shutting down, wouldn't the kexec'd kernel return to the
>> original kernel?
>
> No, I think the image-saving kernel should suspend.  Then, on resume the
> platform will go back to it and it will jump back to the hibernated kernel.
>
>> After all, the original kernel knows about all the devices and can put them
>> into a low-power state, while the kexec'd kernel might not have sufficient
>> information.
>
> That's correct, but ...
>
>> But what about the freezer?  The original reason for using kexec was to
>> avoid the need for the freezer.  With no freezer, while the original
>> kernel is busy powering down its devices, user tasks will be free to
>> carry out I/O -- which will make the memory snapshot inconsistent with
>> the on-disk data structures.
>
> ... we can't return to the hibernated kernel unless we are going to cancel the
> hibernation.

this is where we disagree.

why not? if all that the hibernated kernel does is to suspend-to-ram and 
makes no changes to disks or TCP connections anything that it does do 
would be lost if power were to fail and you instead did a restore from 
disk.

there is only a problem if something takes place that would prevent the 
restore-from-disk from working. if this is done in a non-ACPI way that 
will work across a power cycle you don't have to worry about the hardware 
state not matching anyway.

> That's why I think that for the suspend-to-both the image-saving kernel will
> need to support the same set of devices as the hibernated kernel.

suspend-to-both doesn't really make sense if the suspend-to-disk portion 
is useing the ACPI S4 mode.

if you don't run out of power you will restore-from-ram

if you do run out of power the restore-from-disk won't work either becouse 
devices are not in the right ACPI states.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:17                     ` Rafael J. Wysocki
  2007-07-17 20:34                       ` david
@ 2007-07-17 20:34                       ` Jeremy Maitin-Shepard
  2007-07-17 20:37                         ` david
  2007-07-17 21:11                         ` Rafael J. Wysocki
  1 sibling, 2 replies; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-17 20:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, david, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list,
	Al Boldi

"Rafael J. Wysocki" <rjw@sisk.pl> writes:

[snip]

>> Rafael, for those of us who aren't thoroughly familiar with all the ins
>> and outs of the ACPI spec, could you please summarize a list of the
>> ACPI calls needed in the second and third cases above?  Indicate which
>> ones need to be done from within the original kernel and which should
>> be done from within a kexec'd hibernation kernel.

> Sure.

> In the third case (ie. transition to S4) we are supposed to do the following:

> (1) Upon entering the sleep state, which IMO can be done _after_ the image
>     has been saved:

I assume you mean "in order to enter the sleep state", rather than "upon
entering the sleep state".  I still don't understand what you mean by
"which IMO can be done _after_ the image has been saved"; as far as I
understand, the last step of this process, "make the platform enter S4",
is almost like a shutdown as far as the kernel is concerned (except for
the tiny detail of having to call those special ACPI methods on resume);
consequently, it would seem that nothing can be done after that step.

>   * figure out which devices can wake up
>   * put devices into low power states (wake-up devices are placed in the Dx
>     states compatible with the wake capability, the others are powered off)
>   * execute the _PTS global control method
>   * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
>   * execute the _GTS global control method
>   * set the GPE enable registers corresponding to the wake-up devices)
>   * make the platform enter S4 (there's a well defined procedure for that)
>   I think that this should be done by the image-saving kernel.

I agree.

> (2) Upon start-up (by which I mean what happens after the user has pressed
>     the power button or something like that):
>   * check if the image is present (and valid) _without_ enabling ACPI (we don't
>     do that now, but I see no reason for not doing it in the new framework)
>   * if the image is present (and valid), load it
>   * turn on ACPI (unless already turned on by the BIOS, that is)
>   * execute the _BFS global control method
>   * execute the _WAK global control method
>   * continue
>   Here, the first two things should be done by the image-loading kernel, but
>   the remaining operations have to be carried out by the restored
>     kernel.

It doesn't seem like a problem for that to be the case, but out of
curiosity why do those methods need to be executed by the "restored"
kernel, rather than the "image loading" kernel.  Do they require some
information from ACPI-related kernel data structures that were populated
by the normal ACPI initialization?

[snip]

> ... we can't return to the hibernated kernel unless we are going to cancel the
> hibernation.

I agree.

> That's why I think that for the suspend-to-both the image-saving kernel will
> need to support the same set of devices as the hibernated kernel.

If all of the devices that the image writing kernel doesn't know about
have already been shut down/powered off by the hibernated kernel, then
does the "image writing" kernel still need to know about them in order
to suspend to RAM properly (i.e. without leaving some devices on wasting
power)?

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:34                       ` Jeremy Maitin-Shepard
@ 2007-07-17 20:37                         ` david
  2007-07-17 20:56                           ` Jeremy Maitin-Shepard
  2007-07-17 21:24                           ` Rafael J. Wysocki
  2007-07-17 21:11                         ` Rafael J. Wysocki
  1 sibling, 2 replies; 220+ messages in thread
From: david @ 2007-07-17 20:37 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:

> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
>
> [snip]
>
>>> Rafael, for those of us who aren't thoroughly familiar with all the ins
>>> and outs of the ACPI spec, could you please summarize a list of the
>>> ACPI calls needed in the second and third cases above?  Indicate which
>>> ones need to be done from within the original kernel and which should
>>> be done from within a kexec'd hibernation kernel.
>
>> Sure.
>
>> In the third case (ie. transition to S4) we are supposed to do the following:
>
>> (1) Upon entering the sleep state, which IMO can be done _after_ the image
>>     has been saved:
>
> I assume you mean "in order to enter the sleep state", rather than "upon
> entering the sleep state".  I still don't understand what you mean by
> "which IMO can be done _after_ the image has been saved"; as far as I
> understand, the last step of this process, "make the platform enter S4",
> is almost like a shutdown as far as the kernel is concerned (except for
> the tiny detail of having to call those special ACPI methods on resume);
> consequently, it would seem that nothing can be done after that step.
>
>>   * figure out which devices can wake up
>>   * put devices into low power states (wake-up devices are placed in the Dx
>>     states compatible with the wake capability, the others are powered off)

this can't be done by the image-saving kernel if that kernel doesn't know 
about the device.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:39                             ` Jeremy Maitin-Shepard
@ 2007-07-17 20:39                               ` david
  2007-07-17 20:58                               ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: david @ 2007-07-17 20:39 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:

> david@lang.hm writes:
>
> [snip]
>
>> the non-ACPI hibernate behaves very differently, and for some people (and I
>> think I am one of them) it will meet their needs better then _any_ of the ACPI
>> suspends.
>
> It may have certain differences from the user point of view, but from
> the implementation view, it seems that it is nearly exactly the same.
> The only differences seem to be:
>
> - rather than shutting down, do whatever is necessary to stick the
>   system in S4 state.
>
> - make sure ACPI isn't initialized by the "load image" kernel
>
> - rather than "resume from hibernate" ACPI by initializing it normally,
>   issue the special hibernate-related methods.
>
> Thus, it seems that supporting ACPI S4 will have a very minimal affect
> on the hibernate implementation.

from what Rafael is saying supporting ACPI S4 mode requires a very 
fundamentally different restore approach, and in addition imposes very 
different restrictions on what can be done with the machine while it's 
suspended.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:18                           ` david
@ 2007-07-17 20:39                             ` Jeremy Maitin-Shepard
  2007-07-17 20:39                               ` david
  2007-07-17 20:58                               ` Rafael J. Wysocki
  2007-07-17 20:57                             ` Rafael J. Wysocki
                                               ` (2 subsequent siblings)
  3 siblings, 2 replies; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-17 20:39 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

david@lang.hm writes:

[snip]

> the non-ACPI hibernate behaves very differently, and for some people (and I
> think I am one of them) it will meet their needs better then _any_ of the ACPI
> suspends.

It may have certain differences from the user point of view, but from
the implementation view, it seems that it is nearly exactly the same.
The only differences seem to be: 

 - rather than shutting down, do whatever is necessary to stick the
   system in S4 state.

 - make sure ACPI isn't initialized by the "load image" kernel

 - rather than "resume from hibernate" ACPI by initializing it normally,
   issue the special hibernate-related methods.

Thus, it seems that supporting ACPI S4 will have a very minimal affect
on the hibernate implementation.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:24                           ` Jeremy Maitin-Shepard
@ 2007-07-17 20:44                             ` david
  2007-07-17 21:00                             ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: david @ 2007-07-17 20:44 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:

> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
>
> [snip]
>
>> Well, first, the fact is that _some_ systems _will_ be powered while in
>> hibernation (the majority of notebooks, for example) and you should assume
>> that the platform _may_ retain some information accross the hibernation/restore
>> cycle.  In that case you _should_ _not_ trash the information retained by the
>> platform.
>
> I'm not sure the majority of notebook users will want wakeup support in
> exchange for some power consumption while the system is off.  I think
> many people would not consider the trouble of having to press the power
> button instead of merely opening the lid too great.

I think you mean to say that they would be willing to trade wakeup support 
in exchange for lower power consumption...

> Furthermore, S4 mode is of course also not suitable if you intend to
> replace the battery while the system is hibernated.

exactly, and how many users realize that replacing the battery, or 
allowing the battery to die completely will cause the restore to fail?

> It does seem that it is useful to provide S4 as an option, but certainly
> just shutting down should also be an option on all systems.

absolutly.

>> Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
>> state as a result of the hibernation procedure.  In my opinion this may be done
>> after saving the image, but still this means, for example, that the
>> image-saving kernel needs to support ACPI.
>
> It seems that it most certainly must be done AFTER saving the image, as
> the image obviously cannot be saved after entering S4 state, since S4
> state is nearly the same as powering off completely and all memory will
> be lost.

and the image-saving kernel could kexec back to the main kernel if the 
main kernel then knows that it should execute the S4 mode instead of 
restoring.

>> Next, during the restore, we should first check if the image is present (and
>> valid) _without_ turning ACPI on (note that this is not done by the current
>> hibernation code and that leads to strange problems on some systems).  Then,
>> if the image is present (and valid), we should first load it, jump to the
>> hibernated kernel and _then_ turn ACPI on and execute the _BFS and
>> _WAK ACPI global methods (again, this is not done by the current code in that
>> order, which is wrong).  Only after that is the hibernated kernel supposed to
>> continue.
>
> It seems that the implementation of that behavior for Linux cannot be
> quite so simple, since resume from hibernation is driven (in general)
> from an initrd/initramfs rather than directly from the kernel
> initialization sequence, in order to support modular drivers and
> features like DM and LVM.
>
> Thus, there would have to be a new "delay_acpi_initialization" kernel
> command-line option.  Additionally, there would be a sysfs interface to
> tell the kernel to proceed with the ACPI initialization as normal.  This
> would be used by an initrd/initramfs after determining that a resume
> from hibernate will not be done.  If a resume from hibernate is done,
> this hook won't be used, and instead the resumed kernel will call the
> ACPI hibernate resume stuff if S4 state was used; otherwise, the resumed
> kernel will just re-initialize ACPI as normal.  Also, if the in-kernel
> code for checking if a resume can be done does not find a hibernate
> image, it will also invoke the delayed ACPI initialization.

remember, one of the thoughts for a good hibernate implementation is that 
the image may be over the network, not on the local disk. you don't want 
the kernel trying to implment everything nessasary to get the image back.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:57                             ` Rafael J. Wysocki
@ 2007-07-17 20:53                               ` david
  2007-07-17 21:37                                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17 20:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:

> On Tuesday, 17 July 2007 22:18, david@lang.hm wrote:
>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> On Tuesday, 17 July 2007 19:06, david@lang.hm wrote:
>>>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>>>
>>>>> On Tuesday, 17 July 2007 17:29, david@lang.hm wrote:
>>>>>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>>>>>
>>>>>>> On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
>>>>>>>> On Mon, 16 Jul 2007 david@lang.hm wrote:
>>>>>>>>
>>>>>>>>>> I agree, it would be good to have a non-ACPI-specific hibernation mode,
>>>>>>>>>> something which would look to ACPI like a normal shutdown.  But I'm not
>>>>>>>>>> so sure this is possible.
>>>>>>>>>
>>>>>>>>> why would it not be possible?
>>>>>>>>
>>>>>>>>> I can't think of anything much more frustrating then thinking that I
>>>>>>>>> suspended a system and then discovering that becouse the battery went dead
>>>>>>>>> (a complete power loss) that the system wouldn't boot up properly. to me
>>>>>>>>> this would be a fairly common condition (when I'm mobile I use the machine
>>>>>>>>> until I am out of battery, then stop and it may be a long time (days)
>>>>>>>>> before I can charge the thing up again) this would not be a reliable
>>>>>>>>> suspend as far as I'm concerned.
>>>>>>>>>
>>>>>>>>> for suspend-to-ram you have to worry about ACPI states and what you are
>>>>>>>>> doing with them, for suspend-to-disk you can ignore them and completely
>>>>>>>>> power the system off instead.
>>>>>>>>
>>>>>>>> If the only problem with doing this would be lack of wakeup support
>>>>>>>> then I'm all for it.  There must be a lot of people who would like
>>>>>>>> their computers to hibernate with power drain as close to 0 as possible
>>>>>>>> and who don't care about remote wakeup.  In fact they might even prefer
>>>>>>>> not to have wakeup support, so the computer doesn't resume at
>>>>>>>> unexpected times.
>>>>>>>
>>>>>>> I'm afraid of one thing, though.
>>>>>>>
>>>>>>> If we create a framework without ACPI (well, ACPI needs to be enabled in the
>>>>>>> kernel anyway for other reasons, like the ability to suspend to RAM) and then
>>>>>>> it turns out that we have to add some ACPI hooks to it, that might be difficult
>>>>>>> to do cleanly.
>>>>>>
>>>>>> doing suspend-to-ram should be orthoginal to doing hibernate-to-disk. some
>>>>>> people will want both, some won't.
>>>>>>
>>>>>> at the moment kexec doesn't work with ACPI, that is a limitation that
>>>>>> should be fixed, but makeing it able to work with ACPI enabled doesn't
>>>>>> mean that it needs to be changed to depend on ACPI and it especially
>>>>>> doesn't mean that it should pick up the limitations of the existing ACPI
>>>>>> based hibernation approaches.
>>>>>>
>>>>>> if there is no ACPI on the system it should work, if ther is ACPI on the
>>>>>> system it should still work.
>>>>>>
>>>>>>> Thus, it seems reasonable to think of the ACPI handling in advance.
>>>>>>
>>>>>> but don't become dependant on ACPI.
>>>>>
>>>>> Not dependent, but with the possibility of ACPI support taken into account.
>>>>>
>>>>> Arguably you can create a framework that, for example, will not allow the user
>>>>> to adjust the size of the image, but then adding such a functionality may
>>>>> require you to change the entire design.  Same thing with ACPI.
>>>>>
>>>>> I would rather avoid such pitfalls, if I could.
>>>>
>>>> Ok, what is it that you think ACPI fundamentally changes in this process?
>>>>
>>>> keep in mind that we are not makeing the assumption that the hardware
>>>> will remain powered (even a little bit), or the assumption that nothing
>>>> else will run on the hardware (eliminating any possibility that the
>>>> hardware is in a known ACPI state)
>>>
>>> Well, first, the fact is that _some_ systems _will_ be powered while in
>>> hibernation (the majority of notebooks, for example) and you should assume
>>> that the platform _may_ retain some information accross the hibernation/restore
>>> cycle.  In that case you _should_ _not_ trash the information retained by the
>>> platform.
>>
>> no, systems that remain powered while asleep are a different type of
>> suspend then ones that don't.
>>
>>> Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
>>> state as a result of the hibernation procedure.  In my opinion this may be done
>>> after saving the image, but still this means, for example, that the
>>> image-saving kernel needs to support ACPI.
>>>
>>> Next, during the restore, we should first check if the image is present (and
>>> valid) _without_ turning ACPI on (note that this is not done by the current
>>> hibernation code and that leads to strange problems on some systems).  Then,
>>> if the image is present (and valid), we should first load it, jump to the
>>> hibernated kernel and _then_ turn ACPI on and execute the _BFS and
>>> _WAK ACPI global methods (again, this is not done by the current code in that
>>> order, which is wrong).  Only after that is the hibernated kernel supposed to
>>> continue.
>>>
>>> [Please refer to section 15.3 of the 3.0b ACPI spec for details.]
>>
>> you are starting from the assumption that ACPI S4 mode should be used.
>>
>> I'm saying that a suspend that uses ACPI S4 mode is fundamentally
>> different from one that does a power off instead.
>
> It is different, but not fundamentally.
>
>> from my point of view the ACPI S4 sleep mode has far more in common with
>> suspend-to-ram then with the suspend-to-disk that I'm talking about
>>
>> non-ACPI hibernate
>>
>>    since the box powers off
>>      it uses zero power while suspended
>>      another OS could be run before a resume
>>      hardware can be swapped, suspend image could be sent around the world to be restored on another system.
>>      restore makes no assumptions about the state of the hardware when it is restored
>>      restore is slower (full BIOS boot is required)
>>    should be able to work on just about any hardware (the limit is the ability to initialize the devices)
>>
>>
>> ACPI suspends
>>
>>    since the box never completely powers off:
>>      a complete power failure breaks the suspend
>>      the OS must remain in control so other uses must be prevented.
>>      hardware must remain in the ACPI state from suspend until restore.
>>      restore can be faster (some initialization may be able to be skipped)
>>    requires ACPI hardware support
>>
>> under the catagory of ACPI suspends you have
>>
>>    fast suspend-to-ram (stop scheduling, put the CPU to sleep, as long as
>> the memory keeps getting refreshed)
>>    slow suspend-to-ram (stop scheduling, put as much of the hardware as
>> possible to sleep, including spinning down disks and other things that
>> take a while to undo)
>>    suspend-to-disk (stop scheduleing, copy the ram somewhere so that it
>> doesn't need to be refreshed, put everything into low-power mode)
>>
>>    and there are probably quite a few others as well. but they are all in
>> the same family in that you have to worry about ACPI states, and they all
>> have the same restrictions on what can happen between suspend and resume
>>
>> the non-ACPI hibernate behaves very differently, and for some people (and
>> I think I am one of them) it will meet their needs better then _any_ of
>> the ACPI suspends.
>
> OTOH, there are many people who would want the ACPI suspends to be handled
> and they don't really care for the power-off-only hibernation.
>
> If you aren't going to support the ACPI hibernation, your framework will be
> incomplete and therefore not generally useful.

if you make the framework limited by the ACPI requriement, your framework 
will not be able to be used in all cases and is therefor incomplete and 
not generally useful.

see, I can make authoritative sounding declarations too. :-)

I agree that some people want ACPI suspends, but you don't seem to allow 
the fact that some people don't, and those people don't want to have the 
ACPI based limits. they _especially_ don't want those limits when it 
appears as if supporting those limits is what's preventing their much 
simpler case from working reliably.

I strongly suspect that the majority of users don't care about ACPI, they 
want to be able to pause and resume their machine. they may want a couple 
options for how fast the resume is (trading resume speed against how much 
power the system eats), but the deep sleep modes (suspend-to-disk, 
hibernate) probably have restore times that are close enough to each other 
that very few people would care enough to opt for a ACPI S4 mode that 
won't survive a loss of battery power over a non-ACPI mode that would.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:34                       ` david
@ 2007-07-17 20:54                         ` Jeremy Maitin-Shepard
  2007-07-17 21:04                           ` david
  2007-07-17 21:23                         ` Rafael J. Wysocki
  1 sibling, 1 reply; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-17 20:54 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

david@lang.hm writes:

[snip]

> this is where we disagree.

> why not? if all that the hibernated kernel does is to suspend-to-ram and makes
> no changes to disks or TCP connections anything that it does do would be lost if
> power were to fail and you instead did a restore from disk.

It would be okay to switch the "hibernated" kernel in order to
e.g. initiate a suspend to ram provided that everything is done
atomically with interrupts off, for instance.  It is not clear, though,
that it is possible to suspend to ram atomically like that.

There is also the question of what state the devices will be in when
switching back from the "save image" kernel to the "hibernated" kernel.

[snip]

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:37                         ` david
@ 2007-07-17 20:56                           ` Jeremy Maitin-Shepard
  2007-07-17 21:06                             ` david
  2007-07-17 21:24                           ` Rafael J. Wysocki
  1 sibling, 1 reply; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-17 20:56 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

david@lang.hm writes:

[snip]

>>> * figure out which devices can wake up
>>> * put devices into low power states (wake-up devices are placed in the Dx
>>> states compatible with the wake capability, the others are powered off)

> this can't be done by the image-saving kernel if that kernel doesn't know about
> the device.

The image-saving kernel can be made to know about all of the "wake up"
devices; all other devices should have already been powered off by the
"hibernated" kernel.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:18                           ` david
  2007-07-17 20:39                             ` Jeremy Maitin-Shepard
@ 2007-07-17 20:57                             ` Rafael J. Wysocki
  2007-07-17 20:53                               ` david
  2007-07-21 10:25                             ` Pavel Machek
  2007-08-01 16:58                             ` Stefan Seyfried
  3 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 20:57 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 22:18, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Tuesday, 17 July 2007 19:06, david@lang.hm wrote:
> >> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> >>
> >>> On Tuesday, 17 July 2007 17:29, david@lang.hm wrote:
> >>>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> >>>>
> >>>>> On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
> >>>>>> On Mon, 16 Jul 2007 david@lang.hm wrote:
> >>>>>>
> >>>>>>>> I agree, it would be good to have a non-ACPI-specific hibernation mode,
> >>>>>>>> something which would look to ACPI like a normal shutdown.  But I'm not
> >>>>>>>> so sure this is possible.
> >>>>>>>
> >>>>>>> why would it not be possible?
> >>>>>>
> >>>>>>> I can't think of anything much more frustrating then thinking that I
> >>>>>>> suspended a system and then discovering that becouse the battery went dead
> >>>>>>> (a complete power loss) that the system wouldn't boot up properly. to me
> >>>>>>> this would be a fairly common condition (when I'm mobile I use the machine
> >>>>>>> until I am out of battery, then stop and it may be a long time (days)
> >>>>>>> before I can charge the thing up again) this would not be a reliable
> >>>>>>> suspend as far as I'm concerned.
> >>>>>>>
> >>>>>>> for suspend-to-ram you have to worry about ACPI states and what you are
> >>>>>>> doing with them, for suspend-to-disk you can ignore them and completely
> >>>>>>> power the system off instead.
> >>>>>>
> >>>>>> If the only problem with doing this would be lack of wakeup support
> >>>>>> then I'm all for it.  There must be a lot of people who would like
> >>>>>> their computers to hibernate with power drain as close to 0 as possible
> >>>>>> and who don't care about remote wakeup.  In fact they might even prefer
> >>>>>> not to have wakeup support, so the computer doesn't resume at
> >>>>>> unexpected times.
> >>>>>
> >>>>> I'm afraid of one thing, though.
> >>>>>
> >>>>> If we create a framework without ACPI (well, ACPI needs to be enabled in the
> >>>>> kernel anyway for other reasons, like the ability to suspend to RAM) and then
> >>>>> it turns out that we have to add some ACPI hooks to it, that might be difficult
> >>>>> to do cleanly.
> >>>>
> >>>> doing suspend-to-ram should be orthoginal to doing hibernate-to-disk. some
> >>>> people will want both, some won't.
> >>>>
> >>>> at the moment kexec doesn't work with ACPI, that is a limitation that
> >>>> should be fixed, but makeing it able to work with ACPI enabled doesn't
> >>>> mean that it needs to be changed to depend on ACPI and it especially
> >>>> doesn't mean that it should pick up the limitations of the existing ACPI
> >>>> based hibernation approaches.
> >>>>
> >>>> if there is no ACPI on the system it should work, if ther is ACPI on the
> >>>> system it should still work.
> >>>>
> >>>>> Thus, it seems reasonable to think of the ACPI handling in advance.
> >>>>
> >>>> but don't become dependant on ACPI.
> >>>
> >>> Not dependent, but with the possibility of ACPI support taken into account.
> >>>
> >>> Arguably you can create a framework that, for example, will not allow the user
> >>> to adjust the size of the image, but then adding such a functionality may
> >>> require you to change the entire design.  Same thing with ACPI.
> >>>
> >>> I would rather avoid such pitfalls, if I could.
> >>
> >> Ok, what is it that you think ACPI fundamentally changes in this process?
> >>
> >> keep in mind that we are not makeing the assumption that the hardware
> >> will remain powered (even a little bit), or the assumption that nothing
> >> else will run on the hardware (eliminating any possibility that the
> >> hardware is in a known ACPI state)
> >
> > Well, first, the fact is that _some_ systems _will_ be powered while in
> > hibernation (the majority of notebooks, for example) and you should assume
> > that the platform _may_ retain some information accross the hibernation/restore
> > cycle.  In that case you _should_ _not_ trash the information retained by the
> > platform.
> 
> no, systems that remain powered while asleep are a different type of 
> suspend then ones that don't.
> 
> > Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
> > state as a result of the hibernation procedure.  In my opinion this may be done
> > after saving the image, but still this means, for example, that the
> > image-saving kernel needs to support ACPI.
> >
> > Next, during the restore, we should first check if the image is present (and
> > valid) _without_ turning ACPI on (note that this is not done by the current
> > hibernation code and that leads to strange problems on some systems).  Then,
> > if the image is present (and valid), we should first load it, jump to the
> > hibernated kernel and _then_ turn ACPI on and execute the _BFS and
> > _WAK ACPI global methods (again, this is not done by the current code in that
> > order, which is wrong).  Only after that is the hibernated kernel supposed to
> > continue.
> >
> > [Please refer to section 15.3 of the 3.0b ACPI spec for details.]
> 
> you are starting from the assumption that ACPI S4 mode should be used.
> 
> I'm saying that a suspend that uses ACPI S4 mode is fundamentally 
> different from one that does a power off instead.

It is different, but not fundamentally.

> from my point of view the ACPI S4 sleep mode has far more in common with 
> suspend-to-ram then with the suspend-to-disk that I'm talking about
> 
> non-ACPI hibernate
> 
>    since the box powers off
>      it uses zero power while suspended
>      another OS could be run before a resume
>      hardware can be swapped, suspend image could be sent around the world to be restored on another system.
>      restore makes no assumptions about the state of the hardware when it is restored
>      restore is slower (full BIOS boot is required)
>    should be able to work on just about any hardware (the limit is the ability to initialize the devices)
> 
> 
> ACPI suspends
> 
>    since the box never completely powers off:
>      a complete power failure breaks the suspend
>      the OS must remain in control so other uses must be prevented.
>      hardware must remain in the ACPI state from suspend until restore.
>      restore can be faster (some initialization may be able to be skipped)
>    requires ACPI hardware support
> 
> under the catagory of ACPI suspends you have
> 
>    fast suspend-to-ram (stop scheduling, put the CPU to sleep, as long as 
> the memory keeps getting refreshed)
>    slow suspend-to-ram (stop scheduling, put as much of the hardware as 
> possible to sleep, including spinning down disks and other things that 
> take a while to undo)
>    suspend-to-disk (stop scheduleing, copy the ram somewhere so that it 
> doesn't need to be refreshed, put everything into low-power mode)
> 
>    and there are probably quite a few others as well. but they are all in 
> the same family in that you have to worry about ACPI states, and they all 
> have the same restrictions on what can happen between suspend and resume
> 
> the non-ACPI hibernate behaves very differently, and for some people (and 
> I think I am one of them) it will meet their needs better then _any_ of 
> the ACPI suspends.

OTOH, there are many people who would want the ACPI suspends to be handled
and they don't really care for the power-off-only hibernation.

If you aren't going to support the ACPI hibernation, your framework will be
incomplete and therefore not generally useful.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:39                             ` Jeremy Maitin-Shepard
  2007-07-17 20:39                               ` david
@ 2007-07-17 20:58                               ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 20:58 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: david, Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list,
	Al Boldi

On Tuesday, 17 July 2007 22:39, Jeremy Maitin-Shepard wrote:
> david@lang.hm writes:
> 
> [snip]
> 
> > the non-ACPI hibernate behaves very differently, and for some people (and I
> > think I am one of them) it will meet their needs better then _any_ of the ACPI
> > suspends.
> 
> It may have certain differences from the user point of view, but from
> the implementation view, it seems that it is nearly exactly the same.
> The only differences seem to be: 
> 
>  - rather than shutting down, do whatever is necessary to stick the
>    system in S4 state.
> 
>  - make sure ACPI isn't initialized by the "load image" kernel
> 
>  - rather than "resume from hibernate" ACPI by initializing it normally,
>    issue the special hibernate-related methods.
> 
> Thus, it seems that supporting ACPI S4 will have a very minimal affect
> on the hibernate implementation.

Still, you need to take it into account.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:24                           ` Jeremy Maitin-Shepard
  2007-07-17 20:44                             ` david
@ 2007-07-17 21:00                             ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:00 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: david, Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list,
	Al Boldi

On Tuesday, 17 July 2007 22:24, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
> 
> [snip]
> 
> > Well, first, the fact is that _some_ systems _will_ be powered while in
> > hibernation (the majority of notebooks, for example) and you should assume
> > that the platform _may_ retain some information accross the hibernation/restore
> > cycle.  In that case you _should_ _not_ trash the information retained by the
> > platform.
> 
> I'm not sure the majority of notebook users will want wakeup support in
> exchange for some power consumption while the system is off.  I think
> many people would not consider the trouble of having to press the power
> button instead of merely opening the lid too great.
> 
> Furthermore, S4 mode is of course also not suitable if you intend to
> replace the battery while the system is hibernated.
> 
> It does seem that it is useful to provide S4 as an option, but certainly
> just shutting down should also be an option on all systems.
> 
> > Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
> > state as a result of the hibernation procedure.  In my opinion this may be done
> > after saving the image, but still this means, for example, that the
> > image-saving kernel needs to support ACPI.
> 
> It seems that it most certainly must be done AFTER saving the image, as
> the image obviously cannot be saved after entering S4 state, since S4
> state is nearly the same as powering off completely and all memory will
> be lost.
> 
> > Next, during the restore, we should first check if the image is present (and
> > valid) _without_ turning ACPI on (note that this is not done by the current
> > hibernation code and that leads to strange problems on some systems).  Then,
> > if the image is present (and valid), we should first load it, jump to the
> > hibernated kernel and _then_ turn ACPI on and execute the _BFS and
> > _WAK ACPI global methods (again, this is not done by the current code in that
> > order, which is wrong).  Only after that is the hibernated kernel supposed to
> > continue.
> 
> It seems that the implementation of that behavior for Linux cannot be
> quite so simple, since resume from hibernation is driven (in general)
> from an initrd/initramfs rather than directly from the kernel
> initialization sequence, in order to support modular drivers and
> features like DM and LVM.

That's correct.

> Thus, there would have to be a new "delay_acpi_initialization" kernel
> command-line option.  Additionally, there would be a sysfs interface to
> tell the kernel to proceed with the ACPI initialization as normal.  This
> would be used by an initrd/initramfs after determining that a resume
> from hibernate will not be done.  If a resume from hibernate is done,
> this hook won't be used, and instead the resumed kernel will call the
> ACPI hibernate resume stuff if S4 state was used; otherwise, the resumed
> kernel will just re-initialize ACPI as normal.  Also, if the in-kernel
> code for checking if a resume can be done does not find a hibernate
> image, it will also invoke the delayed ACPI initialization.

Yes, something like this.

My point is, though, that it really requires some thought and needs to be
remebered about.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:54                         ` Jeremy Maitin-Shepard
@ 2007-07-17 21:04                           ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-17 21:04 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:

> david@lang.hm writes:
>
> [snip]
>
>> this is where we disagree.
>
>> why not? if all that the hibernated kernel does is to suspend-to-ram and makes
>> no changes to disks or TCP connections anything that it does do would be lost if
>> power were to fail and you instead did a restore from disk.
>
> It would be okay to switch the "hibernated" kernel in order to
> e.g. initiate a suspend to ram provided that everything is done
> atomically with interrupts off, for instance.  It is not clear, though,
> that it is possible to suspend to ram atomically like that.

why would it neeed to be with interrupts off?

I am arguing that it wouldn't matter if the "hibernated" kernel changed 
every bit of ram, as long as it didn't change anything that would be 
visable when the ram is overwritten by the saved image.

> There is also the question of what state the devices will be in when
> switching back from the "save image" kernel to the "hibernated" kernel.

yes, this is a key factor.

if the saved image assumes that the hardware is in some ACPI mode instead 
of re-initializeing the hardware then the suspend-to-ram operation could 
leave them in a different mode.

but if the saved image doesn't make assumptions about the hardware modes 
and initializes the hardware then it shouldn't matter.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:56                           ` Jeremy Maitin-Shepard
@ 2007-07-17 21:06                             ` david
  2007-07-17 21:40                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17 21:06 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:

> david@lang.hm writes:
>
> [snip]
>
>>>> * figure out which devices can wake up
>>>> * put devices into low power states (wake-up devices are placed in the Dx
>>>> states compatible with the wake capability, the others are powered off)
>
>> this can't be done by the image-saving kernel if that kernel doesn't know about
>> the device.
>
> The image-saving kernel can be made to know about all of the "wake up"
> devices; all other devices should have already been powered off by the
> "hibernated" kernel.

not nessasarily.

for example, you don't want the "hibernated" kernel to spin down your 
disks in the general case, but your image-writing kernel may not have 
drivers in it to talk to some of the disks. when things are powered 
off this just isn't an issue, but if you use S4 mode instead it is.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:34                       ` Jeremy Maitin-Shepard
  2007-07-17 20:37                         ` david
@ 2007-07-17 21:11                         ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:11 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Alan Stern, david, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list,
	Al Boldi

On Tuesday, 17 July 2007 22:34, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
> 
> [snip]
> 
> >> Rafael, for those of us who aren't thoroughly familiar with all the ins
> >> and outs of the ACPI spec, could you please summarize a list of the
> >> ACPI calls needed in the second and third cases above?  Indicate which
> >> ones need to be done from within the original kernel and which should
> >> be done from within a kexec'd hibernation kernel.
> 
> > Sure.
> 
> > In the third case (ie. transition to S4) we are supposed to do the following:
> 
> > (1) Upon entering the sleep state, which IMO can be done _after_ the image
> >     has been saved:
> 
> I assume you mean "in order to enter the sleep state", rather than "upon
> entering the sleep state".

Yes, that seems to be more accurate.

> I still don't understand what you mean by 
> "which IMO can be done _after_ the image has been saved"; as far as I
> understand, the last step of this process, "make the platform enter S4",
> is almost like a shutdown as far as the kernel is concerned (except for
> the tiny detail of having to call those special ACPI methods on resume);
> consequently, it would seem that nothing can be done after that step.

Well, the ACPI spec suggests to save the image somewhere after the devices
have been but into low power states, which is kind of unreasonable. :-)

> >   * figure out which devices can wake up
> >   * put devices into low power states (wake-up devices are placed in the Dx
> >     states compatible with the wake capability, the others are powered off)
> >   * execute the _PTS global control method
> >   * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
> >   * execute the _GTS global control method
> >   * set the GPE enable registers corresponding to the wake-up devices)
> >   * make the platform enter S4 (there's a well defined procedure for that)
> >   I think that this should be done by the image-saving kernel.
> 
> I agree.
> 
> > (2) Upon start-up (by which I mean what happens after the user has pressed
> >     the power button or something like that):
> >   * check if the image is present (and valid) _without_ enabling ACPI (we don't
> >     do that now, but I see no reason for not doing it in the new framework)
> >   * if the image is present (and valid), load it
> >   * turn on ACPI (unless already turned on by the BIOS, that is)
> >   * execute the _BFS global control method
> >   * execute the _WAK global control method
> >   * continue
> >   Here, the first two things should be done by the image-loading kernel, but
> >   the remaining operations have to be carried out by the restored
> >     kernel.
> 
> It doesn't seem like a problem for that to be the case, but out of
> curiosity why do those methods need to be executed by the "restored"
> kernel, rather than the "image loading" kernel.  Do they require some
> information from ACPI-related kernel data structures that were populated
> by the normal ACPI initialization?

Well, there are some complications.  For example, _BFS and _WAK should be
executed with interrupts on (I'm told that the AML interpreter might not work
with interrupts disabled) and the nonboot CPUs should be offline while they
are being executed.  Perhaps we should also avoid playing with APICs and
things like that after executing _WAK, so it's better to execute them from the
restored kernel.

> [snip]
> 
> > ... we can't return to the hibernated kernel unless we are going to cancel the
> > hibernation.
> 
> I agree.
> 
> > That's why I think that for the suspend-to-both the image-saving kernel will
> > need to support the same set of devices as the hibernated kernel.
> 
> If all of the devices that the image writing kernel doesn't know about
> have already been shut down/powered off by the hibernated kernel, then
> does the "image writing" kernel still need to know about them in order
> to suspend to RAM properly (i.e. without leaving some devices on wasting
> power)?

For some devices it may be illegal to leave them powered on while entering
S3.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:23                         ` Rafael J. Wysocki
@ 2007-07-17 21:17                           ` david
  2007-07-17 21:27                             ` Jeremy Maitin-Shepard
  2007-07-17 21:43                             ` Rafael J. Wysocki
  0 siblings, 2 replies; 220+ messages in thread
From: david @ 2007-07-17 21:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:

> On Tuesday, 17 July 2007 22:34, david@lang.hm wrote:
>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> On Tuesday, 17 July 2007 20:32, Alan Stern wrote:
>>>
>>>> I'm still not entirely clear on how "suspend-to-both" ought to be
>>>> handled.  Presumably it will start off as a normal hibernation.  But
>>>> instead of shutting down, wouldn't the kexec'd kernel return to the
>>>> original kernel?
>>>
>>> No, I think the image-saving kernel should suspend.  Then, on resume the
>>> platform will go back to it and it will jump back to the hibernated kernel.
>>>
>>>> After all, the original kernel knows about all the devices and can put them
>>>> into a low-power state, while the kexec'd kernel might not have sufficient
>>>> information.
>>>
>>> That's correct, but ...
>>>
>>>> But what about the freezer?  The original reason for using kexec was to
>>>> avoid the need for the freezer.  With no freezer, while the original
>>>> kernel is busy powering down its devices, user tasks will be free to
>>>> carry out I/O -- which will make the memory snapshot inconsistent with
>>>> the on-disk data structures.
>>>
>>> ... we can't return to the hibernated kernel unless we are going to cancel the
>>> hibernation.
>>
>> this is where we disagree.
>>
>> why not? if all that the hibernated kernel does is to suspend-to-ram and
>> makes no changes to disks or TCP connections anything that it does do
>> would be lost if power were to fail and you instead did a restore from
>> disk.
>
> How do you guarantee that no tasks are scheduled when you get back to the
> hibernated kernel?

just don't schedule any userspace tasks. all you need to do is to execute 
the ACPI sleep functions. you normally do that after stopping userspace 
anyway.

>> there is only a problem if something takes place that would prevent the
>> restore-from-disk from working. if this is done in a non-ACPI way that
>> will work across a power cycle you don't have to worry about the hardware
>> state not matching anyway.
>>
>>> That's why I think that for the suspend-to-both the image-saving kernel will
>>> need to support the same set of devices as the hibernated kernel.
>>
>> suspend-to-both doesn't really make sense if the suspend-to-disk portion
>> is useing the ACPI S4 mode.
>
> Well, not exactly.  If your battery runs out of power while you're suspended,
> but you have the image saved, it's still better to restore from the image, even
> if something may not work correctly after the restore, than to risk a loss of
> data.

if things don't work correctly you are still risking the loss of data, the 
user just doesn't know it.

>> if you don't run out of power you will restore-from-ram
>>
>> if you do run out of power the restore-from-disk won't work either becouse
>> devices are not in the right ACPI states.
>
> See above.
>
> Greetings,
> Rafael
>
>
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:27                     ` david
@ 2007-07-17 21:20                       ` Rafael J. Wysocki
       [not found]                         ` <ea7a437ca4038d408ac544bbc3c2434a@bga.com>
       [not found]                         ` <40fa2626aff7b6b590ad6aa4737fc873@bga.com>
  2007-07-17 22:38                       ` Alan Stern
  1 sibling, 2 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:20 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 22:27, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Alan Stern wrote:
> 
> > On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> >
> >> I'm afraid of one thing, though.
> >>
> >> If we create a framework without ACPI (well, ACPI needs to be enabled in the
> >> kernel anyway for other reasons, like the ability to suspend to RAM) and then
> >> it turns out that we have to add some ACPI hooks to it, that might be difficult
> >> to do cleanly.
> >>
> >> Thus, it seems reasonable to think of the ACPI handling in advance.
> >
> > Absolutely.  This needs to be done in such a way that it will work:
> >
> > 	On platforms without ACPI;
> >
> > 	On platforms with ACPI where we do a non-ACPI type of shutdown
> > 	to whatever extent it is possible (or perhaps an ACPI-aware
> > 	shutdown rather than change to S4);
> >
> > 	On platforms with ACPI where we do an ACPI-aware transition
> > 	to S4.
> >
> > Rafael, for those of us who aren't thoroughly familiar with all the ins
> > and outs of the ACPI spec, could you please summarize a list of the
> > ACPI calls needed in the second and third cases above?  Indicate which
> > ones need to be done from within the original kernel and which should
> > be done from within a kexec'd hibernation kernel.
> >
> 
> there was just a link on slashdot toa primer on the subject of power 
> management
> 
> http://www.techarp.com/showarticle.aspx?artno=420
> 
> >
> > I'm still not entirely clear on how "suspend-to-both" ought to be
> > handled.  Presumably it will start off as a normal hibernation.  But
> > instead of shutting down, wouldn't the kexec'd kernel return to the
> > original kernel?  After all, the original kernel knows about all the
> > devices and can put them into a low-power state, while the kexec'd
> > kernel might not have sufficient information.
> 
> this is what I'm thinking, but the issue here is that the original kernel 
> needs to go into suspend-to-ram mode instead of resuming operation. per 
> the e-mail I got from Ying last night this should not be hard to 
> implement.
> 
> > But what about the freezer?  The original reason for using kexec was to
> > avoid the need for the freezer.  With no freezer, while the original
> > kernel is busy powering down its devices, user tasks will be free to
> > carry out I/O -- which will make the memory snapshot inconsistent with
> > the on-disk data structures.
> 
> no, user tasks just don't get scheduled during shutdown.
> 
> the big problem with the freezer isn't stopping anything from happening, 
> it's _selectivly_ stopping things.

It's selectively stopping kernel threads, which is just about right.  If you
that _this_ is a main problem with the freezer, then think again.

> with kexec you don't need to let any portion of the origional kernel or 
> userspace operate so you don't have a problem.

In fact, the main problem with the freezer is that it is a coarse-grained
solution.  Therefore, what I believe we should do is to evolve in the directoin
of more fine-grained solutions and gradually phase out the freezer.

The kexec-based approach is an attempt to replace one coarse-grained solution
(the freezer) with even more coarse-grained solution (stopping the entire
kernel with everything), which IMO doesn't address the main problem.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:34                       ` david
  2007-07-17 20:54                         ` Jeremy Maitin-Shepard
@ 2007-07-17 21:23                         ` Rafael J. Wysocki
  2007-07-17 21:17                           ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:23 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 22:34, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Tuesday, 17 July 2007 20:32, Alan Stern wrote:
> >
> >> I'm still not entirely clear on how "suspend-to-both" ought to be
> >> handled.  Presumably it will start off as a normal hibernation.  But
> >> instead of shutting down, wouldn't the kexec'd kernel return to the
> >> original kernel?
> >
> > No, I think the image-saving kernel should suspend.  Then, on resume the
> > platform will go back to it and it will jump back to the hibernated kernel.
> >
> >> After all, the original kernel knows about all the devices and can put them
> >> into a low-power state, while the kexec'd kernel might not have sufficient
> >> information.
> >
> > That's correct, but ...
> >
> >> But what about the freezer?  The original reason for using kexec was to
> >> avoid the need for the freezer.  With no freezer, while the original
> >> kernel is busy powering down its devices, user tasks will be free to
> >> carry out I/O -- which will make the memory snapshot inconsistent with
> >> the on-disk data structures.
> >
> > ... we can't return to the hibernated kernel unless we are going to cancel the
> > hibernation.
> 
> this is where we disagree.
> 
> why not? if all that the hibernated kernel does is to suspend-to-ram and 
> makes no changes to disks or TCP connections anything that it does do 
> would be lost if power were to fail and you instead did a restore from 
> disk.

How do you guarantee that no tasks are scheduled when you get back to the
hibernated kernel?

> there is only a problem if something takes place that would prevent the 
> restore-from-disk from working. if this is done in a non-ACPI way that 
> will work across a power cycle you don't have to worry about the hardware 
> state not matching anyway.
> 
> > That's why I think that for the suspend-to-both the image-saving kernel will
> > need to support the same set of devices as the hibernated kernel.
> 
> suspend-to-both doesn't really make sense if the suspend-to-disk portion 
> is useing the ACPI S4 mode.

Well, not exactly.  If your battery runs out of power while you're suspended,
but you have the image saved, it's still better to restore from the image, even
if something may not work correctly after the restore, than to risk a loss of
data.

> if you don't run out of power you will restore-from-ram
> 
> if you do run out of power the restore-from-disk won't work either becouse 
> devices are not in the right ACPI states.

See above.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:37                         ` david
  2007-07-17 20:56                           ` Jeremy Maitin-Shepard
@ 2007-07-17 21:24                           ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:24 UTC (permalink / raw)
  To: david
  Cc: Jeremy Maitin-Shepard, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 22:37, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:
> 
> > "Rafael J. Wysocki" <rjw@sisk.pl> writes:
> >
> > [snip]
> >
> >>> Rafael, for those of us who aren't thoroughly familiar with all the ins
> >>> and outs of the ACPI spec, could you please summarize a list of the
> >>> ACPI calls needed in the second and third cases above?  Indicate which
> >>> ones need to be done from within the original kernel and which should
> >>> be done from within a kexec'd hibernation kernel.
> >
> >> Sure.
> >
> >> In the third case (ie. transition to S4) we are supposed to do the following:
> >
> >> (1) Upon entering the sleep state, which IMO can be done _after_ the image
> >>     has been saved:
> >
> > I assume you mean "in order to enter the sleep state", rather than "upon
> > entering the sleep state".  I still don't understand what you mean by
> > "which IMO can be done _after_ the image has been saved"; as far as I
> > understand, the last step of this process, "make the platform enter S4",
> > is almost like a shutdown as far as the kernel is concerned (except for
> > the tiny detail of having to call those special ACPI methods on resume);
> > consequently, it would seem that nothing can be done after that step.
> >
> >>   * figure out which devices can wake up
> >>   * put devices into low power states (wake-up devices are placed in the Dx
> >>     states compatible with the wake capability, the others are powered off)
> 
> this can't be done by the image-saving kernel if that kernel doesn't know 
> about the device.

Good observation. :-)

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:27                             ` Jeremy Maitin-Shepard
@ 2007-07-17 21:27                               ` david
  2007-07-17 21:54                                 ` Rafael J. Wysocki
  2007-07-17 21:45                               ` Rafael J. Wysocki
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-17 21:27 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:

> david@lang.hm writes:
>
> [snip]
>
>>> How do you guarantee that no tasks are scheduled when you get back to the
>>> hibernated kernel?
>
>> just don't schedule any userspace tasks. all you need to do is to execute the
>> ACPI sleep functions. you normally do that after stopping userspace
>> anyway.
>
> What does "stopping userspace" mean?  You already said it does not mean
> disabling interrupts.  But using the freezer is also not an option,
> since the avoidance of that is the main reason for the kexec approach in
> the first place.

just don't schedule any non-kernel threads.

remember that the normal shutdown/suspend procedure is (from another 
related thread)

> >>sys_reboot(LINUX_REBOOT_CMD_KEXEC)
> >>     kernel_kexec
> >>         kernel_restart_prepare
> >>             device_shutdown
> >>         machine_shutdown
> >>         machine_kexec

I'm just saying that instead of going back to the normal operation of the 
kernel you just go directly to the new shutdown routine instead.

> [snip]
>
>>> Well, not exactly.  If your battery runs out of power while you're suspended,
>>> but you have the image saved, it's still better to restore from the image,
>> even
>>> if something may not work correctly after the restore, than to risk a loss of
>>> data.
>
>> if things don't work correctly you are still risking the loss of data, the user
>> just doesn't know it.
>
> It should be possible on any system to do a hibernate followed by a
> shutdown (and then resume properly, without any problems).  Thus, for
> handling suspend to both, you resume as if the system had been shutdown,
> rather than resuming as if the system came from S4.

I agree with this, but according to Rafael if the "hibernated" image is 
assuming that the devices were put into low-power mode by ACPI and you 
boot up instead the system doesn't work right.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:17                           ` david
@ 2007-07-17 21:27                             ` Jeremy Maitin-Shepard
  2007-07-17 21:27                               ` david
  2007-07-17 21:45                               ` Rafael J. Wysocki
  2007-07-17 21:43                             ` Rafael J. Wysocki
  1 sibling, 2 replies; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-17 21:27 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

david@lang.hm writes:

[snip]

>> How do you guarantee that no tasks are scheduled when you get back to the
>> hibernated kernel?

> just don't schedule any userspace tasks. all you need to do is to execute the
> ACPI sleep functions. you normally do that after stopping userspace
> anyway.

What does "stopping userspace" mean?  You already said it does not mean
disabling interrupts.  But using the freezer is also not an option,
since the avoidance of that is the main reason for the kexec approach in
the first place.

[snip]

>> Well, not exactly.  If your battery runs out of power while you're suspended,
>> but you have the image saved, it's still better to restore from the image,
> even
>> if something may not work correctly after the restore, than to risk a loss of
>> data.

> if things don't work correctly you are still risking the loss of data, the user
> just doesn't know it.

It should be possible on any system to do a hibernate followed by a
shutdown (and then resume properly, without any problems).  Thus, for
handling suspend to both, you resume as if the system had been shutdown,
rather than resuming as if the system came from S4.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:53                               ` david
@ 2007-07-17 21:37                                 ` Rafael J. Wysocki
  2007-07-17 21:42                                   ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:37 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 22:53, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Tuesday, 17 July 2007 22:18, david@lang.hm wrote:
> >> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> >>
> >>> On Tuesday, 17 July 2007 19:06, david@lang.hm wrote:
> >>>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> >>>>
> >>>>> On Tuesday, 17 July 2007 17:29, david@lang.hm wrote:
> >>>>>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> >>>>>>
> >>>>>>> On Tuesday, 17 July 2007 16:15, Alan Stern wrote:
> >>>>>>>> On Mon, 16 Jul 2007 david@lang.hm wrote:
> >>>>>>>>
> >>>>>>>>>> I agree, it would be good to have a non-ACPI-specific hibernation mode,
> >>>>>>>>>> something which would look to ACPI like a normal shutdown.  But I'm not
> >>>>>>>>>> so sure this is possible.
> >>>>>>>>>
> >>>>>>>>> why would it not be possible?
> >>>>>>>>
> >>>>>>>>> I can't think of anything much more frustrating then thinking that I
> >>>>>>>>> suspended a system and then discovering that becouse the battery went dead
> >>>>>>>>> (a complete power loss) that the system wouldn't boot up properly. to me
> >>>>>>>>> this would be a fairly common condition (when I'm mobile I use the machine
> >>>>>>>>> until I am out of battery, then stop and it may be a long time (days)
> >>>>>>>>> before I can charge the thing up again) this would not be a reliable
> >>>>>>>>> suspend as far as I'm concerned.
> >>>>>>>>>
> >>>>>>>>> for suspend-to-ram you have to worry about ACPI states and what you are
> >>>>>>>>> doing with them, for suspend-to-disk you can ignore them and completely
> >>>>>>>>> power the system off instead.
> >>>>>>>>
> >>>>>>>> If the only problem with doing this would be lack of wakeup support
> >>>>>>>> then I'm all for it.  There must be a lot of people who would like
> >>>>>>>> their computers to hibernate with power drain as close to 0 as possible
> >>>>>>>> and who don't care about remote wakeup.  In fact they might even prefer
> >>>>>>>> not to have wakeup support, so the computer doesn't resume at
> >>>>>>>> unexpected times.
> >>>>>>>
> >>>>>>> I'm afraid of one thing, though.
> >>>>>>>
> >>>>>>> If we create a framework without ACPI (well, ACPI needs to be enabled in the
> >>>>>>> kernel anyway for other reasons, like the ability to suspend to RAM) and then
> >>>>>>> it turns out that we have to add some ACPI hooks to it, that might be difficult
> >>>>>>> to do cleanly.
> >>>>>>
> >>>>>> doing suspend-to-ram should be orthoginal to doing hibernate-to-disk. some
> >>>>>> people will want both, some won't.
> >>>>>>
> >>>>>> at the moment kexec doesn't work with ACPI, that is a limitation that
> >>>>>> should be fixed, but makeing it able to work with ACPI enabled doesn't
> >>>>>> mean that it needs to be changed to depend on ACPI and it especially
> >>>>>> doesn't mean that it should pick up the limitations of the existing ACPI
> >>>>>> based hibernation approaches.
> >>>>>>
> >>>>>> if there is no ACPI on the system it should work, if ther is ACPI on the
> >>>>>> system it should still work.
> >>>>>>
> >>>>>>> Thus, it seems reasonable to think of the ACPI handling in advance.
> >>>>>>
> >>>>>> but don't become dependant on ACPI.
> >>>>>
> >>>>> Not dependent, but with the possibility of ACPI support taken into account.
> >>>>>
> >>>>> Arguably you can create a framework that, for example, will not allow the user
> >>>>> to adjust the size of the image, but then adding such a functionality may
> >>>>> require you to change the entire design.  Same thing with ACPI.
> >>>>>
> >>>>> I would rather avoid such pitfalls, if I could.
> >>>>
> >>>> Ok, what is it that you think ACPI fundamentally changes in this process?
> >>>>
> >>>> keep in mind that we are not makeing the assumption that the hardware
> >>>> will remain powered (even a little bit), or the assumption that nothing
> >>>> else will run on the hardware (eliminating any possibility that the
> >>>> hardware is in a known ACPI state)
> >>>
> >>> Well, first, the fact is that _some_ systems _will_ be powered while in
> >>> hibernation (the majority of notebooks, for example) and you should assume
> >>> that the platform _may_ retain some information accross the hibernation/restore
> >>> cycle.  In that case you _should_ _not_ trash the information retained by the
> >>> platform.
> >>
> >> no, systems that remain powered while asleep are a different type of
> >> suspend then ones that don't.
> >>
> >>> Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
> >>> state as a result of the hibernation procedure.  In my opinion this may be done
> >>> after saving the image, but still this means, for example, that the
> >>> image-saving kernel needs to support ACPI.
> >>>
> >>> Next, during the restore, we should first check if the image is present (and
> >>> valid) _without_ turning ACPI on (note that this is not done by the current
> >>> hibernation code and that leads to strange problems on some systems).  Then,
> >>> if the image is present (and valid), we should first load it, jump to the
> >>> hibernated kernel and _then_ turn ACPI on and execute the _BFS and
> >>> _WAK ACPI global methods (again, this is not done by the current code in that
> >>> order, which is wrong).  Only after that is the hibernated kernel supposed to
> >>> continue.
> >>>
> >>> [Please refer to section 15.3 of the 3.0b ACPI spec for details.]
> >>
> >> you are starting from the assumption that ACPI S4 mode should be used.
> >>
> >> I'm saying that a suspend that uses ACPI S4 mode is fundamentally
> >> different from one that does a power off instead.
> >
> > It is different, but not fundamentally.
> >
> >> from my point of view the ACPI S4 sleep mode has far more in common with
> >> suspend-to-ram then with the suspend-to-disk that I'm talking about
> >>
> >> non-ACPI hibernate
> >>
> >>    since the box powers off
> >>      it uses zero power while suspended
> >>      another OS could be run before a resume
> >>      hardware can be swapped, suspend image could be sent around the world to be restored on another system.
> >>      restore makes no assumptions about the state of the hardware when it is restored
> >>      restore is slower (full BIOS boot is required)
> >>    should be able to work on just about any hardware (the limit is the ability to initialize the devices)
> >>
> >>
> >> ACPI suspends
> >>
> >>    since the box never completely powers off:
> >>      a complete power failure breaks the suspend
> >>      the OS must remain in control so other uses must be prevented.
> >>      hardware must remain in the ACPI state from suspend until restore.
> >>      restore can be faster (some initialization may be able to be skipped)
> >>    requires ACPI hardware support
> >>
> >> under the catagory of ACPI suspends you have
> >>
> >>    fast suspend-to-ram (stop scheduling, put the CPU to sleep, as long as
> >> the memory keeps getting refreshed)
> >>    slow suspend-to-ram (stop scheduling, put as much of the hardware as
> >> possible to sleep, including spinning down disks and other things that
> >> take a while to undo)
> >>    suspend-to-disk (stop scheduleing, copy the ram somewhere so that it
> >> doesn't need to be refreshed, put everything into low-power mode)
> >>
> >>    and there are probably quite a few others as well. but they are all in
> >> the same family in that you have to worry about ACPI states, and they all
> >> have the same restrictions on what can happen between suspend and resume
> >>
> >> the non-ACPI hibernate behaves very differently, and for some people (and
> >> I think I am one of them) it will meet their needs better then _any_ of
> >> the ACPI suspends.
> >
> > OTOH, there are many people who would want the ACPI suspends to be handled
> > and they don't really care for the power-off-only hibernation.
> >
> > If you aren't going to support the ACPI hibernation, your framework will be
> > incomplete and therefore not generally useful.
> 
> if you make the framework limited by the ACPI requriement, your framework 
> will not be able to be used in all cases and is therefor incomplete and 
> not generally useful.
> 
> see, I can make authoritative sounding declarations too. :-)

Well, being able to support ACPI need not imply being unable to work in the
other cases.  Conversely, being able to work in non-ACPI cases need not imply
being unable to support ACPI.  Hence, you can support ACPI and be able to
work in the other cases at the same time.

Try again. ;-)

> I agree that some people want ACPI suspends, but you don't seem to allow 
> the fact that some people don't,

No, I do.

What I'm trying to say is that we should support the ACPI S4 transition and
resume, which DOES NOT MEAN that we should support ONLY that.

> and those people don't want to have the ACPI based limits.

There are NO ACPI LIMITS!  There only are things that you need to implement
if you're going to support ACPI, but they need not be used ALWAYS, no?

> they _especially_ don't want those limits when it appears as if supporting
> those limits is what's preventing their much simpler case from working
> reliably. 
> 
> I strongly suspect that the majority of users don't care about ACPI, they 
> want to be able to pause and resume their machine. they may want a couple 
> options for how fast the resume is (trading resume speed against how much 
> power the system eats), but the deep sleep modes (suspend-to-disk, 
> hibernate) probably have restore times that are close enough to each other 
> that very few people would care enough to opt for a ACPI S4 mode that 
> won't survive a loss of battery power over a non-ACPI mode that would.

Please don't argue like that, these are not arguments for avoiding the ACPI
support.  Arguably, you can support _both_ ACPI and non-ACPI hibernation, so
design for being able to support both.  End of story (as far as I'm concerned).

Greetings,
Rafael
 

-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:06                             ` david
@ 2007-07-17 21:40                               ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:40 UTC (permalink / raw)
  To: david
  Cc: Jeremy Maitin-Shepard, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 23:06, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:
> 
> > david@lang.hm writes:
> >
> > [snip]
> >
> >>>> * figure out which devices can wake up
> >>>> * put devices into low power states (wake-up devices are placed in the Dx
> >>>> states compatible with the wake capability, the others are powered off)
> >
> >> this can't be done by the image-saving kernel if that kernel doesn't know about
> >> the device.
> >
> > The image-saving kernel can be made to know about all of the "wake up"
> > devices; all other devices should have already been powered off by the
> > "hibernated" kernel.
> 
> not nessasarily.

More than that.  The hibernated kernel should not power off any devices,
because that is _wasteful_ (unless, of course, the powering off a device is the
only way to quiesce it).

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:37                                 ` Rafael J. Wysocki
@ 2007-07-17 21:42                                   ` david
  2007-07-17 21:53                                     ` Jeremy Maitin-Shepard
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17 21:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:

> On Tuesday, 17 July 2007 22:53, david@lang.hm wrote:
>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> On Tuesday, 17 July 2007 22:18, david@lang.hm wrote:
>>>> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
>>>>
>>>>> Now, with that in mind, ACPI requires us to make the system enter the S4 sleep
>>>>> state as a result of the hibernation procedure.  In my opinion this may be done
>>>>> after saving the image, but still this means, for example, that the
>>>>> image-saving kernel needs to support ACPI.
>>>>>
>>>>> Next, during the restore, we should first check if the image is present (and
>>>>> valid) _without_ turning ACPI on (note that this is not done by the current
>>>>> hibernation code and that leads to strange problems on some systems).  Then,
>>>>> if the image is present (and valid), we should first load it, jump to the
>>>>> hibernated kernel and _then_ turn ACPI on and execute the _BFS and
>>>>> _WAK ACPI global methods (again, this is not done by the current code in that
>>>>> order, which is wrong).  Only after that is the hibernated kernel supposed to
>>>>> continue.
>>>>>
>>>>> [Please refer to section 15.3 of the 3.0b ACPI spec for details.]
>>>>
>>>> you are starting from the assumption that ACPI S4 mode should be used.
>>>>
>>>> I'm saying that a suspend that uses ACPI S4 mode is fundamentally
>>>> different from one that does a power off instead.
>>>
>>> It is different, but not fundamentally.
>>>
>>>> from my point of view the ACPI S4 sleep mode has far more in common with
>>>> suspend-to-ram then with the suspend-to-disk that I'm talking about
>>>>
>>>> non-ACPI hibernate
>>>>
>>>>    since the box powers off
>>>>      it uses zero power while suspended
>>>>      another OS could be run before a resume
>>>>      hardware can be swapped, suspend image could be sent around the world to be restored on another system.
>>>>      restore makes no assumptions about the state of the hardware when it is restored
>>>>      restore is slower (full BIOS boot is required)
>>>>    should be able to work on just about any hardware (the limit is the ability to initialize the devices)
>>>>
>>>>
>>>> ACPI suspends
>>>>
>>>>    since the box never completely powers off:
>>>>      a complete power failure breaks the suspend
>>>>      the OS must remain in control so other uses must be prevented.
>>>>      hardware must remain in the ACPI state from suspend until restore.
>>>>      restore can be faster (some initialization may be able to be skipped)
>>>>    requires ACPI hardware support
>>>>
>>>> under the catagory of ACPI suspends you have
>>>>
>>>>    fast suspend-to-ram (stop scheduling, put the CPU to sleep, as long as
>>>> the memory keeps getting refreshed)
>>>>    slow suspend-to-ram (stop scheduling, put as much of the hardware as
>>>> possible to sleep, including spinning down disks and other things that
>>>> take a while to undo)
>>>>    suspend-to-disk (stop scheduleing, copy the ram somewhere so that it
>>>> doesn't need to be refreshed, put everything into low-power mode)
>>>>
>>>>    and there are probably quite a few others as well. but they are all in
>>>> the same family in that you have to worry about ACPI states, and they all
>>>> have the same restrictions on what can happen between suspend and resume
>>>>
>>>> the non-ACPI hibernate behaves very differently, and for some people (and
>>>> I think I am one of them) it will meet their needs better then _any_ of
>>>> the ACPI suspends.
>>>
>>> OTOH, there are many people who would want the ACPI suspends to be handled
>>> and they don't really care for the power-off-only hibernation.
>>>
>>> If you aren't going to support the ACPI hibernation, your framework will be
>>> incomplete and therefore not generally useful.
>>
>> if you make the framework limited by the ACPI requriement, your framework
>> will not be able to be used in all cases and is therefor incomplete and
>> not generally useful.
>>
>> see, I can make authoritative sounding declarations too. :-)
>
> Well, being able to support ACPI need not imply being unable to work in the
> other cases.  Conversely, being able to work in non-ACPI cases need not imply
> being unable to support ACPI.  Hence, you can support ACPI and be able to
> work in the other cases at the same time.
>
> Try again. ;-)

I thought it was you who was saying that the restore operation requried 
ACPI if you used ACPI at all.

>> I agree that some people want ACPI suspends, but you don't seem to allow
>> the fact that some people don't,
>
> No, I do.
>
> What I'm trying to say is that we should support the ACPI S4 transition and
> resume, which DOES NOT MEAN that we should support ONLY that.

it has sounded like you are saying that if you support ACPI S4 sleep then 
it's incompatible with powering off.

>> and those people don't want to have the ACPI based limits.
>
> There are NO ACPI LIMITS!  There only are things that you need to implement
> if you're going to support ACPI, but they need not be used ALWAYS, no?

yes there are limits. the fact that you can't remove the battery in S4 
mode without messing things up is a limit, the fact that it's against ACPI 
specs to boot another OS before resume is a limit. you seem to keep 
pointing out requirements of ACPI that limit what can be done. I do not 
disagree about these being requirements _if you are useing ACPI_, but if 
you don't need ACPI they should not limit you.

>> they _especially_ don't want those limits when it appears as if supporting
>> those limits is what's preventing their much simpler case from working
>> reliably.
>>
>> I strongly suspect that the majority of users don't care about ACPI, they
>> want to be able to pause and resume their machine. they may want a couple
>> options for how fast the resume is (trading resume speed against how much
>> power the system eats), but the deep sleep modes (suspend-to-disk,
>> hibernate) probably have restore times that are close enough to each other
>> that very few people would care enough to opt for a ACPI S4 mode that
>> won't survive a loss of battery power over a non-ACPI mode that would.
>
> Please don't argue like that, these are not arguments for avoiding the ACPI
> support.  Arguably, you can support _both_ ACPI and non-ACPI hibernation, so
> design for being able to support both.  End of story (as far as I'm concerned).

I wish I understood how you are saying this, becouse it does not match 
what I thought I have been reading for the last few days.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:17                           ` david
  2007-07-17 21:27                             ` Jeremy Maitin-Shepard
@ 2007-07-17 21:43                             ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:43 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 23:17, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Tuesday, 17 July 2007 22:34, david@lang.hm wrote:
> >> On Tue, 17 Jul 2007, Rafael J. Wysocki wrote:
> >>
> >>> On Tuesday, 17 July 2007 20:32, Alan Stern wrote:
> >>>
> >>>> I'm still not entirely clear on how "suspend-to-both" ought to be
> >>>> handled.  Presumably it will start off as a normal hibernation.  But
> >>>> instead of shutting down, wouldn't the kexec'd kernel return to the
> >>>> original kernel?
> >>>
> >>> No, I think the image-saving kernel should suspend.  Then, on resume the
> >>> platform will go back to it and it will jump back to the hibernated kernel.
> >>>
> >>>> After all, the original kernel knows about all the devices and can put them
> >>>> into a low-power state, while the kexec'd kernel might not have sufficient
> >>>> information.
> >>>
> >>> That's correct, but ...
> >>>
> >>>> But what about the freezer?  The original reason for using kexec was to
> >>>> avoid the need for the freezer.  With no freezer, while the original
> >>>> kernel is busy powering down its devices, user tasks will be free to
> >>>> carry out I/O -- which will make the memory snapshot inconsistent with
> >>>> the on-disk data structures.
> >>>
> >>> ... we can't return to the hibernated kernel unless we are going to cancel the
> >>> hibernation.
> >>
> >> this is where we disagree.
> >>
> >> why not? if all that the hibernated kernel does is to suspend-to-ram and
> >> makes no changes to disks or TCP connections anything that it does do
> >> would be lost if power were to fail and you instead did a restore from
> >> disk.
> >
> > How do you guarantee that no tasks are scheduled when you get back to the
> > hibernated kernel?
> 
> just don't schedule any userspace tasks. all you need to do is to execute 
> the ACPI sleep functions. you normally do that after stopping userspace 
> anyway.

This is plain reinventing the freezer under another guise, sorry. ;-)

> >> there is only a problem if something takes place that would prevent the
> >> restore-from-disk from working. if this is done in a non-ACPI way that
> >> will work across a power cycle you don't have to worry about the hardware
> >> state not matching anyway.
> >>
> >>> That's why I think that for the suspend-to-both the image-saving kernel will
> >>> need to support the same set of devices as the hibernated kernel.
> >>
> >> suspend-to-both doesn't really make sense if the suspend-to-disk portion
> >> is useing the ACPI S4 mode.
> >
> > Well, not exactly.  If your battery runs out of power while you're suspended,
> > but you have the image saved, it's still better to restore from the image, even
> > if something may not work correctly after the restore, than to risk a loss of
> > data.
> 
> if things don't work correctly you are still risking the loss of data, the 
> user just doesn't know it.

The experience shows that this is not the case.  The disks usually work (you
have to initialize them without ACPI anyway to load the image, no?).

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:27                             ` Jeremy Maitin-Shepard
  2007-07-17 21:27                               ` david
@ 2007-07-17 21:45                               ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:45 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: david, Alan Stern, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list,
	Al Boldi

On Tuesday, 17 July 2007 23:27, Jeremy Maitin-Shepard wrote:
> david@lang.hm writes:
> 
> [snip]
> 
> >> How do you guarantee that no tasks are scheduled when you get back to the
> >> hibernated kernel?
> 
> > just don't schedule any userspace tasks. all you need to do is to execute the
> > ACPI sleep functions. you normally do that after stopping userspace
> > anyway.
> 
> What does "stopping userspace" mean?  You already said it does not mean
> disabling interrupts.  But using the freezer is also not an option,
> since the avoidance of that is the main reason for the kexec approach in
> the first place.
> 
> [snip]
> 
> >> Well, not exactly.  If your battery runs out of power while you're suspended,
> >> but you have the image saved, it's still better to restore from the image,
> > even
> >> if something may not work correctly after the restore, than to risk a loss of
> >> data.
> 
> > if things don't work correctly you are still risking the loss of data, the user
> > just doesn't know it.
> 
> It should be possible on any system to do a hibernate followed by a
> shutdown (and then resume properly, without any problems).  Thus, for
> handling suspend to both, you resume as if the system had been shutdown,
> rather than resuming as if the system came from S4.

Exactly.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:42                                   ` david
@ 2007-07-17 21:53                                     ` Jeremy Maitin-Shepard
  0 siblings, 0 replies; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-17 21:53 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

david@lang.hm writes:

[snip]
 
>> There are NO ACPI LIMITS!  There only are things that you need to implement
>> if you're going to support ACPI, but they need not be used ALWAYS, no?

> yes there are limits. the fact that you can't remove the battery in S4 mode
> without messing things up is a limit,

You won't mess things up as long as the resuming kernel knows that it
should resume as if the system were shutdown, rather than sent to S4
state.  Maybe it is even possible to detect what type of resuming is
needed automatically.  Similarly, booting another OS shouldn't be a
problem, except that if you do it without powering off the system first,
some devices might not work under the other OS if the other OS doesn't
initialize them properly.

[snip]

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 21:27                               ` david
@ 2007-07-17 21:54                                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-17 21:54 UTC (permalink / raw)
  To: david
  Cc: Jeremy Maitin-Shepard, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tuesday, 17 July 2007 23:27, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Jeremy Maitin-Shepard wrote:
> 
> > david@lang.hm writes:
> >
> > [snip]
> >
> >>> How do you guarantee that no tasks are scheduled when you get back to the
> >>> hibernated kernel?
> >
> >> just don't schedule any userspace tasks. all you need to do is to execute the
> >> ACPI sleep functions. you normally do that after stopping userspace
> >> anyway.
> >
> > What does "stopping userspace" mean?  You already said it does not mean
> > disabling interrupts.  But using the freezer is also not an option,
> > since the avoidance of that is the main reason for the kexec approach in
> > the first place.
> 
> just don't schedule any non-kernel threads.
> 
> remember that the normal shutdown/suspend procedure is (from another 
> related thread)
> 
> > >>sys_reboot(LINUX_REBOOT_CMD_KEXEC)
> > >>     kernel_kexec
> > >>         kernel_restart_prepare
> > >>             device_shutdown
> > >>         machine_shutdown
> > >>         machine_kexec
> 
> I'm just saying that instead of going back to the normal operation of the 
> kernel you just go directly to the new shutdown routine instead.
> 
> > [snip]
> >
> >>> Well, not exactly.  If your battery runs out of power while you're suspended,
> >>> but you have the image saved, it's still better to restore from the image,
> >> even
> >>> if something may not work correctly after the restore, than to risk a loss of
> >>> data.
> >
> >> if things don't work correctly you are still risking the loss of data, the user
> >> just doesn't know it.
> >
> > It should be possible on any system to do a hibernate followed by a
> > shutdown (and then resume properly, without any problems).  Thus, for
> > handling suspend to both, you resume as if the system had been shutdown,
> > rather than resuming as if the system came from S4.
> 
> I agree with this, but according to Rafael if the "hibernated" image is 
> assuming that the devices were put into low-power mode by ACPI and you 
> boot up instead the system doesn't work right.

That's correct, some devices don't work right, but these are not disks.  They
usually are devices closely related to the platform, like the embedded
controller.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 22:38                       ` Alan Stern
@ 2007-07-17 22:37                         ` david
  2007-07-18 14:29                           ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-17 22:37 UTC (permalink / raw)
  To: Alan Stern
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007, Alan Stern wrote:

> On Tue, 17 Jul 2007 david@lang.hm wrote:
>
>>> But what about the freezer?  The original reason for using kexec was to
>>> avoid the need for the freezer.  With no freezer, while the original
>>> kernel is busy powering down its devices, user tasks will be free to
>>> carry out I/O -- which will make the memory snapshot inconsistent with
>>> the on-disk data structures.
>>
>> no, user tasks just don't get scheduled during shutdown.
>
> But a user task may be holding a lock which is needed for putting some
> device into low-power mode.  It can't release that lock if it doesn't
> get scheduled.

then you can't suspend that box. if you schedule it, it could get another 
lock (or another process gets another lock)

if you can't power down or put hardware into low-power mode without the 
approval of userspace, you are in serious trouble.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:27                     ` david
  2007-07-17 21:20                       ` Rafael J. Wysocki
@ 2007-07-17 22:38                       ` Alan Stern
  2007-07-17 22:37                         ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-17 22:38 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007 david@lang.hm wrote:

> > But what about the freezer?  The original reason for using kexec was to
> > avoid the need for the freezer.  With no freezer, while the original
> > kernel is busy powering down its devices, user tasks will be free to
> > carry out I/O -- which will make the memory snapshot inconsistent with
> > the on-disk data structures.
> 
> no, user tasks just don't get scheduled during shutdown.

But a user task may be holding a lock which is needed for putting some 
device into low-power mode.  It can't release that lock if it doesn't 
get scheduled.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 15:19                     ` david
@ 2007-07-18  2:18                       ` Matthew Garrett
  2007-07-18  3:54                         ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Matthew Garrett @ 2007-07-18  2:18 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Jim Crilly, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Tue, Jul 17, 2007 at 08:19:32AM -0700, david@lang.hm wrote:
> On Tue, 17 Jul 2007, Matthew Garrett wrote:
> >Powering off rather than using S4 means you lose most wakeup device
> >support. That would be a functional regression compared to the current
> >code.
> 
> only if the kexec isn't able to initialize those devices.

If you aren't using ACPI, you probably don't know how to.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-18  2:18                       ` Matthew Garrett
@ 2007-07-18  3:54                         ` david
  2007-07-18 11:10                           ` Matthew Garrett
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-18  3:54 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, Jim Crilly, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Wed, 18 Jul 2007, Matthew Garrett wrote:

> On Tue, Jul 17, 2007 at 08:19:32AM -0700, david@lang.hm wrote:
>> On Tue, 17 Jul 2007, Matthew Garrett wrote:
>>> Powering off rather than using S4 means you lose most wakeup device
>>> support. That would be a functional regression compared to the current
>>> code.
>>
>> only if the kexec isn't able to initialize those devices.
>
> If you aren't using ACPI, you probably don't know how to.

the current kexec patch to allow you to move back to the original kernel 
requires ACPI be disabled to work.

that states pretty strongly that kexec isn't dependant on ACPI to 
initialize devices.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-18  3:54                         ` david
@ 2007-07-18 11:10                           ` Matthew Garrett
  2007-07-18 12:56                             ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Matthew Garrett @ 2007-07-18 11:10 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Jim Crilly, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Tue, Jul 17, 2007 at 08:54:24PM -0700, david@lang.hm wrote:
> On Wed, 18 Jul 2007, Matthew Garrett wrote:
> 
> >On Tue, Jul 17, 2007 at 08:19:32AM -0700, david@lang.hm wrote:
> >>On Tue, 17 Jul 2007, Matthew Garrett wrote:
> >>>Powering off rather than using S4 means you lose most wakeup device
> >>>support. That would be a functional regression compared to the current
> >>>code.
> >>
> >>only if the kexec isn't able to initialize those devices.
> >
> >If you aren't using ACPI, you probably don't know how to.
> 
> the current kexec patch to allow you to move back to the original kernel 
> requires ACPI be disabled to work.

Which means it isn't putting the hardware into S4, which means that you 
don't get the platform wakeup events.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-18 11:10                           ` Matthew Garrett
@ 2007-07-18 12:56                             ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-18 12:56 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, Jim Crilly, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, Al Boldi

On Wed, 18 Jul 2007, Matthew Garrett wrote:

> On Tue, Jul 17, 2007 at 08:54:24PM -0700, david@lang.hm wrote:
>> On Wed, 18 Jul 2007, Matthew Garrett wrote:
>>
>>> On Tue, Jul 17, 2007 at 08:19:32AM -0700, david@lang.hm wrote:
>>>> On Tue, 17 Jul 2007, Matthew Garrett wrote:
>>>>> Powering off rather than using S4 means you lose most wakeup device
>>>>> support. That would be a functional regression compared to the current
>>>>> code.
>>>>
>>>> only if the kexec isn't able to initialize those devices.
>>>
>>> If you aren't using ACPI, you probably don't know how to.
>>
>> the current kexec patch to allow you to move back to the original kernel
>> requires ACPI be disabled to work.
>
> Which means it isn't putting the hardware into S4, which means that you
> don't get the platform wakeup events.

Ok, I was misunderstanding what you meant by wakeup device support. if you 
are meaning things like wake-on-lan then yes, you do loose that with no 
ACPI support, but that is acceptable to many people. I understand it's not 
for some, and I'm not saying that this should be the only type of suspend, 
but I'm also saying that this should be one option.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 22:37                         ` david
@ 2007-07-18 14:29                           ` Alan Stern
  2007-07-18 14:47                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-18 14:29 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Tue, 17 Jul 2007 david@lang.hm wrote:

> On Tue, 17 Jul 2007, Alan Stern wrote:
> 
> > On Tue, 17 Jul 2007 david@lang.hm wrote:
> >
> >>> But what about the freezer?  The original reason for using kexec was to
> >>> avoid the need for the freezer.  With no freezer, while the original
> >>> kernel is busy powering down its devices, user tasks will be free to
> >>> carry out I/O -- which will make the memory snapshot inconsistent with
> >>> the on-disk data structures.
> >>
> >> no, user tasks just don't get scheduled during shutdown.
> >
> > But a user task may be holding a lock which is needed for putting some
> > device into low-power mode.  It can't release that lock if it doesn't
> > get scheduled.
> 
> then you can't suspend that box. if you schedule it, it could get another 
> lock (or another process gets another lock)
> 
> if you can't power down or put hardware into low-power mode without the 
> approval of userspace, you are in serious trouble.

You don't seem to appreciate the issues involved here.  Part of the 
justification for the freezer is that it doesn't need userspace 
approval and it freezes tasks at controlled points where they don't 
hold any locks.

Never mind.  It seems clear that this approach will suffer the same 
drawback as the proposal for removing the freezer from the 
suspend-to-RAM pathway.  Namely, device drivers will have to be changed 
to prevent user I/O requests from proceeding while devices are supposed 
to be quiescent or in a low-power state.

If a driver fails to handle this properly, its device could be 
reactivated in order to service a user request before the memory 
snapshot is made.  This could easily ruin the snapshot.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-18 14:29                           ` Alan Stern
@ 2007-07-18 14:47                             ` Rafael J. Wysocki
  2007-07-20  4:40                               ` Al Boldi
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-18 14:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: david, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list, Al Boldi

On Wednesday, 18 July 2007 16:29, Alan Stern wrote:
> On Tue, 17 Jul 2007 david@lang.hm wrote:
> 
> > On Tue, 17 Jul 2007, Alan Stern wrote:
> > 
> > > On Tue, 17 Jul 2007 david@lang.hm wrote:
> > >
> > >>> But what about the freezer?  The original reason for using kexec was to
> > >>> avoid the need for the freezer.  With no freezer, while the original
> > >>> kernel is busy powering down its devices, user tasks will be free to
> > >>> carry out I/O -- which will make the memory snapshot inconsistent with
> > >>> the on-disk data structures.
> > >>
> > >> no, user tasks just don't get scheduled during shutdown.
> > >
> > > But a user task may be holding a lock which is needed for putting some
> > > device into low-power mode.  It can't release that lock if it doesn't
> > > get scheduled.
> > 
> > then you can't suspend that box. if you schedule it, it could get another 
> > lock (or another process gets another lock)
> > 
> > if you can't power down or put hardware into low-power mode without the 
> > approval of userspace, you are in serious trouble.
> 
> You don't seem to appreciate the issues involved here.  Part of the 
> justification for the freezer is that it doesn't need userspace 
> approval and it freezes tasks at controlled points where they don't 
> hold any locks.
> 
> Never mind.  It seems clear that this approach will suffer the same 
> drawback as the proposal for removing the freezer from the 
> suspend-to-RAM pathway.  Namely, device drivers will have to be changed 
> to prevent user I/O requests from proceeding while devices are supposed 
> to be quiescent or in a low-power state.

I agree.
 
> If a driver fails to handle this properly, its device could be 
> reactivated in order to service a user request before the memory 
> snapshot is made.  This could easily ruin the snapshot.

That's why I've been saying for quite some time that we first need to take care
of the drivers. :-)

IMO we've reached the point at which, whatever we want to do next, the drivers
are in the way.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
       [not found]                         ` <ea7a437ca4038d408ac544bbc3c2434a@bga.com>
@ 2007-07-19 17:31                           ` david
  2007-07-20 14:24                             ` Milton Miller
  2007-07-19 20:28                           ` Rafael J. Wysocki
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-19 17:31 UTC (permalink / raw)
  To: Milton Miller
  Cc: linux-pm, LKML, Rafael J. Wysocki, Alan Stern, Huang, Ying,
	Jeremy Maitin-Shepard

On Thu, 19 Jul 2007, Milton Miller wrote:

>>  (2) Upon start-up (by which I mean what happens after the user has pressed
>>      the power button or something like that):
>>    * check if the image is present (and valid) _without_ enabling ACPI (we
>>  don't
>>      do that now, but I see no reason for not doing it in the new
>>  framework)
>>    * if the image is present (and valid), load it
>>    * turn on ACPI (unless already turned on by the BIOS, that is)
>>    * execute the _BFS global control method
>>    * execute the _WAK global control method
>>    * continue
>>    Here, the first two things should be done by the image-loading
>>  kernel, but
>>    the remaining operations have to be carried out by the restored kernel.
>
> Here I agree.
>
> Here is my proposal.  Instead of trying to both write the image and suspend, 
> I think this all becomes much simpler if we limit the scope the work of the 
> second kernel.  Its purpose is to write the image.   After that its done. 
> The platform can be powered off if we are going to S5.   However, to support 
> suspend to ram and suspend to disk, we return to the first kernel.
>
> This means that the first kernel will need to know why it got resumed.  Was 
> the system powered off, and this is the resume from the user?   Or was it 
> restarted because the image has been saved, and its now time to actually 
> suspend until woken up?  If you look at it, this is the same interface we 
> have with the magic arch_suspend hook -- did we just suspend and its time to 
> write the image, or did we just resume and its time to wake everything up.
>
> I think this can be easily solved by giving the image saving kernel two 
> resume points: one for the image has been written, and one for we rebooted 
> and have restored the image.  I'm not familiar with ACPI.  Perhaps we need a 
> third to differentiate we read the image from S4 instead of from S5, but that 
> information must be available to the OS because it needs that to know if it 
> should resume from hibernate.

are we sure that there are only 2-3 possible actions? or should this be 
made into a simple jump table so that it's extendable?

> As noted in  the thread
>
> Message-ID: <873azxwqhr.fsf@jbms.ath.cx>
> Subject: [linux-pm] Re: hibernation/snapshot design
> on Mon Jul  9 08:23:53 2007, Jeremy Maitin-Shepard wrote:
>
>>  (3) how to communicate where to save the memory
>
> This is an intresting topic.  The suspended kernel has most IO and disk 
> space.  It also knows how much space is to be occupied by the kernel.   So 
> communicating a block map to the second kernel would be the obvious choice. 
> But the second kernel must be able to find the image to restore it, and it 
> must have drivers for the media.  Also, this is not feasible for storing to 
> nfs.
>
> I think we will end up with several methods.
>
> One would be supply a list of blocks, and implement a file system that reads 
> the file by reading the scatter list from media.  The restore kernel then 
> only needs to read an anchor, and can build upon that until the image is read 
> into memory.  Or do this in userspace.
>
> I don't know how this compares to the current restore path.   I wasn't able 
> to identify the code that creates the on disk structure in my 10 minute 
> perusal of kernel/power/.
>
> A second method will be to supply a device and file that will be mounted by 
> the save kernel, then unmounted and restored.  This would require a partition 
> that is not mounted or open by the suspended kernel (or use nfs or a similar 
> protocol that is designed for multiple client concurrent access).
>
> A third method would be to allocate a file with the first kernel, and make 
> sure the blocks are flushed to disk.  The save and restore kernels map the 
> file system using a snapshot device.  Writing would map the blocks and use 
> the block offset to write to the real device using the method from the first 
> option; reading could be done directly from the snapshot device.
>
> The first and third option are dead on log based file systems (where the data 
> is stored in the log).

remember that the save and restore kernel can access the memory of the 
suspending kernel, so as long as the data is in a known format and there 
is a pointer to the data in a known location, the save and restore kernel 
can retreive the data from memory, there's no need to involve media.

> Simplifying kjump: the proposal for v3.
>
> The current code is trying to use crash dump area as a safe, reserved area to 
> run the second kernel.   However, that means that the kernel has to be linked 
> specially to run in the reserved area.   I think we need to finish separating 
> kexec_jump from the other code paths.

on x86 at least it's possible to compile a relocateable kernel, so it 
doesn't need to be compiled specificly for a particular reserved area. 
This would allow you to use the same kernel build as the suspending kernel 
if you wanted to (I think that the config of the save and restore kernel 
is going to be trivial enough to consider auto-configuring and building a 
specific kernel for each box a real possibility)

> As a first stage of suspend and resume, we can save to dedicated partitions 
> all memory (as supplied to crash_dump) that is not marked nosave and not part 
> of the save kernel's image.   The fancy block lists and memory lists can be 
> added later.

if the suspending kernel needs to tell the save and restore kernel what 
memory is not marked nosave have it do so useing a memory list of some 
kind. you need to setup a mechanism for communicating the data anyway, 
setup a mechansim that's useable in the long term.

> If we want to keep the second kernel booted, then we need to add a save area 
> for the booted jump target.   Note that the save and restore lists to 
> relocate_new_kernel can be computed once and saved.   Longer term we could 
> implement sys_kexec_load(UNLOAD) that would retrieve the saved list back to 
> application space to save to disk in a file.   This means you could save the 
> booted save kernel, it just couldn't have any shared storage open.

since the kexec to the second kernel needs to handle the device 
intialization, do you really save much by doing this? from a reliability 
point of view it would seem simpler (and therefor more reliable) to 
initialize the save and restore kernel each time it's used, so that it 
always does the same thing (as opposed to carrying state from one use to 
the next)

David Lang


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
       [not found]                         ` <ea7a437ca4038d408ac544bbc3c2434a@bga.com>
  2007-07-19 17:31                           ` [linux-pm] " david
@ 2007-07-19 20:28                           ` Rafael J. Wysocki
  2007-07-19 23:07                             ` david
  2007-07-20 16:08                             ` Milton Miller
  1 sibling, 2 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-19 20:28 UTC (permalink / raw)
  To: Milton Miller
  Cc: linux-pm, LKML, Alan Stern, David Lang, Huang, Ying,
	Jeremy Maitin-Shepard

On Thursday, 19 July 2007 17:46, Milton Miller wrote:
> 
> Hi.  I've found this thread from the kjump thread on the kexec mailing 
> list.   I'll respond to that one later, but I wanted to respond to 
> several messages in this thread.  [Actually, there is a brief outline 
> of a response near the bottom of this note].  I downloaded the archive 
> to get message-ids and references, hopefully I don't break the 
> threading badly.
> 
> First, my background is hardware and low level software.  I wrote the 
> initial ppc64 support for kexec.  I try not to learn too much about x86 
> details, but have read this thread and the tutorial article mentioned 
> in it.  I've studied the suspend code at various times in the past but 
> not in the last year.  Several years I mused that kexec could be used 
> restore the suspend image, but I was told that would not be swsupend.
> 
> 
> Next: lets define what we are trying to solve with the kexec approach.
> 
> We can solve the case of getting the drivers to quiesce the hardware 
> with requests from userspace suspended in queues.  This is what the 
> powermac suspend has been doing for years, and I think its agreed that 
> it will be something similar when we remove the freezer.  In the 
> powermac code, there are two notifications to drivers; in the first 
> stage they can allocate memory and interrupts are enabled, in the 
> second its copy enough device state to memory to restart the device.
> 
> The problem with all the suspend methods is how do we select what needs 
> to be re-enabled to write the image to stable storage (be it disk, 
> network, or other medium).
> 
> The kjump kexec proposal says after we have the io quiesced, we can 
> jump to a totally new kernel, let it initialize the io it needs, and 
> write the image.  This has the advantage that there is no confusion as 
> far as which requests should be service or not serviced by the driver 
> and subsystem stacks.   It also reuses all the drivers, which means we 
> don't get untested code paths.  It also has the advantages that we can 
> use any complicated user stack to access a file system and run any 
> desired access methods (eg encryption, raid, etc).
> 
> The currently identified problems under discussion include:
> (1) how to interact with acpi to enter into S4.
> (2) how to identify which memory needs to be saved
> (3) how to communicate where to save the memory
> (4) what state should devices be in when switching kernels
> (5) the complicated setup required with the current patch
> (6) what code restores the image

(7) how to avoid corrupting filesystems mounted by the hibernated kernel

> I'll now start with quotes from several articles in this thread and my 
> responses.
> 
> 
> Message-ID: <200707172217.01890.rjw@sisk.pl>
> On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote:
> > (1) Upon entering the sleep state, which IMO can be done _after_ the 
> > image
> >     has been saved:
> >   * figure out which devices can wake up
> >   * put devices into low power states (wake-up devices are placed in 
> > the Dx
> >     states compatible with the wake capability, the others are powered 
> > off)
> >   * execute the _PTS global control method
> >   * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
> >   * execute the _GTS global control method
> >   * set the GPE enable registers corresponding to the wake-up devices)
> >   * make the platform enter S4 (there's a well defined procedure for 
> > that)
> >   I think that this should be done by the image-saving kernel.
> 
> Message-ID: <87odiag45q.fsf@jbms.ath.cx>
> On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard
> expressed his agreement with this block but also confusion on the other 
> blocks.
> 
> 
> I strongly disagree.
> 
> (1) as has been pointed out, this requires the new kernel to understand 
> all io devices in the first kernel.
> (2) it requires both kernels to talk to ACPI.   This is doomed to 
> failure.  How can the second kernel initialize ACPI?   The platform 
> thinks it has already been initialized.  Do we plan to always undo all 
> acpi initialization?

Good question.  I don't know.

> > (2) Upon start-up (by which I mean what happens after the user has 
> > pressed
> >     the power button or something like that):
> >   * check if the image is present (and valid) _without_ enabling ACPI 
> > (we don't
> >     do that now, but I see no reason for not doing it in the new 
> > framework)
> >   * if the image is present (and valid), load it
> >   * turn on ACPI (unless already turned on by the BIOS, that is)
> >   * execute the _BFS global control method
> >   * execute the _WAK global control method
> >   * continue
> >   Here, the first two things should be done by the image-loading 
> > kernel, but
> >   the remaining operations have to be carried out by the restored 
> > kernel.
> 
> Here I agree.
> 
> Here is my proposal.  Instead of trying to both write the image and 
> suspend, I think this all becomes much simpler if we limit the scope 
> the work of the second kernel.  Its purpose is to write the image.   
> After that its done.   The platform can be powered off if we are going 
> to S5.   However, to support suspend to ram and suspend to disk, we 
> return to the first kernel.

We can't do this unless we have frozen tasks (this way, or another) before
carrying out the entire operation.  In that case, however, the kexec-based
approach would have only one advantage over the current one.  Namely, it
would allow us to create bigger images.
 
> This means that the first kernel will need to know why it got resumed.  
> Was the system powered off, and this is the resume from the user?   Or 
> was it restarted because the image has been saved, and its now time to 
> actually suspend until woken up?  If you look at it, this is the same 
> interface we have with the magic arch_suspend hook -- did we just 
> suspend and its time to write the image, or did we just resume and its 
> time to wake everything up.
> 
> I think this can be easily solved by giving the image saving kernel two 
> resume points: one for the image has been written, and one for we 
> rebooted and have restored the image.  I'm not familiar with ACPI.  
> Perhaps we need a third to differentiate we read the image from S4 
> instead of from S5, but that information must be available to the OS 
> because it needs that to know if it should resume from hibernate.
> 
> By making the split at image save and restore we have several 
> advantages:
> 
> (1) the kernel always initializes with devices in the init or quiesced 
> but active state.
> 
> (2) the kernel always resumes with devices in the init or quiesced but 
> active state.
> 
> (3) the kjump save and restore kernel does not need to know how to 
> suspend all devices in the platform.
> 
> (4) we have a merged path for suspend to disk, suspend to ram, and 
> suspend to both.
> 
> (5) because of (4), we can implement sleep policys where we save the 
> image to disk but try to stay in ram based on expected remaining 
> battery life.
> 
> (6) we confine all platform (acpi) interaction to the main kernel
> 
> (7) we limit the knowledge needed in the second kernel.   It needs to 
> know how to do its job and then put the hardware back how it found it.  
> Nothing more.

This would have been nice if we had been able to do it.
 
> For the suspend to ram and then woken up case, we simply need to 
> invalidate the image before restarting normal kernel operation.
> 
> People have worried about how to boot and restore the kernel, and what 
> to do if reading the image fails.   They worry about needing memory 
> hotplug or delayed acpi parsing.  They are forgetting one thing.  This 
> kernel has support for kexec.
> 
> This is all easily solved by having the bootloader from the bios always 
> boot the restore kernel.

Well, I think this is not generally acceptable, although I agree that it would
be simpler.

> It will boot with limited useable memory and  
> no acpi support.  If the restore kernel userspace detects that there is 
> no restore image, it simply loads the normal main kernel and initrd / 
> initramfs and calls the normal kexec.  The cost is the time to init the 
> restore kernel, read the kernel with full drivers (vs reading it from 
> the bootloader).  If you want a boot menu, use kboot (on sourceforge).

Well, I'm afraid of adding more and more infrastructure to the mix.

> On Jul 17, 2007, at 2:13 PM, Rafael J. Wysocki wrote:
> > On Tuesday, 17 July 2007 22:27, david@lang.hm wrote:
> >> On Tue, 17 Jul 2007, Alan Stern wrote:
> >>> But what about the freezer?  The original reason for using kexec was 
> >>> to
> >>> avoid the need for the freezer.  With no freezer, while the original
> >>> kernel is busy powering down its devices, user tasks will be free to
> >>> carry out I/O -- which will make the memory snapshot inconsistent 
> >>> with
> >>> the on-disk data structures.
> >>
> >> no, user tasks just don't get scheduled during shutdown.
> >>
> >> the big problem with the freezer isn't stopping anything from 
> >> happening,
> >> it's _selectivly_ stopping things.
> 
> Agreed.   Or rather, selectively not stopping and resuming things.

I don't quite understant this statement.  Can you please elaborate?

> > It's selectively stopping kernel threads, which is just about right.  
> > If you
> > that _this_ is a main problem with the freezer, then think again.
> >
> >> with kexec you don't need to let any portion of the origional kernel 
> >> or
> >> userspace operate so you don't have a problem.
> >
> > In fact, the main problem with the freezer is that it is a 
> > coarse-grained
> > solution.  Therefore, what I believe we should do is to evolve in the 
> > directoin
> > of more fine-grained solutions and gradually phase out the freezer.
> >
> > The kexec-based approach is an attempt to replace one coarse-grained 
> > solution
> > (the freezer) with even more coarse-grained solution (stopping the 
> > entire
> > kernel with everything), which IMO doesn't address the main problem.
> >
> 
> I think this addresses teh problem.   Its probably a bit harder than 
> powermac because we have to fully quiesce devices; we can't cheat by 
> leaving interrupts off.   But once the drivers save the state of their 
> devices and stop their queues, it should be easy to audit the paths to 
> powerdown devices and call the platform suspend and ram wakeup paths.
> 
> 
> Going back to the requirements document that started this thread:
> 
> Message-ID: <200707151433.34625.rjw@sisk.pl>
> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
> > (1) Filesystems mounted before the hibernation are untouchable
> 
> This is because some file systems do a fsck or other activity even when 
> mounted read only.  For the kexec case, however, this should be "file 
> systems mounted by the hibernated system must not be written".   As has 
> been mentioned in the past, we should be able to use something like dm 
> snapshot to allow fsck and the file system to see the cleaned copy 
> while not actually writing the media.

We can't _require_ users to use the dm snapshot in order for the hibernation
to work, sorry.

And by _reading_ from a filesystem you generally update metadata. 

> The kjump kernel must not have any knowledge retained if we reuse it.
> 
> > (2) Swap space in use before the hibernation must be handled with care
> 
> Yes.  Actually, even though they have been used by the write-in-the 
> kernel users, they will be among the most difficult devices to use for 
> snapshots by a userspace second kernel.
> 
> > (3) There are memory regions that must not be saved or restored
> 
> because they may not exist.   This means that we must identify the 
> memory to be saved and restored in a format to be passed between the 
> kernel.
> 
> > (4) The user should be able to limit the size of a hibernation image
> 
> This means the suspending kernel must arrange to reduce its active 
> memory.  The limited save can be done by providing a limited list in 
> (3).

It seems to me that you don't understand the problem here.

Assume you have 90% of RAM allocated before the hibernation and the user has
requested the image to be not greater than 50% of RAM.  In that case you have
to free some memory _before_ identifying memory to save and you must not
race with applications that attempt to allocate memory while you're doing it.

> > (5) Hibernation should be transparent from the applications' point of 
> > view
> 
> People have pointed out they may want userspace to be aware of the 
> suspend.   I believed this can be done with /proc/apm emulation today 
> or by other means; it seems that should be hooked up to dbus in some 
> fashion.

Not a solution, because there still will be programs not needing to know
anything about hibernation.  After all, we don't require all applications to
know anything about SMP, even if they are executed on an SMP system.

> > (6) State of devices from before hibernation should be restored, if 
> > possible
> 
> related to suspend should be transparent ... yes.
> 
> > (7) On ACPI systems special platform-related actions have to be 
> > carried out at
> >     the right points, so that the platform works correctly after the 
> > restore
> 
> I believe I have explained my suggestion.
> 
> > (8) Hibernation and restore should not be too slow
> 
> We control the added code.   We are using full runtime drivers and will 
> run at hardware speeds.

That may not be enough.  If you're going to save, say, 80% of RAM on a 2 GB
machine, then you'll have to be using image compression.

> > (9) Hibernation framework should not be too difficult to set up
> 
> Ok the current patch is presently too difficult.  But I think it will 
> be much simpler with a few small changes.
> 
> As noted in  the thread
> 
> Message-ID: <873azxwqhr.fsf@jbms.ath.cx>
> Subject: [linux-pm] Re: hibernation/snapshot design
> on Mon Jul  9 08:23:53 2007, Jeremy Maitin-Shepard wrote:
> >>  Both would work. One would eat 8-64MB of your RAM, permanently;
> >
> > As I have stated in other messages, the kdump approach would not waste
> > any RAM permanently.
> ...
> > Immediately before jumping to the new kernel, the first X bytes (where 
> > X
> > is the amount of memory the new kernel will get, typically 16MB or 
> > 64MB)
> > of physical memory are backed up into the arbitrary discontiguous pages
> > that are made available.  This will not take very long, because copying
> > even 64MB of memory is extremely fast.  Then the new kernel is free to
> > use the first X bytes of contiguous physical memory.  Problem solved.
> 
> 
> Ok, now let's look at my list again:
> 
> > (1) how to interact with acpi to enter into S4.
> 
> This was discussed.
> 
> > (2) how to identify which memory needs to be saved
> 
> We need to generate a list.  We need it to fit in a compuatable size so 
> that we can free and allocate the pages before suspending IO in the 
> first kernel.
> 
> One possibility is to use something like the kexec copy list.  If we 
> are imaging a small fraction of ram this is appropriate, but if we are 
> doing dense saves we need something extent based.  We should be able to 
> extend the list.
> 
> > (3) how to communicate where to save the memory
> 
> This is an intresting topic.  The suspended kernel has most IO and disk 
> space.  It also knows how much space is to be occupied by the kernel.   
> So communicating a block map to the second kernel would be the obvious 
> choice.   But the second kernel must be able to find the image to 
> restore it, and it must have drivers for the media.  Also, this is not 
> feasible for storing to nfs.
> 
> I think we will end up with several methods.
> 
> One would be supply a list of blocks, and implement a file system that 
> reads the file by reading the scatter list from media.  The restore 
> kernel then only needs to read an anchor, and can build upon that until 
> the image is read into memory.  Or do this in userspace.
> 
> I don't know how this compares to the current restore path.   I wasn't 
> able to identify the code that creates the on disk structure in my 10 
> minute perusal of kernel/power/.

The structure is created at two levels.

First, the code in snapshot.c makes the image available to the code in swap.c
as a stream of pages.  The first page is the header, followed by some pages
containing the PFNs of the page frames to which the image data pages are to be
restored, followed by the image data pages themselves (the ordering of the PFNs
must be the same as the ordering of data pages that correspond to them).
Still, the low-level image format only needs to be known by the restore code in
snapshot.c .

Second, the code in swap.c writes the image pages to a storage adding some
metadata making it possible to reproduce their original ordering during the
restore.

The fact that we use swap spaces as the storage is related to implementation
simplicity rather than anything else.

> A second method will be to supply a device and file that will be 
> mounted by the save kernel, then unmounted and restored.  This would 
> require a partition that is not mounted or open by the suspended kernel 
> (or use nfs or a similar protocol that is designed for multiple client 
> concurrent access).
> 
> A third method would be to allocate a file with the first kernel, and 
> make sure the blocks are flushed to disk.  The save and restore kernels 
> map the file system using a snapshot device.  Writing would map the 
> blocks and use the block offset to write to the real device using the 
> method from the first option; reading could be done directly from the 
> snapshot device.
> 
> The first and third option are dead on log based file systems (where 
> the data is stored in the log).

All in all, we have three different and working implementation of the
image-writing and image-reading code at our disposal.  Why would you want to
break the open doors?

> > (4) what state should devices be in when switching kernels
> 
> My proposal is either initialized and untouched or quiesced.

This is reasonable, but in general we also need to save some information
about the pre-hibernation state of devices, so that we can put them into the
same state, if reasonably possible, during the restore.

> > (5) the complicated setup required with the current patch
> 
> I think a few simple changes to kjump will make this much simpler.  See 
> below.
> 
> > (6) what code restores the image
> 
> The save kernel, loaded at boot.   People have suggested booting the 
> first kernel, and using current restore code.   However, I think that 
> ignores that (1) we saved from a different kernel, so the backed up 
> region will be restored to its backed up random pages,

This problem has already been solved.

> (2) the code was written to restore the same kernel,

Not exactly.  In fact, the current implementation only relies on the tiny
portion of the restore code being in the same place in both kernels, but
we can change the code not to make this assumption (it'll be more complicated,
but that's perfectly doable).

> so the text and data will be replaced by identical text.  Its much simpler
> conceptually to use the same kernel to save and restore the image.

Here I agree. :-)

> Simplifying kjump: the proposal for v3.
> 
> The current code is trying to use crash dump area as a safe, reserved 
> area to run the second kernel.   However, that means that the kernel 
> has to be linked specially to run in the reserved area.   I think we 
> need to finish separating kexec_jump from the other code paths.
> 
> (1) add a new command line argument that specifies the kexec_jump 
> target area.
> 
> (2) add a kjump flag to the flags parameter, used by kexec_load.   When 
> loading a jump kernel, it is loaded like a normal kernel, however, 
> additional control pages are allocated to (a) save the kexec_jump 
> target area (b) save the backed up region that is used by all kernels 
> like crash dump, and (c) space for invoking relocate_new_kernel that 
> will get its args from the execution entry point and will restore the 
> kernel then call resume and suspend.
> 
> (3) replace jump_huf_pfn with two command line addresses that specify 
> the (a) return point for after resume, and (b) the return point for 
> after image save.   Actually these can be done in userspace; the second 
> restore kernel can just specify the null copy list and the entry points 
> supplied by the suspended kernel.  To do resume we also need (c) where 
> to store resume address for the save kernel.
> 
> 
> As a first stage of suspend and resume, we can save to dedicated 
> partitions all memory (as supplied to crash_dump) that is not marked 
> nosave and not part of the save kernel's image.

A little problem here: there are "nosave" areas that are not marked as nosave.

> The fancy block lists and memory lists can be added later.

On the majority of systems that will work.  On some of them it won't.

> mmaking these changes will allow us to use a normal kernel invoked with 
> acpi=off apm=off mem=xxk as the save and restore kernel.
> 
> If we want to keep the second kernel booted, then we need to add a save 
> area for the booted jump target.   Note that the save and restore lists 
> to relocate_new_kernel can be computed once and saved.   Longer term we 
> could implement sys_kexec_load(UNLOAD) that would retrieve the saved 
> list back to application space to save to disk in a file.   This means 
> you could save the booted save kernel, it just couldn't have any shared 
> storage open.
> 
> I'll try to expand on this in the jump v2 thread, but it may be 36+ 
> hours before I do so.

Well, I have no experience with kexec, so I really can't comment your
kexec-related suggestions.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-19 20:28                           ` Rafael J. Wysocki
@ 2007-07-19 23:07                             ` david
  2007-07-20 11:17                               ` Rafael J. Wysocki
       [not found]                               ` <20070720152744.GH20529@grifter.jdc.home>
  2007-07-20 16:08                             ` Milton Miller
  1 sibling, 2 replies; 220+ messages in thread
From: david @ 2007-07-19 23:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Milton Miller, linux-pm, LKML, Alan Stern, Huang, Ying,
	Jeremy Maitin-Shepard

On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:

> On Thursday, 19 July 2007 17:46, Milton Miller wrote:
>>
>> The currently identified problems under discussion include:
>> (1) how to interact with acpi to enter into S4.
>> (2) how to identify which memory needs to be saved
>> (3) how to communicate where to save the memory
>> (4) what state should devices be in when switching kernels
>> (5) the complicated setup required with the current patch
>> (6) what code restores the image
>
> (7) how to avoid corrupting filesystems mounted by the hibernated kernel

I didn't realize this was a discussion item. I thought the options were 
clear, for some filesystem types you can mount them read-only, but for 
ext3 (and possilby other less common ones) you just plain cannot touch 
them.

>>> (2) Upon start-up (by which I mean what happens after the user has
>>> pressed
>>>     the power button or something like that):
>>>   * check if the image is present (and valid) _without_ enabling ACPI
>>> (we don't
>>>     do that now, but I see no reason for not doing it in the new
>>> framework)
>>>   * if the image is present (and valid), load it
>>>   * turn on ACPI (unless already turned on by the BIOS, that is)
>>>   * execute the _BFS global control method
>>>   * execute the _WAK global control method
>>>   * continue
>>>   Here, the first two things should be done by the image-loading
>>> kernel, but
>>>   the remaining operations have to be carried out by the restored
>>> kernel.
>>
>> Here I agree.
>>
>> Here is my proposal.  Instead of trying to both write the image and
>> suspend, I think this all becomes much simpler if we limit the scope
>> the work of the second kernel.  Its purpose is to write the image.
>> After that its done.   The platform can be powered off if we are going
>> to S5.   However, to support suspend to ram and suspend to disk, we
>> return to the first kernel.
>
> We can't do this unless we have frozen tasks (this way, or another) before
> carrying out the entire operation.  In that case, however, the kexec-based
> approach would have only one advantage over the current one.  Namely, it
> would allow us to create bigger images.

we all agree that tasks cannot run during the suspend-to-ram state, but 
the disagreement is over what this means

at one extreme it could mean that you would need the full freezer as per 
the current suspend projects.

at the other extreme it could mean that all that's needed is to invoke the 
suspend-to-ram routine before anything else on the suspended kernel on the 
return from the save and restore kernel.

we just need to figure out which it is (or if it's somewhere in between).

>>> It's selectively stopping kernel threads, which is just about right.
>>> If you
>>> that _this_ is a main problem with the freezer, then think again.
>>>
>>>> with kexec you don't need to let any portion of the origional kernel
>>>> or
>>>> userspace operate so you don't have a problem.
>>>
>>> In fact, the main problem with the freezer is that it is a
>>> coarse-grained
>>> solution.  Therefore, what I believe we should do is to evolve in the
>>> directoin
>>> of more fine-grained solutions and gradually phase out the freezer.
>>>
>>> The kexec-based approach is an attempt to replace one coarse-grained
>>> solution
>>> (the freezer) with even more coarse-grained solution (stopping the
>>> entire
>>> kernel with everything), which IMO doesn't address the main problem.
>>>
>>
>> I think this addresses teh problem.   Its probably a bit harder than
>> powermac because we have to fully quiesce devices; we can't cheat by
>> leaving interrupts off.   But once the drivers save the state of their
>> devices and stop their queues, it should be easy to audit the paths to
>> powerdown devices and call the platform suspend and ram wakeup paths.
>>
>>
>> Going back to the requirements document that started this thread:
>>
>> Message-ID: <200707151433.34625.rjw@sisk.pl>
>> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
>>> (1) Filesystems mounted before the hibernation are untouchable
>>
>> This is because some file systems do a fsck or other activity even when
>> mounted read only.  For the kexec case, however, this should be "file
>> systems mounted by the hibernated system must not be written".   As has
>> been mentioned in the past, we should be able to use something like dm
>> snapshot to allow fsck and the file system to see the cleaned copy
>> while not actually writing the media.
>
> We can't _require_ users to use the dm snapshot in order for the hibernation
> to work, sorry.
>
> And by _reading_ from a filesystem you generally update metadata.

not if the filesystem is mounted read-only (except on ext3)

>> The kjump kernel must not have any knowledge retained if we reuse it.
>>
>>> (2) Swap space in use before the hibernation must be handled with care
>>
>> Yes.  Actually, even though they have been used by the write-in-the
>> kernel users, they will be among the most difficult devices to use for
>> snapshots by a userspace second kernel.
>>
>>> (3) There are memory regions that must not be saved or restored
>>
>> because they may not exist.   This means that we must identify the
>> memory to be saved and restored in a format to be passed between the
>> kernel.
>>
>>> (4) The user should be able to limit the size of a hibernation image
>>
>> This means the suspending kernel must arrange to reduce its active
>> memory.  The limited save can be done by providing a limited list in
>> (3).
>
> It seems to me that you don't understand the problem here.
>
> Assume you have 90% of RAM allocated before the hibernation and the user has
> requested the image to be not greater than 50% of RAM.  In that case you have
> to free some memory _before_ identifying memory to save and you must not
> race with applications that attempt to allocate memory while you're doing it.

I disagree a little bit.

first off, only the suspending kernel can know what can be freed and what 
is needed to do so (remember this is kernel internals, it can change from 
patch to patch, let alone version to version)

second, if you have a lot of memory to free, and you can't just throw away 
caches to do so, you don't know what is going to be involved in freeing 
the memory, it's very possilbe that it is going to involve userspace, so 
you can't freeze any significant portion of the system, so you can't 
eliminate all chance of races

what you can do is

1. try to free stuff
2. stop the system and account for memory, is enough free
if not goto 1

if userspace is dirtying memory fast enough, or is just useing enough 
memory that you can't meet your limit you just won't be able to suspend.

but under any other conditions you will eventually get enough memory free.

so try several times and if you still fail tell the user they have too 
much stuff running and they need to kill something.

>>> (6) State of devices from before hibernation should be restored, if
>>> possible
>>
>> related to suspend should be transparent ... yes.
>>
>>> (7) On ACPI systems special platform-related actions have to be
>>> carried out at
>>>     the right points, so that the platform works correctly after the
>>> restore
>>
>> I believe I have explained my suggestion.
>>
>>> (8) Hibernation and restore should not be too slow
>>
>> We control the added code.   We are using full runtime drivers and will
>> run at hardware speeds.
>
> That may not be enough.  If you're going to save, say, 80% of RAM on a 2 GB
> machine, then you'll have to be using image compression.

this doesn't make sense, 20% of 2G is 400M, if you can't make a kernel and 
userspace that can run in 400M you have a serious problem.

even if you wanted to save 99% of RAM on a 2G system, you have 20M of ram 
to play with, which should easily be enough.

remember, linux runs on really small systems as well, and while you do 
have to load some drivers for the big system, there are a lot of other 
things that aren't needed.

> All in all, we have three different and working implementation of the
> image-writing and image-reading code at our disposal.  Why would you want to
> break the open doors?

becouse you say that the current methods won't work without ACPI support.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-18 14:47                             ` Rafael J. Wysocki
@ 2007-07-20  4:40                               ` Al Boldi
  2007-07-20 10:59                                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Al Boldi @ 2007-07-20  4:40 UTC (permalink / raw)
  To: Rafael J. Wysocki, Alan Stern
  Cc: david, LKML, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list

Rafael J. Wysocki wrote:
> On Wednesday, 18 July 2007 16:29, Alan Stern wrote:
> >
> > Never mind.  It seems clear that this approach will suffer the same
> > drawback as the proposal for removing the freezer from the
> > suspend-to-RAM pathway.  Namely, device drivers will have to be changed
> > to prevent user I/O requests from proceeding while devices are supposed
> > to be quiescent or in a low-power state.
>
> I agree.
>
> > If a driver fails to handle this properly, its device could be
> > reactivated in order to service a user request before the memory
> > snapshot is made.  This could easily ruin the snapshot.
>
> That's why I've been saying for quite some time that we first need to take
> care of the drivers. :-)
>
> IMO we've reached the point at which, whatever we want to do next, the
> drivers are in the way.

Correct, but only if we want ACPI support.  Granted, we need a separation of 
the hibernate/suspend PM functions, but in the absence of ACPI, all we need 
right now are dump/restore routines for the crashkernel.

Next, we should be looking into reducing the kexec'd kernel environment size, 
which currently, at 16MB, is way too big, and even at 1MB would be 
problematic for small systems.

So, ACPI should really be the least of our worries, and the reason why people 
are fixating on ACPI is probably because they have nothing else to fixate 
on.  They probably haven't even tried the kexec patches to appreciate how 
easy the kexec approach is and how little interference APCI represents when 
suspending and resuming.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-20  4:40                               ` Al Boldi
@ 2007-07-20 10:59                                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 10:59 UTC (permalink / raw)
  To: Al Boldi
  Cc: Alan Stern, david, LKML, Andrew Morton, Eric W. Biederman, Huang,
	Ying, Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham,
	Pavel Machek, pm list

On Friday, 20 July 2007 06:40, Al Boldi wrote:
> Rafael J. Wysocki wrote:
> > On Wednesday, 18 July 2007 16:29, Alan Stern wrote:
> > >
> > > Never mind.  It seems clear that this approach will suffer the same
> > > drawback as the proposal for removing the freezer from the
> > > suspend-to-RAM pathway.  Namely, device drivers will have to be changed
> > > to prevent user I/O requests from proceeding while devices are supposed
> > > to be quiescent or in a low-power state.
> >
> > I agree.
> >
> > > If a driver fails to handle this properly, its device could be
> > > reactivated in order to service a user request before the memory
> > > snapshot is made.  This could easily ruin the snapshot.
> >
> > That's why I've been saying for quite some time that we first need to take
> > care of the drivers. :-)
> >
> > IMO we've reached the point at which, whatever we want to do next, the
> > drivers are in the way.
> 
> Correct, but only if we want ACPI support.

No, in general.

> Granted, we need a separation of  
> the hibernate/suspend PM functions, but in the absence of ACPI, all we need 
> right now are dump/restore routines for the crashkernel.

IMO you aren't right, but I guess there's no point in trying to convince you.

> Next, we should be looking into reducing the kexec'd kernel environment size, 
> which currently, at 16MB, is way too big, and even at 1MB would be 
> problematic for small systems.
> 
> So, ACPI should really be the least of our worries, and the reason why people 
> are fixating on ACPI is probably because they have nothing else to fixate 
> on.

Yeah, right.  Please read the $subject message again.  And sorry, but IMO your
previous replies to it haven't addressed any of the original points.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-19 23:07                             ` david
@ 2007-07-20 11:17                               ` Rafael J. Wysocki
  2007-07-20 15:35                                 ` david
  2007-07-20 16:56                                 ` Milton Miller
       [not found]                               ` <20070720152744.GH20529@grifter.jdc.home>
  1 sibling, 2 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 11:17 UTC (permalink / raw)
  To: david
  Cc: Milton Miller, linux-pm, LKML, Alan Stern, Huang, Ying,
	Jeremy Maitin-Shepard

On Friday, 20 July 2007 01:07, david@lang.hm wrote:
> On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Thursday, 19 July 2007 17:46, Milton Miller wrote:
> >>
> >> The currently identified problems under discussion include:
> >> (1) how to interact with acpi to enter into S4.
> >> (2) how to identify which memory needs to be saved
> >> (3) how to communicate where to save the memory
> >> (4) what state should devices be in when switching kernels
> >> (5) the complicated setup required with the current patch
> >> (6) what code restores the image
> >
> > (7) how to avoid corrupting filesystems mounted by the hibernated kernel
> 
> I didn't realize this was a discussion item. I thought the options were 
> clear, for some filesystem types you can mount them read-only, but for 
> ext3 (and possilby other less common ones) you just plain cannot touch 
> them.

That's correct.  And since you cannot thouch ext3, you need either to assume
that you won't touch filesystems at all, or to have a code to recognize the
filesystem you're dealing with.

> >>> (2) Upon start-up (by which I mean what happens after the user has
> >>> pressed
> >>>     the power button or something like that):
> >>>   * check if the image is present (and valid) _without_ enabling ACPI
> >>> (we don't
> >>>     do that now, but I see no reason for not doing it in the new
> >>> framework)
> >>>   * if the image is present (and valid), load it
> >>>   * turn on ACPI (unless already turned on by the BIOS, that is)
> >>>   * execute the _BFS global control method
> >>>   * execute the _WAK global control method
> >>>   * continue
> >>>   Here, the first two things should be done by the image-loading
> >>> kernel, but
> >>>   the remaining operations have to be carried out by the restored
> >>> kernel.
> >>
> >> Here I agree.
> >>
> >> Here is my proposal.  Instead of trying to both write the image and
> >> suspend, I think this all becomes much simpler if we limit the scope
> >> the work of the second kernel.  Its purpose is to write the image.
> >> After that its done.   The platform can be powered off if we are going
> >> to S5.   However, to support suspend to ram and suspend to disk, we
> >> return to the first kernel.
> >
> > We can't do this unless we have frozen tasks (this way, or another) before
> > carrying out the entire operation.  In that case, however, the kexec-based
> > approach would have only one advantage over the current one.  Namely, it
> > would allow us to create bigger images.
> 
> we all agree that tasks cannot run during the suspend-to-ram state, but 
> the disagreement is over what this means
> 
> at one extreme it could mean that you would need the full freezer as per 
> the current suspend projects.
> 
> at the other extreme it could mean that all that's needed is to invoke the 
> suspend-to-ram routine before anything else on the suspended kernel on the 
> return from the save and restore kernel.
> 
> we just need to figure out which it is (or if it's somewhere in between).

Well, I think that the "invoke the suspend-to-ram routine before anything else
on the suspended kernel" thing won't be easy to implement in practice.

> >>> It's selectively stopping kernel threads, which is just about right.
> >>> If you
> >>> that _this_ is a main problem with the freezer, then think again.
> >>>
> >>>> with kexec you don't need to let any portion of the origional kernel
> >>>> or
> >>>> userspace operate so you don't have a problem.
> >>>
> >>> In fact, the main problem with the freezer is that it is a
> >>> coarse-grained
> >>> solution.  Therefore, what I believe we should do is to evolve in the
> >>> directoin
> >>> of more fine-grained solutions and gradually phase out the freezer.
> >>>
> >>> The kexec-based approach is an attempt to replace one coarse-grained
> >>> solution
> >>> (the freezer) with even more coarse-grained solution (stopping the
> >>> entire
> >>> kernel with everything), which IMO doesn't address the main problem.
> >>>
> >>
> >> I think this addresses teh problem.   Its probably a bit harder than
> >> powermac because we have to fully quiesce devices; we can't cheat by
> >> leaving interrupts off.   But once the drivers save the state of their
> >> devices and stop their queues, it should be easy to audit the paths to
> >> powerdown devices and call the platform suspend and ram wakeup paths.
> >>
> >>
> >> Going back to the requirements document that started this thread:
> >>
> >> Message-ID: <200707151433.34625.rjw@sisk.pl>
> >> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
> >>> (1) Filesystems mounted before the hibernation are untouchable
> >>
> >> This is because some file systems do a fsck or other activity even when
> >> mounted read only.  For the kexec case, however, this should be "file
> >> systems mounted by the hibernated system must not be written".   As has
> >> been mentioned in the past, we should be able to use something like dm
> >> snapshot to allow fsck and the file system to see the cleaned copy
> >> while not actually writing the media.
> >
> > We can't _require_ users to use the dm snapshot in order for the hibernation
> > to work, sorry.
> >
> > And by _reading_ from a filesystem you generally update metadata.
> 
> not if the filesystem is mounted read-only (except on ext3)

Well, if the filesystem in question is a journaling one and the hibernated
kernel has mounted this fs read-write, this seems to be tricky anyway.
 
> >> The kjump kernel must not have any knowledge retained if we reuse it.
> >>
> >>> (2) Swap space in use before the hibernation must be handled with care
> >>
> >> Yes.  Actually, even though they have been used by the write-in-the
> >> kernel users, they will be among the most difficult devices to use for
> >> snapshots by a userspace second kernel.
> >>
> >>> (3) There are memory regions that must not be saved or restored
> >>
> >> because they may not exist.   This means that we must identify the
> >> memory to be saved and restored in a format to be passed between the
> >> kernel.
> >>
> >>> (4) The user should be able to limit the size of a hibernation image
> >>
> >> This means the suspending kernel must arrange to reduce its active
> >> memory.  The limited save can be done by providing a limited list in
> >> (3).
> >
> > It seems to me that you don't understand the problem here.
> >
> > Assume you have 90% of RAM allocated before the hibernation and the user has
> > requested the image to be not greater than 50% of RAM.  In that case you have
> > to free some memory _before_ identifying memory to save and you must not
> > race with applications that attempt to allocate memory while you're doing it.
> 
> I disagree a little bit.
> 
> first off, only the suspending kernel can know what can be freed and what 
> is needed to do so (remember this is kernel internals, it can change from 
> patch to patch, let alone version to version)
> 
> second, if you have a lot of memory to free, and you can't just throw away 
> caches to do so, you don't know what is going to be involved in freeing 
> the memory, it's very possilbe that it is going to involve userspace, so 
> you can't freeze any significant portion of the system, so you can't 
> eliminate all chance of races
> 
> what you can do is
> 
> 1. try to free stuff
> 2. stop the system and account for memory, is enough free
> if not goto 1
> 
> if userspace is dirtying memory fast enough, or is just useing enough 
> memory that you can't meet your limit you just won't be able to suspend.

This means unreliable hibernation for some workloads.  While I agree that
shouldn't be a problem in a common case, there are users who will complain. ;-)

> but under any other conditions you will eventually get enough memory free.
> 
> so try several times and if you still fail tell the user they have too 
> much stuff running and they need to kill something.

Well, with the freezer that's much simpler (and more reliable, I'd say): you
freeze tasks and _then_ you shrink memory.

> >>> (6) State of devices from before hibernation should be restored, if
> >>> possible
> >>
> >> related to suspend should be transparent ... yes.
> >>
> >>> (7) On ACPI systems special platform-related actions have to be
> >>> carried out at
> >>>     the right points, so that the platform works correctly after the
> >>> restore
> >>
> >> I believe I have explained my suggestion.
> >>
> >>> (8) Hibernation and restore should not be too slow
> >>
> >> We control the added code.   We are using full runtime drivers and will
> >> run at hardware speeds.
> >
> > That may not be enough.  If you're going to save, say, 80% of RAM on a 2 GB
> > machine, then you'll have to be using image compression.
> 
> this doesn't make sense, 20% of 2G is 400M, if you can't make a kernel and 
> userspace that can run in 400M you have a serious problem.

I was talking about the _speed_ of writing and reading.

> even if you wanted to save 99% of RAM on a 2G system, you have 20M of ram 
> to play with, which should easily be enough.
> 
> remember, linux runs on really small systems as well, and while you do 
> have to load some drivers for the big system, there are a lot of other 
> things that aren't needed.
> 
> > All in all, we have three different and working implementation of the
> > image-writing and image-reading code at our disposal.  Why would you want to
> > break the open doors?
> 
> becouse you say that the current methods won't work without ACPI support.

I didn't say that.  [Or if I did, please point me to this message.]

Anyway, this wouldn't be true even if I did.

What I've been trying to say from the very beginning is that the current
frameworks _support_ hibernation a la ACPI S4 (although that's not exactly
ACPI S4) and if we are going to introduce a new framework, then it should
be designed to _support_ ACPI S4 fully _from_ _the_ _start_.

This DOESN'T mean that the non-ACPI hibernation should be unsupported and
it DOESN"T mean that the non-ACPI hibernation is not supported currently.
IT IS SUPPORTED.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-19 17:31                           ` [linux-pm] " david
@ 2007-07-20 14:24                             ` Milton Miller
  2007-07-20 15:44                               ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Milton Miller @ 2007-07-20 14:24 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, LKML, Rafael J. Wysocki, Huang, Ying, linux-pm,
	Jeremy Maitin-Shepard

On Jul 19, 2007, at 12:31 PM, david@lang.hm wrote:
> On Thu, 19 Jul 2007, Milton Miller wrote:
>>>  (2) Upon start-up (by which I mean what happens after the user has 
>>> pressed
>>>      the power button or something like that):
>>>    * check if the image is present (and valid) _without_ enabling 
>>> ACPI (we
>>>  don't
>>>      do that now, but I see no reason for not doing it in the new
>>>  framework)
>>>    * if the image is present (and valid), load it
>>>    * turn on ACPI (unless already turned on by the BIOS, that is)
>>>    * execute the _BFS global control method
>>>    * execute the _WAK global control method
>>>    * continue
>>>    Here, the first two things should be done by the image-loading
>>>  kernel, but
>>>    the remaining operations have to be carried out by the restored 
>>> kernel.
>>
>> Here I agree.
>>
>> Here is my proposal.  Instead of trying to both write the image and 
>> suspend, I think this all becomes much simpler if we limit the scope 
>> the work of the second kernel.  Its purpose is to write the image.   
>> After that its done. The platform can be powered off if we are going 
>> to S5.   However, to support suspend to ram and suspend to disk, we 
>> return to the first kernel.
>>
>> This means that the first kernel will need to know why it got 
>> resumed.  Was the system powered off, and this is the resume from the 
>> user?   Or was it restarted because the image has been saved, and its 
>> now time to actually suspend until woken up?  If you look at it, this 
>> is the same interface we have with the magic arch_suspend hook -- did 
>> we just suspend and its time to write the image, or did we just 
>> resume and its time to wake everything up.
>>
>> I think this can be easily solved by giving the image saving kernel 
>> two resume points: one for the image has been written, and one for we 
>> rebooted and have restored the image.  I'm not familiar with ACPI.  
>> Perhaps we need a third to differentiate we read the image from S4 
>> instead of from S5, but that information must be available to the OS 
>> because it needs that to know if it should resume from hibernate.
>
> are we sure that there are only 2-3 possible actions? or should this 
> be made into a simple jump table so that it's extendable?

At 2 I don't think we need a jump table.   Even if we had a table, we 
have to identify what each entry means.  If we start getting more then 
we can change from command line to table.

>> As noted in  the thread
>>
>> Message-ID: <873azxwqhr.fsf@jbms.ath.cx>
>> Subject: [linux-pm] Re: hibernation/snapshot design
>> on Mon Jul  9 08:23:53 2007, Jeremy Maitin-Shepard wrote:
>>
>>>  (3) how to communicate where to save the memory
>>
>> This is an intresting topic.  The suspended kernel has most IO and 
>> disk space.  It also knows how much space is to be occupied by the 
>> kernel.   So communicating a block map to the second kernel would be 
>> the obvious choice. But the second kernel must be able to find the 
>> image to restore it, and it must have drivers for the media.  Also, 
>> this is not feasible for storing to nfs.
>>
>> I think we will end up with several methods.
>>
>> One would be supply a list of blocks, and implement a file system 
>> that reads the file by reading the scatter list from media.  The 
>> restore kernel then only needs to read an anchor, and can build upon 
>> that until the image is read into memory.  Or do this in userspace.
>>
>> I don't know how this compares to the current restore path.   I 
>> wasn't able to identify the code that creates the on disk structure 
>> in my 10 minute perusal of kernel/power/.
>>
>> A second method will be to supply a device and file that will be 
>> mounted by the save kernel, then unmounted and restored.  This would 
>> require a partition that is not mounted or open by the suspended 
>> kernel (or use nfs or a similar protocol that is designed for 
>> multiple client concurrent access).
>>
>> A third method would be to allocate a file with the first kernel, and 
>> make sure the blocks are flushed to disk.  The save and restore 
>> kernels map the file system using a snapshot device.  Writing would 
>> map the blocks and use the block offset to write to the real device 
>> using the method from the first option; reading could be done 
>> directly from the snapshot device.
>>
>> The first and third option are dead on log based file systems (where 
>> the data is stored in the log).
>
> remember that the save and restore kernel can access the memory of the 
> suspending kernel, so as long as the data is in a known format and 
> there is a pointer to the data in a known location, the save and 
> restore kernel can retreive the data from memory, there's no need to 
> involve media.

I agree that the the save kernel can read the list from the being-saved 
kernel.

However, when restoring, the being-saved (being-restored) kernel is not 
accessable, so the save list has to be stored as part of the image.

>> Simplifying kjump: the proposal for v3.
>>
>> The current code is trying to use crash dump area as a safe, reserved 
>> area to run the second kernel.   However, that means that the kernel 
>> has to be linked specially to run in the reserved area.   I think we 
>> need to finish separating kexec_jump from the other code paths.
>
> on x86 at least it's possible to compile a relocateable kernel, so it 
> doesn't need to be compiled specificly for a particular reserved area. 
> This would allow you to use the same kernel build as the suspending 
> kernel if you wanted to (I think that the config of the save and 
> restore kernel is going to be trivial enough to consider 
> auto-configuring and building a specific kernel for each box a real 
> possibility)

Yes, one *can* build x86 relocatable.  But there are funny restrictions 
like it has to be a bzImage or be loaded by kexec or something.   And 
not all architectures have relocatable support.  I think making the 
lists for the exsiting code to swap memory will not be that difficult 
and it will make the solution have less restrictions.  Maybe I should 
shut up and write some code this weekend.

Actually, I think we can have the dedicated area as an option.  If you 
suspend frequently keep a relocated kernel booted.  If you need more 
ram or suspend infrequently allocate the pages on the fly.


>> As a first stage of suspend and resume, we can save to dedicated 
>> partitions all memory (as supplied to crash_dump) that is not marked 
>> nosave and not part of the save kernel's image.   The fancy block 
>> lists and memory lists can be added later.
>
> if the suspending kernel needs to tell the save and restore kernel 
> what memory is not marked nosave have it do so useing a memory list of 
> some kind. you need to setup a mechanism for communicating the data 
> anyway, setup a mechansim that's useable in the long term.

I'm saying we can have people start to test by the simple save all ram 
to dedicated while we figure out what the long term list looks like.

>> If we want to keep the second kernel booted, then we need to add a 
>> save area for the booted jump target.   Note that the save and 
>> restore lists to relocate_new_kernel can be computed once and saved.  
>>  Longer term we could implement sys_kexec_load(UNLOAD) that would 
>> retrieve the saved list back to application space to save to disk in 
>> a file.   This means you could save the booted save kernel, it just 
>> couldn't have any shared storage open.
>
> since the kexec to the second kernel needs to handle the device 
> intialization, do you really save much by doing this? from a 
> reliability point of view it would seem simpler (and therefor more 
> reliable) to initialize the save and restore kernel each time it's 
> used, so that it always does the same thing (as opposed to carrying 
> state from one use to the next)

You can save a bit of run time initialization, at the cost of saving 
the whole image with the initialized pages instead of zeroing 
uninitialized pages.   The code to restore the devices is the same code 
path as the code for the main kernel to restore the devices (as 
implemented in the current patch), so we get more testing of that path.

> David Lang
milton


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
       [not found]                         ` <40fa2626aff7b6b590ad6aa4737fc873@bga.com>
@ 2007-07-20 14:48                           ` Huang, Ying
  2007-07-20 15:48                             ` david
  2007-07-20 21:34                             ` Rafael J. Wysocki
  0 siblings, 2 replies; 220+ messages in thread
From: Huang, Ying @ 2007-07-20 14:48 UTC (permalink / raw)
  To: Milton Miller
  Cc: linux-pm, LKML, Rafael J.Wysocki, Alan Stern, David Lang,
	Jeremy Maitin-Shepard

On Fri, 2007-07-20 at 09:01 -0500, Milton Miller wrote:
> Simplifying kjump: the proposal for v3.
> 
> The current code is trying to use crash dump area as a safe, reserved 
> area to run the second kernel.   However, that means that the kernel 
> has to be linked specially to run in the reserved area.   I think we 
> need to finish separating kexec_jump from the other code paths.
> 
> (1) add a new command line argument that specifies the kexec_jump 
> target area (or just size?)
> 
> (2) add a kjump flag to the flags parameter, used by kexec_load.   When 
> loading a jump kernel, it is loaded like a normal kernel, however, 
> additional control pages are allocated to (a) save this kenrel's use of 
> the kexec_jump target area (b) save the backed up region that is used 
> by all kernels like crash dump, and (c) space for invoking 
> relocate_new_kernel that will get its args from the execution entry 
> point and will restore the kernel then call resume and suspend.

Backuping target memory before kexec and restoring it after kexec is
planed feature for kexec jump. But I will work on image writing/reading
first.

> (3) replace jump_huf_pfn with two command line addresses that specify 
> the (a) return point for after resume, and (b) the return point for 
> after image save.   Actually these can be done in userspace; the second 
> restore kernel can just specify the null copy list and the entry points 
> supplied by the suspended kernel.  To do resume we also need (c) where 
> to store resume address for the save kernel.

There is many free spaces in jump_buf_pfn page now. I think passing the
needed information through jump_buf_pfn is more convenient than through
kernel command line. That is, the jump_buf_pfn can be seen as a meta
interface, which is passed to kexeced kernel though command line, while
other information can be passed though jmp_buf_pfn.

> The seperation should be whoever builds a scatter copy list builds the 
> inverse list.  This is why I propose simple jump entry points.  I 
> expect just a few instructions to establish arguments for the call to 
> the exstinging relocate_new_kernel code.

If the "scatter copy" is replaced by "scatter swap", we need not the
inverse list, and the state of kexeced kernel can be backuped too. There
are "scatter copy" support in normal kexec implementation in
"relocate_kernel".

> As a first stage of suspend and resume, we can save to dedicated 
> partitions all memory (as supplied to crash_dump) that is not marked 
> nosave and not part of the save kernel's image.   The fancy block lists 
> and memory lists can be added later.
> 
> Mmaking these changes will allow us to use a normal kernel invoked
> with 
> acpi=off apm=off mem=xxk as the save and restore kernel.

Yes, I am working on this.

> If we want to keep the second kernel booted, then we need to add a save 
> area for the booted jump target.   Note that the save and restore lists 
> to relocate_new_kernel can be computed once and saved.   Longer term we 
> could implement sys_kexec_load(UNLOAD) that would retrieve the saved 
> list back to application space to save to disk in a file.   This means 
> you could save the booted save kernel, it just couldn't have any shared 
> storage open.

Yes, this is also in plan. But with lower priority and will only be
added if necessary.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 11:17                               ` Rafael J. Wysocki
@ 2007-07-20 15:35                                 ` david
  2007-07-20 16:15                                   ` Alan Stern
  2007-07-20 16:56                                 ` Milton Miller
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-20 15:35 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Milton Miller, linux-pm, LKML, Alan Stern, Huang, Ying,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Rafael J. Wysocki wrote:

> On Friday, 20 July 2007 01:07, david@lang.hm wrote:
>> On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> On Thursday, 19 July 2007 17:46, Milton Miller wrote:
>>>>
>>>> The currently identified problems under discussion include:
>>>> (1) how to interact with acpi to enter into S4.
>>>> (2) how to identify which memory needs to be saved
>>>> (3) how to communicate where to save the memory
>>>> (4) what state should devices be in when switching kernels
>>>> (5) the complicated setup required with the current patch
>>>> (6) what code restores the image
>>>
>>> (7) how to avoid corrupting filesystems mounted by the hibernated kernel
>>
>> I didn't realize this was a discussion item. I thought the options were
>> clear, for some filesystem types you can mount them read-only, but for
>> ext3 (and possilby other less common ones) you just plain cannot touch
>> them.
>
> That's correct.  And since you cannot thouch ext3, you need either to assume
> that you won't touch filesystems at all, or to have a code to recognize the
> filesystem you're dealing with.

filesystem detection routines are already available

however, I'm the type to say that in a case like this where it matters you 
should explicitly give the filesystem type anyway.

or the userspace helper functions that setup the instructions for the 
hibernate warn you if you are telling it to mount a filesystem that it 
knows is ext3 and is in use by the system going to sleep.

in any case there needs to be a big warning about this issue, but that's 
different from saying "can't touch any filesystem that was mounted"


>> but under any other conditions you will eventually get enough memory free.
>>
>> so try several times and if you still fail tell the user they have too
>> much stuff running and they need to kill something.
>
> Well, with the freezer that's much simpler (and more reliable, I'd say): you
> freeze tasks and _then_ you shrink memory.

this only works if you don't need tasks to do anything to free the memory.

>>>>> (6) State of devices from before hibernation should be restored, if
>>>>> possible
>>>>
>>>> related to suspend should be transparent ... yes.
>>>>
>>>>> (7) On ACPI systems special platform-related actions have to be
>>>>> carried out at
>>>>>     the right points, so that the platform works correctly after the
>>>>> restore
>>>>
>>>> I believe I have explained my suggestion.
>>>>
>>>>> (8) Hibernation and restore should not be too slow
>>>>
>>>> We control the added code.   We are using full runtime drivers and will
>>>> run at hardware speeds.
>>>
>>> That may not be enough.  If you're going to save, say, 80% of RAM on a 2 GB
>>> machine, then you'll have to be using image compression.
>>
>> this doesn't make sense, 20% of 2G is 400M, if you can't make a kernel and
>> userspace that can run in 400M you have a serious problem.
>
> I was talking about the _speed_ of writing and reading.

if you are just talking about the I/O time for writing 2G of data, I 
wouldn't worry about it. suspend isn't supposed to change the capabilities 
of the system. if it takes a long time to do the I/O then that's how long 
it takes.

being able to do compression or send the data elsewhere is a good thing, 
but that's not something that is required to be supported in order to meet 
some performance goal or consider the approach a failure.

>>> All in all, we have three different and working implementation of the
>>> image-writing and image-reading code at our disposal.  Why would you want to
>>> break the open doors?
>>
>> becouse you say that the current methods won't work without ACPI support.
>
> I didn't say that.  [Or if I did, please point me to this message.]

I may have misunderstood you (I have deleted 90% of the messages in this 
thread so I won't try to go back and find where I thought you said this)

> Anyway, this wouldn't be true even if I did.
>
> What I've been trying to say from the very beginning is that the current
> frameworks _support_ hibernation a la ACPI S4 (although that's not exactly
> ACPI S4) and if we are going to introduce a new framework, then it should
> be designed to _support_ ACPI S4 fully _from_ _the_ _start_.

here is where there is some disagreement (although it may just be 
misunderstanding on the 'fully support' phrase)

it sounds like you are saying that the ACPI support requires a lot of work 
(the phrase I've seen some people use is a requirement to 'fix all the 
drivers'). we aren't wanting to have this work prevent the non-ACPI 
hibernation from progressing.

this isn't that we don't want the ACPI support eventually, it's in the 
spirit of 'perfection is the enemy of good enough'

David Lang

> This DOESN'T mean that the non-ACPI hibernation should be unsupported and
> it DOESN"T mean that the non-ACPI hibernation is not supported currently.
> IT IS SUPPORTED.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
       [not found]                               ` <20070720152744.GH20529@grifter.jdc.home>
@ 2007-07-20 15:36                                 ` david
  2007-07-20 21:43                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-20 15:36 UTC (permalink / raw)
  To: Jim Crilly
  Cc: Rafael J. Wysocki, Milton Miller, linux-pm, LKML, Alan Stern,
	Huang, Ying, Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Jim Crilly wrote:

>>> has
>>> requested the image to be not greater than 50% of RAM.  In that case you
>>> have
>>> to free some memory _before_ identifying memory to save and you must not
>>> race with applications that attempt to allocate memory while you're doing
>>> it.
>>
>> I disagree a little bit.
>>
>> first off, only the suspending kernel can know what can be freed and what
>> is needed to do so (remember this is kernel internals, it can change from
>> patch to patch, let alone version to version)
>>
>> second, if you have a lot of memory to free, and you can't just throw away
>> caches to do so, you don't know what is going to be involved in freeing
>> the memory, it's very possilbe that it is going to involve userspace, so
>> you can't freeze any significant portion of the system, so you can't
>> eliminate all chance of races
>>
>> what you can do is
>>
>> 1. try to free stuff
>> 2. stop the system and account for memory, is enough free
>> if not goto 1
>>
>> if userspace is dirtying memory fast enough, or is just useing enough
>> memory that you can't meet your limit you just won't be able to suspend.
>>
>> but under any other conditions you will eventually get enough memory free.
>>
>> so try several times and if you still fail tell the user they have too
>> much stuff running and they need to kill something.
>
> Which would be a pretty big regression from what we have now. With the
> current implementation I can hibernate under virtually any workload because
> the freezer stops everything and there's no competition for resources.

as long as what you are trying to save is <=50% of ram (at least with some 
implementations). if you are trying to save more then 50% of ram with some 
current implmenetations you just can't

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 14:24                             ` Milton Miller
@ 2007-07-20 15:44                               ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-20 15:44 UTC (permalink / raw)
  To: Milton Miller
  Cc: Alan Stern, LKML, Rafael J. Wysocki, Huang, Ying, linux-pm,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Milton Miller wrote:

> On Jul 19, 2007, at 12:31 PM, david@lang.hm wrote:
>>  On Thu, 19 Jul 2007, Milton Miller wrote:
>> > 
>> >  This means that the first kernel will need to know why it got resumed. 
>> >  Was the system powered off, and this is the resume from the user?   Or 
>> >  was it restarted because the image has been saved, and its now time to 
>> >  actually suspend until woken up?  If you look at it, this is the same 
>> >  interface we have with the magic arch_suspend hook -- did we just 
>> >  suspend and its time to write the image, or did we just resume and its 
>> >  time to wake everything up.
>> > 
>> >  I think this can be easily solved by giving the image saving kernel two 
>> >  resume points: one for the image has been written, and one for we 
>> >  rebooted and have restored the image.  I'm not familiar with ACPI. 
>> >  Perhaps we need a third to differentiate we read the image from S4 
>> >  instead of from S5, but that information must be available to the OS 
>> >  because it needs that to know if it should resume from hibernate.
>>
>>  are we sure that there are only 2-3 possible actions? or should this be
>>  made into a simple jump table so that it's extendable?
>
> At 2 I don't think we need a jump table.   Even if we had a table, we have to 
> identify what each entry means.  If we start getting more then we can change 
> from command line to table.

Ok, I was just looking to future-proof things so that these features can 
work on older kernels (as opposed to having two interfaces and when we 
switch from one to the next kernels older then that can't be used)

>>  remember that the save and restore kernel can access the memory of the
>>  suspending kernel, so as long as the data is in a known format and there
>>  is a pointer to the data in a known location, the save and restore kernel
>>  can retreive the data from memory, there's no need to involve media.
>
> I agree that the the save kernel can read the list from the being-saved 
> kernel.
>
> However, when restoring, the being-saved (being-restored) kernel is not 
> accessable, so the save list has to be stored as part of the image.

at that point it's less a save list then just the record of where the 
memory pages belong. you can use the same list, but you can store it along 
with the memory image. there's still no need for the suspending kernel to 
save it to permanent media.

>> >  Simplifying kjump: the proposal for v3.
>> > 
>> >  The current code is trying to use crash dump area as a safe, reserved 
>> >  area to run the second kernel.   However, that means that the kernel has 
>> >  to be linked specially to run in the reserved area.   I think we need to 
>> >  finish separating kexec_jump from the other code paths.
>>
>>  on x86 at least it's possible to compile a relocateable kernel, so it
>>  doesn't need to be compiled specificly for a particular reserved area.
>>  This would allow you to use the same kernel build as the suspending kernel
>>  if you wanted to (I think that the config of the save and restore kernel
>>  is going to be trivial enough to consider auto-configuring and building a
>>  specific kernel for each box a real possibility)
>
> Yes, one *can* build x86 relocatable.  But there are funny restrictions like 
> it has to be a bzImage or be loaded by kexec or something.   And not all 
> architectures have relocatable support.  I think making the lists for the 
> exsiting code to swap memory will not be that difficult and it will make the 
> solution have less restrictions.  Maybe I should shut up and write some code 
> this weekend.
>
> Actually, I think we can have the dedicated area as an option.  If you 
> suspend frequently keep a relocated kernel booted.  If you need more ram or 
> suspend infrequently allocate the pages on the fly.

for the proof of concept that we are trying for now there's no need to 
implement the capability to free memory to make room for the kexec kernel. 
after we show that this can work that can be added.

>
>> >  As a first stage of suspend and resume, we can save to dedicated 
>> >  partitions all memory (as supplied to crash_dump) that is not marked 
>> >  nosave and not part of the save kernel's image.   The fancy block lists 
>> >  and memory lists can be added later.
>>
>>  if the suspending kernel needs to tell the save and restore kernel what
>>  memory is not marked nosave have it do so useing a memory list of some
>>  kind. you need to setup a mechanism for communicating the data anyway,
>>  setup a mechansim that's useable in the long term.
>
> I'm saying we can have people start to test by the simple save all ram to 
> dedicated while we figure out what the long term list looks like.

the only problem with this is that Rafael is saying that if you try to 
save all the ram you will fail (in some cases becouse you are trying to 
save ram that doesn't exist). so at the very least we need to get a list 
that tells us what _can_ be saved.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 14:48                           ` Huang, Ying
@ 2007-07-20 15:48                             ` david
  2007-07-22  2:17                               ` Huang, Ying
  2007-07-20 21:34                             ` Rafael J. Wysocki
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-20 15:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Milton Miller, linux-pm, LKML, Rafael J.Wysocki, Alan Stern,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Huang, Ying wrote:

> 
> On Fri, 2007-07-20 at 09:01 -0500, Milton Miller wrote:
>> Simplifying kjump: the proposal for v3.
>>
>> The current code is trying to use crash dump area as a safe, reserved
>> area to run the second kernel.   However, that means that the kernel
>> has to be linked specially to run in the reserved area.   I think we
>> need to finish separating kexec_jump from the other code paths.
>>
>> (1) add a new command line argument that specifies the kexec_jump
>> target area (or just size?)
>>
>> (2) add a kjump flag to the flags parameter, used by kexec_load.   When
>> loading a jump kernel, it is loaded like a normal kernel, however,
>> additional control pages are allocated to (a) save this kenrel's use of
>> the kexec_jump target area (b) save the backed up region that is used
>> by all kernels like crash dump, and (c) space for invoking
>> relocate_new_kernel that will get its args from the execution entry
>> point and will restore the kernel then call resume and suspend.
>
> Backuping target memory before kexec and restoring it after kexec is
> planed feature for kexec jump. But I will work on image writing/reading
> first.

if we can get a list of what memory is safe to backup/restore then the 
reading/writing of the image should be able to be done in userspace.

>> (3) replace jump_huf_pfn with two command line addresses that specify
>> the (a) return point for after resume, and (b) the return point for
>> after image save.   Actually these can be done in userspace; the second
>> restore kernel can just specify the null copy list and the entry points
>> supplied by the suspended kernel.  To do resume we also need (c) where
>> to store resume address for the save kernel.
>
> There is many free spaces in jump_buf_pfn page now. I think passing the
> needed information through jump_buf_pfn is more convenient than through
> kernel command line. That is, the jump_buf_pfn can be seen as a meta
> interface, which is passed to kexeced kernel though command line, while
> other information can be passed though jmp_buf_pfn.



>> The seperation should be whoever builds a scatter copy list builds the
>> inverse list.  This is why I propose simple jump entry points.  I
>> expect just a few instructions to establish arguments for the call to
>> the exstinging relocate_new_kernel code.
>
> If the "scatter copy" is replaced by "scatter swap", we need not the
> inverse list, and the state of kexeced kernel can be backuped too. There
> are "scatter copy" support in normal kexec implementation in
> "relocate_kernel".

what do you mean by "scatter swap"?

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-19 20:28                           ` Rafael J. Wysocki
  2007-07-19 23:07                             ` david
@ 2007-07-20 16:08                             ` Milton Miller
  2007-07-20 16:20                               ` Alan Stern
  2007-07-20 21:02                               ` Rafael J. Wysocki
  1 sibling, 2 replies; 220+ messages in thread
From: Milton Miller @ 2007-07-20 16:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, Ying Huang, LKML, David Lang, linux-pm,
	Jeremy Maitin-Shepard

On Jul 19, 2007, at 3:28 PM, Rafael J. Wysocki wrote:
> On Thursday, 19 July 2007 17:46, Milton Miller wrote:
>> The currently identified problems under discussion include:
>> (1) how to interact with acpi to enter into S4.
>> (2) how to identify which memory needs to be saved
>> (3) how to communicate where to save the memory
>> (4) what state should devices be in when switching kernels
>> (5) the complicated setup required with the current patch
>> (6) what code restores the image
>
> (7) how to avoid corrupting filesystems mounted by the hibernated 
> kernel
>

Ok I talked on this too.

>> I'll now start with quotes from several articles in this thread and my
>> responses.
>>
>> Message-ID: <200707172217.01890.rjw@sisk.pl>
>> On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote:
>>> (1) Upon entering the sleep state, which IMO can be done _after_ the
>>> image
>>>     has been saved:
>>>   * figure out which devices can wake up
>>>   * put devices into low power states (wake-up devices are placed in
>>> the Dx
>>>     states compatible with the wake capability, the others are 
>>> powered
>>> off)
>>>   * execute the _PTS global control method
>>>   * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
>>>   * execute the _GTS global control method
>>>   * set the GPE enable registers corresponding to the wake-up 
>>> devices)
>>>   * make the platform enter S4 (there's a well defined procedure for
>>> that)
>>>   I think that this should be done by the image-saving kernel.
>>
>> Message-ID: <87odiag45q.fsf@jbms.ath.cx>
>> On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard
>> expressed his agreement with this block but also confusion on the 
>> other
>> blocks.
>>
>>
>> I strongly disagree.
>>
>> (1) as has been pointed out, this requires the new kernel to 
>> understand
>> all io devices in the first kernel.
>> (2) it requires both kernels to talk to ACPI.   This is doomed to
>> failure.  How can the second kernel initialize ACPI?   The platform
>> thinks it has already been initialized.  Do we plan to always undo all
>> acpi initialization?
>
> Good question.  I don't know.


>>> (2) Upon start-up (by which I mean what happens after the user has
>>> pressed
>>>     the power button or something like that):
>>>   * check if the image is present (and valid) _without_ enabling ACPI
>>> (we don't
>>>     do that now, but I see no reason for not doing it in the new
>>> framework)
>>>   * if the image is present (and valid), load it
>>>   * turn on ACPI (unless already turned on by the BIOS, that is)
>>>   * execute the _BFS global control method
>>>   * execute the _WAK global control method
>>>   * continue
>>>   Here, the first two things should be done by the image-loading
>>> kernel, but
>>>   the remaining operations have to be carried out by the restored
>>> kernel.
>>
>> Here I agree.
>>
>> Here is my proposal.  Instead of trying to both write the image and
>> suspend, I think this all becomes much simpler if we limit the scope
>> the work of the second kernel.  Its purpose is to write the image.
>> After that its done.   The platform can be powered off if we are going
>> to S5.   However, to support suspend to ram and suspend to disk, we
>> return to the first kernel.
>
> We can't do this unless we have frozen tasks (this way, or another) 
> before
> carrying out the entire operation.

What can't we do?   We've already worked with the drivers to quesce the 
hardware and put any information to resume the device in ram.  Now we 
ask them to put their device in low power mode so we can go to sleep.  
Even if we schedule, the only thing userspace could touch is memory.   
If we resume, they just run those computations again.

> In that case, however, the kexec-based
> approach would have only one advantage over the current one.  Namely, 
> it
> would allow us to create bigger images.

The advantage is we don't have to come up with a way to teach drivers 
"wake up to run these requests, but no other requests".  We don't have 
to figure out what we need to resume to allow them to process a 
request.

>> This means that the first kernel will need to know why it got resumed.
>> Was the system powered off, and this is the resume from the user?   Or
>> was it restarted because the image has been saved, and its now time to
>> actually suspend until woken up?  If you look at it, this is the same
>> interface we have with the magic arch_suspend hook -- did we just
>> suspend and its time to write the image, or did we just resume and its
>> time to wake everything up.
>>
>> I think this can be easily solved by giving the image saving kernel 
>> two
>> resume points: one for the image has been written, and one for we
>> rebooted and have restored the image.  I'm not familiar with ACPI.
>> Perhaps we need a third to differentiate we read the image from S4
>> instead of from S5, but that information must be available to the OS
>> because it needs that to know if it should resume from hibernate.
>>
>> By making the split at image save and restore we have several
>> advantages:
>>
>> (1) the kernel always initializes with devices in the init or quiesced
>> but active state.
>>
>> (2) the kernel always resumes with devices in the init or quiesced but
>> active state.
>>
>> (3) the kjump save and restore kernel does not need to know how to
>> suspend all devices in the platform.
>>
>> (4) we have a merged path for suspend to disk, suspend to ram, and
>> suspend to both.
>>
>> (5) because of (4), we can implement sleep policys where we save the
>> image to disk but try to stay in ram based on expected remaining
>> battery life.
>>
>> (6) we confine all platform (acpi) interaction to the main kernel
>>
>> (7) we limit the knowledge needed in the second kernel.   It needs to
>> know how to do its job and then put the hardware back how it found it.
>> Nothing more.
>
> This would have been nice if we had been able to do it.

I don't understand this comment.   "if we had been able"?  I don't 
think we have tried yet.
>
>> For the suspend to ram and then woken up case, we simply need to
>> invalidate the image before restarting normal kernel operation.
>>
>> People have worried about how to boot and restore the kernel, and what
>> to do if reading the image fails.   They worry about needing memory
>> hotplug or delayed acpi parsing.  They are forgetting one thing.  This
>> kernel has support for kexec.
>>
>> This is all easily solved by having the bootloader from the bios 
>> always
>> boot the restore kernel.
>
> Well, I think this is not generally acceptable, although I agree that 
> it would
> be simpler.

For those that don't find it acceptable they can teach their bootloader 
when they may have a image to resume.

>> It will boot with limited useable memory and
>> no acpi support.  If the restore kernel userspace detects that there 
>> is
>> no restore image, it simply loads the normal main kernel and initrd /
>> initramfs and calls the normal kexec.  The cost is the time to init 
>> the
>> restore kernel, read the kernel with full drivers (vs reading it from
>> the bootloader).  If you want a boot menu, use kboot (on sourceforge).
>
> Well, I'm afraid of adding more and more infrastructure to the mix.

Requiring the hibernated kernel to be able to start from kexec should 
not be bad.   If you were referring to adding kboot, that is just an 
option.

One can still use bootloaders menus to select alternate kernels.   
However, as you said, you want to boot differently for resume (no acpi 
until after image loaded) from full boot.

>> On Jul 17, 2007, at 2:13 PM, Rafael J. Wysocki wrote:
>>> On Tuesday, 17 July 2007 22:27, david@lang.hm wrote:
>>>> On Tue, 17 Jul 2007, Alan Stern wrote:
>>>>> But what about the freezer?  The original reason for using kexec 
>>>>> was
>>>>> to
>>>>> avoid the need for the freezer.  With no freezer, while the 
>>>>> original
>>>>> kernel is busy powering down its devices, user tasks will be free 
>>>>> to
>>>>> carry out I/O -- which will make the memory snapshot inconsistent
>>>>> with
>>>>> the on-disk data structures.
>>>>
>>>> no, user tasks just don't get scheduled during shutdown.
>>>>
>>>> the big problem with the freezer isn't stopping anything from
>>>> happening,
>>>> it's _selectivly_ stopping things.
>>
>> Agreed.   Or rather, selectively not stopping and resuming things.
>
> I don't quite understant this statement.  Can you please elaborate?

Feel free to list other problems with the freezer, but I'm saying that 
the problems are stemming from trying to freeze most of userspace and 
some selection of kernel threads so that new requests to the outside 
are not made, but then turning around and saying "ok now do some io, 
but only what this thread of execution originates".  Its originates not 
generates so we are trying to teach the whole stack these limits, 
including going back to userspace for FUSE.

>>> It's selectively stopping kernel threads, which is just about right.
>>> If you
>>> that _this_ is a main problem with the freezer, then think again.
>>>
>>>> with kexec you don't need to let any portion of the origional kernel
>>>> or userspace operate so you don't have a problem.
>>>
>>> In fact, the main problem with the freezer is that it is a
>>> coarse-grained
>>> solution.  Therefore, what I believe we should do is to evolve in the
>>> directoin
>>> of more fine-grained solutions and gradually phase out the freezer.
>>>
>>> The kexec-based approach is an attempt to replace one coarse-grained
>>> solution
>>> (the freezer) with even more coarse-grained solution (stopping the
>>> entire
>>> kernel with everything), which IMO doesn't address the main problem.
>>>
>>
>> I think this addresses teh problem.   Its probably a bit harder than
>> powermac because we have to fully quiesce devices; we can't cheat by
>> leaving interrupts off.   But once the drivers save the state of their
>> devices and stop their queues, it should be easy to audit the paths to
>> powerdown devices and call the platform suspend and ram wakeup paths.

In other words, I'm replacing a course-grained solution with an 
absolute solution.  "From this point on you can only write to ram."

>> Going back to the requirements document that started this thread:
>>
>> Message-ID: <200707151433.34625.rjw@sisk.pl>
>> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
>>> (1) Filesystems mounted before the hibernation are untouchable
>>
>> This is because some file systems do a fsck or other activity even 
>> when
>> mounted read only.  For the kexec case, however, this should be "file
>> systems mounted by the hibernated system must not be written".   As 
>> has
>> been mentioned in the past, we should be able to use something like dm
>> snapshot to allow fsck and the file system to see the cleaned copy
>> while not actually writing the media.
>
> We can't _require_ users to use the dm snapshot in order for the 
> hibernation
> to work, sorry.

I actually listed three ways to start.  Not all of them required 
dm-snapshot.  I was proposing "if you need to read ext3, then use 
dm-snapshot".

> And by _reading_ from a filesystem you generally update metadata.

not on ones mounted read-only.   I'll reply more later in the thread.

>> The kjump kernel must not have any knowledge retained if we reuse it.
>>
>>> (2) Swap space in use before the hibernation must be handled with 
>>> care
>>
>> Yes.  Actually, even though they have been used by the write-in-the
>> kernel users, they will be among the most difficult devices to use for
>> snapshots by a userspace second kernel.
>>
>>> (3) There are memory regions that must not be saved or restored
>>
>> because they may not exist.   This means that we must identify the
>> memory to be saved and restored in a format to be passed between the
>> kernel.
>>
>>> (4) The user should be able to limit the size of a hibernation image
>>
>> This means the suspending kernel must arrange to reduce its active
>> memory.  The limited save can be done by providing a limited list in
>> (3).
>
> It seems to me that you don't understand the problem here.
>
> Assume you have 90% of RAM allocated before the hibernation and the 
> user has
> requested the image to be not greater than 50% of RAM.  In that case 
> you have
> to free some memory _before_ identifying memory to save and you must 
> not
> race with applications that attempt to allocate memory while you're 
> doing it.

Hmm... I didn't say how to reduce the memory or identify it, did I?

Ok fine.   I'll allocate a bunch of memory and put it on a list.  
Normal memory pressure will swap things out or drop filesystem pages.   
When I build the list of memory to backup, I filter out this list.  
After resume, I'll free it back.

We can arrange for this "task" to be preferred by the oom killer, if 
case the user is trying to suspend into  less than memory than can be 
freed.

>>> (5) Hibernation should be transparent from the applications' point of
>>> view
>>
>> People have pointed out they may want userspace to be aware of the
>> suspend.   I believed this can be done with /proc/apm emulation today
>> or by other means; it seems that should be hooked up to dbus in some
>> fashion.
>
> Not a solution, because there still will be programs not needing to 
> know
> anything about hibernation.  After all, we don't require all 
> applications to
> know anything about SMP, even if they are executed on an SMP system.

How do any of those methods require userpsace to know anything about 
hibernation?   I was talking about a general framework consistent with 
todays kernel to user communication for those parts of userspace that 
*want* to know about suspend and hibernation.

>>> (6) State of devices from before hibernation should be restored, if
>>> possible
>>
>> related to suspend should be transparent ... yes.
>>
>>> (7) On ACPI systems special platform-related actions have to be
>>> carried out at
>>>     the right points, so that the platform works correctly after the
>>> restore
>>
>> I believe I have explained my suggestion.
>>
>>> (8) Hibernation and restore should not be too slow
>>
>> We control the added code.   We are using full runtime drivers and 
>> will
>> run at hardware speeds.
>
> That may not be enough.  If you're going to save, say, 80% of RAM on a 
> 2 GB
> machine, then you'll have to be using image compression.

Yea, so?  We have a full kernel and userspace, adding compression 
before writing should be easy.  The is no struct page for memory in the 
old kernel, so we likely need to be copying them in userspace anyways.  
Adding compression should be easy.

>>> (9) Hibernation framework should not be too difficult to set up
>>
>> Ok the current patch is presently too difficult.  But I think it will
>> be much simpler with a few small changes.
>>
>> As noted in  the thread
>>
>> Message-ID: <873azxwqhr.fsf@jbms.ath.cx>
>> Subject: [linux-pm] Re: hibernation/snapshot design
>> on Mon Jul  9 08:23:53 2007, Jeremy Maitin-Shepard wrote:
>>>>  Both would work. One would eat 8-64MB of your RAM, permanently;
>>>
>>> As I have stated in other messages, the kdump approach would not 
>>> waste
>>> any RAM permanently.
>> ...
>>> Immediately before jumping to the new kernel, the first X bytes 
>>> (where
>>> X
>>> is the amount of memory the new kernel will get, typically 16MB or
>>> 64MB)
>>> of physical memory are backed up into the arbitrary discontiguous 
>>> pages
>>> that are made available.  This will not take very long, because 
>>> copying
>>> even 64MB of memory is extremely fast.  Then the new kernel is free 
>>> to
>>> use the first X bytes of contiguous physical memory.  Problem solved.
>>
>>
>> Ok, now let's look at my list again:
>>
>>> (1) how to interact with acpi to enter into S4.
>>
>> This was discussed.
>>
>>> (2) how to identify which memory needs to be saved
>>
>> We need to generate a list.  We need it to fit in a compuatable size 
>> so
>> that we can free and allocate the pages before suspending IO in the
>> first kernel.
>>
>> One possibility is to use something like the kexec copy list.  If we
>> are imaging a small fraction of ram this is appropriate, but if we are
>> doing dense saves we need something extent based.  We should be able 
>> to
>> extend the list.
>>
>>> (3) how to communicate where to save the memory
>>
>> This is an intresting topic.  The suspended kernel has most IO and 
>> disk
>> space.  It also knows how much space is to be occupied by the kernel.
>> So communicating a block map to the second kernel would be the obvious
>> choice.   But the second kernel must be able to find the image to
>> restore it, and it must have drivers for the media.  Also, this is not
>> feasible for storing to nfs.
>>
>> I think we will end up with several methods.
>>
>> One would be supply a list of blocks, and implement a file system that
>> reads the file by reading the scatter list from media.  The restore
>> kernel then only needs to read an anchor, and can build upon that 
>> until
>> the image is read into memory.  Or do this in userspace.
>>
>> I don't know how this compares to the current restore path.   I wasn't
>> able to identify the code that creates the on disk structure in my 10
>> minute perusal of kernel/power/.
>
> The structure is created at two levels.
>
> First, the code in snapshot.c makes the image available to the code in 
> swap.c
> as a stream of pages.  The first page is the header, followed by some 
> pages
> containing the PFNs of the page frames to which the image data pages 
> are to be
> restored, followed by the image data pages themselves (the ordering of 
> the PFNs
> must be the same as the ordering of data pages that correspond to 
> them).
> Still, the low-level image format only needs to be known by the 
> restore code in
> snapshot.c .

Ok sounds like this code could be reused.  I'll look into it.

> Second, the code in swap.c writes the image pages to a storage adding 
> some
> metadata making it possible to reproduce their original ordering 
> during the
> restore.

So you are allocating the blocks as you go ... and adding meta data 
along the way?

> The fact that we use swap spaces as the storage is related to 
> implementation
> simplicity rather than anything else.

Ok ... this only supports uncompressed hibernation?

The first kernel is going to specify (1) what to backup.  It can 
specify (2) where to backup, although we have to be careful identify 
the device in a persistent way.

>> A second method will be to supply a device and file that will be
>> mounted by the save kernel, then unmounted and restored.  This would
>> require a partition that is not mounted or open by the suspended 
>> kernel
>> (or use nfs or a similar protocol that is designed for multiple client
>> concurrent access).
>>
>> A third method would be to allocate a file with the first kernel, and
>> make sure the blocks are flushed to disk.  The save and restore 
>> kernels
>> map the file system using a snapshot device.  Writing would map the
>> blocks and use the block offset to write to the real device using the
>> method from the first option; reading could be done directly from the
>> snapshot device.
>>
>> The first and third option are dead on log based file systems (where
>> the data is stored in the log).
>
> All in all, we have three different and working implementation of the
> image-writing and image-reading code at our disposal.  Why would you 
> want to
> break the open doors?

The problem I'm saying kexec solves is how to get the data to the 
device while most of the kernel is trying not do anything permanent.

If we can reuse existing code, great.

>>> (4) what state should devices be in when switching kernels
>>
>> My proposal is either initialized and untouched or quiesced.
>
> This is reasonable, but in general we also need to save some 
> information
> about the pre-hibernation state of devices, so that we can put them 
> into the
> same state, if reasonably possible, during the restore.

What state are you referring to?

Yes, there is state that the drivers have to store to ram, but this the 
same state they need to store when suspending to ram if the device can 
be powered off.

Maybe we need to teach drivers to store more state, like remember that 
a hard drive was spun down.

So we may need a flag saying "we powered off", "we resumed from 
suspend".

>>> (5) the complicated setup required with the current patch
>>
>> I think a few simple changes to kjump will make this much simpler.  
>> See
>> below.
>>
>>> (6) what code restores the image
>>
>> The save kernel, loaded at boot.   People have suggested booting the
>> first kernel, and using current restore code.   However, I think that
>> ignores that (1) we saved from a different kernel, so the backed up
>> region will be restored to its backed up random pages,
>
> This problem has already been solved.
>
>> (2) the code was written to restore the same kernel,
>
> Not exactly.  In fact, the current implementation only relies on the 
> tiny
> portion of the restore code being in the same place in both kernels, 
> but
> we can change the code not to make this assumption (it'll be more 
> complicated,
> but that's perfectly doable).

If the save kernel is different from the run kernel (to make it 
smaller), its likely the image saving code will move.  I view restoring 
from a different kernel than saving as an advanced feature.

Lets get resuming from the save kernel working first.

>> so the text and data will be replaced by identical text.  Its much 
>> simpler
>> conceptually to use the same kernel to save and restore the image.
>
> Here I agree. :-)
>
>> Simplifying kjump: the proposal for v3.
>>
>> The current code is trying to use crash dump area as a safe, reserved
>> area to run the second kernel.   However, that means that the kernel
>> has to be linked specially to run in the reserved area.   I think we
>> need to finish separating kexec_jump from the other code paths.
>>
>> (1) add a new command line argument that specifies the kexec_jump
>> target area.
>>
>> (2) add a kjump flag to the flags parameter, used by kexec_load.   
>> When
>> loading a jump kernel, it is loaded like a normal kernel, however,
>> additional control pages are allocated to (a) save the kexec_jump
>> target area (b) save the backed up region that is used by all kernels
>> like crash dump, and (c) space for invoking relocate_new_kernel that
>> will get its args from the execution entry point and will restore the
>> kernel then call resume and suspend.
>>
>> (3) replace jump_huf_pfn with two command line addresses that specify
>> the (a) return point for after resume, and (b) the return point for
>> after image save.   Actually these can be done in userspace; the 
>> second
>> restore kernel can just specify the null copy list and the entry 
>> points
>> supplied by the suspended kernel.  To do resume we also need (c) where
>> to store resume address for the save kernel.
>>
>>
>> As a first stage of suspend and resume, we can save to dedicated
>> partitions all memory (as supplied to crash_dump) that is not marked
>> nosave and not part of the save kernel's image.
>
> A little problem here: there are "nosave" areas that are not marked as 
> nosave.

If crash_dump is going work the memory must exist.

>> The fancy block lists and memory lists can be added later.
>
> On the majority of systems that will work.  On some of them it won't.

Ok .... well, my point is we can get started while we workout what the 
list format is.   If we decide to reuse the pfn lists above that may 
come quickly.

>> mmaking these changes will allow us to use a normal kernel invoked 
>> with
>> acpi=off apm=off mem=xxk as the save and restore kernel.
>>
>> If we want to keep the second kernel booted, then we need to add a 
>> save
>> area for the booted jump target.   Note that the save and restore 
>> lists
>> to relocate_new_kernel can be computed once and saved.   Longer term 
>> we
>> could implement sys_kexec_load(UNLOAD) that would retrieve the saved
>> list back to application space to save to disk in a file.   This means
>> you could save the booted save kernel, it just couldn't have any 
>> shared
>> storage open.
>>
>> I'll try to expand on this in the jump v2 thread, but it may be 36+
>> hours before I do so.
>
> Well, I have no experience with kexec, so I really can't comment your
> kexec-related suggestions.
>
> Greetings,
> Rafael

Thanks,
milton


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 15:35                                 ` david
@ 2007-07-20 16:15                                   ` Alan Stern
  2007-07-20 21:46                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-20 16:15 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Milton Miller, linux-pm, LKML, Huang, Ying,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007 david@lang.hm wrote:

> or the userspace helper functions that setup the instructions for the 
> hibernate warn you if you are telling it to mount a filesystem that it 
> knows is ext3 and is in use by the system going to sleep.

One can argue that the ext3 implementation is inadequate.  We should be
able to give it a mount option requiring it to fail rather than play
back the journal and write to the disk.


> > What I've been trying to say from the very beginning is that the current
> > frameworks _support_ hibernation a la ACPI S4 (although that's not exactly
> > ACPI S4) and if we are going to introduce a new framework, then it should
> > be designed to _support_ ACPI S4 fully _from_ _the_ _start_.
> 
> here is where there is some disagreement (although it may just be 
> misunderstanding on the 'fully support' phrase)
> 
> it sounds like you are saying that the ACPI support requires a lot of work 
> (the phrase I've seen some people use is a requirement to 'fix all the 
> drivers'). we aren't wanting to have this work prevent the non-ACPI 
> hibernation from progressing.

You have completely misunderstood.  That phrase "fix all the drivers" 
has nothing whatsoever to do with ACPI.  It is a prerequisite for 
removing the freezer.

And unless I'm mistaken, removing the freezer was the main reason for 
doing all this kexec-style work in the first place.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 16:08                             ` Milton Miller
@ 2007-07-20 16:20                               ` Alan Stern
  2007-07-20 17:32                                 ` Milton Miller
  2007-07-20 20:31                                 ` david
  2007-07-20 21:02                               ` Rafael J. Wysocki
  1 sibling, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-20 16:20 UTC (permalink / raw)
  To: Milton Miller
  Cc: Rafael J. Wysocki, Ying Huang, LKML, David Lang, linux-pm,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Milton Miller wrote:

> > We can't do this unless we have frozen tasks (this way, or another) 
> > before
> > carrying out the entire operation.
> 
> What can't we do?   We've already worked with the drivers to quesce the 
> hardware and put any information to resume the device in ram.  Now we 
> ask them to put their device in low power mode so we can go to sleep.  
> Even if we schedule, the only thing userspace could touch is memory.   

Userspace can submit I/O requests.  Someone will have to audit every 
driver to make sure that such I/O requests don't cause a quiesced 
device to become active.  If the device is active, it will make the 
memory snapshot inconsistent with the on-device data.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 11:17                               ` Rafael J. Wysocki
  2007-07-20 15:35                                 ` david
@ 2007-07-20 16:56                                 ` Milton Miller
  2007-07-20 17:31                                   ` Jeremy Maitin-Shepard
                                                     ` (2 more replies)
  1 sibling, 3 replies; 220+ messages in thread
From: Milton Miller @ 2007-07-20 16:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ying Huang, Alan Stern, LKML, David Lang, linux-pm,
	Jeremy Maitin-Shepard

On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote:
> On Friday, 20 July 2007 01:07, david@lang.hm wrote:
>> On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:
>>> On Thursday, 19 July 2007 17:46, Milton Miller wrote:
>>>> The currently identified problems under discussion include:
>>>> (1) how to interact with acpi to enter into S4.
>>>> (2) how to identify which memory needs to be saved
>>>> (3) how to communicate where to save the memory
>>>> (4) what state should devices be in when switching kernels
>>>> (5) the complicated setup required with the current patch
>>>> (6) what code restores the image
>>>
>>> (7) how to avoid corrupting filesystems mounted by the hibernated 
>>> kernel
>>
>> I didn't realize this was a discussion item. I thought the options 
>> were
>> clear, for some filesystem types you can mount them read-only, but for
>> ext3 (and possilby other less common ones) you just plain cannot touch
>> them.
>
> That's correct.  And since you cannot thouch ext3, you need either to 
> assume
> that you won't touch filesystems at all, or to have a code to 
> recognize the
> filesystem you're dealing with.

Or add a small bit of infrastructure that errors writes at make_request 
if you don't have a magic "i am a direct block device write from 
userspace" flag on the bio.

The hibernate may fail, but you don't corrupt the media.

If you don't get the image out, resume back to the "this is resume" 
instead of the power-down path.

>>>>> (2) Upon start-up (by which I mean what happens after the user has
>>>>> pressed
>>>>>     the power button or something like that):
>>>>>   * check if the image is present (and valid) _without_ enabling 
>>>>> ACPI
>>>>> (we don't
>>>>>     do that now, but I see no reason for not doing it in the new
>>>>> framework)
>>>>>   * if the image is present (and valid), load it
>>>>>   * turn on ACPI (unless already turned on by the BIOS, that is)
>>>>>   * execute the _BFS global control method
>>>>>   * execute the _WAK global control method
>>>>>   * continue
>>>>>   Here, the first two things should be done by the image-loading
>>>>> kernel, but
>>>>>   the remaining operations have to be carried out by the restored
>>>>> kernel.
>>>>
>>>> Here I agree.
>>>>
>>>> Here is my proposal.  Instead of trying to both write the image and
>>>> suspend, I think this all becomes much simpler if we limit the scope
>>>> the work of the second kernel.  Its purpose is to write the image.
>>>> After that its done.   The platform can be powered off if we are 
>>>> going
>>>> to S5.   However, to support suspend to ram and suspend to disk, we
>>>> return to the first kernel.
>>>
>>> We can't do this unless we have frozen tasks (this way, or another) 
>>> before
>>> carrying out the entire operation.  In that case, however, the 
>>> kexec-based
>>> approach would have only one advantage over the current one.  
>>> Namely, it
>>> would allow us to create bigger images.
>>
>> we all agree that tasks cannot run during the suspend-to-ram state, 
>> but
>> the disagreement is over what this means
>>
>> at one extreme it could mean that you would need the full freezer as 
>> per
>> the current suspend projects.
>>
>> at the other extreme it could mean that all that's needed is to 
>> invoke the
>> suspend-to-ram routine before anything else on the suspended kernel 
>> on the
>> return from the save and restore kernel.
>>
>> we just need to figure out which it is (or if it's somewhere in 
>> between).
>
> Well, I think that the "invoke the suspend-to-ram routine before 
> anything else
> on the suspended kernel" thing won't be easy to implement in practice.

Why?  You don't expect suspend-to-ram in drivers to be implemented?  We 
need more speperation of the quiesce drivers from power-down devices?

Note that we are just talking about "suspend devices and put their 
state in ram", not actually invoking the platform to suspend to ram.

And I'm actually saying we free memory and maybe allocate disk blocks 
for the save before we suspend (see below).

>>>> Message-ID: <200707151433.34625.rjw@sisk.pl>
>>>> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
>>>>> (1) Filesystems mounted before the hibernation are untouchable
>>>>
>>>> This is because some file systems do a fsck or other activity even 
>>>> when
>>>> mounted read only.  For the kexec case, however, this should be 
>>>> "file
>>>> systems mounted by the hibernated system must not be written".   As 
>>>> has
>>>> been mentioned in the past, we should be able to use something like 
>>>> dm
>>>> snapshot to allow fsck and the file system to see the cleaned copy
>>>> while not actually writing the media.
>>>
>>> We can't _require_ users to use the dm snapshot in order for the 
>>> hibernation
>>> to work, sorry.
>>>
>>> And by _reading_ from a filesystem you generally update metadata.
>>
>> not if the filesystem is mounted read-only (except on ext3)
>
> Well, if the filesystem in question is a journaling one and the 
> hibernated
> kernel has mounted this fs read-write, this seems to be tricky anyway.

Yes.  I would argue writing to existing blocks of a file (not thorugh 
the filesystem, just getting their blocsk from the file system) should 
be safe, but it occurs to me that may not be the case if your fsck and 
bmap move data blocks from some update log to the file system.

But we know the (maximum) image size.   So we could allocate the blocks 
in the first image before suspending the drivers and memory 
allocations, and supplying the list to the second kernel.  We could 
even write to the first block with a signature "suspend to here", or 
even the whole block list to the beginning (it will have to be saved to 
disk for restore anyways).

>>>> The kjump kernel must not have any knowledge retained if we reuse 
>>>> it.
>>>>
>>>>> (2) Swap space in use before the hibernation must be handled with 
>>>>> care
>>>>
>>>> Yes.  Actually, even though they have been used by the write-in-the
>>>> kernel users, they will be among the most difficult devices to use 
>>>> for
>>>> snapshots by a userspace second kernel.

If we use the "write to these blocks" then this is as easy as writing 
to a file in a mounted filesystem.

>>>>> (4) The user should be able to limit the size of a hibernation 
>>>>> image
>>>>
>>>> This means the suspending kernel must arrange to reduce its active
>>>> memory.  The limited save can be done by providing a limited list in
>>>> (3).
>>>
>>> It seems to me that you don't understand the problem here.
>>>
>>> Assume you have 90% of RAM allocated before the hibernation and the 
>>> user has
>>> requested the image to be not greater than 50% of RAM.  In that case 
>>> you have
>>> to free some memory _before_ identifying memory to save and you must 
>>> not
>>> race with applications that attempt to allocate memory while you're 
>>> doing it.
>>
>> I disagree a little bit.
>>
>> first off, only the suspending kernel can know what can be freed and 
>> what
>> is needed to do so (remember this is kernel internals, it can change 
>> from
>> patch to patch, let alone version to version)
>>
>> second, if you have a lot of memory to free, and you can't just throw 
>> away
>> caches to do so, you don't know what is going to be involved in 
>> freeing
>> the memory, it's very possilbe that it is going to involve userspace, 
>> so
>> you can't freeze any significant portion of the system, so you can't
>> eliminate all chance of races
>>
>> what you can do is
>>
>> 1. try to free stuff
>> 2. stop the system and account for memory, is enough free
>> if not goto 1
>>
>> if userspace is dirtying memory fast enough, or is just useing enough
>> memory that you can't meet your limit you just won't be able to 
>> suspend.
>
> This means unreliable hibernation for some workloads.  While I agree 
> that
> shouldn't be a problem in a common case, there are users who will 
> complain. ;-)

With my allocate memory as a task and don't save that task's memory 
approach, we can get to this point while userspace is running.   It 
could be controllled by userspace, or even be userspace 
(sys_do_not_save_me() waits for resume, and dies as the kernel 
resumes).

>> but under any other conditions you will eventually get enough memory 
>> free.
>>
>> so try several times and if you still fail tell the user they have too
>> much stuff running and they need to kill something.
>
> Well, with the freezer that's much simpler (and more reliable, I'd 
> say): you
> freeze tasks and _then_ you shrink memory.

It means you are committed to suspend before you try to shrink memory.  
What happens when the user requested a smaller image that memory in 
use?

>>>>> =(8) Hibernation and restore should not be too slow
>>>>
>>>> We control the added code.   We are using full runtime drivers and 
>>>> will
>>>> run at hardware speeds.
>>>
>>> That may not be enough.  If you're going to save, say, 80% of RAM on 
>>> a 2 GB
>>> machine, then you'll have to be using image compression.
>>
>> this doesn't make sense, 20% of 2G is 400M, if you can't make a 
>> kernel and
>> userspace that can run in 400M you have a serious problem.
>
> I was talking about the _speed_ of writing and reading.

Yes.  As I said, adding a compress as we copy the pages into the saving 
kernel for writeout should be easy.

>> even if you wanted to save 99% of RAM on a 2G system, you have 20M of 
>> ram
>> to play with, which should easily be enough.
>>
>> remember, linux runs on really small systems as well, and while you do
>> have to load some drivers for the big system, there are a lot of other
>> things that aren't needed.
>>
>>> All in all, we have three different and working implementation of the
>>> image-writing and image-reading code at our disposal.  Why would you 
>>> want to
>>> break the open doors?
>>
>> becouse you say that the current methods won't work without ACPI 
>> support.
>
> I didn't say that.  [Or if I did, please point me to this message.]
>
> Anyway, this wouldn't be true even if I did.
>
> What I've been trying to say from the very beginning is that the 
> current
> frameworks _support_ hibernation a la ACPI S4 (although that's not 
> exactly
> ACPI S4) and if we are going to introduce a new framework, then it 
> should
> be designed to _support_ ACPI S4 fully _from_ _the_ _start_.
>
> This DOESN'T mean that the non-ACPI hibernation should be unsupported 
> and
> it DOESN"T mean that the non-ACPI hibernation is not supported 
> currently.
> IT IS SUPPORTED.
>

As I said, I see kjump as a way to solve the "ok i am at a save point, 
now how do I write this image to media without allowing any other io".  
As you know by now, my solution for ACPI support is after the image is 
written we go back to the kernel that started the suspend and it puts 
the machine in S4.

If this works, we get down to 1 hibernate implementation in the kernel 
:-).

milton


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 16:56                                 ` Milton Miller
@ 2007-07-20 17:31                                   ` Jeremy Maitin-Shepard
  2007-07-20 21:30                                     ` Rafael J. Wysocki
  2007-07-20 19:26                                   ` david
  2007-07-20 21:28                                   ` Rafael J. Wysocki
  2 siblings, 1 reply; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 17:31 UTC (permalink / raw)
  To: Milton Miller
  Cc: Rafael J. Wysocki, Ying Huang, Alan Stern, LKML, David Lang, linux-pm

Milton Miller <miltonm@bga.com> writes:

[snip]

>>>> (7) how to avoid corrupting filesystems mounted by the hibernated kernel
>>> 
>>> I didn't realize this was a discussion item. I thought the options were
>>> clear, for some filesystem types you can mount them read-only, but for
>>> ext3 (and possilby other less common ones) you just plain cannot touch
>>> them.
>> 
>> That's correct.  And since you cannot thouch ext3, you need either to assume
>> that you won't touch filesystems at all, or to have a code to recognize the
>> filesystem you're dealing with.

> Or add a small bit of infrastructure that errors writes at make_request if you
> don't have a magic "i am a direct block device write from userspace" flag on the
> bio.

I still don't understand why there is this fixation on accessing dirty
filesystems in use by the hibernated system.  Even if you avoid
corrupting the filesystem by avoiding writing to the block device, there
isn't any real guarantee about the state of the data, except for a
filesystem that specifically makes guarantees about such data (and I
don't believe any of the existing ones do).

It isn't necessary to be able to access such filesystems: everything can
be done from an initramfs/initrd.

[snip]

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 16:20                               ` Alan Stern
@ 2007-07-20 17:32                                 ` Milton Miller
  2007-07-20 18:17                                   ` Alan Stern
  2007-07-20 20:31                                 ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Milton Miller @ 2007-07-20 17:32 UTC (permalink / raw)
  To: Alan Stern
  Cc: Ying Huang, LKML, Rafael J. Wysocki, David Lang, linux-pm,
	Jeremy Maitin-Shepard

On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:
> On Fri, 20 Jul 2007, Milton Miller wrote:
>>> We can't do this unless we have frozen tasks (this way, or another)
>>> before
>>> carrying out the entire operation.
>>
>> What can't we do?   We've already worked with the drivers to quesce 
>> the
>> hardware and put any information to resume the device in ram.  Now we
>> ask them to put their device in low power mode so we can go to sleep.
>> Even if we schedule, the only thing userspace could touch is memory.
>
> Userspace can submit I/O requests.  Someone will have to audit every
> driver to make sure that such I/O requests don't cause a quiesced
> device to become active.  If the device is active, it will make the
> memory snapshot inconsistent with the on-device data.

If a driver is waking a device between the time it was told by 
hibernation "suspend all operations and save your device state to ram" 
and "resume your device" then it is a buggy driver.

I argue the process can make the io request after we write to disk, we 
just can't service it.  If we are suspended it will go to the request 
queue, and eventually the process will wait for normal throttling 
mechanisms until the driver is woken up.

It may mean the driver has to set a flag so that it knows it had an 
iorequest arrive while it was suspended and needs to wake the queue 
during its resume function.


Actually, my point was more "what kernel services do the drivers need 
to transition from quiesced to low power for acpi S4 or 
suspend-to-ram"?  We can't give them allocate-memory (but we give them 
a call "we are going to suspend" when they can), but does "run this 
tasklet" help?  What timer facilities are needed?

Do we need to differentate init (por by bios) and resume from quiesced 
(for reboot, kexec start/resume)?  I hope not.

milton


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 17:32                                 ` Milton Miller
@ 2007-07-20 18:17                                   ` Alan Stern
  2007-07-20 19:08                                     ` Milton Miller
  2007-07-20 20:03                                     ` Oliver Neukum
  0 siblings, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-20 18:17 UTC (permalink / raw)
  To: Milton Miller
  Cc: Ying Huang, LKML, Rafael J. Wysocki, David Lang, linux-pm,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Milton Miller wrote:

> On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:
> > On Fri, 20 Jul 2007, Milton Miller wrote:
> >>> We can't do this unless we have frozen tasks (this way, or another)
> >>> before
> >>> carrying out the entire operation.
> >>
> >> What can't we do?   We've already worked with the drivers to quesce 
> >> the
> >> hardware and put any information to resume the device in ram.  Now we
> >> ask them to put their device in low power mode so we can go to sleep.
> >> Even if we schedule, the only thing userspace could touch is memory.
> >
> > Userspace can submit I/O requests.  Someone will have to audit every
> > driver to make sure that such I/O requests don't cause a quiesced
> > device to become active.  If the device is active, it will make the
> > memory snapshot inconsistent with the on-device data.
> 
> If a driver is waking a device between the time it was told by 
> hibernation "suspend all operations and save your device state to ram" 
> and "resume your device" then it is a buggy driver.

That's exactly my point.  As far as I know nobody has done a survey,
but I bet you'd find _many_ drivers are buggy either in this way or the
converse (forcing an I/O request to fail immediately instead of waiting
until the suspend is over when it could succeed).  They have this bug 
because they were written -- those which include any suspend/resume 
support at all -- under the assumption that they could rely on the 
freezer.

And that's why Rafael said "We can't do this unless we have frozen
tasks (this way, or another) before carrying out the entire operation."  
Until the drivers are fixed -- which seems like a tremendous job --
none of this will work.

> I argue the process can make the io request after we write to disk, we 
> just can't service it.  If we are suspended it will go to the request 
> queue, and eventually the process will wait for normal throttling 
> mechanisms until the driver is woken up.

Many drivers don't have request queues.  Even for the ones that do, 
there are I/O pathways that bypass the queue (think of ioctl or sysfs).

> Actually, my point was more "what kernel services do the drivers need 
> to transition from quiesced to low power for acpi S4 or 
> suspend-to-ram"?  We can't give them allocate-memory (but we give them 
> a call "we are going to suspend" when they can), but does "run this 
> tasklet" help?  What timer facilities are needed?

Some drivers need the ability to schedule.  Some will need the ability 
to allocate memory (although GFP_ATOMIC is probably sufficient).  Some 
will need timers to run.

> Do we need to differentate init (por by bios) and resume from quiesced 
> (for reboot, kexec start/resume)?  I hope not.

Yes we do.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 18:17                                   ` Alan Stern
@ 2007-07-20 19:08                                     ` Milton Miller
  2007-07-20 19:37                                       ` Alan Stern
  2007-07-20 20:03                                     ` Oliver Neukum
  1 sibling, 1 reply; 220+ messages in thread
From: Milton Miller @ 2007-07-20 19:08 UTC (permalink / raw)
  To: Alan Stern
  Cc: Ying Huang, LKML, Rafael J. Wysocki, David Lang, linux-pm,
	Jeremy Maitin-Shepard


On Jul 20, 2007, at 1:17 PM, Alan Stern wrote:

> On Fri, 20 Jul 2007, Milton Miller wrote:
>
>> On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:
>>> On Fri, 20 Jul 2007, Milton Miller wrote:
>>>>> We can't do this unless we have frozen tasks (this way, or another)
>>>>> before
>>>>> carrying out the entire operation.
>>>>
>>>> What can't we do?   We've already worked with the drivers to quesce
>>>> the
>>>> hardware and put any information to resume the device in ram.  Now 
>>>> we
>>>> ask them to put their device in low power mode so we can go to 
>>>> sleep.
>>>> Even if we schedule, the only thing userspace could touch is memory.
>>>
>>> Userspace can submit I/O requests.  Someone will have to audit every
>>> driver to make sure that such I/O requests don't cause a quiesced
>>> device to become active.  If the device is active, it will make the
>>> memory snapshot inconsistent with the on-device data.
>>
>> If a driver is waking a device between the time it was told by
>> hibernation "suspend all operations and save your device state to ram"
>> and "resume your device" then it is a buggy driver.
>
> That's exactly my point.  As far as I know nobody has done a survey,
> but I bet you'd find _many_ drivers are buggy either in this way or the
> converse (forcing an I/O request to fail immediately instead of waiting
> until the suspend is over when it could succeed).  They have this bug
> because they were written -- those which include any suspend/resume
> support at all -- under the assumption that they could rely on the
> freezer.
>
> And that's why Rafael said "We can't do this unless we have frozen
> tasks (this way, or another) before carrying out the entire operation."
> Until the drivers are fixed -- which seems like a tremendous job --
> none of this will work.

So this is in the way of removing the freezer ... but as we are not 
relying on doing any io other than suspend device operation, save state 
to ram, then later put device in low power mode for s3 and/or s4, and 
finally restore and resume to running.

>> I argue the process can make the io request after we write to disk, we
>> just can't service it.  If we are suspended it will go to the request
>> queue, and eventually the process will wait for normal throttling
>> mechanisms until the driver is woken up.
>
> Many drivers don't have request queues.  Even for the ones that do,
> there are I/O pathways that bypass the queue (think of ioctl or sysfs).

So its not a flag in make_request, fine.

>> Actually, my point was more "what kernel services do the drivers need
>> to transition from quiesced to low power for acpi S4 or
>> suspend-to-ram"?  We can't give them allocate-memory (but we give them
>> a call "we are going to suspend" when they can), but does "run this
>> tasklet" help?  What timer facilities are needed?
>
> Some drivers need the ability to schedule.  Some will need the ability
> to allocate memory (although GFP_ATOMIC is probably sufficient).  Some
> will need timers to run.

Can they allocate the memory in advance?  (Call them when we know we 
want to suspend, they make the allocations they will need; we later 
call them again to release the allocations).

If you need timers, you probably want some scheduling?

>> Do we need to differentate init (por by bios) and resume from quiesced
>> (for reboot, kexec start/resume)?  I hope not.
>
> Yes we do.

can you elabrate?   Note I was not asking resume-from-low power vs 
init-from-por.  We still get that distinction.

How do these drivers work today when we kexec?

The reason I'm asking is its hard to tell the first kernel what 
happened.  We can say "we powered off, and we were restarted", but it 
becomes much harder when each device may or may not have a driver in 
the save kernel if we have to differentate for each device if it was 
initialized and later quiesced by the jump kernel during save or never 
touched.  And we need to tell the resume from hybernate code "i touched 
it" "no i didn't" and "we resumed from s4" "no it was from s5".

This is why I've been proposing that we don't create the suspend image 
with devices in the low power state, but only in a quiesced state 
similar to the initial state.


I'm proposing a sequence like:

(1) start allocating pinned memory to reduce saved image size
(2) allocate and map blocks to save maximum image (we know how much ram 
is not in 1, so the max size)
(3) tell drivers we are going to suspend.   userspace is still running, 
swaping still active, etc.  now is the time to allocate memory to save 
device state.
(4) do what we want to slow down userspace making requests (ie run 
freezer today)
(5) call drivers while still scheduling with interrupts "save 
oppertunitiy".  From this point, any new request should be queued or 
the process put on a wait queue.
(6) suspend timers, turn off interrupts
(7) call drivers with interrupts off (final save)
(8) jump to other kernel to save the image
(9) call drivers to transition to low power
(10) finish operations to platform suspend on hybernate
(11) call drivers to resume, telling them if from suspend-to-ram or 
suspend-to-disk, possibly in two stages (interrupts off no scheduling 
and interrupts on scheduling allowed)
(12) unfreeze processes, kill the the thread holding the extra memroy 
used to reserve

So I'm asking what needs to happen in 9.  If we have to turn interrupts 
on and schedule, that's ok.  If the low power state is the initial 
state then fine.

Note that in 11, we could further differentate "from image restore in 
S4" and "from image restore in S5", and "from failed image save", but 
what needs to happen differently?


I'm guessing that the work that will take some time is seperating the 
go to low power from quiesce operations for snapshot, as it sounds like 
this is done with one driver call today?  Making this separation will 
give us our driver audit :-), but only if we decide on the requiements 
before the start.

miton


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 16:56                                 ` Milton Miller
  2007-07-20 17:31                                   ` Jeremy Maitin-Shepard
@ 2007-07-20 19:26                                   ` david
  2007-07-20 21:28                                   ` Rafael J. Wysocki
  2 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-20 19:26 UTC (permalink / raw)
  To: Milton Miller
  Cc: Rafael J. Wysocki, Ying Huang, Alan Stern, LKML, linux-pm,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Milton Miller wrote:

> On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote:
>>  On Friday, 20 July 2007 01:07, david@lang.hm wrote:
>> >  On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:
>> > >  On Thursday, 19 July 2007 17:46, Milton Miller wrote:
>> > > >  The currently identified problems under discussion include:
>> > > >  (1) how to interact with acpi to enter into S4.
>> > > >  (2) how to identify which memory needs to be saved
>> > > >  (3) how to communicate where to save the memory
>> > > >  (4) what state should devices be in when switching kernels
>> > > >  (5) the complicated setup required with the current patch
>> > > >  (6) what code restores the image
>> > > 
>> > >  (7) how to avoid corrupting filesystems mounted by the hibernated 
>> > >  kernel
>> > 
>> >  I didn't realize this was a discussion item. I thought the options were
>> >  clear, for some filesystem types you can mount them read-only, but for
>> >  ext3 (and possilby other less common ones) you just plain cannot touch
>> >  them.
>>
>>  That's correct.  And since you cannot thouch ext3, you need either to
>>  assume
>>  that you won't touch filesystems at all, or to have a code to recognize
>>  the
>>  filesystem you're dealing with.
>
> Or add a small bit of infrastructure that errors writes at make_request if 
> you don't have a magic "i am a direct block device write from userspace" flag 
> on the bio.

the problem is that the filesystem code will replay the journal when you 
mount the partition, even if you mount it read-only (I seem to remember 
that you could avoid this if you put the entire block device into 
read-only mode, but that doesn't help in this case)

> The hibernate may fail, but you don't corrupt the media.
>
> If you don't get the image out, resume back to the "this is resume" instead 
> of the power-down path.
>
>> > > > >  (2) Upon start-up (by which I mean what happens after the user has
>> > > > >  pressed
>> > > > >      the power button or something like that):
>> > > > >    * check if the image is present (and valid) _without_ enabling 
>> > > > >  ACPI
>> > > > >  (we don't
>> > > > >      do that now, but I see no reason for not doing it in the new
>> > > > >  framework)
>> > > > >    * if the image is present (and valid), load it
>> > > > >    * turn on ACPI (unless already turned on by the BIOS, that is)
>> > > > >    * execute the _BFS global control method
>> > > > >    * execute the _WAK global control method
>> > > > >    * continue
>> > > > >    Here, the first two things should be done by the image-loading
>> > > > >  kernel, but
>> > > > >    the remaining operations have to be carried out by the restored
>> > > > >  kernel.
>> > > > 
>> > > >  Here I agree.
>> > > > 
>> > > >  Here is my proposal.  Instead of trying to both write the image and
>> > > >  suspend, I think this all becomes much simpler if we limit the scope
>> > > >  the work of the second kernel.  Its purpose is to write the image.
>> > > >  After that its done.   The platform can be powered off if we are 
>> > > >  going
>> > > >  to S5.   However, to support suspend to ram and suspend to disk, we
>> > > >  return to the first kernel.
>> > > 
>> > >  We can't do this unless we have frozen tasks (this way, or another) 
>> > >  before
>> > >  carrying out the entire operation.  In that case, however, the 
>> > >  kexec-based
>> > >  approach would have only one advantage over the current one.  Namely, 
>> > >  it
>> > >  would allow us to create bigger images.
>> > 
>> >  we all agree that tasks cannot run during the suspend-to-ram state, but
>> >  the disagreement is over what this means
>> > 
>> >  at one extreme it could mean that you would need the full freezer as per
>> >  the current suspend projects.
>> > 
>> >  at the other extreme it could mean that all that's needed is to invoke 
>> >  the
>> >  suspend-to-ram routine before anything else on the suspended kernel on 
>> >  the
>> >  return from the save and restore kernel.
>> > 
>> >  we just need to figure out which it is (or if it's somewhere in 
>> >  between).
>>
>>  Well, I think that the "invoke the suspend-to-ram routine before anything
>>  else
>>  on the suspended kernel" thing won't be easy to implement in practice.
>
> Why?  You don't expect suspend-to-ram in drivers to be implemented?  We need 
> more speperation of the quiesce drivers from power-down devices?
>
> Note that we are just talking about "suspend devices and put their state in 
> ram", not actually invoking the platform to suspend to ram.

I thought we were talking about actually invoking the suspend-to-ram

>> > > >  Message-ID: <200707151433.34625.rjw@sisk.pl>
>> > > >  On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
>> > > > >  (1) Filesystems mounted before the hibernation are untouchable
>> > > > 
>> > > >  This is because some file systems do a fsck or other activity even 
>> > > >  when
>> > > >  mounted read only.  For the kexec case, however, this should be 
>> > > >  "file
>> > > >  systems mounted by the hibernated system must not be written".   As 
>> > > >  has
>> > > >  been mentioned in the past, we should be able to use something like 
>> > > >  dm
>> > > >  snapshot to allow fsck and the file system to see the cleaned copy
>> > > >  while not actually writing the media.
>> > > 
>> > >  We can't _require_ users to use the dm snapshot in order for the 
>> > >  hibernation
>> > >  to work, sorry.
>> > > 
>> > >  And by _reading_ from a filesystem you generally update metadata.
>> > 
>> >  not if the filesystem is mounted read-only (except on ext3)
>>
>>  Well, if the filesystem in question is a journaling one and the hibernated
>>  kernel has mounted this fs read-write, this seems to be tricky anyway.
>
> Yes.  I would argue writing to existing blocks of a file (not thorugh the 
> filesystem, just getting their blocsk from the file system) should be safe, 
> but it occurs to me that may not be the case if your fsck and bmap move data 
> blocks from some update log to the file system.

right, and the answer is that the filesystem blocks allocated for the 
suspend image are not allowed to be accessed in any way from the main 
system.

this is a good argument for saving the data somewhere else ;-)

> But we know the (maximum) image size.   So we could allocate the blocks in 
> the first image before suspending the drivers and memory allocations, and 
> supplying the list to the second kernel.  We could even write to the first 
> block with a signature "suspend to here", or even the whole block list to the 
> beginning (it will have to be saved to disk for restore anyways).

no, you want to make the blocks that are allocated for the suspend image 
be like the blocks allocated to the journal, alloate them once and never 
touch them again

you especially do not want to try and write something to them from the 
main system just before suspending, you don't know enough about what it 
takes to get the data to the media to be absolutly sure that it's there 
when the save-and-restore kernel goes to look.

>> > > >  The kjump kernel must not have any knowledge retained if we reuse 
>> > > >  it.
>> > > > 
>> > > > >  (2) Swap space in use before the hibernation must be handled with 
>> > > > >  care
>> > > > 
>> > > >  Yes.  Actually, even though they have been used by the write-in-the
>> > > >  kernel users, they will be among the most difficult devices to use 
>> > > >  for
>> > > >  snapshots by a userspace second kernel.
>
> If we use the "write to these blocks" then this is as easy as writing to a 
> file in a mounted filesystem.

and keep in mind that "write to these blocks" can be done in userspace, it 
doesn't require the kernel to do this.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 19:08                                     ` Milton Miller
@ 2007-07-20 19:37                                       ` Alan Stern
  0 siblings, 0 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-20 19:37 UTC (permalink / raw)
  To: Milton Miller
  Cc: Ying Huang, LKML, Rafael J. Wysocki, David Lang, linux-pm,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Milton Miller wrote:

> > That's exactly my point.  As far as I know nobody has done a survey,
> > but I bet you'd find _many_ drivers are buggy either in this way or the
> > converse (forcing an I/O request to fail immediately instead of waiting
> > until the suspend is over when it could succeed).  They have this bug
> > because they were written -- those which include any suspend/resume
> > support at all -- under the assumption that they could rely on the
> > freezer.
> >
> > And that's why Rafael said "We can't do this unless we have frozen
> > tasks (this way, or another) before carrying out the entire operation."
> > Until the drivers are fixed -- which seems like a tremendous job --
> > none of this will work.
> 
> So this is in the way of removing the freezer ... but as we are not 
> relying on doing any io other than suspend device operation, save state 
> to ram, then later put device in low power mode for s3 and/or s4, and 
> finally restore and resume to running.

We aren't relying on doing any other I/O... and we have to prevent any
other I/O from taking place.  That's the hard part.

> > Some drivers need the ability to schedule.  Some will need the ability
> > to allocate memory (although GFP_ATOMIC is probably sufficient).  Some
> > will need timers to run.
> 
> Can they allocate the memory in advance?  (Call them when we know we 
> want to suspend, they make the allocations they will need; we later 
> call them again to release the allocations).

Some yes, some no.  The ones that can't generally don't need very much.

> If you need timers, you probably want some scheduling?

Yes, scheduling was one of the items I listed above.

> >> Do we need to differentate init (por by bios) and resume from quiesced
> >> (for reboot, kexec start/resume)?  I hope not.
> >
> > Yes we do.
> 
> can you elabrate?   Note I was not asking resume-from-low power vs 
> init-from-por.  We still get that distinction.

To be more precise, drivers need to know whether they are doing a
complete initialization, a resume from low-power, or a resume from
hibernate.  Currently there's no way to distinguish the last two (they
both involve calling the resume() method), but that's going to change.  
The first can be told apart because it involves probe() rather than 
resume().

> How do these drivers work today when we kexec?

We don't kexec during a resume from hibernation.  When kexec does run, 
drivers in the new kernel do a complete reinitialization.

> The reason I'm asking is its hard to tell the first kernel what 
> happened.  We can say "we powered off, and we were restarted", but it 
> becomes much harder when each device may or may not have a driver in 
> the save kernel if we have to differentate for each device if it was 
> initialized and later quiesced by the jump kernel during save or never 
> touched.  And we need to tell the resume from hybernate code "i touched 
> it" "no i didn't" and "we resumed from s4" "no it was from s5".

You merely have to distinguish between suspend and hibernate.

> I'm guessing that the work that will take some time is seperating the 
> go to low power from quiesce operations for snapshot, as it sounds like 
> this is done with one driver call today?

There's a single call with different arguments.  How much do you know
about the way the Power Management core actually works now?  Have you 
read the files in Documentation/power?

>  Making this separation will 
> give us our driver audit :-), but only if we decide on the requiements 
> before the start.

No it won't, although it will be a good start.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 18:17                                   ` Alan Stern
  2007-07-20 19:08                                     ` Milton Miller
@ 2007-07-20 20:03                                     ` Oliver Neukum
  2007-07-20 20:12                                       ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: Oliver Neukum @ 2007-07-20 20:03 UTC (permalink / raw)
  To: Alan Stern
  Cc: Milton Miller, Ying Huang, LKML, Rafael J. Wysocki, David Lang,
	linux-pm, Jeremy Maitin-Shepard

Am Freitag 20 Juli 2007 schrieb Alan Stern:
> Some drivers need the ability to schedule.  Some will need the ability 
> to allocate memory (although GFP_ATOMIC is probably sufficient).  Some 
> will need timers to run.

Some will have to request firmware. It can add up to some megabytes.
In addition, if we don't freeze, some drivers, eg. video drivers, can
do allocations in the megabyte range.

It seems to me that without the freezer we will end up with many drivers
needing a two step notification process. Furthermore there are requirements
on the order of shutting down system facilities, eg. device addition must
be stopped before drivers allocate firmware.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 20:03                                     ` Oliver Neukum
@ 2007-07-20 20:12                                       ` Alan Stern
  2007-07-20 21:35                                         ` Oliver Neukum
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-20 20:12 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Milton Miller, Ying Huang, LKML, Rafael J. Wysocki, David Lang,
	linux-pm, Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Oliver Neukum wrote:

> Am Freitag 20 Juli 2007 schrieb Alan Stern:
> > Some drivers need the ability to schedule.  Some will need the ability 
> > to allocate memory (although GFP_ATOMIC is probably sufficient).  Some 
> > will need timers to run.
> 
> Some will have to request firmware. It can add up to some megabytes.
> In addition, if we don't freeze, some drivers, eg. video drivers, can
> do allocations in the megabyte range.
> 
> It seems to me that without the freezer we will end up with many drivers
> needing a two step notification process. Furthermore there are requirements
> on the order of shutting down system facilities, eg. device addition must
> be stopped before drivers allocate firmware.

These are really separate issues, since they refer to things that have 
to happen well before the memory snapshot is captured.

We already have a pre-suspend notification available for drivers that 
need to allocate large amounts of memory.

You are correct about the need to delay/stop device addition.  I don't
know how this can be done in general; each code path calling
device_add() may have to be treated individually.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 16:20                               ` Alan Stern
  2007-07-20 17:32                                 ` Milton Miller
@ 2007-07-20 20:31                                 ` david
  2007-07-20 21:24                                   ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-20 20:31 UTC (permalink / raw)
  To: Alan Stern
  Cc: Milton Miller, Rafael J. Wysocki, Ying Huang, LKML, linux-pm,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Alan Stern wrote:

> On Fri, 20 Jul 2007, Milton Miller wrote:
>
>>> We can't do this unless we have frozen tasks (this way, or another)
>>> before
>>> carrying out the entire operation.
>>
>> What can't we do?   We've already worked with the drivers to quesce the
>> hardware and put any information to resume the device in ram.  Now we
>> ask them to put their device in low power mode so we can go to sleep.
>> Even if we schedule, the only thing userspace could touch is memory.
>
> Userspace can submit I/O requests.  Someone will have to audit every
> driver to make sure that such I/O requests don't cause a quiesced
> device to become active.  If the device is active, it will make the
> memory snapshot inconsistent with the on-device data.

assuming this is the suspend-from-ram after a kexec back from the 
write-to-disk kernel I don't think you are correct.

when doing a suspend-to-ram you get to a point where you just don't use 
any userspace. from that point on you are just walking the device tree 
putting things into low-power mode. This is the point where we are talking 
about jumping to.

david Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 16:08                             ` Milton Miller
  2007-07-20 16:20                               ` Alan Stern
@ 2007-07-20 21:02                               ` Rafael J. Wysocki
  2007-07-21 11:44                                 ` Miklos Szeredi
  1 sibling, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:02 UTC (permalink / raw)
  To: Milton Miller
  Cc: Alan Stern, Ying Huang, LKML, David Lang, linux-pm,
	Jeremy Maitin-Shepard

On Friday, 20 July 2007 18:08, Milton Miller wrote:
> On Jul 19, 2007, at 3:28 PM, Rafael J. Wysocki wrote:
> > On Thursday, 19 July 2007 17:46, Milton Miller wrote:
> >> The currently identified problems under discussion include:
> >> (1) how to interact with acpi to enter into S4.
> >> (2) how to identify which memory needs to be saved
> >> (3) how to communicate where to save the memory
> >> (4) what state should devices be in when switching kernels
> >> (5) the complicated setup required with the current patch
> >> (6) what code restores the image
> >
> > (7) how to avoid corrupting filesystems mounted by the hibernated 
> > kernel
> >
> 
> Ok I talked on this too.
> 
> >> I'll now start with quotes from several articles in this thread and my
> >> responses.
> >>
> >> Message-ID: <200707172217.01890.rjw@sisk.pl>
> >> On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote:
> >>> (1) Upon entering the sleep state, which IMO can be done _after_ the
> >>> image
> >>>     has been saved:
> >>>   * figure out which devices can wake up
> >>>   * put devices into low power states (wake-up devices are placed in
> >>> the Dx
> >>>     states compatible with the wake capability, the others are 
> >>> powered
> >>> off)
> >>>   * execute the _PTS global control method
> >>>   * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
> >>>   * execute the _GTS global control method
> >>>   * set the GPE enable registers corresponding to the wake-up 
> >>> devices)
> >>>   * make the platform enter S4 (there's a well defined procedure for
> >>> that)
> >>>   I think that this should be done by the image-saving kernel.
> >>
> >> Message-ID: <87odiag45q.fsf@jbms.ath.cx>
> >> On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard
> >> expressed his agreement with this block but also confusion on the 
> >> other
> >> blocks.
> >>
> >>
> >> I strongly disagree.
> >>
> >> (1) as has been pointed out, this requires the new kernel to 
> >> understand
> >> all io devices in the first kernel.
> >> (2) it requires both kernels to talk to ACPI.   This is doomed to
> >> failure.  How can the second kernel initialize ACPI?   The platform
> >> thinks it has already been initialized.  Do we plan to always undo all
> >> acpi initialization?
> >
> > Good question.  I don't know.
> 
> 
> >>> (2) Upon start-up (by which I mean what happens after the user has
> >>> pressed
> >>>     the power button or something like that):
> >>>   * check if the image is present (and valid) _without_ enabling ACPI
> >>> (we don't
> >>>     do that now, but I see no reason for not doing it in the new
> >>> framework)
> >>>   * if the image is present (and valid), load it
> >>>   * turn on ACPI (unless already turned on by the BIOS, that is)
> >>>   * execute the _BFS global control method
> >>>   * execute the _WAK global control method
> >>>   * continue
> >>>   Here, the first two things should be done by the image-loading
> >>> kernel, but
> >>>   the remaining operations have to be carried out by the restored
> >>> kernel.
> >>
> >> Here I agree.
> >>
> >> Here is my proposal.  Instead of trying to both write the image and
> >> suspend, I think this all becomes much simpler if we limit the scope
> >> the work of the second kernel.  Its purpose is to write the image.
> >> After that its done.   The platform can be powered off if we are going
> >> to S5.   However, to support suspend to ram and suspend to disk, we
> >> return to the first kernel.
> >
> > We can't do this unless we have frozen tasks (this way, or another) 
> > before
> > carrying out the entire operation.
> 
> What can't we do?   We've already worked with the drivers to quesce the 
> hardware and put any information to resume the device in ram.  Now we 
> ask them to put their device in low power mode so we can go to sleep.  

For that to work, we have to require the image-saving kernel to leave devices
in the same state, or in a state compatible with the state, in which they were
when it got control.

> Even if we schedule, the only thing userspace could touch is memory.   
> If we resume, they just run those computations again.
> 
> > In that case, however, the kexec-based
> > approach would have only one advantage over the current one.  Namely, 
> > it
> > would allow us to create bigger images.
> 
> The advantage is we don't have to come up with a way to teach drivers 
> "wake up to run these requests, but no other requests".  We don't have 
> to figure out what we need to resume to allow them to process a 
> request.

I'm not sure what you mean here.  Please explain.

> >> This means that the first kernel will need to know why it got resumed.
> >> Was the system powered off, and this is the resume from the user?   Or
> >> was it restarted because the image has been saved, and its now time to
> >> actually suspend until woken up?  If you look at it, this is the same
> >> interface we have with the magic arch_suspend hook -- did we just
> >> suspend and its time to write the image, or did we just resume and its
> >> time to wake everything up.
> >>
> >> I think this can be easily solved by giving the image saving kernel 
> >> two
> >> resume points: one for the image has been written, and one for we
> >> rebooted and have restored the image.  I'm not familiar with ACPI.
> >> Perhaps we need a third to differentiate we read the image from S4
> >> instead of from S5, but that information must be available to the OS
> >> because it needs that to know if it should resume from hibernate.
> >>
> >> By making the split at image save and restore we have several
> >> advantages:
> >>
> >> (1) the kernel always initializes with devices in the init or quiesced
> >> but active state.
> >>
> >> (2) the kernel always resumes with devices in the init or quiesced but
> >> active state.
> >>
> >> (3) the kjump save and restore kernel does not need to know how to
> >> suspend all devices in the platform.
> >>
> >> (4) we have a merged path for suspend to disk, suspend to ram, and
> >> suspend to both.
> >>
> >> (5) because of (4), we can implement sleep policys where we save the
> >> image to disk but try to stay in ram based on expected remaining
> >> battery life.
> >>
> >> (6) we confine all platform (acpi) interaction to the main kernel
> >>
> >> (7) we limit the knowledge needed in the second kernel.   It needs to
> >> know how to do its job and then put the hardware back how it found it.
> >> Nothing more.
> >
> > This would have been nice if we had been able to do it.
> 
> I don't understand this comment.   "if we had been able"?  I don't 
> think we have tried yet.

That's related to the discussion above.  If we are unable to do (3) and (6)
without the freezing of tasks, which I'm not sure is not the case, the entire
scheme won't be viable.

Well, we might be able to do it provided that drivers will block the tasks
on I/O effectively, but I see a big 'if' here ...

> >> For the suspend to ram and then woken up case, we simply need to
> >> invalidate the image before restarting normal kernel operation.
> >>
> >> People have worried about how to boot and restore the kernel, and what
> >> to do if reading the image fails.   They worry about needing memory
> >> hotplug or delayed acpi parsing.  They are forgetting one thing.  This
> >> kernel has support for kexec.
> >>
> >> This is all easily solved by having the bootloader from the bios 
> >> always
> >> boot the restore kernel.
> >
> > Well, I think this is not generally acceptable, although I agree that 
> > it would
> > be simpler.
> 
> For those that don't find it acceptable they can teach their bootloader 
> when they may have a image to resume.

Yes, and I think we need to seriously consider this possibility.

> >> It will boot with limited useable memory and
> >> no acpi support.  If the restore kernel userspace detects that there 
> >> is
> >> no restore image, it simply loads the normal main kernel and initrd /
> >> initramfs and calls the normal kexec.  The cost is the time to init 
> >> the
> >> restore kernel, read the kernel with full drivers (vs reading it from
> >> the bootloader).  If you want a boot menu, use kboot (on sourceforge).
> >
> > Well, I'm afraid of adding more and more infrastructure to the mix.
> 
> Requiring the hibernated kernel to be able to start from kexec should 
> not be bad.   If you were referring to adding kboot, that is just an 
> option.

Yes, I was.
 
> One can still use bootloaders menus to select alternate kernels.   
> However, as you said, you want to boot differently for resume (no acpi 
> until after image loaded) from full boot.

That's correct and I think some kind of cooperation with the bootloader is
needed for that.

> >> On Jul 17, 2007, at 2:13 PM, Rafael J. Wysocki wrote:
> >>> On Tuesday, 17 July 2007 22:27, david@lang.hm wrote:
> >>>> On Tue, 17 Jul 2007, Alan Stern wrote:
> >>>>> But what about the freezer?  The original reason for using kexec 
> >>>>> was
> >>>>> to
> >>>>> avoid the need for the freezer.  With no freezer, while the 
> >>>>> original
> >>>>> kernel is busy powering down its devices, user tasks will be free 
> >>>>> to
> >>>>> carry out I/O -- which will make the memory snapshot inconsistent
> >>>>> with
> >>>>> the on-disk data structures.
> >>>>
> >>>> no, user tasks just don't get scheduled during shutdown.
> >>>>
> >>>> the big problem with the freezer isn't stopping anything from
> >>>> happening,
> >>>> it's _selectivly_ stopping things.
> >>
> >> Agreed.   Or rather, selectively not stopping and resuming things.
> >
> > I don't quite understant this statement.  Can you please elaborate?
> 
> Feel free to list other problems with the freezer, but I'm saying that 
> the problems are stemming from trying to freeze most of userspace and 
> some selection of kernel threads so that new requests to the outside 
> are not made, but then turning around and saying "ok now do some io, 
> but only what this thread of execution originates".

We're _not_ doing anything like this.

> Its originates not generates so we are trying to teach the whole stack
> these limits, including going back to userspace for FUSE.

Again, I don't understand what you're talking about.  This is not like things
work right now, that's for sure. :-)

The problem with FUSE is related to the fact that the freezer can't freeze
uninterruptible tasks and we said that perhaps we might avoid it if FUSE
was made freezing-aware.  Still, no one has gone in this direction and I don't
know of any plans to do that.

Please, stop trying to blame the freezer for all evil.

Also, it's better if you know how the things that you want to improve really
work.

> >>> It's selectively stopping kernel threads, which is just about right.
> >>> If you
> >>> that _this_ is a main problem with the freezer, then think again.
> >>>
> >>>> with kexec you don't need to let any portion of the origional kernel
> >>>> or userspace operate so you don't have a problem.
> >>>
> >>> In fact, the main problem with the freezer is that it is a
> >>> coarse-grained
> >>> solution.  Therefore, what I believe we should do is to evolve in the
> >>> directoin
> >>> of more fine-grained solutions and gradually phase out the freezer.
> >>>
> >>> The kexec-based approach is an attempt to replace one coarse-grained
> >>> solution
> >>> (the freezer) with even more coarse-grained solution (stopping the
> >>> entire
> >>> kernel with everything), which IMO doesn't address the main problem.
> >>>
> >>
> >> I think this addresses teh problem.   Its probably a bit harder than
> >> powermac because we have to fully quiesce devices; we can't cheat by
> >> leaving interrupts off.   But once the drivers save the state of their
> >> devices and stop their queues, it should be easy to audit the paths to
> >> powerdown devices and call the platform suspend and ram wakeup paths.
> 
> In other words, I'm replacing a course-grained solution with an 
> absolute solution.  "From this point on you can only write to ram."

Which means that we need to take care of the drivers _before_ doing anything
else.

I agree with that, of course. :-)

> >> Going back to the requirements document that started this thread:
> >>
> >> Message-ID: <200707151433.34625.rjw@sisk.pl>
> >> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
> >>> (1) Filesystems mounted before the hibernation are untouchable
> >>
> >> This is because some file systems do a fsck or other activity even 
> >> when
> >> mounted read only.  For the kexec case, however, this should be "file
> >> systems mounted by the hibernated system must not be written".   As 
> >> has
> >> been mentioned in the past, we should be able to use something like dm
> >> snapshot to allow fsck and the file system to see the cleaned copy
> >> while not actually writing the media.
> >
> > We can't _require_ users to use the dm snapshot in order for the 
> > hibernation
> > to work, sorry.
> 
> I actually listed three ways to start.  Not all of them required 
> dm-snapshot.  I was proposing "if you need to read ext3, then use 
> dm-snapshot".

I don't think we should differentiate filesystems this way.  We should just do
the same thing with all of them.

> > And by _reading_ from a filesystem you generally update metadata.
> 
> not on ones mounted read-only.   I'll reply more later in the thread.

OK

> >> The kjump kernel must not have any knowledge retained if we reuse it.
> >>
> >>> (2) Swap space in use before the hibernation must be handled with 
> >>> care
> >>
> >> Yes.  Actually, even though they have been used by the write-in-the
> >> kernel users, they will be among the most difficult devices to use for
> >> snapshots by a userspace second kernel.
> >>
> >>> (3) There are memory regions that must not be saved or restored
> >>
> >> because they may not exist.   This means that we must identify the
> >> memory to be saved and restored in a format to be passed between the
> >> kernel.
> >>
> >>> (4) The user should be able to limit the size of a hibernation image
> >>
> >> This means the suspending kernel must arrange to reduce its active
> >> memory.  The limited save can be done by providing a limited list in
> >> (3).
> >
> > It seems to me that you don't understand the problem here.
> >
> > Assume you have 90% of RAM allocated before the hibernation and the 
> > user has
> > requested the image to be not greater than 50% of RAM.  In that case 
> > you have
> > to free some memory _before_ identifying memory to save and you must 
> > not
> > race with applications that attempt to allocate memory while you're 
> > doing it.
> 
> Hmm... I didn't say how to reduce the memory or identify it, did I?
> 
> Ok fine.   I'll allocate a bunch of memory and put it on a list.  

And you cause the OOM killer to show up.  Not good.

> Normal memory pressure will swap things out or drop filesystem pages.   
> When I build the list of memory to backup, I filter out this list.  
> After resume, I'll free it back.
> 
> We can arrange for this "task" to be preferred by the oom killer, if 
> case the user is trying to suspend into  less than memory than can be 
> freed.
> 
> >>> (5) Hibernation should be transparent from the applications' point of
> >>> view
> >>
> >> People have pointed out they may want userspace to be aware of the
> >> suspend.   I believed this can be done with /proc/apm emulation today
> >> or by other means; it seems that should be hooked up to dbus in some
> >> fashion.
> >
> > Not a solution, because there still will be programs not needing to 
> > know
> > anything about hibernation.  After all, we don't require all 
> > applications to
> > know anything about SMP, even if they are executed on an SMP system.
> 
> How do any of those methods require userpsace to know anything about 
> hibernation?   I was talking about a general framework consistent with 
> todays kernel to user communication for those parts of userspace that 
> *want* to know about suspend and hibernation.

OK, I didn't understand, then.

> >>> (6) State of devices from before hibernation should be restored, if
> >>> possible
> >>
> >> related to suspend should be transparent ... yes.
> >>
> >>> (7) On ACPI systems special platform-related actions have to be
> >>> carried out at
> >>>     the right points, so that the platform works correctly after the
> >>> restore
> >>
> >> I believe I have explained my suggestion.
> >>
> >>> (8) Hibernation and restore should not be too slow
> >>
> >> We control the added code.   We are using full runtime drivers and 
> >> will
> >> run at hardware speeds.
> >
> > That may not be enough.  If you're going to save, say, 80% of RAM on a 
> > 2 GB
> > machine, then you'll have to be using image compression.
> 
> Yea, so?  We have a full kernel and userspace, adding compression 
> before writing should be easy.  The is no struct page for memory in the 
> old kernel, so we likely need to be copying them in userspace anyways.  
> Adding compression should be easy.

Yes, it's not that difficult.

> >>> (9) Hibernation framework should not be too difficult to set up
> >>
> >> Ok the current patch is presently too difficult.  But I think it will
> >> be much simpler with a few small changes.
> >>
> >> As noted in  the thread
> >>
> >> Message-ID: <873azxwqhr.fsf@jbms.ath.cx>
> >> Subject: [linux-pm] Re: hibernation/snapshot design
> >> on Mon Jul  9 08:23:53 2007, Jeremy Maitin-Shepard wrote:
> >>>>  Both would work. One would eat 8-64MB of your RAM, permanently;
> >>>
> >>> As I have stated in other messages, the kdump approach would not 
> >>> waste
> >>> any RAM permanently.
> >> ...
> >>> Immediately before jumping to the new kernel, the first X bytes 
> >>> (where
> >>> X
> >>> is the amount of memory the new kernel will get, typically 16MB or
> >>> 64MB)
> >>> of physical memory are backed up into the arbitrary discontiguous 
> >>> pages
> >>> that are made available.  This will not take very long, because 
> >>> copying
> >>> even 64MB of memory is extremely fast.  Then the new kernel is free 
> >>> to
> >>> use the first X bytes of contiguous physical memory.  Problem solved.
> >>
> >>
> >> Ok, now let's look at my list again:
> >>
> >>> (1) how to interact with acpi to enter into S4.
> >>
> >> This was discussed.
> >>
> >>> (2) how to identify which memory needs to be saved
> >>
> >> We need to generate a list.  We need it to fit in a compuatable size 
> >> so
> >> that we can free and allocate the pages before suspending IO in the
> >> first kernel.
> >>
> >> One possibility is to use something like the kexec copy list.  If we
> >> are imaging a small fraction of ram this is appropriate, but if we are
> >> doing dense saves we need something extent based.  We should be able 
> >> to
> >> extend the list.
> >>
> >>> (3) how to communicate where to save the memory
> >>
> >> This is an intresting topic.  The suspended kernel has most IO and 
> >> disk
> >> space.  It also knows how much space is to be occupied by the kernel.
> >> So communicating a block map to the second kernel would be the obvious
> >> choice.   But the second kernel must be able to find the image to
> >> restore it, and it must have drivers for the media.  Also, this is not
> >> feasible for storing to nfs.
> >>
> >> I think we will end up with several methods.
> >>
> >> One would be supply a list of blocks, and implement a file system that
> >> reads the file by reading the scatter list from media.  The restore
> >> kernel then only needs to read an anchor, and can build upon that 
> >> until
> >> the image is read into memory.  Or do this in userspace.
> >>
> >> I don't know how this compares to the current restore path.   I wasn't
> >> able to identify the code that creates the on disk structure in my 10
> >> minute perusal of kernel/power/.
> >
> > The structure is created at two levels.
> >
> > First, the code in snapshot.c makes the image available to the code in 
> > swap.c
> > as a stream of pages.  The first page is the header, followed by some 
> > pages
> > containing the PFNs of the page frames to which the image data pages 
> > are to be
> > restored, followed by the image data pages themselves (the ordering of 
> > the PFNs
> > must be the same as the ordering of data pages that correspond to 
> > them).
> > Still, the low-level image format only needs to be known by the 
> > restore code in
> > snapshot.c .
> 
> Ok sounds like this code could be reused.  I'll look into it.
> 
> > Second, the code in swap.c writes the image pages to a storage adding 
> > some
> > metadata making it possible to reproduce their original ordering 
> > during the
> > restore.
> 
> So you are allocating the blocks as you go ... and adding meta data 
> along the way?

Something like this.  I can't say how Nigel does it, though.
 
> > The fact that we use swap spaces as the storage is related to 
> > implementation
> > simplicity rather than anything else.
> 
> Ok ... this only supports uncompressed hibernation?

The in-kernel version doesn't support compression (again, this was a choice
made to keep the code relatively simple), but the userland version supports
compression (and image encryption).

> The first kernel is going to specify (1) what to backup.  It can 
> specify (2) where to backup, although we have to be careful identify 
> the device in a persistent way.

Yes, that seems doable.

> >> A second method will be to supply a device and file that will be
> >> mounted by the save kernel, then unmounted and restored.  This would
> >> require a partition that is not mounted or open by the suspended 
> >> kernel
> >> (or use nfs or a similar protocol that is designed for multiple client
> >> concurrent access).
> >>
> >> A third method would be to allocate a file with the first kernel, and
> >> make sure the blocks are flushed to disk.  The save and restore 
> >> kernels
> >> map the file system using a snapshot device.  Writing would map the
> >> blocks and use the block offset to write to the real device using the
> >> method from the first option; reading could be done directly from the
> >> snapshot device.
> >>
> >> The first and third option are dead on log based file systems (where
> >> the data is stored in the log).
> >
> > All in all, we have three different and working implementation of the
> > image-writing and image-reading code at our disposal.  Why would you 
> > want to
> > break the open doors?
> 
> The problem I'm saying kexec solves is how to get the data to the 
> device while most of the kernel is trying not do anything permanent.
> 
> If we can reuse existing code, great.

I think we can.

> >>> (4) what state should devices be in when switching kernels
> >>
> >> My proposal is either initialized and untouched or quiesced.
> >
> > This is reasonable, but in general we also need to save some 
> > information
> > about the pre-hibernation state of devices, so that we can put them 
> > into the
> > same state, if reasonably possible, during the restore.
> 
> What state are you referring to?
> 
> Yes, there is state that the drivers have to store to ram, but this the 
> same state they need to store when suspending to ram if the device can 
> be powered off.

Yes.

> Maybe we need to teach drivers to store more state, like remember that 
> a hard drive was spun down.

I'm not sure about that.

> So we may need a flag saying "we powered off", "we resumed from 
> suspend".

There already is something like this.

Generally, we're going to have a special callback that will be used by the core
after the restore, so the driver will always know what it's supposed to do.

> >>> (5) the complicated setup required with the current patch
> >>
> >> I think a few simple changes to kjump will make this much simpler.  
> >> See
> >> below.
> >>
> >>> (6) what code restores the image
> >>
> >> The save kernel, loaded at boot.   People have suggested booting the
> >> first kernel, and using current restore code.   However, I think that
> >> ignores that (1) we saved from a different kernel, so the backed up
> >> region will be restored to its backed up random pages,
> >
> > This problem has already been solved.
> >
> >> (2) the code was written to restore the same kernel,
> >
> > Not exactly.  In fact, the current implementation only relies on the 
> > tiny
> > portion of the restore code being in the same place in both kernels, 
> > but
> > we can change the code not to make this assumption (it'll be more 
> > complicated,
> > but that's perfectly doable).
> 
> If the save kernel is different from the run kernel (to make it 
> smaller), its likely the image saving code will move.  I view restoring 
> from a different kernel than saving as an advanced feature.
> 
> Lets get resuming from the save kernel working first.
> 
> >> so the text and data will be replaced by identical text.  Its much 
> >> simpler
> >> conceptually to use the same kernel to save and restore the image.
> >
> > Here I agree. :-)
> >
> >> Simplifying kjump: the proposal for v3.
> >>
> >> The current code is trying to use crash dump area as a safe, reserved
> >> area to run the second kernel.   However, that means that the kernel
> >> has to be linked specially to run in the reserved area.   I think we
> >> need to finish separating kexec_jump from the other code paths.
> >>
> >> (1) add a new command line argument that specifies the kexec_jump
> >> target area.
> >>
> >> (2) add a kjump flag to the flags parameter, used by kexec_load.   
> >> When
> >> loading a jump kernel, it is loaded like a normal kernel, however,
> >> additional control pages are allocated to (a) save the kexec_jump
> >> target area (b) save the backed up region that is used by all kernels
> >> like crash dump, and (c) space for invoking relocate_new_kernel that
> >> will get its args from the execution entry point and will restore the
> >> kernel then call resume and suspend.
> >>
> >> (3) replace jump_huf_pfn with two command line addresses that specify
> >> the (a) return point for after resume, and (b) the return point for
> >> after image save.   Actually these can be done in userspace; the 
> >> second
> >> restore kernel can just specify the null copy list and the entry 
> >> points
> >> supplied by the suspended kernel.  To do resume we also need (c) where
> >> to store resume address for the save kernel.
> >>
> >>
> >> As a first stage of suspend and resume, we can save to dedicated
> >> partitions all memory (as supplied to crash_dump) that is not marked
> >> nosave and not part of the save kernel's image.
> >
> > A little problem here: there are "nosave" areas that are not marked as 
> > nosave.
> 
> If crash_dump is going work the memory must exist.
> 
> >> The fancy block lists and memory lists can be added later.
> >
> > On the majority of systems that will work.  On some of them it won't.
> 
> Ok .... well, my point is we can get started while we workout what the 
> list format is.   If we decide to reuse the pfn lists above that may 
> come quickly.

I think it's generally reasonable (a) not to save the entire memory (like free
RAM areas etc.) and (b) include the information of the original location of
each data page in the image, this way or another.

This doesn't complicate things all that much.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 20:31                                 ` david
@ 2007-07-20 21:24                                   ` Alan Stern
  2007-07-20 21:34                                     ` david
  2007-07-20 21:37                                     ` Jeremy Maitin-Shepard
  0 siblings, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-20 21:24 UTC (permalink / raw)
  To: david
  Cc: Milton Miller, Rafael J. Wysocki, Ying Huang, LKML, linux-pm,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007 david@lang.hm wrote:

> > Userspace can submit I/O requests.  Someone will have to audit every
> > driver to make sure that such I/O requests don't cause a quiesced
> > device to become active.  If the device is active, it will make the
> > memory snapshot inconsistent with the on-device data.
> 
> assuming this is the suspend-from-ram after a kexec back from the 
> write-to-disk kernel I don't think you are correct.
> 
> when doing a suspend-to-ram you get to a point where you just don't use 
> any userspace.

What do you mean?  How can you prevent user tasks from running?  That's 
basically what the freezer does, and the whole point of this approach 
is to eliminate the freezer.  Right?

> from that point on you are just walking the device tree 
> putting things into low-power mode. This is the point where we are talking 
> about jumping to.

Yes.  And putting things into low-power mode requires the ability to 
run the scheduler, which means that user tasks can be scheduled, which 
means that they can run.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 16:56                                 ` Milton Miller
  2007-07-20 17:31                                   ` Jeremy Maitin-Shepard
  2007-07-20 19:26                                   ` david
@ 2007-07-20 21:28                                   ` Rafael J. Wysocki
  2007-07-20 21:33                                     ` Jeremy Maitin-Shepard
  2 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:28 UTC (permalink / raw)
  To: Milton Miller
  Cc: Ying Huang, Alan Stern, LKML, David Lang, linux-pm,
	Jeremy Maitin-Shepard

On Friday, 20 July 2007 18:56, Milton Miller wrote:
> On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote:
> > On Friday, 20 July 2007 01:07, david@lang.hm wrote:
> >> On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:
> >>> On Thursday, 19 July 2007 17:46, Milton Miller wrote:
> >>>> The currently identified problems under discussion include:
> >>>> (1) how to interact with acpi to enter into S4.
> >>>> (2) how to identify which memory needs to be saved
> >>>> (3) how to communicate where to save the memory
> >>>> (4) what state should devices be in when switching kernels
> >>>> (5) the complicated setup required with the current patch
> >>>> (6) what code restores the image
> >>>
> >>> (7) how to avoid corrupting filesystems mounted by the hibernated 
> >>> kernel
> >>
> >> I didn't realize this was a discussion item. I thought the options 
> >> were
> >> clear, for some filesystem types you can mount them read-only, but for
> >> ext3 (and possilby other less common ones) you just plain cannot touch
> >> them.
> >
> > That's correct.  And since you cannot thouch ext3, you need either to 
> > assume
> > that you won't touch filesystems at all, or to have a code to 
> > recognize the
> > filesystem you're dealing with.
> 
> Or add a small bit of infrastructure that errors writes at make_request 
> if you don't have a magic "i am a direct block device write from 
> userspace" flag on the bio.
> 
> The hibernate may fail, but you don't corrupt the media.
> 
> If you don't get the image out, resume back to the "this is resume" 
> instead of the power-down path.

Well, I don't think that is much prettier than the freezer ...

> >>>>> (2) Upon start-up (by which I mean what happens after the user has
> >>>>> pressed
> >>>>>     the power button or something like that):
> >>>>>   * check if the image is present (and valid) _without_ enabling 
> >>>>> ACPI
> >>>>> (we don't
> >>>>>     do that now, but I see no reason for not doing it in the new
> >>>>> framework)
> >>>>>   * if the image is present (and valid), load it
> >>>>>   * turn on ACPI (unless already turned on by the BIOS, that is)
> >>>>>   * execute the _BFS global control method
> >>>>>   * execute the _WAK global control method
> >>>>>   * continue
> >>>>>   Here, the first two things should be done by the image-loading
> >>>>> kernel, but
> >>>>>   the remaining operations have to be carried out by the restored
> >>>>> kernel.
> >>>>
> >>>> Here I agree.
> >>>>
> >>>> Here is my proposal.  Instead of trying to both write the image and
> >>>> suspend, I think this all becomes much simpler if we limit the scope
> >>>> the work of the second kernel.  Its purpose is to write the image.
> >>>> After that its done.   The platform can be powered off if we are 
> >>>> going
> >>>> to S5.   However, to support suspend to ram and suspend to disk, we
> >>>> return to the first kernel.
> >>>
> >>> We can't do this unless we have frozen tasks (this way, or another) 
> >>> before
> >>> carrying out the entire operation.  In that case, however, the 
> >>> kexec-based
> >>> approach would have only one advantage over the current one.  
> >>> Namely, it
> >>> would allow us to create bigger images.
> >>
> >> we all agree that tasks cannot run during the suspend-to-ram state, 
> >> but
> >> the disagreement is over what this means
> >>
> >> at one extreme it could mean that you would need the full freezer as 
> >> per
> >> the current suspend projects.
> >>
> >> at the other extreme it could mean that all that's needed is to 
> >> invoke the
> >> suspend-to-ram routine before anything else on the suspended kernel 
> >> on the
> >> return from the save and restore kernel.
> >>
> >> we just need to figure out which it is (or if it's somewhere in 
> >> between).
> >
> > Well, I think that the "invoke the suspend-to-ram routine before 
> > anything else
> > on the suspended kernel" thing won't be easy to implement in practice.
> 
> Why?  You don't expect suspend-to-ram in drivers to be implemented?  We 
> need more speperation of the quiesce drivers from power-down devices?

No.  I'm saying that when you go back from the image-saving kernel to the
hibernated kernel, you need to make sure that no task will cause any
filesystem's on-disk state to be actually updated.  If you can't make such
a guarantee, you just can't do that.

With the current state of the drivers, it's not doable without the freezer.

> Note that we are just talking about "suspend devices and put their 
> state in ram", not actually invoking the platform to suspend to ram.
> 
> And I'm actually saying we free memory and maybe allocate disk blocks 
> for the save before we suspend (see below).

Well, I've already written about the OOM killer ...

> >>>> Message-ID: <200707151433.34625.rjw@sisk.pl>
> >>>> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
> >>>>> (1) Filesystems mounted before the hibernation are untouchable
> >>>>
> >>>> This is because some file systems do a fsck or other activity even 
> >>>> when
> >>>> mounted read only.  For the kexec case, however, this should be 
> >>>> "file
> >>>> systems mounted by the hibernated system must not be written".   As 
> >>>> has
> >>>> been mentioned in the past, we should be able to use something like 
> >>>> dm
> >>>> snapshot to allow fsck and the file system to see the cleaned copy
> >>>> while not actually writing the media.
> >>>
> >>> We can't _require_ users to use the dm snapshot in order for the 
> >>> hibernation
> >>> to work, sorry.
> >>>
> >>> And by _reading_ from a filesystem you generally update metadata.
> >>
> >> not if the filesystem is mounted read-only (except on ext3)
> >
> > Well, if the filesystem in question is a journaling one and the 
> > hibernated
> > kernel has mounted this fs read-write, this seems to be tricky anyway.
> 
> Yes.  I would argue writing to existing blocks of a file (not thorugh 
> the filesystem, just getting their blocsk from the file system) should 
> be safe, but it occurs to me that may not be the case if your fsck and 
> bmap move data blocks from some update log to the file system.
> 
> But we know the (maximum) image size.   So we could allocate the blocks 
> in the first image before suspending the drivers and memory 
> allocations, and supplying the list to the second kernel.  We could 
> even write to the first block with a signature "suspend to here", or 
> even the whole block list to the beginning (it will have to be saved to 
> disk for restore anyways).

The writing is easy (we're doing that already, just fine).

The tricky part would be if the image-saving kernel tried to mount a journaling
filesystem in use by the hibernated kernel.

> >>>> The kjump kernel must not have any knowledge retained if we reuse 
> >>>> it.
> >>>>
> >>>>> (2) Swap space in use before the hibernation must be handled with 
> >>>>> care
> >>>>
> >>>> Yes.  Actually, even though they have been used by the write-in-the
> >>>> kernel users, they will be among the most difficult devices to use 
> >>>> for
> >>>> snapshots by a userspace second kernel.
> 
> If we use the "write to these blocks" then this is as easy as writing 
> to a file in a mounted filesystem.
> 
> >>>>> (4) The user should be able to limit the size of a hibernation 
> >>>>> image
> >>>>
> >>>> This means the suspending kernel must arrange to reduce its active
> >>>> memory.  The limited save can be done by providing a limited list in
> >>>> (3).
> >>>
> >>> It seems to me that you don't understand the problem here.
> >>>
> >>> Assume you have 90% of RAM allocated before the hibernation and the 
> >>> user has
> >>> requested the image to be not greater than 50% of RAM.  In that case 
> >>> you have
> >>> to free some memory _before_ identifying memory to save and you must 
> >>> not
> >>> race with applications that attempt to allocate memory while you're 
> >>> doing it.
> >>
> >> I disagree a little bit.
> >>
> >> first off, only the suspending kernel can know what can be freed and 
> >> what
> >> is needed to do so (remember this is kernel internals, it can change 
> >> from
> >> patch to patch, let alone version to version)
> >>
> >> second, if you have a lot of memory to free, and you can't just throw 
> >> away
> >> caches to do so, you don't know what is going to be involved in 
> >> freeing
> >> the memory, it's very possilbe that it is going to involve userspace, 
> >> so
> >> you can't freeze any significant portion of the system, so you can't
> >> eliminate all chance of races
> >>
> >> what you can do is
> >>
> >> 1. try to free stuff
> >> 2. stop the system and account for memory, is enough free
> >> if not goto 1
> >>
> >> if userspace is dirtying memory fast enough, or is just useing enough
> >> memory that you can't meet your limit you just won't be able to 
> >> suspend.
> >
> > This means unreliable hibernation for some workloads.  While I agree 
> > that
> > shouldn't be a problem in a common case, there are users who will 
> > complain. ;-)
> 
> With my allocate memory as a task and don't save that task's memory 
> approach, we can get to this point while userspace is running.   It 
> could be controllled by userspace, or even be userspace 
> (sys_do_not_save_me() waits for resume, and dies as the kernel 
> resumes).
> 
> >> but under any other conditions you will eventually get enough memory 
> >> free.
> >>
> >> so try several times and if you still fail tell the user they have too
> >> much stuff running and they need to kill something.
> >
> > Well, with the freezer that's much simpler (and more reliable, I'd 
> > say): you
> > freeze tasks and _then_ you shrink memory.
> 
> It means you are committed to suspend before you try to shrink memory.  
> What happens when the user requested a smaller image that memory in 
> use?

You mean the user wanted the image to be so small that we can't create it?
Well, we don't. :-)

In fact, we cheat a little.  Namely, we check if there's enough storage space
and if so, we create an image that's bigger than requested by the user.

> >>>>> =(8) Hibernation and restore should not be too slow
> >>>>
> >>>> We control the added code.   We are using full runtime drivers and 
> >>>> will
> >>>> run at hardware speeds.
> >>>
> >>> That may not be enough.  If you're going to save, say, 80% of RAM on 
> >>> a 2 GB
> >>> machine, then you'll have to be using image compression.
> >>
> >> this doesn't make sense, 20% of 2G is 400M, if you can't make a 
> >> kernel and
> >> userspace that can run in 400M you have a serious problem.
> >
> > I was talking about the _speed_ of writing and reading.
> 
> Yes.  As I said, adding a compress as we copy the pages into the saving 
> kernel for writeout should be easy.

Easy or not, that's one more thing you should remember about.

> >> even if you wanted to save 99% of RAM on a 2G system, you have 20M of 
> >> ram
> >> to play with, which should easily be enough.
> >>
> >> remember, linux runs on really small systems as well, and while you do
> >> have to load some drivers for the big system, there are a lot of other
> >> things that aren't needed.
> >>
> >>> All in all, we have three different and working implementation of the
> >>> image-writing and image-reading code at our disposal.  Why would you 
> >>> want to
> >>> break the open doors?
> >>
> >> becouse you say that the current methods won't work without ACPI 
> >> support.
> >
> > I didn't say that.  [Or if I did, please point me to this message.]
> >
> > Anyway, this wouldn't be true even if I did.
> >
> > What I've been trying to say from the very beginning is that the 
> > current
> > frameworks _support_ hibernation a la ACPI S4 (although that's not 
> > exactly
> > ACPI S4) and if we are going to introduce a new framework, then it 
> > should
> > be designed to _support_ ACPI S4 fully _from_ _the_ _start_.
> >
> > This DOESN'T mean that the non-ACPI hibernation should be unsupported 
> > and
> > it DOESN"T mean that the non-ACPI hibernation is not supported 
> > currently.
> > IT IS SUPPORTED.
> >
> 
> As I said, I see kjump as a way to solve the "ok i am at a save point, 
> now how do I write this image to media without allowing any other io".  
> As you know by now, my solution for ACPI support is after the image is 
> written we go back to the kernel that started the suspend and it puts 
> the machine in S4.
> 
> If this works, we get down to 1 hibernate implementation in the kernel 
> :-).

For now, we have one implementation in the kernel (swsusp) that may be used
with some external (user space) tools (and is called uswsusp in that case,
quite confusingly), the other complete one that's waiting for merging with
the first one, at least in part (tuxonice, formerly known as suspend2), and
your proposed _third_ one (in a number of variants, perhaps).

I _think_ we can get down to one, but not by creating something entirely new
from the scratch.  If you can think of introducing the kexec-based approach
in such a way that it uses _as_ _much_ _of_ _existing_ _code_ _as_ _reasonably_
_possible_, then the result might be a candidate for the one common
implementation, as far as I'm concerned.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 17:31                                   ` Jeremy Maitin-Shepard
@ 2007-07-20 21:30                                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:30 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Milton Miller, Ying Huang, Alan Stern, LKML, David Lang, linux-pm

On Friday, 20 July 2007 19:31, Jeremy Maitin-Shepard wrote:
> Milton Miller <miltonm@bga.com> writes:
> 
> [snip]
> 
> >>>> (7) how to avoid corrupting filesystems mounted by the hibernated kernel
> >>> 
> >>> I didn't realize this was a discussion item. I thought the options were
> >>> clear, for some filesystem types you can mount them read-only, but for
> >>> ext3 (and possilby other less common ones) you just plain cannot touch
> >>> them.
> >> 
> >> That's correct.  And since you cannot thouch ext3, you need either to assume
> >> that you won't touch filesystems at all, or to have a code to recognize the
> >> filesystem you're dealing with.
> 
> > Or add a small bit of infrastructure that errors writes at make_request if you
> > don't have a magic "i am a direct block device write from userspace" flag on the
> > bio.
> 
> I still don't understand why there is this fixation on accessing dirty
> filesystems in use by the hibernated system.  Even if you avoid
> corrupting the filesystem by avoiding writing to the block device, there
> isn't any real guarantee about the state of the data, except for a
> filesystem that specifically makes guarantees about such data (and I
> don't believe any of the existing ones do).
> 
> It isn't necessary to be able to access such filesystems: everything can
> be done from an initramfs/initrd.

That's correct, but you need an additional ramdisk for that (yet another
complication).

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:28                                   ` Rafael J. Wysocki
@ 2007-07-20 21:33                                     ` Jeremy Maitin-Shepard
  2007-07-20 22:19                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 21:33 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Milton Miller, Ying Huang, Alan Stern, LKML, David Lang, linux-pm

"Rafael J. Wysocki" <rjw@sisk.pl> writes:

[snip]

>> Or add a small bit of infrastructure that errors writes at make_request 
>> if you don't have a magic "i am a direct block device write from 
>> userspace" flag on the bio.
>> 
>> The hibernate may fail, but you don't corrupt the media.
>> 
>> If you don't get the image out, resume back to the "this is resume" 
>> instead of the power-down path.

> Well, I don't think that is much prettier than the freezer ...

It seems that a better solution to the "how do we write to a file on an
in-use partition" has been suggested, which also handles swap partitions
and swap files, and does not require mounting filesystems, so it seems
that the filesystem issue need not be considered.

[snip]

> No.  I'm saying that when you go back from the image-saving kernel to the
> hibernated kernel, you need to make sure that no task will cause any
> filesystem's on-disk state to be actually updated.  If you can't make such
> a guarantee, you just can't do that.

> With the current state of the drivers, it's not doable without the
> freezer.

It seems that it should be feasible to fix the drivers so that

1. they can be taken from normal state to quiesced state without
   requiring the freezer;

2. they can be taken from normal state to low power state without
   requiring the freezer;

3. they can be taken from quiesced state to low power state without
   requiring the freezer.

In the particular, it seems that it should be possible to do (3) without
needing to schedule tasks.

It seems likely that (2) may in fact be almost exactly the same as, or
at least similar to, (1) followed by (3), at least for many drivers.
(1) is required by the kexec hibernate approach even ignoring suspend to
both or S4.  (2) is required for suspend to ram without the freezer,
which seems to be desired anyway.

[snip]

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:24                                   ` Alan Stern
@ 2007-07-20 21:34                                     ` david
  2007-07-20 22:15                                       ` Rafael J. Wysocki
  2007-07-20 21:37                                     ` Jeremy Maitin-Shepard
  1 sibling, 1 reply; 220+ messages in thread
From: david @ 2007-07-20 21:34 UTC (permalink / raw)
  To: Alan Stern
  Cc: Milton Miller, Rafael J. Wysocki, Ying Huang, LKML, linux-pm,
	Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Alan Stern wrote:

> On Fri, 20 Jul 2007 david@lang.hm wrote:
>
>>> Userspace can submit I/O requests.  Someone will have to audit every
>>> driver to make sure that such I/O requests don't cause a quiesced
>>> device to become active.  If the device is active, it will make the
>>> memory snapshot inconsistent with the on-device data.
>>
>> assuming this is the suspend-from-ram after a kexec back from the
>> write-to-disk kernel I don't think you are correct.
>>
>> when doing a suspend-to-ram you get to a point where you just don't use
>> any userspace.
>
> What do you mean?  How can you prevent user tasks from running?  That's
> basically what the freezer does, and the whole point of this approach
> is to eliminate the freezer.  Right?
>
>> from that point on you are just walking the device tree
>> putting things into low-power mode. This is the point where we are talking
>> about jumping to.
>
> Yes.  And putting things into low-power mode requires the ability to
> run the scheduler, which means that user tasks can be scheduled, which
> means that they can run.

I did not know that getting into low-power mode required scheduling.

does it require userspace?

if so this is a problem and I say punt on suspend-to-disk-and-ram until 
suspend-to-ram is working independantly ;-)

if not, then can you schedule but not consider non-kernel tasks runnable?

freezing all of userspace is easy (see above)

freezing all of kernelspace is easy (unplug all non-boot CPU's and don't 
schedule)

where freezing gets hard is when you need to partially freeze either one 
of these.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 14:48                           ` Huang, Ying
  2007-07-20 15:48                             ` david
@ 2007-07-20 21:34                             ` Rafael J. Wysocki
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:34 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Milton Miller, linux-pm, LKML, Alan Stern, David Lang,
	Jeremy Maitin-Shepard

On Friday, 20 July 2007 16:48, Huang, Ying wrote:
> On Fri, 2007-07-20 at 09:01 -0500, Milton Miller wrote:
> > Simplifying kjump: the proposal for v3.
> > 
> > The current code is trying to use crash dump area as a safe, reserved 
> > area to run the second kernel.   However, that means that the kernel 
> > has to be linked specially to run in the reserved area.   I think we 
> > need to finish separating kexec_jump from the other code paths.
> > 
> > (1) add a new command line argument that specifies the kexec_jump 
> > target area (or just size?)
> > 
> > (2) add a kjump flag to the flags parameter, used by kexec_load.   When 
> > loading a jump kernel, it is loaded like a normal kernel, however, 
> > additional control pages are allocated to (a) save this kenrel's use of 
> > the kexec_jump target area (b) save the backed up region that is used 
> > by all kernels like crash dump, and (c) space for invoking 
> > relocate_new_kernel that will get its args from the execution entry 
> > point and will restore the kernel then call resume and suspend.
> 
> Backuping target memory before kexec and restoring it after kexec is
> planed feature for kexec jump. But I will work on image writing/reading
> first.

Have you thought about using any existing code, when you're at it?

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 20:12                                       ` Alan Stern
@ 2007-07-20 21:35                                         ` Oliver Neukum
  2007-07-20 22:25                                           ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: Oliver Neukum @ 2007-07-20 21:35 UTC (permalink / raw)
  To: Alan Stern
  Cc: Milton Miller, Ying Huang, LKML, Rafael J. Wysocki, David Lang,
	linux-pm, Jeremy Maitin-Shepard

Am Freitag 20 Juli 2007 schrieb Alan Stern:
> On Fri, 20 Jul 2007, Oliver Neukum wrote:
> 
> > Am Freitag 20 Juli 2007 schrieb Alan Stern:
> > > Some drivers need the ability to schedule.  Some will need the ability 
> > > to allocate memory (although GFP_ATOMIC is probably sufficient).  Some 
> > > will need timers to run.
> > 
> > Some will have to request firmware. It can add up to some megabytes.
> > In addition, if we don't freeze, some drivers, eg. video drivers, can
> > do allocations in the megabyte range.
> > 
> > It seems to me that without the freezer we will end up with many drivers
> > needing a two step notification process. Furthermore there are requirements
> > on the order of shutting down system facilities, eg. device addition must
> > be stopped before drivers allocate firmware.
> 
> These are really separate issues, since they refer to things that have 
> to happen well before the memory snapshot is captured.
> 
> We already have a pre-suspend notification available for drivers that 
> need to allocate large amounts of memory.

Is that facility fine grained enough?

> You are correct about the need to delay/stop device addition.  I don't
> know how this can be done in general; each code path calling
> device_add() may have to be treated individually.

What about the old API? Do we have to block module loading?
What happens if a scsi error handler is woken? If it cannot be woken,
how are errors handled?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:24                                   ` Alan Stern
  2007-07-20 21:34                                     ` david
@ 2007-07-20 21:37                                     ` Jeremy Maitin-Shepard
  2007-07-20 22:35                                       ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 21:37 UTC (permalink / raw)
  To: Alan Stern
  Cc: david, Milton Miller, Rafael J. Wysocki, Ying Huang, LKML, linux-pm

Alan Stern <stern@rowland.harvard.edu> writes:

> On Fri, 20 Jul 2007 david@lang.hm wrote:
>> > Userspace can submit I/O requests.  Someone will have to audit every
>> > driver to make sure that such I/O requests don't cause a quiesced
>> > device to become active.  If the device is active, it will make the
>> > memory snapshot inconsistent with the on-device data.
>> 
>> assuming this is the suspend-from-ram after a kexec back from the 
>> write-to-disk kernel I don't think you are correct.
>> 
>> when doing a suspend-to-ram you get to a point where you just don't use 
>> any userspace.

> What do you mean?  How can you prevent user tasks from running?  That's 
> basically what the freezer does, and the whole point of this approach 
> is to eliminate the freezer.  Right?

Presumably no tasks at all would be scheduled.

>> from that point on you are just walking the device tree 
>> putting things into low-power mode. This is the point where we are talking 
>> about jumping to.

> Yes.  And putting things into low-power mode requires the ability to 
> run the scheduler, which means that user tasks can be scheduled, which 
> means that they can run.

Does it really (fundamentally) require scheduling tasks, particularly in
the case that the devices have already been put in the "quiesced" state?

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:43                                   ` Rafael J. Wysocki
@ 2007-07-20 21:39                                     ` david
  2007-07-20 22:22                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-20 21:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jim Crilly, Milton Miller, linux-pm, LKML, Alan Stern, Huang,
	Ying, Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Rafael J. Wysocki wrote:

> On Friday, 20 July 2007 17:36, david@lang.hm wrote:
>> On Fri, 20 Jul 2007, Jim Crilly wrote:
>>
>>>>> has
>>>>> requested the image to be not greater than 50% of RAM.  In that case you
>>>>> have
>>>>> to free some memory _before_ identifying memory to save and you must not
>>>>> race with applications that attempt to allocate memory while you're doing
>>>>> it.
>>>>
>>>> I disagree a little bit.
>>>>
>>>> first off, only the suspending kernel can know what can be freed and what
>>>> is needed to do so (remember this is kernel internals, it can change from
>>>> patch to patch, let alone version to version)
>>>>
>>>> second, if you have a lot of memory to free, and you can't just throw away
>>>> caches to do so, you don't know what is going to be involved in freeing
>>>> the memory, it's very possilbe that it is going to involve userspace, so
>>>> you can't freeze any significant portion of the system, so you can't
>>>> eliminate all chance of races
>>>>
>>>> what you can do is
>>>>
>>>> 1. try to free stuff
>>>> 2. stop the system and account for memory, is enough free
>>>> if not goto 1
>>>>
>>>> if userspace is dirtying memory fast enough, or is just useing enough
>>>> memory that you can't meet your limit you just won't be able to suspend.
>>>>
>>>> but under any other conditions you will eventually get enough memory free.
>>>>
>>>> so try several times and if you still fail tell the user they have too
>>>> much stuff running and they need to kill something.
>>>
>>> Which would be a pretty big regression from what we have now. With the
>>> current implementation I can hibernate under virtually any workload because
>>> the freezer stops everything and there's no competition for resources.
>>
>> as long as what you are trying to save is <=50% of ram (at least with some
>> implementations). if you are trying to save more then 50% of ram with some
>> current implmenetations you just can't
>
> With some, you can't, with the others, you can. :-)
>
> The argument given was about the freezer and IMO it was valid.
>
> Why didn't you address it directly?

I thought it had been covered in other messages (with as big as this 
thread is I'm trying to avoid repeating the same thing more then a couple 
times a day :-)

there was another message talking about ways that you could reduce the 
image size without it being racy (allocate pinned memory until the 
remainder is small enough, then don't backup the pinned memory)

that's a much cleaner answer then what I was thinking, so I'll go with it 
instead ;-)

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 15:36                                 ` david
@ 2007-07-20 21:43                                   ` Rafael J. Wysocki
  2007-07-20 21:39                                     ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:43 UTC (permalink / raw)
  To: david
  Cc: Jim Crilly, Milton Miller, linux-pm, LKML, Alan Stern, Huang,
	Ying, Jeremy Maitin-Shepard

On Friday, 20 July 2007 17:36, david@lang.hm wrote:
> On Fri, 20 Jul 2007, Jim Crilly wrote:
> 
> >>> has
> >>> requested the image to be not greater than 50% of RAM.  In that case you
> >>> have
> >>> to free some memory _before_ identifying memory to save and you must not
> >>> race with applications that attempt to allocate memory while you're doing
> >>> it.
> >>
> >> I disagree a little bit.
> >>
> >> first off, only the suspending kernel can know what can be freed and what
> >> is needed to do so (remember this is kernel internals, it can change from
> >> patch to patch, let alone version to version)
> >>
> >> second, if you have a lot of memory to free, and you can't just throw away
> >> caches to do so, you don't know what is going to be involved in freeing
> >> the memory, it's very possilbe that it is going to involve userspace, so
> >> you can't freeze any significant portion of the system, so you can't
> >> eliminate all chance of races
> >>
> >> what you can do is
> >>
> >> 1. try to free stuff
> >> 2. stop the system and account for memory, is enough free
> >> if not goto 1
> >>
> >> if userspace is dirtying memory fast enough, or is just useing enough
> >> memory that you can't meet your limit you just won't be able to suspend.
> >>
> >> but under any other conditions you will eventually get enough memory free.
> >>
> >> so try several times and if you still fail tell the user they have too
> >> much stuff running and they need to kill something.
> >
> > Which would be a pretty big regression from what we have now. With the
> > current implementation I can hibernate under virtually any workload because
> > the freezer stops everything and there's no competition for resources.
> 
> as long as what you are trying to save is <=50% of ram (at least with some 
> implementations). if you are trying to save more then 50% of ram with some 
> current implmenetations you just can't

With some, you can't, with the others, you can. :-)

The argument given was about the freezer and IMO it was valid.

Why didn't you address it directly?

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 16:15                                   ` Alan Stern
@ 2007-07-20 21:46                                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:46 UTC (permalink / raw)
  To: Alan Stern
  Cc: david, Milton Miller, linux-pm, LKML, Huang, Ying, Jeremy Maitin-Shepard

On Friday, 20 July 2007 18:15, Alan Stern wrote:
> On Fri, 20 Jul 2007 david@lang.hm wrote:
> 
> > or the userspace helper functions that setup the instructions for the 
> > hibernate warn you if you are telling it to mount a filesystem that it 
> > knows is ext3 and is in use by the system going to sleep.
> 
> One can argue that the ext3 implementation is inadequate.  We should be
> able to give it a mount option requiring it to fail rather than play
> back the journal and write to the disk.
> 
> 
> > > What I've been trying to say from the very beginning is that the current
> > > frameworks _support_ hibernation a la ACPI S4 (although that's not exactly
> > > ACPI S4) and if we are going to introduce a new framework, then it should
> > > be designed to _support_ ACPI S4 fully _from_ _the_ _start_.
> > 
> > here is where there is some disagreement (although it may just be 
> > misunderstanding on the 'fully support' phrase)
> > 
> > it sounds like you are saying that the ACPI support requires a lot of work 
> > (the phrase I've seen some people use is a requirement to 'fix all the 
> > drivers'). we aren't wanting to have this work prevent the non-ACPI 
> > hibernation from progressing.
> 
> You have completely misunderstood.  That phrase "fix all the drivers" 
> has nothing whatsoever to do with ACPI.  It is a prerequisite for 
> removing the freezer.

Yes.

> And unless I'm mistaken, removing the freezer was the main reason for 
> doing all this kexec-style work in the first place.

Yes, that also is my impression.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:34                                     ` david
@ 2007-07-20 22:15                                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 22:15 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, Milton Miller, Ying Huang, LKML, linux-pm,
	Jeremy Maitin-Shepard

On Friday, 20 July 2007 23:34, david@lang.hm wrote:
> On Fri, 20 Jul 2007, Alan Stern wrote:
> 
> > On Fri, 20 Jul 2007 david@lang.hm wrote:
> >
> >>> Userspace can submit I/O requests.  Someone will have to audit every
> >>> driver to make sure that such I/O requests don't cause a quiesced
> >>> device to become active.  If the device is active, it will make the
> >>> memory snapshot inconsistent with the on-device data.
> >>
> >> assuming this is the suspend-from-ram after a kexec back from the
> >> write-to-disk kernel I don't think you are correct.
> >>
> >> when doing a suspend-to-ram you get to a point where you just don't use
> >> any userspace.
> >
> > What do you mean?  How can you prevent user tasks from running?  That's
> > basically what the freezer does, and the whole point of this approach
> > is to eliminate the freezer.  Right?
> >
> >> from that point on you are just walking the device tree
> >> putting things into low-power mode. This is the point where we are talking
> >> about jumping to.
> >
> > Yes.  And putting things into low-power mode requires the ability to
> > run the scheduler, which means that user tasks can be scheduled, which
> > means that they can run.
> 
> I did not know that getting into low-power mode required scheduling.
> 
> does it require userspace?
> 
> if so this is a problem and I say punt on suspend-to-disk-and-ram until 
> suspend-to-ram is working independantly ;-)
> 
> if not, then can you schedule but not consider non-kernel tasks runnable?
> 
> freezing all of userspace is easy (see above)
> 
> freezing all of kernelspace is easy (unplug all non-boot CPU's and don't 
> schedule)
> 
> where freezing gets hard is when you need to partially freeze either one 
> of these.

If you use the scheduler to "freeze" tasks, you never know where they are
stopped and what locks they may hold.

We would have done that already if that was so easy, because we really want
to freeze _all_ user space tasks (even if not all kernel threads).

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:33                                     ` Jeremy Maitin-Shepard
@ 2007-07-20 22:19                                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 22:19 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Milton Miller, Ying Huang, Alan Stern, LKML, David Lang, linux-pm

On Friday, 20 July 2007 23:33, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
> 
> [snip]
> 
> >> Or add a small bit of infrastructure that errors writes at make_request 
> >> if you don't have a magic "i am a direct block device write from 
> >> userspace" flag on the bio.
> >> 
> >> The hibernate may fail, but you don't corrupt the media.
> >> 
> >> If you don't get the image out, resume back to the "this is resume" 
> >> instead of the power-down path.
> 
> > Well, I don't think that is much prettier than the freezer ...
> 
> It seems that a better solution to the "how do we write to a file on an
> in-use partition" has been suggested, which also handles swap partitions
> and swap files, and does not require mounting filesystems, so it seems
> that the filesystem issue need not be considered.
> 
> [snip]
> 
> > No.  I'm saying that when you go back from the image-saving kernel to the
> > hibernated kernel, you need to make sure that no task will cause any
> > filesystem's on-disk state to be actually updated.  If you can't make such
> > a guarantee, you just can't do that.
> 
> > With the current state of the drivers, it's not doable without the
> > freezer.
> 
> It seems that it should be feasible to fix the drivers so that
> 
> 1. they can be taken from normal state to quiesced state without
>    requiring the freezer;
> 
> 2. they can be taken from normal state to low power state without
>    requiring the freezer;

Yes, that's correct.

> 3. they can be taken from quiesced state to low power state without
>    requiring the freezer.
>
> In the particular, it seems that it should be possible to do (3) without
> needing to schedule tasks.

For that, you'd have to forbid the drivers to call schedule() from the relevant
callbacks, which means, eg. no timeouts in there.

> It seems likely that (2) may in fact be almost exactly the same as, or
> at least similar to, (1) followed by (3), at least for many drivers.
> (1) is required by the kexec hibernate approach even ignoring suspend to
> both or S4.  (2) is required for suspend to ram without the freezer,
> which seems to be desired anyway.

Yes, (2) is needed anyway.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:39                                     ` david
@ 2007-07-20 22:22                                       ` Rafael J. Wysocki
  2007-07-20 22:39                                         ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 22:22 UTC (permalink / raw)
  To: david
  Cc: Jim Crilly, Milton Miller, linux-pm, LKML, Alan Stern, Huang,
	Ying, Jeremy Maitin-Shepard

On Friday, 20 July 2007 23:39, david@lang.hm wrote:
> On Fri, 20 Jul 2007, Rafael J. Wysocki wrote:
> 
> > On Friday, 20 July 2007 17:36, david@lang.hm wrote:
> >> On Fri, 20 Jul 2007, Jim Crilly wrote:
> >>
> >>>>> has
> >>>>> requested the image to be not greater than 50% of RAM.  In that case you
> >>>>> have
> >>>>> to free some memory _before_ identifying memory to save and you must not
> >>>>> race with applications that attempt to allocate memory while you're doing
> >>>>> it.
> >>>>
> >>>> I disagree a little bit.
> >>>>
> >>>> first off, only the suspending kernel can know what can be freed and what
> >>>> is needed to do so (remember this is kernel internals, it can change from
> >>>> patch to patch, let alone version to version)
> >>>>
> >>>> second, if you have a lot of memory to free, and you can't just throw away
> >>>> caches to do so, you don't know what is going to be involved in freeing
> >>>> the memory, it's very possilbe that it is going to involve userspace, so
> >>>> you can't freeze any significant portion of the system, so you can't
> >>>> eliminate all chance of races
> >>>>
> >>>> what you can do is
> >>>>
> >>>> 1. try to free stuff
> >>>> 2. stop the system and account for memory, is enough free
> >>>> if not goto 1
> >>>>
> >>>> if userspace is dirtying memory fast enough, or is just useing enough
> >>>> memory that you can't meet your limit you just won't be able to suspend.
> >>>>
> >>>> but under any other conditions you will eventually get enough memory free.
> >>>>
> >>>> so try several times and if you still fail tell the user they have too
> >>>> much stuff running and they need to kill something.
> >>>
> >>> Which would be a pretty big regression from what we have now. With the
> >>> current implementation I can hibernate under virtually any workload because
> >>> the freezer stops everything and there's no competition for resources.
> >>
> >> as long as what you are trying to save is <=50% of ram (at least with some
> >> implementations). if you are trying to save more then 50% of ram with some
> >> current implmenetations you just can't
> >
> > With some, you can't, with the others, you can. :-)
> >
> > The argument given was about the freezer and IMO it was valid.
> >
> > Why didn't you address it directly?
> 
> I thought it had been covered in other messages (with as big as this 
> thread is I'm trying to avoid repeating the same thing more then a couple 
> times a day :-)
> 
> there was another message talking about ways that you could reduce the 
> image size without it being racy (allocate pinned memory until the 
> remainder is small enough, then don't backup the pinned memory)
> 
> that's a much cleaner answer then what I was thinking, so I'll go with it 
> instead ;-)

Wouldn't that cause the OOM killer to act, in some cases?

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:35                                         ` Oliver Neukum
@ 2007-07-20 22:25                                           ` Alan Stern
  2007-07-23 14:23                                             ` Oliver Neukum
  2007-08-01  9:34                                             ` [linux-pm] Re: Hibernation considerations Pavel Machek
  0 siblings, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-20 22:25 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Milton Miller, Ying Huang, LKML, Rafael J. Wysocki, David Lang,
	linux-pm, Jeremy Maitin-Shepard

On Fri, 20 Jul 2007, Oliver Neukum wrote:

> > We already have a pre-suspend notification available for drivers that 
> > need to allocate large amounts of memory.
> 
> Is that facility fine grained enough?

It's a notifier chain that gets called at several points during the 
suspend transition.  One of those points is right at the start, while 
userspace is still running and reasonably large amounts of memory can 
be allocated.

Is it fine-grained enough?  I don't know -- hard to tell, since nothing 
much is using it yet.

> > You are correct about the need to delay/stop device addition.  I don't
> > know how this can be done in general; each code path calling
> > device_add() may have to be treated individually.
> 
> What about the old API?

What old API do you mean?

>  Do we have to block module loading?

No.  Registering new drivers is okay, registering new devices is bad.

Of course, some modules do want to register a new device in their init 
method.  I don't know what we should do about them.  Force the 
registration to fail, I suppose.  How often will people suspend while a 
module is loading?

> What happens if a scsi error handler is woken? If it cannot be woken,
> how are errors handled?

Why should the error handler wake up?  There isn't supposed to be any 
I/O going on, hence no errors to handle.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:37                                     ` Jeremy Maitin-Shepard
@ 2007-07-20 22:35                                       ` Alan Stern
  2007-07-20 22:43                                         ` david
  2007-07-20 22:48                                         ` Jeremy Maitin-Shepard
  0 siblings, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-20 22:35 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: david, Milton Miller, Rafael J. Wysocki, Ying Huang, LKML, linux-pm

On Fri, 20 Jul 2007, Jeremy Maitin-Shepard wrote:

> >> when doing a suspend-to-ram you get to a point where you just don't use 
> >> any userspace.
> 
> > What do you mean?  How can you prevent user tasks from running?  That's 
> > basically what the freezer does, and the whole point of this approach 
> > is to eliminate the freezer.  Right?
> 
> Presumably no tasks at all would be scheduled.

How would you prevent tasks from being scheduled?  How would you
prevent drivers from deadlocking because in order to put their device
in a low-power state they need to acquire a lock which is held by a
user task?

> >> from that point on you are just walking the device tree 
> >> putting things into low-power mode. This is the point where we are talking 
> >> about jumping to.
> 
> > Yes.  And putting things into low-power mode requires the ability to 
> > run the scheduler, which means that user tasks can be scheduled, which 
> > means that they can run.
> 
> Does it really (fundamentally) require scheduling tasks, particularly in
> the case that the devices have already been put in the "quiesced" state?

I can't say for sure.  That's the way we have been doing it.  It
wouldn't be easy to change, because the driver would have to busy-wait
during delays -- which would mean it would need to use different code
for system-wide suspend and runtime suspend.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 22:22                                       ` Rafael J. Wysocki
@ 2007-07-20 22:39                                         ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-20 22:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jim Crilly, Milton Miller, linux-pm, LKML, Alan Stern, Huang,
	Ying, Jeremy Maitin-Shepard

On Sat, 21 Jul 2007, Rafael J. Wysocki wrote:

> On Friday, 20 July 2007 23:39, david@lang.hm wrote:
>> On Fri, 20 Jul 2007, Rafael J. Wysocki wrote:
>>
>>> On Friday, 20 July 2007 17:36, david@lang.hm wrote:
>>>> On Fri, 20 Jul 2007, Jim Crilly wrote:
>>>>
>>>>>>> has
>>>>>>> requested the image to be not greater than 50% of RAM.  In that case you
>>>>>>> have
>>>>>>> to free some memory _before_ identifying memory to save and you must not
>>>>>>> race with applications that attempt to allocate memory while you're doing
>>>>>>> it.
>>>>>>
>>>>>> I disagree a little bit.
>>>>>>
>>>>>> first off, only the suspending kernel can know what can be freed and what
>>>>>> is needed to do so (remember this is kernel internals, it can change from
>>>>>> patch to patch, let alone version to version)
>>>>>>
>>>>>> second, if you have a lot of memory to free, and you can't just throw away
>>>>>> caches to do so, you don't know what is going to be involved in freeing
>>>>>> the memory, it's very possilbe that it is going to involve userspace, so
>>>>>> you can't freeze any significant portion of the system, so you can't
>>>>>> eliminate all chance of races
>>>>>>
>>>>>> what you can do is
>>>>>>
>>>>>> 1. try to free stuff
>>>>>> 2. stop the system and account for memory, is enough free
>>>>>> if not goto 1
>>>>>>
>>>>>> if userspace is dirtying memory fast enough, or is just useing enough
>>>>>> memory that you can't meet your limit you just won't be able to suspend.
>>>>>>
>>>>>> but under any other conditions you will eventually get enough memory free.
>>>>>>
>>>>>> so try several times and if you still fail tell the user they have too
>>>>>> much stuff running and they need to kill something.
>>>>>
>>>>> Which would be a pretty big regression from what we have now. With the
>>>>> current implementation I can hibernate under virtually any workload because
>>>>> the freezer stops everything and there's no competition for resources.
>>>>
>>>> as long as what you are trying to save is <=50% of ram (at least with some
>>>> implementations). if you are trying to save more then 50% of ram with some
>>>> current implmenetations you just can't
>>>
>>> With some, you can't, with the others, you can. :-)
>>>
>>> The argument given was about the freezer and IMO it was valid.
>>>
>>> Why didn't you address it directly?
>>
>> I thought it had been covered in other messages (with as big as this
>> thread is I'm trying to avoid repeating the same thing more then a couple
>> times a day :-)
>>
>> there was another message talking about ways that you could reduce the
>> image size without it being racy (allocate pinned memory until the
>> remainder is small enough, then don't backup the pinned memory)
>>
>> that's a much cleaner answer then what I was thinking, so I'll go with it
>> instead ;-)
>
> Wouldn't that cause the OOM killer to act, in some cases?

only in the case where the image absolutly cannot be made small enough.

and this should be detectable by the process that's pinning memory (this 
can be a kernel process) so that it stops before the OOM killer is 
triggered, even if that means that it returns 'unable to fit'

David Lang


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 22:35                                       ` Alan Stern
@ 2007-07-20 22:43                                         ` david
  2007-07-21  5:21                                           ` Nigel Cunningham
  2007-07-21 14:10                                           ` Alan Stern
  2007-07-20 22:48                                         ` Jeremy Maitin-Shepard
  1 sibling, 2 replies; 220+ messages in thread
From: david @ 2007-07-20 22:43 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	Ying Huang, LKML, linux-pm

On Fri, 20 Jul 2007, Alan Stern wrote:

> On Fri, 20 Jul 2007, Jeremy Maitin-Shepard wrote:
>
>>>> when doing a suspend-to-ram you get to a point where you just don't use
>>>> any userspace.
>>
>>> What do you mean?  How can you prevent user tasks from running?  That's
>>> basically what the freezer does, and the whole point of this approach
>>> is to eliminate the freezer.  Right?
>>
>> Presumably no tasks at all would be scheduled.
>
> How would you prevent tasks from being scheduled?  How would you
> prevent drivers from deadlocking because in order to put their device
> in a low-power state they need to acquire a lock which is held by a
> user task?

you give up on the suspend becouse you have no way of getting the user 
task to give up the lock.

however, kernel locks should not be held by user tasks, user tasks are not 
expected to behave in rational ways, allowing them to compete with kernel 
tasks for locks is a sure way to get a deadlock or indefinate stall.

what locks are accessed this way?

>>>> from that point on you are just walking the device tree
>>>> putting things into low-power mode. This is the point where we are talking
>>>> about jumping to.
>>
>>> Yes.  And putting things into low-power mode requires the ability to
>>> run the scheduler, which means that user tasks can be scheduled, which
>>> means that they can run.
>>
>> Does it really (fundamentally) require scheduling tasks, particularly in
>> the case that the devices have already been put in the "quiesced" state?
>
> I can't say for sure.  That's the way we have been doing it.  It
> wouldn't be easy to change, because the driver would have to busy-wait
> during delays -- which would mean it would need to use different code
> for system-wide suspend and runtime suspend.

please define terms so that we are all on the same page

what do you mean by
system-wide suspend
runtime suspend

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 22:35                                       ` Alan Stern
  2007-07-20 22:43                                         ` david
@ 2007-07-20 22:48                                         ` Jeremy Maitin-Shepard
  1 sibling, 0 replies; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 22:48 UTC (permalink / raw)
  To: Alan Stern
  Cc: david, Milton Miller, Rafael J. Wysocki, Ying Huang, LKML, linux-pm

Alan Stern <stern@rowland.harvard.edu> writes:

> On Fri, 20 Jul 2007, Jeremy Maitin-Shepard wrote:
>> >> when doing a suspend-to-ram you get to a point where you just don't use 
>> >> any userspace.
>> 
>> > What do you mean?  How can you prevent user tasks from running?  That's 
>> > basically what the freezer does, and the whole point of this approach 
>> > is to eliminate the freezer.  Right?
>> 
>> Presumably no tasks at all would be scheduled.

> How would you prevent tasks from being scheduled?  How would you
> prevent drivers from deadlocking because in order to put their device
> in a low-power state they need to acquire a lock which is held by a
> user task?

Perhaps this isn't an issue once the device is already quiesced.  I'm
just conjecturing.

[snip]

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 22:43                                         ` david
@ 2007-07-21  5:21                                           ` Nigel Cunningham
  2007-07-21 14:10                                           ` Alan Stern
  1 sibling, 0 replies; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-21  5:21 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, Jeremy Maitin-Shepard, Milton Miller,
	Rafael J. Wysocki, Ying Huang, LKML, linux-pm

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

Hi.

On Saturday 21 July 2007 08:43:20 david@lang.hm wrote:
> On Fri, 20 Jul 2007, Alan Stern wrote:
> 
> > On Fri, 20 Jul 2007, Jeremy Maitin-Shepard wrote:
> >
> >>>> when doing a suspend-to-ram you get to a point where you just don't use
> >>>> any userspace.
> >>
> >>> What do you mean?  How can you prevent user tasks from running?  That's
> >>> basically what the freezer does, and the whole point of this approach
> >>> is to eliminate the freezer.  Right?
> >>
> >> Presumably no tasks at all would be scheduled.
> >
> > How would you prevent tasks from being scheduled?  How would you
> > prevent drivers from deadlocking because in order to put their device
> > in a low-power state they need to acquire a lock which is held by a
> > user task?
> 
> you give up on the suspend becouse you have no way of getting the user 
> task to give up the lock.
> 
> however, kernel locks should not be held by user tasks, user tasks are not 
> expected to behave in rational ways, allowing them to compete with kernel 
> tasks for locks is a sure way to get a deadlock or indefinate stall.
> 
> what locks are accessed this way?

Any userspace process can do a syscall. In the process of the syscall, it can 
take kernel locks, and it can schedule (eg, while seeking to take a second 
lock).

Regards,

Nigel

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17  4:45             ` david
  2007-07-17 14:15               ` Alan Stern
@ 2007-07-21 10:17               ` Pavel Machek
  1 sibling, 0 replies; 220+ messages in thread
From: Pavel Machek @ 2007-07-21 10:17 UTC (permalink / raw)
  To: david
  Cc: Alan Stern, Rafael J. Wysocki, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, pm list, Al Boldi

Hi!

> >Part of the problem here is that ACPI already has its own terminology,
> >and you're trying to invent a new one instead of using the existing
> >one.
> >
> >I agree, it would be good to have a non-ACPI-specific hibernation mode,
> >something which would look to ACPI like a normal shutdown.  But I'm not
> >so sure this is possible.
> 
> why would it not be possible?
> 
> >You have to understand that the ACPI spec is weird and complex.  The
> >mere fact that you have written a system image to disk changes the way
> >ACPI regards the shutdown procedure.  Even though you may treat all the
> >devices and the rest of the hardware exactly the same, it's a different
> >operation as far as ACPI is concerned, with different requirements.
> >
> >Yes, it's bizarre.  Why do you think so many people have complained so
> >vehemently about ACPI for all these years?
> 
> so let's act as if ACPI doesn't exist and make a suspend-to-disk that 
> works without it and looks to ACPI like a complete power off/on cycle (but 
> looks to the user like a suspend/resume cycle)

...if you act as if ACPI does not exist, you'll loose AC/battery power
status upon resume, and maybe more, because ACPI handles that.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:18                           ` david
  2007-07-17 20:39                             ` Jeremy Maitin-Shepard
  2007-07-17 20:57                             ` Rafael J. Wysocki
@ 2007-07-21 10:25                             ` Pavel Machek
  2007-07-21 15:35                               ` Jeremy Maitin-Shepard
  2007-08-01 16:58                             ` Stefan Seyfried
  3 siblings, 1 reply; 220+ messages in thread
From: Pavel Machek @ 2007-07-21 10:25 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, pm list, Al Boldi

Hi!

> from my point of view the ACPI S4 sleep mode has far more in common with 
> suspend-to-ram then with the suspend-to-disk that I'm talking about
> 
> non-ACPI hibernate
> 
>   since the box powers off
>     it uses zero power while suspended
>     another OS could be run before a resume
>     hardware can be swapped, suspend image could be sent around the world 
>     to be restored on another system.
>     restore makes no assumptions about the state of the hardware when it is 
>     restored
>     restore is slower (full BIOS boot is required)
>   should be able to work on just about any hardware (the limit is the 
>   ability to initialize the devices)

So it will be break at least battery status and "AC plugged in"
status, because those are handled by ACPI and we do not know how to
control them by hand.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 21:02                               ` Rafael J. Wysocki
@ 2007-07-21 11:44                                 ` Miklos Szeredi
  2007-07-21 12:43                                   ` Nigel Cunningham
  0 siblings, 1 reply; 220+ messages in thread
From: Miklos Szeredi @ 2007-07-21 11:44 UTC (permalink / raw)
  To: rjw; +Cc: miltonm, stern, ying.huang, linux-kernel, david, linux-pm, jbms

> The problem with FUSE is related to the fact that the freezer can't
> freeze uninterruptible tasks and we said that perhaps we might avoid
> it if FUSE was made freezing-aware.  Still, no one has gone in this
> direction and I don't know of any plans to do that.

I thought we have fully explored this direction.  Lots of emails, and
an IRC session with Pavel.  Conclusion:

 - It can't be done without VFS surgery + adding various hacks to fuse

 - VFS surgery for the sake of a working suspend is not realistic

Although removing the freezer seems the cleanest solution, I'm not
saying the freezer can't be fixed up in the mean time.

Allowing tasks to remain in uninterruptible sleep seemed a nice way to
get around the fuse issues.  What was the problem with that patch?  It
was something that was supposed to have been tested in suspend2,
wasn't it?

The other one (trying to wake up task, so that may make other tasks
freezable) didn't seem such a good approach to me.

The theory is quite simple: while and after suspending devices, no
tasks must be touching said devices.

The very cleanest way to do this is in the drivers.  The very simplest
way is the current freezer.  But may be there are possibilities
between these two extremes.

But I can almost guarantee you, that any attempt at fixing the issues
though fuse will just result in an even bigger mess than what we
currently have.

Miklos

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 11:44                                 ` Miklos Szeredi
@ 2007-07-21 12:43                                   ` Nigel Cunningham
  2007-07-21 13:56                                     ` Alan Stern
                                                       ` (2 more replies)
  0 siblings, 3 replies; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-21 12:43 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: rjw, miltonm, stern, ying.huang, linux-kernel, david, linux-pm, jbms

[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]

Hi.

On Saturday 21 July 2007 21:44:32 Miklos Szeredi wrote:
> > The problem with FUSE is related to the fact that the freezer can't
> > freeze uninterruptible tasks and we said that perhaps we might avoid
> > it if FUSE was made freezing-aware.  Still, no one has gone in this
> > direction and I don't know of any plans to do that.
> 
> I thought we have fully explored this direction.  Lots of emails, and
> an IRC session with Pavel.  Conclusion:

What am I missing in the following suggested solution?

1) In the freezer code, we implement a new TIF_LATEFREEZE process flag, which, 
when set, causes a  userspace process to be frozen with kernel threads 
instead of with userspace ones. When freezing, we freezing !TIF_LATEFREEZE, 
sync and then freeze TIF_LATEFREEZE and freezable kernel threads.

2) In the fuse code, the PID of the process that will do the work gets passed 
to the fuse kernel code when the mount is done. The kernel code sets the 
TIF_LATEFREEZE flag, and resets it on umount.

Sorry, but this is a hit-and-run email - I'm off to bed now.

Regards,

Nigel

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 12:43                                   ` Nigel Cunningham
@ 2007-07-21 13:56                                     ` Alan Stern
  2007-07-21 16:13                                     ` Jeremy Maitin-Shepard
  2007-08-01  9:19                                     ` Pavel Machek
  2 siblings, 0 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-21 13:56 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Miklos Szeredi, rjw, miltonm, ying.huang, linux-kernel, david,
	linux-pm, jbms

On Sat, 21 Jul 2007, Nigel Cunningham wrote:

> What am I missing in the following suggested solution?
> 
> 1) In the freezer code, we implement a new TIF_LATEFREEZE process flag, which, 
> when set, causes a  userspace process to be frozen with kernel threads 
> instead of with userspace ones. When freezing, we freezing !TIF_LATEFREEZE, 
> sync and then freeze TIF_LATEFREEZE and freezable kernel threads.
> 
> 2) In the fuse code, the PID of the process that will do the work gets passed 
> to the fuse kernel code when the mount is done. The kernel code sets the 
> TIF_LATEFREEZE flag, and resets it on umount.

What happens when one FUSE filesystem makes use of another?  You'll 
still end up with unfreezable processes, except that now you won't 
detect them until the LATEFREEZE stage.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 22:43                                         ` david
  2007-07-21  5:21                                           ` Nigel Cunningham
@ 2007-07-21 14:10                                           ` Alan Stern
  2007-07-22  3:43                                             ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-21 14:10 UTC (permalink / raw)
  To: david
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	Ying Huang, LKML, linux-pm

On Fri, 20 Jul 2007 david@lang.hm wrote:

> > How would you prevent tasks from being scheduled?  How would you
> > prevent drivers from deadlocking because in order to put their device
> > in a low-power state they need to acquire a lock which is held by a
> > user task?
> 
> you give up on the suspend becouse you have no way of getting the user 
> task to give up the lock.

Once the deadlock has occurred it's too late.  You can't give up; in 
fact you can't do anything at all.  The system has hung.

> however, kernel locks should not be held by user tasks, user tasks are not 
> expected to behave in rational ways, allowing them to compete with kernel 
> tasks for locks is a sure way to get a deadlock or indefinate stall.

What on Earth are you talking about?  "Kernel locks should not be held 
by user tasks"?  Then who _should_ hold them?  You are aware, I hope, 
that down() and mutex_lock() can be called only in process context?

> what locks are accessed this way?

Lots of them.  For example, most drivers won't want a suspend to occur
right in the middle of an I/O transfer.  To prevent this, the driver
might use a mutex.  The task doing the I/O (which will be a user task)
acquires the mutex during a transfer and the suspend routine acquires
the mutex while quiescing the device.

> >> Does it really (fundamentally) require scheduling tasks, particularly in
> >> the case that the devices have already been put in the "quiesced" state?
> >
> > I can't say for sure.  That's the way we have been doing it.  It
> > wouldn't be easy to change, because the driver would have to busy-wait
> > during delays -- which would mean it would need to use different code
> > for system-wide suspend and runtime suspend.
> 
> please define terms so that we are all on the same page

Please read Documentation/power/devices.txt.

> what do you mean by
> system-wide suspend

That's what you would call standby, suspend-to-RAM, or hibernate.  The
entire system goes to sleep.

> runtime suspend

That's when an individual device is placed in a low-power state to 
save energy while it isn't being used.  The system as a whole remains 
awake and the device will be resumed the next time it is needed for 
anything.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-21 10:25                             ` Pavel Machek
@ 2007-07-21 15:35                               ` Jeremy Maitin-Shepard
  2007-07-21 17:56                                 ` Pavel Machek
  0 siblings, 1 reply; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-21 15:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: david, Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	pm list, Al Boldi

Pavel Machek <pavel@ucw.cz> writes:

[snip]

> So it will be break at least battery status and "AC plugged in"
> status, because those are handled by ACPI and we do not know how to
> control them by hand.

It seems that it should be possible to initialize ACPI as if the system
just booted up normally.  Then battery status and such should be
correct, since they are correct after normal initialization.

It should be possible to make hibernate look just like a reboot to all
of the devices, including ACPI stuff.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 12:43                                   ` Nigel Cunningham
  2007-07-21 13:56                                     ` Alan Stern
@ 2007-07-21 16:13                                     ` Jeremy Maitin-Shepard
  2007-07-21 18:12                                       ` Miklos Szeredi
  2007-07-21 22:16                                       ` Nigel Cunningham
  2007-08-01  9:19                                     ` Pavel Machek
  2 siblings, 2 replies; 220+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-21 16:13 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Miklos Szeredi, rjw, miltonm, stern, ying.huang, linux-kernel,
	david, linux-pm

It seems that you could still potentially get a failure to freeze if one
FUSE process depends on another, and the one that is frozen second just
happens to be waiting on the one that is frozen first when it is frozen.
I admit that this situation is unlikely, and perhaps acceptable.

A larger concern is that it seems that freezing FUSE processes at all
_will_ generate deadlocks if a non-synchronous or memory-map-supporting
filesystem is loopback mounted from a FUSE filesystem.  In that case, if
you attempt to sync or free memory once FUSE is frozen, you are sure to
get a deadlock.

-- 
Jeremy Maitin-Shepard

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-21 15:35                               ` Jeremy Maitin-Shepard
@ 2007-07-21 17:56                                 ` Pavel Machek
  2007-07-21 19:35                                   ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Pavel Machek @ 2007-07-21 17:56 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: david, Rafael J. Wysocki, Alan Stern, LKML, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Kyle Moffett, Nigel Cunningham,
	pm list, Al Boldi

Hi!

> Pavel Machek <pavel@ucw.cz> writes:
> 
> [snip]
> 
> > So it will be break at least battery status and "AC plugged in"
> > status, because those are handled by ACPI and we do not know how to
> > control them by hand.
> 
> It seems that it should be possible to initialize ACPI as if the system
> just booted up normally.  Then battery status and such should be
> correct, since they are correct after normal initialization.
> 
> It should be possible to make hibernate look just like a reboot to all
> of the devices, including ACPI stuff.

Patch to make that work with swsusp/shutdown method would be indeed
welcome. It does not work today.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 16:13                                     ` Jeremy Maitin-Shepard
@ 2007-07-21 18:12                                       ` Miklos Szeredi
  2007-07-21 19:20                                         ` Rafael J. Wysocki
  2007-07-21 22:21                                         ` Nigel Cunningham
  2007-07-21 22:16                                       ` Nigel Cunningham
  1 sibling, 2 replies; 220+ messages in thread
From: Miklos Szeredi @ 2007-07-21 18:12 UTC (permalink / raw)
  To: jbms
  Cc: nigel, miklos, rjw, miltonm, stern, ying.huang, linux-kernel,
	david, linux-pm

> It seems that you could still potentially get a failure to freeze if one
> FUSE process depends on another, and the one that is frozen second just
> happens to be waiting on the one that is frozen first when it is frozen.
> I admit that this situation is unlikely, and perhaps acceptable.

It isn't all that unlikely.  There's sshfs for example, that depends
on a separate ssh process for transport.

Oh, there are also userspace network transports, like tun/tap,
nfqueue, etc.  They could block any network filesystem (not just fuse)
if frozen first, making the freezer fail.

Hmm, wonder why this isn't affecting people with VPNs?  Probably
network mounts over VPN are rare, and ever rarer to have fs activity
on them during suspend.

Anyway, I think it's long overdue to stop thinking about how to "fix"
fuse, and concentrate on fixing the underlying problem instead ;)

> A larger concern is that it seems that freezing FUSE processes at all
> _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> filesystem is loopback mounted from a FUSE filesystem.  In that case, if
> you attempt to sync or free memory once FUSE is frozen, you are sure to
> get a deadlock.

Well, it would deadlock, if

 a) memory reclaim was synchronous, or
 b) large part of the memory was used for dirty file data

I can't remember if (a) was ever true.  And now the dirty ratio is 10%
by default, so if we go OOM because that 10% can't be reclaimed, there
is a more serious problem.

Swap over loop over fuse would be problematic, but that won't work for
some time yet ;)

Miklos

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 18:12                                       ` Miklos Szeredi
@ 2007-07-21 19:20                                         ` Rafael J. Wysocki
  2007-08-01  9:22                                           ` Pavel Machek
  2007-07-21 22:21                                         ` Nigel Cunningham
  1 sibling, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-21 19:20 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: jbms, nigel, miltonm, stern, ying.huang, linux-kernel, david, linux-pm

On Saturday, 21 July 2007 20:12, Miklos Szeredi wrote:
> > It seems that you could still potentially get a failure to freeze if one
> > FUSE process depends on another, and the one that is frozen second just
> > happens to be waiting on the one that is frozen first when it is frozen.
> > I admit that this situation is unlikely, and perhaps acceptable.
> 
> It isn't all that unlikely.  There's sshfs for example, that depends
> on a separate ssh process for transport.
> 
> Oh, there are also userspace network transports, like tun/tap,
> nfqueue, etc.  They could block any network filesystem (not just fuse)
> if frozen first, making the freezer fail.
> 
> Hmm, wonder why this isn't affecting people with VPNs?  Probably
> network mounts over VPN are rare, and ever rarer to have fs activity
> on them during suspend.
> 
> Anyway, I think it's long overdue to stop thinking about how to "fix"
> fuse, and concentrate on fixing the underlying problem instead ;)

To conclude this branch of the thread, I have a patch in the works that may
help a bit with unfreezable FUSE filesystems and it only affects the freezer.
I'll post it when 2.6.23-rc1 is out, because it's on top of some other patches
that need to go first.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-21 17:56                                 ` Pavel Machek
@ 2007-07-21 19:35                                   ` david
  2007-07-21 19:49                                     ` Pavel Machek
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-21 19:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jeremy Maitin-Shepard, Rafael J. Wysocki, Alan Stern, LKML,
	Andrew Morton, Eric W. Biederman, Huang, Ying, Kyle Moffett,
	Nigel Cunningham, pm list, Al Boldi

On Sat, 21 Jul 2007, Pavel Machek wrote:

> Hi!
>
>> Pavel Machek <pavel@ucw.cz> writes:
>>
>> [snip]
>>
>>> So it will be break at least battery status and "AC plugged in"
>>> status, because those are handled by ACPI and we do not know how to
>>> control them by hand.
>>
>> It seems that it should be possible to initialize ACPI as if the system
>> just booted up normally.  Then battery status and such should be
>> correct, since they are correct after normal initialization.
>>
>> It should be possible to make hibernate look just like a reboot to all
>> of the devices, including ACPI stuff.
>
> Patch to make that work with swsusp/shutdown method would be indeed
> welcome. It does not work today.

is this a problem in the restore path?

with the kexec approach (and ignoring suspend to ram and disk for the 
moment) the system will actually get shutdown completely after the image 
is written. on resume it gets cold booted. at this point the ACPI stuff 
should have no problem

now if the ACPI drivers are storing something in ram about the battery 
status and AC power status, but don't re-check after the resume, it seems 
to me that they can't possibly be reliable anyway. if they do re-check 
after the resume then where's the problem?

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-21 19:35                                   ` david
@ 2007-07-21 19:49                                     ` Pavel Machek
  2007-07-21 22:14                                       ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Pavel Machek @ 2007-07-21 19:49 UTC (permalink / raw)
  To: david
  Cc: Jeremy Maitin-Shepard, Rafael J. Wysocki, Alan Stern, LKML,
	Andrew Morton, Eric W. Biederman, Huang, Ying, Kyle Moffett,
	Nigel Cunningham, pm list, Al Boldi

Hi!

> >>>So it will be break at least battery status and "AC plugged in"
> >>>status, because those are handled by ACPI and we do not know how to
> >>>control them by hand.
> >>
> >>It seems that it should be possible to initialize ACPI as if the system
> >>just booted up normally.  Then battery status and such should be
> >>correct, since they are correct after normal initialization.
> >>
> >>It should be possible to make hibernate look just like a reboot to all
> >>of the devices, including ACPI stuff.
> >
> >Patch to make that work with swsusp/shutdown method would be indeed
> >welcome. It does not work today.
> 
> is this a problem in the restore path?
> 
> with the kexec approach (and ignoring suspend to ram and disk for the 
> moment) the system will actually get shutdown completely after the image 
> is written. on resume it gets cold booted. at this point the ACPI stuff 
> should have no problem
> 
> now if the ACPI drivers are storing something in ram about the battery 
> status and AC power status, but don't re-check after the resume, it
>seems

That seems to be the problem. They store something in ram, and we
don't tell them that we resumed. That's why platform mode is
important, and way to go on ACPI systems.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-21 19:49                                     ` Pavel Machek
@ 2007-07-21 22:14                                       ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-21 22:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jeremy Maitin-Shepard, Rafael J. Wysocki, Alan Stern, LKML,
	Andrew Morton, Eric W. Biederman, Huang, Ying, Kyle Moffett,
	Nigel Cunningham, pm list, Al Boldi

On Sat, 21 Jul 2007, Pavel Machek wrote:

>>>>> So it will be break at least battery status and "AC plugged in"
>>>>> status, because those are handled by ACPI and we do not know how to
>>>>> control them by hand.
>>>>
>>>> It seems that it should be possible to initialize ACPI as if the system
>>>> just booted up normally.  Then battery status and such should be
>>>> correct, since they are correct after normal initialization.
>>>>
>>>> It should be possible to make hibernate look just like a reboot to all
>>>> of the devices, including ACPI stuff.
>>>
>>> Patch to make that work with swsusp/shutdown method would be indeed
>>> welcome. It does not work today.
>>
>> is this a problem in the restore path?
>>
>> with the kexec approach (and ignoring suspend to ram and disk for the
>> moment) the system will actually get shutdown completely after the image
>> is written. on resume it gets cold booted. at this point the ACPI stuff
>> should have no problem
>>
>> now if the ACPI drivers are storing something in ram about the battery
>> status and AC power status, but don't re-check after the resume, it
>> seems
>
> That seems to be the problem. They store something in ram, and we
> don't tell them that we resumed. That's why platform mode is
> important, and way to go on ACPI systems.

this sounds like the few drivers that do this sort of thing (and this 
should only be the things that report status, drivers that enable devices 
should be taken care of) need a 'forget what you think you know, check the 
reality of the hardware' function.

even without suspend I've sen these drivers get out of sync with reality, 
and so such a function call would be useful to get them back in sync in 
any case.

if such a 'check reality' function was available it should be 
straightforward to call it after a non-ACPI hibrnate/resume

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 16:13                                     ` Jeremy Maitin-Shepard
  2007-07-21 18:12                                       ` Miklos Szeredi
@ 2007-07-21 22:16                                       ` Nigel Cunningham
  2007-07-22 15:26                                         ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-21 22:16 UTC (permalink / raw)
  To: Jeremy Maitin-Shepard
  Cc: Miklos Szeredi, rjw, miltonm, stern, ying.huang, linux-kernel,
	david, linux-pm

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

Hi.

On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> It seems that you could still potentially get a failure to freeze if one
> FUSE process depends on another, and the one that is frozen second just
> happens to be waiting on the one that is frozen first when it is frozen.
> I admit that this situation is unlikely, and perhaps acceptable.
> 
> A larger concern is that it seems that freezing FUSE processes at all
> _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> filesystem is loopback mounted from a FUSE filesystem.  In that case, if
> you attempt to sync or free memory once FUSE is frozen, you are sure to
> get a deadlock.

Ok. So then (in response to Alan too), how about keeping a tree of mounts, 
akin to the device tree, and working from the deepest nodes up? (In 
conjunction with what I already suggested)?

Regards,

Nigel
-- 
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 18:12                                       ` Miklos Szeredi
  2007-07-21 19:20                                         ` Rafael J. Wysocki
@ 2007-07-21 22:21                                         ` Nigel Cunningham
  1 sibling, 0 replies; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-21 22:21 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: jbms, rjw, miltonm, stern, ying.huang, linux-kernel, david, linux-pm

[-- Attachment #1: Type: text/plain, Size: 2078 bytes --]

Hi.

On Sunday 22 July 2007 04:12:22 Miklos Szeredi wrote:
> > It seems that you could still potentially get a failure to freeze if one
> > FUSE process depends on another, and the one that is frozen second just
> > happens to be waiting on the one that is frozen first when it is frozen.
> > I admit that this situation is unlikely, and perhaps acceptable.
> 
> It isn't all that unlikely.  There's sshfs for example, that depends
> on a separate ssh process for transport.
> 
> Oh, there are also userspace network transports, like tun/tap,
> nfqueue, etc.  They could block any network filesystem (not just fuse)
> if frozen first, making the freezer fail.
> 
> Hmm, wonder why this isn't affecting people with VPNs?  Probably
> network mounts over VPN are rare, and ever rarer to have fs activity
> on them during suspend.
> 
> Anyway, I think it's long overdue to stop thinking about how to "fix"
> fuse, and concentrate on fixing the underlying problem instead ;)

That's what I'm seeking to do :)

> > A larger concern is that it seems that freezing FUSE processes at all
> > _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> > filesystem is loopback mounted from a FUSE filesystem.  In that case, if
> > you attempt to sync or free memory once FUSE is frozen, you are sure to
> > get a deadlock.
> 
> Well, it would deadlock, if
> 
>  a) memory reclaim was synchronous, or
>  b) large part of the memory was used for dirty file data

These are problems in normal operation, aren't they?
 
> I can't remember if (a) was ever true.  And now the dirty ratio is 10%
> by default, so if we go OOM because that 10% can't be reclaimed, there
> is a more serious problem.
> 
> Swap over loop over fuse would be problematic, but that won't work for
> some time yet ;)

Hopefully people will wake up to the problems with Fuse and get rid of it 
before then :|. Of course I don't really expect that to happen.

Nigel
-- 
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 15:48                             ` david
@ 2007-07-22  2:17                               ` Huang, Ying
  2007-07-22  2:32                                 ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Huang, Ying @ 2007-07-22  2:17 UTC (permalink / raw)
  To: david
  Cc: Milton Miller, linux-pm, LKML, Rafael J.Wysocki, Alan Stern,
	Jeremy Maitin-Shepard

On Fri, 2007-07-20 at 08:48 -0700, david@lang.hm wrote:
> > Backuping target memory before kexec and restoring it after kexec is
> > planed feature for kexec jump. But I will work on image writing/reading
> > first.
> 
> if we can get a list of what memory is safe to backup/restore then the 
> reading/writing of the image should be able to be done in userspace.

The backup/restore here has nothing to do with the read/write of the
image. It means instead of preserving memory for a new kernel like that
of crash-dump, the memory for a new kernel is backupped before kexec and
restored after kexec by the kexec kernel.

> > If the "scatter copy" is replaced by "scatter swap", we need not the
> > inverse list, and the state of kexeced kernel can be backuped too. There
> > are "scatter copy" support in normal kexec implementation in
> > "relocate_kernel".
> 
> what do you mean by "scatter swap"

copy:	dest=src
swap:	tmp=dest; dest=src; src=tmp

If memory is swapped, no information is lost, both that of kexec kernel
and kexeced kernel.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22  2:17                               ` Huang, Ying
@ 2007-07-22  2:32                                 ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-22  2:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Milton Miller, linux-pm, LKML, Rafael J.Wysocki, Alan Stern,
	Jeremy Maitin-Shepard

On Sun, 22 Jul 2007, Huang, Ying wrote:

> On Fri, 2007-07-20 at 08:48 -0700, david@lang.hm wrote:
>>> Backuping target memory before kexec and restoring it after kexec is
>>> planed feature for kexec jump. But I will work on image writing/reading
>>> first.
>>
>> if we can get a list of what memory is safe to backup/restore then the
>> reading/writing of the image should be able to be done in userspace.
>
> The backup/restore here has nothing to do with the read/write of the
> image. It means instead of preserving memory for a new kernel like that
> of crash-dump, the memory for a new kernel is backupped before kexec and
> restored after kexec by the kexec kernel.

Ok, I see the miscommunication here. you are talking about freeing up 
memory for the second kernel instead of reserving it from boot time.

I'm talking about getting the second kernel a list of what memory pages it 
should write to the image

if we can get the info for the list I'm looking for we should be able to 
demonstrate the kexec based hibernate.

the change you are talking about in an enhancment that is useful after 
that point to save some memory.

>>> If the "scatter copy" is replaced by "scatter swap", we need not the
>>> inverse list, and the state of kexeced kernel can be backuped too. There
>>> are "scatter copy" support in normal kexec implementation in
>>> "relocate_kernel".
>>
>> what do you mean by "scatter swap"
>
> copy:	dest=src
> swap:	tmp=dest; dest=src; src=tmp
>
> If memory is swapped, no information is lost, both that of kexec kernel
> and kexeced kernel.

I'm missing why you need to preserve this memory

if you are talking about memory that will be used by the second kernel 
when you kexec to it then you don't need to preserve it (since it will be 
overwritten by the second kernel). if you aren't talking about memory that 
will be used by the second kernel why do you need to move it?

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 14:10                                           ` Alan Stern
@ 2007-07-22  3:43                                             ` david
  2007-07-22 16:00                                               ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-22  3:43 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	Ying Huang, LKML, linux-pm

On Sat, 21 Jul 2007, Alan Stern wrote:

> On Fri, 20 Jul 2007 david@lang.hm wrote:
>
>>> How would you prevent tasks from being scheduled?  How would you
>>> prevent drivers from deadlocking because in order to put their device
>>> in a low-power state they need to acquire a lock which is held by a
>>> user task?
>>
>> you give up on the suspend becouse you have no way of getting the user
>> task to give up the lock.
>
> Once the deadlock has occurred it's too late.  You can't give up; in
> fact you can't do anything at all.  The system has hung.
>
>> however, kernel locks should not be held by user tasks, user tasks are not
>> expected to behave in rational ways, allowing them to compete with kernel
>> tasks for locks is a sure way to get a deadlock or indefinate stall.
>
> What on Earth are you talking about?  "Kernel locks should not be held
> by user tasks"?  Then who _should_ hold them?  You are aware, I hope,
> that down() and mutex_lock() can be called only in process context?
>
>> what locks are accessed this way?
>
> Lots of them.  For example, most drivers won't want a suspend to occur
> right in the middle of an I/O transfer.  To prevent this, the driver
> might use a mutex.  The task doing the I/O (which will be a user task)
> acquires the mutex during a transfer and the suspend routine acquires
> the mutex while quiescing the device.

wait a min her, it's possible we are misunderstanding each other.

as I see it.

if userspace can aquire locks that prevent the kernel from shutting off 
(or doing anything else in particular) then it's possible for misbehaving 
userspace code to stop the kernel by simply choosing to never release the 
lock.

this would be a trivial DOS from userspace.

now, if you are talking instead about the fact that when userspace makes a 
system call, the execution of that system call involves aquiring locks 
that are released before the system call completes you have a very 
different situation.

if you have locks that are held across system calls then you should 
already have problems. becouse you can't count on userspace ever taking 
whatever action is appropriate to release the lock.

what am I missing that concerns you so much?

>>>> Does it really (fundamentally) require scheduling tasks, particularly in
>>>> the case that the devices have already been put in the "quiesced" state?
>>>
>>> I can't say for sure.  That's the way we have been doing it.  It
>>> wouldn't be easy to change, because the driver would have to busy-wait
>>> during delays -- which would mean it would need to use different code
>>> for system-wide suspend and runtime suspend.
>>
>> please define terms so that we are all on the same page
>
> Please read Documentation/power/devices.txt.

I have done so.

>> what do you mean by
>> system-wide suspend
>
> That's what you would call standby, suspend-to-RAM, or hibernate.  The
> entire system goes to sleep.
>
>> runtime suspend
>
> That's when an individual device is placed in a low-power state to
> save energy while it isn't being used.  The system as a whole remains
> awake and the device will be resumed the next time it is needed for
> anything.

thanks for the defintitions.

having read through Documentation/power/devices.txt I remain convinced 
that you are making a fundamental mistake.

you are designing a system that will only work if everything (every 
driver, every state transition) participates fully in the process at all 
times. You started with the facts 'this is the info that ACPI provides and 
this is how it is designed to be used' and worked from there instead of 
looking to see what the kernel really needed and figuring how to provide a 
good interface for that that happens to be implemented (today) with ACPI. 
(a proper power management framework shouldn't care if you have ACPI, APM, 
or some other method of controlling the devices)

this leads to resume functions that can only work if the proper suspend 
function was called rather then makeing 'resume' just mean 'go to full 
operation', which is the same thing that gets called when the device is 
first initialized. internally it can examine the hardware and follow 
different paths depending on what it finds the current state of the 
hardware is, but the outside world (including the rest of the kernel) 
should not care. the fact that the rest of the kernel needs to know if it 
should call 'resume' or 'initialize' is a failure in the abstraction.

in fact, a better abstraction would be something like

report_power_modes
   which would return a series of modes (sorted only by modeID)
   modeID, %power_used_in_this_mode, %capability_in_this_mode
   (I would make mode 0 always be complete power off, and mode 1 always be 
full capacity)

report_power_mode_speed
   which would return a matrix giving how long it takes to transition from 
any mode to any other mode. this should be a relative number, not an 
absolute number since it will be different at different clock speeds.

set_operational_mode(modeID)
   which would take you from whatever mode you are in now to the requested 
mode.

most devices would report the simple list of modes

0,0,0
1,100,100

with a mode_speed matrix of
   0 1
   ---
0|0 1
1|1 0

it may be that there is more info needed for the powr management engine to 
decide what modes it wants to put things into, if so identify what type of 
info you need and add another column to the modes list.
for example:
   you may want to add a flag for 'does this mode allow downstream devices 
to operate?'
   you may want to make a mode for 'this mode doesn't allow any new 
requests, but continues to process pending requests' and have a flag that 
indicates this

currently it looks like there's no way to find out what modes are 
available, and you have to know what mode something is in currently before 
you can request it change to a different mode. both of these prevent 
effective power management without encoding intimate knowledge of the 
capability of the particular hardware in your management tool.

some of this may be discoverable via the ACPI interface (it's not talked 
about much in the devices.txt file), but the mode setting is still wrong.

note that in the example above it's accpetable for a driver to cache what 
mode it thinks the device is in, but it needs to properly set the new 
mode even if it's cached data is incorrect.

this approach would allow the transition of ALL drivers to the new mode of 
operation in one fell swoop, and then adding additional power management 
features is just adding to the existing list rather then implementing new 
functions.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 22:16                                       ` Nigel Cunningham
@ 2007-07-22 15:26                                         ` Alan Stern
  2007-07-22 16:27                                           ` Miklos Szeredi
  2007-07-22 22:42                                           ` Nigel Cunningham
  0 siblings, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-22 15:26 UTC (permalink / raw)
  To: nigel
  Cc: Jeremy Maitin-Shepard, Miklos Szeredi, rjw, miltonm, ying.huang,
	linux-kernel, david, linux-pm

On Sun, 22 Jul 2007, Nigel Cunningham wrote:

> Hi.
> 
> On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> > It seems that you could still potentially get a failure to freeze if one
> > FUSE process depends on another, and the one that is frozen second just
> > happens to be waiting on the one that is frozen first when it is frozen.
> > I admit that this situation is unlikely, and perhaps acceptable.
> > 
> > A larger concern is that it seems that freezing FUSE processes at all
> > _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> > filesystem is loopback mounted from a FUSE filesystem.  In that case, if
> > you attempt to sync or free memory once FUSE is frozen, you are sure to
> > get a deadlock.
> 
> Ok. So then (in response to Alan too), how about keeping a tree of mounts, 
> akin to the device tree, and working from the deepest nodes up? (In 
> conjunction with what I already suggested)?

Face it, Nigel, this is a losing battle.  You can try to come up with
ever-more complex schemes to try and force FUSE into the freezer's
framework, but it just won't fit.  Or if it does, the next filesystem
to come along will require an even more baroque type of special-case 
handling.

The general problem is that task A may be in an unfreezable state,
waiting for task B to do something, while task B is already frozen.  
Since there's no reasonable way to determine that A really is waiting
for B, you're just stuck.  (To make matters worse, A may not even
realize which task it is waiting for; it may know only that it's
waiting for somebody to do something!)  A and B could be user tasks, 
kernel threads, or one of each.

The only thing to do is what Rafael has been working on: unfreeze
things, hope the tasks sort themselves out, and try again.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22  3:43                                             ` david
@ 2007-07-22 16:00                                               ` Alan Stern
  2007-07-22 21:50                                                 ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-22 16:00 UTC (permalink / raw)
  To: david
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	Ying Huang, LKML, linux-pm

On Sat, 21 Jul 2007 david@lang.hm wrote:

> wait a min her, it's possible we are misunderstanding each other.

I'd describe it as: You are misunderstanding me.  :-)

> as I see it.
> 
> if userspace can aquire locks that prevent the kernel from shutting off 
> (or doing anything else in particular) then it's possible for misbehaving 
> userspace code to stop the kernel by simply choosing to never release the 
> lock.
> 
> this would be a trivial DOS from userspace.

You are confusing "userspace" with "user tasks".  And not only that,
you often use the term "userspace" when you should say "user mode".

If you want I can explain the differences.

> now, if you are talking instead about the fact that when userspace makes a 
> system call, the execution of that system call involves aquiring locks 
> that are released before the system call completes you have a very 
> different situation.

That is exactly what I have been talking about.  It may be different
from what you _thought_, but it's not different from what I actually
_said_.

> if you have locks that are held across system calls then you should 
> already have problems. becouse you can't count on userspace ever taking 
> whatever action is appropriate to release the lock.
> 
> what am I missing that concerns you so much?

Here's what you are missing:

The new kexec approach eliminates the freezer and relies instead on the
fact that none of the tasks in the original kernel can execute while
the new kexec'd kernel is running.  This means the new kernel can write
out a memory image with no fear of interference or corruption.

But it also means that tasks which otherwise would have been frozen are 
actually free to run before the kexec call is made (and after the call 
returns, if the kexec'd kernel returns back to the original kernel).  
Any driver which was written with the assumption that tasks would be 
frozen at those times will need to be changed.

For example, drivers know that they have to quiesce their device in
preparation for creating the memory snapshot.  But they assume that no
I/O requests will be made while the device is quiesced (because no user
task is capable of generating an I/O request if they are all frozen),
so the driver doesn't try to prevent such requests from reactivating
the device.

The situation as regards locking is harder to discuss since I don't 
know of any code examples to use as a guide.  The fact remains that if 
user tasks aren't frozen then they can make system calls, and while 
running in kernel mode they can acquire locks, which might cause 
problems -- even though I can't identify any definite examples.

Because of these problems, it's too early to start trying to use kexec
to avoid the need for the freezer.

Of course, exactly the same possible problems exist when one tries to
remove the freezer from suspend-to-RAM.  It has nothing to do with 
kexec in particular (and certainly nothing to do with ACPI).

> having read through Documentation/power/devices.txt I remain convinced 
> that you are making a fundamental mistake.
> 
> you are designing a system

I'm not designing anything!  _You_ are.  I'm merely pointing out
problems in your design which you haven't considered.

>  that will only work if everything (every 
> driver, every state transition) participates fully in the process at all 
> times. You started with the facts 'this is the info that ACPI provides

Look again; I wasn't talking about ACPI.  You have mixed up the issues
in this email thread.  (Not hard to do, since it has been a very long
and complicated thread.)

> and 
> this is how it is designed to be used' and worked from there instead of 
> looking to see what the kernel really needed and figuring how to provide a 
> good interface for that that happens to be implemented (today) with ACPI. 
> (a proper power management framework shouldn't care if you have ACPI, APM, 
> or some other method of controlling the devices)

This and the rest of your email have no bearing on what I was talking
about, so I have snipped out the remainder.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 15:26                                         ` Alan Stern
@ 2007-07-22 16:27                                           ` Miklos Szeredi
  2007-07-22 20:09                                             ` Alan Stern
  2007-07-22 22:42                                           ` Nigel Cunningham
  1 sibling, 1 reply; 220+ messages in thread
From: Miklos Szeredi @ 2007-07-22 16:27 UTC (permalink / raw)
  To: stern
  Cc: nigel, jbms, miklos, rjw, miltonm, ying.huang, linux-kernel,
	david, linux-pm

> The only thing to do is what Rafael has been working on: unfreeze
> things, hope the tasks sort themselves out, and try again.

Have we some proof, that this will untangle the freezing tasks in a
limited time?  Or will it just make the problem harder to trigger?

Miklos

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 16:27                                           ` Miklos Szeredi
@ 2007-07-22 20:09                                             ` Alan Stern
  2007-07-22 21:54                                               ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-22 20:09 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: nigel, jbms, rjw, miltonm, ying.huang, linux-kernel, david, linux-pm

On Sun, 22 Jul 2007, Miklos Szeredi wrote:

> > The only thing to do is what Rafael has been working on: unfreeze
> > things, hope the tasks sort themselves out, and try again.
> 
> Have we some proof, that this will untangle the freezing tasks in a
> limited time?  Or will it just make the problem harder to trigger?

Of course there's no proof.  Just the opposite -- if things get hung up
the first time, they might get hung up the second time.  And the
third...

But it ought to make the problem harder to trigger.  For the present 
that's a worthwhile improvement.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 16:00                                               ` Alan Stern
@ 2007-07-22 21:50                                                 ` david
  2007-07-23 15:19                                                   ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-22 21:50 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	Ying Huang, LKML, linux-pm

On Sun, 22 Jul 2007, Alan Stern wrote:

> On Sat, 21 Jul 2007 david@lang.hm wrote:
>
>> wait a min her, it's possible we are misunderstanding each other.
>
> I'd describe it as: You are misunderstanding me.  :-)

very possibly :-)

>> as I see it.
>>
>> if userspace can aquire locks that prevent the kernel from shutting off
>> (or doing anything else in particular) then it's possible for misbehaving
>> userspace code to stop the kernel by simply choosing to never release the
>> lock.
>>
>> this would be a trivial DOS from userspace.
>
> You are confusing "userspace" with "user tasks".  And not only that,
> you often use the term "userspace" when you should say "user mode".
>
> If you want I can explain the differences.

please do, I have been treating all three as the same catagory.

>> now, if you are talking instead about the fact that when userspace makes a
>> system call, the execution of that system call involves aquiring locks
>> that are released before the system call completes you have a very
>> different situation.
>
> That is exactly what I have been talking about.  It may be different
> from what you _thought_, but it's not different from what I actually
> _said_.

Ok, I did misunderstand you. it sound slike all you need to do to make 
sure that locks are not held is to allow system calls to return before 
trying to do the suspend/kexec/etc. that sounds like not only a trivial 
thing to do, but something that would probably be done anyway.

although syscalls that then call out to userspace tasks before they can 
complete cause potential deadlocks (without that issue you can just wait 
until all syscalls have returned, and not allow anything to issue new 
syscalls) is this the issue that's killing FUSE+suspend?

>> if you have locks that are held across system calls then you should
>> already have problems. becouse you can't count on userspace ever taking
>> whatever action is appropriate to release the lock.
>>
>> what am I missing that concerns you so much?
>
> Here's what you are missing:
>
> The new kexec approach eliminates the freezer and relies instead on the
> fact that none of the tasks in the original kernel can execute while
> the new kexec'd kernel is running.  This means the new kernel can write
> out a memory image with no fear of interference or corruption.

correct

> But it also means that tasks which otherwise would have been frozen are
> actually free to run before the kexec call is made (and after the call
> returns, if the kexec'd kernel returns back to the original kernel).
> Any driver which was written with the assumption that tasks would be
> frozen at those times will need to be changed.

here is where you loose me.

why should jumping back to the original kernel immedialty start running 
these processes? the process of doing a kexec requires things to happen in 
the drivers before normal activity can happen, so there is a phase in 
there where the kernel being jumped to has drivers initializing, but still 
does not allow anything else to run. why can't this phase be extended to 
allow for the possibility of transitioning these drivers to a sleep mode 
instead of to full operation?

> For example, drivers know that they have to quiesce their device in
> preparation for creating the memory snapshot.  But they assume that no
> I/O requests will be made while the device is quiesced (because no user
> task is capable of generating an I/O request if they are all frozen),
> so the driver doesn't try to prevent such requests from reactivating
> the device.
>
> The situation as regards locking is harder to discuss since I don't
> know of any code examples to use as a guide.  The fact remains that if
> user tasks aren't frozen then they can make system calls, and while
> running in kernel mode they can acquire locks, which might cause
> problems -- even though I can't identify any definite examples.

yes, if userspace is running jobs and submitting I/O and system calls 
while drivers are trying to initalize there is a big problem, but I am 
missing the reason this must be the case.

> Because of these problems, it's too early to start trying to use kexec
> to avoid the need for the freezer.
>
> Of course, exactly the same possible problems exist when one tries to
> remove the freezer from suspend-to-RAM.  It has nothing to do with
> kexec in particular (and certainly nothing to do with ACPI).

the part of the freezer that everyone is trying to eliminate is the 
exceptions (freeze everything except X,Y,Z becouse we will need to use 
those later for A)

>> having read through Documentation/power/devices.txt I remain convinced
>> that you are making a fundamental mistake.
>>
>> you are designing a system
>
> I'm not designing anything!  _You_ are.  I'm merely pointing out
> problems in your design which you haven't considered.

a better way of phrasing what I meant goes more along the lines of 'the 
current design of the system...'

>>  that will only work if everything (every
>> driver, every state transition) participates fully in the process at all
>> times. You started with the facts 'this is the info that ACPI provides
>
> Look again; I wasn't talking about ACPI.  You have mixed up the issues
> in this email thread.  (Not hard to do, since it has been a very long
> and complicated thread.)

very possibly. there are so many different sub-threads that part of my 
answer was to you, and part is addressing other things brought up during 
the thread

>> and
>> this is how it is designed to be used' and worked from there instead of
>> looking to see what the kernel really needed and figuring how to provide a
>> good interface for that that happens to be implemented (today) with ACPI.
>> (a proper power management framework shouldn't care if you have ACPI, APM,
>> or some other method of controlling the devices)
>
> This and the rest of your email have no bearing on what I was talking
> about, so I have snipped out the remainder.

this was in reaction to reading the power/devices.txt. my first thought 
was along the lines of "no wonder device driver authors don't implement 
all this, it's obviously evolved from the needs of the people doing the 
suspend, one call at a time" and from there I started thinking about what 
would make sense to driver authors and provide the capability that's 
needed for the job. I broke that off into a seperate thread anyway.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 20:09                                             ` Alan Stern
@ 2007-07-22 21:54                                               ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-22 21:54 UTC (permalink / raw)
  To: Alan Stern
  Cc: Miklos Szeredi, nigel, jbms, rjw, miltonm, ying.huang,
	linux-kernel, linux-pm

On Sun, 22 Jul 2007, Alan Stern wrote:

> On Sun, 22 Jul 2007, Miklos Szeredi wrote:
>
>>> The only thing to do is what Rafael has been working on: unfreeze
>>> things, hope the tasks sort themselves out, and try again.
>>
>> Have we some proof, that this will untangle the freezing tasks in a
>> limited time?  Or will it just make the problem harder to trigger?
>
> Of course there's no proof.  Just the opposite -- if things get hung up
> the first time, they might get hung up the second time.  And the
> third...
>
> But it ought to make the problem harder to trigger.  For the present
> that's a worthwhile improvement.

it gives the system more tries to find a spot in time where the deadlock 
doesn't happen, if you find one you can continue.

but even if things keep getting hung up, at least you are backing out of 
each try safely and can eventually tell the user "I give up, try shutting 
some things down and suspending again"

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 15:26                                         ` Alan Stern
  2007-07-22 16:27                                           ` Miklos Szeredi
@ 2007-07-22 22:42                                           ` Nigel Cunningham
  2007-07-22 23:09                                             ` Rafael J. Wysocki
                                                               ` (3 more replies)
  1 sibling, 4 replies; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-22 22:42 UTC (permalink / raw)
  To: Alan Stern
  Cc: nigel, Jeremy Maitin-Shepard, Miklos Szeredi, rjw, miltonm,
	ying.huang, linux-kernel, david, linux-pm

[-- Attachment #1: Type: text/plain, Size: 2988 bytes --]

Hi Alan.

On Monday 23 July 2007 01:26:23 Alan Stern wrote:
> On Sun, 22 Jul 2007, Nigel Cunningham wrote:
> 
> > Hi.
> > 
> > On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> > > It seems that you could still potentially get a failure to freeze if one
> > > FUSE process depends on another, and the one that is frozen second just
> > > happens to be waiting on the one that is frozen first when it is frozen.
> > > I admit that this situation is unlikely, and perhaps acceptable.
> > > 
> > > A larger concern is that it seems that freezing FUSE processes at all
> > > _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> > > filesystem is loopback mounted from a FUSE filesystem.  In that case, if
> > > you attempt to sync or free memory once FUSE is frozen, you are sure to
> > > get a deadlock.
> > 
> > Ok. So then (in response to Alan too), how about keeping a tree of mounts, 
> > akin to the device tree, and working from the deepest nodes up? (In 
> > conjunction with what I already suggested)?
> 
> Face it, Nigel, this is a losing battle.  You can try to come up with
> ever-more complex schemes to try and force FUSE into the freezer's
> framework, but it just won't fit.  Or if it does, the next filesystem
> to come along will require an even more baroque type of special-case 
> handling.

It does seem to be a losing battle, but I'm wondering whether that's really 
because it's an intractable problem, or because people have given up on it 
before its time. We are talking about a computer system, so things should be 
predictable.
 
> The general problem is that task A may be in an unfreezable state,
> waiting for task B to do something, while task B is already frozen.  
> Since there's no reasonable way to determine that A really is waiting
> for B, you're just stuck.  (To make matters worse, A may not even
> realize which task it is waiting for; it may know only that it's
> waiting for somebody to do something!)  A and B could be user tasks, 
> kernel threads, or one of each.

I guess I want to persist because all of these issues aren't utterly 
unsolvable. It's just that we don't have the infrastructure yet to figure out 
the solutions to these issues trivially. Take, for example, the locking 
issue. If we could call some function to say "What process holds this lock?", 
then task A could know that it's waiting on task B and put that information 
somewhere. We could then use the information to freeze task B before task A.

 
> The only thing to do is what Rafael has been working on: unfreeze
> things, hope the tasks sort themselves out, and try again.

That's what I'm questioning. Is there a more reliable way and we've just given 
up too quickly?

Regards,

Nigel
-- 
Nigel Cunningham
Christian Reformed Church of Cobden
103 Curdie Street, Cobden 3266, Victoria, Australia
Ph. +61 3 5595 1185 / +61 417 100 574
Communal Worship: 11 am Sunday.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 22:42                                           ` Nigel Cunningham
@ 2007-07-22 23:09                                             ` Rafael J. Wysocki
  2007-07-22 23:18                                               ` Nigel Cunningham
  2007-07-23  0:04                                             ` Paul Mackerras
                                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-22 23:09 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Alan Stern, nigel, Jeremy Maitin-Shepard, Miklos Szeredi,
	miltonm, ying.huang, linux-kernel, david, linux-pm

Hi,

On Monday, 23 July 2007 00:42, Nigel Cunningham wrote:
> Hi Alan.
> 
> On Monday 23 July 2007 01:26:23 Alan Stern wrote:
> > On Sun, 22 Jul 2007, Nigel Cunningham wrote:
> > 
> > > Hi.
> > > 
> > > On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> > > > It seems that you could still potentially get a failure to freeze if one
> > > > FUSE process depends on another, and the one that is frozen second just
> > > > happens to be waiting on the one that is frozen first when it is frozen.
> > > > I admit that this situation is unlikely, and perhaps acceptable.
> > > > 
> > > > A larger concern is that it seems that freezing FUSE processes at all
> > > > _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> > > > filesystem is loopback mounted from a FUSE filesystem.  In that case, if
> > > > you attempt to sync or free memory once FUSE is frozen, you are sure to
> > > > get a deadlock.
> > > 
> > > Ok. So then (in response to Alan too), how about keeping a tree of mounts, 
> > > akin to the device tree, and working from the deepest nodes up? (In 
> > > conjunction with what I already suggested)?
> > 
> > Face it, Nigel, this is a losing battle.  You can try to come up with
> > ever-more complex schemes to try and force FUSE into the freezer's
> > framework, but it just won't fit.  Or if it does, the next filesystem
> > to come along will require an even more baroque type of special-case 
> > handling.
> 
> It does seem to be a losing battle, but I'm wondering whether that's really 
> because it's an intractable problem, or because people have given up on it 
> before its time. We are talking about a computer system, so things should be 
> predictable.
>  
> > The general problem is that task A may be in an unfreezable state,
> > waiting for task B to do something, while task B is already frozen.  
> > Since there's no reasonable way to determine that A really is waiting
> > for B, you're just stuck.  (To make matters worse, A may not even
> > realize which task it is waiting for; it may know only that it's
> > waiting for somebody to do something!)  A and B could be user tasks, 
> > kernel threads, or one of each.
> 
> I guess I want to persist because all of these issues aren't utterly 
> unsolvable. It's just that we don't have the infrastructure yet to figure out 
> the solutions to these issues trivially. Take, for example, the locking 
> issue. If we could call some function to say "What process holds this lock?", 
> then task A could know that it's waiting on task B and put that information 
> somewhere. We could then use the information to freeze task B before task A.
> 
>  
> > The only thing to do is what Rafael has been working on: unfreeze
> > things, hope the tasks sort themselves out, and try again.
> 
> That's what I'm questioning. Is there a more reliable way and we've just given 
> up too quickly?

Well, there probably is one, but it likely would require us to make changes
that wouldn't be accepted by some people and thus would never be merged.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 23:09                                             ` Rafael J. Wysocki
@ 2007-07-22 23:18                                               ` Nigel Cunningham
  0 siblings, 0 replies; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-22 23:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, nigel, Jeremy Maitin-Shepard, Miklos Szeredi,
	miltonm, ying.huang, linux-kernel, david, linux-pm

[-- Attachment #1: Type: text/plain, Size: 3712 bytes --]

On Monday 23 July 2007 09:09:21 Rafael J. Wysocki wrote:
> Hi,
> 
> On Monday, 23 July 2007 00:42, Nigel Cunningham wrote:
> > Hi Alan.
> > 
> > On Monday 23 July 2007 01:26:23 Alan Stern wrote:
> > > On Sun, 22 Jul 2007, Nigel Cunningham wrote:
> > > 
> > > > Hi.
> > > > 
> > > > On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> > > > > It seems that you could still potentially get a failure to freeze if 
one
> > > > > FUSE process depends on another, and the one that is frozen second 
just
> > > > > happens to be waiting on the one that is frozen first when it is 
frozen.
> > > > > I admit that this situation is unlikely, and perhaps acceptable.
> > > > > 
> > > > > A larger concern is that it seems that freezing FUSE processes at 
all
> > > > > _will_ generate deadlocks if a non-synchronous or 
memory-map-supporting
> > > > > filesystem is loopback mounted from a FUSE filesystem.  In that 
case, if
> > > > > you attempt to sync or free memory once FUSE is frozen, you are sure 
to
> > > > > get a deadlock.
> > > > 
> > > > Ok. So then (in response to Alan too), how about keeping a tree of 
mounts, 
> > > > akin to the device tree, and working from the deepest nodes up? (In 
> > > > conjunction with what I already suggested)?
> > > 
> > > Face it, Nigel, this is a losing battle.  You can try to come up with
> > > ever-more complex schemes to try and force FUSE into the freezer's
> > > framework, but it just won't fit.  Or if it does, the next filesystem
> > > to come along will require an even more baroque type of special-case 
> > > handling.
> > 
> > It does seem to be a losing battle, but I'm wondering whether that's 
really 
> > because it's an intractable problem, or because people have given up on it 
> > before its time. We are talking about a computer system, so things should 
be 
> > predictable.
> >  
> > > The general problem is that task A may be in an unfreezable state,
> > > waiting for task B to do something, while task B is already frozen.  
> > > Since there's no reasonable way to determine that A really is waiting
> > > for B, you're just stuck.  (To make matters worse, A may not even
> > > realize which task it is waiting for; it may know only that it's
> > > waiting for somebody to do something!)  A and B could be user tasks, 
> > > kernel threads, or one of each.
> > 
> > I guess I want to persist because all of these issues aren't utterly 
> > unsolvable. It's just that we don't have the infrastructure yet to figure 
out 
> > the solutions to these issues trivially. Take, for example, the locking 
> > issue. If we could call some function to say "What process holds this 
lock?", 
> > then task A could know that it's waiting on task B and put that 
information 
> > somewhere. We could then use the information to freeze task B before task 
A.
> > 
> >  
> > > The only thing to do is what Rafael has been working on: unfreeze
> > > things, hope the tasks sort themselves out, and try again.
> > 
> > That's what I'm questioning. Is there a more reliable way and we've just 
given 
> > up too quickly?
> 
> Well, there probably is one, but it likely would require us to make changes
> that wouldn't be accepted by some people and thus would never be merged.

Well, doesn't that imply that we should at least look into what changes would 
be needed? If they wouldn't be accepted by some people, then either the 
objections would be reasonable or they wouldn't (and would hopefully be 
overridden). But we can't know if we don't try.

Regards,

Nigel

-- 
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 22:42                                           ` Nigel Cunningham
  2007-07-22 23:09                                             ` Rafael J. Wysocki
@ 2007-07-23  0:04                                             ` Paul Mackerras
  2007-07-23  3:11                                               ` Nigel Cunningham
  2007-07-23  5:31                                             ` david
  2007-07-23 10:24                                             ` Miklos Szeredi
  3 siblings, 1 reply; 220+ messages in thread
From: Paul Mackerras @ 2007-07-23  0:04 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Alan Stern, david, Miklos Szeredi, nigel, linux-kernel, miltonm,
	ying.huang, linux-pm, Jeremy Maitin-Shepard

Nigel Cunningham writes:

> I guess I want to persist because all of these issues aren't utterly
> unsolvable. It's just that we don't have the infrastructure yet to
> figure out the solutions to these issues trivially. Take, for example,

Ever heard of the halting problem? :)  It's not just a matter of
infrastructure.  You very quickly get into questions that are
mathematically undecideable.

> the locking issue. If we could call some function to say "What process
> holds this lock?", then task A could know that it's waiting on task B
> and put that information somewhere. We could then use the information
> to freeze task B before task A.

But how would that help?  If task B holds the lock, then we can't
freeze it until it's released the lock.  Then the question is, what
does task B need in order to get to the point where it releases the
lock?  And so on.  It rapidly gets not just extremely messy, but
actually impossible to compute in general.

Paul.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23  0:04                                             ` Paul Mackerras
@ 2007-07-23  3:11                                               ` Nigel Cunningham
  2007-07-23 15:23                                                 ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-23  3:11 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alan Stern, david, Miklos Szeredi, nigel, linux-kernel, miltonm,
	ying.huang, linux-pm, Jeremy Maitin-Shepard

[-- Attachment #1: Type: text/plain, Size: 2192 bytes --]

Hi.

On Monday 23 July 2007 10:04:43 Paul Mackerras wrote:
> Nigel Cunningham writes:
> 
> > I guess I want to persist because all of these issues aren't utterly
> > unsolvable. It's just that we don't have the infrastructure yet to
> > figure out the solutions to these issues trivially. Take, for example,
> 
> Ever heard of the halting problem? :)  It's not just a matter of
> infrastructure.  You very quickly get into questions that are
> mathematically undecideable.

Is this the halting problem, though?

> > the locking issue. If we could call some function to say "What process
> > holds this lock?", then task A could know that it's waiting on task B
> > and put that information somewhere. We could then use the information
> > to freeze task B before task A.
> 
> But how would that help?  If task B holds the lock, then we can't
> freeze it until it's released the lock.  Then the question is, what
> does task B need in order to get to the point where it releases the
> lock?  And so on.  It rapidly gets not just extremely messy, but
> actually impossible to compute in general.

Take a step back for a second.

The problem we're facing now is that we're getting some userspace threads, 
used in processing I/O, that are functioning as exceptions to the "freeze 
userspace, then freezeable kernel threads" rule. They are only exceptions 
because of that role in processing I/O - because they're de facto kernel 
threads. So, if we orient our thinking more in terms of I/O processing and 
less in terms of the userspace/kernelspace distinction, we'll have a 
solution:

1) Freeze processes that aren't fs related (ie stop them generating I/O).
2) Flush pending I/O.
3) Freeze filesystems in reverse order of dependency, the primary purpose 
being to stop them generating further I/O on their metadata.

Locks that are being held are only being held because work is being done. If 
we progressively focus on threads in terms of their create/process work 
dependencies, we'll see that the problem isn't at all intractable.

Regards,

Nigel
-- 
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 22:42                                           ` Nigel Cunningham
  2007-07-22 23:09                                             ` Rafael J. Wysocki
  2007-07-23  0:04                                             ` Paul Mackerras
@ 2007-07-23  5:31                                             ` david
  2007-07-23 10:24                                             ` Miklos Szeredi
  3 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-07-23  5:31 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Alan Stern, nigel, Jeremy Maitin-Shepard, Miklos Szeredi, rjw,
	miltonm, ying.huang, linux-kernel, linux-pm

On Mon, 23 Jul 2007, Nigel Cunningham wrote:

> Hi Alan.
>
> On Monday 23 July 2007 01:26:23 Alan Stern wrote:
>> On Sun, 22 Jul 2007, Nigel Cunningham wrote:
>>
>>> Hi.
>>>
>>> On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
>>>> It seems that you could still potentially get a failure to freeze if one
>>>> FUSE process depends on another, and the one that is frozen second just
>>>> happens to be waiting on the one that is frozen first when it is frozen.
>>>> I admit that this situation is unlikely, and perhaps acceptable.
>>>>
>>>> A larger concern is that it seems that freezing FUSE processes at all
>>>> _will_ generate deadlocks if a non-synchronous or memory-map-supporting
>>>> filesystem is loopback mounted from a FUSE filesystem.  In that case, if
>>>> you attempt to sync or free memory once FUSE is frozen, you are sure to
>>>> get a deadlock.
>>>
>>> Ok. So then (in response to Alan too), how about keeping a tree of mounts,
>>> akin to the device tree, and working from the deepest nodes up? (In
>>> conjunction with what I already suggested)?
>>
>> Face it, Nigel, this is a losing battle.  You can try to come up with
>> ever-more complex schemes to try and force FUSE into the freezer's
>> framework, but it just won't fit.  Or if it does, the next filesystem
>> to come along will require an even more baroque type of special-case
>> handling.
>
> It does seem to be a losing battle, but I'm wondering whether that's really
> because it's an intractable problem, or because people have given up on it
> before its time. We are talking about a computer system, so things should be
> predictable.
>
>> The general problem is that task A may be in an unfreezable state,
>> waiting for task B to do something, while task B is already frozen.
>> Since there's no reasonable way to determine that A really is waiting
>> for B, you're just stuck.  (To make matters worse, A may not even
>> realize which task it is waiting for; it may know only that it's
>> waiting for somebody to do something!)  A and B could be user tasks,
>> kernel threads, or one of each.
>
> I guess I want to persist because all of these issues aren't utterly
> unsolvable. It's just that we don't have the infrastructure yet to figure out
> the solutions to these issues trivially. Take, for example, the locking
> issue. If we could call some function to say "What process holds this lock?",
> then task A could know that it's waiting on task B and put that information
> somewhere. We could then use the information to freeze task B before task A.
>

this sounds like the standard priority inversion problem taken to 
extremes. Ingo has been working this issue, but IIRC the problem is that 
tracking what owns the lock so that you can get that thing to run ends up 
being enough overhead that it's not acceptable in the general case.

David Lang

>> The only thing to do is what Rafael has been working on: unfreeze
>> things, hope the tasks sort themselves out, and try again.
>
> That's what I'm questioning. Is there a more reliable way and we've just given
> up too quickly?
>
> Regards,
>
> Nigel
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 22:42                                           ` Nigel Cunningham
                                                               ` (2 preceding siblings ...)
  2007-07-23  5:31                                             ` david
@ 2007-07-23 10:24                                             ` Miklos Szeredi
  2007-07-23 12:08                                               ` Rafael J. Wysocki
  3 siblings, 1 reply; 220+ messages in thread
From: Miklos Szeredi @ 2007-07-23 10:24 UTC (permalink / raw)
  To: nigel
  Cc: stern, nigel, jbms, miklos, rjw, miltonm, ying.huang,
	linux-kernel, david, linux-pm

> > The only thing to do is what Rafael has been working on: unfreeze
> > things, hope the tasks sort themselves out, and try again.
> 
> That's what I'm questioning. Is there a more reliable way and we've
> just given up too quickly?

There obviously _are_ more reliable ways.  A trivial one seems to be
to just not require user tasks to finish syscalls.

Yeah, stopping user processes outside the kernel is convenient, but
there's no fundamental reason why it is the only place where those
tasks can be stopped.

And there are very fundamental reasons to _not_ require this.  Not
just in the fuse case, but in any case where a syscall requires
another user task to run before it can be finished (e.g. NFS over
OpenVPN).

Miklos

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 10:24                                             ` Miklos Szeredi
@ 2007-07-23 12:08                                               ` Rafael J. Wysocki
  2007-07-23 12:14                                                 ` Miklos Szeredi
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-23 12:08 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: nigel, stern, nigel, jbms, miltonm, ying.huang, linux-kernel,
	david, linux-pm

On Monday, 23 July 2007 12:24, Miklos Szeredi wrote:
> > > The only thing to do is what Rafael has been working on: unfreeze
> > > things, hope the tasks sort themselves out, and try again.
> > 
> > That's what I'm questioning. Is there a more reliable way and we've
> > just given up too quickly?
> 
> There obviously _are_ more reliable ways.  A trivial one seems to be
> to just not require user tasks to finish syscalls.
> 
> Yeah, stopping user processes outside the kernel is convenient, but
> there's no fundamental reason why it is the only place where those
> tasks can be stopped.

The reason is that we want them to "park" in safe places, ie. where there
are no locks held etc.  Thus, these safe places need to be chosen somehow
and since they are not marked throughout the code, we choose the obvious
one. :-)

> And there are very fundamental reasons to _not_ require this.  Not
> just in the fuse case, but in any case where a syscall requires
> another user task to run before it can be finished (e.g. NFS over
> OpenVPN).

Yeah.  Mark the safe places for us and we'll use them.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 12:08                                               ` Rafael J. Wysocki
@ 2007-07-23 12:14                                                 ` Miklos Szeredi
  2007-07-23 12:27                                                   ` Rafael J. Wysocki
  2007-07-23 12:31                                                   ` Oliver Neukum
  0 siblings, 2 replies; 220+ messages in thread
From: Miklos Szeredi @ 2007-07-23 12:14 UTC (permalink / raw)
  To: rjw
  Cc: miklos, nigel, stern, nigel, jbms, miltonm, ying.huang,
	linux-kernel, david, linux-pm

> On Monday, 23 July 2007 12:24, Miklos Szeredi wrote:
> > > > The only thing to do is what Rafael has been working on: unfreeze
> > > > things, hope the tasks sort themselves out, and try again.
> > > 
> > > That's what I'm questioning. Is there a more reliable way and we've
> > > just given up too quickly?
> > 
> > There obviously _are_ more reliable ways.  A trivial one seems to be
> > to just not require user tasks to finish syscalls.
> > 
> > Yeah, stopping user processes outside the kernel is convenient, but
> > there's no fundamental reason why it is the only place where those
> > tasks can be stopped.
> 
> The reason is that we want them to "park" in safe places, ie. where there
> are no locks held etc.  Thus, these safe places need to be chosen somehow
> and since they are not marked throughout the code, we choose the obvious
> one. :-)

Why shouldn't locks be held?

No locks which are required for suspend must be held, sure.  But
otherwise holding locks doesn't matter at all.

And I'm not saying that is trivial to do, but it might not be too hard
either.

Rafael, can you please tell, what happened to that patch, that did not
wait for tasks in uninterruptible sleep to be frozen?

That seemed like a magnificent approach compared to anything that has
been proposed since.

Miklos

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 12:14                                                 ` Miklos Szeredi
@ 2007-07-23 12:27                                                   ` Rafael J. Wysocki
  2007-07-23 12:31                                                   ` Oliver Neukum
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-23 12:27 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: nigel, stern, nigel, jbms, miltonm, ying.huang, linux-kernel,
	david, linux-pm

On Monday, 23 July 2007 14:14, Miklos Szeredi wrote:
> > On Monday, 23 July 2007 12:24, Miklos Szeredi wrote:
> > > > > The only thing to do is what Rafael has been working on: unfreeze
> > > > > things, hope the tasks sort themselves out, and try again.
> > > > 
> > > > That's what I'm questioning. Is there a more reliable way and we've
> > > > just given up too quickly?
> > > 
> > > There obviously _are_ more reliable ways.  A trivial one seems to be
> > > to just not require user tasks to finish syscalls.
> > > 
> > > Yeah, stopping user processes outside the kernel is convenient, but
> > > there's no fundamental reason why it is the only place where those
> > > tasks can be stopped.
> > 
> > The reason is that we want them to "park" in safe places, ie. where there
> > are no locks held etc.  Thus, these safe places need to be chosen somehow
> > and since they are not marked throughout the code, we choose the obvious
> > one. :-)
> 
> Why shouldn't locks be held?
> 
> No locks which are required for suspend must be held, sure.  But
> otherwise holding locks doesn't matter at all.
> 
> And I'm not saying that is trivial to do, but it might not be too hard
> either.
> 
> Rafael, can you please tell, what happened to that patch, that did not
> wait for tasks in uninterruptible sleep to be frozen?
> 
> That seemed like a magnificent approach compared to anything that has
> been proposed since.

Well, the freezer have failed to freeze tasks for a couple of times in my
test setup and I've had a couple of hangs.

I have an idea how to improve it, but that still requires some pending freezer
patches to go first.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 12:14                                                 ` Miklos Szeredi
  2007-07-23 12:27                                                   ` Rafael J. Wysocki
@ 2007-07-23 12:31                                                   ` Oliver Neukum
  2007-07-23 13:08                                                     ` Miklos Szeredi
  2007-07-23 19:08                                                     ` david
  1 sibling, 2 replies; 220+ messages in thread
From: Oliver Neukum @ 2007-07-23 12:31 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: rjw, nigel, stern, nigel, jbms, miltonm, ying.huang,
	linux-kernel, david, linux-pm

Am Montag 23 Juli 2007 schrieb Miklos Szeredi:
> > The reason is that we want them to "park" in safe places, ie. where there
> > are no locks held etc.  Thus, these safe places need to be chosen somehow
> > and since they are not marked throughout the code, we choose the obvious
> > one. :-)
> 
> Why shouldn't locks be held?
> 
> No locks which are required for suspend must be held, sure.  But
> otherwise holding locks doesn't matter at all.

If you can provide a way to tell them apart, this would work.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 12:31                                                   ` Oliver Neukum
@ 2007-07-23 13:08                                                     ` Miklos Szeredi
  2007-07-23 14:01                                                       ` Rafael J. Wysocki
  2007-07-23 19:08                                                     ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Miklos Szeredi @ 2007-07-23 13:08 UTC (permalink / raw)
  To: oliver
  Cc: miklos, rjw, nigel, stern, nigel, jbms, miltonm, ying.huang,
	linux-kernel, david, linux-pm

> > > The reason is that we want them to "park" in safe places, ie. where there
> > > are no locks held etc.  Thus, these safe places need to be chosen somehow
> > > and since they are not marked throughout the code, we choose the obvious
> > > one. :-)
> > 
> > Why shouldn't locks be held?
> > 
> > No locks which are required for suspend must be held, sure.  But
> > otherwise holding locks doesn't matter at all.
> 
> If you can provide a way to tell them apart, this would work.

Without some marking we can't tell obviously.

Are there many such locks?  We can easily check by adding some
debugging code to the lock primitives, to make them yell if they are
used during suspend.

Miklos

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 14:01                                                       ` Rafael J. Wysocki
@ 2007-07-23 14:01                                                         ` Miklos Szeredi
  0 siblings, 0 replies; 220+ messages in thread
From: Miklos Szeredi @ 2007-07-23 14:01 UTC (permalink / raw)
  To: rjw
  Cc: miklos, stern, oliver, nigel, nigel, jbms, miltonm, ying.huang,
	linux-kernel, david, linux-pm

> Alan has recently proposed to introduce "suspend locks" to be acquired during
> a suspend/hibernation and such that we can leave uninterruptible tasks that
> don't hold any of them.

Sounds sane.  A global rwsem could be acquired for read by drivers,
and for write by suspend/hibernate.  Just need to add it to all
drivers that have PM, but that shouldn't need a heroic effort.

Miklos

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 13:08                                                     ` Miklos Szeredi
@ 2007-07-23 14:01                                                       ` Rafael J. Wysocki
  2007-07-23 14:01                                                         ` Miklos Szeredi
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-23 14:01 UTC (permalink / raw)
  To: Miklos Szeredi, stern
  Cc: oliver, nigel, nigel, jbms, miltonm, ying.huang, linux-kernel,
	david, linux-pm

On Monday, 23 July 2007 15:08, Miklos Szeredi wrote:
> > > > The reason is that we want them to "park" in safe places, ie. where there
> > > > are no locks held etc.  Thus, these safe places need to be chosen somehow
> > > > and since they are not marked throughout the code, we choose the obvious
> > > > one. :-)
> > > 
> > > Why shouldn't locks be held?
> > > 
> > > No locks which are required for suspend must be held, sure.  But
> > > otherwise holding locks doesn't matter at all.
> > 
> > If you can provide a way to tell them apart, this would work.
> 
> Without some marking we can't tell obviously.
> 
> Are there many such locks?  We can easily check by adding some
> debugging code to the lock primitives, to make them yell if they are
> used during suspend.

This way we can only obtain information from systems that use hibernation
quite often.

Alan has recently proposed to introduce "suspend locks" to be acquired during
a suspend/hibernation and such that we can leave uninterruptible tasks that
don't hold any of them.

Unfortunately, I have no link to his original message at hand.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 22:25                                           ` Alan Stern
@ 2007-07-23 14:23                                             ` Oliver Neukum
  2007-07-23 20:05                                               ` Towards eliminating the freezer Alan Stern
  2007-08-01  9:34                                             ` [linux-pm] Re: Hibernation considerations Pavel Machek
  1 sibling, 1 reply; 220+ messages in thread
From: Oliver Neukum @ 2007-07-23 14:23 UTC (permalink / raw)
  To: Alan Stern
  Cc: Milton Miller, Ying Huang, LKML, Rafael J. Wysocki, David Lang,
	linux-pm, Jeremy Maitin-Shepard

Am Samstag 21 Juli 2007 schrieb Alan Stern:
> On Fri, 20 Jul 2007, Oliver Neukum wrote:
> 
> > > We already have a pre-suspend notification available for drivers that 
> > > need to allocate large amounts of memory.
> > 
> > Is that facility fine grained enough?
> 
> It's a notifier chain that gets called at several points during the 
> suspend transition.  One of those points is right at the start, while 
> userspace is still running and reasonably large amounts of memory can 
> be allocated.
> 
> Is it fine-grained enough?  I don't know -- hard to tell, since nothing 
> much is using it yet.
> 
> > > You are correct about the need to delay/stop device addition.  I don't
> > > know how this can be done in general; each code path calling
> > > device_add() may have to be treated individually.
> > 
> > What about the old API?
> 
> What old API do you mean?

The find_device() stuff.

> >  Do we have to block module loading?
> 
> No.  Registering new drivers is okay, registering new devices is bad.

What if it is a driver for virtual devices that don't need probe()
for actual hardware?

> Of course, some modules do want to register a new device in their init 
> method.  I don't know what we should do about them.  Force the 
> registration to fail, I suppose.  How often will people suspend while a 
> module is loading?
> 
> > What happens if a scsi error handler is woken? If it cannot be woken,
> > how are errors handled?
> 
> Why should the error handler wake up?  There isn't supposed to be any 
> I/O going on, hence no errors to handle.

What about shared busses? Firewire, FibreChannel? They can get external
resets, etc ...

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-22 21:50                                                 ` david
@ 2007-07-23 15:19                                                   ` Alan Stern
  2007-07-23 19:01                                                     ` david
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-23 15:19 UTC (permalink / raw)
  To: david
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	Ying Huang, LKML, linux-pm

On Sun, 22 Jul 2007 david@lang.hm wrote:

> > You are confusing "userspace" with "user tasks".  And not only that,
> > you often use the term "userspace" when you should say "user mode".
> >
> > If you want I can explain the differences.
> 
> please do, I have been treating all three as the same catagory.

Very briefly then: "User mode" and "kernel mode" refer to the CPU's
hardware privilege level.  A process makes the transition from user
mode to kernel mode by executing a system call.  Interrupt and
exception handlers also run in kernel mode, but they generally are not
considered to be part of any process.  The reverse transition occurs
when a process returns from a system call, or when an interrupt which
occurred while the CPU was in user mode completes.  (It's interesting
to note that system calls are somewhat similar to interrupts; in fact
sometimes they are implemented by a "software interrupt".)

"Kernel threads" are processes that run entirely in kernel mode.  They
usually don't have a memory mapping for any user-owned memory and they
never go into user mode.  All other processes are "user threads".

"Userspace" is a rather general term referring to things not in the
kernel.  It comprises both user tasks (while running in user mode) and
user memory.

> Ok, I did misunderstand you. it sound slike all you need to do to make 
> sure that locks are not held is to allow system calls to return before 
> trying to do the suspend/kexec/etc. that sounds like not only a trivial 
> thing to do, but something that would probably be done anyway.

If you could actually do it, it would work.  But you can't do it.  If 
it were feasible, the freezer would have used that approach in the 
first place.

For one thing, checking for a suspend-in-progress at the beginning of
each and every system call would add overhead to a hot path in the
kernel, one which is already very heavily optimized.  People wouldn't
stand for it.

> although syscalls that then call out to userspace tasks before they can 
> complete cause potential deadlocks (without that issue you can just wait 
> until all syscalls have returned, and not allow anything to issue new 
> syscalls) is this the issue that's killing FUSE+suspend?

You get similar problems from system calls that wait in kernel mode 
until something has happened.  For example, a read() call for the 
console device will wait until somebody types on the keyboard.  At any 
point in time, many (or even most) user threads are blocked in a system 
call.

> > Here's what you are missing:
> >
> > The new kexec approach eliminates the freezer and relies instead on the
> > fact that none of the tasks in the original kernel can execute while
> > the new kexec'd kernel is running.  This means the new kernel can write
> > out a memory image with no fear of interference or corruption.
> 
> correct
> 
> > But it also means that tasks which otherwise would have been frozen are
> > actually free to run before the kexec call is made (and after the call
> > returns, if the kexec'd kernel returns back to the original kernel).
> > Any driver which was written with the assumption that tasks would be
> > frozen at those times will need to be changed.
> 
> here is where you loose me.
> 
> why should jumping back to the original kernel immedialty start running 
> these processes?

Let's let kernel K1 be the original kernel, the one which is going into
hibernation.  Kernel K2 is the one started by kexec to write out the
memory image.

Your question becomes: Why should K2 jumping back to K1 cause K1
immediately to start running user tasks?  Answer: Because K1 has been
running user tasks all along (except while K2 was active) and nothing
has told it to stop.  In fact, about the only things which _can_ cause
K1 to stop running user threads are the freezer (which you want to
eliminate) and disabling interrupts (not possible since some drivers
require interrupts to be enabled when putting devices in low-power 
mode).

>  the process of doing a kexec requires things to happen in 
> the drivers before normal activity can happen, so there is a phase in 
> there where the kernel being jumped to has drivers initializing, but still 
> does not allow anything else to run.

So when K2 starts up, it will have a phase in which user threads don't 
run.  That doesn't affect K1.  When K2 returns to K1, K1 does not go 
through this sort of phase.  It simply picks up from where it left off.

> why can't this phase be extended to 
> allow for the possibility of transitioning these drivers to a sleep mode 
> instead of to full operation?

Indeed, Rafael has suggested that K2 be responsible for putting devices
in low-power mode.  This has the disadvantage of requiring K2 to 
include drivers for every device used by K1, but otherwise it would 
work.

However there still remains the problem of user tasks running after 
devices are supposed to be quiescent and before K1 starts.  There's 
currently nothing to stop such tasks from making I/O requests and 
thereby causing a quiescent device to become active again.

> > The situation as regards locking is harder to discuss since I don't
> > know of any code examples to use as a guide.  The fact remains that if
> > user tasks aren't frozen then they can make system calls, and while
> > running in kernel mode they can acquire locks, which might cause
> > problems -- even though I can't identify any definite examples.
> 
> yes, if userspace is running jobs and submitting I/O and system calls 
> while drivers are trying to initalize there is a big problem, but I am 
> missing the reason this must be the case.

We aren't talking about drivers initializing devices.  We are talking
about what happens during the time when drivers are trying to quiesce
devices (i.e., before K1 has started up K2) or power them down (after
K2 has returned to K1).

> the part of the freezer that everyone is trying to eliminate is the 
> exceptions (freeze everything except X,Y,Z becouse we will need to use 
> those later for A)

Wrong.  People are trying to eliminate the freezer entirely.  Go back 
and reread some of the postings at the beginning of this long thread, 
especially those from Paul Mackerras and Ben Herrenschmidt.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23  3:11                                               ` Nigel Cunningham
@ 2007-07-23 15:23                                                 ` Alan Stern
  2007-07-23 21:55                                                   ` Nigel Cunningham
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-23 15:23 UTC (permalink / raw)
  To: nigel
  Cc: Paul Mackerras, david, Miklos Szeredi, linux-kernel, miltonm,
	ying.huang, linux-pm, Jeremy Maitin-Shepard

On Mon, 23 Jul 2007, Nigel Cunningham wrote:

> Take a step back for a second.
> 
> The problem we're facing now is that we're getting some userspace threads, 
> used in processing I/O, that are functioning as exceptions to the "freeze 
> userspace, then freezeable kernel threads" rule. They are only exceptions 
> because of that role in processing I/O - because they're de facto kernel 
> threads. So, if we orient our thinking more in terms of I/O processing and 
> less in terms of the userspace/kernelspace distinction, we'll have a 
> solution:
> 
> 1) Freeze processes that aren't fs related (ie stop them generating I/O).

The problem here is that with things like FUSE, _every_ process is 
potentially fs related.  Nothing prevents a FUSE thread from doing IPC 
with any other thread.

> 2) Flush pending I/O.
> 3) Freeze filesystems in reverse order of dependency, the primary purpose 
> being to stop them generating further I/O on their metadata.
> 
> Locks that are being held are only being held because work is being done. If 
> we progressively focus on threads in terms of their create/process work 
> dependencies, we'll see that the problem isn't at all intractable.

As has been mentioned before, keeping track of all that dependency 
information would be very fragile and time-consuming.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 15:19                                                   ` Alan Stern
@ 2007-07-23 19:01                                                     ` david
  2007-07-23 20:22                                                       ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: david @ 2007-07-23 19:01 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	Ying Huang, LKML, linux-pm

On Mon, 23 Jul 2007, Alan Stern wrote:

> On Sun, 22 Jul 2007 david@lang.hm wrote:
>

>> Ok, I did misunderstand you. it sound slike all you need to do to make
>> sure that locks are not held is to allow system calls to return before
>> trying to do the suspend/kexec/etc. that sounds like not only a trivial
>> thing to do, but something that would probably be done anyway.
>
> If you could actually do it, it would work.  But you can't do it.  If
> it were feasible, the freezer would have used that approach in the
> first place.
>
> For one thing, checking for a suspend-in-progress at the beginning of
> each and every system call would add overhead to a hot path in the
> kernel, one which is already very heavily optimized.  People wouldn't
> stand for it.

I thought that the suspend stuff did this easily, but the freezer really 
starts running into trouble when it wants to freeze some things, but not 
other things. this seems to be the biggest area of churn and problems.

>> although syscalls that then call out to userspace tasks before they can
>> complete cause potential deadlocks (without that issue you can just wait
>> until all syscalls have returned, and not allow anything to issue new
>> syscalls) is this the issue that's killing FUSE+suspend?
>
> You get similar problems from system calls that wait in kernel mode
> until something has happened.  For example, a read() call for the
> console device will wait until somebody types on the keyboard.  At any
> point in time, many (or even most) user threads are blocked in a system
> call.

but are locks held while they are blocked like this?

>>> But it also means that tasks which otherwise would have been frozen are
>>> actually free to run before the kexec call is made (and after the call
>>> returns, if the kexec'd kernel returns back to the original kernel).
>>> Any driver which was written with the assumption that tasks would be
>>> frozen at those times will need to be changed.
>>
>> here is where you loose me.
>>
>> why should jumping back to the original kernel immedialty start running
>> these processes?
>
> Let's let kernel K1 be the original kernel, the one which is going into
> hibernation.  Kernel K2 is the one started by kexec to write out the
> memory image.
>
> Your question becomes: Why should K2 jumping back to K1 cause K1
> immediately to start running user tasks?  Answer: Because K1 has been
> running user tasks all along (except while K2 was active) and nothing
> has told it to stop.  In fact, about the only things which _can_ cause
> K1 to stop running user threads are the freezer (which you want to
> eliminate) and disabling interrupts (not possible since some drivers
> require interrupts to be enabled when putting devices in low-power
> mode).

when you jump to a body of code you jump to a specific point in the code, 
not to some nebulous 'everything running' state.

>>  the process of doing a kexec requires things to happen in
>> the drivers before normal activity can happen, so there is a phase in
>> there where the kernel being jumped to has drivers initializing, but still
>> does not allow anything else to run.
>
> So when K2 starts up, it will have a phase in which user threads don't
> run.  That doesn't affect K1.  When K2 returns to K1, K1 does not go
> through this sort of phase.  It simply picks up from where it left off.

then how can it restart drivers before the user threads need them?

>> why can't this phase be extended to
>> allow for the possibility of transitioning these drivers to a sleep mode
>> instead of to full operation?
>
> Indeed, Rafael has suggested that K2 be responsible for putting devices
> in low-power mode.  This has the disadvantage of requiring K2 to
> include drivers for every device used by K1, but otherwise it would
> work.
>
> However there still remains the problem of user tasks running after
> devices are supposed to be quiescent and before K1 starts.  There's
> currently nothing to stop such tasks from making I/O requests and
> thereby causing a quiescent device to become active again.

but if the devices are in low power mode then K1 needs to get them out of 
low power mode before user tasks try to access them.

>>> The situation as regards locking is harder to discuss since I don't
>>> know of any code examples to use as a guide.  The fact remains that if
>>> user tasks aren't frozen then they can make system calls, and while
>>> running in kernel mode they can acquire locks, which might cause
>>> problems -- even though I can't identify any definite examples.
>>
>> yes, if userspace is running jobs and submitting I/O and system calls
>> while drivers are trying to initalize there is a big problem, but I am
>> missing the reason this must be the case.
>
> We aren't talking about drivers initializing devices.  We are talking
> about what happens during the time when drivers are trying to quiesce
> devices (i.e., before K1 has started up K2) or power them down (after
> K2 has returned to K1).

or if you are doing a resume instead of a suspend to ram the drivers need 
to initialize or otherwise move to full power on K1 before user tasks hit 
them.

>> the part of the freezer that everyone is trying to eliminate is the
>> exceptions (freeze everything except X,Y,Z becouse we will need to use
>> those later for A)
>
> Wrong.  People are trying to eliminate the freezer entirely.  Go back
> and reread some of the postings at the beginning of this long thread,
> especially those from Paul Mackerras and Ben Herrenschmidt.
>
> Alan Stern
>
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 12:31                                                   ` Oliver Neukum
  2007-07-23 13:08                                                     ` Miklos Szeredi
@ 2007-07-23 19:08                                                     ` david
  1 sibling, 0 replies; 220+ messages in thread
From: david @ 2007-07-23 19:08 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Miklos Szeredi, rjw, nigel, stern, nigel, jbms, miltonm,
	ying.huang, linux-kernel, linux-pm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 764 bytes --]

On Mon, 23 Jul 2007, Oliver Neukum wrote:

> Am Montag 23 Juli 2007 schrieb Miklos Szeredi:
>>> The reason is that we want them to "park" in safe places, ie. where there
>>> are no locks held etc.  Thus, these safe places need to be chosen somehow
>>> and since they are not marked throughout the code, we choose the obvious
>>> one. :-)
>>
>> Why shouldn't locks be held?
>>
>> No locks which are required for suspend must be held, sure.  But
>> otherwise holding locks doesn't matter at all.
>
> If you can provide a way to tell them apart, this would work.

can you just tell the driver to try and suspend and if it reports back 
that it fails back out of the suspend? or will the driver deadlock instead 
of reporting a failure if a lock is held.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Towards eliminating the freezer
  2007-07-23 14:23                                             ` Oliver Neukum
@ 2007-07-23 20:05                                               ` Alan Stern
  2007-07-24  8:21                                                 ` Oliver Neukum
  2007-07-24  9:33                                                 ` Rafael J. Wysocki
  0 siblings, 2 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-23 20:05 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: LKML, linux-pm

[Note changed $SUBJECT]

On Mon, 23 Jul 2007, Oliver Neukum wrote:

> > > > You are correct about the need to delay/stop device addition.  I don't
> > > > know how this can be done in general; each code path calling
> > > > device_add() may have to be treated individually.
> > > 
> > > What about the old API?
> > 
> > What old API do you mean?
> 
> The find_device() stuff.

You mean like bus_find_device() or driver_find_device()?  I don't see 
any problem with them.  They aren't involved in device registration or 
locking.

> > >  Do we have to block module loading?
> > 
> > No.  Registering new drivers is okay, registering new devices is bad.
> 
> What if it is a driver for virtual devices that don't need probe()
> for actual hardware?

Like I said, registering the new driver is okay.  Registering the 
virtual devices could cause a problem.

> > Of course, some modules do want to register a new device in their init 
> > method.  I don't know what we should do about them.  Force the 
> > registration to fail, I suppose.  How often will people suspend while a 
> > module is loading?
> > 
> > > What happens if a scsi error handler is woken? If it cannot be woken,
> > > how are errors handled?
> > 
> > Why should the error handler wake up?  There isn't supposed to be any 
> > I/O going on, hence no errors to handle.
> 
> What about shared busses? Firewire, FibreChannel? They can get external
> resets, etc ...

The same reasoning applies: If no I/O is going on, why should there be
a reset?  If a reset or any other event is generated externally then it
is handled in the kernel by some device driver for the bus, which
should be smart enough not to register new devices or start up an error
handler until I/O is once again permitted.


=============================


Now here's an idea which might work.  Can we require every caller of
device_add() to hold some existing device's semaphore?  Normally it
would be the semaphore of the new device's parent, but it could be a
higher ancestor.  There even could be a single "root" semaphore for
drivers registering a top-level device with no parent.

(Some testing shows that during startup things like ACPI and IDE don't 
fulfill this requirement, so maybe we should require it only after 
userspace has begun running.  After all, the system can't suspend 
until then.)

It seems like a reasonable sort of thing to do.  Hotplugged devices
tend to be registered as they are discovered by their parent's driver,
so it shouldn't be too much to ask that the parent's semaphore be held
when the new device is registered.  Static devices generally aren't
quite so nice; the serial and floppy drivers in particular would need a
little work (and probably some other drivers too).

If we do this, then once the PM core has acquired the semaphore for 
every device it will be guaranteed that no new devices can be added.  
It would be a simple solution to a rather nasty problem.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 19:01                                                     ` david
@ 2007-07-23 20:22                                                       ` Alan Stern
  2007-07-24 13:26                                                         ` Huang, Ying
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-23 20:22 UTC (permalink / raw)
  To: david
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	Ying Huang, LKML, linux-pm

On Mon, 23 Jul 2007 david@lang.hm wrote:

> > For one thing, checking for a suspend-in-progress at the beginning of
> > each and every system call would add overhead to a hot path in the
> > kernel, one which is already very heavily optimized.  People wouldn't
> > stand for it.
> 
> I thought that the suspend stuff did this easily,

It does not do it at all.  Do you know how the freezer works?

>  but the freezer really 
> starts running into trouble when it wants to freeze some things, but not 
> other things. this seems to be the biggest area of churn and problems.

No.  The freezer starts running into trouble when it wants to freeze a
thread but can't, because that thread is waiting for some event to
occur and the only thread which can cause the event is already frozen.  
Or is itself waiting for a third thread which is already frozen...

> > You get similar problems from system calls that wait in kernel mode
> > until something has happened.  For example, a read() call for the
> > console device will wait until somebody types on the keyboard.  At any
> > point in time, many (or even most) user threads are blocked in a system
> > call.
> 
> but are locks held while they are blocked like this?

Sometimes they are, sometimes they aren't.

> > Let's let kernel K1 be the original kernel, the one which is going into
> > hibernation.  Kernel K2 is the one started by kexec to write out the
> > memory image.
> >
> > Your question becomes: Why should K2 jumping back to K1 cause K1
> > immediately to start running user tasks?  Answer: Because K1 has been
> > running user tasks all along (except while K2 was active) and nothing
> > has told it to stop.  In fact, about the only things which _can_ cause
> > K1 to stop running user threads are the freezer (which you want to
> > eliminate) and disabling interrupts (not possible since some drivers
> > require interrupts to be enabled when putting devices in low-power
> > mode).
> 
> when you jump to a body of code you jump to a specific point in the code, 
> not to some nebulous 'everything running' state.

How is that relevant?  When K2 jumps back to K1, it jumps to some 
designated location in K1.  It might just after the place where K1 
called K2; I'm not familiar with the details of kexec.  In any event, 
K1 will still be in the same state as it was when it called K2.

> > So when K2 starts up, it will have a phase in which user threads don't
> > run.  That doesn't affect K1.  When K2 returns to K1, K1 does not go
> > through this sort of phase.  It simply picks up from where it left off.
> 
> then how can it restart drivers before the user threads need them?

It can't.  Indeed, in the absence of a freezer, user threads will need 
devices (more accurately, will submit I/O requests for devices) that 
have to be kept quiescent or low-power.  Drivers will need to delay 
those requests until the devices are returned to full operation.

That's exactly what I've been saying all along: Drivers will need to 
be changed to delay I/O requests, if there is no freezer.

> > However there still remains the problem of user tasks running after
> > devices are supposed to be quiescent and before K1 starts.  There's
> > currently nothing to stop such tasks from making I/O requests and
> > thereby causing a quiescent device to become active again.
> 
> but if the devices are in low power mode then K1 needs to get them out of 
> low power mode before user tasks try to access them.

No -- which is good because it can't.  If a user task is running
there's no way to stop it from submitting I/O requests.  K1 needs to
delay these requests until after the device has returned to full 
operation.

> > We aren't talking about drivers initializing devices.  We are talking
> > about what happens during the time when drivers are trying to quiesce
> > devices (i.e., before K1 has started up K2) or power them down (after
> > K2 has returned to K1).
> 
> or if you are doing a resume instead of a suspend to ram the drivers need 
> to initialize or otherwise move to full power on K1 before user tasks hit 
> them.

Correct.  User tasks are allowed to submit requests, but the requests 
can't be carried out until the device returns to full operation.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 15:23                                                 ` Alan Stern
@ 2007-07-23 21:55                                                   ` Nigel Cunningham
  2007-07-23 22:10                                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Nigel Cunningham @ 2007-07-23 21:55 UTC (permalink / raw)
  To: Alan Stern
  Cc: nigel, Paul Mackerras, david, Miklos Szeredi, linux-kernel,
	miltonm, ying.huang, linux-pm, Jeremy Maitin-Shepard

[-- Attachment #1: Type: text/plain, Size: 1707 bytes --]

Hi.

On Tuesday 24 July 2007 01:23:15 Alan Stern wrote:
> On Mon, 23 Jul 2007, Nigel Cunningham wrote:
> 
> > Take a step back for a second.
> > 
> > The problem we're facing now is that we're getting some userspace threads, 
> > used in processing I/O, that are functioning as exceptions to the "freeze 
> > userspace, then freezeable kernel threads" rule. They are only exceptions 
> > because of that role in processing I/O - because they're de facto kernel 
> > threads. So, if we orient our thinking more in terms of I/O processing and 
> > less in terms of the userspace/kernelspace distinction, we'll have a 
> > solution:
> > 
> > 1) Freeze processes that aren't fs related (ie stop them generating I/O).
> 
> The problem here is that with things like FUSE, _every_ process is 
> potentially fs related.  Nothing prevents a FUSE thread from doing IPC 
> with any other thread.

Yes, but the fuse thread is going to know what other thread it's doing IPC 
with, so it can get that thread flagged too.

> > 2) Flush pending I/O.
> > 3) Freeze filesystems in reverse order of dependency, the primary purpose 
> > being to stop them generating further I/O on their metadata.
> > 
> > Locks that are being held are only being held because work is being done. 
If 
> > we progressively focus on threads in terms of their create/process work 
> > dependencies, we'll see that the problem isn't at all intractable.
> 
> As has been mentioned before, keeping track of all that dependency 
> information would be very fragile and time-consuming.

I disagree. It's at least going to be less fragile and time-consuming then 
maintaining new/extra code for kexec.

Nigel


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-23 21:55                                                   ` Nigel Cunningham
@ 2007-07-23 22:10                                                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-23 22:10 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Alan Stern, nigel, Paul Mackerras, david, Miklos Szeredi,
	linux-kernel, miltonm, ying.huang, linux-pm,
	Jeremy Maitin-Shepard

On Monday, 23 July 2007 23:55, Nigel Cunningham wrote:
> Hi.
> 
> On Tuesday 24 July 2007 01:23:15 Alan Stern wrote:
> > On Mon, 23 Jul 2007, Nigel Cunningham wrote:
> > 
> > > Take a step back for a second.
> > > 
> > > The problem we're facing now is that we're getting some userspace threads, 
> > > used in processing I/O, that are functioning as exceptions to the "freeze 
> > > userspace, then freezeable kernel threads" rule. They are only exceptions 
> > > because of that role in processing I/O - because they're de facto kernel 
> > > threads. So, if we orient our thinking more in terms of I/O processing and 
> > > less in terms of the userspace/kernelspace distinction, we'll have a 
> > > solution:
> > > 
> > > 1) Freeze processes that aren't fs related (ie stop them generating I/O).
> > 
> > The problem here is that with things like FUSE, _every_ process is 
> > potentially fs related.  Nothing prevents a FUSE thread from doing IPC 
> > with any other thread.
> 
> Yes, but the fuse thread is going to know what other thread it's doing IPC 
> with, so it can get that thread flagged too.

Yes, but that thread may do IPC with yet another one and so on.

> > > 2) Flush pending I/O.
> > > 3) Freeze filesystems in reverse order of dependency, the primary purpose 
> > > being to stop them generating further I/O on their metadata.
> > > 
> > > Locks that are being held are only being held because work is being done. 
> If 
> > > we progressively focus on threads in terms of their create/process work 
> > > dependencies, we'll see that the problem isn't at all intractable.
> > 
> > As has been mentioned before, keeping track of all that dependency 
> > information would be very fragile and time-consuming.
> 
> I disagree. It's at least going to be less fragile and time-consuming then 
> maintaining new/extra code for kexec.

Well, I think the issue is real, so we need to find a solution (the simpler,
the better) and that need not be related to kexec. ;-)

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-23 20:05                                               ` Towards eliminating the freezer Alan Stern
@ 2007-07-24  8:21                                                 ` Oliver Neukum
  2007-07-24 14:27                                                   ` Alan Stern
  2007-07-24  9:33                                                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 220+ messages in thread
From: Oliver Neukum @ 2007-07-24  8:21 UTC (permalink / raw)
  To: Alan Stern; +Cc: LKML, linux-pm

Am Montag 23 Juli 2007 schrieb Alan Stern:
> Now here's an idea which might work.  Can we require every caller of
> device_add() to hold some existing device's semaphore?  Normally it
> would be the semaphore of the new device's parent, but it could be a
> higher ancestor.  There even could be a single "root" semaphore for
> drivers registering a top-level device with no parent.

What prevents us from having a device addition semaphore?
Adding device is not critical to performance, is it?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-23 20:05                                               ` Towards eliminating the freezer Alan Stern
  2007-07-24  8:21                                                 ` Oliver Neukum
@ 2007-07-24  9:33                                                 ` Rafael J. Wysocki
  2007-07-24 14:29                                                   ` Alan Stern
  1 sibling, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-24  9:33 UTC (permalink / raw)
  To: Alan Stern; +Cc: Oliver Neukum, LKML, linux-pm

On Monday, 23 July 2007 22:05, Alan Stern wrote:
> [Note changed $SUBJECT]
[--snip--]
> =============================
> 
> 
> Now here's an idea which might work.  Can we require every caller of
> device_add() to hold some existing device's semaphore?  Normally it
> would be the semaphore of the new device's parent, but it could be a
> higher ancestor.  There even could be a single "root" semaphore for
> drivers registering a top-level device with no parent.
> 
> (Some testing shows that during startup things like ACPI and IDE don't 
> fulfill this requirement, so maybe we should require it only after 
> userspace has begun running.  After all, the system can't suspend 
> until then.)
> 
> It seems like a reasonable sort of thing to do.  Hotplugged devices
> tend to be registered as they are discovered by their parent's driver,
> so it shouldn't be too much to ask that the parent's semaphore be held
> when the new device is registered.  Static devices generally aren't
> quite so nice; the serial and floppy drivers in particular would need a
> little work (and probably some other drivers too).
> 
> If we do this, then once the PM core has acquired the semaphore for 
> every device it will be guaranteed that no new devices can be added.  
> It would be a simple solution to a rather nasty problem.

Hmm, in device_pm_add() and device_pm_remove() we acquire dpm_list_mtx which
also is acquired by device_suspend() and device_resume().  Thus, every attempt
to register a new device or unregister an existing one will be blocked while
either device_suspend() or device_resume() is running.

If we arrange things so that dpm_list_mtx is acquired, but not released, by
device_suspend() and released, but not acquired, by device_resume(), then it
won't be possible to register/unregister a device during a suspend-resume
cycle.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* RE: [linux-pm] Re: Hibernation considerations
  2007-07-23 20:22                                                       ` Alan Stern
@ 2007-07-24 13:26                                                         ` Huang, Ying
  2007-07-24 14:50                                                           ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: Huang, Ying @ 2007-07-24 13:26 UTC (permalink / raw)
  To: Alan Stern, david
  Cc: Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki, LKML,
	linux-pm, nigel, Pavel Machek

>From: Alan Stern [mailto:stern@rowland.harvard.edu]
>It can't.  Indeed, in the absence of a freezer, user threads will need
>devices (more accurately, will submit I/O requests for devices) that
>have to be kept quiescent or low-power.  Drivers will need to delay
>those requests until the devices are returned to full operation.
>
>That's exactly what I've been saying all along: Drivers will need to
>be changed to delay I/O requests, if there is no freezer.

If it is a too big work to implement "delaying I/O requests" for every
driver, is it possible to implement it as follow:

1. It is triggered to suspend to RAM/DISK.
2. Replace the driver related syscall entries (such as sys_read,
sys_write, sys_ioctl, etc) in sys_call_table with special wrapper
entries provided by "suspend to RAM/DISK" subsystem, which will delay
I/O requests if appropriate.
3. When devices are quiesced, they are put into "low power" state and
system is put into suspend state; or the image is written to disk
(through snapshot/uswsusp or kexeced kernel).
4. After resuming from RAM/DISK, devices are put into "normal" state and
the syscall entries replaced in step 2 are restored.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24  8:21                                                 ` Oliver Neukum
@ 2007-07-24 14:27                                                   ` Alan Stern
  0 siblings, 0 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-24 14:27 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: LKML, linux-pm

On Tue, 24 Jul 2007, Oliver Neukum wrote:

> Am Montag 23 Juli 2007 schrieb Alan Stern:
> > Now here's an idea which might work.  Can we require every caller of
> > device_add() to hold some existing device's semaphore?  Normally it
> > would be the semaphore of the new device's parent, but it could be a
> > higher ancestor.  There even could be a single "root" semaphore for
> > drivers registering a top-level device with no parent.
> 
> What prevents us from having a device addition semaphore?
> Adding device is not critical to performance, is it?

It would create a locking order violation.  Many drivers hold a device
semaphore while registering a child device, so they would acquire your
new semaphore while holding a device sem.  But the PM core needs to
prevent registration while calling suspend() methods, so it would need
to acquire the device sems while holding your new semaphore.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24  9:33                                                 ` Rafael J. Wysocki
@ 2007-07-24 14:29                                                   ` Alan Stern
  2007-07-24 15:24                                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-24 14:29 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Oliver Neukum, LKML, linux-pm

On Tue, 24 Jul 2007, Rafael J. Wysocki wrote:

> > Now here's an idea which might work.  Can we require every caller of
> > device_add() to hold some existing device's semaphore?  Normally it
> > would be the semaphore of the new device's parent, but it could be a
> > higher ancestor.  There even could be a single "root" semaphore for
> > drivers registering a top-level device with no parent.
> > 
> > (Some testing shows that during startup things like ACPI and IDE don't 
> > fulfill this requirement, so maybe we should require it only after 
> > userspace has begun running.  After all, the system can't suspend 
> > until then.)
> > 
> > It seems like a reasonable sort of thing to do.  Hotplugged devices
> > tend to be registered as they are discovered by their parent's driver,
> > so it shouldn't be too much to ask that the parent's semaphore be held
> > when the new device is registered.  Static devices generally aren't
> > quite so nice; the serial and floppy drivers in particular would need a
> > little work (and probably some other drivers too).
> > 
> > If we do this, then once the PM core has acquired the semaphore for 
> > every device it will be guaranteed that no new devices can be added.  
> > It would be a simple solution to a rather nasty problem.
> 
> Hmm, in device_pm_add() and device_pm_remove() we acquire dpm_list_mtx which
> also is acquired by device_suspend() and device_resume().  Thus, every attempt
> to register a new device or unregister an existing one will be blocked while
> either device_suspend() or device_resume() is running.
> 
> If we arrange things so that dpm_list_mtx is acquired, but not released, by
> device_suspend() and released, but not acquired, by device_resume(), then it
> won't be possible to register/unregister a device during a suspend-resume
> cycle.

As with Oliver's suggestion, this would create a locking order 
violation.  Drivers registering children (and thus acquiring 
dpm_list_mtx) will often already hold the parent's sem.  But 
device_suspend() needs to acquire device sems while holding 
dpm_list_mtx.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* RE: [linux-pm] Re: Hibernation considerations
  2007-07-24 13:26                                                         ` Huang, Ying
@ 2007-07-24 14:50                                                           ` Alan Stern
  0 siblings, 0 replies; 220+ messages in thread
From: Alan Stern @ 2007-07-24 14:50 UTC (permalink / raw)
  To: Huang, Ying
  Cc: david, Jeremy Maitin-Shepard, Milton Miller, Rafael J. Wysocki,
	LKML, linux-pm, nigel, Pavel Machek

On Tue, 24 Jul 2007, Huang, Ying wrote:

> >From: Alan Stern [mailto:stern@rowland.harvard.edu]
> >It can't.  Indeed, in the absence of a freezer, user threads will need
> >devices (more accurately, will submit I/O requests for devices) that
> >have to be kept quiescent or low-power.  Drivers will need to delay
> >those requests until the devices are returned to full operation.
> >
> >That's exactly what I've been saying all along: Drivers will need to
> >be changed to delay I/O requests, if there is no freezer.
> 
> If it is a too big work to implement "delaying I/O requests" for every
> driver, is it possible to implement it as follow:
> 
> 1. It is triggered to suspend to RAM/DISK.
> 2. Replace the driver related syscall entries (such as sys_read,
> sys_write, sys_ioctl, etc) in sys_call_table with special wrapper
> entries provided by "suspend to RAM/DISK" subsystem, which will delay
> I/O requests if appropriate.
> 3. When devices are quiesced, they are put into "low power" state and
> system is put into suspend state; or the image is written to disk
> (through snapshot/uswsusp or kexeced kernel).
> 4. After resuming from RAM/DISK, devices are put into "normal" state and
> the syscall entries replaced in step 2 are restored.

Ha!  I made exactly this same suggestion (URL lost in the mists of 
time), except that I proposed changing the syscall entries for every 
system call, not just the driver-related ones.

Nobody seemed to think it would work very well.

It leaves a few loose ends.  For example, suppose a user thread is 
already in the middle of a system call and is about to start doing some 
I/O (maybe it's waiting for a timer to expire).

In the end, this doesn't seem to be very different from freezing all 
user threads.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24 14:29                                                   ` Alan Stern
@ 2007-07-24 15:24                                                     ` Rafael J. Wysocki
  2007-07-24 16:06                                                       ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-24 15:24 UTC (permalink / raw)
  To: Alan Stern; +Cc: Oliver Neukum, LKML, linux-pm

On Tuesday, 24 July 2007 16:29, Alan Stern wrote:
> On Tue, 24 Jul 2007, Rafael J. Wysocki wrote:
> 
> > > Now here's an idea which might work.  Can we require every caller of
> > > device_add() to hold some existing device's semaphore?  Normally it
> > > would be the semaphore of the new device's parent, but it could be a
> > > higher ancestor.  There even could be a single "root" semaphore for
> > > drivers registering a top-level device with no parent.
> > > 
> > > (Some testing shows that during startup things like ACPI and IDE don't 
> > > fulfill this requirement, so maybe we should require it only after 
> > > userspace has begun running.  After all, the system can't suspend 
> > > until then.)
> > > 
> > > It seems like a reasonable sort of thing to do.  Hotplugged devices
> > > tend to be registered as they are discovered by their parent's driver,
> > > so it shouldn't be too much to ask that the parent's semaphore be held
> > > when the new device is registered.  Static devices generally aren't
> > > quite so nice; the serial and floppy drivers in particular would need a
> > > little work (and probably some other drivers too).
> > > 
> > > If we do this, then once the PM core has acquired the semaphore for 
> > > every device it will be guaranteed that no new devices can be added.  
> > > It would be a simple solution to a rather nasty problem.
> > 
> > Hmm, in device_pm_add() and device_pm_remove() we acquire dpm_list_mtx which
> > also is acquired by device_suspend() and device_resume().  Thus, every attempt
> > to register a new device or unregister an existing one will be blocked while
> > either device_suspend() or device_resume() is running.
> > 
> > If we arrange things so that dpm_list_mtx is acquired, but not released, by
> > device_suspend() and released, but not acquired, by device_resume(), then it
> > won't be possible to register/unregister a device during a suspend-resume
> > cycle.
> 
> As with Oliver's suggestion, this would create a locking order 
> violation.  Drivers registering children (and thus acquiring 
> dpm_list_mtx) will often already hold the parent's sem.  But 
> device_suspend() needs to acquire device sems while holding 
> dpm_list_mtx.

Hmm, but this is done already (ie. device_suspend() acquires device sems
while holding dpm_list_mtx in the current code).

What I'm suggesting is not to let device_suspend() release dpm_list_mtx
when it's finished.  The appended patch illustrates that I mean.

Greetings,
Rafael


---
dpm_list_mtx is used by the PM core to prevent device objects from being
added/removed while device_suspend() and device_resume() are running.  However,
it should also be impossible to add a device after device_suspend() has
finished, because at that time the dpm_active list is empty and adding new
devices to it would break the ordering of devices during the next suspend.
Thus, it seems reasonable to leave device_suspend() with dpm_list_mtx held
and release at the end of device_resume().

In that case device_suspend() and device_resume() cannot be run concurrently
and dpm_mtx is no longer needed.  Also, it's a bug to run device_resume() after
a failing device_suspend(), so the APM code needs to be updated.

---
 arch/i386/kernel/apm.c       |    5 ++++-
 drivers/base/power/main.c    |    1 -
 drivers/base/power/power.h   |    5 -----
 drivers/base/power/resume.c  |    3 ---
 drivers/base/power/suspend.c |    3 ---
 5 files changed, 4 insertions(+), 13 deletions(-)

Index: linux-2.6.23-rc1/arch/i386/kernel/apm.c
===================================================================
--- linux-2.6.23-rc1.orig/arch/i386/kernel/apm.c	2007-07-23 22:28:35.000000000 +0200
+++ linux-2.6.23-rc1/arch/i386/kernel/apm.c	2007-07-24 11:04:45.000000000 +0200
@@ -1202,7 +1202,9 @@ static int suspend(int vetoable)
 		printk(KERN_CRIT "apm: suspend was vetoed, but suspending anyway.\n");
 	}
 
-	device_suspend(PMSG_SUSPEND);
+	err = device_suspend(PMSG_SUSPEND);
+	if (err)
+		goto send_resume;
 	local_irq_disable();
 	device_power_down(PMSG_SUSPEND);
 
@@ -1224,6 +1226,7 @@ static int suspend(int vetoable)
 	device_power_up();
 	local_irq_enable();
 	device_resume();
+ send_resume:
 	pm_send_all(PM_RESUME, (void *)0);
 	queue_event(APM_NORMAL_RESUME, NULL);
  out:
Index: linux-2.6.23-rc1/drivers/base/power/resume.c
===================================================================
--- linux-2.6.23-rc1.orig/drivers/base/power/resume.c	2007-07-23 22:06:42.000000000 +0200
+++ linux-2.6.23-rc1/drivers/base/power/resume.c	2007-07-24 11:18:04.000000000 +0200
@@ -72,7 +72,6 @@ static int resume_device_early(struct de
  */
 void dpm_resume(void)
 {
-	mutex_lock(&dpm_list_mtx);
 	while(!list_empty(&dpm_off)) {
 		struct list_head * entry = dpm_off.next;
 		struct device * dev = to_device(entry);
@@ -99,9 +98,7 @@ void dpm_resume(void)
 void device_resume(void)
 {
 	might_sleep();
-	mutex_lock(&dpm_mtx);
 	dpm_resume();
-	mutex_unlock(&dpm_mtx);
 }
 
 EXPORT_SYMBOL_GPL(device_resume);
Index: linux-2.6.23-rc1/drivers/base/power/suspend.c
===================================================================
--- linux-2.6.23-rc1.orig/drivers/base/power/suspend.c	2007-07-23 22:06:42.000000000 +0200
+++ linux-2.6.23-rc1/drivers/base/power/suspend.c	2007-07-24 11:17:52.000000000 +0200
@@ -127,7 +127,6 @@ int device_suspend(pm_message_t state)
 	int error = 0;
 
 	might_sleep();
-	mutex_lock(&dpm_mtx);
 	mutex_lock(&dpm_list_mtx);
 	while (!list_empty(&dpm_active) && error == 0) {
 		struct list_head * entry = dpm_active.prev;
@@ -153,11 +152,9 @@ int device_suspend(pm_message_t state)
 				error == -EAGAIN ? " (please convert to suspend_late)" : "");
 		put_device(dev);
 	}
-	mutex_unlock(&dpm_list_mtx);
 	if (error)
 		dpm_resume();
 
-	mutex_unlock(&dpm_mtx);
 	return error;
 }
 
Index: linux-2.6.23-rc1/drivers/base/power/main.c
===================================================================
--- linux-2.6.23-rc1.orig/drivers/base/power/main.c	2007-07-23 22:06:42.000000000 +0200
+++ linux-2.6.23-rc1/drivers/base/power/main.c	2007-07-24 11:17:16.000000000 +0200
@@ -28,7 +28,6 @@ LIST_HEAD(dpm_active);
 LIST_HEAD(dpm_off);
 LIST_HEAD(dpm_off_irq);
 
-DEFINE_MUTEX(dpm_mtx);
 DEFINE_MUTEX(dpm_list_mtx);
 
 int (*platform_enable_wakeup)(struct device *dev, int is_on);
Index: linux-2.6.23-rc1/drivers/base/power/power.h
===================================================================
--- linux-2.6.23-rc1.orig/drivers/base/power/power.h	2007-07-23 22:06:42.000000000 +0200
+++ linux-2.6.23-rc1/drivers/base/power/power.h	2007-07-24 11:17:33.000000000 +0200
@@ -12,11 +12,6 @@ extern void device_shutdown(void);
  */
 
 /*
- * Used to synchronize global power management operations.
- */
-extern struct mutex dpm_mtx;
-
-/*
  * Used to serialize changes to the dpm_* lists.
  */
 extern struct mutex dpm_list_mtx;

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24 15:24                                                     ` Rafael J. Wysocki
@ 2007-07-24 16:06                                                       ` Alan Stern
  2007-07-24 19:20                                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-24 16:06 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Oliver Neukum, LKML, linux-pm

On Tue, 24 Jul 2007, Rafael J. Wysocki wrote:

> > As with Oliver's suggestion, this would create a locking order 
> > violation.  Drivers registering children (and thus acquiring 
> > dpm_list_mtx) will often already hold the parent's sem.  But 
> > device_suspend() needs to acquire device sems while holding 
> > dpm_list_mtx.
> 
> Hmm, but this is done already (ie. device_suspend() acquires device sems
> while holding dpm_list_mtx in the current code).
> 
> What I'm suggesting is not to let device_suspend() release dpm_list_mtx
> when it's finished.  The appended patch illustrates that I mean.

Oh, okay, I see what you mean.

I should have explained earlier that my proposal was meant to be in the 
context of a previous discussion, where I suggested that 
device_suspend() should go through a preliminary step of acquiring all 
the device semaphores.  This would have the beneficial effect of 
blocking all attempts at driver binding or unbinding while a suspend is 
underway.

Still, this isn't a bad approach.  Maybe the following algorithm could 
be used:

 get_more:
	For each device on dpm_list
		Acquire dev->sem
		Move dev from dpm_list to a temporary list
	Lock dpm_list_mutex
	If (!list_empty(dpm_list)) {
		Unlock dpm_list_mutex
		Goto get_more
	}

(The "For each" loop would have to be written carefully to allow for 
device removal.)

The total number of iterations should never be large.  At the end the 
PM core would own all the device semaphores and no more devices could 
be added.  Then it would be safe to call each device's suspend() 
method.

This will remove one of the barriers to eliminating the freezer.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24 16:06                                                       ` Alan Stern
@ 2007-07-24 19:20                                                         ` Rafael J. Wysocki
  2007-07-24 20:24                                                           ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-24 19:20 UTC (permalink / raw)
  To: Alan Stern; +Cc: Oliver Neukum, LKML, linux-pm

On Tuesday, 24 July 2007 18:06, Alan Stern wrote:
> On Tue, 24 Jul 2007, Rafael J. Wysocki wrote:
> 
> > > As with Oliver's suggestion, this would create a locking order 
> > > violation.  Drivers registering children (and thus acquiring 
> > > dpm_list_mtx) will often already hold the parent's sem.  But 
> > > device_suspend() needs to acquire device sems while holding 
> > > dpm_list_mtx.
> > 
> > Hmm, but this is done already (ie. device_suspend() acquires device sems
> > while holding dpm_list_mtx in the current code).
> > 
> > What I'm suggesting is not to let device_suspend() release dpm_list_mtx
> > when it's finished.  The appended patch illustrates that I mean.
> 
> Oh, okay, I see what you mean.
> 
> I should have explained earlier that my proposal was meant to be in the 
> context of a previous discussion, where I suggested that 
> device_suspend() should go through a preliminary step of acquiring all 
> the device semaphores.  This would have the beneficial effect of 
> blocking all attempts at driver binding or unbinding while a suspend is 
> underway.
> 
> Still, this isn't a bad approach.  Maybe the following algorithm could 
> be used:
> 
>  get_more:
> 	For each device on dpm_list
> 		Acquire dev->sem
> 		Move dev from dpm_list to a temporary list
> 	Lock dpm_list_mutex
> 	If (!list_empty(dpm_list)) {
> 		Unlock dpm_list_mutex
> 		Goto get_more
> 	}
> 
> (The "For each" loop would have to be written carefully to allow for 
> device removal.)

Hmm, I still don't understand why we can't lock dpm_list_mutex before the
"For each" loop (we already do something like this in device_suspend() and
device_resume()) and that would simplify things.

It seems that we can do something like this:

device_suspend:
	Lock dpm_list_mutex (from now on, new devices cannot be added)
	For each device on dpm_active, reverse
		acquire dev->sem (from now on, no new drivers can bind to dev)
		suspend(dev)
		move dev to dpm_off

device_resume:
	For each device on dpm_off
		move dev to dpm_active
		resume(dev) (this cannot fail)
		release dev->sem (allow new drivers to bind to dev)
	Unlock dpm_list_mutex (allow new devices to be added)

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24 19:20                                                         ` Rafael J. Wysocki
@ 2007-07-24 20:24                                                           ` Alan Stern
  2007-07-24 21:14                                                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-24 20:24 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Oliver Neukum, LKML, linux-pm

On Tue, 24 Jul 2007, Rafael J. Wysocki wrote:

> Hmm, I still don't understand why we can't lock dpm_list_mutex before the
> "For each" loop (we already do something like this in device_suspend() and
> device_resume()) and that would simplify things.
> 
> It seems that we can do something like this:
> 
> device_suspend:
> 	Lock dpm_list_mutex (from now on, new devices cannot be added)
> 	For each device on dpm_active, reverse
> 		acquire dev->sem (from now on, no new drivers can bind to dev)
> 		suspend(dev)
> 		move dev to dpm_off

You have a minor error there; it's necessary to unlock dpm_list_mutex 
while acquiring dev-sem and then lock it again.  But more importantly, 
this code acquires the device semaphores in the wrong order.  They have 
to be acquired going forward (from the top of the device tree down), 
not backward.


Here's my proposal in a more explicit form.  Before doing
device_suspend() we call lock_all_devices():

struct list_head dpm_locked;

static void lock_all_devices()
{
	mutex_lock(&dpm_list_mtx);
	while (!list_empty(&dpm_active)) {
		struct list_head *entry = dpm_active.next;
		struct device *dev = to_device(entry);

		get_device(dev);
		mutex_unlock(&dpm_list_mtx);
		down(&dev->sem);
		mutex_lock(&dpm_list_mtx);

		if (list_empty(entry))		/* Device was removed */
			up(&dev->sem);
		else			/* Move it to the dpm_locked list */
			list_move_tail(entry, &dpm_locked);
		put_device(dev);
	}
}

Then device_suspend() can be simplified:

int device_suspend(pm_message_t state)
{
	int error = 0;

	might_sleep();
	list_for_each_entry_reverse(dev, &dpm_locked, power.entry) {
		error = suspend_device(dev, state);

		if (error) {
			printk(KERN_ERR "Could not suspend device %s: "
				"error %d%s\n",
				kobject_name(&dev->kobj), error,
				error == -EAGAIN ? " (please convert to suspend_late)" : "");
			break;
		}
		list_move(&dev->power.entry, &dpm_off);
	}
	if (error)
		dpm_resume();
	return error;
}

Appropriate changes are needed in the resume pathway as well, together 
with an unlock_all_devices() routine:

static void unlock_all_devices(void)
{
	while (!list_empty(&dpm_locked)) {
		struct list_head *entry = dpm_locked.prev;
		struct device *dev = to_device(entry);

		list_move(entry, &dpm_active);
		up(&dev->sem);
	}
	mutex_unlock(&dpm_list_mtx);
}


Incidentally, what is dpm_mtx for?  It doesn't seem to do anything 
useful.  Is it a relic of the former runtime PM support?

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24 20:24                                                           ` Alan Stern
@ 2007-07-24 21:14                                                             ` Rafael J. Wysocki
  2007-07-24 22:14                                                               ` Alan Stern
  0 siblings, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-24 21:14 UTC (permalink / raw)
  To: Alan Stern; +Cc: Oliver Neukum, LKML, linux-pm

On Tuesday, 24 July 2007 22:24, Alan Stern wrote:
> On Tue, 24 Jul 2007, Rafael J. Wysocki wrote:
> 
> > Hmm, I still don't understand why we can't lock dpm_list_mutex before the
> > "For each" loop (we already do something like this in device_suspend() and
> > device_resume()) and that would simplify things.
> > 
> > It seems that we can do something like this:
> > 
> > device_suspend:
> > 	Lock dpm_list_mutex (from now on, new devices cannot be added)
> > 	For each device on dpm_active, reverse
> > 		acquire dev->sem (from now on, no new drivers can bind to dev)
> > 		suspend(dev)
> > 		move dev to dpm_off
> 
> You have a minor error there; it's necessary to unlock dpm_list_mutex 
> while acquiring dev-sem and then lock it again.

Ah, right, now I see that.

> But more importantly, this code acquires the device semaphores in the wrong
> order.  They have to be acquired going forward (from the top of the device
> tree down), not backward.

Yes, I've overlooked that too.

> Here's my proposal in a more explicit form.  Before doing
> device_suspend() we call lock_all_devices():
> 
> struct list_head dpm_locked;
> 
> static void lock_all_devices()
> {
> 	mutex_lock(&dpm_list_mtx);
> 	while (!list_empty(&dpm_active)) {
> 		struct list_head *entry = dpm_active.next;
> 		struct device *dev = to_device(entry);
> 
> 		get_device(dev);
> 		mutex_unlock(&dpm_list_mtx);
> 		down(&dev->sem);
> 		mutex_lock(&dpm_list_mtx);
> 
> 		if (list_empty(entry))		/* Device was removed */
> 			up(&dev->sem);
> 		else			/* Move it to the dpm_locked list */
> 			list_move_tail(entry, &dpm_locked);
> 		put_device(dev);
> 	}
> }
> 
> Then device_suspend() can be simplified:
> 
> int device_suspend(pm_message_t state)
> {
> 	int error = 0;
> 
> 	might_sleep();
> 	list_for_each_entry_reverse(dev, &dpm_locked, power.entry) {
> 		error = suspend_device(dev, state);
> 
> 		if (error) {
> 			printk(KERN_ERR "Could not suspend device %s: "
> 				"error %d%s\n",
> 				kobject_name(&dev->kobj), error,
> 				error == -EAGAIN ? " (please convert to suspend_late)" : "");
> 			break;
> 		}
> 		list_move(&dev->power.entry, &dpm_off);

Is that safe with list_for_each_entry_reverse?

> 	}
> 	if (error)
> 		dpm_resume();
> 	return error;
> }
> 
> Appropriate changes are needed in the resume pathway as well, together 
> with an unlock_all_devices() routine:

Sure.

> static void unlock_all_devices(void)
> {
> 	while (!list_empty(&dpm_locked)) {
> 		struct list_head *entry = dpm_locked.prev;
> 		struct device *dev = to_device(entry);
> 
> 		list_move(entry, &dpm_active);
> 		up(&dev->sem);
> 	}
> 	mutex_unlock(&dpm_list_mtx);
> }

Yes, that looks fine. 

So, who's writing the patch? ;-)

> Incidentally, what is dpm_mtx for?  It doesn't seem to do anything 
> useful.  Is it a relic of the former runtime PM support?

I think so.  IMO it can be removed.

I also think it would be nicer to have all of the functions in
drivers/base/power/{main|suspend|resume}.c moved to one file.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24 21:14                                                             ` Rafael J. Wysocki
@ 2007-07-24 22:14                                                               ` Alan Stern
  2007-07-25 12:23                                                                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Alan Stern @ 2007-07-24 22:14 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Oliver Neukum, LKML, linux-pm

On Tue, 24 Jul 2007, Rafael J. Wysocki wrote:

> > Then device_suspend() can be simplified:
> > 
> > int device_suspend(pm_message_t state)
> > {
> > 	int error = 0;
> > 
> > 	might_sleep();
> > 	list_for_each_entry_reverse(dev, &dpm_locked, power.entry) {
> > 		error = suspend_device(dev, state);
> > 
> > 		if (error) {
> > 			printk(KERN_ERR "Could not suspend device %s: "
> > 				"error %d%s\n",
> > 				kobject_name(&dev->kobj), error,
> > 				error == -EAGAIN ? " (please convert to suspend_late)" : "");
> > 			break;
> > 		}
> > 		list_move(&dev->power.entry, &dpm_off);
> 
> Is that safe with list_for_each_entry_reverse?

No.  I guess it'll have to resemble the other code.

> Yes, that looks fine. 
> 
> So, who's writing the patch? ;-)

I can do it.  You haven't made any changes to this part of the code, 
have you?  My work tends to be based on Linus's tree, not -mm.

Something to watch out for: With all the extra locking, we run the risk
of blocking the keventd workqueue.  This may or may not matter, but to
be safe perhaps there should be a new general-purpose workqueue which
_expects_ to block (or freeze) during suspends.  Any work routine that 
involves adding or removing a device should go on the new workqueue.

> > Incidentally, what is dpm_mtx for?  It doesn't seem to do anything 
> > useful.  Is it a relic of the former runtime PM support?
> 
> I think so.  IMO it can be removed.
> 
> I also think it would be nicer to have all of the functions in
> drivers/base/power/{main|suspend|resume}.c moved to one file.

Yes, they are all similar enough that there isn't much point keeping 
them separate.

Alan Stern


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Towards eliminating the freezer
  2007-07-24 22:14                                                               ` Alan Stern
@ 2007-07-25 12:23                                                                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-25 12:23 UTC (permalink / raw)
  To: Alan Stern; +Cc: Oliver Neukum, LKML, linux-pm

On Wednesday, 25 July 2007 00:14, Alan Stern wrote:
> On Tue, 24 Jul 2007, Rafael J. Wysocki wrote:
> 
> > > Then device_suspend() can be simplified:
> > > 
> > > int device_suspend(pm_message_t state)
> > > {
> > > 	int error = 0;
> > > 
> > > 	might_sleep();
> > > 	list_for_each_entry_reverse(dev, &dpm_locked, power.entry) {
> > > 		error = suspend_device(dev, state);
> > > 
> > > 		if (error) {
> > > 			printk(KERN_ERR "Could not suspend device %s: "
> > > 				"error %d%s\n",
> > > 				kobject_name(&dev->kobj), error,
> > > 				error == -EAGAIN ? " (please convert to suspend_late)" : "");
> > > 			break;
> > > 		}
> > > 		list_move(&dev->power.entry, &dpm_off);
> > 
> > Is that safe with list_for_each_entry_reverse?
> 
> No.  I guess it'll have to resemble the other code.
> 
> > Yes, that looks fine. 
> > 
> > So, who's writing the patch? ;-)
> 
> I can do it.  You haven't made any changes to this part of the code, 
> have you?

Yes, I have, quite recently. :-)

> My work tends to be based on Linus's tree, not -mm. 

At the moment they are pretty much in line, at least as far as this code is
concerned.  Anyway, I'm trying to keep track of PM-related patches,
at http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.23-rc1/

> Something to watch out for: With all the extra locking, we run the risk
> of blocking the keventd workqueue.  This may or may not matter, but to
> be safe perhaps there should be a new general-purpose workqueue which
> _expects_ to block (or freeze) during suspends.  Any work routine that 
> involves adding or removing a device should go on the new workqueue.

Yes, this sounds like a good idea.  Still, I think we can check if there are
problems with the keventd workqueue alone, first.

> > > Incidentally, what is dpm_mtx for?  It doesn't seem to do anything 
> > > useful.  Is it a relic of the former runtime PM support?
> > 
> > I think so.  IMO it can be removed.
> > 
> > I also think it would be nicer to have all of the functions in
> > drivers/base/power/{main|suspend|resume}.c moved to one file.
> 
> Yes, they are all similar enough that there isn't much point keeping 
> them separate.

Plus some variables might be made static, like dpm_off or even dpm_list_mtx.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-15 22:38   ` Rafael J. Wysocki
  2007-07-15 22:27     ` david
@ 2007-07-29  6:53     ` Vojtech Pavlik
  2007-07-29  9:56       ` Rafael J. Wysocki
  2007-08-05 19:56       ` encrypted hibernation (was Re: Hibernation considerations) Pavel Machek
  1 sibling, 2 replies; 220+ messages in thread
From: Vojtech Pavlik @ 2007-07-29  6:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Dr. David Alan Gilbert, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david,
	Al Boldi

On Mon, Jul 16, 2007 at 12:38:11AM +0200, Rafael J. Wysocki wrote:

> > Or the user unplugs their flash drive after hibernation rather than before.
> > 
> > Two things which I think would be nice to consider are:
> >    1) Encryption - I'd actually prefer if my luks device did not
> >        remember the key accross a hibernation; I want to be forced to
> >        reenter the phrase.  However I don't know what the best thing
> >        to do to partitions/applications using the luks device is.
> 
> Encryption is possible with both the userland hibernation (aka uswsusp) and
> TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a "must
> have" feature for a framework to be generally useful (many users don't use it
> anyway).

If a user uses an encrypted filesystem, then he also needs an encrypted
swap and encrypted hibernation image: Otherwise the fileystem encryption
is not very useful.

Forgetting the filesystem/swap decryption keys before hibernation is
probably harder to do - there may be sensitive data in the kernel memory
image that weren't cleared - even if the key itself is not there.

In my opinion, encrypted hibernation is what every notebook user should
want - that's the only way how to make sure data from the notebook
aren't available when the notebook is physically stolen.

-- 
Vojtech Pavlik
Director SuSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-29  6:53     ` Vojtech Pavlik
@ 2007-07-29  9:56       ` Rafael J. Wysocki
  2007-08-05 19:56       ` encrypted hibernation (was Re: Hibernation considerations) Pavel Machek
  1 sibling, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-07-29  9:56 UTC (permalink / raw)
  To: Vojtech Pavlik
  Cc: Dr. David Alan Gilbert, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, Pavel Machek, pm list, david,
	Al Boldi

On Sunday, 29 July 2007 08:53, Vojtech Pavlik wrote:
> On Mon, Jul 16, 2007 at 12:38:11AM +0200, Rafael J. Wysocki wrote:
> 
> > > Or the user unplugs their flash drive after hibernation rather than before.
> > > 
> > > Two things which I think would be nice to consider are:
> > >    1) Encryption - I'd actually prefer if my luks device did not
> > >        remember the key accross a hibernation; I want to be forced to
> > >        reenter the phrase.  However I don't know what the best thing
> > >        to do to partitions/applications using the luks device is.
> > 
> > Encryption is possible with both the userland hibernation (aka uswsusp) and
> > TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a "must
> > have" feature for a framework to be generally useful (many users don't use it
> > anyway).
> 
> If a user uses an encrypted filesystem, then he also needs an encrypted
> swap and encrypted hibernation image: Otherwise the fileystem encryption
> is not very useful.

I was talking about hibernation image encryption.  Arguably, if the image is
encrypted, you don't need to worry about its contents, including the keys for
other kinds of encryption (eg. fs encryption).
 
> Forgetting the filesystem/swap decryption keys before hibernation is
> probably harder to do - there may be sensitive data in the kernel memory
> image that weren't cleared - even if the key itself is not there.

If the image is encrypted, its contents are not available to anyone
unauthorized and that includes the filesystem/swap decryption keys.

> In my opinion, encrypted hibernation is what every notebook user should
> want - that's the only way how to make sure data from the notebook
> aren't available when the notebook is physically stolen.

Provided that there are any sensitive (to the user or her employer etc.) data
in the notebook.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 12:43                                   ` Nigel Cunningham
  2007-07-21 13:56                                     ` Alan Stern
  2007-07-21 16:13                                     ` Jeremy Maitin-Shepard
@ 2007-08-01  9:19                                     ` Pavel Machek
  2 siblings, 0 replies; 220+ messages in thread
From: Pavel Machek @ 2007-08-01  9:19 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Miklos Szeredi, rjw, miltonm, stern, ying.huang, linux-kernel,
	david, linux-pm, jbms

Hi!

> > > The problem with FUSE is related to the fact that the freezer can't
> > > freeze uninterruptible tasks and we said that perhaps we might avoid
> > > it if FUSE was made freezing-aware.  Still, no one has gone in this
> > > direction and I don't know of any plans to do that.
> > 
> > I thought we have fully explored this direction.  Lots of emails, and
> > an IRC session with Pavel.  Conclusion:
> 
> What am I missing in the following suggested solution?
> 
> 1) In the freezer code, we implement a new TIF_LATEFREEZE process flag, which, 
> when set, causes a  userspace process to be frozen with kernel threads 
> instead of with userspace ones. When freezing, we freezing !TIF_LATEFREEZE, 
> sync and then freeze TIF_LATEFREEZE and freezable kernel threads.
> 
> 2) In the fuse code, the PID of the process that will do the work gets passed 

The list of neccessary PIDs is not known to the kernel. FUSE servers
may depend on another parts of userland.



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-21 19:20                                         ` Rafael J. Wysocki
@ 2007-08-01  9:22                                           ` Pavel Machek
  2007-08-02 17:02                                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 220+ messages in thread
From: Pavel Machek @ 2007-08-01  9:22 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Miklos Szeredi, jbms, nigel, miltonm, stern, ying.huang,
	linux-kernel, david, linux-pm

Hi!

> > Hmm, wonder why this isn't affecting people with VPNs?  Probably
> > network mounts over VPN are rare, and ever rarer to have fs activity
> > on them during suspend.
> > 
> > Anyway, I think it's long overdue to stop thinking about how to "fix"
> > fuse, and concentrate on fixing the underlying problem instead ;)
> 
> To conclude this branch of the thread, I have a patch in the works that may
> help a bit with unfreezable FUSE filesystems and it only affects the freezer.
> I'll post it when 2.6.23-rc1 is out, because it's on top of some other patches
> that need to go first.

I'm interested... which one is that?
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-07-20 22:25                                           ` Alan Stern
  2007-07-23 14:23                                             ` Oliver Neukum
@ 2007-08-01  9:34                                             ` Pavel Machek
  2007-08-03  3:50                                               ` david
  1 sibling, 1 reply; 220+ messages in thread
From: Pavel Machek @ 2007-08-01  9:34 UTC (permalink / raw)
  To: Alan Stern
  Cc: Oliver Neukum, Milton Miller, Ying Huang, LKML,
	Rafael J. Wysocki, David Lang, linux-pm, Jeremy Maitin-Shepard

Hi!

> >  Do we have to block module loading?
> 
> No.  Registering new drivers is okay, registering new devices is bad.
> 
> Of course, some modules do want to register a new device in their init 
> method.  I don't know what we should do about them.  Force the 
> registration to fail, I suppose.  How often will people suspend while a 
> module is loading?

Well... plug this pcmcia card into the slot so that I do not have to
carry it separately, close the lid and go?

...not that impossible to imagine...
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: Hibernation considerations
  2007-07-17 20:18                           ` david
                                               ` (2 preceding siblings ...)
  2007-07-21 10:25                             ` Pavel Machek
@ 2007-08-01 16:58                             ` Stefan Seyfried
  3 siblings, 0 replies; 220+ messages in thread
From: Stefan Seyfried @ 2007-08-01 16:58 UTC (permalink / raw)
  To: david
  Cc: Rafael J. Wysocki, LKML, Kyle Moffett, Al Boldi,
	Eric W. Biederman, Pavel Machek, Huang, Ying, Andrew Morton,
	pm list, Jeremy Maitin-Shepard

Hi,

Sorry for joining late, just a small annotation:

On Tue, Jul 17, 2007 at 01:18:13PM -0700, david@lang.hm wrote:
 
> non-ACPI hibernate
> 
>   since the box powers off
>     it uses zero power while suspended
>     another OS could be run before a resume
>     hardware can be swapped, suspend image could be sent around the world to be restored on another system.
>     restore makes no assumptions about the state of the hardware when it is restored
>     restore is slower (full BIOS boot is required)
>   should be able to work on just about any hardware (the limit is the ability to initialize the devices)
> 
> 
> ACPI suspends
> 
>   since the box never completely powers off:A

wrong

>     a complete power failure breaks the suspend

wrong

>     the OS must remain in control so other uses must be prevented.
>     hardware must remain in the ACPI state from suspend until restore.
>     restore can be faster (some initialization may be able to be skipped)
>   requires ACPI hardware support
> 
> under the catagory of ACPI suspends you have

ACPI S4 turns off the machine completely and you can remove the battery (this
is even required somewhere in the spec). Any state saving is done in CMOS RAM
or flash.

But for example many Notebooks resume much faster if they go through the
ACPI S4 hooks during suspend (less than one second from "lid open" to "grub"
while they need ~10 seconds through the BIOS on a "normal" boot.
My Toughbook resumes on "Lid Opened" after S4, it doesn't after a shutdown.

So there will be differences.
I'm not saying that they are too important, but 20% faster resume still is
a good saving for me.

No need to restart this thread btw ;-)

Have fun,

    Stefan
-- 
Stefan Seyfried
QA / R&D Team Mobile Devices        |              "Any ideas, John?"
SUSE LINUX Products GmbH, Nürnberg  | "Well, surrounding them's out." 

This footer brought to you by insane German lawmakers:
SUSE Linux Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-08-01  9:22                                           ` Pavel Machek
@ 2007-08-02 17:02                                             ` Rafael J. Wysocki
  0 siblings, 0 replies; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-08-02 17:02 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Miklos Szeredi, jbms, nigel, miltonm, stern, ying.huang,
	linux-kernel, david, linux-pm

On Wednesday, 1 August 2007 11:22, Pavel Machek wrote:
> Hi!
> 
> > > Hmm, wonder why this isn't affecting people with VPNs?  Probably
> > > network mounts over VPN are rare, and ever rarer to have fs activity
> > > on them during suspend.
> > > 
> > > Anyway, I think it's long overdue to stop thinking about how to "fix"
> > > fuse, and concentrate on fixing the underlying problem instead ;)
> > 
> > To conclude this branch of the thread, I have a patch in the works that may
> > help a bit with unfreezable FUSE filesystems and it only affects the freezer.
> > I'll post it when 2.6.23-rc1 is out, because it's on top of some other patches
> > that need to go first.
> 
> I'm interested... which one is that?

Appended, on top of this:
https://lists.linux-foundation.org/pipermail/linux-pm/2007-July/014521.html

Greetings,
Rafael


---
 kernel/power/process.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

Index: linux-2.6.23-rc1/kernel/power/process.c
===================================================================
--- linux-2.6.23-rc1.orig/kernel/power/process.c	2007-07-24 00:14:07.000000000 +0200
+++ linux-2.6.23-rc1/kernel/power/process.c	2007-07-24 00:14:17.000000000 +0200
@@ -30,6 +30,14 @@
  */
 #define MAX_WAITS 5
 
+/*
+ * If the freezing of tasks fails, we attempt to thaw tasks that have already
+ * been frozen to give a chance the other tasks to freeze, in case one or more
+ * of them are blocked by the frozen ones.  If this fails MAX_ATTEMPTS times
+ * in a row, we give up.
+ */
+#define MAX_ATTEMPTS 10
+
 #define FREEZER_KERNEL_THREADS 0
 #define FREEZER_USER_SPACE 1
 
@@ -192,14 +200,21 @@ static void cancel_freezing(struct task_
 static int try_to_freeze_tasks(int freeze_user_space)
 {
 	struct task_struct *g, *p;
-	unsigned int todo, waits;
+	unsigned int todo, waits, attempts;
 	unsigned long ret;
 	struct timeval start, end;
 	s64 elapsed_csecs64;
 	unsigned int elapsed_csecs;
+	char *tick = "-\\|/";
+
+	printk(" ");
+	attempts = 0;
 
 	do_gettimeofday(&start);
 
+ Repeat:
+	printk("\b%c", tick[attempts++ % 4]);
+
 	refrigerator_called = 0;
 	waits = 0;
 	do {
@@ -235,11 +250,43 @@ static int try_to_freeze_tasks(int freez
 		}
 	} while (todo);
 
+	if (todo && attempts <= MAX_ATTEMPTS) {
+		/*
+		 * Some tasks have not been able to freeze.  They might be stuck
+		 * in TASK_UNINTERRUPTIBLE waiting for the frozen tasks.  Try to
+		 * thaw the tasks that have frozen without clearing the freeze
+		 * requests of the remaining tasks and repeat.
+		 */
+		read_lock(&tasklist_lock);
+		do_each_thread(g, p) {
+			if (frozen(p)) {
+				p->flags &= ~PF_FROZEN;
+				wake_up_process(p);
+			}
+		} while_each_thread(g, p);
+		read_unlock(&tasklist_lock);
+
+		ret = wait_event_timeout(refrigerator_waitq,
+						refrigerator_called, TIMEOUT);
+		if (!ret) {
+			/*
+			 * There is a little hope that we will succeed, but at
+			 * least we want to know which tasks have not been
+			 * frozen.  Thus, we are going to repeat once.
+			 */
+			attempts = MAX_ATTEMPTS;
+		}
+
+		goto Repeat;
+	}
+
 	do_gettimeofday(&end);
 	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
 	do_div(elapsed_csecs64, NSEC_PER_SEC / 100);
 	elapsed_csecs = elapsed_csecs64;
 
+	printk("\b");
+
 	if (todo) {
 		/* This does not unfreeze processes that are already frozen
 		 * (we have slightly ugly calling convention in that respect,

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [linux-pm] Re: Hibernation considerations
  2007-08-01  9:34                                             ` [linux-pm] Re: Hibernation considerations Pavel Machek
@ 2007-08-03  3:50                                               ` david
  0 siblings, 0 replies; 220+ messages in thread
From: david @ 2007-08-03  3:50 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Stern, Oliver Neukum, Milton Miller, Ying Huang, LKML,
	Rafael J. Wysocki, linux-pm, Jeremy Maitin-Shepard

On Wed, 1 Aug 2007, Pavel Machek wrote:

> Hi!
>
>>>  Do we have to block module loading?
>>
>> No.  Registering new drivers is okay, registering new devices is bad.
>>
>> Of course, some modules do want to register a new device in their init
>> method.  I don't know what we should do about them.  Force the
>> registration to fail, I suppose.  How often will people suspend while a
>> module is loading?
>
> Well... plug this pcmcia card into the slot so that I do not have to
> carry it separately, close the lid and go?
>
> ...not that impossible to imagine...

I useually leave my broadband card in the slot, but not seated. I wouldn't 
bet against it getting pushed in enough to be detected while putting the 
laptop in the bag.

David Lang

^ permalink raw reply	[flat|nested] 220+ messages in thread

* encrypted hibernation (was Re: Hibernation considerations)
  2007-07-29  6:53     ` Vojtech Pavlik
  2007-07-29  9:56       ` Rafael J. Wysocki
@ 2007-08-05 19:56       ` Pavel Machek
  2007-08-11 23:43         ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 220+ messages in thread
From: Pavel Machek @ 2007-08-05 19:56 UTC (permalink / raw)
  To: Vojtech Pavlik, seife
  Cc: Rafael J. Wysocki, Dr. David Alan Gilbert, LKML, Alan Stern,
	Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham, pm list,
	david, Al Boldi

Hi!

> > > Two things which I think would be nice to consider are:
> > >    1) Encryption - I'd actually prefer if my luks device did not
> > >        remember the key accross a hibernation; I want to be forced to
> > >        reenter the phrase.  However I don't know what the best thing
> > >        to do to partitions/applications using the luks device is.
> > 
> > Encryption is possible with both the userland hibernation (aka uswsusp) and
> > TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a "must
> > have" feature for a framework to be generally useful (many users don't use it
> > anyway).
> 
> If a user uses an encrypted filesystem, then he also needs an encrypted
> swap and encrypted hibernation image: Otherwise the fileystem encryption
> is not very useful.

Actually, we can do most of that stuff already. 

We can encrypt filesystems, encrypt swaps (LVM), and encrypt hibernation.

What we _can't_ do is to hibernate on LVM encrypted partition, and we
could only suspend to swap partition. Bad combination, but here's way
out: just use separate (raw) partition for hibernation.

Ok, that needs re-partitioning; if that's bad, just swapoff before
hibernation and mkswap/swapon after its done.

Index: suspend.c
===================================================================
RCS file: /cvsroot/suspend/suspend/suspend.c,v
retrieving revision 1.82
diff -u -u -r1.82 suspend.c
--- suspend.c	29 Jul 2007 12:48:10 -0000	1.82
+++ suspend.c	5 Aug 2007 19:49:05 -0000
@@ -59,6 +59,7 @@
 static unsigned long pref_image_size = IMAGE_SIZE;
 static int suspend_loglevel = SUSPEND_LOGLEVEL;
 static char compute_checksum;
+static int raw_partition = 1;
 #ifdef CONFIG_COMPRESS
 static char compress;
 #else
@@ -184,6 +185,9 @@
 	int error;
 	loff_t free_swap;
 
+	if (raw_partition)
+		return 1*1024*1024*1024;
+
 	error = ioctl(dev, SNAPSHOT_AVAIL_SWAP, &free_swap);
 	if (!error)
 		return free_swap;
@@ -197,6 +201,12 @@
 	int error;
 	loff_t offset;
 
+	if (raw_partition) {
+		static int cur_offset = 0;
+		cur_offset += page_size;
+		return cur_offset;
+	}
+
 	error = ioctl(dev, SNAPSHOT_GET_SWAP_PAGE, &offset);
 	if (!error)
 		return offset;
@@ -205,6 +215,8 @@
 
 static inline int free_swap_pages(int dev)
 {
+	if (raw_partition)
+		return 0;
 	return ioctl(dev, SNAPSHOT_FREE_SWAP_PAGES, 0);
 }
 
@@ -213,6 +225,8 @@
 	struct resume_swap_area swap;
 	int error;
 
+	if (raw_partition)
+		return 0;
 	swap.dev = blkdev;
 	swap.offset = offset;
 	error = ioctl(dev, SNAPSHOT_SET_SWAP_AREA, &swap);



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: encrypted hibernation (was Re: Hibernation considerations)
  2007-08-05 19:56       ` encrypted hibernation (was Re: Hibernation considerations) Pavel Machek
@ 2007-08-11 23:43         ` Dr. David Alan Gilbert
  2007-08-12 22:12           ` Rafael J. Wysocki
  2007-08-13  2:30           ` Michael Chang
  0 siblings, 2 replies; 220+ messages in thread
From: Dr. David Alan Gilbert @ 2007-08-11 23:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Vojtech Pavlik, seife, Rafael J. Wysocki, Dr. David Alan Gilbert,
	LKML, Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham, pm list,
	david, Al Boldi

* Pavel Machek (pavel@ucw.cz) wrote:
> Hi!
> 
> > > > Two things which I think would be nice to consider are:
> > > >    1) Encryption - I'd actually prefer if my luks device did not
> > > >        remember the key accross a hibernation; I want to be forced to
> > > >        reenter the phrase.  However I don't know what the best thing
> > > >        to do to partitions/applications using the luks device is.
> > > 
> > > Encryption is possible with both the userland hibernation (aka uswsusp) and
> > > TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a "must
> > > have" feature for a framework to be generally useful (many users don't use it
> > > anyway).
> > 
> > If a user uses an encrypted filesystem, then he also needs an encrypted
> > swap and encrypted hibernation image: Otherwise the fileystem encryption
> > is not very useful.
> 
> Actually, we can do most of that stuff already. 
> 
> We can encrypt filesystems, encrypt swaps (LVM), and encrypt hibernation.

But can you do what my original question was; find a way to lose a luks
encrypted device key and cleanly unmount the filesystem that was
using it?  (and preferably put it all back together after resume).

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: encrypted hibernation (was Re: Hibernation considerations)
  2007-08-11 23:43         ` Dr. David Alan Gilbert
@ 2007-08-12 22:12           ` Rafael J. Wysocki
  2007-08-18 19:37             ` Dr. David Alan Gilbert
  2007-08-13  2:30           ` Michael Chang
  1 sibling, 1 reply; 220+ messages in thread
From: Rafael J. Wysocki @ 2007-08-12 22:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pavel Machek, Vojtech Pavlik, seife, LKML, Alan Stern,
	Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham, pm list,
	david, Al Boldi

On Sunday, 12 August 2007 01:43, Dr. David Alan Gilbert wrote:
> * Pavel Machek (pavel@ucw.cz) wrote:
> > Hi!
> > 
> > > > > Two things which I think would be nice to consider are:
> > > > >    1) Encryption - I'd actually prefer if my luks device did not
> > > > >        remember the key accross a hibernation;

Why exactly (assuming that the hibernation image is encrypted)?

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: encrypted hibernation (was Re: Hibernation considerations)
  2007-08-11 23:43         ` Dr. David Alan Gilbert
  2007-08-12 22:12           ` Rafael J. Wysocki
@ 2007-08-13  2:30           ` Michael Chang
  2007-08-13  4:53             ` alon.barlev
  1 sibling, 1 reply; 220+ messages in thread
From: Michael Chang @ 2007-08-13  2:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pavel Machek, Vojtech Pavlik, seife, Rafael J. Wysocki, LKML,
	Alan Stern, Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham, pm list,
	david, Al Boldi

On 8/11/07, Dr. David Alan Gilbert <linux@treblig.org> wrote:
> * Pavel Machek (pavel@ucw.cz) wrote:
> > Hi!
> >
> > > > > Two things which I think would be nice to consider are:
> > > > >    1) Encryption - I'd actually prefer if my luks device did not
> > > > >        remember the key accross a hibernation; I want to be forced to
> > > > >        reenter the phrase.  However I don't know what the best thing
> > > > >        to do to partitions/applications using the luks device is.
> > > >
> > > > Encryption is possible with both the userland hibernation (aka uswsusp) and
> > > > TuxOnIce (formerly known as suspend2).  Still, I don't consider it as a "must
> > > > have" feature for a framework to be generally useful (many users don't use it
> > > > anyway).
> > >
> > > If a user uses an encrypted filesystem, then he also needs an encrypted
> > > swap and encrypted hibernation image: Otherwise the fileystem encryption
> > > is not very useful.
> >
> > Actually, we can do most of that stuff already.
> >
> > We can encrypt filesystems, encrypt swaps (LVM), and encrypt hibernation.
>
> But can you do what my original question was; find a way to lose a luks
> encrypted device key and cleanly unmount the filesystem that was
> using it?  (and preferably put it all back together after resume).
>

If you lose the device key, how are you going to get luks to find it
again when resuming? Wouldn't it make more sense to have it remember
the key? I can't see it being advisable to allow input or similar
before resume has completed...

-- 
Michael Chang

Please avoid sending me Word or PowerPoint attachments. Send me ODT,
RTF, or HTML instead.
See http://www.gnu.org/philosophy/no-word-attachments.html
Thank you.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: encrypted hibernation (was Re: Hibernation considerations)
  2007-08-13  2:30           ` Michael Chang
@ 2007-08-13  4:53             ` alon.barlev
  0 siblings, 0 replies; 220+ messages in thread
From: alon.barlev @ 2007-08-13  4:53 UTC (permalink / raw)
  To: Michael Chang
  Cc: Dr. David Alan Gilbert, Pavel Machek, Vojtech Pavlik, seife,
	Rafael J. Wysocki, LKML, Alan Stern, Andrew Morton,
	Eric W. Biederman, Huang, Ying, Jeremy Maitin-Shepard,
	Kyle Moffett, Nigel Cunningham, pm list, david, Al Boldi

Hello,

We already have a sample at:
http://wiki.tuxonice.net/EncryptedSwapAndRoot
It stores the keys of mounted partitions on an encrypted swap, which
has the same encryption with different keyset.
It also shows how to resume from encrypted swap, And you can
optionally store the keys on hardware device such as smartcards.

Best Regards,
Alon Bar-Lev

On 8/13/07, Michael Chang <thenewme91@gmail.com> wrote:
> On 8/11/07, Dr. David Alan Gilbert <linux@treblig.org> wrote:
> > * Pavel Machek (pavel@ucw.cz) wrote:
> > > Hi!
> > >
> > > > > > Two things which I think would be nice to consider are:
> > > > > >    1) Encryption - I'd actually prefer if my luks device did not
> > > > > >        remember the key accross a hibernation; I want to be forced
> to
> > > > > >        reenter the phrase.  However I don't know what the best
> thing
> > > > > >        to do to partitions/applications using the luks device is.
> > > > >
> > > > > Encryption is possible with both the userland hibernation (aka
> uswsusp) and
> > > > > TuxOnIce (formerly known as suspend2).  Still, I don't consider it
> as a "must
> > > > > have" feature for a framework to be generally useful (many users
> don't use it
> > > > > anyway).
> > > >
> > > > If a user uses an encrypted filesystem, then he also needs an
> encrypted
> > > > swap and encrypted hibernation image: Otherwise the fileystem
> encryption
> > > > is not very useful.
> > >
> > > Actually, we can do most of that stuff already.
> > >
> > > We can encrypt filesystems, encrypt swaps (LVM), and encrypt
> hibernation.
> >
> > But can you do what my original question was; find a way to lose a luks
> > encrypted device key and cleanly unmount the filesystem that was
> > using it?  (and preferably put it all back together after resume).
> >
>
> If you lose the device key, how are you going to get luks to find it
> again when resuming? Wouldn't it make more sense to have it remember
> the key? I can't see it being advisable to allow input or similar
> before resume has completed...
>
> --
> Michael Chang
>
> Please avoid sending me Word or PowerPoint attachments. Send me ODT,
> RTF, or HTML instead.
> See http://www.gnu.org/philosophy/no-word-attachments.html
> Thank you.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: encrypted hibernation (was Re: Hibernation considerations)
  2007-08-12 22:12           ` Rafael J. Wysocki
@ 2007-08-18 19:37             ` Dr. David Alan Gilbert
  2007-08-21  7:29               ` Pavel Machek
  0 siblings, 1 reply; 220+ messages in thread
From: Dr. David Alan Gilbert @ 2007-08-18 19:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Vojtech Pavlik, seife, LKML, Alan Stern,
	Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham, pm list,
	david, Al Boldi

* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> On Sunday, 12 August 2007 01:43, Dr. David Alan Gilbert wrote:
> > * Pavel Machek (pavel@ucw.cz) wrote:
> > > Hi!
> > > 
> > > > > > Two things which I think would be nice to consider are:
> > > > > >    1) Encryption - I'd actually prefer if my luks device did not
> > > > > >        remember the key accross a hibernation;
> 
> Why exactly (assuming that the hibernation image is encrypted)?

I was assuming the hibernation image was not encrypted.
Certainly if it meant a penalty during normal operation (e.g. encrypted swap)
it wouldn't be.

(I have a small amount of encrypted data in a luks partition,
most of the time it isn't used, only rarely do apps have it open
and I'm not actually worried about crawling through swap to find out
what is there - this is just a personal laptop; I appreciate these
concerns are different depending what you are storing).

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: encrypted hibernation (was Re: Hibernation considerations)
  2007-08-18 19:37             ` Dr. David Alan Gilbert
@ 2007-08-21  7:29               ` Pavel Machek
  0 siblings, 0 replies; 220+ messages in thread
From: Pavel Machek @ 2007-08-21  7:29 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Rafael J. Wysocki, Vojtech Pavlik, seife, LKML, Alan Stern,
	Andrew Morton, Eric W. Biederman, Huang, Ying,
	Jeremy Maitin-Shepard, Kyle Moffett, Nigel Cunningham, pm list,
	david, Al Boldi

Hi!

> > > > > > > Two things which I think would be nice to consider are:
> > > > > > >    1) Encryption - I'd actually prefer if my luks device did not
> > > > > > >        remember the key accross a hibernation;
> > 
> > Why exactly (assuming that the hibernation image is encrypted)?
> 
> I was assuming the hibernation image was not encrypted.
> Certainly if it meant a penalty during normal operation (e.g. encrypted swap)
> it wouldn't be.

uswsusp works the way you want. Don't encrypt normal swap, and uswsusp
will still be encrypted.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 220+ messages in thread

end of thread, other threads:[~2007-08-21  9:35 UTC | newest]

Thread overview: 220+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-15 12:33 Hibernation considerations Rafael J. Wysocki
2007-07-15 12:51 ` Nigel Cunningham
2007-07-15 12:58 ` Dr. David Alan Gilbert
2007-07-15 22:38   ` Rafael J. Wysocki
2007-07-15 22:27     ` david
2007-07-17 17:40       ` Dr. David Alan Gilbert
2007-07-17 17:49         ` david
2007-07-29  6:53     ` Vojtech Pavlik
2007-07-29  9:56       ` Rafael J. Wysocki
2007-08-05 19:56       ` encrypted hibernation (was Re: Hibernation considerations) Pavel Machek
2007-08-11 23:43         ` Dr. David Alan Gilbert
2007-08-12 22:12           ` Rafael J. Wysocki
2007-08-18 19:37             ` Dr. David Alan Gilbert
2007-08-21  7:29               ` Pavel Machek
2007-08-13  2:30           ` Michael Chang
2007-08-13  4:53             ` alon.barlev
2007-07-15 15:10 ` Hibernation considerations Al Boldi
2007-07-15 15:35   ` jimmy bahuleyan
2007-07-15 17:40     ` Al Boldi
2007-07-15 16:29   ` Alan Stern
2007-07-15 17:40     ` Al Boldi
2007-07-15 23:28       ` Alan Stern
2007-07-15 23:58         ` david
2007-07-16  5:02         ` Al Boldi
2007-07-16  6:49           ` david
2007-07-16 13:32             ` Al Boldi
2007-07-17  4:33               ` david
2007-07-17 12:08                 ` Al Boldi
2007-07-17 14:18                   ` Rafael J. Wysocki
2007-07-17 15:23                   ` david
2007-07-16 14:53           ` Alan Stern
2007-07-16 16:51             ` Al Boldi
2007-07-17  4:37               ` david
2007-07-15 19:52     ` david
2007-07-15 20:13 ` david
2007-07-15 22:47   ` Rafael J. Wysocki
2007-07-15 22:42     ` david
2007-07-15 23:15       ` Alan Stern
2007-07-15 23:38         ` Nigel Cunningham
2007-07-16 14:15           ` Alan Stern
2007-07-16 15:25             ` Rafael J. Wysocki
2007-07-15 23:41         ` david
2007-07-16 14:21           ` Alan Stern
2007-07-17  4:45             ` david
2007-07-17 14:15               ` Alan Stern
2007-07-17 14:40                 ` Rafael J. Wysocki
2007-07-17 15:29                   ` david
2007-07-17 16:02                     ` Rafael J. Wysocki
2007-07-17 17:06                       ` david
2007-07-17 19:50                         ` Rafael J. Wysocki
2007-07-17 20:18                           ` david
2007-07-17 20:39                             ` Jeremy Maitin-Shepard
2007-07-17 20:39                               ` david
2007-07-17 20:58                               ` Rafael J. Wysocki
2007-07-17 20:57                             ` Rafael J. Wysocki
2007-07-17 20:53                               ` david
2007-07-17 21:37                                 ` Rafael J. Wysocki
2007-07-17 21:42                                   ` david
2007-07-17 21:53                                     ` Jeremy Maitin-Shepard
2007-07-21 10:25                             ` Pavel Machek
2007-07-21 15:35                               ` Jeremy Maitin-Shepard
2007-07-21 17:56                                 ` Pavel Machek
2007-07-21 19:35                                   ` david
2007-07-21 19:49                                     ` Pavel Machek
2007-07-21 22:14                                       ` david
2007-08-01 16:58                             ` Stefan Seyfried
2007-07-17 20:24                           ` Jeremy Maitin-Shepard
2007-07-17 20:44                             ` david
2007-07-17 21:00                             ` Rafael J. Wysocki
2007-07-17 16:09                   ` Jeremy Maitin-Shepard
2007-07-17 19:54                     ` Rafael J. Wysocki
2007-07-17 18:32                   ` Alan Stern
2007-07-17 20:17                     ` Rafael J. Wysocki
2007-07-17 20:34                       ` david
2007-07-17 20:54                         ` Jeremy Maitin-Shepard
2007-07-17 21:04                           ` david
2007-07-17 21:23                         ` Rafael J. Wysocki
2007-07-17 21:17                           ` david
2007-07-17 21:27                             ` Jeremy Maitin-Shepard
2007-07-17 21:27                               ` david
2007-07-17 21:54                                 ` Rafael J. Wysocki
2007-07-17 21:45                               ` Rafael J. Wysocki
2007-07-17 21:43                             ` Rafael J. Wysocki
2007-07-17 20:34                       ` Jeremy Maitin-Shepard
2007-07-17 20:37                         ` david
2007-07-17 20:56                           ` Jeremy Maitin-Shepard
2007-07-17 21:06                             ` david
2007-07-17 21:40                               ` Rafael J. Wysocki
2007-07-17 21:24                           ` Rafael J. Wysocki
2007-07-17 21:11                         ` Rafael J. Wysocki
2007-07-17 20:27                     ` david
2007-07-17 21:20                       ` Rafael J. Wysocki
     [not found]                         ` <ea7a437ca4038d408ac544bbc3c2434a@bga.com>
2007-07-19 17:31                           ` [linux-pm] " david
2007-07-20 14:24                             ` Milton Miller
2007-07-20 15:44                               ` david