LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* boot cgroup questions
@ 2008-03-12  1:23 Max Krasnyansky
  2008-03-12  1:27 ` Paul Menage
  0 siblings, 1 reply; 42+ messages in thread
From: Max Krasnyansky @ 2008-03-12  1:23 UTC (permalink / raw)
  To: Paul Jackson, Paul Menage, Ingo Molnar, Peter Zijlstra; +Cc: LKML

Folks,

Concept of 'boot' cgroup was discussed as part of the cpuset/cpuisol lkml threads.
In short 'boot' group is very much like the 'root' or toplevel group. ie It
contains all tasks, and 'boot' cpuset contains all cpus, mem nodes, irqs, etc.
The difference is that it can be easily shrunk if needed, where as
toplevel/root group cannot.

I just wanted to make sure that we still want to create 'boot' cgroup during
kernel init instead of doing it in the user-space.

After looking into this a little bit I'm thinking of creating 'boot' cgroup
right after cpuset_init_smp() (init/main.c:841). Just before do_basic_setup()
which creates work queues and stuff.

The thing is though that the very next thing we do there is run early
userspace. Which begs the question, shouldn't we just do it from early
user-space then ?
It'd be very simple to mount cgroup, create 'boot' group and move all the
tasks in there.

So kernel or early-userspace ?

If kernel.
Paul M, do you have a suggestion as to what's the best way of creating a
cgroup without mounting cgroup fs. Seems like there is currently no easy way
for doing that. I probably missed it.

Thanx
Max




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  1:23 boot cgroup questions Max Krasnyansky
@ 2008-03-12  1:27 ` Paul Menage
  2008-03-12  2:34   ` Max Krasnyansky
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-03-12  1:27 UTC (permalink / raw)
  To: Max Krasnyansky; +Cc: Paul Jackson, Ingo Molnar, Peter Zijlstra, LKML

On Tue, Mar 11, 2008 at 6:23 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>  The thing is though that the very next thing we do there is run early
>  userspace. Which begs the question, shouldn't we just do it from early
>  user-space then ?

Seems simplest to me. We have an early boot script that creates a
"system" cpuset and moves all tasks into it. It seems to work fine for
us.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  1:27 ` Paul Menage
@ 2008-03-12  2:34   ` Max Krasnyansky
  2008-03-12  2:36     ` Paul Menage
  0 siblings, 1 reply; 42+ messages in thread
From: Max Krasnyansky @ 2008-03-12  2:34 UTC (permalink / raw)
  To: Paul Menage; +Cc: Paul Jackson, Ingo Molnar, Peter Zijlstra, LKML



Paul Menage wrote:
> On Tue, Mar 11, 2008 at 6:23 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>>  The thing is though that the very next thing we do there is run early
>>  userspace. Which begs the question, shouldn't we just do it from early
>>  user-space then ?
> 
> Seems simplest to me. We have an early boot script that creates a
> "system" cpuset and moves all tasks into it. It seems to work fine for
> us.

Suppose we were to do it from kernel. What's the right way to create a cgroup
without mounting a cgroupfs ?
I just want to play with it. There are a couple of advantages that I see for
doing it from kernel. We can move 'kthreadd' and idle threads into the 'boot'
cgroup early on and therefor later on won't even have to iterate through the
tasks and stuff. Whereas user-space has to iterate through tasks and be smart
about threads that are pinned and stuff. Not a big deal but if kernel code is
simple enough maybe it makes sense.

So, any pointers. How do I do create_cgroup() without fs mounted ?

Max

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  2:34   ` Max Krasnyansky
@ 2008-03-12  2:36     ` Paul Menage
  2008-03-12  2:53       ` Max Krasnyansky
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-03-12  2:36 UTC (permalink / raw)
  To: Max Krasnyansky; +Cc: Paul Jackson, Ingo Molnar, Peter Zijlstra, LKML

On Tue, Mar 11, 2008 at 7:34 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>
>  Suppose we were to do it from kernel. What's the right way to create a cgroup
>  without mounting a cgroupfs ?

There isn't really a way, but you could always kern_mount() a
filesystem inside the kernel.

>  I just want to play with it. There are a couple of advantages that I see for
>  doing it from kernel. We can move 'kthreadd' and idle threads into the 'boot'
>  cgroup early on and therefor later on won't even have to iterate through the
>  tasks and stuff.

Would this be done based on some boot commandline option? I don't
think you'd want to do it unconditionally.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  2:36     ` Paul Menage
@ 2008-03-12  2:53       ` Max Krasnyansky
  2008-03-12  3:09         ` Paul Menage
  0 siblings, 1 reply; 42+ messages in thread
From: Max Krasnyansky @ 2008-03-12  2:53 UTC (permalink / raw)
  To: Paul Menage; +Cc: Paul Jackson, Ingo Molnar, Peter Zijlstra, LKML



Paul Menage wrote:
> On Tue, Mar 11, 2008 at 7:34 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>>  Suppose we were to do it from kernel. What's the right way to create a cgroup
>>  without mounting a cgroupfs ?
> 
> There isn't really a way, but you could always kern_mount() a
> filesystem inside the kernel.
Aha, that's what I was missing. kern_mount(). Cool :).

>>  I just want to play with it. There are a couple of advantages that I see for
>>  doing it from kernel. We can move 'kthreadd' and idle threads into the 'boot'
>>  cgroup early on and therefor later on won't even have to iterate through the
>>  tasks and stuff.
> 
> Would this be done based on some boot commandline option? I don't
> think you'd want to do it unconditionally.
Hmm, I believe the original discussion was about doing it unconditionally.
Why not I guess ? It probably won't even affect your existing scripts since
they will be able to move tasks into another set just like they do now. The
only thing I can think of is that if your scripts use sched_load_balance then
they will now have to unset it in the 'boot' set as well. Otherwise since the
'boot' set will be non-exclusive (cpus and mems) it should not really affect
anything.
So what's your concern with unconditional 'boot' cgroup/cpuset ?

Max

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  2:53       ` Max Krasnyansky
@ 2008-03-12  3:09         ` Paul Menage
  2008-03-12  3:39           ` Max Krasnyansky
  2008-03-12  4:59           ` Paul Jackson
  0 siblings, 2 replies; 42+ messages in thread
From: Paul Menage @ 2008-03-12  3:09 UTC (permalink / raw)
  To: Max Krasnyansky; +Cc: Paul Jackson, Ingo Molnar, Peter Zijlstra, LKML

On Tue, Mar 11, 2008 at 7:53 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
> It probably won't even affect your existing scripts since
> they will be able to move tasks into another set just like they do now.

My boot scripts look in /dev/cpuset/tasks to find processes to move
into the system cpuset. So that would break them.

>  they will now have to unset it in the 'boot' set as well.

That can break existing userspace, so I presume PaulJ isn't in favour
of this change.

> Otherwise since the
>  'boot' set will be non-exclusive (cpus and mems) it should not really affect
>  anything.

Apart from other cpusets that *are* mem_exclusive or cpu_exclusive.

>  So what's your concern with unconditional 'boot' cgroup/cpuset ?

The exclusivity problem, as above.

Which subsystems are you going to include in this boot hierarchy?
Userspace is going to have to be aware of the fact that there's a
cpusets hierarchy which might have to be dismantled if it wants to set
up something different.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  3:09         ` Paul Menage
@ 2008-03-12  3:39           ` Max Krasnyansky
  2008-03-12  4:59           ` Paul Jackson
  1 sibling, 0 replies; 42+ messages in thread
From: Max Krasnyansky @ 2008-03-12  3:39 UTC (permalink / raw)
  To: Paul Menage; +Cc: Paul Jackson, Ingo Molnar, Peter Zijlstra, LKML

Paul Menage wrote:
> On Tue, Mar 11, 2008 at 7:53 PM, Max Krasnyansky <maxk@qualcomm.com> wrote:
>> It probably won't even affect your existing scripts since
>> they will be able to move tasks into another set just like they do now.
> 
> My boot scripts look in /dev/cpuset/tasks to find processes to move
> into the system cpuset. So that would break them.
I see. I assumed you just iterate through /proc/[0-9]*

>>  they will now have to unset it in the 'boot' set as well.
> 
> That can break existing userspace, so I presume PaulJ isn't in favour
> of this change.
My impression was that he was ok with changing his stuff. But I maybe
completely wrong of course. I'm actually perfectly fine with making it
conditional.
Maybe something like
	bootcpuset=1
?

>> Otherwise since the
>>  'boot' set will be non-exclusive (cpus and mems) it should not really affect
>>  anything.
> 
> Apart from other cpusets that *are* mem_exclusive or cpu_exclusive.
Hold on, if you move all the tasks ... Oh, never mind :). You mean that you
won't be able to create any cpusets that must be exclusive unless you nuke
'boot' set. Makes sense.

>>  So what's your concern with unconditional 'boot' cgroup/cpuset ?
> 
> The exclusivity problem, as above.
Yes I agree. If this 'boot' set is unconditional user-space tools will have to
change. As I mentioned above I totally do not mind if is is conditional. Any
other opinions out there ?

> 
> Which subsystems are you going to include in this boot hierarchy?
> Userspace is going to have to be aware of the fact that there's a
> cpusets hierarchy which might have to be dismantled if it wants to set
> up something different.
I was going to only include 'cpusets'. Does it make sense for anything else ?

Max

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  3:09         ` Paul Menage
  2008-03-12  3:39           ` Max Krasnyansky
@ 2008-03-12  4:59           ` Paul Jackson
  2008-03-12 18:24             ` Max Krasnyanskiy
  2008-03-12 19:16             ` Paul Menage
  1 sibling, 2 replies; 42+ messages in thread
From: Paul Jackson @ 2008-03-12  4:59 UTC (permalink / raw)
  To: Paul Menage; +Cc: maxk, mingo, a.p.zijlstra, linux-kernel

Paul M wrote:
> >  they will now have to unset it in the 'boot' set as well.
> 
> That can break existing userspace, so I presume PaulJ isn't in favour
> of this change.

You're right - I don't favor it.

Using the 'cpus' in one or more cpusets to determine both:
 1) which CPUs can receive an irq, and
 2) resolving conflicts in such irq placement,
excessively overloads the cpuset hierarchy, breaking existing
userspace, as Paul M notes.

If you don't have any other cpuset hierarchy you need to use, and
so don't really otherwise care what your cpuset hierarchy is, then
I suppose this works just fine.

But if you also need to use the cpuset hierarchy to define nested
subsets of CPUs and Memory Nodes, for the purposes of controlling
which tasks can run where (the original and still primary motivation
for cpusets) then one can only conveniently specify those trivial
irq configurations that happen to exactly conform with that hierarchy
(that exactly want to make use of some of the same sets of CPUs, and
that don't depend on the hierarchy to resolve conflicts in overlapping
irq directives).

Almost any non-trivial use of cpusets for both irq directivity and CPU
and Memory placement would complicate both hierarchies, forcing
unending confusion and breakage on the existing cpuset users.

Some examples:

    Let's say I have three cpusets defining the CPU and Memory Node
    sets in which I want to place my tasks:

	    /dev/cpuset/A
	    /dev/cpuset/B
	    /dev/cpuset/C

    and I want a particular set of irqs to be directed to the CPUs in A
    and B, but not C.  Well -- guess I can duplicate the irqs settings.

    But don't tell me to use a 'boot' cpuset, as in:

	    /dev/cpuset/boot/A
	    /dev/cpuset/boot/B
	    /dev/cpuset/C

    to accomplish this, as that intrudes in the hierarchy, breaking
    user code.

    If my irq isolation needs don't exactly partition along the
    'cpus' settings in A, B and C, then not even duplication helps.

    If the 'irqs' in /dev/cpuset/A/Z (where Z's cpus are a proper
    subset of A's) don't match the 'irqs' in /dev/cpuset/A, then I
    have further confusions resulting from conflicting irq directives.

(If your proposal handles all the above, without forcing changes
on the cpuset hierarchy, then I misread it - in that case, sorry.)

Paul M has already proposed pulling apart the binding of CPUs and
Memory Nodes, in the underlying cgroups, as he apparently has cases in
which the legacy connection of those two into a single cpuset hierarchy
is an undesired constraint on (complication of) the hierarchy.
That's more likely the direction in which we should be proceeding --
making these hierarchies independent, not entwining them.

This additional overloading of the current cpuset hierarchy might
handle the simple case you need.  But that's only because you don't
have conflicting needs for the cpuset hierarchy.

Hopefully, Paul M will be able to view with some sense of humor that I
am complaining that this proposal of yourself (and Peter Z's earlier
patches) isn't general enough, even as I have complained of some of
some other recent cgroup proposals of Paul M that their increased
generality isn't sufficient to justify their subtle incompatibilities.

At a minimum, as in my proposal (http://lkml.org/lkml/2008/3/6/512) of
last week, one needs some mechanism independent of the cpuset hierarchy
to resolve conflicts in these irq directives.  As you may recall,
that proposal named each set of irqs, let each cpuset specify which
named set of irqs applied to its CPUs, and encoded the precedence N
of each named list of irqs in the filename '/dev/cpuset/irqs.N.name'
of the file listing the irqs in that named set.  Then one can specify
irqs for each cpuset, and have some way to specify the precedence of
these irq specifications, without overloading the cpuset hierarchy.

Even this minimum proposal might be insufficient, if one has needs
to specify irq directives for sets of CPUs that are not otherwise
present in the cpuset hierarchy.  Observe that this proposal does
not handle the next to the last example case above.  I am not yet
convinced that this deficiency is a show stopper.  It might be.

The other direction considered, making this its own cgroup, -seemed-
to fail as well, as someone, I forget whom, noted.  Cgroups attach
tasks to sets of things.  We aren't trying to attach tasks to anything.
We're trying to attach irqs to CPUs.  We are trying now to treat irqs
as 'pseudo-tasks', but that forces the irq hierarchy to be a subset
of the CPU hierarchy, due to overloading the 'cpus' set.  This is the
problem noted above.

Paul M -- could we take a different tack here -- extend cgroups to map
-either- tasks or irqs to the managed resources?  Then irqs would be
managed by a cgroup hierarchy that mapped irqs to a subsystem specific
attribute of 'cpus' (resembling the cpuset 'cpus').  If the hierarchy
one needed for irqs was a nice subset of ones cpuset hierarchy, one
might even mount both cgroup subsystems on the same mount, so long
as we could work out what it means for two cgroup subsystems to share
the same subsystem specific attribute, 'cpus' in this case.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  4:59           ` Paul Jackson
@ 2008-03-12 18:24             ` Max Krasnyanskiy
  2008-03-12 18:57               ` Paul Jackson
  2008-03-12 19:16             ` Paul Menage
  1 sibling, 1 reply; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-03-12 18:24 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Paul Menage, mingo, a.p.zijlstra, linux-kernel

Paul Jackson wrote:
> Paul M wrote:
>>>  they will now have to unset it in the 'boot' set as well.
>> That can break existing userspace, so I presume PaulJ isn't in favour
>> of this change.
> 
> You're right - I don't favor it.
Hmm, I think we're mixing two different threads here.
	1. How to map irq affinity handling onto cpusets.
	2. Whether and how to create in kernel 'boot' cgroup/cpuset.
They are somewhat orthogonal imho. In a sense that no mater how we decide to 
handle irqs (even if we do not do them under cpusets at all) we may still want 
'boot' group. As I mentioned at the beginning of this thread 'boot group/set 
is basically just a convenience feature. The only difference between root/top 
  group is that 'boot' group can be dynamically resized and moved.

Ok. So the rest of the email is mostly about irqs. It'd be nice if it was in 
the other thread (cpuset: irq affinity support) but I'm ok with replying here.

> Using the 'cpus' in one or more cpusets to determine both:
>  1) which CPUs can receive an irq, and
>  2) resolving conflicts in such irq placement,
> excessively overloads the cpuset hierarchy, breaking existing
> userspace, as Paul M notes.
> If you don't have any other cpuset hierarchy you need to use, and
> so don't really otherwise care what your cpuset hierarchy is, then
> I suppose this works just fine.
I'm not sure #2 is a concern. With the latest couset irq handling patches 
conflict resolution is very simple. "irq can belong to a single cpuset at a time".

> But if you also need to use the cpuset hierarchy to define nested
> subsets of CPUs and Memory Nodes, for the purposes of controlling
> which tasks can run where (the original and still primary motivation
> for cpusets) then one can only conveniently specify those trivial
> irq configurations that happen to exactly conform with that hierarchy
> (that exactly want to make use of some of the same sets of CPUs, and
> that don't depend on the hierarchy to resolve conflicts in overlapping
> irq directives).
I do not think we need overlapping irq directives.

> Almost any non-trivial use of cpusets for both irq directivity and CPU
> and Memory placement would complicate both hierarchies, forcing
> unending confusion and breakage on the existing cpuset users.
I'm not sure what breakage you're talking about. But lets talk examples I 
guess. See below.

> Some examples:
>     Let's say I have three cpusets defining the CPU and Memory Node
>     sets in which I want to place my tasks:
> 
> 	    /dev/cpuset/A
> 	    /dev/cpuset/B
> 	    /dev/cpuset/C
> 
>     and I want a particular set of irqs to be directed to the CPUs in A
>     and B, but not C.  Well -- guess I can duplicate the irqs settings.
> 
>     But don't tell me to use a 'boot' cpuset, as in:
> 
> 	    /dev/cpuset/boot/A
> 	    /dev/cpuset/boot/B
> 	    /dev/cpuset/C
> 
>     to accomplish this, as that intrudes in the hierarchy, breaking
>     user code.
> 
>     If my irq isolation needs don't exactly partition along the
>     'cpus' settings in A, B and C, then not even duplication helps.
> 
>     If the 'irqs' in /dev/cpuset/A/Z (where Z's cpus are a proper
>     subset of A's) don't match the 'irqs' in /dev/cpuset/A, then I
>     have further confusions resulting from conflicting irq directives.
How is that any different from tasks ? Exact same example right back at you.
Suppose I have a task that needs to run in A and B but not C. In fact if you 
look at the example that I provided in the other thread I already have such an 
app. In my current apps different threads have to run in different cpusets.

And yes I think the way to solve that is to use more complex cpuset hierarchy 
like the one you used above. I would not necessarily mix in the 'boot' set 
here. I mean if people want to subdivide it that's fine but they do not have 
to. I mean people can just nuke the 'boot' group/set and create something else.

> (If your proposal handles all the above, without forcing changes
> on the cpuset hierarchy, then I misread it - in that case, sorry.)
It does not force any changes. irqs handled just like tasks and if people have 
complex partitioning requirements they may have to use more complicated 
hierarchies.

> Paul M has already proposed pulling apart the binding of CPUs and
> Memory Nodes, in the underlying cgroups, as he apparently has cases in
> which the legacy connection of those two into a single cpuset hierarchy
> is an undesired constraint on (complication of) the hierarchy.
> That's more likely the direction in which we should be proceeding --
> making these hierarchies independent, not entwining them.
> 
> This additional overloading of the current cpuset hierarchy might
> handle the simple case you need.  But that's only because you don't
> have conflicting needs for the cpuset hierarchy.
> 
> Hopefully, Paul M will be able to view with some sense of humor that I
> am complaining that this proposal of yourself (and Peter Z's earlier
> patches) isn't general enough, even as I have complained of some of
> some other recent cgroup proposals of Paul M that their increased
> generality isn't sufficient to justify their subtle incompatibilities.
> 
> At a minimum, as in my proposal (http://lkml.org/lkml/2008/3/6/512) of
> last week, one needs some mechanism independent of the cpuset hierarchy
> to resolve conflicts in these irq directives.  As you may recall,
> that proposal named each set of irqs, let each cpuset specify which
> named set of irqs applied to its CPUs, and encoded the precedence N
> of each named list of irqs in the filename '/dev/cpuset/irqs.N.name'
> of the file listing the irqs in that named set.  Then one can specify
> irqs for each cpuset, and have some way to specify the precedence of
> these irq specifications, without overloading the cpuset hierarchy.
> 
> Even this minimum proposal might be insufficient, if one has needs
> to specify irq directives for sets of CPUs that are not otherwise
> present in the cpuset hierarchy.  Observe that this proposal does
> not handle the next to the last example case above.  I am not yet
> convinced that this deficiency is a show stopper.  It might be.
That (ie additional sets of irqs) seems like an major overkill to me.
Probably because I do not think that there are any conflicts to resolve in the 
first place. As I explained above if we treat irqs just like tasks (from 
cpuset perspective) then same exact rules and limitations apply. Irq can be 
assigned to a single cpuset at a time. Complex requirements can be solved 
either by deeper cgroup/cpuset hierarchies or worst case if there is something 
  totally wacky constraint people always have an option of assigning irq to 
the top cpuset and using /proc/irq/N/smp_affinity interface to select which 
cpus it can run on.

> The other direction considered, making this its own cgroup, -seemed-
> to fail as well, as someone, I forget whom, noted.  Cgroups attach
> tasks to sets of things.  We aren't trying to attach tasks to anything.
> We're trying to attach irqs to CPUs.  We are trying now to treat irqs
> as 'pseudo-tasks', but that forces the irq hierarchy to be a subset
> of the CPU hierarchy, due to overloading the 'cpus' set.  This is the
> problem noted above.
> 
> Paul M -- could we take a different tack here -- extend cgroups to map
> -either- tasks or irqs to the managed resources?  Then irqs would be
> managed by a cgroup hierarchy that mapped irqs to a subsystem specific
> attribute of 'cpus' (resembling the cpuset 'cpus').  If the hierarchy
> one needed for irqs was a nice subset of ones cpuset hierarchy, one
> might even mount both cgroup subsystems on the same mount, so long
> as we could work out what it means for two cgroup subsystems to share
> the same subsystem specific attribute, 'cpus' in this case.
Hold on. How does this help if at the end of the day 'cpus' are still shared 
between the irq and task groups ? We'd still have exact same constrains.

btw I'm starting to gravitate back towards my original solution (ie cpu_map 
that tells which cpus can be used by kernel and irqs). If you remember I was 
totally against using cpusets/cgroup exactly because they are designed to 
handle tasks. You guys convinced me that we can extend them and that it's a 
better way to go about. Yet after weeks of discussion we seem to be taking 
about adding more and more stuff.

Anyway, I think treating irqs as tasks and enforcing the same rules and 
constraints should handle most scenarios even complex ones at the expense of 
deeper cpuset hierarchies. Adding new 'cgroups' or 'sets' just for irqs seems 
overkill.

Max

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 18:24             ` Max Krasnyanskiy
@ 2008-03-12 18:57               ` Paul Jackson
  2008-03-12 19:11                 ` Max Krasnyanskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-03-12 18:57 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Max K wrote:
> How is that any different from tasks ? Exact same example right back at you.
> Suppose I have a task that needs to run in A and B but not C.

Can't happen.

Each task belongs to exactly one cpuset, no exceptions.

That's why you can't "treat irqs just like tasks".

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 18:57               ` Paul Jackson
@ 2008-03-12 19:11                 ` Max Krasnyanskiy
  2008-03-12 19:32                   ` Paul Jackson
  0 siblings, 1 reply; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-03-12 19:11 UTC (permalink / raw)
  To: Paul Jackson; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Paul Jackson wrote:
> Max K wrote:
>> How is that any different from tasks ? Exact same example right back at you.
>> Suppose I have a task that needs to run in A and B but not C.
> 
> Can't happen.
Of course it can. See below.

> Each task belongs to exactly one cpuset, no exceptions.
Sure. Same for irqs.

> That's why you can't "treat irqs just like tasks".
Sure you can.

I was talking about running on the _cpus_ that belong to the "sets A and B but 
not C" and not that a task must belong to more than one cpuset. Unless I 
misinterpreted your example you were talking about exact same thing. In other 
words that an irq needs to assigned to the _cpus_ in the sets A and B but not C.
Makes sense ?

Max



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12  4:59           ` Paul Jackson
  2008-03-12 18:24             ` Max Krasnyanskiy
@ 2008-03-12 19:16             ` Paul Menage
  2008-03-12 19:24               ` Paul Jackson
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Menage @ 2008-03-12 19:16 UTC (permalink / raw)
  To: Paul Jackson; +Cc: maxk, mingo, a.p.zijlstra, linux-kernel

On Tue, Mar 11, 2008 at 9:59 PM, Paul Jackson <pj@sgi.com> wrote:
>
>  Paul M -- could we take a different tack here -- extend cgroups to map
>  -either- tasks or irqs to the managed resources?

Not cgroups, no. If you really wanted to extend cpusets specifically
to allow irqs to be assigned to a cpuset to control which cpus they
could execute on, then that might be a possibility. But I don't see
how this would be useful for any other cgroup subsystem, so it doesn't
belong in the generic framework.

My feeling is that just using a simple bitmask assignment, unrelated
to cpusets or cgroups, as Max suggested in his later email is the way
to go.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 19:16             ` Paul Menage
@ 2008-03-12 19:24               ` Paul Jackson
  2008-03-12 19:30                 ` Max Krasnyanskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-03-12 19:24 UTC (permalink / raw)
  To: Paul Menage; +Cc: maxk, mingo, a.p.zijlstra, linux-kernel

Paul M wrote:
> Not cgroups, no. If you really wanted to extend cpusets specifically
> to allow irqs to be assigned to a cpuset to control which cpus they
> could execute on, then that might be a possibility. But I don't see
> how this would be useful for any other cgroup subsystem, so it doesn't
> belong in the generic framework.

Ok - a sensible decision.

> My feeling is that just using a simple bitmask assignment, unrelated
> to cpusets or cgroups, as Max suggested in his later email is the way
> to go.

I'll have to have another go at reading his replies.  I seem to have
more difficulty making sense of his posts ... not sure why.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 19:24               ` Paul Jackson
@ 2008-03-12 19:30                 ` Max Krasnyanskiy
  0 siblings, 0 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-03-12 19:30 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Paul Menage, mingo, a.p.zijlstra, linux-kernel

Paul Jackson wrote:
> Paul M wrote:
>> Not cgroups, no. If you really wanted to extend cpusets specifically
>> to allow irqs to be assigned to a cpuset to control which cpus they
>> could execute on, then that might be a possibility. But I don't see
>> how this would be useful for any other cgroup subsystem, so it doesn't
>> belong in the generic framework.
> 
> Ok - a sensible decision.
> 
>> My feeling is that just using a simple bitmask assignment, unrelated
>> to cpusets or cgroups, as Max suggested in his later email is the way
>> to go.
> 
> I'll have to have another go at reading his replies.  I seem to have
> more difficulty making sense of his posts ... not sure why.
I'm sure it's because of gazillion typos in them :).

Max

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 19:11                 ` Max Krasnyanskiy
@ 2008-03-12 19:32                   ` Paul Jackson
  2008-03-12 20:08                     ` Max Krasnyanskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-03-12 19:32 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Max wrote:
> I was talking about running on the _cpus_ that belong to the "sets A and B but 
> not C" and not that a task must belong to more than one cpuset.

This doesn't make sense to me.

If a task is to run on the CPUs in both sets A and B, then it has to be
in both those cpusets, which isn't allowed, or in some super set of both
A and B (that is, in this example, in the top cpuset), which doesn't
restrict the task to just A or B or their union.

I have no idea what distinction you are seeing between what _cpus_ a task
can run on, and what cpuset it belongs to.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 19:32                   ` Paul Jackson
@ 2008-03-12 20:08                     ` Max Krasnyanskiy
  2008-03-12 20:37                       ` Paul Jackson
  0 siblings, 1 reply; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-03-12 20:08 UTC (permalink / raw)
  To: Paul Jackson; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Paul Jackson wrote:
> Max wrote:
>> I was talking about running on the _cpus_ that belong to the "sets A and B but 
>> not C" and not that a task must belong to more than one cpuset.
> 
> This doesn't make sense to me.
> 
> If a task is to run on the CPUs in both sets A and B, then it has to be
> in both those cpusets, which isn't allowed, or in some super set of both
> A and B (that is, in this example, in the top cpuset), which doesn't
> restrict the task to just A or B or their union.
> 
> I have no idea what distinction you are seeing between what _cpus_ a task
> can run on, and what cpuset it belongs to.

Paul, we are in 100% agreement here about the tasks. All I'm saying is that 
the same exact thing applies to the irqs. Again let me try your example.

Suppose we have
	/dev/cpuset/A
	/dev/cpuset/B
	/dev/cpuset/C

Now suppose that for whatever reason I must run task1 on the cpus that belong 
to sets A and B but not C. The only way to do that with cpusets is

	/dev/cpuset/X
         	  |-- A
	          `-- B
	/dev/cpuset/C

i.e. create parent cpuset X and assign task1 into cpuset X.
Of course if A and B are not cpu_exclusive then X does not have to be their 
parent.

Makes sense so far ?

Now the same exact thing can be said about the irqs. If I need to assign irq1 
to the cpus in sets A and B but not C I have to create set X that is the union 
of A and B, and assign irq1 to the set X.

This is what I meant by "deeper hierarchies" in the earlier emails.

Did I do a better job explaining this time :) ?

Max







^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 20:08                     ` Max Krasnyanskiy
@ 2008-03-12 20:37                       ` Paul Jackson
  2008-03-12 22:29                         ` Max Krasnyanskiy
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-03-12 20:37 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

> This is what I meant by "deeper hierarchies" in the earlier emails.

These deeper hierarchies create an incompatibility in some common uses
of cpusets.

When my example had cpusets A, B and C, that was as stated, not as
might be modified to X, X/A, X/B and C.

If the user has or would have setup cpusets A, B and C because that's
what they needed to manage the CPU and Memory Node placement of their
tasks, then that's what they might have setup, and there is a good
chance that they would find the imposition of the extra 'X' cpuset to
be a problem, to require more code and to be a cause of bugs.

Adding irqs to the cpuset hierarchy isn't free; it can further overload
the hierarchy, with "deeper hierarchies" as you state.

If instead of deeper hierarchies, we allow the same irq to be listed in
more than one cpuset (unlike tasks, which only get one cpuset) then we
need some way, independent of the cpuset hierarchy, to determine how to
resolve conflicts.  We can't just add all the cpus together, allowing an
irq to be directed to any CPU which is listed in any cpuset that
accepts that irq, because a major use for this is to remove irqs from
certain realtime CPUs.

So ... if the natural hierarchy needed to map irqs to CPUs is not a
subset of the natural hierarchy needed to map tasks to sets of CPUs and
Nodes, then we either deepen the hierarchy (cross product of the the
two maps, essentially) or we allow the same irq to be listed in
multiple cpusets and provide some alternative mechanism, outside the
hierarchy, to resolve the resulting conflicts in the irq to CPU map.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 20:37                       ` Paul Jackson
@ 2008-03-12 22:29                         ` Max Krasnyanskiy
  2008-03-12 23:30                           ` Paul Jackson
  2008-03-12 23:32                           ` Paul Jackson
  0 siblings, 2 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-03-12 22:29 UTC (permalink / raw)
  To: Paul Jackson; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Paul Jackson wrote:

>> This is what I meant by "deeper hierarchies" in the earlier emails.
> 
> These deeper hierarchies create an incompatibility in some common uses
> of cpusets.
> 
> When my example had cpusets A, B and C, that was as stated, not as
> might be modified to X, X/A, X/B and C.
> 
> If the user has or would have setup cpusets A, B and C because that's
> what they needed to manage the CPU and Memory Node placement of their
> tasks, then that's what they might have setup, and there is a good
> chance that they would find the imposition of the extra 'X' cpuset to
> be a problem, to require more code and to be a cause of bugs.

Isn't that just an issue of planing ? Those cpusets are not cast in stones are 
they. I mean yes users have setup A,B,C they way they did because that's what 
they needed. Now their plans/requirements have changed. They now want to also 
manage irqs via cpusets and in order to do that they need to replan/redo the 
partitioning.

In order to manager irqs the code has to change anyway because currently there 
is not way to do that via cpuset. The users would have two options:
1. keep all irqs in the top set and manage them individually via /proc
2. layout cpusets differently

btw I still do not see the "incompatibility" argument. Probably because I have 
no idea how the software you're talking about is designed. Are you saying that 
the software relies on a flat cpuset partitioning ? ie That it will brake if 
users add extra cpuset levels.

> Adding irqs to the cpuset hierarchy isn't free; it can further overload
> the hierarchy, with "deeper hierarchies" as you state.
> 
> If instead of deeper hierarchies, we allow the same irq to be listed in
> more than one cpuset (unlike tasks, which only get one cpuset) then we
> need some way, independent of the cpuset hierarchy, to determine how to
> resolve conflicts.  We can't just add all the cpus together, allowing an
> irq to be directed to any CPU which is listed in any cpuset that
> accepts that irq, because a major use for this is to remove irqs from
> certain realtime CPUs.
This sounds like an overkill and as you pointed out is not even clear how it'd 
work.

Looks like we have a trade-off here:
1. use simple "irq == pseudo-task" concept and potentially brake some existing 
software. We do have working solution.
2. come up with something that requires more complex irq management rules at 
the expense of complexity. We do not have working solution.

My vote goes for #1 :).

> So ... if the natural hierarchy needed to map irqs to CPUs is not a
> subset of the natural hierarchy needed to map tasks to sets of CPUs and
> Nodes, then we either deepen the hierarchy (cross product of the the
> two maps, essentially) or we allow the same irq to be listed in
> multiple cpusets and provide some alternative mechanism, outside the
> hierarchy, to resolve the resulting conflicts in the irq to CPU map.
I think by natural you mean "compatible with existing sw". What is unnatural 
in extra levels of cpusets ? If I read cgroup/cpuset documentation it seems to 
imply that nested cgroups/cpuset are allowed.

Max



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 22:29                         ` Max Krasnyanskiy
@ 2008-03-12 23:30                           ` Paul Jackson
  2008-03-13  0:57                             ` Max Krasnyanskiy
  2008-03-12 23:32                           ` Paul Jackson
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-03-12 23:30 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Max K wrote:
> btw I still do not see the "incompatibility" argument.

It's similar, perhaps, to what happens when we try to accomodate two
architectures in one file system, with things like:
	/x86_64/bin
	/ia64/bin
replacing the well known /bin.

Things break.  Apps such as the major batch schedulers (PBS and LSF)
and various other tools and scripts buried here and there have come
used to developing particular cpuset hierarchies over the last couple
of years.

Any time you force another dimension into such an existing hierarchy,
things break, and people get annoyed.

Sure ... the kernel doesn't care ... it can handle whatever hierarchy
you like.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 22:29                         ` Max Krasnyanskiy
  2008-03-12 23:30                           ` Paul Jackson
@ 2008-03-12 23:32                           ` Paul Jackson
  2008-03-13  0:46                             ` Max Krasnyanskiy
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-03-12 23:32 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Max K wrote:
> 1. use simple "irq == pseudo-task" concept and potentially brake some existing 
> software. We do have working solution.

Breaking existing software is not what I call working.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 23:32                           ` Paul Jackson
@ 2008-03-13  0:46                             ` Max Krasnyanskiy
  0 siblings, 0 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-03-13  0:46 UTC (permalink / raw)
  To: Paul Jackson; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Paul Jackson wrote:
> Max K wrote:
>> 1. use simple "irq == pseudo-task" concept and potentially brake some existing 
>> software. We do have working solution.
> 
> Breaking existing software is not what I call working.
> 
Ok ok I get it :)
You know what I meant though. In the other scheme it's not even clear how it'd 
work in general.

Max



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-12 23:30                           ` Paul Jackson
@ 2008-03-13  0:57                             ` Max Krasnyanskiy
  2008-03-13  7:03                               ` Paul Jackson
  2008-03-13  7:12                               ` Paul Jackson
  0 siblings, 2 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-03-13  0:57 UTC (permalink / raw)
  To: Paul Jackson; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Paul Jackson wrote:
> Max K wrote:
>> btw I still do not see the "incompatibility" argument.
> 
> It's similar, perhaps, to what happens when we try to accomodate two
> architectures in one file system, with things like:
> 	/x86_64/bin
> 	/ia64/bin
> replacing the well known /bin.
> 
> Things break.  Apps such as the major batch schedulers (PBS and LSF)
> and various other tools and scripts buried here and there have come
> used to developing particular cpuset hierarchies over the last couple
> of years.
> 
> Any time you force another dimension into such an existing hierarchy,
> things break, and people get annoyed.
> 
> Sure ... the kernel doesn't care ... it can handle whatever hierarchy
> you like.

Crazy idea. How about we add support for sym links to the cgroup fs ?
It's still much cleaner imo than dealing with complex irq grouping schemes.

In other words with symlinks we could do
`-- cpuset
     |-- A -> X/A
     |-- B -> X/B
     |-- C
     `-- X
         |-- A
         `-- B

The software that is used to the flat structure won't know the difference.

Max






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-13  0:57                             ` Max Krasnyanskiy
@ 2008-03-13  7:03                               ` Paul Jackson
  2008-04-10 18:03                                 ` Max Krasnyanskiy
  2008-03-13  7:12                               ` Paul Jackson
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-03-13  7:03 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Max K wrote:
> cleaner imo than dealing with complex irq grouping schemes.

What's this "complex irq grouping scheme" that you're referring to?

If it's what I posted last week, with named sets of irqs, and each
cpuset naming which set it belonged to, that seems to me to actually
fit the usage pattern rather well.

The jobs running in particular cpusets need only know the 'name' of
the set of irqs it makes sense to send to its CPUs (the realtime
irqs, a particular piece of hardwares irqs, the ordinary system
irqs, the absolute minimum set of irqs, ...) and the system admin
gets to specify, one time, which irq numbers are in which named
set, or to change, later on, which set a particular irq is in, all
without having to have detailed knowledge of the jobs that want
particular irq sets directed to their CPUs.

We tend to label whatever makes sense to us as "simple", and whatever
doesn't seem necessary in our experience, or doesn't make sense, as
"complex".

Such labels are losing their meaning these days, other than to help
others figure out what we favor, or disfavor.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-13  0:57                             ` Max Krasnyanskiy
  2008-03-13  7:03                               ` Paul Jackson
@ 2008-03-13  7:12                               ` Paul Jackson
  2008-04-10 17:24                                 ` Max Krasnyanskiy
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-03-13  7:12 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

> How about we add support for sym links to the cgroup fs ?

Still pollutes the primary cpuset name space ... you have all
the directories X, X/A, and X/B as well as the symlinks A and B.

Symlinks allow for one path that needs to be 'aliased' to another,
but they are a one-way map; without an exhaustive search of the
potential namespace, one can't invert them, or determine if they
can't be inverted.

Tools have to constantly make heuristic decisions whether to
default to dereferencing the symlink, or not, and often have to
provide alternatives for the non-default choice.

They are a pain in the backside even if designed in and expected
up front.

If added as critical structure after the fact, something breaks,
pretty much for sure.

For one minor example, code I've probably buried someplace that
does "find /dev/cpuset -type d" to find all cpusets would break.

Or the one-line /sbin/cpuset_release_agent script:
	rmdir /dev/cpuset/$1
is broken -- fails to clean-up associated symlinks, and can't
avoid race conditions if it tries to add code to do that.

> Crazy idea.

Agreed ;)

But nice picture ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-13  7:12                               ` Paul Jackson
@ 2008-04-10 17:24                                 ` Max Krasnyanskiy
  2008-04-10 17:37                                   ` Paul Jackson
  0 siblings, 1 reply; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-04-10 17:24 UTC (permalink / raw)
  To: Paul Jackson; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Sorry for disappearing on you guys. I'm working on releasing the user-space 
framework and engine that uses cpu isolation for hard-RT. Once that's done I'm 
going to resurrect these efforts. In the mean time let me reply to your last 
comments.

Paul Jackson wrote:
>> How about we add support for sym links to the cgroup fs ?
> 
> Still pollutes the primary cpuset name space ... you have all
> the directories X, X/A, and X/B as well as the symlinks A and B.
> 
> Symlinks allow for one path that needs to be 'aliased' to another,
> but they are a one-way map; without an exhaustive search of the
> potential namespace, one can't invert them, or determine if they
> can't be inverted.
> 
> Tools have to constantly make heuristic decisions whether to
> default to dereferencing the symlink, or not, and often have to
> provide alternatives for the non-default choice.
> 
> They are a pain in the backside even if designed in and expected
> up front.
> 
> If added as critical structure after the fact, something breaks,
> pretty much for sure.
> 
> For one minor example, code I've probably buried someplace that
> does "find /dev/cpuset -type d" to find all cpusets would break.
> 
> Or the one-line /sbin/cpuset_release_agent script:
> 	rmdir /dev/cpuset/$1
> is broken -- fails to clean-up associated symlinks, and can't
> avoid race conditions if it tries to add code to do that.
> 
>> Crazy idea.
> 
> Agreed ;)

Got it. Symlinks are out :)

Max



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-04-10 17:24                                 ` Max Krasnyanskiy
@ 2008-04-10 17:37                                   ` Paul Jackson
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Jackson @ 2008-04-10 17:37 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Max K wrote:
> > Agreed ;)
> 
> Got it. Symlinks are out :)

Good ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-03-13  7:03                               ` Paul Jackson
@ 2008-04-10 18:03                                 ` Max Krasnyanskiy
  2008-04-14 18:39                                   ` Paul Jackson
  2008-04-14 18:42                                   ` boot cgroup questions Paul Jackson
  0 siblings, 2 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-04-10 18:03 UTC (permalink / raw)
  To: Paul Jackson; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

The context here was that we were talking about a way to group irqs and assign 
them to the cpusets. I was proposing to just treat IRQs as tasks, and you were 
proposing to add some additional grouping. Replies inline below.

Paul Jackson wrote:
> Max K wrote:
>> cleaner imo than dealing with complex irq grouping schemes.
> 
> What's this "complex irq grouping scheme" that you're referring to?
> 
> If it's what I posted last week, with named sets of irqs, and each
> cpuset naming which set it belonged to, that seems to me to actually
> fit the usage pattern rather well.
I was just saying that cpuset already provides a nice grouping. After thinking 
about this some more I still do not see a need to group IRQs before assigning 
them to the cpusets. That's the complexity I was talking about.

> The jobs running in particular cpusets need only know the 'name' of
> the set of irqs it makes sense to send to its CPUs (the realtime
> irqs, a particular piece of hardwares irqs, the ordinary system
> irqs, the absolute minimum set of irqs, ...) and the system admin
> gets to specify, one time, which irq numbers are in which named
> set, or to change, later on, which set a particular irq is in, all
> without having to have detailed knowledge of the jobs that want
> particular irq sets directed to their CPUs.
> 
> We tend to label whatever makes sense to us as "simple", and whatever
> doesn't seem necessary in our experience, or doesn't make sense, as
> "complex".
> 
> Such labels are losing their meaning these days, other than to help
> others figure out what we favor, or disfavor.
I agree in general. In this particular case additional grouping introduces 
even more hierarchy. I seems to me that
	"irqN -> cpu1, cpu2, cpu3"
is a very simple, straightforward relationship. Whereas
	"irqN -> groupX"
	"groupX -> cpu1"
	"groupX -> cpu2"
	"groupX -> cpu3"
Is not that straightforward.

Anyway. I think it all boils down to the compatibility with existing 
user-space apps. I still like the simple approach of treating irqs like tasks 
when it comes to assigning them to the cpusets. Which as we discussed earlier 
in some cases may require an extra level in the cpuset hierarchy. The question 
is, is that really such a big problem. If we make in kernel boot set optional, 
by default all irqs will be in the root cpuset. Which means people can still 
use /proc/irq/N/smp_affinity and manage irqs just like they do now. There is 
no compatibility issues in that case.

So do you think the apps compatibility is an issue in that case ?
Also isn't it likely that the apps will gradually adapt to handling 
multi-level cpusets anyway ? I mean you guys were talking about how wonderful 
and flexible cpusets are, but we cannot seem to use the flexibility because 
the apps are designed for a flat layout.

Max

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-04-10 18:03                                 ` Max Krasnyanskiy
@ 2008-04-14 18:39                                   ` Paul Jackson
  2008-05-09 10:45                                     ` Peter Zijlstra
  2008-04-14 18:42                                   ` boot cgroup questions Paul Jackson
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-04-14 18:39 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Max wrote:
> I mean you guys were talking about how wonderful 
> and flexible cpusets are, but we cannot seem to use the flexibility because 
> the apps are designed for a flat layout

No.  Not flat.  Not at all flat.

We routinely and normally have an interesting hierarchy of cpusets
below /dev/cpuset.  However that hierarchy is determined by the
nesting of subsets of the nodes (CPUs and/or Memory) on the system.

These subsets of nodes in the /dev/cpuset hierarchy may well map
nicely into the subsets of CPUs that can receive a particular set
of IRQs, however that map is not bijective.  Of particular interest
here, it's not injective, meaning that multiple cpusets might and
will commonly receive the same set of IRQs.  You can force this map
to be injective by elaborating the cpuset hierarchy to reflect both
this new assignment of IRQs and the (CPU and/or Memory) node subset
hiearchy that it currently reflects, but that will break code that
was expecting the directory tree below /dev/cpuset to directly and
only reflect the node hierarchy.

In less mathematically obtuse wording, sure you can add more directory
layers below /dev/cpuset, to handle IRQ assignments, but that will
break code that was expecting the /dev/cpuset directory tree to only
reflect the nesting of (CPU and/or Memory) nodes.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-04-10 18:03                                 ` Max Krasnyanskiy
  2008-04-14 18:39                                   ` Paul Jackson
@ 2008-04-14 18:42                                   ` Paul Jackson
  1 sibling, 0 replies; 42+ messages in thread
From: Paul Jackson @ 2008-04-14 18:42 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: menage, mingo, a.p.zijlstra, linux-kernel

Max K wrote:
> I agree in general. In this particular case additional grouping introduces 
> even more hierarchy. I seems to me that
> 	"irqN -> cpu1, cpu2, cpu3"
> is a very simple, straightforward relationship. Whereas
> 	"irqN -> groupX"
> 	"groupX -> cpu1"
> 	"groupX -> cpu2"
> 	"groupX -> cpu3"
> Is not that straightforward.

Clearly, yes, the first is simpler than the second.

The question is which is correct.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: boot cgroup questions
  2008-04-14 18:39                                   ` Paul Jackson
@ 2008-05-09 10:45                                     ` Peter Zijlstra
  2008-05-09 11:17                                       ` IRQ affinities (was: boot cgroup questions) Paul Jackson
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2008-05-09 10:45 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Max Krasnyanskiy, menage, mingo, linux-kernel

On Mon, 2008-04-14 at 13:39 -0500, Paul Jackson wrote:
> Max wrote:
> > I mean you guys were talking about how wonderful 
> > and flexible cpusets are, but we cannot seem to use the flexibility because 
> > the apps are designed for a flat layout
> 
> No.  Not flat.  Not at all flat.
> 
> We routinely and normally have an interesting hierarchy of cpusets
> below /dev/cpuset.  However that hierarchy is determined by the
> nesting of subsets of the nodes (CPUs and/or Memory) on the system.
> 
> These subsets of nodes in the /dev/cpuset hierarchy may well map
> nicely into the subsets of CPUs that can receive a particular set
> of IRQs, however that map is not bijective.  Of particular interest
> here, it's not injective, meaning that multiple cpusets might and
> will commonly receive the same set of IRQs.  You can force this map
> to be injective by elaborating the cpuset hierarchy to reflect both
> this new assignment of IRQs and the (CPU and/or Memory) node subset
> hiearchy that it currently reflects, but that will break code that
> was expecting the directory tree below /dev/cpuset to directly and
> only reflect the node hierarchy.
> 
> In less mathematically obtuse wording, sure you can add more directory
> layers below /dev/cpuset, to handle IRQ assignments, but that will
> break code that was expecting the /dev/cpuset directory tree to only
> reflect the nesting of (CPU and/or Memory) nodes.

Sorry for being rather late to the game - other stuff keeps me from
doing anything much here :-(.

Anyway, the current applications don't support IRQ assingment anyway.
That's a new feature; and its quite common that new features require
code changes.

So I'm not seeing the problem - don't change code and stuff works as
before - change code and you get new stuff.

So I'm arguing in favour of the IRQs as tasks idea that might need extra
hierarchy levels.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* IRQ affinities (was: boot cgroup questions)
  2008-05-09 10:45                                     ` Peter Zijlstra
@ 2008-05-09 11:17                                       ` Paul Jackson
  2008-05-09 11:48                                         ` Peter Zijlstra
  2008-05-21  1:14                                         ` Max Krasnyanskiy
  0 siblings, 2 replies; 42+ messages in thread
From: Paul Jackson @ 2008-05-09 11:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: maxk, menage, mingo, linux-kernel

Peter wrote:
> That's a new feature; and its quite common that new features require
> code changes.

It's common for new features to require code changes to take advantage
of the new features.

It's less desirable that taking advantage of such new features breaks
existing, basically unrelated, code.

My gut sense is that, in a misguided effort to find a "simple" answer
to irq distribution, we (well, y'all) are trying to attach this
feature to cpusets or cgroups.

Let me ask a different question:

  What solutions would you (Max, Peter, Ingo, lurkers, ...) be
  suggesting for this 'IRQ affinity' problem if cpusets and
  cgroups didn't exist in any form whatsoever?

The answer to that question might help me contribute to this discussion
in another way ... it might help me understand better what we're really
trying to do here.  You guys were proposing mechanisms that don't fit
my architecture sense of cpusets, but I was having problems figuring out
what are the essential underlying requirements, independent of choice
of mechanism.

Perhaps by describing one or two possible alternative, cpuset-free,
mechanisms that come more or less close to meeting our needs, I will
glean a better understanding of these elusive requirements, and can
better contribute to the discussion of design trade offs facing us.

So could you describe some possible cpuset-free solutions?  If they are
flawed in some critical way, that's ok, just point out said flaw(s).
Either way, this could help illuminate what's needed here.

It might be, once I better understand the requirements, possible
solutions and their tradeoffs, that I come to agree that cpusets or
cgroups present the best mechanism, given the tradeoffs and what's
needed.  Or it might be we find a better way to meet our needs.

Actually, if for no other reason than to bring any lurkers up to speed,
if you (Max or Peter, likely) wanted to describe, from the beginning,
what this discussion is about, that would be good too.  I doubt anyone
outside of three or four of us even recalls that long discussion of
February and March, 2008.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities (was: boot cgroup questions)
  2008-05-09 11:17                                       ` IRQ affinities (was: boot cgroup questions) Paul Jackson
@ 2008-05-09 11:48                                         ` Peter Zijlstra
  2008-05-09 12:03                                           ` Paul Jackson
  2008-05-21  1:14                                         ` Max Krasnyanskiy
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2008-05-09 11:48 UTC (permalink / raw)
  To: Paul Jackson; +Cc: maxk, menage, mingo, linux-kernel

On Fri, 2008-05-09 at 06:17 -0500, Paul Jackson wrote:
> Peter wrote:
> > That's a new feature; and its quite common that new features require
> > code changes.
> 
> It's common for new features to require code changes to take advantage
> of the new features.
> 
> It's less desirable that taking advantage of such new features breaks
> existing, basically unrelated, code.
> 
> My gut sense is that, in a misguided effort to find a "simple" answer
> to irq distribution, we (well, y'all) are trying to attach this
> feature to cpusets or cgroups.
> 
> Let me ask a different question:
> 
>   What solutions would you (Max, Peter, Ingo, lurkers, ...) be
>   suggesting for this 'IRQ affinity' problem if cpusets and
>   cgroups didn't exist in any form whatsoever?
> 
> The answer to that question might help me contribute to this discussion
> in another way ... it might help me understand better what we're really
> trying to do here.  You guys were proposing mechanisms that don't fit
> my architecture sense of cpusets, but I was having problems figuring out
> what are the essential underlying requirements, independent of choice
> of mechanism.
> 
> Perhaps by describing one or two possible alternative, cpuset-free,
> mechanisms that come more or less close to meeting our needs, I will
> glean a better understanding of these elusive requirements, and can
> better contribute to the discussion of design trade offs facing us.
> 
> So could you describe some possible cpuset-free solutions?  If they are
> flawed in some critical way, that's ok, just point out said flaw(s).
> Either way, this could help illuminate what's needed here.
> 
> It might be, once I better understand the requirements, possible
> solutions and their tradeoffs, that I come to agree that cpusets or
> cgroups present the best mechanism, given the tradeoffs and what's
> needed.  Or it might be we find a better way to meet our needs.
> 
> Actually, if for no other reason than to bring any lurkers up to speed,
> if you (Max or Peter, likely) wanted to describe, from the beginning,
> what this discussion is about, that would be good too.  I doubt anyone
> outside of three or four of us even recalls that long discussion of
> February and March, 2008.


I see two use-cases:

 - Isolation
 - NUMA node devices

With isolation you want to move all of you 'normal' system tasks off to
side of your machine and use the other side for 'special - rt' tasks.

For IRQs this means that you want to move all the 'normal' IRQs along
with the 'normal' tasks, and move the special IRQs into the rt side.

Of course you can do this by setting IRQ affinities one by one, but
being able to group the IRQs seems a sensible thing to me.

One thing here is that we'd like to also provide a default group for new
IRQs, so that when a new device appears its not allowed into the
'special' side of your machine.

This is what Max focussed on, and provides a binary devision of your
machine: special and not special.


Now I was thinking that if we generalize this whole thing it might be
useful for other purposes such as IRQ placement near the nodes that host
the device and/or the application using them.


So what we'd end up with is named affinity groups that contain (unique)
IRQs. 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities (was: boot cgroup questions)
  2008-05-09 11:48                                         ` Peter Zijlstra
@ 2008-05-09 12:03                                           ` Paul Jackson
  2008-05-09 12:14                                             ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-05-09 12:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: maxk, menage, mingo, linux-kernel

Peter wrote:
> I see two use-cases:
> 
>  - Isolation
>  - NUMA node devices

Ok ... so let me propose an entirely different solution.

No doubt it has some terrible flaw, but I'll just have to
await your replies to see what that is.

How about we have:

 1) Yet another text config file in /etc, this one containing
    lines having two fields:
	* a list of IRQs, and
	* a cpumask.
    This file would specify which CPUs should handle which IRQs.

 2) A utility that can be run, after changing the above file, 
    to poke the proper cpumask to each IRQ, as specified in
    the file.

(Obligatory "simple" marketing claim: the above requires no
kernel changes.)

What am I missing?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities (was: boot cgroup questions)
  2008-05-09 12:03                                           ` Paul Jackson
@ 2008-05-09 12:14                                             ` Peter Zijlstra
  2008-05-09 12:36                                               ` Paul Jackson
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2008-05-09 12:14 UTC (permalink / raw)
  To: Paul Jackson; +Cc: maxk, menage, mingo, linux-kernel

On Fri, 2008-05-09 at 07:03 -0500, Paul Jackson wrote:
> Peter wrote:
> > I see two use-cases:
> > 
> >  - Isolation
> >  - NUMA node devices
> 
> Ok ... so let me propose an entirely different solution.
> 
> No doubt it has some terrible flaw, but I'll just have to
> await your replies to see what that is.
> 
> How about we have:
> 
>  1) Yet another text config file in /etc, this one containing
>     lines having two fields:
> 	* a list of IRQs, and
> 	* a cpumask.
>     This file would specify which CPUs should handle which IRQs.
> 
>  2) A utility that can be run, after changing the above file, 
>     to poke the proper cpumask to each IRQ, as specified in
>     the file.
> 
> (Obligatory "simple" marketing claim: the above requires no
> kernel changes.)
> 
> What am I missing?

Two points:

 - we can't currently set irq affinities for non-existent (aka new) IRQs
 - its a shame to duplicate the masks - most of this information would
also be used in the cpuset structure used to place the tasks.





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities (was: boot cgroup questions)
  2008-05-09 12:14                                             ` Peter Zijlstra
@ 2008-05-09 12:36                                               ` Paul Jackson
  2008-05-09 17:43                                                 ` Paul Jackson
  2008-05-21  1:21                                                 ` IRQ affinities Max Krasnyanskiy
  0 siblings, 2 replies; 42+ messages in thread
From: Paul Jackson @ 2008-05-09 12:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: maxk, menage, mingo, linux-kernel

Peter, responding to pj:
> > What am I missing?
> 
> Two points:
> 
>  - we can't currently set irq affinities for non-existent (aka new) IRQs
>  - its a shame to duplicate the masks - most of this information would
>    also be used in the cpuset structure used to place the tasks.

Ok.  Let me twist this a turn tighter then.

The first of your two points, a default affinitiy mask for new irqs,
would seem to require a kernel change.  But that change could be a
single cpumask, settable in /sys somewhere, specifying the default
affinity.  If that's all we needed, it would be easy.

The second of your two points, "duplicating masks", seems more delicate.

The space of named cpusets (the directory pathnames below the usual
mount point, /dev/cpuset) is not really much more compact than the
set of interesting cpumasks.  But I suppose your point is that some
of the -particular- cpumasks already named by the cpuset hierarchy
are tantilizingly close to the set of interesting cpumasks needed for
irq affinity ... close given some combination of union, intersection,
set difference and compliment operations, given my usual bias toward
looking at such things as this using set theory mechanisms.  That is,
for example, one might want all the CPUs in cpusets foo, bar and baz,
except the CPUs in cpuset blip, to handle IRQs so and so.

Let me think on that ... it's my nap time now.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities (was: boot cgroup questions)
  2008-05-09 12:36                                               ` Paul Jackson
@ 2008-05-09 17:43                                                 ` Paul Jackson
  2008-05-21  1:21                                                 ` IRQ affinities Max Krasnyanskiy
  1 sibling, 0 replies; 42+ messages in thread
From: Paul Jackson @ 2008-05-09 17:43 UTC (permalink / raw)
  To: Paul Jackson; +Cc: a.p.zijlstra, maxk, menage, mingo, linux-kernel

pj, talking to himself:
> That is, for example, one might want all the CPUs in cpusets
> foo, bar and baz, except the CPUs in cpuset blip, to handle
> IRQs so and so.

Ahh!  Perhaps that example has the keys to this kingdom.

How about this.  We add two files to each cpuset:

    irq_affinity_include	# IRQs to direct to CPUs in this cpuset
    irq_affinity_exclude	# IRQs -not- to direct to these CPUs

where irq_affinity_exclude overrides irq_affinity_include.

So, to determine to which CPUs a given interrupt (IRQ) can be directed:
 1) Combine (union) the 'cpus' of all the cpusets for which
    that IRQ is in that cpusets irq_affinity_include, then
 2) Remove (set substraction) the 'cpus' of any cpuset for which
    that IRQ is in that cpusets irq_affinity_exclude.

In the simplest case of just wanting to isolate some CPUs with their
own special list of interrupts, one would:
 1) include all interrupts in the top cpusets irq_affinity_include, and
 2) include the interrupts you don't want in the isolated cpusets
    irq_affinity_exclude.

Observe that there is no dependency on the cpuset hierarchy in the above.

The contents of the files irq_affinity_include and irq_affinity_exclude
would be inherited by child cpusets on creation from their parents.

The one detail that puzzles me at the moment is what ownership and
permissions these two irq_affinity_* files would have.  I am concerned
that the usual permissions, which allow a job to write its own cpuset
files would allow a job to affect the overall system to a greater
degree than is desired.  Perhaps an additional inheritance rule would
be useful and appropriate, such as a rule that a given cpusets
irq_affinity_include must be a subset of its parents or a rule that
a given cpusets irq_affinity_exclude must be a -superset- of its
parents; I'm unsure here.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities
  2008-05-09 11:17                                       ` IRQ affinities (was: boot cgroup questions) Paul Jackson
  2008-05-09 11:48                                         ` Peter Zijlstra
@ 2008-05-21  1:14                                         ` Max Krasnyanskiy
  2008-05-21  4:45                                           ` Arjan van de Ven
  2008-05-21  6:34                                           ` Paul Jackson
  1 sibling, 2 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-05-21  1:14 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Peter Zijlstra, menage, mingo, linux-kernel

Paul Jackson wrote:
> Peter wrote:
>> That's a new feature; and its quite common that new features require
>> code changes.
> 
> It's common for new features to require code changes to take advantage
> of the new features.
> 
> It's less desirable that taking advantage of such new features breaks
> existing, basically unrelated, code.
> 
> My gut sense is that, in a misguided effort to find a "simple" answer
> to irq distribution, we (well, y'all) are trying to attach this
> feature to cpusets or cgroups.
> 
> Let me ask a different question:
> 
>   What solutions would you (Max, Peter, Ingo, lurkers, ...) be
>   suggesting for this 'IRQ affinity' problem if cpusets and
>   cgroups didn't exist in any form whatsoever?

As Peter explained I'm focusing on the "CPU isolation" aspect. ie Shielding a 
CPU (or a set of CPUs) from various kernel activities (load balancing, soft 
and hard irq handling, workqueues, etc).

For the IRQs specifically all I need is to be able to tell the kernel to not 
route IRQs to certain CPUs. That's mostly works already via 
/proc/irq/N/smp_affinity, the problem is dynamically allocated irqs because 
/proc/irq/N directory does not exist until those IRQs are allocated/enabled.

Originally I introduced global cpu_isolated_map. IRQ code was using that map 
to exclude CPU(s) from IRQ routing. What I realized now is that all I need is
/proc/irq/default_smp_affinity. In other words I just need to export default 
mask used by the IRQ layer. I think this makes sense regardless of what cpuset 
  based solution we'll come up with.

Max

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities
  2008-05-09 12:36                                               ` Paul Jackson
  2008-05-09 17:43                                                 ` Paul Jackson
@ 2008-05-21  1:21                                                 ` Max Krasnyanskiy
  1 sibling, 0 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-05-21  1:21 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Peter Zijlstra, menage, mingo, linux-kernel

Paul Jackson wrote:
> Peter, responding to pj:
>>> What am I missing?
>> Two points:
>>
>>  - we can't currently set irq affinities for non-existent (aka new) IRQs
>>  - its a shame to duplicate the masks - most of this information would
>>    also be used in the cpuset structure used to place the tasks.
> 
> Ok.  Let me twist this a turn tighter then.
> 
> The first of your two points, a default affinitiy mask for new irqs,
> would seem to require a kernel change.  But that change could be a
> single cpumask, settable in /sys somewhere, specifying the default
> affinity.  If that's all we needed, it would be easy.
Looks like we arrived at the same conclusion. See my prev reply.
I'm in the process of making a patch for exposing default affinity mask.

> The second of your two points, "duplicating masks", seems more delicate.
There is actually no duplication as far as I can see because IRQ layer already 
has the default_mask variable. It just needs to be exposed via /proc or /sys.

> The space of named cpusets (the directory pathnames below the usual
> mount point, /dev/cpuset) is not really much more compact than the
> set of interesting cpumasks.  But I suppose your point is that some
> of the -particular- cpumasks already named by the cpuset hierarchy
> are tantilizingly close to the set of interesting cpumasks needed for
> irq affinity ... close given some combination of union, intersection,
> set difference and compliment operations, given my usual bias toward
> looking at such things as this using set theory mechanisms.  That is,
> for example, one might want all the CPUs in cpusets foo, bar and baz,
> except the CPUs in cpuset blip, to handle IRQs so and so.
> 
> Let me think on that ... it's my nap time now.
This would be an overkill imho.

Max


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities
  2008-05-21  1:14                                         ` Max Krasnyanskiy
@ 2008-05-21  4:45                                           ` Arjan van de Ven
  2008-05-21 16:18                                             ` Max Krasnyanskiy
  2008-05-21  6:34                                           ` Paul Jackson
  1 sibling, 1 reply; 42+ messages in thread
From: Arjan van de Ven @ 2008-05-21  4:45 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: Paul Jackson, Peter Zijlstra, menage, mingo, linux-kernel

On Tue, 20 May 2008 18:14:58 -0700
Max Krasnyanskiy <maxk@qualcomm.co
> 

> For the IRQs specifically all I need is to be able to tell the kernel
> to not route IRQs to certain CPUs. That's mostly works already via 
> /proc/irq/N/smp_affinity, the problem is dynamically allocated irqs
> because /proc/irq/N directory does not exist until those IRQs are
> allocated/enabled.
\\

why don't you tell irqbalance instead? it'll make sure the irq stays
out of the wind...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities
  2008-05-21  1:14                                         ` Max Krasnyanskiy
  2008-05-21  4:45                                           ` Arjan van de Ven
@ 2008-05-21  6:34                                           ` Paul Jackson
  2008-05-21 17:58                                             ` Max Krasnyanskiy
  1 sibling, 1 reply; 42+ messages in thread
From: Paul Jackson @ 2008-05-21  6:34 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: a.p.zijlstra, menage, mingo, linux-kernel

Max wrote:
> What I realized now is that all I need is
> /proc/irq/default_smp_affinity.

I suspect that something like you're proposing to do here will answer
your needs, to "tell the kernel to not route IRQs to certain CPUs."

I suspect that other folks will have some additional needs, that perhaps
my idea of May 9, 2008:

       How about this.  We add two files to each cpuset:
       
           irq_affinity_include	# IRQs to direct to CPUs in this cpuset
           irq_affinity_exclude	# IRQs -not- to direct to these CPUs
       
       where irq_affinity_exclude overrides irq_affinity_include.

could meet.

It makes sense to me to deal with your "default_smp_affinity" patch
first, and then come back around and see what remains to be done, and
how to do it, perhaps with additional cpuset based mechanisms such as
the above two irq_affinity_* IRQ masks.

> I'm in the process of making a patch for exposing default affinity mask.

Peter, et al: how does Max's planned "default_smp_affinity" patch sound
to you, as the next step we take on this?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities
  2008-05-21  4:45                                           ` Arjan van de Ven
@ 2008-05-21 16:18                                             ` Max Krasnyanskiy
  0 siblings, 0 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-05-21 16:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Paul Jackson, Peter Zijlstra, menage, mingo, linux-kernel

Arjan van de Ven wrote:
> On Tue, 20 May 2008 18:14:58 -0700
> Max Krasnyanskiy <maxk@qualcomm.co
> 
>> For the IRQs specifically all I need is to be able to tell the kernel
>> to not route IRQs to certain CPUs. That's mostly works already via 
>> /proc/irq/N/smp_affinity, the problem is dynamically allocated irqs
>> because /proc/irq/N directory does not exist until those IRQs are
>> allocated/enabled.
> \\
> 
> why don't you tell irqbalance instead? it'll make sure the irq stays
> out of the wind...
> 

That will be too late. By the time irqbalance sees that IRQ it may have 
already fired (possibly several times) on the "wrong" processor.

Max


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: IRQ affinities
  2008-05-21  6:34                                           ` Paul Jackson
@ 2008-05-21 17:58                                             ` Max Krasnyanskiy
  0 siblings, 0 replies; 42+ messages in thread
From: Max Krasnyanskiy @ 2008-05-21 17:58 UTC (permalink / raw)
  To: Paul Jackson; +Cc: a.p.zijlstra, menage, mingo, linux-kernel

Hi Paul,

Paul Jackson wrote:
> Max wrote:
>> What I realized now is that all I need is
>> /proc/irq/default_smp_affinity.
> 
> I suspect that something like you're proposing to do here will answer
> your needs, to "tell the kernel to not route IRQs to certain CPUs."
> 
> I suspect that other folks will have some additional needs, that perhaps
> my idea of May 9, 2008:
> 
>        How about this.  We add two files to each cpuset:
>        
>            irq_affinity_include	# IRQs to direct to CPUs in this cpuset
>            irq_affinity_exclude	# IRQs -not- to direct to these CPUs
>        
>        where irq_affinity_exclude overrides irq_affinity_include.
> 
> could meet.

I saw your earlier email with that proposal. Just had to digest it a bit :) 
(still catching up with things after vacation).

> So, to determine to which CPUs a given interrupt (IRQ) can be directed:
>  1) Combine (union) the 'cpus' of all the cpusets for which
>     that IRQ is in that cpusets irq_affinity_include, then
>  2) Remove (set substraction) the 'cpus' of any cpuset for which
>     that IRQ is in that cpusets irq_affinity_exclude.

That would work. But wouldn't it be hard for the users to debug things ? I 
mean if you have a complex cpuset hierarchy it may be hard to figure out why a 
certain irq is not getting to cpuX and vice versa.
btw How would we represent "all irqs", are you implying that those files 
contain masks ?
We'll also need to handle conflicts like "irq excluded from all cpusets", etc.
I still prefer "irq as a task" approach. It's very simple and straightforward 
  mapping of an irq -> cpuset, no conflicts, etc. Easy to figure out for the 
user where an irq will end up.

btw I did not quite get the idea behind the "exclude" part. Why is "include" 
not enough ? Can you give me an example.

> It makes sense to me to deal with your "default_smp_affinity" patch
> first, and then come back around and see what remains to be done, and
> how to do it, perhaps with additional cpuset based mechanisms such as
> the above two irq_affinity_* IRQ masks.
> 
>> I'm in the process of making a patch for exposing default affinity mask.
> 
> Peter, et al: how does Max's planned "default_smp_affinity" patch sound
> to you, as the next step we take on this?

I think it makes sense regardless of the cpuset based approach. Seems like a 
logical extension of the existing interface (ie per IRQ mask plus the default).

Max



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2008-05-21 17:58 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-12  1:23 boot cgroup questions Max Krasnyansky
2008-03-12  1:27 ` Paul Menage
2008-03-12  2:34   ` Max Krasnyansky
2008-03-12  2:36     ` Paul Menage
2008-03-12  2:53       ` Max Krasnyansky
2008-03-12  3:09         ` Paul Menage
2008-03-12  3:39           ` Max Krasnyansky
2008-03-12  4:59           ` Paul Jackson
2008-03-12 18:24             ` Max Krasnyanskiy
2008-03-12 18:57               ` Paul Jackson
2008-03-12 19:11                 ` Max Krasnyanskiy
2008-03-12 19:32                   ` Paul Jackson
2008-03-12 20:08                     ` Max Krasnyanskiy
2008-03-12 20:37                       ` Paul Jackson
2008-03-12 22:29                         ` Max Krasnyanskiy
2008-03-12 23:30                           ` Paul Jackson
2008-03-13  0:57                             ` Max Krasnyanskiy
2008-03-13  7:03                               ` Paul Jackson
2008-04-10 18:03                                 ` Max Krasnyanskiy
2008-04-14 18:39                                   ` Paul Jackson
2008-05-09 10:45                                     ` Peter Zijlstra
2008-05-09 11:17                                       ` IRQ affinities (was: boot cgroup questions) Paul Jackson
2008-05-09 11:48                                         ` Peter Zijlstra
2008-05-09 12:03                                           ` Paul Jackson
2008-05-09 12:14                                             ` Peter Zijlstra
2008-05-09 12:36                                               ` Paul Jackson
2008-05-09 17:43                                                 ` Paul Jackson
2008-05-21  1:21                                                 ` IRQ affinities Max Krasnyanskiy
2008-05-21  1:14                                         ` Max Krasnyanskiy
2008-05-21  4:45                                           ` Arjan van de Ven
2008-05-21 16:18                                             ` Max Krasnyanskiy
2008-05-21  6:34                                           ` Paul Jackson
2008-05-21 17:58                                             ` Max Krasnyanskiy
2008-04-14 18:42                                   ` boot cgroup questions Paul Jackson
2008-03-13  7:12                               ` Paul Jackson
2008-04-10 17:24                                 ` Max Krasnyanskiy
2008-04-10 17:37                                   ` Paul Jackson
2008-03-12 23:32                           ` Paul Jackson
2008-03-13  0:46                             ` Max Krasnyanskiy
2008-03-12 19:16             ` Paul Menage
2008-03-12 19:24               ` Paul Jackson
2008-03-12 19:30                 ` Max Krasnyanskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).