LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
@ 2018-04-19 19:08 Alexey Dobriyan
  2018-04-19 19:28 ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Dobriyan @ 2018-04-19 19:08 UTC (permalink / raw)
  To: longman; +Cc: linux-kernel, rdunlap

> Therefore, application performance can be impacted if the application
> reads /proc/stat rather frequently.

[nods]
Text interfaces can be designed in a very stupid way.

> For example, reading /proc/stat in a certain 2-socket Skylake server
> took about 4.6ms because it had over 5k irqs.

Is this top(1)? What is this application doing?
If it needs percpu usage stats, then maybe /proc/stat should be
converted away from single_open() so that core seq_file code doesn't
generate everything at once.

> -
> -	/* sum again ? it could be updated? */
> -	for_each_irq_nr(j)
> -		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> -

This is direct userspace breakage.

Proper fix is to start strategic switch away from /proc.
It is a fun toy but its time has come.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 19:08 [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs Alexey Dobriyan
@ 2018-04-19 19:28 ` Waiman Long
  2018-04-19 19:55   ` Alexey Dobriyan
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2018-04-19 19:28 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: linux-kernel, rdunlap

On 04/19/2018 03:08 PM, Alexey Dobriyan wrote:
>> Therefore, application performance can be impacted if the application
>> reads /proc/stat rather frequently.
> [nods]
> Text interfaces can be designed in a very stupid way.
>
>> For example, reading /proc/stat in a certain 2-socket Skylake server
>> took about 4.6ms because it had over 5k irqs.
> Is this top(1)? What is this application doing?
> If it needs percpu usage stats, then maybe /proc/stat should be
> converted away from single_open() so that core seq_file code doesn't
> generate everything at once.

The application is actually a database benchmarking tool used by a
customer. The reading of /proc/stat is an artifact of the benchmarking
tool that can actually be turned off. Without doing that, about 20% of
CPU time were spent reading /proc/stat and the trashing of cachelines
slowed the benchmark number quite significantly. However, I was also
told that there are legitimate cases where reading /proc/stat was
necessary in some of their applications.

>> -
>> -	/* sum again ? it could be updated? */
>> -	for_each_irq_nr(j)
>> -		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
>> -
> This is direct userspace breakage.

Yes, I am aware of that. That is the cost of improving the performance
of applications that read /proc/stat, but don't need the individual irq
counts.

>
> Proper fix is to start strategic switch away from /proc.
> It is a fun toy but its time has come.

Migration from procfs is easier said then done. Many existing customers
are reluctant to do that.

-Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 19:28 ` Waiman Long
@ 2018-04-19 19:55   ` Alexey Dobriyan
       [not found]     ` <eb7c1569-e445-5cbb-6d10-2694b625232a@redhat.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Dobriyan @ 2018-04-19 19:55 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-kernel, rdunlap

On Thu, Apr 19, 2018 at 03:28:40PM -0400, Waiman Long wrote:
> On 04/19/2018 03:08 PM, Alexey Dobriyan wrote:
> >> Therefore, application performance can be impacted if the application
> >> reads /proc/stat rather frequently.
> > [nods]
> > Text interfaces can be designed in a very stupid way.
> >
> >> For example, reading /proc/stat in a certain 2-socket Skylake server
> >> took about 4.6ms because it had over 5k irqs.
> > Is this top(1)? What is this application doing?
> > If it needs percpu usage stats, then maybe /proc/stat should be
> > converted away from single_open() so that core seq_file code doesn't
> > generate everything at once.
> 
> The application is actually a database benchmarking tool used by a
> customer.

So it probably needs lines before "intr" line.

> The reading of /proc/stat is an artifact of the benchmarking
> tool that can actually be turned off. Without doing that, about 20% of
> CPU time were spent reading /proc/stat and the trashing of cachelines
> slowed the benchmark number quite significantly. However, I was also
> told that there are legitimate cases where reading /proc/stat was
> necessary in some of their applications.
> 
> >> -
> >> -	/* sum again ? it could be updated? */
> >> -	for_each_irq_nr(j)
> >> -		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> >> -
> > This is direct userspace breakage.
> 
> Yes, I am aware of that. That is the cost of improving the performance
> of applications that read /proc/stat, but don't need the individual irq
> counts.

Yeah, but all it takes is one script which cares.

I have an idea.

Maintain "maximum registered irq #", it should be much smaller than
"nr_irqs":

intr 4245359 151 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 44330 182364 57741 0 0 0 0 0 0 0 0 85 89124 0 0 0 0 0 323360 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Or maintain array of registered irqs and iterate over them only.

I have another idea.

perf record shows mutex_lock/mutex_unlock at the top.
Most of them are irq mutex not seqfile mutex as there are many more
interrupts than reads. Take it once.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
       [not found]     ` <eb7c1569-e445-5cbb-6d10-2694b625232a@redhat.com>
@ 2018-04-19 20:39       ` Alexey Dobriyan
  2018-04-19 20:58         ` Waiman Long
  2018-04-19 21:05         ` Waiman Long
  0 siblings, 2 replies; 18+ messages in thread
From: Alexey Dobriyan @ 2018-04-19 20:39 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-kernel, rdunlap, akpm

On Thu, Apr 19, 2018 at 04:21:14PM -0400, Waiman Long wrote:
> On 04/19/2018 03:55 PM, Alexey Dobriyan wrote:
> > On Thu, Apr 19, 2018 at 03:28:40PM -0400, Waiman Long wrote:
> >> On 04/19/2018 03:08 PM, Alexey Dobriyan wrote:
> >>>> Therefore, application performance can be impacted if the application
> >>>> reads /proc/stat rather frequently.
> >>> [nods]
> >>> Text interfaces can be designed in a very stupid way.
> >>>
> >>>> For example, reading /proc/stat in a certain 2-socket Skylake server
> >>>> took about 4.6ms because it had over 5k irqs.
> >>> Is this top(1)? What is this application doing?
> >>> If it needs percpu usage stats, then maybe /proc/stat should be
> >>> converted away from single_open() so that core seq_file code doesn't
> >>> generate everything at once.
> >> The application is actually a database benchmarking tool used by a
> >> customer.
> > So it probably needs lines before "intr" line.
> >
> >> The reading of /proc/stat is an artifact of the benchmarking
> >> tool that can actually be turned off. Without doing that, about 20% of
> >> CPU time were spent reading /proc/stat and the trashing of cachelines
> >> slowed the benchmark number quite significantly. However, I was also
> >> told that there are legitimate cases where reading /proc/stat was
> >> necessary in some of their applications.
> >>
> >>>> -
> >>>> -	/* sum again ? it could be updated? */
> >>>> -	for_each_irq_nr(j)
> >>>> -		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> >>>> -
> >>> This is direct userspace breakage.
> >> Yes, I am aware of that. That is the cost of improving the performance
> >> of applications that read /proc/stat, but don't need the individual irq
> >> counts.
> > Yeah, but all it takes is one script which cares.
> >
> > I have an idea.
> >
> > Maintain "maximum registered irq #", it should be much smaller than
> > "nr_irqs":
> >
> > intr 4245359 151 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 44330 182364 57741 0 0 0 0 0 0 0 0 85 89124 0 0 0 0 0 323360 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 
> Yes, that can probably help.
> 
> This is the data from the problematic skylake server:
> 
> model name    : Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
> 56 sosreport-carevalo.02076935-20180413085327/proc/stat
> Interrupts: 5370
> Interrupts without "0" entries: 1011
> 
> There are still quite a large number of non-zero entries, though.
> 
> > Or maintain array of registered irqs and iterate over them only.
> 
> Right, we can allocate a bitmap of used irqs to do that.
> 
> >
> > I have another idea.
> >
> > perf record shows mutex_lock/mutex_unlock at the top.
> > Most of them are irq mutex not seqfile mutex as there are many more
> > interrupts than reads. Take it once.
> >
> How many cpus are in your test system? In that skylake server, it was
> the per-cpu summing operation of the irq counts that was consuming most
> of the time for reading /proc/stat. I think we can certainly try to
> optimize the lock taking.

It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
Given that irq registering is rare operation, maintaining sorted array
of irq should be the best option.

> For the time being, I think I am going to have a clone /proc/stat2 as
> suggested in my earlier email. Alternatively, I can put that somewhere
> in sysfs if you have a good idea of where I can put it.

sysfs is strictly one-value-per-file.

> I will also look into ways to optimize the current per-IRQ stats
> handling, but it will come later.

There is always a time-honored way of ioctl(2) switching irq info off
/proc supports that.

There are many options.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 20:39       ` Alexey Dobriyan
@ 2018-04-19 20:58         ` Waiman Long
  2018-04-19 23:23           ` Joel Fernandes (Google)
  2018-04-19 21:05         ` Waiman Long
  1 sibling, 1 reply; 18+ messages in thread
From: Waiman Long @ 2018-04-19 20:58 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: linux-kernel, rdunlap, akpm

On 04/19/2018 04:39 PM, Alexey Dobriyan wrote:
> On Thu, Apr 19, 2018 at 04:21:14PM -0400, Waiman Long wrote:
>> On 04/19/2018 03:55 PM, Alexey Dobriyan wrote:
>>> On Thu, Apr 19, 2018 at 03:28:40PM -0400, Waiman Long wrote:
>>>> On 04/19/2018 03:08 PM, Alexey Dobriyan wrote:
>>>>>> Therefore, application performance can be impacted if the application
>>>>>> reads /proc/stat rather frequently.
>>>>> [nods]
>>>>> Text interfaces can be designed in a very stupid way.
>>>>>
>>>>>> For example, reading /proc/stat in a certain 2-socket Skylake server
>>>>>> took about 4.6ms because it had over 5k irqs.
>>>>> Is this top(1)? What is this application doing?
>>>>> If it needs percpu usage stats, then maybe /proc/stat should be
>>>>> converted away from single_open() so that core seq_file code doesn't
>>>>> generate everything at once.
>>>> The application is actually a database benchmarking tool used by a
>>>> customer.
>>> So it probably needs lines before "intr" line.
>>>
>>>> The reading of /proc/stat is an artifact of the benchmarking
>>>> tool that can actually be turned off. Without doing that, about 20% of
>>>> CPU time were spent reading /proc/stat and the trashing of cachelines
>>>> slowed the benchmark number quite significantly. However, I was also
>>>> told that there are legitimate cases where reading /proc/stat was
>>>> necessary in some of their applications.
>>>>
>>>>>> -
>>>>>> -	/* sum again ? it could be updated? */
>>>>>> -	for_each_irq_nr(j)
>>>>>> -		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
>>>>>> -
>>>>> This is direct userspace breakage.
>>>> Yes, I am aware of that. That is the cost of improving the performance
>>>> of applications that read /proc/stat, but don't need the individual irq
>>>> counts.
>>> Yeah, but all it takes is one script which cares.
>>>
>>> I have an idea.
>>>
>>> Maintain "maximum registered irq #", it should be much smaller than
>>> "nr_irqs":
>>>
>>> intr 4245359 151 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 44330 182364 57741 0 0 0 0 0 0 0 0 85 89124 0 0 0 0 0 323360 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> Yes, that can probably help.
>>
>> This is the data from the problematic skylake server:
>>
>> model name    : Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
>> 56 sosreport-carevalo.02076935-20180413085327/proc/stat
>> Interrupts: 5370
>> Interrupts without "0" entries: 1011
>>
>> There are still quite a large number of non-zero entries, though.
>>
>>> Or maintain array of registered irqs and iterate over them only.
>> Right, we can allocate a bitmap of used irqs to do that.
>>
>>> I have another idea.
>>>
>>> perf record shows mutex_lock/mutex_unlock at the top.
>>> Most of them are irq mutex not seqfile mutex as there are many more
>>> interrupts than reads. Take it once.
>>>
>> How many cpus are in your test system? In that skylake server, it was
>> the per-cpu summing operation of the irq counts that was consuming most
>> of the time for reading /proc/stat. I think we can certainly try to
>> optimize the lock taking.
> It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
> Given that irq registering is rare operation, maintaining sorted array
> of irq should be the best option.
>> For the time being, I think I am going to have a clone /proc/stat2 as
>> suggested in my earlier email. Alternatively, I can put that somewhere
>> in sysfs if you have a good idea of where I can put it.
> sysfs is strictly one-value-per-file.
>
>> I will also look into ways to optimize the current per-IRQ stats
>> handling, but it will come later.
> There is always a time-honored way of ioctl(2) switching irq info off
> /proc supports that.
>
> There are many options.

OK, it is good to know. Do you have any existing code snippet in the
kernel that I can use as reference on how to use ioctl(2) switching?

I will look into how to optimize the existing per-IRQ stats code first
before venturing into cloning /proc/stat.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 20:39       ` Alexey Dobriyan
  2018-04-19 20:58         ` Waiman Long
@ 2018-04-19 21:05         ` Waiman Long
  1 sibling, 0 replies; 18+ messages in thread
From: Waiman Long @ 2018-04-19 21:05 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: linux-kernel, rdunlap, akpm

On 04/19/2018 04:39 PM, Alexey Dobriyan wrote:
>>
>> Yes, that can probably help.
>>
>> This is the data from the problematic skylake server:
>>
>> model name    : Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
>> 56 sosreport-carevalo.02076935-20180413085327/proc/stat
>> Interrupts: 5370
>> Interrupts without "0" entries: 1011
>>
>> There are still quite a large number of non-zero entries, though.
>>
>>> Or maintain array of registered irqs and iterate over them only.
>> Right, we can allocate a bitmap of used irqs to do that.
>>
>>> I have another idea.
>>>
>>> perf record shows mutex_lock/mutex_unlock at the top.
>>> Most of them are irq mutex not seqfile mutex as there are many more
>>> interrupts than reads. Take it once.
>>>
>> How many cpus are in your test system? In that skylake server, it was
>> the per-cpu summing operation of the irq counts that was consuming most
>> of the time for reading /proc/stat. I think we can certainly try to
>> optimize the lock taking.
> It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
> Given that irq registering is rare operation, maintaining sorted array
> of irq should be the best option.
>

BTW, the skylake server is 2-socket 24-core 48-thread.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 20:58         ` Waiman Long
@ 2018-04-19 23:23           ` Joel Fernandes (Google)
  2018-04-21 20:34             ` Alexey Dobriyan
  0 siblings, 1 reply; 18+ messages in thread
From: Joel Fernandes (Google) @ 2018-04-19 23:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexey Dobriyan, Linux Kernel Mailing List, rdunlap, Andrew Morton

Hi,

>>>> Or maintain array of registered irqs and iterate over them only.
>>> Right, we can allocate a bitmap of used irqs to do that.
>>>
>>>> I have another idea.
>>>>
>>>> perf record shows mutex_lock/mutex_unlock at the top.
>>>> Most of them are irq mutex not seqfile mutex as there are many more
>>>> interrupts than reads. Take it once.
>>>>
>>> How many cpus are in your test system? In that skylake server, it was
>>> the per-cpu summing operation of the irq counts that was consuming most
>>> of the time for reading /proc/stat. I think we can certainly try to
>>> optimize the lock taking.
>> It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
>> Given that irq registering is rare operation, maintaining sorted array
>> of irq should be the best option.
>>> For the time being, I think I am going to have a clone /proc/stat2 as
>>> suggested in my earlier email. Alternatively, I can put that somewhere
>>> in sysfs if you have a good idea of where I can put it.
>> sysfs is strictly one-value-per-file.
>>
>>> I will also look into ways to optimize the current per-IRQ stats
>>> handling, but it will come later.
>> There is always a time-honored way of ioctl(2) switching irq info off
>> /proc supports that.
>>
>> There are many options.
>
> OK, it is good to know. Do you have any existing code snippet in the
> kernel that I can use as reference on how to use ioctl(2) switching?
>
> I will look into how to optimize the existing per-IRQ stats code first
> before venturing into cloning /proc/stat.

Can we not just remove per-IRQ stats from /proc/stat (since I gather
from this discussion it isn't scalable), and just have applications
that need per-IRQ stats use /proc/interrupts ?

thanks,

- Joel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 23:23           ` Joel Fernandes (Google)
@ 2018-04-21 20:34             ` Alexey Dobriyan
  2018-04-21 20:36               ` Alexey Dobriyan
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Dobriyan @ 2018-04-21 20:34 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Waiman Long, Linux Kernel Mailing List, rdunlap, Andrew Morton

On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> Can we not just remove per-IRQ stats from /proc/stat (since I gather
> from this discussion it isn't scalable), and just have applications
> that need per-IRQ stats use /proc/interrupts ?

If you can prove noone is using them in /proc/stat...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-21 20:34             ` Alexey Dobriyan
@ 2018-04-21 20:36               ` Alexey Dobriyan
  2018-04-24  5:54                 ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Dobriyan @ 2018-04-21 20:36 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Waiman Long, Linux Kernel Mailing List, rdunlap, Andrew Morton

On Sat, Apr 21, 2018 at 11:34:22PM +0300, Alexey Dobriyan wrote:
> On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > from this discussion it isn't scalable), and just have applications
> > that need per-IRQ stats use /proc/interrupts ?
> 
> If you can prove noone is using them in /proc/stat...

And you can't even stick WARN into /proc/stat to find out.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-21 20:36               ` Alexey Dobriyan
@ 2018-04-24  5:54                 ` David Rientjes
  2018-04-24  6:18                   ` Alexey Dobriyan
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2018-04-24  5:54 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Joel Fernandes (Google),
	Waiman Long, Linux Kernel Mailing List, rdunlap, Andrew Morton

On Sat, 21 Apr 2018, Alexey Dobriyan wrote:

> > On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > > from this discussion it isn't scalable), and just have applications
> > > that need per-IRQ stats use /proc/interrupts ?
> > 
> > If you can prove noone is using them in /proc/stat...
> 
> And you can't even stick WARN into /proc/stat to find out.
> 

FWIW, removing per irq counts from /proc/stat would break some of our 
scripts.  We could adapt to that, but everybody else would have to as 
well, so I'm afraid it's not going to be possible.

It would probably be better to extract out the stats that you're actually 
interested in to a new file.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-24  5:54                 ` David Rientjes
@ 2018-04-24  6:18                   ` Alexey Dobriyan
  2018-05-02  0:02                     ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Dobriyan @ 2018-04-24  6:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joel Fernandes (Google),
	Waiman Long, Linux Kernel Mailing List, rdunlap, Andrew Morton

On Mon, Apr 23, 2018 at 10:54:18PM -0700, David Rientjes wrote:
> On Sat, 21 Apr 2018, Alexey Dobriyan wrote:
> 
> > > On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > > > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > > > from this discussion it isn't scalable), and just have applications
> > > > that need per-IRQ stats use /proc/interrupts ?
> > > 
> > > If you can prove noone is using them in /proc/stat...
> > 
> > And you can't even stick WARN into /proc/stat to find out.
> > 
> 
> FWIW, removing per irq counts from /proc/stat would break some of our
> scripts.  We could adapt to that, but everybody else would have to as
> well, so I'm afraid it's not going to be possible.

Excellent!

> It would probably be better to extract out the stats that you're actually
> interested in to a new file.

This is the worst scenario. Individual IRQ stats are going to live in 2 places.
And /proc/stat still would be slow.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-24  6:18                   ` Alexey Dobriyan
@ 2018-05-02  0:02                     ` Andrew Morton
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2018-05-02  0:02 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: David Rientjes, Joel Fernandes (Google),
	Waiman Long, Linux Kernel Mailing List, rdunlap

On Tue, 24 Apr 2018 09:18:59 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:

> On Mon, Apr 23, 2018 at 10:54:18PM -0700, David Rientjes wrote:
> > On Sat, 21 Apr 2018, Alexey Dobriyan wrote:
> > 
> > > > On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > > > > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > > > > from this discussion it isn't scalable), and just have applications
> > > > > that need per-IRQ stats use /proc/interrupts ?
> > > > 
> > > > If you can prove noone is using them in /proc/stat...
> > > 
> > > And you can't even stick WARN into /proc/stat to find out.
> > > 
> > 
> > FWIW, removing per irq counts from /proc/stat would break some of our
> > scripts.  We could adapt to that, but everybody else would have to as
> > well, so I'm afraid it's not going to be possible.
> 
> Excellent!
> 
> > It would probably be better to extract out the stats that you're actually
> > interested in to a new file.
> 
> This is the worst scenario. Individual IRQ stats are going to live in 2 places.
> And /proc/stat still would be slow.

No, a new /proc/stat2 which omits the `intr' line would be fast(er). 
Although if we're going to do such a thing we might choose to
reorganize and prune /proc/stat2 even further, and actually put some
thought into it - the current one is a bit of a dog's breakfast.

Dumb question(s):

a) if we moved the `intr' line to the very end of /proc/stat, would
   anything break?  Maybe.

b) if an application were then to read stuff from /proc/stat and
   were to stop reading before it read the `intr' line, would
   its read from /proc/stat still be slow?

c) if the answer to b) is "yes" then can we change that?  Only go
   off and prepare the `intr' line if the application is really truly
   reading it in?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 19:43 ` Andrew Morton
  2018-04-19 19:57   ` Waiman Long
@ 2018-04-19 20:02   ` Alexey Dobriyan
  1 sibling, 0 replies; 18+ messages in thread
From: Alexey Dobriyan @ 2018-04-19 20:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Waiman Long, Kate Stewart, Thomas Gleixner, Greg Kroah-Hartman,
	linux-kernel

On Thu, Apr 19, 2018 at 12:43:19PM -0700, Andrew Morton wrote:
> On Thu, 19 Apr 2018 13:09:29 -0400 Waiman Long <longman@redhat.com> wrote:
> 
> > It was found that reading /proc/stat could be time consuming on
> > systems with a lot of irqs. For example, reading /proc/stat in a
> > certain 2-socket Skylake server took about 4.6ms because it had over
> > 5k irqs. In that particular case, the majority of the CPU cycles for
> > reading /proc/stat was spent in the kstat_irqs() function.  Therefore,
> > application performance can be impacted if the application reads
> > /proc/stat rather frequently.
> > 
> > The "intr" line within /proc/stat contains a sum total of all the irqs
> > that have happened followed by a list of irq counts for each individual
> > irq number. In many cases, the first number is good enough. The
> > individual irq counts may not provide that much more information.
> > 
> > In order to avoid this kind of performance issue, all these individual
> > irq counts are now separated into a new /proc/stat_irqs file. The
> > sum total irq count will stay in /proc/stat and be duplicated in
> > /proc/stat_irqs. Applications that need to look up individual irq counts
> > will now have to look into /proc/stat_irqs instead of /proc/stat.
> > 
> 
> (cc /proc maintainer)
> 
> It's a non-backward-compatible change.  For something which has
> existing for so long, it would be a mighty task to demonstrate that no
> existing userspace will be disrupted by this change.
> 
> So we need to think again.  A new interface which omits the per-IRQ
> stats might be acceptable.

Here is profile of open+read+close /proc/stat

30% is taking mutex only to print "0".

+   98.80%     0.04%  a.out    [kernel.vmlinux]  [k] entry_SYSCALL_64                                                                            ▒
+   98.75%     0.10%  a.out    [kernel.vmlinux]  [k] do_syscall_64                                                                               ▒
+   95.56%     0.04%  a.out    libc-2.25.so      [.] __GI___libc_read                                                                            ◆
+   95.09%     0.01%  a.out    [kernel.vmlinux]  [k] sys_read                                                                                    ▒
+   95.04%     0.03%  a.out    [kernel.vmlinux]  [k] vfs_read                                                                                    ▒
+   94.98%     0.05%  a.out    [kernel.vmlinux]  [k] proc_reg_read                                                                               ▒
+   94.98%     0.00%  a.out    [kernel.vmlinux]  [k] __vfs_read                                                                                  ▒
+   94.92%     0.06%  a.out    [kernel.vmlinux]  [k] seq_read                                                                                    ▒
+   94.52%     3.65%  a.out    [kernel.vmlinux]  [k] show_stat                                                                                   ▒
+   48.62%     2.59%  a.out    [kernel.vmlinux]  [k] kstat_irqs_usr                                                                              ▒
+   33.52%     9.55%  a.out    [kernel.vmlinux]  [k] seq_put_decimal_ull                                                                         ▒
+   19.63%    19.59%  a.out    [kernel.vmlinux]  [k] memcpy_erms                                                                                 ▒
+   17.34%     9.53%  a.out    [kernel.vmlinux]  [k] kstat_irqs                                                                                  ▒
-   15.45%    15.43%  a.out    [kernel.vmlinux]  [k] mutex_lock                                                                                  ▒
     15.43% __GI___libc_read                                                                                                                     ▒
        entry_SYSCALL_64                                                                                                                         ▒
        do_syscall_64                                                                                                                            ▒
        sys_read                                                                                                                                 ▒
        vfs_read                                                                                                                                 ▒
        __vfs_read                                                                                                                               ▒
        proc_reg_read                                                                                                                            ▒
      - seq_read                                                                                                                                 ▒
         - 15.41% show_stat                                                                                                                      ▒
              kstat_irqs_usr                                                                                                                     ▒
              mutex_lock                                                                                                                         ▒
+   13.32%    13.27%  a.out    [kernel.vmlinux]  [k] mutex_unlock                                                                                ▒
+    4.60%     1.35%  a.out    [kernel.vmlinux]  [k] cpumask_next                                                                                ▒
+    3.03%     3.03%  a.out    [kernel.vmlinux]  [k] __radix_tree_lookup                                                                         ▒
+    2.96%     0.08%  a.out    [kernel.vmlinux]  [k] seq_printf                                                                                  ▒
+    2.92%     0.02%  a.out    libc-2.25.so      [.] __GI___libc_open                                                                            ▒
+    2.89%     0.07%  a.out    [kernel.vmlinux]  [k] seq_vprintf                                                                                 ▒
+    2.81%     0.70%  a.out    [kernel.vmlinux]  [k] vsnprintf                                                                                   ▒
+    2.66%     2.66%  a.out    [kernel.vmlinux]  [k] _find_next_bit                                                                              ▒
+    2.42%     1.36%  a.out    [kernel.vmlinux]  [k] num_to_str                                                                                  ▒
+    2.41%     0.19%  a.out    [kernel.vmlinux]  [k] get_idle_time                                                                               ▒
+    2.39%     0.02%  a.out    [kernel.vmlinux]  [k] do_sys_open

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 19:43 ` Andrew Morton
@ 2018-04-19 19:57   ` Waiman Long
  2018-04-19 20:02   ` Alexey Dobriyan
  1 sibling, 0 replies; 18+ messages in thread
From: Waiman Long @ 2018-04-19 19:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kate Stewart, Thomas Gleixner, Greg Kroah-Hartman, linux-kernel,
	Alexey Dobriyan

On 04/19/2018 03:43 PM, Andrew Morton wrote:
> On Thu, 19 Apr 2018 13:09:29 -0400 Waiman Long <longman@redhat.com> wrote:
>
>> It was found that reading /proc/stat could be time consuming on
>> systems with a lot of irqs. For example, reading /proc/stat in a
>> certain 2-socket Skylake server took about 4.6ms because it had over
>> 5k irqs. In that particular case, the majority of the CPU cycles for
>> reading /proc/stat was spent in the kstat_irqs() function.  Therefore,
>> application performance can be impacted if the application reads
>> /proc/stat rather frequently.
>>
>> The "intr" line within /proc/stat contains a sum total of all the irqs
>> that have happened followed by a list of irq counts for each individual
>> irq number. In many cases, the first number is good enough. The
>> individual irq counts may not provide that much more information.
>>
>> In order to avoid this kind of performance issue, all these individual
>> irq counts are now separated into a new /proc/stat_irqs file. The
>> sum total irq count will stay in /proc/stat and be duplicated in
>> /proc/stat_irqs. Applications that need to look up individual irq counts
>> will now have to look into /proc/stat_irqs instead of /proc/stat.
>>
> (cc /proc maintainer)
>
> It's a non-backward-compatible change.  For something which has
> existing for so long, it would be a mighty task to demonstrate that no
> existing userspace will be disrupted by this change.
>
> So we need to think again.  A new interface which omits the per-IRQ
> stats might be acceptable.

OK, how about a new /proc/stat2 file that is the same as /proc/stat
except that it omits per-IRQ stats. BTW, do you have any suggestion for
a better name?

> Or, conceivably, a new /proc knob which disables the per-IRQ stats in
> /proc/stat.  That would allow operators to opt in to this disabling and
> would avoid the need to alter
> whatever-application-it-is-that-is-having-trouble.  This seems a bit ugly
> though.
>
> Also, the changelog is rather vague.  "application performance can be
> impacted".  Well, *are* applications impacted?  What is the real-world
> performance gain which this change provides, in a real-world workload?

Will clarify that in the next version.

-Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 17:09 Waiman Long
  2018-04-19 17:38 ` Randy Dunlap
@ 2018-04-19 19:43 ` Andrew Morton
  2018-04-19 19:57   ` Waiman Long
  2018-04-19 20:02   ` Alexey Dobriyan
  1 sibling, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2018-04-19 19:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Kate Stewart, Thomas Gleixner, Greg Kroah-Hartman, linux-kernel,
	Alexey Dobriyan

On Thu, 19 Apr 2018 13:09:29 -0400 Waiman Long <longman@redhat.com> wrote:

> It was found that reading /proc/stat could be time consuming on
> systems with a lot of irqs. For example, reading /proc/stat in a
> certain 2-socket Skylake server took about 4.6ms because it had over
> 5k irqs. In that particular case, the majority of the CPU cycles for
> reading /proc/stat was spent in the kstat_irqs() function.  Therefore,
> application performance can be impacted if the application reads
> /proc/stat rather frequently.
> 
> The "intr" line within /proc/stat contains a sum total of all the irqs
> that have happened followed by a list of irq counts for each individual
> irq number. In many cases, the first number is good enough. The
> individual irq counts may not provide that much more information.
> 
> In order to avoid this kind of performance issue, all these individual
> irq counts are now separated into a new /proc/stat_irqs file. The
> sum total irq count will stay in /proc/stat and be duplicated in
> /proc/stat_irqs. Applications that need to look up individual irq counts
> will now have to look into /proc/stat_irqs instead of /proc/stat.
> 

(cc /proc maintainer)

It's a non-backward-compatible change.  For something which has
existing for so long, it would be a mighty task to demonstrate that no
existing userspace will be disrupted by this change.

So we need to think again.  A new interface which omits the per-IRQ
stats might be acceptable.

Or, conceivably, a new /proc knob which disables the per-IRQ stats in
/proc/stat.  That would allow operators to opt in to this disabling and
would avoid the need to alter
whatever-application-it-is-that-is-having-trouble.  This seems a bit ugly
though.

Also, the changelog is rather vague.  "application performance can be
impacted".  Well, *are* applications impacted?  What is the real-world
performance gain which this change provides, in a real-world workload?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 17:38 ` Randy Dunlap
@ 2018-04-19 18:44   ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2018-04-19 18:44 UTC (permalink / raw)
  To: Randy Dunlap, Andrew Morton, Kate Stewart, Thomas Gleixner,
	Greg Kroah-Hartman
  Cc: linux-kernel

On 04/19/2018 01:38 PM, Randy Dunlap wrote:
> On 04/19/18 10:09, Waiman Long wrote:
>> It was found that reading /proc/stat could be time consuming on
>> systems with a lot of irqs. For example, reading /proc/stat in a
>> certain 2-socket Skylake server took about 4.6ms because it had over
>> 5k irqs. In that particular case, the majority of the CPU cycles for
>> reading /proc/stat was spent in the kstat_irqs() function.  Therefore,
>> application performance can be impacted if the application reads
>> /proc/stat rather frequently.
>>
>> The "intr" line within /proc/stat contains a sum total of all the irqs
>> that have happened followed by a list of irq counts for each individual
>> irq number. In many cases, the first number is good enough. The
>> individual irq counts may not provide that much more information.
>>
>> In order to avoid this kind of performance issue, all these individual
>> irq counts are now separated into a new /proc/stat_irqs file. The
>> sum total irq count will stay in /proc/stat and be duplicated in
>> /proc/stat_irqs. Applications that need to look up individual irq counts
>> will now have to look into /proc/stat_irqs instead of /proc/stat.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  fs/proc/stat.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 41 insertions(+), 7 deletions(-)
> Also please update Documentation/filesystems/proc.txt.
>
> thanks,

Right. I forgot to do that. Will send out v2 to fix that.

-Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
  2018-04-19 17:09 Waiman Long
@ 2018-04-19 17:38 ` Randy Dunlap
  2018-04-19 18:44   ` Waiman Long
  2018-04-19 19:43 ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Randy Dunlap @ 2018-04-19 17:38 UTC (permalink / raw)
  To: Waiman Long, Andrew Morton, Kate Stewart, Thomas Gleixner,
	Greg Kroah-Hartman
  Cc: linux-kernel

On 04/19/18 10:09, Waiman Long wrote:
> It was found that reading /proc/stat could be time consuming on
> systems with a lot of irqs. For example, reading /proc/stat in a
> certain 2-socket Skylake server took about 4.6ms because it had over
> 5k irqs. In that particular case, the majority of the CPU cycles for
> reading /proc/stat was spent in the kstat_irqs() function.  Therefore,
> application performance can be impacted if the application reads
> /proc/stat rather frequently.
> 
> The "intr" line within /proc/stat contains a sum total of all the irqs
> that have happened followed by a list of irq counts for each individual
> irq number. In many cases, the first number is good enough. The
> individual irq counts may not provide that much more information.
> 
> In order to avoid this kind of performance issue, all these individual
> irq counts are now separated into a new /proc/stat_irqs file. The
> sum total irq count will stay in /proc/stat and be duplicated in
> /proc/stat_irqs. Applications that need to look up individual irq counts
> will now have to look into /proc/stat_irqs instead of /proc/stat.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  fs/proc/stat.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 41 insertions(+), 7 deletions(-)

Also please update Documentation/filesystems/proc.txt.

thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
@ 2018-04-19 17:09 Waiman Long
  2018-04-19 17:38 ` Randy Dunlap
  2018-04-19 19:43 ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Waiman Long @ 2018-04-19 17:09 UTC (permalink / raw)
  To: Andrew Morton, Kate Stewart, Thomas Gleixner, Greg Kroah-Hartman
  Cc: linux-kernel, Waiman Long

It was found that reading /proc/stat could be time consuming on
systems with a lot of irqs. For example, reading /proc/stat in a
certain 2-socket Skylake server took about 4.6ms because it had over
5k irqs. In that particular case, the majority of the CPU cycles for
reading /proc/stat was spent in the kstat_irqs() function.  Therefore,
application performance can be impacted if the application reads
/proc/stat rather frequently.

The "intr" line within /proc/stat contains a sum total of all the irqs
that have happened followed by a list of irq counts for each individual
irq number. In many cases, the first number is good enough. The
individual irq counts may not provide that much more information.

In order to avoid this kind of performance issue, all these individual
irq counts are now separated into a new /proc/stat_irqs file. The
sum total irq count will stay in /proc/stat and be duplicated in
/proc/stat_irqs. Applications that need to look up individual irq counts
will now have to look into /proc/stat_irqs instead of /proc/stat.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/proc/stat.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 59749df..79e3c03 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -155,11 +155,6 @@ static int show_stat(struct seq_file *p, void *v)
 		seq_putc(p, '\n');
 	}
 	seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
-
 	seq_printf(p,
 		"\nctxt %llu\n"
 		"btime %llu\n"
@@ -181,15 +176,46 @@ static int show_stat(struct seq_file *p, void *v)
 	return 0;
 }
 
+/*
+ * Showing individual irq counts can be expensive if there are a lot of
+ * irqs. So it is done in a separate procfs file to reduce performance
+ * overhead of reading other statistical counts.
+ */
+static int show_stat_irqs(struct seq_file *p, void *v)
+{
+	int i, j;
+	u64 sum = 0;
+
+	for_each_possible_cpu(i) {
+		sum += kstat_cpu_irqs_sum(i);
+		sum += arch_irq_stat_cpu(i);
+	}
+	sum += arch_irq_stat();
+
+	seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
+
+	for_each_irq_nr(j)
+		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
+
+	seq_putc(p, '\n');
+
+	return 0;
+}
+
 static int stat_open(struct inode *inode, struct file *file)
 {
 	size_t size = 1024 + 128 * num_online_cpus();
 
-	/* minimum size to display an interrupt count : 2 bytes */
-	size += 2 * nr_irqs;
 	return single_open_size(file, show_stat, NULL, size);
 }
 
+static int stat_irqs_open(struct inode *inode, struct file *file)
+{
+	size_t size = 1024 + 16 * nr_irqs;
+
+	return single_open_size(file, show_stat_irqs, NULL, size);
+}
+
 static const struct file_operations proc_stat_operations = {
 	.open		= stat_open,
 	.read		= seq_read,
@@ -197,9 +223,17 @@ static int stat_open(struct inode *inode, struct file *file)
 	.release	= single_release,
 };
 
+static const struct file_operations proc_stat_irqs_operations = {
+	.open		= stat_irqs_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static int __init proc_stat_init(void)
 {
 	proc_create("stat", 0, NULL, &proc_stat_operations);
+	proc_create("stat_irqs", 0, NULL, &proc_stat_irqs_operations);
 	return 0;
 }
 fs_initcall(proc_stat_init);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-05-02  0:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-19 19:08 [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs Alexey Dobriyan
2018-04-19 19:28 ` Waiman Long
2018-04-19 19:55   ` Alexey Dobriyan
     [not found]     ` <eb7c1569-e445-5cbb-6d10-2694b625232a@redhat.com>
2018-04-19 20:39       ` Alexey Dobriyan
2018-04-19 20:58         ` Waiman Long
2018-04-19 23:23           ` Joel Fernandes (Google)
2018-04-21 20:34             ` Alexey Dobriyan
2018-04-21 20:36               ` Alexey Dobriyan
2018-04-24  5:54                 ` David Rientjes
2018-04-24  6:18                   ` Alexey Dobriyan
2018-05-02  0:02                     ` Andrew Morton
2018-04-19 21:05         ` Waiman Long
  -- strict thread matches above, loose matches on Subject: below --
2018-04-19 17:09 Waiman Long
2018-04-19 17:38 ` Randy Dunlap
2018-04-19 18:44   ` Waiman Long
2018-04-19 19:43 ` Andrew Morton
2018-04-19 19:57   ` Waiman Long
2018-04-19 20:02   ` Alexey Dobriyan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).