Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value
@ 2020-09-08  2:09 Dust Li
  2020-09-09  2:49 ` David Miller
  0 siblings, 1 reply; 3+ messages in thread
From: Dust Li @ 2020-09-08  2:09 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller, Eric Dumazet, Satoru Moriya; +Cc: netdev

sysctl_mem is an point, and tracepoint entry do not support
been visited like an array. Use an __array(3) to store sysctl_mem
instead.

tracpoint output with and without this fix:
- without fix:
   28821.074 sock:sock_exceed_buf_limit:proto:UDP
   sysctl_mem=-1741233440,19,322156906942464 allocated=19 sysctl_rmem=4096
   rmem_alloc=75008 sysctl_wmem=4096 wmem_alloc=1 wmem_queued=0
   kind=SK_MEM_RECV

- with fix:
  2126.136 sock:sock_exceed_buf_limit:proto:UDP
  sysctl_mem=18,122845,184266 allocated=19 sysctl_rmem=4096
  rmem_alloc=73728 sysctl_wmem=4096 wmem_alloc=1 wmem_queued=0
  kind=SK_MEM_RECV

v2: use __array(3) instead of 3 long type to store sysctl_mem

Fixes: 3847ce32aea9fdf ("core: add tracepoints for queueing skb to rcvbuf")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
---
 include/trace/events/sock.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index a966d4b5ab37..914e58938480 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_STRUCT__entry(
 		__array(char, name, 32)
-		__field(long *, sysctl_mem)
+		__array(long, sysctl_mem, 3)
 		__field(long, allocated)
 		__field(int, sysctl_rmem)
 		__field(int, rmem_alloc)
@@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_fast_assign(
 		strncpy(__entry->name, prot->name, 32);
-		__entry->sysctl_mem = prot->sysctl_mem;
+		__entry->sysctl_mem[0] = prot->sysctl_mem[0];
+		__entry->sysctl_mem[1] = prot->sysctl_mem[1];
+		__entry->sysctl_mem[2] = prot->sysctl_mem[2];
 		__entry->allocated = allocated;
 		__entry->sysctl_rmem = sk_get_rmem0(sk, prot);
 		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value
  2020-09-08  2:09 [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value Dust Li
@ 2020-09-09  2:49 ` David Miller
  2020-09-10 14:34   ` dust.li
  0 siblings, 1 reply; 3+ messages in thread
From: David Miller @ 2020-09-09  2:49 UTC (permalink / raw)
  To: dust.li; +Cc: kuba, edumazet, satoru.moriya, netdev

From: Dust Li <dust.li@linux.alibaba.com>
Date: Tue,  8 Sep 2020 10:09:39 +0800

> @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>  
>  	TP_STRUCT__entry(
>  		__array(char, name, 32)
> -		__field(long *, sysctl_mem)
> +		__array(long, sysctl_mem, 3)
>  		__field(long, allocated)
>  		__field(int, sysctl_rmem)
>  		__field(int, rmem_alloc)
> @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>  
>  	TP_fast_assign(
>  		strncpy(__entry->name, prot->name, 32);
> -		__entry->sysctl_mem = prot->sysctl_mem;
> +		__entry->sysctl_mem[0] = prot->sysctl_mem[0];
> +		__entry->sysctl_mem[1] = prot->sysctl_mem[1];
> +		__entry->sysctl_mem[2] = prot->sysctl_mem[2];

I can't understand at all why the current code doesn't work.

We assign a pointer to entry->sysctl_mem and then print out the
three words pointed to by that.

It's so wasteful to copy this over every tracepoint entry so
the pointer approach is very desirable.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value
  2020-09-09  2:49 ` David Miller
@ 2020-09-10 14:34   ` dust.li
  0 siblings, 0 replies; 3+ messages in thread
From: dust.li @ 2020-09-10 14:34 UTC (permalink / raw)
  To: David Miller; +Cc: kuba, edumazet, satoru.moriya, netdev

On Tue, Sep 08, 2020 at 07:49:52PM -0700, David Miller wrote:
>From: Dust Li <dust.li@linux.alibaba.com>
>Date: Tue,  8 Sep 2020 10:09:39 +0800
>
>> @@ -98,7 +98,7 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>  
>>  	TP_STRUCT__entry(
>>  		__array(char, name, 32)
>> -		__field(long *, sysctl_mem)
>> +		__array(long, sysctl_mem, 3)
>>  		__field(long, allocated)
>>  		__field(int, sysctl_rmem)
>>  		__field(int, rmem_alloc)
>> @@ -110,7 +110,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>>  
>>  	TP_fast_assign(
>>  		strncpy(__entry->name, prot->name, 32);
>> -		__entry->sysctl_mem = prot->sysctl_mem;
>> +		__entry->sysctl_mem[0] = prot->sysctl_mem[0];
>> +		__entry->sysctl_mem[1] = prot->sysctl_mem[1];
>> +		__entry->sysctl_mem[2] = prot->sysctl_mem[2];
>
>I can't understand at all why the current code doesn't work.
>
>We assign a pointer to entry->sysctl_mem and then print out the
>three words pointed to by that.
>
>It's so wasteful to copy this over every tracepoint entry so
>the pointer approach is very desirable.

Thanks for your reply!

I took a close look at the code generated by tracepoint and
found the problem is not the tracepoint itself, but `perf trace`.

My previous output was got by running:
  `perf trace -e sock:sock_exceed_buf_limit`

This time, I tried directly read from /sys/kernel/debug/tracing/trace,
and everything is right :)

So I checked the code of perf tool, and found the foundamatal difference
is `perf trace` did the string formatting in the userspace, but raw ftrace
did it in the kernel.

When using `perf trace`, the kernel passes the string format and the
data to perf using the perf ringbuffer, and no one in the kernel will
try to visit the pointer sysctl_mem is pointed to, so the the userspace
perf got the original pointer of sysctl_mem and tries to do the formating,
which result in the wrong data in the commit log.

The key call trace when using `perf trace` is this:

trace_sock_exceed_buf_limit()
    --> perf_trace_sock_exceed_buf_limit()
        {
	        ...
                perf_fetch_caller_regs(__regs);
                {
                        strncpy(entry->name, prot->name, 32);
                        entry->sysctl_mem = prot->sysctl_mem;
                        entry->allocated = allocated;
                        entry->sysctl_rmem = sk_get_rmem0(sk, prot);
                        entry->rmem_alloc = atomic_read(&sk->sk_backlog.rmem_alloc);
                        entry->sysctl_wmem = sk_get_wmem0(sk, prot);
                        entry->wmem_alloc = refcount_read(&sk->sk_wmem_alloc);
                        entry->wmem_queued = sk->sk_wmem_queued;
                        entry->kind = kind;
                }
                perf_trace_run_bpf_submit(entry, __entry_size, rctx, event_call, \
					  __count, __regs, head, __task);
        }


Here *entry* is directly passed in to perf_trace_run_bpf_submit()
as raw data, and perf_trace_run_bpf_submit() won't do string formatting
but just pass them to the userspace perf, which will finnally did the
formatting, but it's already too late to get sysctl_mem[x].

So, any pointer dereference in tracepoint entry should failed in
`perf trace`.

Thanks.
Dust.Li

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-09-10 21:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-08  2:09 [PATCH v2] net: tracepoint: fix print wrong sysctl_mem value Dust Li
2020-09-09  2:49 ` David Miller
2020-09-10 14:34   ` dust.li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).