LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Pavlos Parissis <pavlos.parissis@gmail.com>
To: Jan Kara <jack@suse.cz>, Guillaume Morin <guillaume@morinfr.org>
Cc: stable@vger.kernel.org, decui@microsoft.com, jack@suse.com,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	mszeredi@redhat.com
Subject: Re: kernel panics with 4.14.X versions
Date: Tue, 17 Apr 2018 01:31:24 +0200	[thread overview]
Message-ID: <9b11cfba-4bdc-8a3e-cd33-2f7e8d513bdf@gmail.com> (raw)
In-Reply-To: <20180416144041.t2mt7ugzwqr56ka3@quack2.suse.cz>


[-- Attachment #1.1: Type: text/plain, Size: 5046 bytes --]

On 16/04/2018 04:40 μμ, Jan Kara wrote:
> On Mon 16-04-18 15:25:50, Guillaume Morin wrote:
>> Fwiw, there have been already reports of similar soft lockups in
>> fsnotify() on 4.14: https://lkml.org/lkml/2018/3/2/1038
>>
>> We have also noticed similar softlockups with 4.14.22 here.
> 
> Yeah.
>  
>> On 16 Apr 13:54, Pavlos Parissis wrote:
>>>
>>> Hi all,
>>>

[..snip..]

>>> [373782.361064] watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [kube-apiserver:24261]
>>> [373782.378225] Modules linked in: binfmt_misc sctp_diag sctp dccp_diag dccp tcp_diag udp_diag
>>> inet_diag unix_diag cfg80211 rfkill dell_rbu 8021q garp mrp xfs libcrc32c loop x86_pkg_temp_thermal
>>> intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
>>> pcbc aesni_intel vfat fat crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf iTCO_wdt ses
>>> iTCO_vendor_support mxm_wmi ipmi_si dcdbas enclosure mei_me pcspkr ipmi_devintf lpc_ich sg mei
>>> ipmi_msghandler mfd_core shpchp wmi acpi_power_meter netconsole nfsd auth_rpcgss nfs_acl lockd grace
>>> sunrpc ip_tables ext4 mbcache jbd2 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
>>> fb_sys_fops sd_mod ttm crc32c_intel ahci libahci mlx5_core drm mlxfw mpt3sas ptp libata raid_class
>>> pps_core scsi_transport_sas
>>> [373782.516807]  dm_mirror dm_region_hash dm_log dm_mod dax
>>> [373782.531739] CPU: 24 PID: 24261 Comm: kube-apiserver Not tainted 4.14.32-1.el7.x86_64 #1
>>> [373782.549848] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.4.3 01/17/2017
>>> [373782.567486] task: ffff882f66d28000 task.stack: ffffc9002120c000
>>> [373782.583441] RIP: 0010:fsnotify+0x197/0x510
>>> [373782.597319] RSP: 0018:ffffc9002120fdb8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
>>> [373782.615308] RAX: 0000000000000000 RBX: ffff882f9ec65c20 RCX: 0000000000000002
>>> [373782.632950] RDX: 0000000000028700 RSI: 0000000000000002 RDI: ffffffff8269a4e0
>>> [373782.650616] RBP: ffffc9002120fe98 R08: 0000000000000000 R09: 0000000000000000
>>> [373782.668287] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>> [373782.685918] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>> [373782.703302] FS:  000000c42009f090(0000) GS:ffff882fbf900000(0000) knlGS:0000000000000000
>>> [373782.721887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [373782.737741] CR2: 00007f82b6539244 CR3: 0000002f3de2a005 CR4: 00000000003606e0
>>> [373782.755247] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [373782.772722] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [373782.790043] Call Trace:
>>> [373782.802041]  vfs_write+0x151/0x1b0
>>> [373782.815081]  ? syscall_trace_enter+0x1cd/0x2b0
>>> [373782.829175]  SyS_write+0x55/0xc0
>>> [373782.841870]  do_syscall_64+0x79/0x1b0
>>> [373782.855073]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Can you please run RIP through ./scripts/faddr2line to see where exactly
> are we looping? I expect the loop iterating over marks to notify but better
> be sure.
> 

I am very newbie on this and I tried with:
 ../repo/Linux/linux/scripts/faddr2line ./vmlinuz-4.14.32-1.el7.x86_64
0010:fsnotify+0x197/0x510
readelf: Error: Not an ELF file - it has the wrong magic bytes at the start
size: ./vmlinuz-4.14.32-1.el7.x86_64: Warning: Ignoring section flag
IMAGE_SCN_MEM_NOT_PAGED in section .bss
nm: ./vmlinuz-4.14.32-1.el7.x86_64: Warning: Ignoring section flag
IMAGE_SCN_MEM_NOT_PAGED in section .bss
nm: ./vmlinuz-4.14.32-1.el7.x86_64: no symbols
size: ./vmlinuz-4.14.32-1.el7.x86_64: Warning: Ignoring section flag
IMAGE_SCN_MEM_NOT_PAGED in section .bss
nm: ./vmlinuz-4.14.32-1.el7.x86_64: Warning: Ignoring section flag
IMAGE_SCN_MEM_NOT_PAGED in section .bss
nm: ./vmlinuz-4.14.32-1.el7.x86_64: no symbols
no match for 0010:fsnotify+0x197/0x510

Obviously, I am doing something very wrong.

> How easily can you hit this?

Very easily, I only need to wait 1-2 days for a crash to occur.

> Are you able to run debug kernels

Well, I was under the impression I do as I have:
  grep -E 'DEBUG_KERNEL|DEBUG_INFO' /boot/config-4.14.32-1.el7.x86_64
  CONFIG_DEBUG_INFO=y
  # CONFIG_DEBUG_INFO_REDUCED is not set
  # CONFIG_DEBUG_INFO_SPLIT is not set
  # CONFIG_DEBUG_INFO_DWARF4 is not set
  CONFIG_DEBUG_KERNEL=y

Do you think that my kernel doesn't produce a proper crash dump?
I have a production cluster where I can run any kernel we need, so if I need
to compile again with different settings I can certainly do that.

> / inspect
> crash dumps when the issue occurs?

I can't do that as the server isn't responsive and I can only power cycle it.

> Also testing with the latest mainline
> kernel (4.16) would be welcome whether this isn't just an issue with the
> backport of fsnotify fixes from Miklos.

I can try the kernel-ml-4.16.2 from elrepo (we use CentOS 7).

Thanks a lot for your reply.
Pavlos Parissis


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

  parent reply	other threads:[~2018-04-16 23:31 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20180416132550.d25jtdntdvpy55l3@bender.morinfr.org>
2018-04-16 14:40 ` Jan Kara
2018-04-16 16:06   ` Guillaume Morin
2018-04-16 21:10   ` Dexuan Cui
2018-04-17 10:33     ` Greg KH
2018-04-17 11:48       ` Amir Goldstein
2018-04-17 12:03         ` Jan Kara
2018-04-17 17:42       ` Dexuan Cui
2018-04-16 23:31   ` Pavlos Parissis [this message]
2018-04-17 11:18     ` Pavlos Parissis
2018-04-17 12:04       ` Jan Kara
2018-04-17 12:12     ` Jan Kara
2018-04-18  8:32       ` Pavlos Parissis
2018-04-19 20:23         ` Jan Kara
2018-04-19 21:16           ` Pavlos Parissis
2018-04-19 21:24             ` Robert Kolchmeyer
2018-04-19 21:37           ` Dexuan Cui
2018-04-20 10:21             ` Jan Kara
2018-04-20 17:43               ` Dexuan Cui
2018-04-16 11:54 Pavlos Parissis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9b11cfba-4bdc-8a3e-cd33-2f7e8d513bdf@gmail.com \
    --to=pavlos.parissis@gmail.com \
    --cc=decui@microsoft.com \
    --cc=guillaume@morinfr.org \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=stable@vger.kernel.org \
    --subject='Re: kernel panics with 4.14.X versions' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).