LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Linux 5.2-rc5
@ 2019-06-16 19:06 Linus Torvalds
  2019-06-19 12:39 ` NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5] Chris Wilson
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2019-06-16 19:06 UTC (permalink / raw)
  To: Linux List Kernel Mailing

"It's Sunday afternoon somewhere in the world".

In fact, it's _barely_ Sunday afternoon back home, where I'll be later
today. But not quite yet, and I continue my slightly flaky release
schedule due to my normal release time being spent on an airplane once
again.

In fact, that will happen the _next_ two weekends too due to yet more
travel. So the releases will not be quite the clockwork they usually
are.

But the good news is that we're getting to the later parts of the rc
series, and things do seem to be calming down. I was hoping rc5 would
end up smaller than rc4, and so it turned out. There's some pending
stuff still, but it all looks quite small and nothing seems to be
particularly scary-looking.

And this time around we don't even have any huge SPDX updates, so the
diffstat looks nice and small and clean too. Normal changes all over
(with drivers being the bulk of it as it should be: sound stands out,
but there's gpu, HID, USB, block.. ). Outside of driver fixes there's
the usual noise all over: arch updates, documentation, and small misc
fixes spread out.

As mentioned, nothing particularly stands out as being scary. Shortlog
appended for details for those of you who want to scan over it
quickly, it's not big.

Go forth and test,

               Linus

---

Alex Deucher (1):
      drm/amdgpu: return 0 by default in amdgpu_pm_load_smu_firmware

Alex Levin (1):
      ASoC: Intel: sst: fix kmalloc call with wrong flags

Alexandre Belloni (1):
      usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC

Amadeusz Sławiński (2):
      ALSA: hdac: fix memory release for SST and SOF drivers
      SoC: rt274: Fix internal jack assignment in set_jack callback

Andrea Arcangeli (1):
      coredump: fix race condition between collapse_huge_page() and core dumping

Andreas Gruenbacher (1):
      gfs2: Fix rounding error in gfs2_iomap_page_prepare

Andreas Herrmann (2):
      block/switching-sched.txt: Update to blk-mq schedulers
      blkio-controller.txt: Remove references to CFQ

Andrey Ryabinin (1):
      x86/kasan: Fix boot with 5-level paging and KASAN

Andrey Smirnov (1):
      usb: phy: mxs: Disable external charger detect in mxs_phy_hw_init()

Andrzej Pietrasiewicz (1):
      usb: gadget: dwc2: fix zlp handling

Axel Lin (1):
      regulator: tps6507x: Fix boot regression due to testing wrong
init_data pointer

Baoquan He (1):
      x86/mm/KASLR: Compute the size of the vmemmap section properly

Benjamin Tissoires (4):
      HID: multitouch: handle faulty Elo touch device
      Revert "HID: Increase maximum report size allowed by hid_field_extract()"
      Revert "HID: core: Do not call request_module() in async context"
      Revert "HID: core: Call request_module before doing device_add"

Boris Brezillon (1):
      drm/gem_shmem: Use a writecombine mapping for ->vaddr

Borislav Petkov (2):
      RAS/CEC: Fix binary search function
      x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback

Błażej Szczygieł (1):
      HID: a4tech: fix horizontal scrolling

Casey Schaufler (1):
      Smack: Restore the smackfsdef mount option and add missing prefixes

Chaitanya Kulkarni (1):
      null_blk: remove duplicate check for report zone

Chris Packham (1):
      USB: serial: pl2303: add Allied Telesis VT-Kit3

Christoph Hellwig (1):
      x86/fpu: Don't use current->mm to check for a kthread

Christophe Leroy (3):
      spi: spi-fsl-spi: call spi_finalize_current_message() at the end
      powerpc: Fix kexec failure on book3s/32
      powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX

Coly Li (2):
      bcache: fix stack corruption by PRECEDING_KEY()
      bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached

Cong Wang (1):
      RAS/CEC: Convert the timer callback to a workqueue

Curtis Malainey (1):
      ASoC: rt5677-spi: Handle over reading when flipping bytes

Damien Le Moal (1):
      block: force select mq-deadline for zoned block devices

Dan Carpenter (1):
      drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()

Dan Williams (6):
      drivers/base/devres: introduce devm_release_action()
      mm/devm_memremap_pages: introduce devm_memunmap_pages
      PCI/P2PDMA: fix the gen_pool_add_virt() failure path
      lib/genalloc: introduce chunk owners
      PCI/P2PDMA: track pgmap references per resource, not globally
      mm/devm_memremap_pages: fix final page put race

Daniele Palmas (1):
      USB: serial: option: add Telit 0x1260 and 0x1261 compositions

Dave Jiang (1):
      iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock

Dave Martin (1):
      arm64/sve: Fix missing SVE/FPSIMD endianness conversions

Don Brace (1):
      scsi: hpsa: correct ioaccel2 chaining

Douglas Anderson (1):
      usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression)

Eiichi Tsukata (3):
      tracing: Fix out-of-range read in trace_stack_print()
      tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
      tracing/uprobe: Fix obsolete comment on trace_uprobe_create()

Eric Biggers (1):
      io_uring: fix memory leak of UNIX domain socket inode

Eric W. Biederman (1):
      signal/ptrace: Don't leak unitialized kernel memory with
PTRACE_PEEK_SIGINFO

Ezequiel Garcia (1):
      drm/panfrost: Require the simple_ondemand governor

Geert Uytterhoeven (1):
      block/ps3vram: Use %llu to format sector_t after LBDAF removal

Gen Zhang (2):
      selinux: fix a missing-check bug in selinux_add_mnt_opt( )
      selinux: fix a missing-check bug in selinux_sb_eat_lsm_opts()

Georgii Staroselskii (1):
      ASoC: sun4i-codec: fix first delay on Speaker

Greg Kroah-Hartman (1):
      blk-mq: no need to check return value of debugfs_create functions

Guennadi Liakhovetski (1):
      ASoC: SOF: ipc: fix a race, leading to IPC timeouts

Gustavo A. R. Silva (1):
      usb: typec: ucsi: ccg: fix memory leak in do_flash

H. Nikolaus Schaller (1):
      gpio: pca953x: hack to fix 24 bit gpio expanders

Hans de Goede (9):
      HID: logitech-dj: add support for the Logitech MX5500's
Bluetooth Mini-Receiver
      HID: logitech-hidpp: add support for the MX5500 keyboard
      HID: logitech-hidpp: Add support for the S510 remote control
      HID: logitech-dj: Fix 064d:c52f receiver support
      drm: panel-orientation-quirks: Add quirk for GPD pocket2
      drm: panel-orientation-quirks: Add quirk for GPD MicroPC
      drm/i915/dsi: Use a fuzzy check for burst mode clock check
      platform/x86: asus-wmi: Only Tell EC the OS will handle display
hotkeys from asus_nb_wmi
      libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk

Heikki Krogerus (1):
      usb: typec: Make sure an alt mode exist before getting its partner

Hsin-Yi Wang (5):
      drm/mediatek: fix unbind functions
      drm/mediatek: unbind components in mtk_drm_unbind()
      drm/mediatek: call drm_atomic_helper_shutdown() when unbinding driver
      drm/mediatek: clear num_pipes when unbind driver
      drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable()

Hugh Dickins (1):
      x86/fpu: Use fault_in_pages_writeable() for pre-faulting

Hui Wang (1):
      Revert "ALSA: hda/realtek - Improve the headset mic for Acer
Aspire laptops"

James Morse (1):
      x86/resctrl: Don't stop walking closids when a locksetup group is found

Jani Nikula (2):
      drm/edid: abstract override/firmware EDID retrieval
      drm: add fallback override/firmware EDID modes workaround

Jann Horn (1):
      ptrace: restore smp_rmb() in __ptrace_may_access()

Jason Gerecke (5):
      HID: wacom: Don't set tool type until we're in range
      HID: wacom: Don't report anything prior to the tool entering range
      HID: wacom: Send BTN_TOUCH in response to INTUOSP2_BT eraser contact
      HID: wacom: Correct button numbering 2nd-gen Intuos Pro over Bluetooth
      HID: wacom: Sync INTUOSP2_BT touch state after each frame if necessary

Jens Axboe (1):
      cgroup/bfq: revert bfq.weight symlink change

Joel Savitz (1):
      cpuset: restore sanity to cpuset_cpus_allowed_fallback()

Johannes Weiner (1):
      mm: memcontrol: don't batch updates of local VM stats and events

Jon Hunter (2):
      ASoC: simple-card: Fix configuration of DAI format
      ASoC: simple-card: Restore original configuration of DAI format

Joseph Salisbury (1):
      HID: hyperv: Add a module description line

Josh Poimboeuf (1):
      module: Fix livepatch/ftrace module text permissions race

Julien Thierry (1):
      clocksource/drivers/arm_arch_timer: Don't trace count reader functions

Jörgen Storvist (1):
      USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode

Kai Vehmanen (2):
      ASoC: SOF: fix race in FW boot timeout handling
      ASoC: SOF: fix DSP oops definitions in FW ABI

Kai-Heng Feng (2):
      HID: i2c-hid: add iBall Aer3 to descriptor override
      USB: usb-storage: Add new ID to ums-realtek

Kailang Yang (1):
      ALSA: hda/realtek - Update headset mode for ALC256

Kan Liang (1):
      x86/CPU: Add more Icelake model numbers

Keyon Jie (1):
      ASoC: SOF: control: correct the copy size for bytes kcontrol put

Kirill Tkhai (1):
      mm/vmscan.c: fix recent_rotated history

Kovács Tamás (1):
      ASoC: Intel: Baytrail: add quirk for Aegex 10 (RU2) tablet

Kuninori Morimoto (2):
      ASoC: soc-dpm: fixup DAI active unbalance
      ASoC: soc-core: fixup references at soc_cleanup_card_resources()

Libin Yang (2):
      ASoC: soc-pcm: BE dai needs prepare when pause release after resume
      ASoC: SOF: pcm: clear hw_params_upon_resume flag correctly

Linus Torvalds (1):
      Linux 5.2-rc5

Linus Walleij (1):
      i2c: pca-platform: Fix GPIO lookup code

Lionel Landwerlin (1):
      drm/i915/perf: fix whitelist on Gen10+

Lu Baolu (2):
      iommu: Add missing new line for dma type
      iommu/vt-d: Set the right field for Page Walk Snoop

Lucas De Marchi (1):
      drm/i915/dmc: protect against reading random memory

Manuel Traut (1):
      scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE

Marco Zatta (1):
      USB: Fix chipmunk-like voice when using Logitech C270 for recording audio.

Marcus Cooper (2):
      ASoC: sun4i-i2s: Fix sun8i tx channel offset mask
      ASoC: sun4i-i2s: Add offset to RX channel select

Mark Brown (1):
      spi: Fix Raspberry Pi breakage

Martin Schiller (1):
      usb: dwc2: Fix DMA cache alignment issues

Mathew King (1):
      platform/x86: intel-vbtn: Report switch events when event wakes device

Matt Flax (1):
      ASoC : cs4265 : readable register too low

Matt Mullins (1):
      x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()

Minas Harutyunyan (1):
      usb: dwc2: Set actual frame number for completed ISOC transfer
for none DDMA

Minchan Kim (1):
      mm/vmscan.c: fix trying to reclaim unevictable LRU page

Ming Lei (1):
      blk-mq: remove WARN_ON(!q->elevator) from blk_mq_sched_free_requests

Nathan Chancellor (1):
      arm64: Don't unconditionally add -Wno-psabi to KBUILD_CFLAGS

Neil Armstrong (4):
      drm/meson: fix G12A HDMI PLL settings for 4K60 1000/1001 variations
      drm/meson: fix primary plane disabling
      drm/meson: fix G12A primary plane disabling
      drm/panfrost: make devfreq optional again

Nicholas Piggin (2):
      powerpc/64s: Fix THP PMD collapse serialisation
      powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()

Nikolay Borisov (1):
      btrfs: Always trim all unallocated space in btrfs_trim_free_extents

Odin Ugedal (1):
      docs cgroups: add another example size for hugetlb

Ondrej Mosnacek (1):
      selinux: log raw contexts as untrusted strings

Pan Xiuli (1):
      ASoC: SOF: soundwire: add initial soundwire support

Parav Pandit (3):
      vfio/mdev: Improve the create/remove sequence
      vfio/mdev: Avoid creating sysfs remove file on stale device removal
      vfio/mdev: Synchronize device create/remove with parent removal

Philippe Mazenauer (1):
      clocksource/drivers/timer-ti-dm: Change to new style declaration

Pierre-Louis Bossart (9):
      ASoC: SOF: nocodec: fix undefined reference
      ASoC: SOF: core: fix error handling with the probe workqueue
      ASoC: SOF: pcm: remove warning - initialize workqueue on open
      ASoC: SOF: uapi: mirror firmware changes
      ASoC: SOF: bump to ABI 3.6
      ASoC: Intel: cht_bsw_max98090: fix kernel oops with platform_name override
      ASoC: Intel: bytcht_es8316: fix kernel oops with platform_name override
      ASoC: Intel: cht_bsw_nau8824: fix kernel oops with platform_name override
      ASoC: Intel: cht_bsw_rt5672: fix kernel oops with platform_name override

Potyra, Stefan (1):
      mm/mlock.c: mlockall error for flag MCL_ONFAULT

Prarit Bhargava (1):
      x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled

Ranjani Sridharan (6):
      ASoC: SOF: fix error in verbose ipc command parsing
      ASoC: core: lock client_mutex while removing link components
      ASoC: SOF: core: remove DSP after unregistering machine driver
      ASoC: SOF: core: remove snd_soc_unregister_component in case of error
      ASoC: hda: fix unbalanced codec dev refcount for HDA_DEV_ASOC
      ASoC: core: Fix deadlock in snd_soc_instantiate_card()

Rick Edgecombe (2):
      mm/vmalloc: Fix calculation of direct map addr range
      mm/vmalloc: Avoid rare case of flushing TLB with weird arguments

Robin Murphy (1):
      iommu/arm-smmu: Avoid constant zero in TLBI writes

Rui Nuno Capela (1):
      ALSA: ice1712: Check correct return value to snd_i2c_sendbytes
(EWS/DMX 6Fire)

Russell King (1):
      i2c: acorn: fix i2c warning

S.j. Wang (2):
      ASoC: fsl_asrc: Fix the issue about unsupported rate
      ASoC: cs42xx8: Add regcache mask dirty

Sathya Prakash M R (3):
      ASoC: Intel: soc-acpi: Fix machine selection order
      ASoC: Intel: sof-rt5682: fix for codec button mapping
      ASoC: Intel: sof-rt5682: fix AMP quirk support

Sean Young (1):
      media: dvb: warning about dvb frequency limits produces too much noise

Sebastian Andrzej Siewior (1):
      x86/fpu: Update kernel's FPU state before using for the fsave header

Shakeel Butt (1):
      mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node

Shirish S (1):
      drm/amdgpu/{uvd,vcn}: fetch ring's read_ptr after alloc

Slawomir Blauciak (1):
      ASoC: SOF: ipc: replace fw ready bitfield with explicit bit ordering

Stanimir Varbanov (1):
      media: venus: hfi_parser: fix a regression in parser

Stefano Stabellini (1):
      xen/swiotlb: don't initialize swiotlb twice on arm64

Super Liu (1):
      spi: abort spi_sync if failed to prepare_transfer_hardware

Takashi Sakamoto (2):
      ALSA: firewire-motu: fix destruction of data for isochronous resources
      ALSA: oxfw: allow PCM capture for Stanton SCS.1m

Tejun Heo (6):
      cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
      cgroup: Call cgroup_release() before __exit_signal()
      cgroup: Implement css_task_iter_skip()
      cgroup: Include dying leaders with live threads in PROCS iterations
      cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
      cgroup: Fix css_task_iter_advance_css_set() cset skip condition

Thomas Gleixner (1):
      timekeeping: Repair ktime_get_coarse*() granularity

Tobias Auerochs (1):
      HID: rmi: Use SET_REPORT request on control endpoint for Acer
Switch 3 and 5

Tzung-Bi Shih (1):
      ASoC: core: move DAI pre-links initiation to snd_soc_instantiate_card

Vadim Pasternak (2):
      platform/x86: mlx-platform: Fix parent device in i2c-mux-reg
device registration
      platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow

Vasily Gorbik (1):
      tracing: avoid build warning with HAVE_NOP_MCOUNT

Ville Syrjälä (2):
      drm/i915: Fix per-pixel alpha with CCS
      drm/i915/sdvo: Implement proper HDMI audio support for SDVO

Viorel Suman (2):
      ASoC: ak4458: add return value for ak4458_probe
      ASoC: ak4458: rstn_control - return a non-zero on error only

Wei Li (1):
      ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()

Wei Yongjun (1):
      usb: gadget: udc: lpc32xx: fix return value check in lpc32xx_udc_probe()

Wengang Wang (1):
      fs/ocfs2: fix race in ocfs2_dentry_attach_lock()

Will Deacon (1):
      arm64: tlbflush: Ensure start/end of address range are aligned to stride

Yang Shi (1):
      mm: mmu_gather: remove __tlb_reset_range() for force flush

Yongqiang Niu (2):
      drm/mediatek: adjust ddp clock control flow
      drm/mediatek: respect page offset for PRIME mmap calls

Young Xiao (1):
      usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i]

Yu-Hsuan Hsu (1):
      ASoC: max98090: remove 24-bit format support if RJ is 0

YueHaibing (4):
      spi: bitbang: Fix NULL pointer dereference in spi_unregister_master
      ASoC: SOF: Intel: hda: Fix COMPILE_TEST build error
      ASoC: da7219: Fix build error without CONFIG_I2C
      tracing: Make two symbols static

Zhu Yingjiang (2):
      ASoC: SOF: Intel: hda: fix the hda init chip
      ASoC: SOF: Intel: hda: use the defined ppcap functions

swkhack (1):
      mm/mlock.c: change count_mm_mlocked_page_nr return type

^ permalink raw reply	[flat|nested] 11+ messages in thread

* NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-16 19:06 Linux 5.2-rc5 Linus Torvalds
@ 2019-06-19 12:39 ` Chris Wilson
  2019-06-19 18:49   ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2019-06-19 12:39 UTC (permalink / raw)
  To: Linus Torvalds, Linux List Kernel Mailing

I haven't bisected this, but with the merge of rc5 into our CI we
started hitting an issue that resulted in a oops and the NMI watchdog
firing as we dumped the ftrace. This NMI watchdog locks up prior to the
backtraces being printed, preventing the machine from rebooting, and can
be avoided with hardlockup_all_cpu_backtrace=0. Running on arch/x86.
-Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-19 12:39 ` NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5] Chris Wilson
@ 2019-06-19 18:49   ` Linus Torvalds
  2019-06-19 19:19     ` Chris Wilson
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2019-06-19 18:49 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Linux List Kernel Mailing

On Wed, Jun 19, 2019 at 5:40 AM Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> I haven't bisected this, but with the merge of rc5 into our CI we
> started hitting an issue that resulted in a oops and the NMI watchdog
> firing as we dumped the ftrace.

Do you have the oops itself at all?

           Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-19 18:49   ` Linus Torvalds
@ 2019-06-19 19:19     ` Chris Wilson
  2019-06-19 20:42       ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2019-06-19 19:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux List Kernel Mailing

Quoting Linus Torvalds (2019-06-19 19:49:37)
> On Wed, Jun 19, 2019 at 5:40 AM Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >
> > I haven't bisected this, but with the merge of rc5 into our CI we
> > started hitting an issue that resulted in a oops and the NMI watchdog
> > firing as we dumped the ftrace.
> 
> Do you have the oops itself at all?

An example at
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/dmesg0.log
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/boot0.log

The bug causing the oops is clearly a driver problem. The rc5 fallout
just seems to be because of some shrinker changes affecting some object
reaping that were unfortunately still active. What perturbed the CI
team was the machine failed to panic & reboot.
-Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-19 19:19     ` Chris Wilson
@ 2019-06-19 20:42       ` Linus Torvalds
  2019-06-21 15:30         ` Thomas Gleixner
  2019-06-25  3:03         ` Josh Poimboeuf
  0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2019-06-19 20:42 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Linux List Kernel Mailing, Steven Rostedt, Josh Poimboeuf,
	Thomas Gleixner

On Wed, Jun 19, 2019 at 12:19 PM Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> > Do you have the oops itself at all?
>
> An example at
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/dmesg0.log
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/boot0.log
>
> The bug causing the oops is clearly a driver problem. The rc5 fallout
> just seems to be because of some shrinker changes affecting some object
> reaping that were unfortunately still active. What perturbed the CI
> team was the machine failed to panic & reboot.

Hmm. It's hard to guess at the cause of that. The oopses themselves
don't look like they are happening in any particularly bad context, so
all the normal reboot-on-oops etc stuff _should_ work.

So it would help a lot if you could bisect the bad problem at least a
bit, if it is at all reproducible. Because with no other clues, it's
hard to even guess at what might be up.

The fact that you say "NMI watchdog firing as we dumped the ftrace"
means that maybe it might be some ftrace / stacktrace issue where the
dumping itself leads to some endless loop, but who knows.

For example, one thing that has happened during this development cycle
is the stacktrace common infrastructure changes (arch_stack_walk() and
friends). I'm, not seeing why that would cause your issues, but I'm
adding a few random people for ftrace / stacktrace changes.

                     Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-19 20:42       ` Linus Torvalds
@ 2019-06-21 15:30         ` Thomas Gleixner
  2019-06-21 18:37           ` Chris Wilson
  2019-06-25  3:03         ` Josh Poimboeuf
  1 sibling, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2019-06-21 15:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Wilson, Linux List Kernel Mailing, Steven Rostedt,
	Josh Poimboeuf, Joerg Roedel

On Wed, 19 Jun 2019, Linus Torvalds wrote:

> On Wed, Jun 19, 2019 at 12:19 PM Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >
> > > Do you have the oops itself at all?
> >
> > An example at
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/dmesg0.log
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/boot0.log

The second once contains a lockdep splat which warns about a potential
deadlock in the iommu code.

> > The bug causing the oops is clearly a driver problem. The rc5 fallout
> > just seems to be because of some shrinker changes affecting some object
> > reaping that were unfortunately still active. What perturbed the CI
> > team was the machine failed to panic & reboot.
> 
> Hmm. It's hard to guess at the cause of that. The oopses themselves
> don't look like they are happening in any particularly bad context, so
> all the normal reboot-on-oops etc stuff _should_ work.
> 
> So it would help a lot if you could bisect the bad problem at least a
> bit, if it is at all reproducible. Because with no other clues, it's
> hard to even guess at what might be up.
> 
> The fact that you say "NMI watchdog firing as we dumped the ftrace"
> means that maybe it might be some ftrace / stacktrace issue where the
> dumping itself leads to some endless loop, but who knows.
> 
> For example, one thing that has happened during this development cycle
> is the stacktrace common infrastructure changes (arch_stack_walk() and
> friends). I'm, not seeing why that would cause your issues, but I'm
> adding a few random people for ftrace / stacktrace changes.

/me whistels innocently.

Chris, do you have the actual NMI lockup detector splats somewhere?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-21 15:30         ` Thomas Gleixner
@ 2019-06-21 18:37           ` Chris Wilson
  2019-06-21 19:33             ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2019-06-21 18:37 UTC (permalink / raw)
  To: Linus Torvalds, Thomas Gleixner
  Cc: Linux List Kernel Mailing, Steven Rostedt, Josh Poimboeuf, Joerg Roedel

Quoting Thomas Gleixner (2019-06-21 16:30:52)
> Chris, do you have the actual NMI lockup detector splats somewhere?

Sorry, I'm having a hard time reproducing this at will now. The test
case depends on the right timing of the wrong event to cause the GPU to
hang.

From memory, I got the
	"Watchdog detected hard LOCKUP on cpu foo"
followed by the register dump and then nothing. At which point I had to
power cycle the machine.
-Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-21 18:37           ` Chris Wilson
@ 2019-06-21 19:33             ` Thomas Gleixner
  2019-06-21 19:56               ` Chris Wilson
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2019-06-21 19:33 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Linus Torvalds, Linux List Kernel Mailing, Steven Rostedt,
	Josh Poimboeuf, Joerg Roedel

On Fri, 21 Jun 2019, Chris Wilson wrote:

> Quoting Thomas Gleixner (2019-06-21 16:30:52)
> > Chris, do you have the actual NMI lockup detector splats somewhere?
> 
> Sorry, I'm having a hard time reproducing this at will now. The test
> case depends on the right timing of the wrong event to cause the GPU to
> hang.
> 
> From memory, I got the
> 	"Watchdog detected hard LOCKUP on cpu foo"
> followed by the register dump and then nothing. At which point I had to
> power cycle the machine.

Hmm. Do you have a serial log of that incident?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-21 19:33             ` Thomas Gleixner
@ 2019-06-21 19:56               ` Chris Wilson
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2019-06-21 19:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Linux List Kernel Mailing, Steven Rostedt,
	Josh Poimboeuf, Joerg Roedel

Quoting Thomas Gleixner (2019-06-21 20:33:36)
> On Fri, 21 Jun 2019, Chris Wilson wrote:
> 
> > Quoting Thomas Gleixner (2019-06-21 16:30:52)
> > > Chris, do you have the actual NMI lockup detector splats somewhere?
> > 
> > Sorry, I'm having a hard time reproducing this at will now. The test
> > case depends on the right timing of the wrong event to cause the GPU to
> > hang.
> > 
> > From memory, I got the
> >       "Watchdog detected hard LOCKUP on cpu foo"
> > followed by the register dump and then nothing. At which point I had to
> > power cycle the machine.
> 
> Hmm. Do you have a serial log of that incident?

I use netconsole. I think Tomi has a serial console for most things
available, but not permanently hooked up. And I didn't have it in a tee
as it was late, with the lockup an annoyance to the bug I was trying
to solve. I'll keep trying to recreate that bug as once I do have that
recipe, it should be possible to bisect. I can check with Tomi on Monday
if he can pull a machine out of the farm and see how it locked up.
-Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-19 20:42       ` Linus Torvalds
  2019-06-21 15:30         ` Thomas Gleixner
@ 2019-06-25  3:03         ` Josh Poimboeuf
  2019-06-26 12:26           ` Steven Rostedt
  1 sibling, 1 reply; 11+ messages in thread
From: Josh Poimboeuf @ 2019-06-25  3:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Wilson, Linux List Kernel Mailing, Steven Rostedt, Thomas Gleixner

On Wed, Jun 19, 2019 at 01:42:53PM -0700, Linus Torvalds wrote:
> On Wed, Jun 19, 2019 at 12:19 PM Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >
> > > Do you have the oops itself at all?
> >
> > An example at
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/dmesg0.log
> > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/boot0.log
> >
> > The bug causing the oops is clearly a driver problem. The rc5 fallout
> > just seems to be because of some shrinker changes affecting some object
> > reaping that were unfortunately still active. What perturbed the CI
> > team was the machine failed to panic & reboot.
> 
> Hmm. It's hard to guess at the cause of that. The oopses themselves
> don't look like they are happening in any particularly bad context, so
> all the normal reboot-on-oops etc stuff _should_ work.

Looking at the dmesg, panic_on_oops doesn't seem to be enabled: it went
through the rewind_stack_do_exit() path instead of the panic() path.  So
the system is apparently not configured to reboot on oops.

So I'd say the hang was presumably caused by a lock held by the oopsing
code.  So it looks normal to me, other than the original oops.

-- 
Josh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]
  2019-06-25  3:03         ` Josh Poimboeuf
@ 2019-06-26 12:26           ` Steven Rostedt
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2019-06-26 12:26 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Linus Torvalds, Chris Wilson, Linux List Kernel Mailing, Thomas Gleixner

On Mon, 24 Jun 2019 22:03:45 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> Looking at the dmesg, panic_on_oops doesn't seem to be enabled: it went
> through the rewind_stack_do_exit() path instead of the panic() path.  So
> the system is apparently not configured to reboot on oops.

"Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/sda1 rootwait fsck.repair=yes intel_iommu=igfx_off nmi_watchdog=panic,auto panic=5 softdog.soft_panic=5 drm.debug=0xe log_buf_len=1M 3 ro"

> 
> So I'd say the hang was presumably caused by a lock held by the oopsing
> code.  So it looks normal to me, other than the original oops.
> 

Looks like its missing "oops=panic", as the documentation says:


        oops=panic      Always panic on oopses. Default is to just kill the
                        process, but there is a small probability of
                        deadlocking the machine.
                        This will also cause panics on machine check exceptions.
                        Useful together with panic=30 to trigger a reboot.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-06-26 12:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-16 19:06 Linux 5.2-rc5 Linus Torvalds
2019-06-19 12:39 ` NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5] Chris Wilson
2019-06-19 18:49   ` Linus Torvalds
2019-06-19 19:19     ` Chris Wilson
2019-06-19 20:42       ` Linus Torvalds
2019-06-21 15:30         ` Thomas Gleixner
2019-06-21 18:37           ` Chris Wilson
2019-06-21 19:33             ` Thomas Gleixner
2019-06-21 19:56               ` Chris Wilson
2019-06-25  3:03         ` Josh Poimboeuf
2019-06-26 12:26           ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).