From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752395AbeDXQs2 (ORCPT ); Tue, 24 Apr 2018 12:48:28 -0400 Received: from mail.kernel.org ([198.145.29.99]:36144 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750827AbeDXQs0 (ORCPT ); Tue, 24 Apr 2018 12:48:26 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1BCCF217CE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=frederic@kernel.org Date: Tue, 24 Apr 2018 18:48:23 +0200 From: Frederic Weisbecker To: Thomas Gleixner Cc: "Wan, Kaike" , "Marciniszyn, Mike" , "Dalessandro, Dennis" , "Weiny, Ira" , "Fleck, John" , "linux-kernel@vger.kernel.org" , "linux-rdma@vger.kernel.org" , Peter Zijlstra , Anna-Maria Gleixner , Frederic Weisbecker , Ingo Molnar Subject: Re: hrtimer (rdmavt RNR timer) was lost Message-ID: <20180424164821.GA31724@lerouge> References: <3F128C9216C9B84BB6ED23EF16290AFB634C8C87@CRSMSX101.amr.corp.intel.com> <3F128C9216C9B84BB6ED23EF16290AFB634C9420@CRSMSX101.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 24, 2018 at 04:54:58PM +0200, Thomas Gleixner wrote: > On Mon, 23 Apr 2018, Thomas Gleixner wrote: > > On Mon, 23 Apr 2018, Wan, Kaike wrote: > > > > Can you apply the following debug patch and enable the hrtimer_start trace > > > > point and send me the full trace or upload it somewhere? > > > > > > The original trace was about 29GB and I filtered it with > > > "0000000066dda1ea" (the offending base) to generate a 1.4GB file that I > > > could open and investigate. I am not sure how I can send them to you. Do > > > you have somewhere I can upload to? > > > > > > I can try your debug patch and again I am anticipating a big trace file. > > > > Well, you can find the spot where the fail happens and then extract the > > full thing from 2s before that point to 1s after. That should be reasonably > > small and good enough. Let me know when you have it and how big it is > > (compressed) and we'll figure something out how to transport it. > > Thanks for the more complete data which actually let me decode the wreckage. > > So you have NO_HZ enabled and looking at the trace then this is related: > > -0 [003] dN.. 3598605307236: hrtimer_start: hrtimer=0000000004346740 function=tick_sched_timer expires=712171000000 softexpires=712171000000mode=ABS|PINNED base=0000000066dda1ea next=00000000708914d7 > -0 [003] dN.. 3598605307601: hrtimer_post_add: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=0000000004346740 > > -0 [002] d.h. 3598605313255: hrtimer_start: hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt] expires=712172915662 softexpires=712172915662mode=REL base=0000000066dda1ea next=0000000004346740 > -0 [002] d.h. 3599538740786: hrtimer_post_add: hrtimer=00000000662d2dd8 function=rvt_rc_rnr_retry [rdmavt] base=0000000066dda1ea next=00000000662d2dd8 > > -0 [003] dn.. 3599538742242: hrtimer_pre_remove: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 > -0 [003] dn.. 3599538743084: hrtimer_post_remove: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 > > -0 [003] dn.. 3599538743830: hrtimer_start: hrtimer=0000000004346740 function=tick_sched_timer expires=716767000000 softexpires=716767000000mode=ABS|PINNED base=0000000066dda1ea next=00000000662d2dd8 > -0 [003] dn.. 3599538744560: hrtimer_post_add: hrtimer=0000000004346740 function=tick_sched_timer base=0000000066dda1ea next=00000000662d2dd8 > > And staring at the NOHZ code I'm pretty sure, I know what happens. > > CPU 3 CPU 2 > > idle > start tick_sched_timer 712171000000 > start rdmavt timer 712172915662 > lock(base) > tick_nohz_stop_tick() > tick = 716767000000 timerqueue_add(tmr) > > hrtimer_set_expires(&ts->sched_timer, tick); <---- FAIL > > if (tmr->exp < queue->next->exp) > hrtimer_start(sched_timer) queue->next = tmr; > lock(base) > unlock(base) > timerqueue_remove() > timerqueue_add() > .... > > So ts->sched_timer is still queued and queue->next is pointing to it, but > then expires is modified. So the other side sees the new expiry time and > makes the rdmavt timer the new queue->next. All f*cked from there. > > The problem was introduced with: > > d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get out of sync") > > The changelog is not really helpful, but I can't see why the expiry time of > ts->sched_timer should be updated before the timer is [re]started in case > of HIGHRES + NOHZ. hrtimer_start() sets the expiry time, so that should be > sufficient. Of course the tick needs to be stored in ts->sched_timer for > the !HIGHRES + HOHZ case, but that's trivial enough to do. Patch against > Linus tree below. It's easy to backport in case you want to run it against > the kernel which made the observation simpler. > > I need to come up with integration of hrtimer_set_expires() into debug > objects to catch this kind of horrors. Groan.... Sorry for all that, that must have been hairy enough to debug :-( I thought that I could fiddle with the sched_timer because nothing else is supposed to touch it concurrently, but that forgot about the whole queue locked under cpu_base. My fault. The fix looks all good to me, thanks! ACK.