On 27/10/2016 16:20, Andriy Gapon wrote: > > I observe a problem on a relatively recent, but not the latest, head. > r306752 amd64 on AMD hardware. > If I run > pmcstat -T -P instructions -t $pid > with a pid of a busy userland process, then I shortly get a (stray) NMI. > Apparently hwpmc does not recognize that NMI. Because the problem was readily reproducible, I managed to gather some hwpmc debug traces. The following are last messages collected from a crash dump that I made after getting an NMI on CPU#0: index cpu timestamp trace ------ --- ---------------- ----- 97230 1 1089795818106 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97229 1 1089795792315 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97228 1 1089795767010 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97227 1 1089795741843 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97226 1 1089795716970 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97225 1 1089795691881 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97224 1 1089795666756 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97223 1 1089795641637 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97222 1 1089795616422 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97221 1 1089795590715 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97220 0 1089795590076 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=0 97219 1 1089795560493 MDP:INT:1: cpu=1 tf=0xfffffe03c45ecf30 um=1 97218 1 1089795520646 MDP:SWI:1: pc=0xfffff8002f355e00 pp=0xfffff80021f56400 enable-msr=0 97217 1 1089795519677 MDP:STA:2: amd-start config=0x5300c0 97216 1 1089795519420 MDP:STA:1: amd-start cpu=1 ri=0 97215 1 1089795518154 MDP:WRI:1: amd-write cpu=1 ri=0 v=ffffffffffff00fe 97214 1 1089795517311 CSW:SWI:1: cpu=1 ri=17 new=65282 97213 1 1089795514866 MDP:CFG:1: cpu=1 ri=0 pm=0xfffff8001c6e0500 97212 1 1089795510620 CSW:SWI:1: cpu=1 proc=0xfffff8013b1fe000 (1666, burnK7) pp=0xfffff80021f56400 97211 0 1089795494715 MDP:SWO:1: pc=0xfffff8001c6e0800 pp=0xfffff80021f56400 enable-msr=0 97210 0 1089795494448 MDP:CFG:1: cpu=0 ri=0 pm=0x0 97209 0 1089795494003 CSW:SWO:1: cpu=0 ri=17 tmp=-2839 (samp) 97208 0 1089795493811 MDP:REA:2: amd-read (post-munge) id=0 -> 65282 97207 0 1089795493644 MDP:REA:2: amd-read (pre-munge) id=0 -> 281474976645374 97206 0 1089795493299 MDP:REA:1: amd-read id=0 class=2 97205 0 1089795491453 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=0 97204 0 1089795490487 MDP:STO:1: amd-stop ri=0 97203 0 1089795489399 CSW:SWO:1: cpu=0 proc=0xfffff8013b1fe000 (1666, burnK7) pp=0xfffff80021f56400 97202 0 1089795460401 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=1 97201 0 1089795433574 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=1 97200 0 1089795371727 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=0 97199 0 1089795328556 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=1 97198 0 1089795303542 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=1 97197 0 1089795278288 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=1 97196 0 1089795253226 MDP:INT:1: cpu=0 tf=0xffffffff81d769d0 um=1 I think that what we see here is the target process being migrated from CPU#0 to CPU#1 and the corresponding reconfiguration of the performance counters on both CPUs. I interpret event #97220 to be the unclaimed NMI. Events #97204 and #97205 look curious. They seem like a possible "race" between amd_stop_pmc() and amd_intr(). As I understand, amd_stop_pmc() is called from the context switch code when the target process gets off CPU. I suspect that under the right conditions it's possible for wrmsr to cause a counter overflow, such that an interrupt (if enabled) is generated after wrmsr is executed, even if wrmsr disables the counter. In amd_intr() we have this code: wrmsr(evsel, config & ~AMD_PMC_ENABLE); wrmsr(perfctr, AMD_RELOAD_COUNT_TO_PERFCTR_VALUE(v)); /* Restart the counter if logging succeeded. */ error = pmc_process_interrupt(cpu, PMC_HR, pm, tf, TRAPF_USERMODE(tf)); if (error == 0) wrmsr(evsel, config | AMD_PMC_ENABLE); I suspect that in the scenario above, if it is indeed possible, the last wrmsr would re-enable the counter that's supposed to be stopped. I think that writing back the original value should be more correct, that is: wrmsr(evsel, config); I'll test if this change would help. -- Andriy GaponReceived on Fri Oct 28 2016 - 11:02:13 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC