Re: Safe-mode on amd64 broken

From: David Naylor <naylor.b.david_at_gmail.com>
Date: Thu, 30 Sep 2010 17:43:01 +0200
On Thursday 30 September 2010 08:52:39 Alexander Motin wrote:
> David Naylor wrote:
> > On Thursday 30 September 2010 07:23:34 Alexander Motin wrote:
> >> David Naylor wrote:
> >>> On Wednesday 29 September 2010 18:25:13 Alexander Motin wrote:
> >>>> David Naylor wrote:
> >>>>> On Wednesday 29 September 2010 16:19:08 Andriy Gapon wrote:
> >>>>>> What do you try to actually achieve?
> >>>>> 
> >>>>> I was trying to boot a system and it was panicking due to stray
> >>>>> interrupts. It turned out to be caused by HPET.  I found
> >>>>> `hint.hpet.0.clock=0' which fixed the problem.
> >>>>> 
> >>>>> This means HPET does not work on any of my machines.  The other one's
> >>>>> symptoms are hda losing interrupts after a period of up-time.
> >>>> 
> >>>> What chipset do you use? Nvidia MCP5x? Could you send me your verbose
> >>>> dmesg?
> >>> 
> >>> Yes, the one is a MCP51, the other is a ICH8M.
> >>> 
> >>> The desktop is a Gigabyte N650SLI-DS4L.  Its symptom is hda losing
> >>> interrupts after a period of time.
> >> 
> >> There are too many reports about different lost interrupts problems on
> >> different controllers of MCP5x. I don't know the reason. Attached patch
> >> should disable using regular HPET interrupts on NVidia chipsets. I hope
> >> it will work as workaround. May be it is too aggressive, but better to
> >> be safe then sorry. I assume that legacy_route mode may still work fine
> >> there. It would be nice to test it.
> > 
> > I assume you mean hint.hpet.0.legacy_route=1?  I'll give that a try later
> > today on both machines.
> 
> Make sure that both attimer and atrtc disabled, as mentioned in hpet(4).

legacy_route worked on the desktop but not on the laptop (boot stalled).  

Here is vmstat using default settings for the desktop:
interrupt                          total       rate
irq1: atkbd0                          64          0
irq12: psm0                          756          3
irq14: ata0                         1255          5
irq16: vgapci0                     13576         54
irq17: dc0                          1546          6
irq18: hpet0                      456756       1834
irq20: atapci2                     11557         46
irq21: hdac0 ohci0                 17038         68
irq23: atapci1                     11534         46
Total                             514082       2064

I moved hpet to irq22 (allowed_irqs="0x400000") and that also worked for the 
desktop.  

> > Is your patch the same as hint.hpet.0.clock=0?
> 
> By default - effectively yes. But it still allows to configure
> legacy_route, which is, for example, default for Linux.
> 
> >>> The laptop is a Acer 2920.  Its symptom for a GENERIC is a panic saying
> >>> stray interrupt (irq7), with a custom kernel booting stalls.
> >> 
> >> This is strange, as my Acer with the same ICH8M works fine in all
> >> possible modes. Also IMHO stray interrupts are not a reason to panic.
> >> Could you show what it looks like?
> > 
> > See http://markmail.org/message/smxnofrdmmkxyvnd for my previous email
> > that includes the backtrace from that panic.  When I booted in i386 safe
> > mode the kernel reported stray interrupts on irq7.  vmstat -i shows irq7
> > as "stray irq7".
> 
> I am not sure "stray irq7" related here. Instead more suspicious looks
> probable irq20 interrupt sharing between HPET and uhci0 and the fact
> that system panicked during interrupt handler registration by uhci0. I
> can't be sure what IRQ was used by HPET there, as in only present dmesg
> it was disabled, but as soon as HPET registered early, I think it
> grabbed first possible - irq20. On my system HPET also uses irq20, but
> uhci0 lives on irq16 and so irq20 is not shared.

On the laptop uhci0 and ehci0 live on irq20.  

> To collect more data you may try to hint HPET driver to avoid irq20 by
> setting hint.hpet.0.allowed_irqs=0x00e00000 or other values. I've tried
> same recipy to create sharing on my system, but still found no problem.

This fixes the problem for the laptop.  This also allows one-shot timing to 
work.  Moving hpet to irq22 also worked.  Here is the vmstat -i using the 
above hint:

interrupt                          total       rate
irq1: atkbd0                         407          0
irq9: acpi0                         1857          2
irq12: psm0                         1005          1
irq14: ata0                         1870          2
irq18: uhci4                        2183          2
irq20: uhci0 ehci0                  2421          3
irq21: hpet0 uhci1                502330        667
irq23: uhci2 ehci1                     3          0
irq256: vgapci0                    25023         33
irq257: hdac0                        236          0
irq258: bge0                          79          0
irq259: ahci0                      27356         36
Total                             564770        750

Received on Thu Sep 30 2010 - 13:44:10 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:07 UTC