John Baldwin wrote: > > On Dec 6, 2005, at 5:57 PM, Julian Elischer wrote: > >> In short words for the likes of me, >> Can someone give a quicj roundup on PCI routing in 4.x and -current. > > > My, what a set of questions. :) I'll do my best, but this will > probably be a long and perhaps wandering e-mail. > > First off, interrupts for PCI devices are roughly split up into two > categories (currently): INTx interrupt lines and MSI interrupts. MSI > is relatively new and I won't cover it much here. No versions of > FreeBSD currently support MSI either (though it's on my todo list), so > I'll limit this discussion to INTx interrupts. For INTx interrupts, > each PCI device (or slot) has 4 interrupt lines: INTA, INTB, INTC, and > INTD. Thus, you can describe any individual PCI interrupt as a tuple > of (bus, slot, pin). For example, device 4's INTA pin on pci bus 0 > would be (0, 4, INTA). Each PCI function is allowed to have one INTx > interrupt. The bus and slot come from the location of that function in > the PCI hierarchy, and the pin comes from the intpin PCI config > register. PCI doesn't define beyond the INTx pin how an interrupt is > delivered to the CPU, etc. That is all a property of the architecture, > chipset, etc. > > On x86, there are two disparate sets of hardware for managing interrupt > signals. The first is the pair of 8259A interrupt controllers found on > all PC-AT compatible machines. The second set of hardware is the APIC > subsystem as it were. Each processor contains a local APIC that can > receive messages from other APICs and send messages to other local > APICs. In addition to the local APICs, the chipset contains 1 or more > I/O APICs. Each I/O APIC contains anywhere from 4 to 32 individual > interrupt pins. Common numbers are 4 (somewhat rare), 16, 24, and 32. > Conceptually, on x86 a given interrupt source can be described by the > tuple (pic, pin). > > Simply put, PCI interrupt routing is the mapping of (bus, slot, pin) > PCI interrupt tuples to (pic, pin) x86 interrupt tuples. > > Now, before delving deeper into the specifics of routing on x86, let me > digress about IRQs on FreeBSD. Basically, an IRQ value is a cookie > useful for binding a device interrupt (such as a PCI (bus, slot, pin) > tuple or an ISA IRQ) to a x86 interrupt tuple (pic, pin). BIOSes don't > operate with APICs at all, at least not for handling device > interrupts. Thus, they all use a simple mapping where IRQs 0-7 > correspond to pins 0-7 on the master 8259A, and IRQs 8-15 map to pins > 0-7 on the slave 8259A. All versions of FreeBSD use the same mapping > for IRQ cookie values when using the 8259As to route interrupts. For > the APIC case the mapping of IRQ cookies to (pic, pin) tuples is > slightly more complicated. First, the simple case. FreeBSD 5.2 and > later follow the ACPI model (even when not using ACPI) where the IRQs > 0-n correspond to the pins 0-n of the first I/O APIC, IRQs n+1 to > (n+1)+m map to pins 0 to m of the second I/O APIC, etc. (There is one > possible exception with ACPI I'll cover later.) FreeBSD 4.x is more > complicated. The reason is that due to cpl and spl interrupt masks > being 32-bit integers with 8 bits set aside for software interrupts > (SWIs), cpl only has 24 bits available for hardware interrupts. > Therefore, FreeBSD <= 5.2 is limited to IRQ values 0-23 and can't use > the simple (and intuitive) model that FreeBSD 5.2+ and ACPI use. What > FreeBSD 4.x does is to map the ISA interrupts attached to the first I/O > APIC to IRQs 0, 1, and 3-15. This just leaves IRQs 2 and 16-23 > available for all the other APIC interrupt pins. As each PCI device > registers an interrupt handler for a specific (apic, pin) tuple, that > x86 interrupt is mapped to one of the last set of IRQs. If all of them > have been used already, then the kernel starts assigning multiple > (apic, pin) tuples to the same IRQ resulting in interrupts being shared > in software because of the cpl limitation even though they aren't > shared in hardware. This is why your IRQ values are different on 4.x > than on FreeBSD 5.2+ and Linux which use the ACPI global interrupt > number model. but if I change the code that does this, I may be able to get my devices that collide with the 'boot interrupt' to go elsewhere? That would be good.. > > Now, back to how routing of PCI device interrupts on x86 actually > works. I'll cover non-ACPI first. There are two cases to consider. > First, the easy case is that a PCI device interrupt (bus, tuple, pin) > is wired directly to an individual pin on a pic. This is often how > interrupts are wired when using APICs. If you look at the mptable > output and look at the interrupt section, this is fairly obvious as you > will see entries that map the interrupt for a given pci bus, slot and > pin to a given apic id and intpin on that apic. Thus, there is the > mapping for (bus, slot, pin) to (pic, pin) directly. The way interrupt > routing is implemented in this case is that when we go to route an > interrupt for a given PCI device, we search the mptable for a matching > entry. We then look up the associated apic via its apic id, ask it for > the specified pin, and then ask that pin for its IRQ (via the > pic_vector method of the ioapic interrupt source object that describes > the specific pin). When nexus(4) does bus_setup_intr(), it passes that > IRQ to the x86 intr_machdep code which uses the IRQ as an index into > its interrupt source array and ends up with the interrupt source object > for the (apic, pin) tuple being used. (Thus, IRQs are just a cookie > that is the index into the global array of interrupt sources on x86.) > Note that interrupts routed this way are hardwired into the motherboard > design. There's no chance for the OS to change which (pic, pin) a PCI > device interrupt is hooked up to. but from my memory, many PCI devices can select between A,B,C and D so maybe by going to the device and selecting a different one of those you can force it to go elsewhere... > > For the non-APIC case (non-ACPI still), PCI device interrupts are > usually wired up to a pin on a programmable interrupt router. Each of > these pins is called a pci link device. Multiple PCI device interrupts > may be wired up to the same link device, and systems typically have > anywhere from 4 to 8 (sometimes even more) link devices. Each link > device can be independently routed to a given (pic, pin) and it is > limited to a fixed set of possible IRQs. If multiple link devices are > routed to the same IRQ, then all of the devices attached to both link > devices end up sharing the same IRQ (and thus the same ithread, etc.). > Because the link devices are independently steerable, this is the one > way in which the OS has limited flexibility in routing interrupts. > However, the way it works is that you route the link devices, not > individual PCI device interrupts. The table the BIOS provides with the > information about the link devices is called the $PIR (since that's the > 4 byte signature you search for in RAM to find it). You can see it > during a verbose boot dmesg. It is a table that maps a given (bus, > slot, intpin) PCI tuple to a link index. Each entry also has a bitmask > of the valid ISA IRQs ($PIR only allows for the 16 ISA IRQs) that the > specified link index can be routed to. Thus, the way that interrupt > routing works with $PIR is that when a PCI interrupt is routed, you > search the table for a matching entry to get a link index. The $PIR > code in sys/i386/pci/pci_pir.c basically has a list of link objects > that maintain state about each link. The code finds the data > associated with the link index and sees if it has an IRQ routed > already. If so, that's the IRQ that that PCI device interrupt is > assigned to. If an IRQ isn't routed already, then it has to use an > algorithm to pick one, make a BIOS call to route the link to the chosen > IRQ, and then assign the PCI device interrupt to that IRQ. > so, is a "link device" a physical piece of hardware or a software abstraction? > Now that you understand that, ACPI routing can make some sense. The > way that ACPI routing works is that each PCI bus in the ACPI namespace > has a _PRT method that returns a table of routing entries. Each entry > contains the slot and intpin that it handles (so that you can build the > (bus, slot, intpin) PCI tuple (bus comes from the PCI bus device _PRT > is a child of, in FreeBSD the _PRT is actually a child of the pcib(4) > device that is the bridge that is the parent of the PCI bus, but I > digress)) as well as a reference to a link device in the ACPI namespace > and a source index. If the link device reference is empty or NULL, > then the interrupt is a hard-wired interrupt such as the ones used with > MP Table routing, and the source index is the global interrupt number > (==IRQ) that you use for this interrupt and you are done. If the link > device reference isn't empty, then it is the name of a ACPI device > object that manages a single pci link device. Example names include > \_SB_.PCI0.LPC0.LNKA. Each link device object includes methods to > query which IRQ it is currently routed to (though in practice this is > unreliable), get the list of possible IRQs, disable the link device > altogether, and route the link device to a specified IRQ. This is > similar to the link objects we have in the $PIR code except that these > end up being full blown devices on the ACPI side. ACPI adds another > twist in that the BIOS is free to use link devices with APICs (MP Table > has no way of handling that), and in fact in practice there are some > nvidia chipsets for amd64 that do route some PCI device interrupts to > link devices that in APIC mode can be routed to any of the IRQs 20-23. > > Now some of the minor trivia and exceptions. On the first I/O APIC, > IRQ0 is generally routed to intpin 2, not intpin 0 (though many > motherboards don't actually hook up the IRQ0 output from the ISA timer > to intpin 2 but do claim to do so in the MP Table and MADT). Instead > intpin 0 is a special ExtINT pin that listens to the 8259As and can > forward interrupts from the 8259As to one or more CPUs. This is what > "mixed mode" is, and on FreeBSD 4.x, if we discover via a test that the > motherboard did not wire IRQ0 up to intpin 2, we use mixed mode to > deliver it via the 8259A bounced through the ExtINT pin 0 on the first > I/O APIC. Blech. Also, for ACPI, the SCI is generally tied to IRQ 9, > however, the SCI may be routed to another intpin in APIC mode. Rather > than change the IRQ value in the FADT (or whichever table the SCI INT > is in), ACPI will include an entry in the MADT that maps IRQ 9 to some > other intpin such as IRQ 13 or IRQ 20. If the new intpin is not an ISA > IRQ (> 15) we use a backdoor to override the IRQ ACPI uses. If the new > intpin is an ISA IRQ though, we actually rename the destination IRQ > (such as IRQ 13 on one of my boxes) to IRQ 9, and the original IRQ 9 > becomes a "dead" interrupt pin with no IRQ associated with it. Note > that except for a few rare and very old SMP boxes, no FreeBSD x86 > machine has an IRQ 2. Another odd case is that some very old SMP boxes > did not route PCI device interrupts to the APICs at all. Instead, they > routed the outputs of the link devices to the pins on the first I/O > APIC corresponding to the same IRQ as on the 8259A (the I/O APIC only > had 16 pins). Thus, on these boxes, PCI interrupts are still routed > via link devices via $PIR, and end up triggering IRQ X via intpin X on > the first I/O APIC. One final twist. If a PCI bus behind a PCI-PCI > bridge is not listed in a BIOS table ($PIR or MP Table) or does not > have a _PRT in ACPI, the interrupts are routed by applying the swizzle > defined in the PCI standard to route the interrupt via one of the four > INTx pins on the PCI-PCI bridge's parent PCI bus. The standard defines > this behavior for add-in cards, but some built in busses do this as > well. (I've seen several AGP busses that actually use this method to > route the VGA IRQ). > >> Also, if the "boot interrupt" was previously set to 2, is that likely >> to have changed in -current? >> Am I now going to get clobbered on IRQ16? If yes, is this something >> that teh BIOS writers >> decided, or something that the Motherboard designers decided? > > > The "boot interrupt" issue on some of the PXH's used for PCI-X and > PCI-e host bridges is an unpleasant mess. I think it comes from Intel > assuming all the world is windows (imagine that) and ignoring standards > (such as MP Table and ACPI) that it helps to author. (Yay Intel!) The > issue there is that the PXH's include a dedicated I/O APIC for each of > the two busses the PXH serves, and the PCI device interrupts are routed > to intpins on those APICs. To handle the non- APIC case, the PXH's > forward any device interrupts to the INTx pins on the parent side of > the PCI-PCI bridge if the APIC is disabled. The problem is that Intel > chose a hack to figure out if the APIC was disabled and that hack > interacts badly with FreeBSD. Basically, if the individual intpin is > masked in the APIC, the PXH assumes you aren't using the APIC to handle > interrupts and so it forwards the interrupt to the INTx pin on its > bridge's parent side. The problem is that after an interrupt comes in > on 4.x and later, we mask the interrupt in the APIC until we have run > the interrupt handler. The reason is that PCI interrupts are level > triggered, so they won't "shut up" until the ISR has run and pacified > the PCI device. 4.x masks the interrupt because it wants to not run > ISRs with all interrupts disabled, but at the same cpl that the > interrupt was registered at so that higher priority interrupts can > still preempt an ISR. 5.x and later need to mask the interrupt so that > the processor doesn't have to keep interrupts disabled until the > ithread finishes. Trying to do that would become complicated and quite > painful since it would also mean deferring the EOI to the lapic (which > has to happen on the same CPU that received the interrupt) and has > other nastiness since ithreads can block on locks, etc. Other OS's > that use ithreads such as BSD/OS and probably Solaris/x86 and > Darwin/x86 probably have the same issue. The sucky part is that Intel > didn't have to do this gross hack. ACPI requires that the OS call a > method _PIC if it wants to use APIC mode, and the _PIC method is free > to write to registers, PCI config space, etc., so Intel could have > provided a register to specify if the PXH's APIC was being used or not > and included the code to manage that in _PIC in their sample BIOS. > But, they didn't. > > One possible workaround for this issue would be to provide a hacked > PCI-PCI bridge driver for the PXH's that hacked the PCI interrupt > routing such that the PCI device interrupts for child devices didn't > use the APICs in the PXH at all, but used the IRQs that get aliased to > (such as IRQ16 on 5.2+). Getting that to work on 4.x might be quite > painful since 4.x PCI interrupt routing code is rather gross and hacky > already. > > Hopefully this at least answers some questions and gives a good > overview of what PCI interrupt routing is and how it works, etc. My head hurts, but a lot makes more sense now. I'll need to read this a few more times however. if you made this into a web page, and added a few diagrams that would be amazing.. also you use a few Acronyms without saying what they are.. Thanks! >Received on Wed Dec 07 2005 - 06:39:01 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:48 UTC