From 4b5ff469234b8ab5cd05f4a201cbb229896729d0 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Mon, 10 Mar 2008 17:16:32 -0700 Subject: PCI: doc/pci: create Documentation/PCI/ and move files into it Create Documentation/PCI/ and move PCI-related files to it. Fix a few instances of trailing whitespace. Update references to the new file locations. Signed-off-by: Randy Dunlap Cc: Jesse Barnes Signed-off-by: Greg Kroah-Hartman --- Documentation/00-INDEX | 10 - Documentation/PCI/00-INDEX | 12 + Documentation/PCI/PCIEBUS-HOWTO.txt | 217 +++++++++++ Documentation/PCI/pci-error-recovery.txt | 396 +++++++++++++++++++ Documentation/PCI/pci.txt | 646 +++++++++++++++++++++++++++++++ Documentation/PCI/pcieaer-howto.txt | 253 ++++++++++++ Documentation/PCIEBUS-HOWTO.txt | 217 ----------- Documentation/memory-barriers.txt | 4 +- Documentation/pci-error-recovery.txt | 396 ------------------- Documentation/pci.txt | 646 ------------------------------- Documentation/pcieaer-howto.txt | 253 ------------ 11 files changed, 1526 insertions(+), 1524 deletions(-) create mode 100644 Documentation/PCI/00-INDEX create mode 100644 Documentation/PCI/PCIEBUS-HOWTO.txt create mode 100644 Documentation/PCI/pci-error-recovery.txt create mode 100644 Documentation/PCI/pci.txt create mode 100644 Documentation/PCI/pcieaer-howto.txt delete mode 100644 Documentation/PCIEBUS-HOWTO.txt delete mode 100644 Documentation/pci-error-recovery.txt delete mode 100644 Documentation/pci.txt delete mode 100644 Documentation/pcieaer-howto.txt (limited to 'Documentation') diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index f7923a42e769..a82a113b4a4b 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -25,8 +25,6 @@ DMA-API.txt - DMA API, pci_ API & extensions for non-consistent memory machines. DMA-ISA-LPC.txt - How to do DMA with ISA (and LPC) devices. -DMA-mapping.txt - - info for PCI drivers using DMA portably across all platforms. DocBook/ - directory with DocBook templates etc. for kernel documentation. HOWTO @@ -43,8 +41,6 @@ ManagementStyle - how to (attempt to) manage kernel hackers. MSI-HOWTO.txt - the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ. -PCIEBUS-HOWTO.txt - - a guide describing the PCI Express Port Bus driver. RCU/ - directory with info on RCU (read-copy update). README.DAC960 @@ -285,12 +281,6 @@ parport.txt - how to use the parallel-port driver. parport-lowlevel.txt - description and usage of the low level parallel port functions. -pci-error-recovery.txt - - info on PCI error recovery. -pci.txt - - info on the PCI subsystem for device driver authors. -pcieaer-howto.txt - - the PCI Express Advanced Error Reporting Driver Guide HOWTO. pcmcia/ - info on the Linux PCMCIA driver. pi-futex.txt diff --git a/Documentation/PCI/00-INDEX b/Documentation/PCI/00-INDEX new file mode 100644 index 000000000000..49f43946c6b6 --- /dev/null +++ b/Documentation/PCI/00-INDEX @@ -0,0 +1,12 @@ +00-INDEX + - this file +PCI-DMA-mapping.txt + - info for PCI drivers using DMA portably across all platforms +PCIEBUS-HOWTO.txt + - a guide describing the PCI Express Port Bus driver +pci-error-recovery.txt + - info on PCI error recovery +pci.txt + - info on the PCI subsystem for device driver authors +pcieaer-howto.txt + - the PCI Express Advanced Error Reporting Driver Guide HOWTO diff --git a/Documentation/PCI/PCIEBUS-HOWTO.txt b/Documentation/PCI/PCIEBUS-HOWTO.txt new file mode 100644 index 000000000000..9a07e38631b0 --- /dev/null +++ b/Documentation/PCI/PCIEBUS-HOWTO.txt @@ -0,0 +1,217 @@ + The PCI Express Port Bus Driver Guide HOWTO + Tom L Nguyen tom.l.nguyen@intel.com + 11/03/2004 + +1. About this guide + +This guide describes the basics of the PCI Express Port Bus driver +and provides information on how to enable the service drivers to +register/unregister with the PCI Express Port Bus Driver. + +2. Copyright 2004 Intel Corporation + +3. What is the PCI Express Port Bus Driver + +A PCI Express Port is a logical PCI-PCI Bridge structure. There +are two types of PCI Express Port: the Root Port and the Switch +Port. The Root Port originates a PCI Express link from a PCI Express +Root Complex and the Switch Port connects PCI Express links to +internal logical PCI buses. The Switch Port, which has its secondary +bus representing the switch's internal routing logic, is called the +switch's Upstream Port. The switch's Downstream Port is bridging from +switch's internal routing bus to a bus representing the downstream +PCI Express link from the PCI Express Switch. + +A PCI Express Port can provide up to four distinct functions, +referred to in this document as services, depending on its port type. +PCI Express Port's services include native hotplug support (HP), +power management event support (PME), advanced error reporting +support (AER), and virtual channel support (VC). These services may +be handled by a single complex driver or be individually distributed +and handled by corresponding service drivers. + +4. Why use the PCI Express Port Bus Driver? + +In existing Linux kernels, the Linux Device Driver Model allows a +physical device to be handled by only a single driver. The PCI +Express Port is a PCI-PCI Bridge device with multiple distinct +services. To maintain a clean and simple solution each service +may have its own software service driver. In this case several +service drivers will compete for a single PCI-PCI Bridge device. +For example, if the PCI Express Root Port native hotplug service +driver is loaded first, it claims a PCI-PCI Bridge Root Port. The +kernel therefore does not load other service drivers for that Root +Port. In other words, it is impossible to have multiple service +drivers load and run on a PCI-PCI Bridge device simultaneously +using the current driver model. + +To enable multiple service drivers running simultaneously requires +having a PCI Express Port Bus driver, which manages all populated +PCI Express Ports and distributes all provided service requests +to the corresponding service drivers as required. Some key +advantages of using the PCI Express Port Bus driver are listed below: + + - Allow multiple service drivers to run simultaneously on + a PCI-PCI Bridge Port device. + + - Allow service drivers implemented in an independent + staged approach. + + - Allow one service driver to run on multiple PCI-PCI Bridge + Port devices. + + - Manage and distribute resources of a PCI-PCI Bridge Port + device to requested service drivers. + +5. Configuring the PCI Express Port Bus Driver vs. Service Drivers + +5.1 Including the PCI Express Port Bus Driver Support into the Kernel + +Including the PCI Express Port Bus driver depends on whether the PCI +Express support is included in the kernel config. The kernel will +automatically include the PCI Express Port Bus driver as a kernel +driver when the PCI Express support is enabled in the kernel. + +5.2 Enabling Service Driver Support + +PCI device drivers are implemented based on Linux Device Driver Model. +All service drivers are PCI device drivers. As discussed above, it is +impossible to load any service driver once the kernel has loaded the +PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver +Model requires some minimal changes on existing service drivers that +imposes no impact on the functionality of existing service drivers. + +A service driver is required to use the two APIs shown below to +register its service with the PCI Express Port Bus driver (see +section 5.2.1 & 5.2.2). It is important that a service driver +initializes the pcie_port_service_driver data structure, included in +header file /include/linux/pcieport_if.h, before calling these APIs. +Failure to do so will result an identity mismatch, which prevents +the PCI Express Port Bus driver from loading a service driver. + +5.2.1 pcie_port_service_register + +int pcie_port_service_register(struct pcie_port_service_driver *new) + +This API replaces the Linux Driver Model's pci_module_init API. A +service driver should always calls pcie_port_service_register at +module init. Note that after service driver being loaded, calls +such as pci_enable_device(dev) and pci_set_master(dev) are no longer +necessary since these calls are executed by the PCI Port Bus driver. + +5.2.2 pcie_port_service_unregister + +void pcie_port_service_unregister(struct pcie_port_service_driver *new) + +pcie_port_service_unregister replaces the Linux Driver Model's +pci_unregister_driver. It's always called by service driver when a +module exits. + +5.2.3 Sample Code + +Below is sample service driver code to initialize the port service +driver data structure. + +static struct pcie_port_service_id service_id[] = { { + .vendor = PCI_ANY_ID, + .device = PCI_ANY_ID, + .port_type = PCIE_RC_PORT, + .service_type = PCIE_PORT_SERVICE_AER, + }, { /* end: all zeroes */ } +}; + +static struct pcie_port_service_driver root_aerdrv = { + .name = (char *)device_name, + .id_table = &service_id[0], + + .probe = aerdrv_load, + .remove = aerdrv_unload, + + .suspend = aerdrv_suspend, + .resume = aerdrv_resume, +}; + +Below is a sample code for registering/unregistering a service +driver. + +static int __init aerdrv_service_init(void) +{ + int retval = 0; + + retval = pcie_port_service_register(&root_aerdrv); + if (!retval) { + /* + * FIX ME + */ + } + return retval; +} + +static void __exit aerdrv_service_exit(void) +{ + pcie_port_service_unregister(&root_aerdrv); +} + +module_init(aerdrv_service_init); +module_exit(aerdrv_service_exit); + +6. Possible Resource Conflicts + +Since all service drivers of a PCI-PCI Bridge Port device are +allowed to run simultaneously, below lists a few of possible resource +conflicts with proposed solutions. + +6.1 MSI Vector Resource + +The MSI capability structure enables a device software driver to call +pci_enable_msi to request MSI based interrupts. Once MSI interrupts +are enabled on a device, it stays in this mode until a device driver +calls pci_disable_msi to disable MSI interrupts and revert back to +INTx emulation mode. Since service drivers of the same PCI-PCI Bridge +port share the same physical device, if an individual service driver +calls pci_enable_msi/pci_disable_msi it may result unpredictable +behavior. For example, two service drivers run simultaneously on the +same physical Root Port. Both service drivers call pci_enable_msi to +request MSI based interrupts. A service driver may not know whether +any other service drivers have run on this Root Port. If either one +of them calls pci_disable_msi, it puts the other service driver +in a wrong interrupt mode. + +To avoid this situation all service drivers are not permitted to +switch interrupt mode on its device. The PCI Express Port Bus driver +is responsible for determining the interrupt mode and this should be +transparent to service drivers. Service drivers need to know only +the vector IRQ assigned to the field irq of struct pcie_device, which +is passed in when the PCI Express Port Bus driver probes each service +driver. Service drivers should use (struct pcie_device*)dev->irq to +call request_irq/free_irq. In addition, the interrupt mode is stored +in the field interrupt_mode of struct pcie_device. + +6.2 MSI-X Vector Resources + +Similar to the MSI a device driver for an MSI-X capable device can +call pci_enable_msix to request MSI-X interrupts. All service drivers +are not permitted to switch interrupt mode on its device. The PCI +Express Port Bus driver is responsible for determining the interrupt +mode and this should be transparent to service drivers. Any attempt +by service driver to call pci_enable_msix/pci_disable_msix may +result unpredictable behavior. Service drivers should use +(struct pcie_device*)dev->irq and call request_irq/free_irq. + +6.3 PCI Memory/IO Mapped Regions + +Service drivers for PCI Express Power Management (PME), Advanced +Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access +PCI configuration space on the PCI Express port. In all cases the +registers accessed are independent of each other. This patch assumes +that all service drivers will be well behaved and not overwrite +other service driver's configuration settings. + +6.4 PCI Config Registers + +Each service driver runs its PCI config operations on its own +capability structure except the PCI Express capability structure, in +which Root Control register and Device Control register are shared +between PME and AER. This patch assumes that all service drivers +will be well behaved and not overwrite other service driver's +configuration settings. diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt new file mode 100644 index 000000000000..6650af432523 --- /dev/null +++ b/Documentation/PCI/pci-error-recovery.txt @@ -0,0 +1,396 @@ + + PCI Error Recovery + ------------------ + February 2, 2006 + + Current document maintainer: + Linas Vepstas + + +Many PCI bus controllers are able to detect a variety of hardware +PCI errors on the bus, such as parity errors on the data and address +busses, as well as SERR and PERR errors. Some of the more advanced +chipsets are able to deal with these errors; these include PCI-E chipsets, +and the PCI-host bridges found on IBM Power4 and Power5-based pSeries +boxes. A typical action taken is to disconnect the affected device, +halting all I/O to it. The goal of a disconnection is to avoid system +corruption; for example, to halt system memory corruption due to DMA's +to "wild" addresses. Typically, a reconnection mechanism is also +offered, so that the affected PCI device(s) are reset and put back +into working condition. The reset phase requires coordination +between the affected device drivers and the PCI controller chip. +This document describes a generic API for notifying device drivers +of a bus disconnection, and then performing error recovery. +This API is currently implemented in the 2.6.16 and later kernels. + +Reporting and recovery is performed in several steps. First, when +a PCI hardware error has resulted in a bus disconnect, that event +is reported as soon as possible to all affected device drivers, +including multiple instances of a device driver on multi-function +cards. This allows device drivers to avoid deadlocking in spinloops, +waiting for some i/o-space register to change, when it never will. +It also gives the drivers a chance to defer incoming I/O as +needed. + +Next, recovery is performed in several stages. Most of the complexity +is forced by the need to handle multi-function devices, that is, +devices that have multiple device drivers associated with them. +In the first stage, each driver is allowed to indicate what type +of reset it desires, the choices being a simple re-enabling of I/O +or requesting a hard reset (a full electrical #RST of the PCI card). +If any driver requests a full reset, that is what will be done. + +After a full reset and/or a re-enabling of I/O, all drivers are +again notified, so that they may then perform any device setup/config +that may be required. After these have all completed, a final +"resume normal operations" event is sent out. + +The biggest reason for choosing a kernel-based implementation rather +than a user-space implementation was the need to deal with bus +disconnects of PCI devices attached to storage media, and, in particular, +disconnects from devices holding the root file system. If the root +file system is disconnected, a user-space mechanism would have to go +through a large number of contortions to complete recovery. Almost all +of the current Linux file systems are not tolerant of disconnection +from/reconnection to their underlying block device. By contrast, +bus errors are easy to manage in the device driver. Indeed, most +device drivers already handle very similar recovery procedures; +for example, the SCSI-generic layer already provides significant +mechanisms for dealing with SCSI bus errors and SCSI bus resets. + + +Detailed Design +--------------- +Design and implementation details below, based on a chain of +public email discussions with Ben Herrenschmidt, circa 5 April 2005. + +The error recovery API support is exposed to the driver in the form of +a structure of function pointers pointed to by a new field in struct +pci_driver. A driver that fails to provide the structure is "non-aware", +and the actual recovery steps taken are platform dependent. The +arch/powerpc implementation will simulate a PCI hotplug remove/add. + +This structure has the form: +struct pci_error_handlers +{ + int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); + int (*mmio_enabled)(struct pci_dev *dev); + int (*link_reset)(struct pci_dev *dev); + int (*slot_reset)(struct pci_dev *dev); + void (*resume)(struct pci_dev *dev); +}; + +The possible channel states are: +enum pci_channel_state { + pci_channel_io_normal, /* I/O channel is in normal state */ + pci_channel_io_frozen, /* I/O to channel is blocked */ + pci_channel_io_perm_failure, /* PCI card is dead */ +}; + +Possible return values are: +enum pci_ers_result { + PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ + PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ + PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ + PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ + PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ +}; + +A driver does not have to implement all of these callbacks; however, +if it implements any, it must implement error_detected(). If a callback +is not implemented, the corresponding feature is considered unsupported. +For example, if mmio_enabled() and resume() aren't there, then it +is assumed that the driver is not doing any direct recovery and requires +a reset. If link_reset() is not implemented, the card is assumed as +not care about link resets. Typically a driver will want to know about +a slot_reset(). + +The actual steps taken by a platform to recover from a PCI error +event will be platform-dependent, but will follow the general +sequence described below. + +STEP 0: Error Event +------------------- +PCI bus error is detect by the PCI hardware. On powerpc, the slot +is isolated, in that all I/O is blocked: all reads return 0xffffffff, +all writes are ignored. + + +STEP 1: Notification +-------------------- +Platform calls the error_detected() callback on every instance of +every driver affected by the error. + +At this point, the device might not be accessible anymore, depending on +the platform (the slot will be isolated on powerpc). The driver may +already have "noticed" the error because of a failing I/O, but this +is the proper "synchronization point", that is, it gives the driver +a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) +to complete; it can take semaphores, schedule, etc... everything but +touch the device. Within this function and after it returns, the driver +shouldn't do any new IOs. Called in task context. This is sort of a +"quiesce" point. See note about interrupts at the end of this doc. + +All drivers participating in this system must implement this call. +The driver must return one of the following result codes: + - PCI_ERS_RESULT_CAN_RECOVER: + Driver returns this if it thinks it might be able to recover + the HW by just banging IOs or if it wants to be given + a chance to extract some diagnostic information (see + mmio_enable, below). + - PCI_ERS_RESULT_NEED_RESET: + Driver returns this if it can't recover without a hard + slot reset. + - PCI_ERS_RESULT_DISCONNECT: + Driver returns this if it doesn't want to recover at all. + +The next step taken will depend on the result codes returned by the +drivers. + +If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, +then the platform should re-enable IOs on the slot (or do nothing in +particular, if the platform doesn't isolate slots), and recovery +proceeds to STEP 2 (MMIO Enable). + +If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), +then recovery proceeds to STEP 4 (Slot Reset). + +If the platform is unable to recover the slot, the next step +is STEP 6 (Permanent Failure). + +>>> The current powerpc implementation assumes that a device driver will +>>> *not* schedule or semaphore in this routine; the current powerpc +>>> implementation uses one kernel thread to notify all devices; +>>> thus, if one device sleeps/schedules, all devices are affected. +>>> Doing better requires complex multi-threaded logic in the error +>>> recovery implementation (e.g. waiting for all notification threads +>>> to "join" before proceeding with recovery.) This seems excessively +>>> complex and not worth implementing. + +>>> The current powerpc implementation doesn't much care if the device +>>> attempts I/O at this point, or not. I/O's will fail, returning +>>> a value of 0xff on read, and writes will be dropped. If the device +>>> driver attempts more than 10K I/O's to a frozen adapter, it will +>>> assume that the device driver has gone into an infinite loop, and +>>> it will panic the kernel. There doesn't seem to be any other +>>> way of stopping a device driver that insists on spinning on I/O. + +STEP 2: MMIO Enabled +------------------- +The platform re-enables MMIO to the device (but typically not the +DMA), and then calls the mmio_enabled() callback on all affected +device drivers. + +This is the "early recovery" call. IOs are allowed again, but DMA is +not (hrm... to be discussed, I prefer not), with some restrictions. This +is NOT a callback for the driver to start operations again, only to +peek/poke at the device, extract diagnostic information, if any, and +eventually do things like trigger a device local reset or some such, +but not restart operations. This is callback is made if all drivers on +a segment agree that they can try to recover and if no automatic link reset +was performed by the HW. If the platform can't just re-enable IOs without +a slot reset or a link reset, it wont call this callback, and instead +will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) + +>>> The following is proposed; no platform implements this yet: +>>> Proposal: All I/O's should be done _synchronously_ from within +>>> this callback, errors triggered by them will be returned via +>>> the normal pci_check_whatever() API, no new error_detected() +>>> callback will be issued due to an error happening here. However, +>>> such an error might cause IOs to be re-blocked for the whole +>>> segment, and thus invalidate the recovery that other devices +>>> on the same segment might have done, forcing the whole segment +>>> into one of the next states, that is, link reset or slot reset. + +The driver should return one of the following result codes: + - PCI_ERS_RESULT_RECOVERED + Driver returns this if it thinks the device is fully + functional and thinks it is ready to start + normal driver operations again. There is no + guarantee that the driver will actually be + allowed to proceed, as another driver on the + same segment might have failed and thus triggered a + slot reset on platforms that support it. + + - PCI_ERS_RESULT_NEED_RESET + Driver returns this if it thinks the device is not + recoverable in it's current state and it needs a slot + reset to proceed. + + - PCI_ERS_RESULT_DISCONNECT + Same as above. Total failure, no recovery even after + reset driver dead. (To be defined more precisely) + +The next step taken depends on the results returned by the drivers. +If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform +proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). + +If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform +proceeds to STEP 4 (Slot Reset) + +>>> The current powerpc implementation does not implement this callback. + + +STEP 3: Link Reset +------------------ +The platform resets the link, and then calls the link_reset() callback +on all affected device drivers. This is a PCI-Express specific state +and is done whenever a non-fatal error has been detected that can be +"solved" by resetting the link. This call informs the driver of the +reset and the driver should check to see if the device appears to be +in working condition. + +The driver is not supposed to restart normal driver I/O operations +at this point. It should limit itself to "probing" the device to +check it's recoverability status. If all is right, then the platform +will call resume() once all drivers have ack'd link_reset(). + + Result codes: + (identical to STEP 3 (MMIO Enabled) + +The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5 +(Resume Operations). + +>>> The current powerpc implementation does not implement this callback. + + +STEP 4: Slot Reset +------------------ +The platform performs a soft or hard reset of the device, and then +calls the slot_reset() callback. + +A soft reset consists of asserting the adapter #RST line and then +restoring the PCI BAR's and PCI configuration header to a state +that is equivalent to what it would be after a fresh system +power-on followed by power-on BIOS/system firmware initialization. +If the platform supports PCI hotplug, then the reset might be +performed by toggling the slot electrical power off/on. + +It is important for the platform to restore the PCI config space +to the "fresh poweron" state, rather than the "last state". After +a slot reset, the device driver will almost always use its standard +device initialization routines, and an unusual config space setup +may result in hung devices, kernel panics, or silent data corruption. + +This call gives drivers the chance to re-initialize the hardware +(re-download firmware, etc.). At this point, the driver may assume +that he card is in a fresh state and is fully functional. In +particular, interrupt generation should work normally. + +Drivers should not yet restart normal I/O processing operations +at this point. If all device drivers report success on this +callback, the platform will call resume() to complete the sequence, +and let the driver restart normal I/O processing. + +A driver can still return a critical failure for this function if +it can't get the device operational after reset. If the platform +previously tried a soft reset, it might now try a hard reset (power +cycle) and then call slot_reset() again. It the device still can't +be recovered, there is nothing more that can be done; the platform +will typically report a "permanent failure" in such a case. The +device will be considered "dead" in this case. + +Drivers for multi-function cards will need to coordinate among +themselves as to which driver instance will perform any "one-shot" +or global device initialization. For example, the Symbios sym53cxx2 +driver performs device init only from PCI function 0: + ++ if (PCI_FUNC(pdev->devfn) == 0) ++ sym_reset_scsi_bus(np, 0); + + Result codes: + - PCI_ERS_RESULT_DISCONNECT + Same as above. + +Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent +Failure). + +>>> The current powerpc implementation does not currently try a +>>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT. +>>> However, it probably should. + + +STEP 5: Resume Operations +------------------------- +The platform will call the resume() callback on all affected device +drivers if all drivers on the segment have returned +PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. +The goal of this callback is to tell the driver to restart activity, +that everything is back and running. This callback does not return +a result code. + +At this point, if a new error happens, the platform will restart +a new error recovery sequence. + +STEP 6: Permanent Failure +------------------------- +A "permanent failure" has occurred, and the platform cannot recover +the device. The platform will call error_detected() with a +pci_channel_state value of pci_channel_io_perm_failure. + +The device driver should, at this point, assume the worst. It should +cancel all pending I/O, refuse all new I/O, returning -EIO to +higher layers. The device driver should then clean up all of its +memory and remove itself from kernel operations, much as it would +during system shutdown. + +The platform will typically notify the system operator of the +permanent failure in some way. If the device is hotplug-capable, +the operator will probably want to remove and replace the device. +Note, however, not all failures are truly "permanent". Some are +caused by over-heating, some by a poorly seated card. Many +PCI error events are caused by software bugs, e.g. DMA's to +wild addresses or bogus split transactions due to programming +errors. See the discussion in powerpc/eeh-pci-error-recovery.txt +for additional detail on real-life experience of the causes of +software errors. + + +Conclusion; General Remarks +--------------------------- +The way those callbacks are called is platform policy. A platform with +no slot reset capability may want to just "ignore" drivers that can't +recover (disconnect them) and try to let other cards on the same segment +recover. Keep in mind that in most real life cases, though, there will +be only one driver per segment. + +Now, a note about interrupts. If you get an interrupt and your +device is dead or has been isolated, there is a problem :) +The current policy is to turn this into a platform policy. +That is, the recovery API only requires that: + + - There is no guarantee that interrupt delivery can proceed from any +device on the segment starting from the error detection and until the +resume callback is sent, at which point interrupts are expected to be +fully operational. + + - There is no guarantee that interrupt delivery is stopped, that is, +a driver that gets an interrupt after detecting an error, or that detects +an error within the interrupt handler such that it prevents proper +ack'ing of the interrupt (and thus removal of the source) should just +return IRQ_NOTHANDLED. It's up to the platform to deal with that +condition, typically by masking the IRQ source during the duration of +the error handling. It is expected that the platform "knows" which +interrupts are routed to error-management capable slots and can deal +with temporarily disabling that IRQ number during error processing (this +isn't terribly complex). That means some IRQ latency for other devices +sharing the interrupt, but there is simply no other way. High end +platforms aren't supposed to share interrupts between many devices +anyway :) + +>>> Implementation details for the powerpc platform are discussed in +>>> the file Documentation/powerpc/eeh-pci-error-recovery.txt + +>>> As of this writing, there are six device drivers with patches +>>> implementing error recovery. Not all of these patches are in +>>> mainline yet. These may be used as "examples": +>>> +>>> drivers/scsi/ipr.c +>>> drivers/scsi/sym53cxx_2 +>>> drivers/next/e100.c +>>> drivers/net/e1000 +>>> drivers/net/ixgb +>>> drivers/net/s2io.c + +The End +------- diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt new file mode 100644 index 000000000000..8d4dc6250c58 --- /dev/null +++ b/Documentation/PCI/pci.txt @@ -0,0 +1,646 @@ + + How To Write Linux PCI Drivers + + by Martin Mares on 07-Feb-2000 + updated by Grant Grundler on 23-Dec-2006 + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The world of PCI is vast and full of (mostly unpleasant) surprises. +Since each CPU architecture implements different chip-sets and PCI devices +have different requirements (erm, "features"), the result is the PCI support +in the Linux kernel is not as trivial as one would wish. This short paper +tries to introduce all potential driver authors to Linux APIs for +PCI device drivers. + +A more complete resource is the third edition of "Linux Device Drivers" +by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. +LDD3 is available for free (under Creative Commons License) from: + + http://lwn.net/Kernel/LDD3/ + +However, keep in mind that all documents are subject to "bit rot". +Refer to the source code if things are not working as described here. + +Please send questions/comments/patches about Linux PCI API to the +"Linux PCI" mailing list. + + + +0. Structure of PCI drivers +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +PCI drivers "discover" PCI devices in a system via pci_register_driver(). +Actually, it's the other way around. When the PCI generic code discovers +a new device, the driver with a matching "description" will be notified. +Details on this below. + +pci_register_driver() leaves most of the probing for devices to +the PCI layer and supports online insertion/removal of devices [thus +supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. +pci_register_driver() call requires passing in a table of function +pointers and thus dictates the high level structure of a driver. + +Once the driver knows about a PCI device and takes ownership, the +driver generally needs to perform the following initialization: + + Enable the device + Request MMIO/IOP resources + Set the DMA mask size (for both coherent and streaming DMA) + Allocate and initialize shared control data (pci_allocate_coherent()) + Access device configuration space (if needed) + Register IRQ handler (request_irq()) + Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) + Enable DMA/processing engines + +When done using the device, and perhaps the module needs to be unloaded, +the driver needs to take the follow steps: + Disable the device from generating IRQs + Release the IRQ (free_irq()) + Stop all DMA activity + Release DMA buffers (both streaming and coherent) + Unregister from other subsystems (e.g. scsi or netdev) + Release MMIO/IOP resources + Disable the device + +Most of these topics are covered in the following sections. +For the rest look at LDD3 or . + +If the PCI subsystem is not configured (CONFIG_PCI is not set), most of +the PCI functions described below are defined as inline functions either +completely empty or just returning an appropriate error codes to avoid +lots of ifdefs in the drivers. + + + +1. pci_register_driver() call +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +PCI device drivers call pci_register_driver() during their +initialization with a pointer to a structure describing the driver +(struct pci_driver): + + field name Description + ---------- ------------------------------------------------------ + id_table Pointer to table of device ID's the driver is + interested in. Most drivers should export this + table using MODULE_DEVICE_TABLE(pci,...). + + probe This probing function gets called (during execution + of pci_register_driver() for already existing + devices or later if a new device gets inserted) for + all PCI devices which match the ID table and are not + "owned" by the other drivers yet. This function gets + passed a "struct pci_dev *" for each device whose + entry in the ID table matches the device. The probe + function returns zero when the driver chooses to + take "ownership" of the device or an error code + (negative number) otherwise. + The probe function always gets called from process + context, so it can sleep. + + remove The remove() function gets called whenever a device + being handled by this driver is removed (either during + deregistration of the driver or when it's manually + pulled out of a hot-pluggable slot). + The remove function always gets called from process + context, so it can sleep. + + suspend Put device into low power state. + suspend_late Put device into low power state. + + resume_early Wake device from low power state. + resume Wake device from low power state. + + (Please see Documentation/power/pci.txt for descriptions + of PCI Power Management and the related functions.) + + shutdown Hook into reboot_notifier_list (kernel/sys.c). + Intended to stop any idling DMA operations. + Useful for enabling wake-on-lan (NIC) or changing + the power state of a device before reboot. + e.g. drivers/net/e100.c. + + err_handler See Documentation/PCI/pci-error-recovery.txt + + +The ID table is an array of struct pci_device_id entries ending with an +all-zero entry; use of the macro DEFINE_PCI_DEVICE_TABLE is the preferred +method of declaring the table. Each entry consists of: + + vendor,device Vendor and device ID to match (or PCI_ANY_ID) + + subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID) + subdevice, + + class Device class, subclass, and "interface" to match. + See Appendix D of the PCI Local Bus Spec or + include/linux/pci_ids.h for a full list of classes. + Most drivers do not need to specify class/class_mask + as vendor/device is normally sufficient. + + class_mask limit which sub-fields of the class field are compared. + See drivers/scsi/sym53c8xx_2/ for example of usage. + + driver_data Data private to the driver. + Most drivers don't need to use driver_data field. + Best practice is to use driver_data as an index + into a static list of equivalent device types, + instead of using it as a pointer. + + +Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up +a pci_device_id table. + +New PCI IDs may be added to a device driver pci_ids table at runtime +as shown below: + +echo "vendor device subvendor subdevice class class_mask driver_data" > \ +/sys/bus/pci/drivers/{driver}/new_id + +All fields are passed in as hexadecimal values (no leading 0x). +The vendor and device fields are mandatory, the others are optional. Users +need pass only as many optional fields as necessary: + o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) + o class and classmask fields default to 0 + o driver_data defaults to 0UL. + +Once added, the driver probe routine will be invoked for any unclaimed +PCI devices listed in its (newly updated) pci_ids list. + +When the driver exits, it just calls pci_unregister_driver() and the PCI layer +automatically calls the remove hook for all devices handled by the driver. + + +1.1 "Attributes" for driver functions/data + +Please mark the initialization and cleanup functions where appropriate +(the corresponding macros are defined in ): + + __init Initialization code. Thrown away after the driver + initializes. + __exit Exit code. Ignored for non-modular drivers. + + + __devinit Device initialization code. + Identical to __init if the kernel is not compiled + with CONFIG_HOTPLUG, normal function otherwise. + __devexit The same for __exit. + +Tips on when/where to use the above attributes: + o The module_init()/module_exit() functions (and all + initialization functions called _only_ from these) + should be marked __init/__exit. + + o Do not mark the struct pci_driver. + + o The ID table array should be marked __devinitconst; this is done + automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE(). + + o The probe() and remove() functions should be marked __devinit + and __devexit respectively. All initialization functions + exclusively called by the probe() routine, can be marked __devinit. + Ditto for remove() and __devexit. + + o If mydriver_remove() is marked with __devexit(), then all address + references to mydriver_remove must use __devexit_p(mydriver_remove) + (in the struct pci_driver declaration for example). + __devexit_p() will generate the function name _or_ NULL if the + function will be discarded. For an example, see drivers/net/tg3.c. + + o Do NOT mark a function if you are not sure which mark to use. + Better to not mark the function than mark the function wrong. + + + +2. How to find PCI devices manually +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +PCI drivers should have a really good reason for not using the +pci_register_driver() interface to search for PCI devices. +The main reason PCI devices are controlled by multiple drivers +is because one PCI device implements several different HW services. +E.g. combined serial/parallel port/floppy controller. + +A manual search may be performed using the following constructs: + +Searching by vendor and device ID: + + struct pci_dev *dev = NULL; + while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) + configure_device(dev); + +Searching by class ID (iterate in a similar way): + + pci_get_class(CLASS_ID, dev) + +Searching by both vendor/device and subsystem vendor/device ID: + + pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). + +You can use the constant PCI_ANY_ID as a wildcard replacement for +VENDOR_ID or DEVICE_ID. This allows searching for any device from a +specific vendor, for example. + +These functions are hotplug-safe. They increment the reference count on +the pci_dev that they return. You must eventually (possibly at module unload) +decrement the reference count on these devices by calling pci_dev_put(). + + + +3. Device Initialization Steps +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As noted in the introduction, most PCI drivers need the following steps +for device initialization: + + Enable the device + Request MMIO/IOP resources + Set the DMA mask size (for both coherent and streaming DMA) + Allocate and initialize shared control data (pci_allocate_coherent()) + Access device configuration space (if needed) + Register IRQ handler (request_irq()) + Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) + Enable DMA/processing engines. + +The driver can access PCI config space registers at any time. +(Well, almost. When running BIST, config space can go away...but +that will just result in a PCI Bus Master Abort and config reads +will return garbage). + + +3.1 Enable the PCI device +~~~~~~~~~~~~~~~~~~~~~~~~~ +Before touching any device registers, the driver needs to enable +the PCI device by calling pci_enable_device(). This will: + o wake up the device if it was in suspended state, + o allocate I/O and memory regions of the device (if BIOS did not), + o allocate an IRQ (if BIOS did not). + +NOTE: pci_enable_device() can fail! Check the return value. + +[ OS BUG: we don't check resource allocations before enabling those + resources. The sequence would make more sense if we called + pci_request_resources() before calling pci_enable_device(). + Currently, the device drivers can't detect the bug when when two + devices have been allocated the same range. This is not a common + problem and unlikely to get fixed soon. + + This has been discussed before but not changed as of 2.6.19: + http://lkml.org/lkml/2006/3/2/194 +] + +pci_set_master() will enable DMA by setting the bus master bit +in the PCI_COMMAND register. It also fixes the latency timer value if +it's set to something bogus by the BIOS. + +If the PCI device can use the PCI Memory-Write-Invalidate transaction, +call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval +and also ensures that the cache line size register is set correctly. +Check the return value of pci_set_mwi() as not all architectures +or chip-sets may support Memory-Write-Invalidate. Alternatively, +if Mem-Wr-Inval would be nice to have but is not required, call +pci_try_set_mwi() to have the system do its best effort at enabling +Mem-Wr-Inval. + + +3.2 Request MMIO/IOP resources +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Memory (MMIO), and I/O port addresses should NOT be read directly +from the PCI device config space. Use the values in the pci_dev structure +as the PCI "bus address" might have been remapped to a "host physical" +address by the arch/chip-set specific kernel support. + +See Documentation/IO-mapping.txt for how to access device registers +or device memory. + +The device driver needs to call pci_request_region() to verify +no other device is already using the same address resource. +Conversely, drivers should call pci_release_region() AFTER +calling pci_disable_device(). +The idea is to prevent two devices colliding on the same address range. + +[ See OS BUG comment above. Currently (2.6.19), The driver can only + determine MMIO and IO Port resource availability _after_ calling + pci_enable_device(). ] + +Generic flavors of pci_request_region() are request_mem_region() +(for MMIO ranges) and request_region() (for IO Port ranges). +Use these for address resources that are not described by "normal" PCI +BARs. + +Also see pci_request_selected_regions() below. + + +3.3 Set the DMA mask size +~~~~~~~~~~~~~~~~~~~~~~~~~ +[ If anything below doesn't make sense, please refer to + Documentation/DMA-API.txt. This section is just a reminder that + drivers need to indicate DMA capabilities of the device and is not + an authoritative source for DMA interfaces. ] + +While all drivers should explicitly indicate the DMA capability +(e.g. 32 or 64 bit) of the PCI bus master, devices with more than +32-bit bus master capability for streaming data need the driver +to "register" this capability by calling pci_set_dma_mask() with +appropriate parameters. In general this allows more efficient DMA +on systems where System RAM exists above 4G _physical_ address. + +Drivers for all PCI-X and PCIe compliant devices must call +pci_set_dma_mask() as they are 64-bit DMA devices. + +Similarly, drivers must also "register" this capability if the device +can directly address "consistent memory" in System RAM above 4G physical +address by calling pci_set_consistent_dma_mask(). +Again, this includes drivers for all PCI-X and PCIe compliant devices. +Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are +64-bit DMA capable for payload ("streaming") data but not control +("consistent") data. + + +3.4 Setup shared control data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) +memory. See Documentation/DMA-API.txt for a full description of +the DMA APIs. This section is just a reminder that it needs to be done +before enabling DMA on the device. + + +3.5 Initialize device registers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Some drivers will need specific "capability" fields programmed +or other "vendor specific" register initialized or reset. +E.g. clearing pending interrupts. + + +3.6 Register IRQ handler +~~~~~~~~~~~~~~~~~~~~~~~~ +While calling request_irq() is the last step described here, +this is often just another intermediate step to initialize a device. +This step can often be deferred until the device is opened for use. + +All interrupt handlers for IRQ lines should be registered with IRQF_SHARED +and use the devid to map IRQs to devices (remember that all PCI IRQ lines +can be shared). + +request_irq() will associate an interrupt handler and device handle +with an interrupt number. Historically interrupt numbers represent +IRQ lines which run from the PCI device to the Interrupt controller. +With MSI and MSI-X (more below) the interrupt number is a CPU "vector". + +request_irq() also enables the interrupt. Make sure the device is +quiesced and does not have any interrupts pending before registering +the interrupt handler. + +MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" +which deliver interrupts to the CPU via a DMA write to a Local APIC. +The fundamental difference between MSI and MSI-X is how multiple +"vectors" get allocated. MSI requires contiguous blocks of vectors +while MSI-X can allocate several individual ones. + +MSI capability can be enabled by calling pci_enable_msi() or +pci_enable_msix() before calling request_irq(). This causes +the PCI support to program CPU vector data into the PCI device +capability registers. + +If your PCI device supports both, try to enable MSI-X first. +Only one can be enabled at a time. Many architectures, chip-sets, +or BIOSes do NOT support MSI or MSI-X and the call to pci_enable_msi/msix +will fail. This is important to note since many drivers have +two (or more) interrupt handlers: one for MSI/MSI-X and another for IRQs. +They choose which handler to register with request_irq() based on the +return value from pci_enable_msi/msix(). + +There are (at least) two really good reasons for using MSI: +1) MSI is an exclusive interrupt vector by definition. + This means the interrupt handler doesn't have to verify + its device caused the interrupt. + +2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed + to be visible to the host CPU(s) when the MSI is delivered. This + is important for both data coherency and avoiding stale control data. + This guarantee allows the driver to omit MMIO reads to flush + the DMA stream. + +See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples +of MSI/MSI-X usage. + + + +4. PCI device shutdown +~~~~~~~~~~~~~~~~~~~~~~~ + +When a PCI device driver is being unloaded, most of the following +steps need to be performed: + + Disable the device from generating IRQs + Release the IRQ (free_irq()) + Stop all DMA activity + Release DMA buffers (both streaming and consistent) + Unregister from other subsystems (e.g. scsi or netdev) + Disable device from responding to MMIO/IO Port addresses + Release MMIO/IO Port resource(s) + + +4.1 Stop IRQs on the device +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +How to do this is chip/device specific. If it's not done, it opens +the possibility of a "screaming interrupt" if (and only if) +the IRQ is shared with another device. + +When the shared IRQ handler is "unhooked", the remaining devices +using the same IRQ line will still need the IRQ enabled. Thus if the +"unhooked" device asserts IRQ line, the system will respond assuming +it was one of the remaining devices asserted the IRQ line. Since none +of the other devices will handle the IRQ, the system will "hang" until +it decides the IRQ isn't going to get handled and masks the IRQ (100,000 +iterations later). Once the shared IRQ is masked, the remaining devices +will stop functioning properly. Not a nice situation. + +This is another reason to use MSI or MSI-X if it's available. +MSI and MSI-X are defined to be exclusive interrupts and thus +are not susceptible to the "screaming interrupt" problem. + + +4.2 Release the IRQ +~~~~~~~~~~~~~~~~~~~ +Once the device is quiesced (no more IRQs), one can call free_irq(). +This function will return control once any pending IRQs are handled, +"unhook" the drivers IRQ handler from that IRQ, and finally release +the IRQ if no one else is using it. + + +4.3 Stop all DMA activity +~~~~~~~~~~~~~~~~~~~~~~~~~ +It's extremely important to stop all DMA operations BEFORE attempting +to deallocate DMA control data. Failure to do so can result in memory +corruption, hangs, and on some chip-sets a hard crash. + +Stopping DMA after stopping the IRQs can avoid races where the +IRQ handler might restart DMA engines. + +While this step sounds obvious and trivial, several "mature" drivers +didn't get this step right in the past. + + +4.4 Release DMA buffers +~~~~~~~~~~~~~~~~~~~~~~~ +Once DMA is stopped, clean up streaming DMA first. +I.e. unmap data buffers and return buffers to "upstream" +owners if there is one. + +Then clean up "consistent" buffers which contain the control data. + +See Documentation/DMA-API.txt for details on unmapping interfaces. + + +4.5 Unregister from other subsystems +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Most low level PCI device drivers support some other subsystem +like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your +driver isn't losing resources from that other subsystem. +If this happens, typically the symptom is an Oops (panic) when +the subsystem attempts to call into a driver that has been unloaded. + + +4.6 Disable Device from responding to MMIO/IO Port addresses +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +io_unmap() MMIO or IO Port resources and then call pci_disable_device(). +This is the symmetric opposite of pci_enable_device(). +Do not access device registers after calling pci_disable_device(). + + +4.7 Release MMIO/IO Port Resource(s) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Call pci_release_region() to mark the MMIO or IO Port range as available. +Failure to do so usually results in the inability to reload the driver. + + + +5. How to access PCI config space +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can use pci_(read|write)_config_(byte|word|dword) to access the config +space of a device represented by struct pci_dev *. All these functions return 0 +when successful or an error code (PCIBIOS_...) which can be translated to a text +string by pcibios_strerror. Most drivers expect that accesses to valid PCI +devices don't fail. + +If you don't have a struct pci_dev available, you can call +pci_bus_(read|write)_config_(byte|word|dword) to access a given device +and function on that bus. + +If you access fields in the standard portion of the config header, please +use symbolic names of locations and bits declared in . + +If you need to access Extended PCI Capability registers, just call +pci_find_capability() for the particular capability and it will find the +corresponding register block for you. + + + +6. Other interesting functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +pci_find_slot() Find pci_dev corresponding to given bus and + slot numbers. +pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) +pci_find_capability() Find specified capability in device's capability + list. +pci_resource_start() Returns bus start address for a given PCI region +pci_resource_end() Returns bus end address for a given PCI region +pci_resource_len() Returns the byte length of a PCI region +pci_set_drvdata() Set private driver data pointer for a pci_dev +pci_get_drvdata() Return private driver data pointer for a pci_dev +pci_set_mwi() Enable Memory-Write-Invalidate transactions. +pci_clear_mwi() Disable Memory-Write-Invalidate transactions. + + + +7. Miscellaneous hints +~~~~~~~~~~~~~~~~~~~~~~ + +When displaying PCI device names to the user (for example when a driver wants +to tell the user what card has it found), please use pci_name(pci_dev). + +Always refer to the PCI devices by a pointer to the pci_dev structure. +All PCI layer functions use this identification and it's the only +reasonable one. Don't use bus/slot/function numbers except for very +special purposes -- on systems with multiple primary buses their semantics +can be pretty complex. + +Don't try to turn on Fast Back to Back writes in your driver. All devices +on the bus need to be capable of doing it, so this is something which needs +to be handled by platform and generic code, not individual drivers. + + + +8. Vendor and device identifications +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One is not not required to add new device ids to include/linux/pci_ids.h. +Please add PCI_VENDOR_ID_xxx for vendors and a hex constant for device ids. + +PCI_VENDOR_ID_xxx constants are re-used. The device ids are arbitrary +hex numbers (vendor controlled) and normally used only in a single +location, the pci_device_id table. + +Please DO submit new vendor/device ids to pciids.sourceforge.net project. + + + +9. Obsolete functions +~~~~~~~~~~~~~~~~~~~~~ + +There are several functions which you might come across when trying to +port an old driver to the new PCI interface. They are no longer present +in the kernel as they aren't compatible with hotplug or PCI domains or +having sane locking. + +pci_find_device() Superseded by pci_get_device() +pci_find_subsys() Superseded by pci_get_subsys() +pci_find_slot() Superseded by pci_get_slot() + + +The alternative is the traditional PCI device driver that walks PCI +device lists. This is still possible but discouraged. + + + +10. MMIO Space and "Write Posting" +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Converting a driver from using I/O Port space to using MMIO space +often requires some additional changes. Specifically, "write posting" +needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) +already do this. I/O Port space guarantees write transactions reach the PCI +device before the CPU can continue. Writes to MMIO space allow the CPU +to continue before the transaction reaches the PCI device. HW weenies +call this "Write Posting" because the write completion is "posted" to +the CPU before the transaction has reached its destination. + +Thus, timing sensitive code should add readl() where the CPU is +expected to wait before doing other work. The classic "bit banging" +sequence works fine for I/O Port space: + + for (i = 8; --i; val >>= 1) { + outb(val & 1, ioport_reg); /* write bit */ + udelay(10); + } + +The same sequence for MMIO space should be: + + for (i = 8; --i; val >>= 1) { + writeb(val & 1, mmio_reg); /* write bit */ + readb(safe_mmio_reg); /* flush posted write */ + udelay(10); + } + +It is important that "safe_mmio_reg" not have any side effects that +interferes with the correct operation of the device. + +Another case to watch out for is when resetting a PCI device. Use PCI +Configuration space reads to flush the writel(). This will gracefully +handle the PCI master abort on all platforms if the PCI device is +expected to not respond to a readl(). Most x86 platforms will allow +MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage +(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). + diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt new file mode 100644 index 000000000000..16c251230c82 --- /dev/null +++ b/Documentation/PCI/pcieaer-howto.txt @@ -0,0 +1,253 @@ + The PCI Express Advanced Error Reporting Driver Guide HOWTO + T. Long Nguyen + Yanmin Zhang + 07/29/2006 + + +1. Overview + +1.1 About this guide + +This guide describes the basics of the PCI Express Advanced Error +Reporting (AER) driver and provides information on how to use it, as +well as how to enable the drivers of endpoint devices to conform with +PCI Express AER driver. + +1.2 Copyright © Intel Corporation 2006. + +1.3 What is the PCI Express AER Driver? + +PCI Express error signaling can occur on the PCI Express link itself +or on behalf of transactions initiated on the link. PCI Express +defines two error reporting paradigms: the baseline capability and +the Advanced Error Reporting capability. The baseline capability is +required of all PCI Express components providing a minimum defined +set of error reporting requirements. Advanced Error Reporting +capability is implemented with a PCI Express advanced error reporting +extended capability structure providing more robust error reporting. + +The PCI Express AER driver provides the infrastructure to support PCI +Express Advanced Error Reporting capability. The PCI Express AER +driver provides three basic functions: + +- Gathers the comprehensive error information if errors occurred. +- Reports error to the users. +- Performs error recovery actions. + +AER driver only attaches root ports which support PCI-Express AER +capability. + + +2. User Guide + +2.1 Include the PCI Express AER Root Driver into the Linux Kernel + +The PCI Express AER Root driver is a Root Port service driver attached +to the PCI Express Port Bus driver. If a user wants to use it, the driver +has to be compiled. Option CONFIG_PCIEAER supports this capability. It +depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and +CONFIG_PCIEAER = y. + +2.2 Load PCI Express AER Root Driver +There is a case where a system has AER support in BIOS. Enabling the AER +Root driver and having AER support in BIOS may result unpredictable +behavior. To avoid this conflict, a successful load of the AER Root driver +requires ACPI _OSC support in the BIOS to allow the AER Root driver to +request for native control of AER. See the PCI FW 3.0 Specification for +details regarding OSC usage. Currently, lots of firmwares don't provide +_OSC support while they use PCI Express. To support such firmwares, +forceload, a parameter of type bool, could enable AER to continue to +be initiated although firmwares have no _OSC support. To enable the +walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line +when booting kernel. Note that forceload=n by default. + +2.3 AER error output +When a PCI-E AER error is captured, an error message will be outputed to +console. If it's a correctable error, it is outputed as a warning. +Otherwise, it is printed as an error. So users could choose different +log level to filter out correctable error messages. + +Below shows an example. ++------ PCI-Express Device Error -----+ +Error Severity : Uncorrected (Fatal) +PCIE Bus Error type : Transaction Layer +Unsupported Request : First +Requester ID : 0500 +VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h +TLB Header: +04000001 00200a03 05010000 00050100 + +In the example, 'Requester ID' means the ID of the device who sends +the error message to root port. Pls. refer to pci express specs for +other fields. + + +3. Developer Guide + +To enable AER aware support requires a software driver to configure +the AER capability structure within its device and to provide callbacks. + +To support AER better, developers need understand how AER does work +firstly. + +PCI Express errors are classified into two types: correctable errors +and uncorrectable errors. This classification is based on the impacts +of those errors, which may result in degraded performance or function +failure. + +Correctable errors pose no impacts on the functionality of the +interface. The PCI Express protocol can recover without any software +intervention or any loss of data. These errors are detected and +corrected by hardware. Unlike correctable errors, uncorrectable +errors impact functionality of the interface. Uncorrectable errors +can cause a particular transaction or a particular PCI Express link +to be unreliable. Depending on those error conditions, uncorrectable +errors are further classified into non-fatal errors and fatal errors. +Non-fatal errors cause the particular transaction to be unreliable, +but the PCI Express link itself is fully functional. Fatal errors, on +the other hand, cause the link to be unreliable. + +When AER is enabled, a PCI Express device will automatically send an +error message to the PCIE root port above it when the device captures +an error. The Root Port, upon receiving an error reporting message, +internally processes and logs the error message in its PCI Express +capability structure. Error information being logged includes storing +the error reporting agent's requestor ID into the Error Source +Identification Registers and setting the error bits of the Root Error +Status Register accordingly. If AER error reporting is enabled in Root +Error Command Register, the Root Port generates an interrupt if an +error is detected. + +Note that the errors as described above are related to the PCI Express +hierarchy and links. These errors do not include any device specific +errors because device specific errors will still get sent directly to +the device driver. + +3.1 Configure the AER capability structure + +AER aware drivers of PCI Express component need change the device +control registers to enable AER. They also could change AER registers, +including mask and severity registers. Helper function +pci_enable_pcie_error_reporting could be used to enable AER. See +section 3.3. + +3.2. Provide callbacks + +3.2.1 callback reset_link to reset pci express link + +This callback is used to reset the pci express physical link when a +fatal error happens. The root port aer service driver provides a +default reset_link function, but different upstream ports might +have different specifications to reset pci express link, so all +upstream ports should provide their own reset_link functions. + +In struct pcie_port_service_driver, a new pointer, reset_link, is +added. + +pci_ers_result_t (*reset_link) (struct pci_dev *dev); + +Section 3.2.2.2 provides more detailed info on when to call +reset_link. + +3.2.2 PCI error-recovery callbacks + +The PCI Express AER Root driver uses error callbacks to coordinate +with downstream device drivers associated with a hierarchy in question +when performing error recovery actions. + +Data struct pci_driver has a pointer, err_handler, to point to +pci_error_handlers who consists of a couple of callback function +pointers. AER driver follows the rules defined in +pci-error-recovery.txt except pci express specific parts (e.g. +reset_link). Pls. refer to pci-error-recovery.txt for detailed +definitions of the callbacks. + +Below sections specify when to call the error callback functions. + +3.2.2.1 Correctable errors + +Correctable errors pose no impacts on the functionality of +the interface. The PCI Express protocol can recover without any +software intervention or any loss of data. These errors do not +require any recovery actions. The AER driver clears the device's +correctable error status register accordingly and logs these errors. + +3.2.2.2 Non-correctable (non-fatal and fatal) errors + +If an error message indicates a non-fatal error, performing link reset +at upstream is not required. The AER driver calls error_detected(dev, +pci_channel_io_normal) to all drivers associated within a hierarchy in +question. for example, +EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. +If Upstream port A captures an AER error, the hierarchy consists of +Downstream port B and EndPoint. + +A driver may return PCI_ERS_RESULT_CAN_RECOVER, +PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on +whether it can recover or the AER driver calls mmio_enabled as next. + +If an error message indicates a fatal error, kernel will broadcast +error_detected(dev, pci_channel_io_frozen) to all drivers within +a hierarchy in question. Then, performing link reset at upstream is +necessary. As different kinds of devices might use different approaches +to reset link, AER port service driver is required to provide the +function to reset link. Firstly, kernel looks for if the upstream +component has an aer driver. If it has, kernel uses the reset_link +callback of the aer driver. If the upstream component has no aer driver +and the port is downstream port, we will use the aer driver of the +root port who reports the AER error. As for upstream ports, +they should provide their own aer service drivers with reset_link +function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and +reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes +to mmio_enabled. + +3.3 helper functions + +3.3.1 int pci_find_aer_capability(struct pci_dev *dev); +pci_find_aer_capability locates the PCI Express AER capability +in the device configuration space. If the device doesn't support +PCI-Express AER, the function returns 0. + +3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); +pci_enable_pcie_error_reporting enables the device to send error +messages to root port when an error is detected. Note that devices +don't enable the error reporting by default, so device drivers need +call this function to enable it. + +3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); +pci_disable_pcie_error_reporting disables the device to send error +messages to root port when an error is detected. + +3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); +pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable +error status register. + +3.4 Frequent Asked Questions + +Q: What happens if a PCI Express device driver does not provide an +error recovery handler (pci_driver->err_handler is equal to NULL)? + +A: The devices attached with the driver won't be recovered. If the +error is fatal, kernel will print out warning messages. Please refer +to section 3 for more information. + +Q: What happens if an upstream port service driver does not provide +callback reset_link? + +A: Fatal error recovery will fail if the errors are reported by the +upstream ports who are attached by the service driver. + +Q: How does this infrastructure deal with driver that is not PCI +Express aware? + +A: This infrastructure calls the error callback functions of the +driver when an error happens. But if the driver is not aware of +PCI Express, the device might not report its own errors to root +port. + +Q: What modifications will that driver need to make it compatible +with the PCI Express AER Root driver? + +A: It could call the helper functions to enable AER in devices and +cleanup uncorrectable status register. Pls. refer to section 3.3. + diff --git a/Documentation/PCIEBUS-HOWTO.txt b/Documentation/PCIEBUS-HOWTO.txt deleted file mode 100644 index c93f42a74d7e..000000000000 --- a/Documentation/PCIEBUS-HOWTO.txt +++ /dev/null @@ -1,217 +0,0 @@ - The PCI Express Port Bus Driver Guide HOWTO - Tom L Nguyen tom.l.nguyen@intel.com - 11/03/2004 - -1. About this guide - -This guide describes the basics of the PCI Express Port Bus driver -and provides information on how to enable the service drivers to -register/unregister with the PCI Express Port Bus Driver. - -2. Copyright 2004 Intel Corporation - -3. What is the PCI Express Port Bus Driver - -A PCI Express Port is a logical PCI-PCI Bridge structure. There -are two types of PCI Express Port: the Root Port and the Switch -Port. The Root Port originates a PCI Express link from a PCI Express -Root Complex and the Switch Port connects PCI Express links to -internal logical PCI buses. The Switch Port, which has its secondary -bus representing the switch's internal routing logic, is called the -switch's Upstream Port. The switch's Downstream Port is bridging from -switch's internal routing bus to a bus representing the downstream -PCI Express link from the PCI Express Switch. - -A PCI Express Port can provide up to four distinct functions, -referred to in this document as services, depending on its port type. -PCI Express Port's services include native hotplug support (HP), -power management event support (PME), advanced error reporting -support (AER), and virtual channel support (VC). These services may -be handled by a single complex driver or be individually distributed -and handled by corresponding service drivers. - -4. Why use the PCI Express Port Bus Driver? - -In existing Linux kernels, the Linux Device Driver Model allows a -physical device to be handled by only a single driver. The PCI -Express Port is a PCI-PCI Bridge device with multiple distinct -services. To maintain a clean and simple solution each service -may have its own software service driver. In this case several -service drivers will compete for a single PCI-PCI Bridge device. -For example, if the PCI Express Root Port native hotplug service -driver is loaded first, it claims a PCI-PCI Bridge Root Port. The -kernel therefore does not load other service drivers for that Root -Port. In other words, it is impossible to have multiple service -drivers load and run on a PCI-PCI Bridge device simultaneously -using the current driver model. - -To enable multiple service drivers running simultaneously requires -having a PCI Express Port Bus driver, which manages all populated -PCI Express Ports and distributes all provided service requests -to the corresponding service drivers as required. Some key -advantages of using the PCI Express Port Bus driver are listed below: - - - Allow multiple service drivers to run simultaneously on - a PCI-PCI Bridge Port device. - - - Allow service drivers implemented in an independent - staged approach. - - - Allow one service driver to run on multiple PCI-PCI Bridge - Port devices. - - - Manage and distribute resources of a PCI-PCI Bridge Port - device to requested service drivers. - -5. Configuring the PCI Express Port Bus Driver vs. Service Drivers - -5.1 Including the PCI Express Port Bus Driver Support into the Kernel - -Including the PCI Express Port Bus driver depends on whether the PCI -Express support is included in the kernel config. The kernel will -automatically include the PCI Express Port Bus driver as a kernel -driver when the PCI Express support is enabled in the kernel. - -5.2 Enabling Service Driver Support - -PCI device drivers are implemented based on Linux Device Driver Model. -All service drivers are PCI device drivers. As discussed above, it is -impossible to load any service driver once the kernel has loaded the -PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver -Model requires some minimal changes on existing service drivers that -imposes no impact on the functionality of existing service drivers. - -A service driver is required to use the two APIs shown below to -register its service with the PCI Express Port Bus driver (see -section 5.2.1 & 5.2.2). It is important that a service driver -initializes the pcie_port_service_driver data structure, included in -header file /include/linux/pcieport_if.h, before calling these APIs. -Failure to do so will result an identity mismatch, which prevents -the PCI Express Port Bus driver from loading a service driver. - -5.2.1 pcie_port_service_register - -int pcie_port_service_register(struct pcie_port_service_driver *new) - -This API replaces the Linux Driver Model's pci_module_init API. A -service driver should always calls pcie_port_service_register at -module init. Note that after service driver being loaded, calls -such as pci_enable_device(dev) and pci_set_master(dev) are no longer -necessary since these calls are executed by the PCI Port Bus driver. - -5.2.2 pcie_port_service_unregister - -void pcie_port_service_unregister(struct pcie_port_service_driver *new) - -pcie_port_service_unregister replaces the Linux Driver Model's -pci_unregister_driver. It's always called by service driver when a -module exits. - -5.2.3 Sample Code - -Below is sample service driver code to initialize the port service -driver data structure. - -static struct pcie_port_service_id service_id[] = { { - .vendor = PCI_ANY_ID, - .device = PCI_ANY_ID, - .port_type = PCIE_RC_PORT, - .service_type = PCIE_PORT_SERVICE_AER, - }, { /* end: all zeroes */ } -}; - -static struct pcie_port_service_driver root_aerdrv = { - .name = (char *)device_name, - .id_table = &service_id[0], - - .probe = aerdrv_load, - .remove = aerdrv_unload, - - .suspend = aerdrv_suspend, - .resume = aerdrv_resume, -}; - -Below is a sample code for registering/unregistering a service -driver. - -static int __init aerdrv_service_init(void) -{ - int retval = 0; - - retval = pcie_port_service_register(&root_aerdrv); - if (!retval) { - /* - * FIX ME - */ - } - return retval; -} - -static void __exit aerdrv_service_exit(void) -{ - pcie_port_service_unregister(&root_aerdrv); -} - -module_init(aerdrv_service_init); -module_exit(aerdrv_service_exit); - -6. Possible Resource Conflicts - -Since all service drivers of a PCI-PCI Bridge Port device are -allowed to run simultaneously, below lists a few of possible resource -conflicts with proposed solutions. - -6.1 MSI Vector Resource - -The MSI capability structure enables a device software driver to call -pci_enable_msi to request MSI based interrupts. Once MSI interrupts -are enabled on a device, it stays in this mode until a device driver -calls pci_disable_msi to disable MSI interrupts and revert back to -INTx emulation mode. Since service drivers of the same PCI-PCI Bridge -port share the same physical device, if an individual service driver -calls pci_enable_msi/pci_disable_msi it may result unpredictable -behavior. For example, two service drivers run simultaneously on the -same physical Root Port. Both service drivers call pci_enable_msi to -request MSI based interrupts. A service driver may not know whether -any other service drivers have run on this Root Port. If either one -of them calls pci_disable_msi, it puts the other service driver -in a wrong interrupt mode. - -To avoid this situation all service drivers are not permitted to -switch interrupt mode on its device. The PCI Express Port Bus driver -is responsible for determining the interrupt mode and this should be -transparent to service drivers. Service drivers need to know only -the vector IRQ assigned to the field irq of struct pcie_device, which -is passed in when the PCI Express Port Bus driver probes each service -driver. Service drivers should use (struct pcie_device*)dev->irq to -call request_irq/free_irq. In addition, the interrupt mode is stored -in the field interrupt_mode of struct pcie_device. - -6.2 MSI-X Vector Resources - -Similar to the MSI a device driver for an MSI-X capable device can -call pci_enable_msix to request MSI-X interrupts. All service drivers -are not permitted to switch interrupt mode on its device. The PCI -Express Port Bus driver is responsible for determining the interrupt -mode and this should be transparent to service drivers. Any attempt -by service driver to call pci_enable_msix/pci_disable_msix may -result unpredictable behavior. Service drivers should use -(struct pcie_device*)dev->irq and call request_irq/free_irq. - -6.3 PCI Memory/IO Mapped Regions - -Service drivers for PCI Express Power Management (PME), Advanced -Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access -PCI configuration space on the PCI Express port. In all cases the -registers accessed are independent of each other. This patch assumes -that all service drivers will be well behaved and not overwrite -other service driver's configuration settings. - -6.4 PCI Config Registers - -Each service driver runs its PCI config operations on its own -capability structure except the PCI Express capability structure, in -which Root Control register and Device Control register are shared -between PME and AER. This patch assumes that all service drivers -will be well behaved and not overwrite other service driver's -configuration settings. diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 1f506f7830ec..e5a819a4f0c9 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -430,8 +430,8 @@ There are certain things that the Linux kernel memory barriers do not guarantee: [*] For information on bus mastering DMA and coherency please read: - Documentation/pci.txt - Documentation/DMA-mapping.txt + Documentation/PCI/pci.txt + Documentation/PCI/PCI-DMA-mapping.txt Documentation/DMA-API.txt diff --git a/Documentation/pci-error-recovery.txt b/Documentation/pci-error-recovery.txt deleted file mode 100644 index 6650af432523..000000000000 --- a/Documentation/pci-error-recovery.txt +++ /dev/null @@ -1,396 +0,0 @@ - - PCI Error Recovery - ------------------ - February 2, 2006 - - Current document maintainer: - Linas Vepstas - - -Many PCI bus controllers are able to detect a variety of hardware -PCI errors on the bus, such as parity errors on the data and address -busses, as well as SERR and PERR errors. Some of the more advanced -chipsets are able to deal with these errors; these include PCI-E chipsets, -and the PCI-host bridges found on IBM Power4 and Power5-based pSeries -boxes. A typical action taken is to disconnect the affected device, -halting all I/O to it. The goal of a disconnection is to avoid system -corruption; for example, to halt system memory corruption due to DMA's -to "wild" addresses. Typically, a reconnection mechanism is also -offered, so that the affected PCI device(s) are reset and put back -into working condition. The reset phase requires coordination -between the affected device drivers and the PCI controller chip. -This document describes a generic API for notifying device drivers -of a bus disconnection, and then performing error recovery. -This API is currently implemented in the 2.6.16 and later kernels. - -Reporting and recovery is performed in several steps. First, when -a PCI hardware error has resulted in a bus disconnect, that event -is reported as soon as possible to all affected device drivers, -including multiple instances of a device driver on multi-function -cards. This allows device drivers to avoid deadlocking in spinloops, -waiting for some i/o-space register to change, when it never will. -It also gives the drivers a chance to defer incoming I/O as -needed. - -Next, recovery is performed in several stages. Most of the complexity -is forced by the need to handle multi-function devices, that is, -devices that have multiple device drivers associated with them. -In the first stage, each driver is allowed to indicate what type -of reset it desires, the choices being a simple re-enabling of I/O -or requesting a hard reset (a full electrical #RST of the PCI card). -If any driver requests a full reset, that is what will be done. - -After a full reset and/or a re-enabling of I/O, all drivers are -again notified, so that they may then perform any device setup/config -that may be required. After these have all completed, a final -"resume normal operations" event is sent out. - -The biggest reason for choosing a kernel-based implementation rather -than a user-space implementation was the need to deal with bus -disconnects of PCI devices attached to storage media, and, in particular, -disconnects from devices holding the root file system. If the root -file system is disconnected, a user-space mechanism would have to go -through a large number of contortions to complete recovery. Almost all -of the current Linux file systems are not tolerant of disconnection -from/reconnection to their underlying block device. By contrast, -bus errors are easy to manage in the device driver. Indeed, most -device drivers already handle very similar recovery procedures; -for example, the SCSI-generic layer already provides significant -mechanisms for dealing with SCSI bus errors and SCSI bus resets. - - -Detailed Design ---------------- -Design and implementation details below, based on a chain of -public email discussions with Ben Herrenschmidt, circa 5 April 2005. - -The error recovery API support is exposed to the driver in the form of -a structure of function pointers pointed to by a new field in struct -pci_driver. A driver that fails to provide the structure is "non-aware", -and the actual recovery steps taken are platform dependent. The -arch/powerpc implementation will simulate a PCI hotplug remove/add. - -This structure has the form: -struct pci_error_handlers -{ - int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); - int (*mmio_enabled)(struct pci_dev *dev); - int (*link_reset)(struct pci_dev *dev); - int (*slot_reset)(struct pci_dev *dev); - void (*resume)(struct pci_dev *dev); -}; - -The possible channel states are: -enum pci_channel_state { - pci_channel_io_normal, /* I/O channel is in normal state */ - pci_channel_io_frozen, /* I/O to channel is blocked */ - pci_channel_io_perm_failure, /* PCI card is dead */ -}; - -Possible return values are: -enum pci_ers_result { - PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ - PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ - PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ - PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ - PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ -}; - -A driver does not have to implement all of these callbacks; however, -if it implements any, it must implement error_detected(). If a callback -is not implemented, the corresponding feature is considered unsupported. -For example, if mmio_enabled() and resume() aren't there, then it -is assumed that the driver is not doing any direct recovery and requires -a reset. If link_reset() is not implemented, the card is assumed as -not care about link resets. Typically a driver will want to know about -a slot_reset(). - -The actual steps taken by a platform to recover from a PCI error -event will be platform-dependent, but will follow the general -sequence described below. - -STEP 0: Error Event -------------------- -PCI bus error is detect by the PCI hardware. On powerpc, the slot -is isolated, in that all I/O is blocked: all reads return 0xffffffff, -all writes are ignored. - - -STEP 1: Notification --------------------- -Platform calls the error_detected() callback on every instance of -every driver affected by the error. - -At this point, the device might not be accessible anymore, depending on -the platform (the slot will be isolated on powerpc). The driver may -already have "noticed" the error because of a failing I/O, but this -is the proper "synchronization point", that is, it gives the driver -a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) -to complete; it can take semaphores, schedule, etc... everything but -touch the device. Within this function and after it returns, the driver -shouldn't do any new IOs. Called in task context. This is sort of a -"quiesce" point. See note about interrupts at the end of this doc. - -All drivers participating in this system must implement this call. -The driver must return one of the following result codes: - - PCI_ERS_RESULT_CAN_RECOVER: - Driver returns this if it thinks it might be able to recover - the HW by just banging IOs or if it wants to be given - a chance to extract some diagnostic information (see - mmio_enable, below). - - PCI_ERS_RESULT_NEED_RESET: - Driver returns this if it can't recover without a hard - slot reset. - - PCI_ERS_RESULT_DISCONNECT: - Driver returns this if it doesn't want to recover at all. - -The next step taken will depend on the result codes returned by the -drivers. - -If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, -then the platform should re-enable IOs on the slot (or do nothing in -particular, if the platform doesn't isolate slots), and recovery -proceeds to STEP 2 (MMIO Enable). - -If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), -then recovery proceeds to STEP 4 (Slot Reset). - -If the platform is unable to recover the slot, the next step -is STEP 6 (Permanent Failure). - ->>> The current powerpc implementation assumes that a device driver will ->>> *not* schedule or semaphore in this routine; the current powerpc ->>> implementation uses one kernel thread to notify all devices; ->>> thus, if one device sleeps/schedules, all devices are affected. ->>> Doing better requires complex multi-threaded logic in the error ->>> recovery implementation (e.g. waiting for all notification threads ->>> to "join" before proceeding with recovery.) This seems excessively ->>> complex and not worth implementing. - ->>> The current powerpc implementation doesn't much care if the device ->>> attempts I/O at this point, or not. I/O's will fail, returning ->>> a value of 0xff on read, and writes will be dropped. If the device ->>> driver attempts more than 10K I/O's to a frozen adapter, it will ->>> assume that the device driver has gone into an infinite loop, and ->>> it will panic the kernel. There doesn't seem to be any other ->>> way of stopping a device driver that insists on spinning on I/O. - -STEP 2: MMIO Enabled -------------------- -The platform re-enables MMIO to the device (but typically not the -DMA), and then calls the mmio_enabled() callback on all affected -device drivers. - -This is the "early recovery" call. IOs are allowed again, but DMA is -not (hrm... to be discussed, I prefer not), with some restrictions. This -is NOT a callback for the driver to start operations again, only to -peek/poke at the device, extract diagnostic information, if any, and -eventually do things like trigger a device local reset or some such, -but not restart operations. This is callback is made if all drivers on -a segment agree that they can try to recover and if no automatic link reset -was performed by the HW. If the platform can't just re-enable IOs without -a slot reset or a link reset, it wont call this callback, and instead -will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) - ->>> The following is proposed; no platform implements this yet: ->>> Proposal: All I/O's should be done _synchronously_ from within ->>> this callback, errors triggered by them will be returned via ->>> the normal pci_check_whatever() API, no new error_detected() ->>> callback will be issued due to an error happening here. However, ->>> such an error might cause IOs to be re-blocked for the whole ->>> segment, and thus invalidate the recovery that other devices ->>> on the same segment might have done, forcing the whole segment ->>> into one of the next states, that is, link reset or slot reset. - -The driver should return one of the following result codes: - - PCI_ERS_RESULT_RECOVERED - Driver returns this if it thinks the device is fully - functional and thinks it is ready to start - normal driver operations again. There is no - guarantee that the driver will actually be - allowed to proceed, as another driver on the - same segment might have failed and thus triggered a - slot reset on platforms that support it. - - - PCI_ERS_RESULT_NEED_RESET - Driver returns this if it thinks the device is not - recoverable in it's current state and it needs a slot - reset to proceed. - - - PCI_ERS_RESULT_DISCONNECT - Same as above. Total failure, no recovery even after - reset driver dead. (To be defined more precisely) - -The next step taken depends on the results returned by the drivers. -If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). - -If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform -proceeds to STEP 4 (Slot Reset) - ->>> The current powerpc implementation does not implement this callback. - - -STEP 3: Link Reset ------------------- -The platform resets the link, and then calls the link_reset() callback -on all affected device drivers. This is a PCI-Express specific state -and is done whenever a non-fatal error has been detected that can be -"solved" by resetting the link. This call informs the driver of the -reset and the driver should check to see if the device appears to be -in working condition. - -The driver is not supposed to restart normal driver I/O operations -at this point. It should limit itself to "probing" the device to -check it's recoverability status. If all is right, then the platform -will call resume() once all drivers have ack'd link_reset(). - - Result codes: - (identical to STEP 3 (MMIO Enabled) - -The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5 -(Resume Operations). - ->>> The current powerpc implementation does not implement this callback. - - -STEP 4: Slot Reset ------------------- -The platform performs a soft or hard reset of the device, and then -calls the slot_reset() callback. - -A soft reset consists of asserting the adapter #RST line and then -restoring the PCI BAR's and PCI configuration header to a state -that is equivalent to what it would be after a fresh system -power-on followed by power-on BIOS/system firmware initialization. -If the platform supports PCI hotplug, then the reset might be -performed by toggling the slot electrical power off/on. - -It is important for the platform to restore the PCI config space -to the "fresh poweron" state, rather than the "last state". After -a slot reset, the device driver will almost always use its standard -device initialization routines, and an unusual config space setup -may result in hung devices, kernel panics, or silent data corruption. - -This call gives drivers the chance to re-initialize the hardware -(re-download firmware, etc.). At this point, the driver may assume -that he card is in a fresh state and is fully functional. In -particular, interrupt generation should work normally. - -Drivers should not yet restart normal I/O processing operations -at this point. If all device drivers report success on this -callback, the platform will call resume() to complete the sequence, -and let the driver restart normal I/O processing. - -A driver can still return a critical failure for this function if -it can't get the device operational after reset. If the platform -previously tried a soft reset, it might now try a hard reset (power -cycle) and then call slot_reset() again. It the device still can't -be recovered, there is nothing more that can be done; the platform -will typically report a "permanent failure" in such a case. The -device will be considered "dead" in this case. - -Drivers for multi-function cards will need to coordinate among -themselves as to which driver instance will perform any "one-shot" -or global device initialization. For example, the Symbios sym53cxx2 -driver performs device init only from PCI function 0: - -+ if (PCI_FUNC(pdev->devfn) == 0) -+ sym_reset_scsi_bus(np, 0); - - Result codes: - - PCI_ERS_RESULT_DISCONNECT - Same as above. - -Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent -Failure). - ->>> The current powerpc implementation does not currently try a ->>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT. ->>> However, it probably should. - - -STEP 5: Resume Operations -------------------------- -The platform will call the resume() callback on all affected device -drivers if all drivers on the segment have returned -PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. -The goal of this callback is to tell the driver to restart activity, -that everything is back and running. This callback does not return -a result code. - -At this point, if a new error happens, the platform will restart -a new error recovery sequence. - -STEP 6: Permanent Failure -------------------------- -A "permanent failure" has occurred, and the platform cannot recover -the device. The platform will call error_detected() with a -pci_channel_state value of pci_channel_io_perm_failure. - -The device driver should, at this point, assume the worst. It should -cancel all pending I/O, refuse all new I/O, returning -EIO to -higher layers. The device driver should then clean up all of its -memory and remove itself from kernel operations, much as it would -during system shutdown. - -The platform will typically notify the system operator of the -permanent failure in some way. If the device is hotplug-capable, -the operator will probably want to remove and replace the device. -Note, however, not all failures are truly "permanent". Some are -caused by over-heating, some by a poorly seated card. Many -PCI error events are caused by software bugs, e.g. DMA's to -wild addresses or bogus split transactions due to programming -errors. See the discussion in powerpc/eeh-pci-error-recovery.txt -for additional detail on real-life experience of the causes of -software errors. - - -Conclusion; General Remarks ---------------------------- -The way those callbacks are called is platform policy. A platform with -no slot reset capability may want to just "ignore" drivers that can't -recover (disconnect them) and try to let other cards on the same segment -recover. Keep in mind that in most real life cases, though, there will -be only one driver per segment. - -Now, a note about interrupts. If you get an interrupt and your -device is dead or has been isolated, there is a problem :) -The current policy is to turn this into a platform policy. -That is, the recovery API only requires that: - - - There is no guarantee that interrupt delivery can proceed from any -device on the segment starting from the error detection and until the -resume callback is sent, at which point interrupts are expected to be -fully operational. - - - There is no guarantee that interrupt delivery is stopped, that is, -a driver that gets an interrupt after detecting an error, or that detects -an error within the interrupt handler such that it prevents proper -ack'ing of the interrupt (and thus removal of the source) should just -return IRQ_NOTHANDLED. It's up to the platform to deal with that -condition, typically by masking the IRQ source during the duration of -the error handling. It is expected that the platform "knows" which -interrupts are routed to error-management capable slots and can deal -with temporarily disabling that IRQ number during error processing (this -isn't terribly complex). That means some IRQ latency for other devices -sharing the interrupt, but there is simply no other way. High end -platforms aren't supposed to share interrupts between many devices -anyway :) - ->>> Implementation details for the powerpc platform are discussed in ->>> the file Documentation/powerpc/eeh-pci-error-recovery.txt - ->>> As of this writing, there are six device drivers with patches ->>> implementing error recovery. Not all of these patches are in ->>> mainline yet. These may be used as "examples": ->>> ->>> drivers/scsi/ipr.c ->>> drivers/scsi/sym53cxx_2 ->>> drivers/next/e100.c ->>> drivers/net/e1000 ->>> drivers/net/ixgb ->>> drivers/net/s2io.c - -The End -------- diff --git a/Documentation/pci.txt b/Documentation/pci.txt deleted file mode 100644 index d2c2e6e2b224..000000000000 --- a/Documentation/pci.txt +++ /dev/null @@ -1,646 +0,0 @@ - - How To Write Linux PCI Drivers - - by Martin Mares on 07-Feb-2000 - updated by Grant Grundler on 23-Dec-2006 - -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The world of PCI is vast and full of (mostly unpleasant) surprises. -Since each CPU architecture implements different chip-sets and PCI devices -have different requirements (erm, "features"), the result is the PCI support -in the Linux kernel is not as trivial as one would wish. This short paper -tries to introduce all potential driver authors to Linux APIs for -PCI device drivers. - -A more complete resource is the third edition of "Linux Device Drivers" -by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. -LDD3 is available for free (under Creative Commons License) from: - - http://lwn.net/Kernel/LDD3/ - -However, keep in mind that all documents are subject to "bit rot". -Refer to the source code if things are not working as described here. - -Please send questions/comments/patches about Linux PCI API to the -"Linux PCI" mailing list. - - - -0. Structure of PCI drivers -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -PCI drivers "discover" PCI devices in a system via pci_register_driver(). -Actually, it's the other way around. When the PCI generic code discovers -a new device, the driver with a matching "description" will be notified. -Details on this below. - -pci_register_driver() leaves most of the probing for devices to -the PCI layer and supports online insertion/removal of devices [thus -supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. -pci_register_driver() call requires passing in a table of function -pointers and thus dictates the high level structure of a driver. - -Once the driver knows about a PCI device and takes ownership, the -driver generally needs to perform the following initialization: - - Enable the device - Request MMIO/IOP resources - Set the DMA mask size (for both coherent and streaming DMA) - Allocate and initialize shared control data (pci_allocate_coherent()) - Access device configuration space (if needed) - Register IRQ handler (request_irq()) - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) - Enable DMA/processing engines - -When done using the device, and perhaps the module needs to be unloaded, -the driver needs to take the follow steps: - Disable the device from generating IRQs - Release the IRQ (free_irq()) - Stop all DMA activity - Release DMA buffers (both streaming and coherent) - Unregister from other subsystems (e.g. scsi or netdev) - Release MMIO/IOP resources - Disable the device - -Most of these topics are covered in the following sections. -For the rest look at LDD3 or . - -If the PCI subsystem is not configured (CONFIG_PCI is not set), most of -the PCI functions described below are defined as inline functions either -completely empty or just returning an appropriate error codes to avoid -lots of ifdefs in the drivers. - - - -1. pci_register_driver() call -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -PCI device drivers call pci_register_driver() during their -initialization with a pointer to a structure describing the driver -(struct pci_driver): - - field name Description - ---------- ------------------------------------------------------ - id_table Pointer to table of device ID's the driver is - interested in. Most drivers should export this - table using MODULE_DEVICE_TABLE(pci,...). - - probe This probing function gets called (during execution - of pci_register_driver() for already existing - devices or later if a new device gets inserted) for - all PCI devices which match the ID table and are not - "owned" by the other drivers yet. This function gets - passed a "struct pci_dev *" for each device whose - entry in the ID table matches the device. The probe - function returns zero when the driver chooses to - take "ownership" of the device or an error code - (negative number) otherwise. - The probe function always gets called from process - context, so it can sleep. - - remove The remove() function gets called whenever a device - being handled by this driver is removed (either during - deregistration of the driver or when it's manually - pulled out of a hot-pluggable slot). - The remove function always gets called from process - context, so it can sleep. - - suspend Put device into low power state. - suspend_late Put device into low power state. - - resume_early Wake device from low power state. - resume Wake device from low power state. - - (Please see Documentation/power/pci.txt for descriptions - of PCI Power Management and the related functions.) - - shutdown Hook into reboot_notifier_list (kernel/sys.c). - Intended to stop any idling DMA operations. - Useful for enabling wake-on-lan (NIC) or changing - the power state of a device before reboot. - e.g. drivers/net/e100.c. - - err_handler See Documentation/pci-error-recovery.txt - - -The ID table is an array of struct pci_device_id entries ending with an -all-zero entry; use of the macro DEFINE_PCI_DEVICE_TABLE is the preferred -method of declaring the table. Each entry consists of: - - vendor,device Vendor and device ID to match (or PCI_ANY_ID) - - subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID) - subdevice, - - class Device class, subclass, and "interface" to match. - See Appendix D of the PCI Local Bus Spec or - include/linux/pci_ids.h for a full list of classes. - Most drivers do not need to specify class/class_mask - as vendor/device is normally sufficient. - - class_mask limit which sub-fields of the class field are compared. - See drivers/scsi/sym53c8xx_2/ for example of usage. - - driver_data Data private to the driver. - Most drivers don't need to use driver_data field. - Best practice is to use driver_data as an index - into a static list of equivalent device types, - instead of using it as a pointer. - - -Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up -a pci_device_id table. - -New PCI IDs may be added to a device driver pci_ids table at runtime -as shown below: - -echo "vendor device subvendor subdevice class class_mask driver_data" > \ -/sys/bus/pci/drivers/{driver}/new_id - -All fields are passed in as hexadecimal values (no leading 0x). -The vendor and device fields are mandatory, the others are optional. Users -need pass only as many optional fields as necessary: - o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) - o class and classmask fields default to 0 - o driver_data defaults to 0UL. - -Once added, the driver probe routine will be invoked for any unclaimed -PCI devices listed in its (newly updated) pci_ids list. - -When the driver exits, it just calls pci_unregister_driver() and the PCI layer -automatically calls the remove hook for all devices handled by the driver. - - -1.1 "Attributes" for driver functions/data - -Please mark the initialization and cleanup functions where appropriate -(the corresponding macros are defined in ): - - __init Initialization code. Thrown away after the driver - initializes. - __exit Exit code. Ignored for non-modular drivers. - - - __devinit Device initialization code. - Identical to __init if the kernel is not compiled - with CONFIG_HOTPLUG, normal function otherwise. - __devexit The same for __exit. - -Tips on when/where to use the above attributes: - o The module_init()/module_exit() functions (and all - initialization functions called _only_ from these) - should be marked __init/__exit. - - o Do not mark the struct pci_driver. - - o The ID table array should be marked __devinitconst; this is done - automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE(). - - o The probe() and remove() functions should be marked __devinit - and __devexit respectively. All initialization functions - exclusively called by the probe() routine, can be marked __devinit. - Ditto for remove() and __devexit. - - o If mydriver_remove() is marked with __devexit(), then all address - references to mydriver_remove must use __devexit_p(mydriver_remove) - (in the struct pci_driver declaration for example). - __devexit_p() will generate the function name _or_ NULL if the - function will be discarded. For an example, see drivers/net/tg3.c. - - o Do NOT mark a function if you are not sure which mark to use. - Better to not mark the function than mark the function wrong. - - - -2. How to find PCI devices manually -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -PCI drivers should have a really good reason for not using the -pci_register_driver() interface to search for PCI devices. -The main reason PCI devices are controlled by multiple drivers -is because one PCI device implements several different HW services. -E.g. combined serial/parallel port/floppy controller. - -A manual search may be performed using the following constructs: - -Searching by vendor and device ID: - - struct pci_dev *dev = NULL; - while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) - configure_device(dev); - -Searching by class ID (iterate in a similar way): - - pci_get_class(CLASS_ID, dev) - -Searching by both vendor/device and subsystem vendor/device ID: - - pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). - -You can use the constant PCI_ANY_ID as a wildcard replacement for -VENDOR_ID or DEVICE_ID. This allows searching for any device from a -specific vendor, for example. - -These functions are hotplug-safe. They increment the reference count on -the pci_dev that they return. You must eventually (possibly at module unload) -decrement the reference count on these devices by calling pci_dev_put(). - - - -3. Device Initialization Steps -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -As noted in the introduction, most PCI drivers need the following steps -for device initialization: - - Enable the device - Request MMIO/IOP resources - Set the DMA mask size (for both coherent and streaming DMA) - Allocate and initialize shared control data (pci_allocate_coherent()) - Access device configuration space (if needed) - Register IRQ handler (request_irq()) - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) - Enable DMA/processing engines. - -The driver can access PCI config space registers at any time. -(Well, almost. When running BIST, config space can go away...but -that will just result in a PCI Bus Master Abort and config reads -will return garbage). - - -3.1 Enable the PCI device -~~~~~~~~~~~~~~~~~~~~~~~~~ -Before touching any device registers, the driver needs to enable -the PCI device by calling pci_enable_device(). This will: - o wake up the device if it was in suspended state, - o allocate I/O and memory regions of the device (if BIOS did not), - o allocate an IRQ (if BIOS did not). - -NOTE: pci_enable_device() can fail! Check the return value. - -[ OS BUG: we don't check resource allocations before enabling those - resources. The sequence would make more sense if we called - pci_request_resources() before calling pci_enable_device(). - Currently, the device drivers can't detect the bug when when two - devices have been allocated the same range. This is not a common - problem and unlikely to get fixed soon. - - This has been discussed before but not changed as of 2.6.19: - http://lkml.org/lkml/2006/3/2/194 -] - -pci_set_master() will enable DMA by setting the bus master bit -in the PCI_COMMAND register. It also fixes the latency timer value if -it's set to something bogus by the BIOS. - -If the PCI device can use the PCI Memory-Write-Invalidate transaction, -call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval -and also ensures that the cache line size register is set correctly. -Check the return value of pci_set_mwi() as not all architectures -or chip-sets may support Memory-Write-Invalidate. Alternatively, -if Mem-Wr-Inval would be nice to have but is not required, call -pci_try_set_mwi() to have the system do its best effort at enabling -Mem-Wr-Inval. - - -3.2 Request MMIO/IOP resources -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Memory (MMIO), and I/O port addresses should NOT be read directly -from the PCI device config space. Use the values in the pci_dev structure -as the PCI "bus address" might have been remapped to a "host physical" -address by the arch/chip-set specific kernel support. - -See Documentation/IO-mapping.txt for how to access device registers -or device memory. - -The device driver needs to call pci_request_region() to verify -no other device is already using the same address resource. -Conversely, drivers should call pci_release_region() AFTER -calling pci_disable_device(). -The idea is to prevent two devices colliding on the same address range. - -[ See OS BUG comment above. Currently (2.6.19), The driver can only - determine MMIO and IO Port resource availability _after_ calling - pci_enable_device(). ] - -Generic flavors of pci_request_region() are request_mem_region() -(for MMIO ranges) and request_region() (for IO Port ranges). -Use these for address resources that are not described by "normal" PCI -BARs. - -Also see pci_request_selected_regions() below. - - -3.3 Set the DMA mask size -~~~~~~~~~~~~~~~~~~~~~~~~~ -[ If anything below doesn't make sense, please refer to - Documentation/DMA-API.txt. This section is just a reminder that - drivers need to indicate DMA capabilities of the device and is not - an authoritative source for DMA interfaces. ] - -While all drivers should explicitly indicate the DMA capability -(e.g. 32 or 64 bit) of the PCI bus master, devices with more than -32-bit bus master capability for streaming data need the driver -to "register" this capability by calling pci_set_dma_mask() with -appropriate parameters. In general this allows more efficient DMA -on systems where System RAM exists above 4G _physical_ address. - -Drivers for all PCI-X and PCIe compliant devices must call -pci_set_dma_mask() as they are 64-bit DMA devices. - -Similarly, drivers must also "register" this capability if the device -can directly address "consistent memory" in System RAM above 4G physical -address by calling pci_set_consistent_dma_mask(). -Again, this includes drivers for all PCI-X and PCIe compliant devices. -Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are -64-bit DMA capable for payload ("streaming") data but not control -("consistent") data. - - -3.4 Setup shared control data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) -memory. See Documentation/DMA-API.txt for a full description of -the DMA APIs. This section is just a reminder that it needs to be done -before enabling DMA on the device. - - -3.5 Initialize device registers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Some drivers will need specific "capability" fields programmed -or other "vendor specific" register initialized or reset. -E.g. clearing pending interrupts. - - -3.6 Register IRQ handler -~~~~~~~~~~~~~~~~~~~~~~~~ -While calling request_irq() is the last step described here, -this is often just another intermediate step to initialize a device. -This step can often be deferred until the device is opened for use. - -All interrupt handlers for IRQ lines should be registered with IRQF_SHARED -and use the devid to map IRQs to devices (remember that all PCI IRQ lines -can be shared). - -request_irq() will associate an interrupt handler and device handle -with an interrupt number. Historically interrupt numbers represent -IRQ lines which run from the PCI device to the Interrupt controller. -With MSI and MSI-X (more below) the interrupt number is a CPU "vector". - -request_irq() also enables the interrupt. Make sure the device is -quiesced and does not have any interrupts pending before registering -the interrupt handler. - -MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" -which deliver interrupts to the CPU via a DMA write to a Local APIC. -The fundamental difference between MSI and MSI-X is how multiple -"vectors" get allocated. MSI requires contiguous blocks of vectors -while MSI-X can allocate several individual ones. - -MSI capability can be enabled by calling pci_enable_msi() or -pci_enable_msix() before calling request_irq(). This causes -the PCI support to program CPU vector data into the PCI device -capability registers. - -If your PCI device supports both, try to enable MSI-X first. -Only one can be enabled at a time. Many architectures, chip-sets, -or BIOSes do NOT support MSI or MSI-X and the call to pci_enable_msi/msix -will fail. This is important to note since many drivers have -two (or more) interrupt handlers: one for MSI/MSI-X and another for IRQs. -They choose which handler to register with request_irq() based on the -return value from pci_enable_msi/msix(). - -There are (at least) two really good reasons for using MSI: -1) MSI is an exclusive interrupt vector by definition. - This means the interrupt handler doesn't have to verify - its device caused the interrupt. - -2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed - to be visible to the host CPU(s) when the MSI is delivered. This - is important for both data coherency and avoiding stale control data. - This guarantee allows the driver to omit MMIO reads to flush - the DMA stream. - -See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples -of MSI/MSI-X usage. - - - -4. PCI device shutdown -~~~~~~~~~~~~~~~~~~~~~~~ - -When a PCI device driver is being unloaded, most of the following -steps need to be performed: - - Disable the device from generating IRQs - Release the IRQ (free_irq()) - Stop all DMA activity - Release DMA buffers (both streaming and consistent) - Unregister from other subsystems (e.g. scsi or netdev) - Disable device from responding to MMIO/IO Port addresses - Release MMIO/IO Port resource(s) - - -4.1 Stop IRQs on the device -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -How to do this is chip/device specific. If it's not done, it opens -the possibility of a "screaming interrupt" if (and only if) -the IRQ is shared with another device. - -When the shared IRQ handler is "unhooked", the remaining devices -using the same IRQ line will still need the IRQ enabled. Thus if the -"unhooked" device asserts IRQ line, the system will respond assuming -it was one of the remaining devices asserted the IRQ line. Since none -of the other devices will handle the IRQ, the system will "hang" until -it decides the IRQ isn't going to get handled and masks the IRQ (100,000 -iterations later). Once the shared IRQ is masked, the remaining devices -will stop functioning properly. Not a nice situation. - -This is another reason to use MSI or MSI-X if it's available. -MSI and MSI-X are defined to be exclusive interrupts and thus -are not susceptible to the "screaming interrupt" problem. - - -4.2 Release the IRQ -~~~~~~~~~~~~~~~~~~~ -Once the device is quiesced (no more IRQs), one can call free_irq(). -This function will return control once any pending IRQs are handled, -"unhook" the drivers IRQ handler from that IRQ, and finally release -the IRQ if no one else is using it. - - -4.3 Stop all DMA activity -~~~~~~~~~~~~~~~~~~~~~~~~~ -It's extremely important to stop all DMA operations BEFORE attempting -to deallocate DMA control data. Failure to do so can result in memory -corruption, hangs, and on some chip-sets a hard crash. - -Stopping DMA after stopping the IRQs can avoid races where the -IRQ handler might restart DMA engines. - -While this step sounds obvious and trivial, several "mature" drivers -didn't get this step right in the past. - - -4.4 Release DMA buffers -~~~~~~~~~~~~~~~~~~~~~~~ -Once DMA is stopped, clean up streaming DMA first. -I.e. unmap data buffers and return buffers to "upstream" -owners if there is one. - -Then clean up "consistent" buffers which contain the control data. - -See Documentation/DMA-API.txt for details on unmapping interfaces. - - -4.5 Unregister from other subsystems -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Most low level PCI device drivers support some other subsystem -like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your -driver isn't losing resources from that other subsystem. -If this happens, typically the symptom is an Oops (panic) when -the subsystem attempts to call into a driver that has been unloaded. - - -4.6 Disable Device from responding to MMIO/IO Port addresses -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -io_unmap() MMIO or IO Port resources and then call pci_disable_device(). -This is the symmetric opposite of pci_enable_device(). -Do not access device registers after calling pci_disable_device(). - - -4.7 Release MMIO/IO Port Resource(s) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Call pci_release_region() to mark the MMIO or IO Port range as available. -Failure to do so usually results in the inability to reload the driver. - - - -5. How to access PCI config space -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can use pci_(read|write)_config_(byte|word|dword) to access the config -space of a device represented by struct pci_dev *. All these functions return 0 -when successful or an error code (PCIBIOS_...) which can be translated to a text -string by pcibios_strerror. Most drivers expect that accesses to valid PCI -devices don't fail. - -If you don't have a struct pci_dev available, you can call -pci_bus_(read|write)_config_(byte|word|dword) to access a given device -and function on that bus. - -If you access fields in the standard portion of the config header, please -use symbolic names of locations and bits declared in . - -If you need to access Extended PCI Capability registers, just call -pci_find_capability() for the particular capability and it will find the -corresponding register block for you. - - - -6. Other interesting functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -pci_find_slot() Find pci_dev corresponding to given bus and - slot numbers. -pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) -pci_find_capability() Find specified capability in device's capability - list. -pci_resource_start() Returns bus start address for a given PCI region -pci_resource_end() Returns bus end address for a given PCI region -pci_resource_len() Returns the byte length of a PCI region -pci_set_drvdata() Set private driver data pointer for a pci_dev -pci_get_drvdata() Return private driver data pointer for a pci_dev -pci_set_mwi() Enable Memory-Write-Invalidate transactions. -pci_clear_mwi() Disable Memory-Write-Invalidate transactions. - - - -7. Miscellaneous hints -~~~~~~~~~~~~~~~~~~~~~~ - -When displaying PCI device names to the user (for example when a driver wants -to tell the user what card has it found), please use pci_name(pci_dev). - -Always refer to the PCI devices by a pointer to the pci_dev structure. -All PCI layer functions use this identification and it's the only -reasonable one. Don't use bus/slot/function numbers except for very -special purposes -- on systems with multiple primary buses their semantics -can be pretty complex. - -Don't try to turn on Fast Back to Back writes in your driver. All devices -on the bus need to be capable of doing it, so this is something which needs -to be handled by platform and generic code, not individual drivers. - - - -8. Vendor and device identifications -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -One is not not required to add new device ids to include/linux/pci_ids.h. -Please add PCI_VENDOR_ID_xxx for vendors and a hex constant for device ids. - -PCI_VENDOR_ID_xxx constants are re-used. The device ids are arbitrary -hex numbers (vendor controlled) and normally used only in a single -location, the pci_device_id table. - -Please DO submit new vendor/device ids to pciids.sourceforge.net project. - - - -9. Obsolete functions -~~~~~~~~~~~~~~~~~~~~~ - -There are several functions which you might come across when trying to -port an old driver to the new PCI interface. They are no longer present -in the kernel as they aren't compatible with hotplug or PCI domains or -having sane locking. - -pci_find_device() Superseded by pci_get_device() -pci_find_subsys() Superseded by pci_get_subsys() -pci_find_slot() Superseded by pci_get_slot() - - -The alternative is the traditional PCI device driver that walks PCI -device lists. This is still possible but discouraged. - - - -10. MMIO Space and "Write Posting" -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Converting a driver from using I/O Port space to using MMIO space -often requires some additional changes. Specifically, "write posting" -needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) -already do this. I/O Port space guarantees write transactions reach the PCI -device before the CPU can continue. Writes to MMIO space allow the CPU -to continue before the transaction reaches the PCI device. HW weenies -call this "Write Posting" because the write completion is "posted" to -the CPU before the transaction has reached its destination. - -Thus, timing sensitive code should add readl() where the CPU is -expected to wait before doing other work. The classic "bit banging" -sequence works fine for I/O Port space: - - for (i = 8; --i; val >>= 1) { - outb(val & 1, ioport_reg); /* write bit */ - udelay(10); - } - -The same sequence for MMIO space should be: - - for (i = 8; --i; val >>= 1) { - writeb(val & 1, mmio_reg); /* write bit */ - readb(safe_mmio_reg); /* flush posted write */ - udelay(10); - } - -It is important that "safe_mmio_reg" not have any side effects that -interferes with the correct operation of the device. - -Another case to watch out for is when resetting a PCI device. Use PCI -Configuration space reads to flush the writel(). This will gracefully -handle the PCI master abort on all platforms if the PCI device is -expected to not respond to a readl(). Most x86 platforms will allow -MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage -(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). - diff --git a/Documentation/pcieaer-howto.txt b/Documentation/pcieaer-howto.txt deleted file mode 100644 index d5da86170106..000000000000 --- a/Documentation/pcieaer-howto.txt +++ /dev/null @@ -1,253 +0,0 @@ - The PCI Express Advanced Error Reporting Driver Guide HOWTO - T. Long Nguyen - Yanmin Zhang - 07/29/2006 - - -1. Overview - -1.1 About this guide - -This guide describes the basics of the PCI Express Advanced Error -Reporting (AER) driver and provides information on how to use it, as -well as how to enable the drivers of endpoint devices to conform with -PCI Express AER driver. - -1.2 Copyright © Intel Corporation 2006. - -1.3 What is the PCI Express AER Driver? - -PCI Express error signaling can occur on the PCI Express link itself -or on behalf of transactions initiated on the link. PCI Express -defines two error reporting paradigms: the baseline capability and -the Advanced Error Reporting capability. The baseline capability is -required of all PCI Express components providing a minimum defined -set of error reporting requirements. Advanced Error Reporting -capability is implemented with a PCI Express advanced error reporting -extended capability structure providing more robust error reporting. - -The PCI Express AER driver provides the infrastructure to support PCI -Express Advanced Error Reporting capability. The PCI Express AER -driver provides three basic functions: - -- Gathers the comprehensive error information if errors occurred. -- Reports error to the users. -- Performs error recovery actions. - -AER driver only attaches root ports which support PCI-Express AER -capability. - - -2. User Guide - -2.1 Include the PCI Express AER Root Driver into the Linux Kernel - -The PCI Express AER Root driver is a Root Port service driver attached -to the PCI Express Port Bus driver. If a user wants to use it, the driver -has to be compiled. Option CONFIG_PCIEAER supports this capability. It -depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and -CONFIG_PCIEAER = y. - -2.2 Load PCI Express AER Root Driver -There is a case where a system has AER support in BIOS. Enabling the AER -Root driver and having AER support in BIOS may result unpredictable -behavior. To avoid this conflict, a successful load of the AER Root driver -requires ACPI _OSC support in the BIOS to allow the AER Root driver to -request for native control of AER. See the PCI FW 3.0 Specification for -details regarding OSC usage. Currently, lots of firmwares don't provide -_OSC support while they use PCI Express. To support such firmwares, -forceload, a parameter of type bool, could enable AER to continue to -be initiated although firmwares have no _OSC support. To enable the -walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line -when booting kernel. Note that forceload=n by default. - -2.3 AER error output -When a PCI-E AER error is captured, an error message will be outputed to -console. If it's a correctable error, it is outputed as a warning. -Otherwise, it is printed as an error. So users could choose different -log level to filter out correctable error messages. - -Below shows an example. -+------ PCI-Express Device Error -----+ -Error Severity : Uncorrected (Fatal) -PCIE Bus Error type : Transaction Layer -Unsupported Request : First -Requester ID : 0500 -VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h -TLB Header: -04000001 00200a03 05010000 00050100 - -In the example, 'Requester ID' means the ID of the device who sends -the error message to root port. Pls. refer to pci express specs for -other fields. - - -3. Developer Guide - -To enable AER aware support requires a software driver to configure -the AER capability structure within its device and to provide callbacks. - -To support AER better, developers need understand how AER does work -firstly. - -PCI Express errors are classified into two types: correctable errors -and uncorrectable errors. This classification is based on the impacts -of those errors, which may result in degraded performance or function -failure. - -Correctable errors pose no impacts on the functionality of the -interface. The PCI Express protocol can recover without any software -intervention or any loss of data. These errors are detected and -corrected by hardware. Unlike correctable errors, uncorrectable -errors impact functionality of the interface. Uncorrectable errors -can cause a particular transaction or a particular PCI Express link -to be unreliable. Depending on those error conditions, uncorrectable -errors are further classified into non-fatal errors and fatal errors. -Non-fatal errors cause the particular transaction to be unreliable, -but the PCI Express link itself is fully functional. Fatal errors, on -the other hand, cause the link to be unreliable. - -When AER is enabled, a PCI Express device will automatically send an -error message to the PCIE root port above it when the device captures -an error. The Root Port, upon receiving an error reporting message, -internally processes and logs the error message in its PCI Express -capability structure. Error information being logged includes storing -the error reporting agent's requestor ID into the Error Source -Identification Registers and setting the error bits of the Root Error -Status Register accordingly. If AER error reporting is enabled in Root -Error Command Register, the Root Port generates an interrupt if an -error is detected. - -Note that the errors as described above are related to the PCI Express -hierarchy and links. These errors do not include any device specific -errors because device specific errors will still get sent directly to -the device driver. - -3.1 Configure the AER capability structure - -AER aware drivers of PCI Express component need change the device -control registers to enable AER. They also could change AER registers, -including mask and severity registers. Helper function -pci_enable_pcie_error_reporting could be used to enable AER. See -section 3.3. - -3.2. Provide callbacks - -3.2.1 callback reset_link to reset pci express link - -This callback is used to reset the pci express physical link when a -fatal error happens. The root port aer service driver provides a -default reset_link function, but different upstream ports might -have different specifications to reset pci express link, so all -upstream ports should provide their own reset_link functions. - -In struct pcie_port_service_driver, a new pointer, reset_link, is -added. - -pci_ers_result_t (*reset_link) (struct pci_dev *dev); - -Section 3.2.2.2 provides more detailed info on when to call -reset_link. - -3.2.2 PCI error-recovery callbacks - -The PCI Express AER Root driver uses error callbacks to coordinate -with downstream device drivers associated with a hierarchy in question -when performing error recovery actions. - -Data struct pci_driver has a pointer, err_handler, to point to -pci_error_handlers who consists of a couple of callback function -pointers. AER driver follows the rules defined in -pci-error-recovery.txt except pci express specific parts (e.g. -reset_link). Pls. refer to pci-error-recovery.txt for detailed -definitions of the callbacks. - -Below sections specify when to call the error callback functions. - -3.2.2.1 Correctable errors - -Correctable errors pose no impacts on the functionality of -the interface. The PCI Express protocol can recover without any -software intervention or any loss of data. These errors do not -require any recovery actions. The AER driver clears the device's -correctable error status register accordingly and logs these errors. - -3.2.2.2 Non-correctable (non-fatal and fatal) errors - -If an error message indicates a non-fatal error, performing link reset -at upstream is not required. The AER driver calls error_detected(dev, -pci_channel_io_normal) to all drivers associated within a hierarchy in -question. for example, -EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. -If Upstream port A captures an AER error, the hierarchy consists of -Downstream port B and EndPoint. - -A driver may return PCI_ERS_RESULT_CAN_RECOVER, -PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on -whether it can recover or the AER driver calls mmio_enabled as next. - -If an error message indicates a fatal error, kernel will broadcast -error_detected(dev, pci_channel_io_frozen) to all drivers within -a hierarchy in question. Then, performing link reset at upstream is -necessary. As different kinds of devices might use different approaches -to reset link, AER port service driver is required to provide the -function to reset link. Firstly, kernel looks for if the upstream -component has an aer driver. If it has, kernel uses the reset_link -callback of the aer driver. If the upstream component has no aer driver -and the port is downstream port, we will use the aer driver of the -root port who reports the AER error. As for upstream ports, -they should provide their own aer service drivers with reset_link -function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and -reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes -to mmio_enabled. - -3.3 helper functions - -3.3.1 int pci_find_aer_capability(struct pci_dev *dev); -pci_find_aer_capability locates the PCI Express AER capability -in the device configuration space. If the device doesn't support -PCI-Express AER, the function returns 0. - -3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); -pci_enable_pcie_error_reporting enables the device to send error -messages to root port when an error is detected. Note that devices -don't enable the error reporting by default, so device drivers need -call this function to enable it. - -3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); -pci_disable_pcie_error_reporting disables the device to send error -messages to root port when an error is detected. - -3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); -pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable -error status register. - -3.4 Frequent Asked Questions - -Q: What happens if a PCI Express device driver does not provide an -error recovery handler (pci_driver->err_handler is equal to NULL)? - -A: The devices attached with the driver won't be recovered. If the -error is fatal, kernel will print out warning messages. Please refer -to section 3 for more information. - -Q: What happens if an upstream port service driver does not provide -callback reset_link? - -A: Fatal error recovery will fail if the errors are reported by the -upstream ports who are attached by the service driver. - -Q: How does this infrastructure deal with driver that is not PCI -Express aware? - -A: This infrastructure calls the error callback functions of the -driver when an error happens. But if the driver is not aware of -PCI Express, the device might not report its own errors to root -port. - -Q: What modifications will that driver need to make it compatible -with the PCI Express AER Root driver? - -A: It could call the helper functions to enable AER in devices and -cleanup uncorrectable status register. Pls. refer to section 3.3. - -- cgit v1.2.3 From 1ba6ab11d8fbd8d29afec4e39236e1255ae0339a Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Wed, 13 Feb 2008 15:06:38 -0800 Subject: PCI: remove initial bios sort of PCI devices on x86 We currently keep 2 lists of PCI devices in the system, one in the driver core, and one all on its own. This second list is sorted at boot time, in "BIOS" order, to try to remain compatible with older kernels (2.2 and earlier days). There was also a "nosort" option to turn this sorting off, to remain compatible with even older kernel versions, but that just ends up being what we have been doing from 2.5 days... Unfortunately, the second list of devices is not really ever used to determine the probing order of PCI devices or drivers[1]. That is done using the driver core list instead. This change happened back in the early 2.5 days. Relying on BIOS ording for the binding of drivers to specific device names is problematic for many reasons, and userspace tools like udev exist to properly name devices in a persistant manner if that is needed, no reliance on the BIOS is needed. Matt Domsch and others at Dell noticed this back in 2006, and added a boot option to sort the PCI device lists (both of them) in a breadth-first manner to help remain compatible with the 2.4 order, if needed for any reason. This option is not going away, as some systems rely on them. This patch removes the sorting of the internal PCI device list in "BIOS" mode, as it's not needed at all anymore, and hasn't for many years. I've also removed the PCI flags for this from some other arches that for some reason defined them, but never used them. This should not change the ordering of any drivers or device probing. [1] The old-style pci_get_device and pci_find_device() still used this sorting order, but there are very few drivers that use these functions, as they are deprecated for use in this manner. If for some reason, a driver rely on the order and uses these functions, the breadth-first boot option will resolve any problem. Cc: Matt Domsch Signed-off-by: Greg Kroah-Hartman --- Documentation/kernel-parameters.txt | 4 -- arch/frv/mb93090-mb00/pci-frv.h | 2 - arch/mn10300/unit-asb2305/pci-asb2305.h | 2 - arch/sh/drivers/pci/pci-sh4.h | 2 - arch/x86/pci/common.c | 7 ---- arch/x86/pci/pcbios.c | 72 --------------------------------- arch/x86/pci/pci.h | 3 -- include/asm-sh/mpc1211/pci.h | 2 - 8 files changed, 94 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 4b0f1ae31a4c..e30d8fe4e4b1 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1461,10 +1461,6 @@ and is between 256 and 4096 characters. It is defined in the file nomsi [MSI] If the PCI_MSI kernel config parameter is enabled, this kernel boot option can be used to disable the use of MSI interrupts system-wide. - nosort [X86-32] Don't sort PCI devices according to - order given by the PCI BIOS. This sorting is - done to get a device order compatible with - older kernels. biosirq [X86-32] Use PCI BIOS calls to get the interrupt routing table. These calls are known to be buggy on several machines and they hang the machine diff --git a/arch/frv/mb93090-mb00/pci-frv.h b/arch/frv/mb93090-mb00/pci-frv.h index 7481797ab382..0c7bf39dc729 100644 --- a/arch/frv/mb93090-mb00/pci-frv.h +++ b/arch/frv/mb93090-mb00/pci-frv.h @@ -17,8 +17,6 @@ #define PCI_PROBE_BIOS 0x0001 #define PCI_PROBE_CONF1 0x0002 #define PCI_PROBE_CONF2 0x0004 -#define PCI_NO_SORT 0x0100 -#define PCI_BIOS_SORT 0x0200 #define PCI_NO_CHECKS 0x0400 #define PCI_ASSIGN_ROMS 0x1000 #define PCI_BIOS_IRQ_SCAN 0x2000 diff --git a/arch/mn10300/unit-asb2305/pci-asb2305.h b/arch/mn10300/unit-asb2305/pci-asb2305.h index 84634fa3bce6..9763d1ce343a 100644 --- a/arch/mn10300/unit-asb2305/pci-asb2305.h +++ b/arch/mn10300/unit-asb2305/pci-asb2305.h @@ -23,8 +23,6 @@ #define PCI_PROBE_BIOS 1 #define PCI_PROBE_CONF1 2 #define PCI_PROBE_CONF2 4 -#define PCI_NO_SORT 0x100 -#define PCI_BIOS_SORT 0x200 #define PCI_NO_CHECKS 0x400 #define PCI_ASSIGN_ROMS 0x1000 #define PCI_BIOS_IRQ_SCAN 0x2000 diff --git a/arch/sh/drivers/pci/pci-sh4.h b/arch/sh/drivers/pci/pci-sh4.h index 07e29506080f..a83dcf70c13b 100644 --- a/arch/sh/drivers/pci/pci-sh4.h +++ b/arch/sh/drivers/pci/pci-sh4.h @@ -15,8 +15,6 @@ #define PCI_PROBE_BIOS 1 #define PCI_PROBE_CONF1 2 #define PCI_PROBE_CONF2 4 -#define PCI_NO_SORT 0x100 -#define PCI_BIOS_SORT 0x200 #define PCI_NO_CHECKS 0x400 #define PCI_ASSIGN_ROMS 0x1000 #define PCI_BIOS_IRQ_SCAN 0x2000 diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index 7b6e3bb9b28c..c9ff4ff66739 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -427,10 +427,6 @@ static int __init pcibios_init(void) if (pci_bf_sort >= pci_force_bf) pci_sort_breadthfirst(); -#ifdef CONFIG_PCI_BIOS - if ((pci_probe & PCI_BIOS_SORT) && !(pci_probe & PCI_NO_SORT)) - pcibios_sort(); -#endif return 0; } @@ -455,9 +451,6 @@ char * __devinit pcibios_setup(char *str) } else if (!strcmp(str, "nobios")) { pci_probe &= ~PCI_PROBE_BIOS; return NULL; - } else if (!strcmp(str, "nosort")) { - pci_probe |= PCI_NO_SORT; - return NULL; } else if (!strcmp(str, "biosirq")) { pci_probe |= PCI_BIOS_IRQ_SCAN; return NULL; diff --git a/arch/x86/pci/pcbios.c b/arch/x86/pci/pcbios.c index 2f7109ac4c15..37472fc6f729 100644 --- a/arch/x86/pci/pcbios.c +++ b/arch/x86/pci/pcbios.c @@ -152,28 +152,6 @@ static int __devinit check_pcibios(void) return 0; } -static int __devinit pci_bios_find_device (unsigned short vendor, unsigned short device_id, - unsigned short index, unsigned char *bus, unsigned char *device_fn) -{ - unsigned short bx; - unsigned short ret; - - __asm__("lcall *(%%edi); cld\n\t" - "jc 1f\n\t" - "xor %%ah, %%ah\n" - "1:" - : "=b" (bx), - "=a" (ret) - : "1" (PCIBIOS_FIND_PCI_DEVICE), - "c" (device_id), - "d" (vendor), - "S" ((int) index), - "D" (&pci_indirect)); - *bus = (bx >> 8) & 0xff; - *device_fn = bx & 0xff; - return (int) (ret & 0xff00) >> 8; -} - static int pci_bios_read(unsigned int seg, unsigned int bus, unsigned int devfn, int reg, int len, u32 *value) { @@ -363,55 +341,6 @@ static struct pci_raw_ops * __devinit pci_find_bios(void) return NULL; } -/* - * Sort the device list according to PCI BIOS. Nasty hack, but since some - * fool forgot to define the `correct' device order in the PCI BIOS specs - * and we want to be (possibly bug-to-bug ;-]) compatible with older kernels - * which used BIOS ordering, we are bound to do this... - */ - -void __devinit pcibios_sort(void) -{ - LIST_HEAD(sorted_devices); - struct list_head *ln; - struct pci_dev *dev, *d; - int idx, found; - unsigned char bus, devfn; - - DBG("PCI: Sorting device list...\n"); - while (!list_empty(&pci_devices)) { - ln = pci_devices.next; - dev = pci_dev_g(ln); - idx = found = 0; - while (pci_bios_find_device(dev->vendor, dev->device, idx, &bus, &devfn) == PCIBIOS_SUCCESSFUL) { - idx++; - list_for_each(ln, &pci_devices) { - d = pci_dev_g(ln); - if (d->bus->number == bus && d->devfn == devfn) { - list_move_tail(&d->global_list, &sorted_devices); - if (d == dev) - found = 1; - break; - } - } - if (ln == &pci_devices) { - printk(KERN_WARNING "PCI: BIOS reporting unknown device %02x:%02x\n", bus, devfn); - /* - * We must not continue scanning as several buggy BIOSes - * return garbage after the last device. Grr. - */ - break; - } - } - if (!found) { - printk(KERN_WARNING "PCI: Device %s not found by BIOS\n", - pci_name(dev)); - list_move_tail(&dev->global_list, &sorted_devices); - } - } - list_splice(&sorted_devices, &pci_devices); -} - /* * BIOS Functions for IRQ Routing */ @@ -495,7 +424,6 @@ void __init pci_pcbios_init(void) { if ((pci_probe & PCI_PROBE_BIOS) && ((raw_pci_ops = pci_find_bios()))) { - pci_probe |= PCI_BIOS_SORT; pci_bios_present = 1; } } diff --git a/arch/x86/pci/pci.h b/arch/x86/pci/pci.h index 3431518d921a..02b016a98423 100644 --- a/arch/x86/pci/pci.h +++ b/arch/x86/pci/pci.h @@ -19,8 +19,6 @@ #define PCI_PROBE_MASK 0x000f #define PCI_PROBE_NOEARLY 0x0010 -#define PCI_NO_SORT 0x0100 -#define PCI_BIOS_SORT 0x0200 #define PCI_NO_CHECKS 0x0400 #define PCI_USE_PIRQ_MASK 0x0800 #define PCI_ASSIGN_ROMS 0x1000 @@ -101,7 +99,6 @@ extern int pci_direct_probe(void); extern void pci_direct_init(int type); extern void pci_pcbios_init(void); extern void pci_mmcfg_init(int type); -extern void pcibios_sort(void); /* pci-mmconfig.c */ diff --git a/include/asm-sh/mpc1211/pci.h b/include/asm-sh/mpc1211/pci.h index 5d3712c3a701..d9162c5ed76a 100644 --- a/include/asm-sh/mpc1211/pci.h +++ b/include/asm-sh/mpc1211/pci.h @@ -24,8 +24,6 @@ #define PCI_PROBE_BIOS 1 #define PCI_PROBE_CONF1 2 #define PCI_PROBE_CONF2 4 -#define PCI_NO_SORT 0x100 -#define PCI_BIOS_SORT 0x200 #define PCI_NO_CHECKS 0x400 #define PCI_ASSIGN_ROMS 0x1000 #define PCI_BIOS_IRQ_SCAN 0x2000 -- cgit v1.2.3 From 5e0d2a6fc094a9b5047998deefeb1254c66856ee Mon Sep 17 00:00:00 2001 From: mark gross Date: Tue, 4 Mar 2008 15:22:08 -0800 Subject: PCI: iommu: iotlb flushing This patch is for batching up the flushing of the IOTLB for the DMAR implementation found in the Intel VT-d hardware. It works by building a list of to be flushed IOTLB entries and a bitmap list of which DMAR engine they are from. After either a high water mark (250 accessible via debugfs) or 10ms the list of iova's will be reclaimed and the DMAR engines associated are IOTLB-flushed. This approach recovers 15 to 20% of the performance lost when using the IOMMU for my netperf udp stream benchmark with small packets. It can be disabled with a kernel boot parameter "intel_iommu=strict". Its use does weaken the IOMMU protections a bit. Signed-off-by: Mark Gross Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman --- Documentation/kernel-parameters.txt | 4 + drivers/pci/intel-iommu.c | 147 +++++++++++++++++++++++++++++++----- drivers/pci/iova.h | 2 + 3 files changed, 135 insertions(+), 18 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e30d8fe4e4b1..f7492cd10093 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -847,6 +847,10 @@ and is between 256 and 4096 characters. It is defined in the file than 32 bit addressing. The default is to look for translation below 32 bit and if not available then look in the higher range. + strict [Default Off] + With this option on every unmap_single operation will + result in a hardware IOTLB flush operation as opposed + to batching them for performance. io_delay= [X86-32,X86-64] I/O delay method 0x80 diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index 4cb949f0ebd9..8690a0d45d7f 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -22,6 +22,7 @@ #include #include +#include #include #include #include @@ -31,6 +32,7 @@ #include #include #include +#include #include "iova.h" #include "intel-iommu.h" #include /* force_iommu in this header in x86-64*/ @@ -51,11 +53,32 @@ #define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1) + +static void flush_unmaps_timeout(unsigned long data); + +DEFINE_TIMER(unmap_timer, flush_unmaps_timeout, 0, 0); + +static struct intel_iommu *g_iommus; +/* bitmap for indexing intel_iommus */ +static unsigned long *g_iommus_to_flush; +static int g_num_of_iommus; + +static DEFINE_SPINLOCK(async_umap_flush_lock); +static LIST_HEAD(unmaps_to_do); + +static int timer_on; +static long list_size; +static int high_watermark; + +static struct dentry *intel_iommu_debug, *debug; + + static void domain_remove_dev_info(struct dmar_domain *domain); static int dmar_disabled; static int __initdata dmar_map_gfx = 1; static int dmar_forcedac; +static int intel_iommu_strict; #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) static DEFINE_SPINLOCK(device_domain_lock); @@ -74,9 +97,13 @@ static int __init intel_iommu_setup(char *str) printk(KERN_INFO "Intel-IOMMU: disable GFX device mapping\n"); } else if (!strncmp(str, "forcedac", 8)) { - printk (KERN_INFO + printk(KERN_INFO "Intel-IOMMU: Forcing DAC for PCI devices\n"); dmar_forcedac = 1; + } else if (!strncmp(str, "strict", 6)) { + printk(KERN_INFO + "Intel-IOMMU: disable batched IOTLB flush\n"); + intel_iommu_strict = 1; } str += strcspn(str, ","); @@ -966,17 +993,13 @@ static int iommu_init_domains(struct intel_iommu *iommu) set_bit(0, iommu->domain_ids); return 0; } - -static struct intel_iommu *alloc_iommu(struct dmar_drhd_unit *drhd) +static struct intel_iommu *alloc_iommu(struct intel_iommu *iommu, + struct dmar_drhd_unit *drhd) { - struct intel_iommu *iommu; int ret; int map_size; u32 ver; - iommu = kzalloc(sizeof(*iommu), GFP_KERNEL); - if (!iommu) - return NULL; iommu->reg = ioremap(drhd->reg_base_addr, PAGE_SIZE_4K); if (!iommu->reg) { printk(KERN_ERR "IOMMU: can't map the region\n"); @@ -1404,7 +1427,7 @@ static int dmar_pci_device_match(struct pci_dev *devices[], int cnt, int index; while (dev) { - for (index = 0; index < cnt; index ++) + for (index = 0; index < cnt; index++) if (dev == devices[index]) return 1; @@ -1669,7 +1692,7 @@ int __init init_dmars(void) struct dmar_rmrr_unit *rmrr; struct pci_dev *pdev; struct intel_iommu *iommu; - int ret, unit = 0; + int nlongs, i, ret, unit = 0; /* * for each drhd @@ -1680,7 +1703,35 @@ int __init init_dmars(void) for_each_drhd_unit(drhd) { if (drhd->ignored) continue; - iommu = alloc_iommu(drhd); + g_num_of_iommus++; + /* + * lock not needed as this is only incremented in the single + * threaded kernel __init code path all other access are read + * only + */ + } + + nlongs = BITS_TO_LONGS(g_num_of_iommus); + g_iommus_to_flush = kzalloc(nlongs * sizeof(unsigned long), GFP_KERNEL); + if (!g_iommus_to_flush) { + printk(KERN_ERR "Intel-IOMMU: " + "Allocating bitmap array failed\n"); + return -ENOMEM; + } + + g_iommus = kzalloc(g_num_of_iommus * sizeof(*iommu), GFP_KERNEL); + if (!g_iommus) { + kfree(g_iommus_to_flush); + ret = -ENOMEM; + goto error; + } + + i = 0; + for_each_drhd_unit(drhd) { + if (drhd->ignored) + continue; + iommu = alloc_iommu(&g_iommus[i], drhd); + i++; if (!iommu) { ret = -ENOMEM; goto error; @@ -1713,7 +1764,6 @@ int __init init_dmars(void) * endfor */ for_each_rmrr_units(rmrr) { - int i; for (i = 0; i < rmrr->devices_cnt; i++) { pdev = rmrr->devices[i]; /* some BIOS lists non-exist devices in DMAR table */ @@ -1769,6 +1819,7 @@ error: iommu = drhd->iommu; free_iommu(iommu); } + kfree(g_iommus); return ret; } @@ -1917,6 +1968,53 @@ error: return 0; } +static void flush_unmaps(void) +{ + struct iova *node, *n; + unsigned long flags; + int i; + + spin_lock_irqsave(&async_umap_flush_lock, flags); + timer_on = 0; + + /* just flush them all */ + for (i = 0; i < g_num_of_iommus; i++) { + if (test_and_clear_bit(i, g_iommus_to_flush)) + iommu_flush_iotlb_global(&g_iommus[i], 0); + } + + list_for_each_entry_safe(node, n, &unmaps_to_do, list) { + /* free iova */ + list_del(&node->list); + __free_iova(&((struct dmar_domain *)node->dmar)->iovad, node); + + } + list_size = 0; + spin_unlock_irqrestore(&async_umap_flush_lock, flags); +} + +static void flush_unmaps_timeout(unsigned long data) +{ + flush_unmaps(); +} + +static void add_unmap(struct dmar_domain *dom, struct iova *iova) +{ + unsigned long flags; + + spin_lock_irqsave(&async_umap_flush_lock, flags); + iova->dmar = dom; + list_add(&iova->list, &unmaps_to_do); + set_bit((dom->iommu - g_iommus), g_iommus_to_flush); + + if (!timer_on) { + mod_timer(&unmap_timer, jiffies + msecs_to_jiffies(10)); + timer_on = 1; + } + list_size++; + spin_unlock_irqrestore(&async_umap_flush_lock, flags); +} + static void intel_unmap_single(struct device *dev, dma_addr_t dev_addr, size_t size, int dir) { @@ -1944,13 +2042,21 @@ static void intel_unmap_single(struct device *dev, dma_addr_t dev_addr, dma_pte_clear_range(domain, start_addr, start_addr + size); /* free page tables */ dma_pte_free_pagetable(domain, start_addr, start_addr + size); - - if (iommu_flush_iotlb_psi(domain->iommu, domain->id, start_addr, - size >> PAGE_SHIFT_4K, 0)) - iommu_flush_write_buffer(domain->iommu); - - /* free iova */ - __free_iova(&domain->iovad, iova); + if (intel_iommu_strict) { + if (iommu_flush_iotlb_psi(domain->iommu, + domain->id, start_addr, size >> PAGE_SHIFT_4K, 0)) + iommu_flush_write_buffer(domain->iommu); + /* free iova */ + __free_iova(&domain->iovad, iova); + } else { + add_unmap(domain, iova); + /* + * queue up the release of the unmap to save the 1/6th of the + * cpu used up by the iotlb flush operation... + */ + if (list_size > high_watermark) + flush_unmaps(); + } } static void * intel_alloc_coherent(struct device *hwdev, size_t size, @@ -2274,6 +2380,10 @@ int __init intel_iommu_init(void) if (dmar_table_init()) return -ENODEV; + high_watermark = 250; + intel_iommu_debug = debugfs_create_dir("intel_iommu", NULL); + debug = debugfs_create_u32("high_watermark", S_IWUGO | S_IRUGO, + intel_iommu_debug, &high_watermark); iommu_init_mempool(); dmar_init_reserved_ranges(); @@ -2289,6 +2399,7 @@ int __init intel_iommu_init(void) printk(KERN_INFO "PCI-DMA: Intel(R) Virtualization Technology for Directed I/O\n"); + init_timer(&unmap_timer); force_iommu = 1; dma_ops = &intel_dma_ops; return 0; diff --git a/drivers/pci/iova.h b/drivers/pci/iova.h index 228f6c94b69c..2f1317801b20 100644 --- a/drivers/pci/iova.h +++ b/drivers/pci/iova.h @@ -24,6 +24,8 @@ struct iova { struct rb_node node; unsigned long pfn_hi; /* IOMMU dish out addr hi */ unsigned long pfn_lo; /* IOMMU dish out addr lo */ + struct list_head list; + void *dmar; }; /* holds all the iova translations for a domain */ -- cgit v1.2.3 From 94e6108803469a37ee1e3c92dafdd1d59298602f Mon Sep 17 00:00:00 2001 From: Ben Hutchings Date: Wed, 5 Mar 2008 16:52:39 +0000 Subject: PCI: Expose PCI VPD through sysfs Vital Product Data (VPD) may be exposed by PCI devices in several ways. It is generally unsafe to read this information through the existing interfaces to user-land because of stateful interfaces. This adds: - abstract operations for VPD access (struct pci_vpd_ops) - VPD state information in struct pci_dev (struct pci_vpd) - an implementation of the VPD access method specified in PCI 2.2 (in access.c) - a 'vpd' binary file in sysfs directories for PCI devices with VPD operations defined It adds a probe for PCI 2.2 VPD in pci_scan_device() and release of VPD state in pci_release_dev(). Signed-off-by: Ben Hutchings Signed-off-by: Greg Kroah-Hartman --- Documentation/ABI/testing/sysfs-bus-pci | 11 +++ drivers/pci/access.c | 166 ++++++++++++++++++++++++++++++++ drivers/pci/pci-sysfs.c | 109 ++++++++++++++++++--- drivers/pci/pci.h | 19 ++++ drivers/pci/probe.c | 3 + include/linux/pci.h | 3 + 6 files changed, 297 insertions(+), 14 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-bus-pci (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci new file mode 100644 index 000000000000..ceddcff4082a --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -0,0 +1,11 @@ +What: /sys/bus/pci/devices/.../vpd +Date: February 2008 +Contact: Ben Hutchings +Description: + A file named vpd in a device directory will be a + binary file containing the Vital Product Data for the + device. It should follow the VPD format defined in + PCI Specification 2.1 or 2.2, but users should consider + that some devices may have malformatted data. If the + underlying VPD has a writable section then the + corresponding section of this file will be writable. diff --git a/drivers/pci/access.c b/drivers/pci/access.c index fc405f0165d9..ec8f7002b09d 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -1,3 +1,4 @@ +#include #include #include #include @@ -126,6 +127,171 @@ PCI_USER_WRITE_CONFIG(byte, u8) PCI_USER_WRITE_CONFIG(word, u16) PCI_USER_WRITE_CONFIG(dword, u32) +/* VPD access through PCI 2.2+ VPD capability */ + +#define PCI_VPD_PCI22_SIZE (PCI_VPD_ADDR_MASK + 1) + +struct pci_vpd_pci22 { + struct pci_vpd base; + spinlock_t lock; /* controls access to hardware and the flags */ + u8 cap; + bool busy; + bool flag; /* value of F bit to wait for */ +}; + +/* Wait for last operation to complete */ +static int pci_vpd_pci22_wait(struct pci_dev *dev) +{ + struct pci_vpd_pci22 *vpd = + container_of(dev->vpd, struct pci_vpd_pci22, base); + u16 flag, status; + int wait; + int ret; + + if (!vpd->busy) + return 0; + + flag = vpd->flag ? PCI_VPD_ADDR_F : 0; + wait = vpd->flag ? 10 : 1000; /* read: 100 us; write: 10 ms */ + for (;;) { + ret = pci_user_read_config_word(dev, + vpd->cap + PCI_VPD_ADDR, + &status); + if (ret < 0) + return ret; + if ((status & PCI_VPD_ADDR_F) == flag) { + vpd->busy = false; + return 0; + } + if (wait-- == 0) + return -ETIMEDOUT; + udelay(10); + } +} + +static int pci_vpd_pci22_read(struct pci_dev *dev, int pos, int size, + char *buf) +{ + struct pci_vpd_pci22 *vpd = + container_of(dev->vpd, struct pci_vpd_pci22, base); + u32 val; + int ret; + int begin, end, i; + + if (pos < 0 || pos > PCI_VPD_PCI22_SIZE || + size > PCI_VPD_PCI22_SIZE - pos) + return -EINVAL; + if (size == 0) + return 0; + + spin_lock_irq(&vpd->lock); + ret = pci_vpd_pci22_wait(dev); + if (ret < 0) + goto out; + ret = pci_user_write_config_word(dev, vpd->cap + PCI_VPD_ADDR, + pos & ~3); + if (ret < 0) + goto out; + vpd->busy = true; + vpd->flag = 1; + ret = pci_vpd_pci22_wait(dev); + if (ret < 0) + goto out; + ret = pci_user_read_config_dword(dev, vpd->cap + PCI_VPD_DATA, + &val); +out: + spin_unlock_irq(&vpd->lock); + if (ret < 0) + return ret; + + /* Convert to bytes */ + begin = pos & 3; + end = min(4, begin + size); + for (i = 0; i < end; ++i) { + if (i >= begin) + *buf++ = val; + val >>= 8; + } + return end - begin; +} + +static int pci_vpd_pci22_write(struct pci_dev *dev, int pos, int size, + const char *buf) +{ + struct pci_vpd_pci22 *vpd = + container_of(dev->vpd, struct pci_vpd_pci22, base); + u32 val; + int ret; + + if (pos < 0 || pos > PCI_VPD_PCI22_SIZE || pos & 3 || + size > PCI_VPD_PCI22_SIZE - pos || size < 4) + return -EINVAL; + + val = (u8) *buf++; + val |= ((u8) *buf++) << 8; + val |= ((u8) *buf++) << 16; + val |= ((u32)(u8) *buf++) << 24; + + spin_lock_irq(&vpd->lock); + ret = pci_vpd_pci22_wait(dev); + if (ret < 0) + goto out; + ret = pci_user_write_config_dword(dev, vpd->cap + PCI_VPD_DATA, + val); + if (ret < 0) + goto out; + ret = pci_user_write_config_word(dev, vpd->cap + PCI_VPD_ADDR, + pos | PCI_VPD_ADDR_F); + if (ret < 0) + goto out; + vpd->busy = true; + vpd->flag = 0; + ret = pci_vpd_pci22_wait(dev); +out: + spin_unlock_irq(&vpd->lock); + if (ret < 0) + return ret; + + return 4; +} + +static int pci_vpd_pci22_get_size(struct pci_dev *dev) +{ + return PCI_VPD_PCI22_SIZE; +} + +static void pci_vpd_pci22_release(struct pci_dev *dev) +{ + kfree(container_of(dev->vpd, struct pci_vpd_pci22, base)); +} + +static struct pci_vpd_ops pci_vpd_pci22_ops = { + .read = pci_vpd_pci22_read, + .write = pci_vpd_pci22_write, + .get_size = pci_vpd_pci22_get_size, + .release = pci_vpd_pci22_release, +}; + +int pci_vpd_pci22_init(struct pci_dev *dev) +{ + struct pci_vpd_pci22 *vpd; + u8 cap; + + cap = pci_find_capability(dev, PCI_CAP_ID_VPD); + if (!cap) + return -ENODEV; + vpd = kzalloc(sizeof(*vpd), GFP_ATOMIC); + if (!vpd) + return -ENOMEM; + + vpd->base.ops = &pci_vpd_pci22_ops; + spin_lock_init(&vpd->lock); + vpd->cap = cap; + vpd->busy = false; + dev->vpd = &vpd->base; + return 0; +} + /** * pci_block_user_cfg_access - Block userspace PCI config reads/writes * @dev: pci device struct diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index f5b0b622c189..ae9a7695be97 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -343,6 +343,58 @@ pci_write_config(struct kobject *kobj, struct bin_attribute *bin_attr, return count; } +static ssize_t +pci_read_vpd(struct kobject *kobj, struct bin_attribute *bin_attr, + char *buf, loff_t off, size_t count) +{ + struct pci_dev *dev = + to_pci_dev(container_of(kobj, struct device, kobj)); + int end; + int ret; + + if (off > bin_attr->size) + count = 0; + else if (count > bin_attr->size - off) + count = bin_attr->size - off; + end = off + count; + + while (off < end) { + ret = dev->vpd->ops->read(dev, off, end - off, buf); + if (ret < 0) + return ret; + buf += ret; + off += ret; + } + + return count; +} + +static ssize_t +pci_write_vpd(struct kobject *kobj, struct bin_attribute *bin_attr, + char *buf, loff_t off, size_t count) +{ + struct pci_dev *dev = + to_pci_dev(container_of(kobj, struct device, kobj)); + int end; + int ret; + + if (off > bin_attr->size) + count = 0; + else if (count > bin_attr->size - off) + count = bin_attr->size - off; + end = off + count; + + while (off < end) { + ret = dev->vpd->ops->write(dev, off, end - off, buf); + if (ret < 0) + return ret; + buf += ret; + off += ret; + } + + return count; +} + #ifdef HAVE_PCI_LEGACY /** * pci_read_legacy_io - read byte(s) from legacy I/O port space @@ -611,7 +663,7 @@ int __attribute__ ((weak)) pcibios_add_platform_entries(struct pci_dev *dev) int __must_check pci_create_sysfs_dev_files (struct pci_dev *pdev) { - struct bin_attribute *rom_attr = NULL; + struct bin_attribute *attr = NULL; int retval; if (!sysfs_initialized) @@ -624,22 +676,41 @@ int __must_check pci_create_sysfs_dev_files (struct pci_dev *pdev) if (retval) goto err; + /* If the device has VPD, try to expose it in sysfs. */ + if (pdev->vpd) { + attr = kzalloc(sizeof(*attr), GFP_ATOMIC); + if (attr) { + pdev->vpd->attr = attr; + attr->size = pdev->vpd->ops->get_size(pdev); + attr->attr.name = "vpd"; + attr->attr.mode = S_IRUGO | S_IWUSR; + attr->read = pci_read_vpd; + attr->write = pci_write_vpd; + retval = sysfs_create_bin_file(&pdev->dev.kobj, attr); + if (retval) + goto err_vpd; + } else { + retval = -ENOMEM; + goto err_config_file; + } + } + retval = pci_create_resource_files(pdev); if (retval) - goto err_bin_file; + goto err_vpd_file; /* If the device has a ROM, try to expose it in sysfs. */ if (pci_resource_len(pdev, PCI_ROM_RESOURCE) || (pdev->resource[PCI_ROM_RESOURCE].flags & IORESOURCE_ROM_SHADOW)) { - rom_attr = kzalloc(sizeof(*rom_attr), GFP_ATOMIC); - if (rom_attr) { - pdev->rom_attr = rom_attr; - rom_attr->size = pci_resource_len(pdev, PCI_ROM_RESOURCE); - rom_attr->attr.name = "rom"; - rom_attr->attr.mode = S_IRUSR; - rom_attr->read = pci_read_rom; - rom_attr->write = pci_write_rom; - retval = sysfs_create_bin_file(&pdev->dev.kobj, rom_attr); + attr = kzalloc(sizeof(*attr), GFP_ATOMIC); + if (attr) { + pdev->rom_attr = attr; + attr->size = pci_resource_len(pdev, PCI_ROM_RESOURCE); + attr->attr.name = "rom"; + attr->attr.mode = S_IRUSR; + attr->read = pci_read_rom; + attr->write = pci_write_rom; + retval = sysfs_create_bin_file(&pdev->dev.kobj, attr); if (retval) goto err_rom; } else { @@ -657,12 +728,18 @@ int __must_check pci_create_sysfs_dev_files (struct pci_dev *pdev) err_rom_file: if (pci_resource_len(pdev, PCI_ROM_RESOURCE)) - sysfs_remove_bin_file(&pdev->dev.kobj, rom_attr); + sysfs_remove_bin_file(&pdev->dev.kobj, pdev->rom_attr); err_rom: - kfree(rom_attr); + kfree(pdev->rom_attr); err_resource_files: pci_remove_resource_files(pdev); -err_bin_file: +err_vpd_file: + if (pdev->vpd) { + sysfs_remove_bin_file(&pdev->dev.kobj, pdev->vpd->attr); +err_vpd: + kfree(pdev->vpd->attr); + } +err_config_file: if (pdev->cfg_size < 4096) sysfs_remove_bin_file(&pdev->dev.kobj, &pci_config_attr); else @@ -684,6 +761,10 @@ void pci_remove_sysfs_dev_files(struct pci_dev *pdev) pcie_aspm_remove_sysfs_dev_files(pdev); + if (pdev->vpd) { + sysfs_remove_bin_file(&pdev->dev.kobj, pdev->vpd->attr); + kfree(pdev->vpd->attr); + } if (pdev->cfg_size < 4096) sysfs_remove_bin_file(&pdev->dev.kobj, &pci_config_attr); else diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index eabeb1f2ec99..0a497c1b4227 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -18,6 +18,25 @@ extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val); extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val); extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 val); +struct pci_vpd_ops { + int (*read)(struct pci_dev *dev, int pos, int size, char *buf); + int (*write)(struct pci_dev *dev, int pos, int size, const char *buf); + int (*get_size)(struct pci_dev *dev); + void (*release)(struct pci_dev *dev); +}; + +struct pci_vpd { + struct pci_vpd_ops *ops; + struct bin_attribute *attr; /* descriptor for sysfs VPD entry */ +}; + +extern int pci_vpd_pci22_init(struct pci_dev *dev); +static inline void pci_vpd_release(struct pci_dev *dev) +{ + if (dev->vpd) + dev->vpd->ops->release(dev); +} + /* PCI /proc functions */ #ifdef CONFIG_PROC_FS extern int pci_proc_attach_device(struct pci_dev *dev); diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 284ef392c3ea..c2e99fd87faf 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -794,6 +794,7 @@ static void pci_release_dev(struct device *dev) struct pci_dev *pci_dev; pci_dev = to_pci_dev(dev); + pci_vpd_release(pci_dev); kfree(pci_dev); } @@ -933,6 +934,8 @@ pci_scan_device(struct pci_bus *bus, int devfn) return NULL; } + pci_vpd_pci22_init(dev); + return dev; } diff --git a/include/linux/pci.h b/include/linux/pci.h index e2f46b05cf8b..292491324b01 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -20,6 +20,8 @@ /* Include the pci register defines */ #include +struct pci_vpd; + /* * The PCI interface treats multi-function devices as independent * devices. The slot/function address of each device is encoded @@ -206,6 +208,7 @@ struct pci_dev { #ifdef CONFIG_PCI_MSI struct list_head msi_list; #endif + struct pci_vpd *vpd; }; extern struct pci_dev *alloc_pci_dev(void); -- cgit v1.2.3