error management solutions synergy with whea john strange software design engineer core os johnstra...
TRANSCRIPT
Error Management Solutions Synergy With WHEA
John StrangeSoftware Design EngineerCore OSJohnStra @ microsoft.comMicrosoft Corporation
Session OutlineSession Outline
WHEA Overview
Hardware Error Sources
Hardware Error Management Solutions
WHEA Integration
PCI Express Advanced Error Reporting (AER) Example
Session GoalsSession Goals
Attendees should leave this session with the following:
A good understanding of: How platform hardware/firmware, device drivers, and error management software integrate with WHEA
Knowledge of where to find resources for WHEA
WHEA Overview
Architecture - OverviewArchitecture - Overview
I/O Bus Driver
WheaReportHwErr
HW Error Event Consumer
user
kernel
Platform-specific Hardware Error Driver
Platform (HW/FW)
Other Error Source
texttextPlug-Ins
HAL
MCELLHEH
CPEILLHEH
PCIeLLHEH
LLHEH
ETW Event
Key ComponentsKey Components
Platform Specific Hardware Error Driver (PSHED)
Low-Level Hardware Error Handler (LLHEH)
WheaReportHwError – Entry point to OS common error handling
Error Record – Common OS error record
Error Event Consumers
Error Sources
Hardware Error SourceHardware Error Source
An error source is a mechanism that notifies software of hardware error conditions and provides information to describe the error condition
Notification may be via interrupt, polling of error status registers, or callback from system firmware
Error data may be recorded in hardware registers, mapped to PCI configuration space, provided by a system firmware interface, etc.
Hardware Error Sources and WHEAHardware Error Sources and WHEA
WHEA targets platform-level error sources
Platform-level error sources usually aggregate error reporting for multiple of devices
Error Source Hardware
Machine Check Processor, Cache,TLBs, Memory
Corrected Platform Error
Memory controller
Non-maskable Interrupt IO Bus
PCI Express Device, Root Complex
Managing Error Sources
Managing Hardware Error SourcesManaging Hardware Error Sources
WHEA enables management of error sources
A number of attributes associated with a given error source may be manageable
Platform OEMs specify this functionalityThey can decide which attributes are exposed to be viewed and/or modified
WHEA enables programmatic control over the attributes associated with an error source
Whether an error source is enabled/disabled
Thresholds associated with an error source
Control register settings of a particular error sourceError Severity Mappings
Error Masking Settings
Managing Hardware Error Sources (con’t)Managing Hardware Error Sources (con’t)
OS queries the PSHED for a table of all the error sources on a given platform
PSHED interfaces with the platform to extract this information and return it to the OS
The OS makes this information available to management applications
Some of this information may settable only by privileged entities
These interfaces will be available during OS install, so platform-appropriate settings may be applied during setup
This capability solves BIOS/OS conflicts over error source settings
Hardware Error Management SolutionsHardware Error Management Solutions
Existing hardware error management solutions are necessarily proprietary
Even those based on standards such as the Intelligent Platform Management Interface (IPMI) record error information in proprietary format in the SEL (system event log)
A generic SDR (sensor data record) is used and record size constraints limit the richness of the error records
Proprietary applications can consume and perform management operations on the proprietary error data
These applications retrieve the error information in a proprietary manner – usually via a collections of device drivers that present the information to the management application
Hardware Error Management Solutions (con’t)Hardware Error Management Solutions (con’t)
WHEA enables generic hardware error management solutions
Published error record format
ETW-based error eventing model allows management applications to subscribe for the events in which they are interested
WHEA permits value-add extensibility by having unstructured (e.g. proprietary) error data added to error records
WHEA error records are potentially very rich in content and include OS context information to aid in problem diagnosis and resolution
WHEA Integration
WHEA IntegrationWHEA Integration
How solution providers integrate with WHEA?System firmware/platform support
Implement platform interfaces required by WHEA (e.g. Error Source Discovery and Error Record Serialization)
PSHED Plug-insAugment and/or override the behavior of the default per-processor-architecture PSHED
LLHEHsDevice drivers for some hardware error sources may be made WHEA aware to report hardware errors to the system
Consumer ApplicationsUser-mode applications that perform health-monitoring and other higher-level error management functions
WHEA Integration - Platform SupportWHEA Integration - Platform Support
OEMs will be required to implement at least minimal WHEA support to obtain Logo
Error Source Discovery
Error Record Serialization
Opportunities exist for even tighter integration with the OS
Adopting the WHEA error record format as the platforms native error record
Improved platform-level mechanisms for reporting error conditions to the OS (e.g. using extended PCI config space and a structured error data format)
WHEA Integration - LLHEHsWHEA Integration - LLHEHs
Bus drivers might be in charge of error sources that need to be exposed to WHEA
Endpoint devices are not expected to do this
Device drivers that fall into this category implement LLHEHs which handle errors and report them to the kernel
WHEA Integration - PSHED Plug-InsWHEA Integration - PSHED Plug-Ins
The PSHED houses all hardware error related interactions between the OS and the platform
The PSHED represents an opportunity for OEMs to rethink how some error handling features are implemented
Some functionality may be moved into the PSHED rather than BIOS/FW
Portions of the functionality may stay in BIOS/FW and PSHED plug-ins may interface with these functions
WHEA Integration – Management ApplicationsWHEA Integration – Management Applications
Management applications implement high-level error monitoring, reporting, and potentially recovery capabilities
These applications subscribe to receive error event notifications via ETW
Generic processing of all error events is possible given the common error record format
Extended processing of error events is possible through unstructured (private) error information recorded in the error record
PCI Express Example
PCI Express AER ExamplePCI Express AER Example
PCI Express Advanced Error Reporting (AER) represents a good technology to use in an example
This example will show how PCI Express AER support can be integrated into WHEA
PCI Express AER Example – PCI Express AER Example – Platform-Level SupportPlatform-Level Support
The platform BIOS must surface PCI Express AER as a platform error source
Possible mechanisms include: ACPI Table or EFI runtime interface
The platform must grant OS control of PCI Express error handling via ACPI _OSC
Assume our example platform implements some non-standard PCI Express error registers that capture platform-specific information in addition to the standardized AER error registers.
PCI Express AER Example – LLHEHPCI Express AER Example – LLHEH
The PCI bus driver will implement the root port interrupt handler which receives error interrupts
Therefore, the PCI bus driver will implement the LLHEH for PCI Express AER
To accomplish this, the PCI bus driver must…Implement an ErrorSourceInitializer callback routine to initialize error reporting resourcesFrom its DriverEntry routine
Register the ErrorSourceInitializer callback by calling WheaRegisterErrSrcInitializer
After the initializer routine has been called, the bus driver can report hardware errors to the kernel
PCI Express AER Example – LLHEH (con’t)PCI Express AER Example – LLHEH (con’t)
Upon detecting a PCI Express error, the PCI bus driver does the following
Creates and initializes a WHEA_ERROR_PACKET using the error information it extracts from the PCI Express AER error status in extended config space
The driver is responsible for mapping the error severity reported by the device into one of WHEA’s error severity levels
Calls the PSHED’s PshedRetrieveErrorInfo routine, passing a pointer to the WHEA_ERROR_PACKET
Calls WheaReportHwError, supplying a pointer to the WHEA_ERROR_PACKET
PCI Express AER Example – PSHEDPCI Express AER Example – PSHED
Remember, our example platform implements a set of non-standard PCI Express error registers
A PSHED plug-in might participate in the error source discovery functionality to ensure that the OS sizes the WHEA_ERROR_PACKET for the PCI Express error source to accommodate the additional error information
PshedRetrieveErrorInfo is called by the LLHEH when it detects an error condition
A plug-in could extract the information in the non-standard error registers and add that information to the error packet
PCI Express AER Example – PSHED (Con’t)PCI Express AER Example – PSHED (Con’t)
The PSHED will be called by WheaReportHwError to finalize construction of the error record
At this point, a PSHED plug-in could use platform-specific information to populate additional error sections in the error record
Note that the approach suggested gracefully accommodate platform differentiation
An entry-level server line might ship without the PSHED plug-in and its error reporting capabilities would not include the additional non-standard registers
A higher-level server line should ship with the plug-in and therefore offer extended error reporting (and possibly recovery) capabilities
PCI Express AER Example – ConsumersPCI Express AER Example – Consumers
A targeted consumer (management application) might be written with special knowledge of the information contained in the platform’s non-standard PCI Express error registers
The consumer might implement extended error reporting, health monitoring, and even fail-over services
Call To ActionCall To Action
Send us your questions
Watch for WHEA logo requirements for Windows codenamed “Longhorn”
Evaluate how your products will integrate with WHEA
Community ResourcesCommunity Resources
Windows Hardware & Driver Central (WHDC)www.microsoft.com/whdc/default.mspx
Technical Communitieswww.microsoft.com/communities/products/default.mspx
Non-Microsoft Community Siteswww.microsoft.com/communities/related/default.mspx
Microsoft Public Newsgroupswww.microsoft.com/communities/newsgroups
Technical Chats and Webcastswww.microsoft.com/communities/chats/default.mspx
www.microsoft.com/webcasts
Microsoft Blogswww.microsoft.com/communities/blogs
Additional ResourcesAdditional Resources
Email: Send feedback and questions to WHEAFB @ microsoft.com
© 2005 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.