kernel internals student notes
DESCRIPTION
Kernel Internals Student Notebook IBMeServer UNIX Technical EducationIBM Certified Course MaterialTRANSCRIPT
AIX 5L Kernel Internals (Course Code BE0070XS)
Student NotebookERC 4.0
IBM Certified Course MaterialeServer UNIX Technical Education
V2.0.0.3
cover
��� Front cover
Student Notebook
The information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis withoutany warranty either express or implied. The use of this information or the implementation of any of these techniques is a customerresponsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. Whileeach item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results willresult elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.
© Copyright International Business Machines Corporation 2001, 2003. All rights reserved.This document may not be reproduced in whole or in part without the prior written permission of IBM.Note to U.S. Government Users — Documentation related to restricted rights — Use, duplication or disclosure is subject to restrictionsset forth in GSA ADP Schedule Contract with IBM Corp.
Trademarks
The reader should recognize that the following terms, which appear in the content of this training document, are official trademarks of IBM or other companies:
IBM® is a registered trademark of International Business Machines Corporation.
The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, or other countries, or both:
ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel Corporation in the United States, other countries, or both.
Intel is a trademark of Intel Corporation in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States and other countries.
Other company, product and service names may be trademarks or service marks of others.
AIX® AIX 5L™ AS/400®Chipkill™ DB2® DFS™Electronic Service Agent™ IBM® iSeries™LoadLeveler® NUMA-Q® PowerPC®pSeries™ PTX® RS/6000®S/370™ Sequent® SP™zSeries™
June 2003 Edition
Student NotebookV2.0.0.3
TOC
ContentsTrademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Course Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Unit 1. Introduction to the AIX 5L Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2Operating System and the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3Kernel Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7Mode and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9Context Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11Interrupt Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13AIX 5L Kernel Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16AIX 5L Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-18System Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-20Conditional Compile Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-25Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26
Unit 2. Kernel Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2What tools will you be using in this class? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3The Major Functions of KDB are: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4Enabling the Kernel Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6Verifying the Debugger is Enabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8Starting the Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9System Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10kdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Unit 3. Process Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2Parts of a Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-51:1 Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7M:1 Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8M:N Thread Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9Creating Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11Creating Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Contents iii
Student Notebook
Process State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-15The Process Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-18pvproc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-20pv_stat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-21Table Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-22Extending the pvproc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-24PID Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-26Finding the Slot Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-28Kernel Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-29Thread Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-31pvthread Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-33TID Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-34u-block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-35Six Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-37Thread Scheduling Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-39Thread State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-40Thread Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-43Run Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-45Dispatcher and Scheduler Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-46Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-47Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-48Preemption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-49Preemptive Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-51Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-53SMP - Multiple Run Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-56NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-58Memory Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-60Global Run Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-62Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-64Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-65Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-66
Unit 4. Addressing Memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2Memory Management Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3Pages and Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-6Translating Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-8Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-9Segment Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1132-bit Hardware Address Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1364 Bit Hardware Address Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-15Segment Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-16Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-19shmat Memory Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-21Memory Mapped Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2332-bit User Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2632-bit Kernel Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-28
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
iv Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
TOC
64-bit User/Kernel Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-31Exercise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-32Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-33Unit 5. Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2Virtual Memory Management (VMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3Object Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5Demand Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10Hardware Page Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12Page not in Hardware Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13Page on Paging Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15External Page Table (XPT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16Loading Pages From the File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18Object Type / Backing Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20Paging Space Management Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21Paging Space Allocation Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23Free Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25Clock Hand Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27Fatal Memory Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32
Unit 6. Logical Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3Physical Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5Logical Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7Components Required for LPAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9Operating System Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13Virtual Memory Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14Real Address Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15Real Mode Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17Operating System Real Mode Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21Allocating Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23Partition Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25Translation Control Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-27Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29Dividing Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-31Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-33Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-34
Unit 7. LFS, VFS and LVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Contents v
Student Notebook
What is the Purpose of LFS/VFS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3Kernel I/O Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5Major Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7Logical File System Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9User File Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-11The file Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-13vnode/vfs Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-15vnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-17vfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-19root (l) and usr File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-21vmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-23File and File System Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-25gfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-27vnodeops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-29vfsops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-31gnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-33kdb devsw Subcommand Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-35kdb volgrp Subcommand Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-37AIX lsvg Command Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-39kdb lvol Subcommand Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-40AIX lslv Command Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-44kdb pvol Subcommand Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-46AIX lspv Command Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-48Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-49Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-50Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-51Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-52
Unit 8. Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-2JFS File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-3Reserved Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-7Disk Inode Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9In-core Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-11Direct (No Indirect Blocks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-15Single Indirect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-17Double Indirect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-18Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-19Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-20
Unit 9. Enhanced Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-3Aggregate and Fileset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-4Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6Allocation Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-9Fileset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-11Inode Allocation Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-13
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
vi Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
TOC
Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14Increasing an Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16Binary Tree of Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-18Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20Inline Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26Binary Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-27More Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-28Continuing to Add Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-29Another Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30fsdb Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-34Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-35Directory Root Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-37Directory Slot Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-39Small Directory Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-41Adding a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-42Adding a Leaf Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-43Adding an Internal Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-44Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-45Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-46Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-47Unit 10. Kernel Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2Kernel Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3Relationship With the Kernel Nucleus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5Global Kernel Name Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6Why Export Symbols? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9Kernel Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11Configuration Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13Compiling and Linking Kernel Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-15How to Build a Dual Binary Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19Loading Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21sysconfig() - Loading and Unloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22sysconfig() - Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23sysconfig() - Device Driver Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-24The loadext() Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-26System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-28Sample System Call - Export/Import File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-30Sample System Call - question.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-31Sample System Call - Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-32Argument Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-33User Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-35Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-38Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-39Unit Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-40
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Contents vii
Student Notebook
Appendix A. Checkpoint Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Appendix B. KI Crash Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1Unit Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2Crash Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5About This Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-6
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
viii Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
TMK
TrademarksThe reader should recognize that the following terms, which appear in the content of this training document, are official trademarks of IBM or other companies:
IBM® is a registered trademark of International Business Machines Corporation.
The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, or other countries, or both:
ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel Corporation in the United States, other countries, or both.
Intel is a trademark of Intel Corporation in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States and other countries.
Other company, product and service names may be trademarks or service marks of others.
AIX® AIX 5L™ AS/400®Chipkill™ DB2® DFS™Electronic Service Agent™ IBM® iSeries™LoadLeveler® NUMA-Q® PowerPC®pSeries™ PTX® RS/6000®S/370™ Sequent® SP™zSeries™
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Trademarks ix
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
x Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
pref
Course DescriptionAIX 5L Kernel Internals Concepts
Duration: 5 days
Purpose
This is a course in basic AIX 5L Kernel concepts. It is designed to provide background information useful to support engineers and AIX development/application engineers who are new to the AIX 5L Kernel environment as implemented in AIX releases 5.1 and 5.2. This course also provides background knowledge helpful for those planning to attend the AIX 5L Device Driver (Q1330) course.
Audience
— AIX technical support personnel
— Application developers who want to achieve a conceptual understanding of AIX 5L Kernel Internals
Prerequisites
Students are expected to have programming knowledge in the C programming language, working knowledge of AIX system calls, and user-level working knowledge of AIX/UNIX, including editors, shells, pipes, and Input/Output (I/O) redirection. Additionally knowledge of basic system administration skills is required, such as the use of SMIT, configuring file systems and configuring dump devices. These skills can be obtained by attending the following courses or through equivalent experience:
— Introduction to C Programming - AIX/UNIX (Q1070)
— AIX 5L System Administration II: Problem Determination (AU16/Q1316)
In addition, the following courses are helpful:
— KornShell Programming (AU23/Q1123)
— AIX Application Programming Environment (AU25/Q1125)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Course Description xi
Student Notebook
Objectives
At the end of this course you will be able to:
— List the major features of the AIX 5L kernel
— Quickly traverse the system header files to find data structures
— Use the kdb command to examine data structures in the memory image of a running system or system dump
— Understand the structures used by the kernel to manage processes and threads, and the relationships between them
— Describe the layout of the segmented addressing model, and how logical to physical address translation is achieved
— Describe the operation of VMM subsystem and the different paging algorithms
— Describe the mechanisms used to implement logical partitioning
— Understand the purpose of the logical file system and virtual file system layers and the data structures they use
— List and describe the components and function of the JFS2 and JFS file systems
— Identify the steps required to compile, link and load kernel extensions
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
xii Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
pref
AgendaDay 1Welcome Unit 1 - Introduction to the AIX 5L Kernel lecture Exercise 1 - Introduction to the AIX 5L Kernel Unit 2 - Kernel Analysis Tools lecture Exercise 2 - Kernel Analysis Tools
Day 2
Daily reviewUnit 3 - Process Management lecture Exercise 3 - Process Management Unit 4 - Addressing Memory lecture
Day 3
Daily review Exercise 4 - Addressing MemoryUnit 5 - Memory Management lecture Exercise 5 - Memory ManagementUnit 6 - Logical Partitioning lecture
Day 4
Daily review Unit 7 - LFS, VFS and LVM lecture Exercise 6 - LFS, VFS and LVMUnit 8 - Journaled File System lecture Unit 9 - Enhanced Journaled File System - Topic 1 lecture Exercise 7 - Enhanced Journaled File System - Topic 1 Unit 9 - Enhanced Journaled File System - Topic 2 lecture Exercise 8 - Enhanced Journaled File System - Topic 2
Day 5
Daily review Unit 10 - Kernel Extensions lecture Exercise 9 - Kernel Extensions
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Agenda xiii
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
xiv Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 1. Introduction to the AIX 5L KernelWhat This Unit Is About
This unit describes the purpose, concepts and features of the AIX 5L kernel.
What You Should Be Able to Do
After completing this unit, you should be able to:
• Describe the role the kernel plays in an operating system
• Define user and kernel mode and list the operations that can only be performed in kernel mode
• Describe when the kernel must make a context switch
• Describe the role of the mstsave area in a context switch
• Name the execution environments available on each of the platforms supported by AIX 5L
• Using the system header files, identify data element types for each of the available kernels in AIX 5L
How You Will Check Your Progress
Accountability:
• Exercises using your lab system • Check-point activity • Unit review
References
The Design of the UNIX Operating System, by Maurice J. Bach, ISBN: 0132017997
AIX Online Documentation: http://publib16.boulder.ibm.com/pseries/en_US/infocenter/base/aix.htm
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-1
Student Notebook
Figure 1-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this unit you should be able to:
Describe the role the kernel plays in an operating system
Define user and kernel mode and list the operations that can only be performed in kernel mode
Describe when the kernel must make a context switch
Describe the role of the mstsave area in a context switch
Name the execution environments available on each of the platforms supported by AIX 5L
Using the system header files, identify data element types for each of the available kernels in AIX 5L
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 1-2. Operating System and the Kernel BE0070XS4.0
Notes:
Operating system
The principal purpose of the AIX operating system is to provide an environment where application programs can be executed. This mainly involves the management of hardware resources including memory, CPU and IO.
Kernel
The kernel is the base program of the operating system. It acts as intermediary between the application programs and the computer hardware. It provides the system call interface allowing programs to request use of the hardware. The kernel prioritizes these requests and manages the hardware through its hardware interface.
Operating System and the Kernel
Process
Kernel
ttyCPU
CPUCPU
system call
Interface
hardware
Interface
ProcessProcess
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-3
Student Notebook
The kernel is the key program
The operating system is made up of many programs including the kernel. It is safe to say that the kernel is the most important part of the operating system; if the kernel is not running nothing else in the operating system can function. This class discusses the internal working of the kernel in the AIX 5L operating system.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 1-3. Kernel Components BE0070XS4.0
Notes:
Introduction
The kernel may be broken up into several sections based on the services provided to applications programs. Each of these sections are discussed in this class. The kernel components are shown in the visual above.
Process management
The process management function of the kernel is responsible for the creation, and termination of processes and threads, along with scheduling threads on CPUs.
Virtual memory management
The Virtual Memory Management (VMM) function of the kernel is responsible for managing all aspects of virtual and physical memory by processes and the kernel. This includes allocating physical page frames to virtual pages, providing space for file
Kernel Components
Virtual memory
managment
ttyDisk
CPU
user
kernel
Process
managment
Applications
File systems
I/O Subsystem
Device driver Device driver
Buffered I/O
Raw I/OBuffered I/O
Disk space managment
(LVM)
CPU
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-5
Student Notebook
system buffering and keeping track of which process memory is resident in physical memory and which is stored on disk.
I/O subsystem
Parts of the kernel that interact directly with I/O devices are called device drivers. Typically each type of device installed on the system will require its own device driver. Device drivers are covered in detail in a separate class on writing device drivers.
Disk space management
The management of disk space in AIX is handled by a layer above the disk’s drivers. The Logical Volume Manger (LVM) provides the function of disk space management.
File system
AIX supports several types of file systems including JFS, JFS2, NFS and several CD-ROM file systems. The file system software interacts with the disk space management software. This class covers the JFS and JFS2 file systems.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 1-4. Address Space BE0070XS4.0
Notes:
Introduction
AIX implements a virtual memory system. Addresses referenced by a user program do not directly reference physical memory; instead they reference a virtual address.
Virtual address space
By using the concept of virtual memory, each process on the system can appear to have its own address space that is separate and isolated from other processes. A process’ address space contains both user- and kernel-memory addresses.
Memory management
Virtual addresses are mapped by the hardware to a physical memory address. Translation tables are used by the hardware to map virtual to physical addresses. The address translation tables are controlled by the kernel. One set of address translation
Address Space
Address space Address space Address space
user
kernel
Process A Process B Process C
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-7
Student Notebook
tables is kept for each process. To switch from one process’ address space to another, the kernel loads the appropriate address translation table into the hardware.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 1-5. Mode and Context BE0070XS4.0
Notes:
Introduction
Two key concepts of mode and environment are described in this section.
Mode
The computer hardware provides two modes of execution; a privileged kernel mode and a less-privileged user mode. Application programs must run in user mode thus are given limited access to the hardware. The kernel, as you would expect, runs in kernel mode. The following table compares these two modes.
Mode and Environment
User mode
Kernel mode
Application
code
Process
Environment
Interrupt
Environment
Hardware
interrupt
System Call
Kernel code
Invalid combination - interrupts
always run in kernel mode
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-9
Student Notebook
Environment
The AIX kernel may execute in one of two environments: process environment or interrupt environment. In process environment, the kernel is running on behalf of a user process. This generally occurs when a user program makes a system call, although it is also possible to create a kernel-mode only process. When the kernel responds to an interrupt, it is running in the interrupt environment. In this context the kernel cannot access the user address space or any kernel data related to the user process that was running on the processor just before the interrupt occurred.
User mode Kernel modeMemory access is limited to the user’s private memory. Kernel memory is not accessible.
Can access all memory on the system.
I/O instructions are blocked. All I/O is performed in kernel mode.
Can’t modify hardware registers related to memory management.
Memory management registers may be modified.
Interrupts must be handled in kernel mode.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 1-6. Context Switches BE0070XS4.0
Notes:
Introduction
A context switch is the action of exchanging one thread of execution on a CPU for another.
Thread of execution
Threads of execution are simply logical paths through the instructions of a program. The AIX kernel manages many threads of execution by switching the CPUs between the different threads on the system.
Context Switches
mstsaveSaved:
CPUs registers
stack pointer
instruction pointer
CPU
mstsaveSaved:
CPUs registers
stack pointer
instruction pointer
Thread 1 Thread 2
context switch
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-11
Student Notebook
Context switches
Context switches can occur at two points:
a. A hardware interrupt occurs.
b. Execution of the thread is blocked waiting for the completion of an event.
mstsave
The context of the running thread must be saved when a context switch occurs. This context includes information such as the values of the CPU registers, the instruction address register and stack pointer. This information is saved in a structure called the mstsave (machine state save) structure. Each thread of execution has an associated mstsave structure.
Restoring a context
When a thread is restored (switched in), the system register values stored in the mstsave of the thread are loaded into the CPU. The CPU then performs a branch instruction to the address of the saved instruction pointer.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 1-7. Interrupt Processing BE0070XS4.0
Notes:
Introduction
A hardware interrupt results in a temporary context switch. Each time an interrupt occurs, the current context of the processor must be saved so that processing can be continued after handling the interrupt.
mstsave pool
Interrupts can occur when the CPU is currently processing an interrupt; therefore, multiple mstsave areas are needed to save the context of each interrupt. AIX keeps a pool of mstsave areas to use. This is because a thread structure has an mstsave structure, however an interrupt is a transient entity and does not have its own thread structure.
Interrupt Processing
csa
mstsave mstsave mstsave threads
mstsave
unused
(next interrupt
goes here)
high
priority
interrupt
low
priority
interrupt
base
interrupt
level
current save area
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-13
Student Notebook
csa pointer
Each processor has a pointer to the mstsave area it should use when an interrupt occurs. This pointer is called the current save area, or csa pointer.
Interrupt history
When AIX receives an interrupt that is of higher priority than the one it is currently handling it must save the current state in a new mstsave area linking the new save area to the previous one. This forms a history of interrupt processing.
Interrupt processing
Saving context
When an interrupt occurs, the steps AIX takes to save the currently running context are:
Unwinding the interrupts
As the processing of each interrupt is completed the chain of mstsave areas are unlinked. Working backwards from the highest priority interrupt to the lowest and finally to the base-level mstsave. The last or base-level mstsave in the chain is the mstsave of the thread that was running when the first interrupt occurred. The steps to restore a context are shown in this table.
Step Action
1. Save the current context in the mstsave area pointed to by the CPU’s csa.
2. Get the next available mstsave area from the pool.
3. Link the just used mstsave to the new mstsave.
4. Update the CPU’s csa pointer to point to the new mstsave area.
Step Action
1.
If returning to the base interrupt level and the interrupt has made a thread runnable, invoke the dispatcher. The dispatcher will move the thread originally on the end of the MST chain back to the run queue, and place the best runnable thread at the end of the MST chain.
2. Return the current mstsave area to the pool.
3. Set the CPU’s csa pointer to the previous mstsave area.
4. Reload the registers from the processing the context.
5. Branch to the instruction referenced by the instruction address register.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Finding the current mstsaveThe csa always points to an unused mstsave area. This mstsave will be used if a higher-priority interrupt occurs. The data in this mstsave will not be valid except for its pointer to the next mstsave in the chain. The last used mstsave area can be located by following the prev pointer from the mstsave pointed to by the csa.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-15
Student Notebook
Figure 1-8. AIX 5L Kernel Characteristics BE0070XS4.0
Notes:
Introduction
The AIX kernel was the first mainstream UNIX operating system to implement several important features. These features are listed above.
Preemptable
Preemptable means that the kernel can be running in kernel mode (running a system call for example) and be interrupted by another more important task. Preemption causes a context switch to another thread inside the kernel. Many other UNIX kernels will not allow preemption to occur when running in kernel mode. This can result in long delays in the processing of real time threads. AIX improves real time processing by allowing for preemption in kernel mode. As an example, Linux does not support preemption when in kernel mode.
AIX 5L Kernel Characteristics
Preemptable kernel
Pageable kernel memory
Dynamically extensible kernel
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
PageableNot all of the kernel’s virtual memory space needs to be resident in physical memory at all times. Portions of the kernel memory may be paged out to disk when not needed. This allows for better utilization of physical memory. The ability to page kernel memory is a feature not found in all UNIX kernels. Most kernels support the paging of user-virtual-address space. AIX supports paging both user- and kernel-address space. As an example, the kernel memory of the Linux operating system is resident in physical memory at all times.
Pinning memory
Some areas of the kernel’s memory must stay resident meaning they may not be paged to disk. Areas of memory that are not subject to paging are called pinned memory; for example, portions of device drivers must be pinned in memory.
Extensible
The AIX kernel is dynamically extensible. This means that not all the code required for the kernel needs to be included in a single binary (/unix). Portions of the kernel’s code will be loaded at runtime. Dynamically loaded modules are called kernel extensions. Kernel extensions typically add functionality that may not be needed by all systems. This keeps the kernel smaller and requires less memory. Kernel extensions can include:
- Device drivers
- Extended system calls
- File systems
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-17
Student Notebook
Figure 1-9. AIX 5L Execution Environment BE0070XS4.0
Notes:
Introduction
AIX 5L supports both 32-bit and a 64-bit execution environments. On 32-bit hardware platforms only the 32-bit environment can be used, but on 64-bit platforms either can be used. The key to this 64-bit platform flexibility is that a 64-bit VMM (Virtual Memory Manager) is run in both cases, using left zero fill of addresses for the 32-bit kernel environment.
32-bit and 64-bit kernel
The primary advantage of the 64-bit kernel is the increased kernel address space. This allows systems to support increased workloads. However, there is an added cost to managing a 64-bit address space. Not all applications will require the increased address space of the 64-bit kernel. In these cases, a 32-bit kernel is provided.
AIX 5L Execution Environment
User
Kernel
32-bit
Applications
32-bit
Kernel
32-bit
Hardware
32-bit
Applications
64-bit
Applications
32-bit
Kernel
64-bit Hardware
32-bit
Applications
64-bit
Applications
64-bit
Kernel
64-bit Hardware
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Selecting a kernelThe file /unix is a link to the kernel image file that is loaded at boot time. Depending on the hardware type and kernel type (32-bit or 64-bit) the link will point to the appropriate file as shown in this table.
User applications
Both 32-bit and 64-bit applications are supported when running on 64-bit hardware, regardless of the kernel that is running.
User commands
User level commands included with the AIX 5L operating system are designed to work with either the 32-bit or 64-bit kernel. However, some commands require both a 32-bit and a 64-bit version. These are typically commands that must work directly with the internal structures of the kernel. For these commands, the 32-bit version of the command will determine if a 32-bit or 64-bit kernel is running. If a 64-bit kernel is detected, then a 64-bit version of the command is started. The steps are shown in this table.
Kernel extensions
Only 64-bit kernel extensions are supported under the 64-bit kernel. Only 32-bit kernel extensions are supported under the 32-bit kernel. All kernel extensions must be SMP safe. Earlier versions of AIX supported running non-SMP safe kernel extensions on SMP hardware using a mechanism called funneling. Funneling is not supported on the 64-bit AIX 5L kernel.
Hardware platform Kernel type Kernel file
32-bit or 64-bit 32-bit/usr/lib/boot/unix_mp
/usr/lib/boot/unix_up
64-bit 64-bit /usr/lib/boot/unix_64
Step Action1. 32-bit version of command is run by user.
2. The 32-bit command checks the kernel type (32- or 64-bit).
3.
If a 64-bit kernel is detected, then the 64-bit version of the command is run. For example, under the initial release of AIX 5.1 the command vmstat would run the command vmstat64. In later versions of AIX 5.1, and in AIX 5.2, vmstat (along with other performance commands) uses a performance tools API.
4. If a 32-bit kernel is detected, the 32-bit command completes its execution.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-19
Student Notebook
Figure 1-10. System Header Files BE0070XS4.0
Notes:
Introduction
The system header files contain the definition of structures that are used by the AIX kernel. We will reference these files throughout this class, since they contain the C language definitions of the structures we will be describing.
Finding header files
The drawing above shows the location of the system header files.
System Header Files
/ (root)
usr
include
sys jfs j2
stdio.h
fcntl.h
mode.h
signal.h
dir.h
filsys.h
ino.h
inode.h
jfsmount..h
proc.h
thread.h
types.h
user.h
utherad.h
j2-btree.h
j2-dinode.h
j2-inode.h
j2-types.h
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Location of header filesThe /usr/include directory contains several sub-directories containing header files. Some of the sub-directories are described in this table.
Header file directories Description
/usr/include General program header files
/usr/include/sysHeader files dealing directly with the operations of the system
/usr/include/jfs Header files for the JFS file system
/usr/include/j2 Header files for the JFS2 file system
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-21
Student Notebook
Figure 1-11. Conditional Compile Values BE0070XS4.0
Notes:
Conditional compile values
Several conditional compiler directives are used in the system header files to select the platform and environment (32-bit or 64-bit kernel). This is because certain data types have different sizes depending on the execution environment (for example, 32-bit or 64-bit).
Example
Shown here is a portion of the definition of a struct thread. The compiler directive #ifndef __64BIT_KERNEL is used to create different definitions for the 32-bit and 64-bit kernels.
Conditional Compile Values
Value Meaning
_POWER_MP Code is being compiled for a multiprocessor machine. This value should always be used for 64-bit kernel extensions and device drivers.
_KERNSYS Enable kernel symbols in header files. This value should always be used when compiling kernel code.
_KERNEL Compiling kernel extension or device driver code. This value should always be used when compiling kernel code.
_64BIT_KERNEL Code is being compiled for a 64-bit kernel.
_64BIT Code is being compiled in 64-bit mode. This value is automatically defined by the compiler if the -q64 option is
specified.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-22 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
struct thread {/* identifier fields */tid_t t_tid; /* unique thread identifier */tid_t t_vtid; /* Virtual tid *//* related data structures */struct pvthread *t_pvthreadp; /* my pvthread struct */struct proc *t_procp; /* owner process */struct t_uaddress {
struct uthread *uthreadp; /* local data */struct user *userp; /* owner process' ublock (const)*/
} t_uaddress;/* user addresses */
#ifndef __64BIT_KERNELuint t_ulock64; /* high order 32-bits */uint t_ulock; /* user addr - lock or cv */uint t_uchan64; /* high order 32-bits */uint t_uchan; /* key of user addr */uint t_userdata64; /* high order 32-bits if 64-bit mode */int t_userdata; /* user-owned data */uint t_cv64; /* high order 32-bits if 64-bit mode */int t_cv; /* User condition variable */uint t_stackp64; /* high order 32-bits if 64bit mode */char *t_stackp; /* saved user stack pointer */uint t_scp64; /* high order 32-bits if 64bit mode */struct sigcontext *t_scp; /* sigctx location in user space*/
#elselong t_ulock; /* user addr - lock or cv */long t_uchan; /* key of user addr */long t_userdata; /* user-owned data */long t_cv; /* User condition variable */char *t_stackp; /* saved user stack pointer */struct sigcontext *t_scp; /* sigctx location in user space*/
#endif. . . .
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-23
Student Notebook
Figure 1-12. Checkpoint BE0070XS4.0
Notes:
Checkpoint
1. The______ is the base program of the operating system.
2. The processor runs interrupt routines in ______mode.
3. The AIX kernel is _______, ________ and __________.
4. The 64-bit AIX kernel supports only _______kernel extensions, and only runs on _______ hardware.
5. The 32-bit kernel supports 64-bit user applications when running on ________hardware.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-24 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 1-13. Exercise BE0070XS4.0
Notes:
Turn to your lab workbook and complete exercise one.
Exercise
Complete exercise one
Consists of theory and hands-on
Ask questions at any time
Activities are identified by a
What you will do:Use the cscope tool to examine system header files
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 1. Introduction to the AIX 5L Kernel 1-25
Student Notebook
Figure 1-14. Unit Summary BE0070XS4.0
Notes:
Unit Summary
Describe the role the kernel plays in an operating system
Define user and kernel mode and list the operations that can only be performed in kernel mode
Describe when the kernel must make a context switch
Describe the role of the mstsave area in a context
switch
Name the execution environments available on each of the platforms supported by AIX 5L
Using the system header files, identify data element types for each of the available kernels in AIX 5L
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
1-26 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 2. Kernel Analysis ToolsWhat This Unit Is About
This unit describes the different tools that are available to debug the AIX 5L kernel.
What You Should Be Able to Do
After completing this unit, you should be able to:
• List the tools available for analyzing the AIX 5L kernel
• Use KDB to display and modify memory locations and interpret a stack trace
• Use basic kdb navigation to explore crash dump and live system
How You Will Check Your Progress
Accountability:
• Exercises using your lab system
References
AIX Documentation: Kernel Extensions and Device Support Programming Concepts
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-1
Student Notebook
Figure 2-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this unit you should be able to:
List the tools available for analyzing the AIX 5L kernel
Use KDB to display and modify memory locations and interpret a stack trace
Use basic kdb navigation to explore crash dump and live system
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 2-2. What tools will you be using in this class? BE0070XS4.0
Notes:
Kernel Analysis Tools
Several tools are available in AIX 5L that are used to examine and debug the kernel. This table list the primary tools we will be covering in this unit.
Typographic conventions
In this class an uppercase KDB will be used when referring to the kernel debugger, and lowercase kdb is used when referring to the image analysis command.
Description ToolKernel debugger for live system debugging KDB
Used for system image analysis kdb
What tools will you be using in this class?
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-3
Student Notebook
Figure 2-3. The Major Functions of KDB are: BE0070XS4.0
Notes:
Introduction
This section covers describes the kernel debugger available in AIX 5L.
Overview
The kernel debugger is built into the AIX 5L production kernel. For the debugger to be used it must be enabled prior to booting.
Interfacing with the debugger
Once started the kernel debugger is operated from a terminal connected to a native serial port of the system. The debugger cannot be operated from the LFT graphics display, or from a serial terminal connected via an 8-port or 128-port adapter.
The Major Functions of KDB are:
Set breakpoints within the kernel or kernel extensions
Execution control through various forms of step execution commands
Format display of selected kernel data structures
Display and modification of kernel data
Display and modification of kernel instructions
Modify the machine state through alteration of system registers
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
ConceptWhen KDB is invoked, it is the only running program until you exit the debugger. All processes are stopped and interrupts are disabled. The kernel debugger runs with its own Machine State Save Area (mst) and a special stack. In addition, the kernel debugger does not run operating system routines. Though this requires that kernel code be duplicated within the debugger, this means it is possible to set breakpoints anywhere within the kernel code. When exiting the kernel debugger, all processes continue to run unless the debugger was entered via a system halt.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-5
Student Notebook
Figure 2-4. Enabling the Kernel Debugger BE0070XS4.0
Notes:
Kernel flags
The kernel debugger feature is enabled by setting flags in the boot image prior to booting the kernel. After changing these flags you must create a new boot image and reboot the system to use this new image.
Building a new boot image
The bosboot command is used to build boot images. Arguments supplied to the bosboot command will set flags in the boot image causing the kernel debugger to be enabled or disabled. After the boot image has been built the system must be re-booted for the new options to take effect.
Enabling the Kernel Debugger
Perform these steps to enable the kernel debugger:
1. Set Kernel boot Flags (bosdebug -D)
2. Build a new boot image (bosboot -ad /dev/ipldevice)
3. Boot the new image (shutdown -Fr)
4. Verify the debugger is enabled (Check dbg_avail)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
bosboot syntaxThe syntax of the bosboot command is:
bosboot -a [-D | -I] -d device
Example
The following command will build a new boot image with the kernel debugger loaded:
# bosboot -a -D -d /dev/ipldevice
The system must be rebooted for the change to take effect.
bosdebug
Attributes in the SWservAt ODM database can be set so that bosboot will enable the kernel debugger regardless of the command line argument used when building the boot image. The bosdebug command is used to view or set these attributes. To view the setting of the debug flags in the ODM database use the command:
# bosdebugMemory debugger offMemory sizes 0Network memory sizes 0Kernel debugger onReal Time Kernel off
To set the kernel debugger attribute on use the command:
# bosdebug -D
To set the kernel debugger attribute off use the command:
# bosdebug -o
Note: All this command does is to set attributes in the SWservAt ODM database. The bosboot command reads these values and sets up the boot image accordingly.
Argument Description-d device Specifies the boot device. The current boot disk is represented by
the device: /dev/ipldevice
-D Loads the kernel debugger. The kernel debugger will not automatically be invoked when the system boots.
-I Loads and invokes the kernel debugger. The kernel debugger will be invoked immediately on boot.
-a Creates complete boot image.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-7
Student Notebook
Figure 2-5. Verifying the Debugger is Enabled BE0070XS4.0
Notes:
Verifying the kernel debugger is enabled
Once the kernel is booted, you can use the following procedure to verify that the kernel debugger has been enabled.
Verifying the Debugger is Enabled
Step Action
1Start the kdb command#kdb
2
View the dbg_avail memory flag(0)> dw dbg_avail 1dbg_avail + 000000: 00000002
3
Compare the value of dbg_avail against the mask value in this table.
Mask Description
0x00000000 Do invoke at bootup.
0x00000001 Don't invoke at boot, but debugger is still invokable.
0x00000002 Debugger is not ever to be called
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 2-6. Starting the Debugger BE0070XS4.0
Notes:
Invoke vs. load only
When the kernel debugger is configured to be invoked (the -I option) the debugger will start immediately after booting. If configured to be loaded but not invoked (the -D option) one of the conditions listed above must occur after the system is booted for the debugger to be started.
Starting the Debugger
From a native serial port, type the key sequence:
Ctrl-\
From the LFT keyboard, type the key sequence: Ctrl-alt-Numpad4
A kernel extension or application makes a call to brkpoint()
A breakpoint previously set using the debugger has been reached
A fatal system error occurs
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-9
Student Notebook
Figure 2-7. System Dumps BE0070XS4.0
Notes:
What is in a system dump
Typically, an AIX 5L dump includes all of the information needed to determine the nature of the problem. The dump contains:
- Operating system (kernel) code and data
- Some data from the current running application
- Most of the kernel extensions code and data
Paged memory
The dump facility cannot page in memory, so only what is currently in physical memory can be dumped. Normally this is not a problem since most of the kernel data structures are in memory. The process and thread tables are pinned, and the uthread and ublock structures of the running thread are pinned as well.
System Dumps
A dump image is not actually a full image of the system memory but a set of memory areas copied out by the dump routines.
What is in a system dump?
What is the effect of kernel paging?
What is the role of the Master Dump Table?
What tools are used to analyze system dumps?
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
The master dump tableThe system dump function captures data areas by processing information returned by routines registered in the Master Dump Table. Kernel extensions can specify a routine to be called to include data in a system dump. On AIX 5.1 this is done with the dmp_add() kernel service, AIX 5.2 uses the dmp_ctl() kernel service. Kernel specific areas to be included in the dump are pre-loaded at kernel initialization.
Analyzing dumps
System dumps can be examined using the kdb command.
Dump Creation Process
Introduction
This section describes the dump process.
Process overview
The following steps are used to write a dump to the dump device:
Step Action1. Interrupts are disabled
2. 0c9 or 0c2 are written to the LED display, if present
3. Header information about the dump is written to the dump device
4.
The kernel steps through each entry in the Master Dump Table, calling each Component Dump routine twice:
• Once to indicate that the kernel is starting to dump this component (1 is passed as a parameter).
• Again to say that the dump process is complete (2 is passed as a parameter).
• After the first call to a Component Dump routine, the kernel processes the CDT that was returned
For each CDT entry, the kernel :
• Checks every page in the identified data area to see if it is in memory or paged out
• Builds a bitmap indicating each page's status
• Writes a header, the bitmap, and those pages which are in memory to the dump device
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-11
Student Notebook
5. Once all dump routines have been called, the kernel enters an infinite loop, displaying 0c0 or flashing 888
Step Action
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 2-8. kdb . BE0070XS4.0
Notes:
kdb Command
Files needed
The kdb command requires both a memory image (dump device, vmcore or /dev/mem) and a copy of /unix to operate. The /unix file provides the necessary symbol mapping needed to analyze the memory image file. It is imperative that the /unix file supplied is the one that was running at the time the memory image was created. The memory image (whether a device such as /dev/dumplv or a file such as vmcore.0) must not be compressed.
kdb
The kdb command allows examination of an operating
system image
Requires system image and /unix
Can be run on a running system using /dev/mem
Typical invocations:
# kdb -m vmcore.X -u /usr/lib/boot/unix
or
# kdb
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-13
Student Notebook
Parameters
The kdb command may be used with the following parameters:
Example
To run kdb against a vmcore file use the following command line:
# kdb -m vmcore.X -u /unix
To run kdb against the live (running kernel) no parameters are required.
# kdb
Parameter Description
no parameterUse /dev/mem as the system image file and /usr/lib/boot/unix as the kernel file. In this case root permissions are required.
-m system_image_file Use the image file provided
-u kernel_fileUse the kernel file. This is required to analyze a system dump on a different system.
-k kernel_modules Add the kernel_modules listed
-w View XCOFF object
-v Print CDT entries
-h Print help
-lDisable in-line more, useful when running non- interactive session
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 2-9. Checkpoint BE0070XS4.0
Notes:
Checkpoint
1. _____is used for live system debugging.
2. _____is used for system image analysis.
3. The value of the _______kernel variable indicates how the debugger is loaded.
4. A system dump image contains everything that was in the kernel at the time of the crash. True or False?
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-15
Student Notebook
Figure 2-10. Exercise BE0070XS4.0
Notes:
Introduction
Turn to your lab workbook and complete exercise two.
Read the information blocks included with the exercises. They will provide you with information needed to do the exercise.
Exercise
Complete exercise two
Consists of theory and hands-on
Ask questions at any time
Activities are identified by a
What you will do:Enable and start the kernel debuggerDisplay and interpret stack tracesDisplay and modify variables in kernel memoryPerform basic kdb navigations on live system and
crash dump
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 2-11. Unit Summary BE0070XS4.0
Notes:
Unit Summary
List the tools available for analyzing the AIX 5L kernel
Use KDB to display and modify memory locations and interpret a stack trace
Use basic kdb navigation to explore crash dump and live system
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 2. Kernel Analysis Tools 2-17
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
2-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 3. Process ManagementWhat This Unit Is About
This unit describes how processes and threads are managed inAIX 5L.
What You Should Be Able to Do
After completing this unit, you should be able to:
• List the three thread models available in AIX 5L
• Identify the relationship between the six internal structures: pvproc, proc, pv_thread, thread, user and u_thread
• Use the kernel debugging tools in AIX to locate and examine a process’ proc, thread, user and u_thread data structures
• Identify the states of processes and threads on a live system and in a crash dump
• Analyze a crash dump caused by a run-away process
• Identify the features of AIX scheduling algorithms
• Identify the primary features of the AIX scheduler supporting SMP and large system architectures
• Identify the action the threads of a process will take when a signal is received by the process
How You Will Check Your Progress
Accountability:
• Exercises using your lab system
• Check-point activity
• Unit review
References
AIX Documentation: Performance Management GuideAIX Documentation: System Management Guide: Operating System and Devices
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-1
Student Notebook
Figure 3-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this unit you should be able to:
List the three thread models available in AIX 5L
Identify the relationship between the six internal structures: pvproc, proc, pv_thread, thread, user and u_thread
Use the kernel debugging tools in AIX to locate and examine a process’ proc, thread, user and u_thread data structures
Identify the states of processes and threads on a live system and in a crash dump
Analyze a crash dump caused by a run-away process
Identify the features of AIX scheduling algorithms
Identify the primary features of the AIX scheduler supporting SMP and large system architectures
Identify the action the threads of a process will take when a signal is received by the process
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-2. Parts of a Process BE0070XS4.0
Notes:
Processes and threads
A process is a self-contained entity that consists of the information required to run a single program, such as a user application.
Process
A process can be divided into two components:
- A collection of resources
- A set of one or more threads
Parts of a Process
Resources
Address space
Open files pointers
User credentials
Management data
Stack
CPU registers
Stack
CPU registers
Stack
CPU registers
Process
Thread Thread Thread
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-3
Student Notebook
Resources
The resources making up a process are shared by all threads in the process. The resources are:
- Address space (program text, data and heap)
- A set of open files pointers
- User credentials
- Management data
Threads
A thread can be thought of as a path of execution through the instructions of the process. Each thread has a private execution context that includes:
- A stack
- CPU register values (loaded into the CPU when the thread is running)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-3. Threads BE0070XS4.0
Notes:
Threads
Threads provide the execution context to the process.
Kernel threads
Kernel threads are not associated with a user process and therefore have no user context. Kernel threads run completely in kernel mode and have their own kernel stack. They are cheap to create and manage thus are typically used to perform a specific function like asynchronous I/O.
Threads
Three type of threads are available in AIX:
Kernel
Kernel-managed
User
Three thread programming models are available for user threads:
1:1
M:1
M:N
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-5
Student Notebook
Kernel-managed threads
Kernel-managed threads are sometimes called ”Light Weight Processes” or LWPs and are the fundamental unit of execution in AIX. Each user process contains one or more kernel-managed threads.
The scheduling and running of kernel-managed threads is managed by the kernel. Each thread is scheduled to run on a CPU independent of the other threads of the process. On SMP systems, the threads of one process can run concurrently.
User threads
User threads are an abstraction entirely at the user level. The kernel has no knowledge of their existence. They are managed by a user-level threads library and their scheduling and execution are managed at the user level.
Programming models
AIX 5L provides three models for mapping user threads on top of kernel-managed threads. The application developer can chose between 1:1, M:1 and M:N models.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-4. 1:1 Thread Model BE0070XS4.0
Notes:
1:1 Model
In the 1:1 model, each user thread is mapped to a single kernel-managed thread:
1:1 Thread Model
User Thread User Thread User Thread
Kernel-
managed
Thread
Kernel-
managed
Thread
Kernel-
managed
thread
Thread Library
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-7
Student Notebook
Figure 3-5. M:1 Thread Model BE0070XS4.0
Notes:
M:1
In the M:1 model all user threads are mapped to one kernel-managed thread. The scheduling and management of the user threads are completely handled by the thread library.
M:1 Thread Model
Kernel-
managed
Thread
Library Scheduler
Thread Library
User Thread User Thread User Thread
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-6. M:N Thread Model BE0070XS4.0
Notes:
M:N
In the M:N model, user threads are mapped to a pool of kernel-managed threads. A user thread may be bound to a specific kernel-managed thread. An additional “hidden” user scheduler thread may be started by the library to handle mapping user threads onto kernel managed threads.
Thread model for this unit
This unit focuses on the management and scheduling of kernel-managed-threads. Primarily, the 1:1 model is discussed. Unless specified, the term “thread” refers to a kernel-managed thread.
Note that the thread model is selectable. The default for AIX 4.3.1 and higher is the M:N model. Using the 1:1 model can improve performance. The following will select the 1:1 model:
M:N Thread Model
User Thread
Kernel-
managed
Thread
Kernel-
managed
Thread
Thread
Library
User Thread User Thread
Kernel-
managed
Thread
User Thread
Library Scheduler
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-9
Student Notebook
#export AIXTHREAD_SCOPE=S
#<your_program>
There are many similar options available for thread tuning. See the Performance Management Guide in the AIX online documentation.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-7. Creating Processes BE0070XS4.0
Notes:
Creating processes
A new process is created when an existing process executes a fork() system call. The new process is called a child process; the creating process is the child’s parent.
Exec
When a process is first created it is running the same program as its parent. One of the exec() class of system calls is normally used to load a new program into the process’ address space.
Creating Processes
When a process is created it is given:
A process table entry
Process identifier (PID)
An address space (its contents are copied from the parent process)
User-areaProgram textData
User and kernel stacks
A single kernel-managed thread (even if the parent process had many threads)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-11
Student Notebook
Example
Here is an example of fork and exec to start a new program:
main(){pid_t child;if ( (child=fork()) == -1){
perror("could not fork a child process");exit(1);
}if ( child==0 ) { /* child */
/* exec a new program */if (execl("/bin/ls","-l",NULL) == -1 ){
perror("error on execl");exit(1);
}exit(0); /* all done end the new process */
} else { /* parent */wait(NULL); /* Ensure parent terminates after child */
}} /* main */
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-8. Creating Threads BE0070XS4.0
Notes:
Creating threads
When a process is first created it contains a single kernel-managed thread. A process can create additional threads using the thread_create() system call.
Thread library
AIX provides a thread library to assist programers with the creation and management of threads. Typically, the library function pthread_create() is used to create threads rather than calling thread_create() directly. The thread library allows for creation and management of both kernel-managed threads and user threads using the same interface.
Creating Threads
A new thread is created by the thread_create()
system call. When created the thread is assigned:
A thread table entry
A thread identifier
An execution context (stack pointer and CPU registers)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-13
Student Notebook
pthread_create example
Here is an example of the creating a new thread using pthread_create:
#include <pthread.h>#include <errno.h>void *new_thread(void *arg);
int main () {int i;pthread_t threadId;/* start up a new thread */if (pthread_create (&threadId, NULL, new_thread, NULL )) {
perror ("pthread_create");exit (errno);
}/* main thread code here */
}
void *new_thread(void *arg) {/* new thread code here */
}
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-9. Process State Transitions BE0070XS4.0
Notes:
Process states
This illustration above shows the states of a process during its life.
In AIX a process can be in one of five states:
- Idle
- Active
- Stopped
- Swapped
- Zombie
Process State Transitions
IdleProcess creation
fork()
ActiveSwapped
Zombie Non-existent
Stopped
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-15
Student Notebook
States
The five process states are described in this table:
Zombie process
Sometimes a Zombie process will stay in the process list for a long time. One example of this situation could be that a process has exited, but the parent process is busy or waiting in the kernel and unable to read the return code. If the parent process no longer exists when a child process exits, the init process (PID 1) frees the remaining resources held by the child.
State Description
IdleA process is started with a fork() system call. During creation the process is in the idle state. This state is temporary until all of the necessary resources have been allocated.
ActiveOnce the creation of the process is done it is placed in the active state. This is the normal process state. The threads of the process can now be scheduled to run on a CPU.
Stopped
When a process receives a SIGSTOP signal, it is placed in the stopped state. If a process is stopped, all its threads are stopped and will not be scheduled on a CPU. A stopped process can be restarted by the SIGCONT signal.
SwappedA swapped process has lost its memory resources and its address space has been moved onto disk. It cannot run until swapped back into memory.
Zombie
When a process terminates, some of its resources are not automatically released. A process is placed in the zombie state until its parent cleans up after it frees the resources. The parent must execute a wait() system call to retrieve the process’ exit status before the process will be removed from the process table.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Process state on a running systemThe state of a process can be found on a running system using the ps command.
# ps -l
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
240001 A 201 17670 16390 0 60 20 61f4 496 pts/3 0:00 ksh
200001 A 0 19172 17670 0 60 20 59da 496 pts/3 0:00 ksh
200001 A 0 19392 19172 3 61 20 2605 308 pts/3 0:00 ps
200011 T 0 19928 19172 0 60 20 4dff 436 pts/3 0:00 vi
Process state in a crash dump
The state of a process can also be found in a crash dump using kdb:
# kdb
(0)> proc *
SLOTNAME STATE PID PPID PGRP UID ADSPACE
pvproc+000000 0 swapperACTIVE 00000 00000 00000 00000 00004812
pvproc+000200 1 init ACTIVE 00001 00000 00000 00000 0000342D
pvproc+000400 2 wait ACTIVE 00204 00000 00000 00000 00004C13
pvproc+0006003 netm ACTIVE 00306 00000 00000 00000 0000282A
S Flag State
O Nonexistent
I Idle
A Active
T Stopped
W Swapped
Z Zombie
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-17
Student Notebook
Figure 3-10. The Process Table BE0070XS4.0
Notes:
The process table
The kernel maintains a table entry for each process on the system. This table is called the process table. Each process is represented by one entry in the table. Each entry contains:
- A process identifier
- The process state
- A list of threads
- A description of the process’ address space
- Other process management data
The Process Table
pvproc
pvproc
pvproc
pvproc
proc
proc
proc
Process TableSlot
Number
1
3
2
.
NPROC
.
.
.
.
.
.
.
pv_procp
pv_procp
pv_procp
0
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Process tableThe process table is a fixed-length array of pvproc structures allocated from kernel memory. For the 64-bit kernel, this table is divided into a number of sections called zones. At system startup, one zone is allocated on each SRAD (see later topic, Table Management).
proc structure
The proc structure is an extension on the pvproc structure. The pv_procp in the pvproc points to its associated proc structure. The proc and pvproc structures are split to accommodate large system architectures.
Slot number
Each entry in the process table is referred to by its slot number.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-19
Student Notebook
Figure 3-11. pvproc BE0070XS4.0
Notes:
pvproc structure
The definition of the pvproc structure can be found in /usr/include/sys/proc.h. Some of the key elements are shown above.
pvproc
Element Descriptionpv_pid Unique process identifier (PID)pv_ppid Parents process identifier (PPID)pv_uid User identifierpv_stat Process statepv_flags Process flags*pv_procp Pointer to the proc entry*pv_threadlist Head of list of threads*pv_child Head of list of children*pv_siblings NULL termintated sibling list
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-12. pv_stat . BE0070XS4.0
Notes:
pv_stat
The process state is stored in the pvproc->pv_stat data element. Values for pv_stat are defined in /usr/include/sys/proc.h as shown in this table.
Process table size
The size of the process table determines how many processes the system can have. The size of the table is defined as NPROC in the file /usr/include/sys/proc.h.
pv_stat
Values Meaning
SNONE Slot is not being used
SIDL Process is being created
SACTIVE Process has at least one active thread
SSWAP Process is swapped out
SSTOP Process is stopped
SZOMB Process is zombie
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-21
Student Notebook
Figure 3-13. Table Management BE0070XS4.0
Notes:
Table management
If the entire process table were pinned in memory it would consume a significant amount. In reality the entire table is rarely needed; therefore, only a portion of the table is pinned into memory at one time.
Zones
The process table used in the 64-bit kernel is split into equal sized sections called zones. Each zone contains a fixed number of process slots. The number of zones, and number of process slots per zone, is version dependent. The details can be determined by examining the value of PM_NUMSRAD_ZONES, defined in the header file <sys/pmzone.h>.
At system startup, one zone is allocated on each SRAD in the system. When a zone on an SRAD fills up (i.e. all of the process slots in that zone are used) then another zone is
Table Management
Zone 32
Zone 1
Zone 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
High water
mark
Pinned
pages
Slot 0
Slot 8192
Zone 0
Process Table
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-22 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
allocated to the SRAD and added to the pool. At the moment, there is only one SRAD per system.Pinning pages of the processes table
Each zone of the process table contains a high water mark indicating the highest number of slots in the zone that have been in use. The memory pages containing the slots up to the high water mark are pinned in memory. As the table grows the high water mark is moved and additional pages of the table are pinned.
32-bit kernel
The process table on 32-bit kernels has only one zone encompassing the entire process table. A single high water mark is used and pages are pinned as explained above.
Large systems
On some systems (64-bit kernel only) a zone would typically be associated with a single RAD (a group of resources connected together by some physical proximity).
Details
Two structures are used to manage the process table. Both are defined in /usr/include/sys/pmzone.h. The table is defined by a struct pm_heap_global. This structure has pointers to several pm_heap structures, one for each zone in the table. The high water mark for the zone is found in the pm_heap.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-23
Student Notebook
Figure 3-14. Extending the pvproc . BE0070XS4.0
Notes:
proc structure
The proc structure is an extension to the pvproc structure.
History
In older versions of AIX, the process table was made from an array of proc structures. In AIX 5L, each process is represented by two structures; the proc and a smaller pvproc.
Large systems
In some systems physical memory is divided into pools that have a degree of physical proximity to particular processors. Access speed to memory hosted from another processor may be slower than accessing memory hosted from the local processor. Using one large proc structure table could result in many "remote" accesses. The AIX
Extending the pvproc
CPU CPU
CPU CPU
proc
proc
CPU CPU
CPU CPU
proc
proc
CPU CPU
CPU CPU
proc
proc
SRAD SRAD
SRAD
pvproc
table
zone
pvproc
table
zone
pvproc
table
zone
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-24 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
5L design allows the use of RADs (Resource Affinity Domains), a collection of resources grouped by some degree of physical proximity. An SRAD (scheduler RAD) is a RAD large enough to warrant a dedicated scheduler thread. The table of pvproc structures is separated into zones, which allows each zone to reside on its own SRAD, and refer to proc structures for processes running on that SRAD.Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-25
Student Notebook
Figure 3-15. PID Format BE0070XS4.0
Notes:
Process identifier
The process identifier or PID is a unique number assigned to a process when the process is first created. It is composed of the process table slot number and a generation count. The generation count is incremented each time the process table slot is used. This means a process table slot can be used 128 times before a process ID is reused.
PID format
The format of a PID is shown above.
PID Format
0178252631
000000 Process table slot indexGeneration
count0
32-bit Kernel
0178252663
Generation
count
12
64-bit Kernel
13
Low order bits
of Process
table slot index0
SRAD(upper bits
of index)
00 . . . 0
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-26 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
pid_t
Process identifiers are stored internally using the pid_t typedef.
Bits Description
Bit 0Always set to zero making all PIDs even numbers, apart from init, which is a special case and always has process ID 1.
Generation count A generation count used to prevent the rapid re-use of PIDs.
Process table slot index
The process table slot number.
SRAD (Scheduler Resource Affinity Domain)
These bits are used to select the zone on the process table. The number of bits used for the SRAD is version dependent, and defined by PM_NUMSRAD_BITS defined in <sys/pmzone.h>. AIX 5.1 uses 5 bits, AIX 5.2 currently uses 4 bits.
Remaining bits Set to zero.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-27
Student Notebook
Figure 3-16. Finding the Slot Number BE0070XS4.0
Notes:
Finding the slot number
In a 32-bit kernel the process table slot number can easily be found from a PID by shifting the PID 8 bits to the right. In a 64-bit kernel the slot number is a combination of the SRAD bits with the index bits as shown above.
On AIX 5.1, the SRAD field is 5-bits long; therefore, the index bits do not line up on an even nibble boundary. This makes calculating the slot number in your head a little difficult. On AIX 5.2, the SRAD field is 4-bits long, so the calculation is a little easier.
Why are the fields swapped?
The SRAD and index bits are shifted around so that indexing is partitioned by zones.
Finding The Slot Number
000000 Process table
index bits
Generation
count0SRAD
Process table
index bitsSRAD
pvproc table slot number
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-28 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-17. Kernel Processes BE0070XS4.0
Notes:
Kernel Processes
Some processes in the system are kernel processes. Kernel processes are created by the kernel itself and execute independently of user thread action.
Kernel Processes
Kernel processes:
Are created by the kernel
Have a private u-area and kernel stack
Share text and data with the rest of the kernel
Are not affected by signals
Can not use shared library object code or other user-protection domain code
Run in the Kernel Protection Domain
Can have multiple threads, as can user processes
Are scheduled like user processes, but tend to have higher priorities
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-29
Student Notebook
Listing kernel processes
You can list the current kernel processes with the ps -k command.
# ps -k
PID TTY TIME CMD0 - 0:02 swapper
16388 - 11:20 wait24582 - 5681:27 wait
. . .98334 - 0:00 lvmbb114718 - 0:00 j2pg163968 - 0:00 rtcmd172074 - 0:00 dog
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-30 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-18. Thread Table BE0070XS4.0
Notes:
Thread Table
The kernel maintains a thread table. Each kernel-managed thread is represented by one table entry which contains:
- A thread identifier (TID)
- A thread state
- Thread management data
The thread table is similar to the process table. It is an array of pvthread structures allocated from kernel memory. Each entry in the table is referred to by its slot number. The thread table for 64-bit systems is divided into zones and the zones are allocated on different SRADs, just as with the process table.
Thread Table
pvthread
pvthread
pvthread
pvthread
thread
thread
thread
Thread TableSlot
Number
1
3
2
.
.
NTHREAD
.
.
.
.
.
.
.
tv_threadp
tv_threadp
tv_threadp
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-31
Student Notebook
thread structure
The thread structure is an extension on the pvthread structure. The tv_threadp item in the pv_thread points to its associated thread structure. The thread and pvthread structures were split to accommodate large system architectures.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-32 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-19. pvthread Elements BE0070XS4.0
Notes:
pvthread and thread structures
Definitions for the pvthread and thread structures can be found in /usr/include/sys/thread.h.
Elements
Some of the key element of the pvthread structure are shown above.
Table management
The memory pages for the thread table are managed using the same mechanism that was described for the process table. The thread table is split into multiple zones. Each zone contains a high water mark representing the largest slot number used since system boot. All memory pages for the slots up to the high water mark are pinned. The size of each zone, and the number of zones are version dependent.
pvthread Elements
Element Description
tv_tid Unique thread identifier (TID)*tv_threadp Pointer to thread structure
*tv_pvprocp Pointer to pvproc for this thread
*tv_next thread Pointer to next thread (pvthread) in
the process*tv_prevthread Pointer to previous thread
(pvthread) in the process
tv_state Thread state
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-33
Student Notebook
Figure 3-20. TID Format BE0070XS4.0
Notes:
Thread identifier
Introduction
The thread identifier or TID is a unique number assigned to a thread. The format of a TID is similar to that of a PID except that all TIDs are odd numbers and PIDs are even numbers. The format of a TID is shown above.
tid_t
Thread identifiers are stored internally using the tid_t typedef.
TID Format
0178262731
000000 Thread table slot indexGeneration
count1
32-bit Kernel
0178262763
Generation
count
12
64-bit Kernel
13
Low order bits
of thread table
slot index1
SRAD(upper bits
of index)
00 . . . 0
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-34 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-21. u-block BE0070XS4.0
Notes:
Introduction
Each process (including a kernel process) contains a u-block area. The u-block is made up of a user structure (one per process) and one or more uthreads (one per thread).
Access
The u-block is part of the process private memory segment; however, it is only accessible when in kernel mode. It maintains the process state information which is only required when the process is running; therefore, it need not be accessible when the process is not running. It need not be in memory when the process is swapped out. It is pinned when the process is swapped into memory, and unpinned when the process is swapped out.
u-block
user
Location - process private memory segment
Definition - /usr/include/sys/user.h
uthread
uthread
uthread
uthread
user shared between all threads in the process
uthreadThread private data
stack pointers
mstsave
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-35
Student Notebook
Definitions
The u-block is described in the file /usr/include/sys/user.h.
user
Each process has one user structure. Information stored in the user structure is global and shared between all threads in the process. For example, the file descriptor table and the user credentials are kept in the user structure.
uthread
Each thread of a process has its own uthread structure. Threads are responsible for storing execution context; therefore, the uthread holds execution-specific items like the stack pointers and CPU registers. When a thread is interrupted or a context switch occurs the stack pointers and CPU registers of the interrupted thread are stored in the mst-save area of the uthread. When execution of the thread continues the stack pointers and registers are loaded from the mst-save area.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-36 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-22. Six Structures BE0070XS4.0
Notes:
Introduction
This unit has discussed the AIX 5L data structures: pvproc, proc, pvthread, thread, uthread and user. This section describes how these six structures are tied together.
Diagram
The above diagram depicts the structures for a single process containing three kernel-managed threads.
proc and thread
From the pvproc structure the first pvthread can be found by following the pv_threadlist pointer. All the pvthread structures for the process are linked via a circular doubly-linked list (see pointers tv_nextthread and tv_prevthread). The
Six Structures
t_userp
pvthread
thread
pvthread
thread
pvthread
thread
uthread uthread uthread
user
proc
u-block
tv_nextthread
tv_threadp
t_uthreadp
pvproc
tv_pvprocp
t_pvthreadp
U_procp
t_procp
pv_threadlist
pv_procp
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-37
Student Notebook
pvproc is extended in to the proc structure via the pv_procp pointer. Similarly, the pvthread structures are extended into the thread structures via the tv_threadp.
u-block
The u-block is divided into uthread sections, one per thread and one process-wide user structure. Pointers in the thread structure point to both of these sections. Data that is private to the thread-like stack pointers are kept in the uthread. Process-wide data is kept in the user area; for example, the file descriptor table. This allows all threads in a process to share the same open files.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-38 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-23. Thread Scheduling Topics BE0070XS4.0
Notes:
Introduction
The object of thread scheduling is to manage the CPU resources of the system, sharing these resources between all the threads.
Thread Scheduling Topics
Thread states
Thread priorities
Run queues
Software components of the kernelSchedulerDispatcher
Scheduling algorithms
Support for SMP and large systems
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-39
Student Notebook
Figure 3-24. Thread State Transitions BE0070XS4.0
Notes:
Introduction
In AIX, the kernel allows many threads to run at the same time, but there can be only one thread actually executing on each CPU at one time. The thread state shows if a thread is currently running or is inactive.
State transitions
Threads can be in one of several states. A thread typically changes its state between running, ready to run, sleeping and stopped several times during its lifetime. The diagram above shows all the state transitions a thread can make.
Thread State Transitions
Idle
RunningSleeping
Zombie
Stopped by
a signal
Ready to Run
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-40 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
StatesAll the thread states are described in this table:
tv_state
The thread state is kept in the tv_state flag of the pv_thread structure. The defined values for this flag are:
State Description
IdleWhen first created a thread is placed in the idle state. This state is temporary until all of the necessary resources for the the thread have been allocated.
Ready to Run
Once the new thread creation is completed, it is placed in the ready to run state. The thread waits in this state until the thread is run.
RunningA thread in the running state is the thread executing on a CPU. The thread state will change between running and ready to run until the thread finishes execution; the thread then goes to the zombie state.
SleepingWhenever the thread is waiting for an event, the thread is said to be sleeping.
StoppedA stopped thread is a thread stopped by the SIGSTOP signal. Stopped threads can be restarted by the SIGCONT signal.
SwappedThough swapping takes place at the process level and all threads of a process are swapped at the same time, the thread table is updated whenever the thread is swapped.
ZombieThe zombie state is an intermediate state for the thread lasting only until all the resources owned by the thread are given up.
Flag Meaning
TSNONE slot is available
TSIDL being created (idle)
TSRUN runable (or running)
TSSLEEP awaiting an event (sleeping)
TSSWAP swapped
TSSTOP stopped
TSZOMB being deleted (zombie)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-41
Student Notebook
Running threads
No tv_state flag value has been defined for the running state. The running state is implied when a thread is currently being run; therefore a flag is not necessary. The value of the tv_state flag for running threads will be shown as ready to run (TSRUN). A thread must be ready to run before it can be run.
A thread that is ready to run has a state of TSRUN, and a wait type of TWCPU, i.e. the thread is waiting for CPU access. A thread that is actually running has a state of TSRUN, and a wait type of TNOWAIT.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-42 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-25. Thread Priority BE0070XS4.0
Notes:
Introduction
All threads are assigned a priority value and a nice value. The dispatcher examines these values to determine what thread to run.
Thread priority
Each thread is assigned a priority number between 0 and 255. CPU time is made available to threads according to their priority number. Precedence is given to the thread with the lowest priority number. The highest priority a thread can run in user mode is defined as PUSER or 40. Priorities above PUSER (example: numerically lower) are used for real-time threads.
Thread Priority
0
255
Highest
priority
Lowest
priorityPriority
values
PUSER = 40
user
kernel
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-43
Student Notebook
Lower number means high priority
Do not confuse a high priority value with a high priority thread. The two are inversely related. In other words, a thread with a numerically low priority value is more important than one with a larger value.
nice
Each process is assigned a nice value between 0 and 39. The nice value is used to adjust thread priority. A process’ nice value is saved in the proc structure as p_nice=nice+PUSER. The default value for nice is 20. The nice value of a process can be set using the nice command or changed using the renice command.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-44 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-26. Run Queues BE0070XS4.0
Notes:
Introduction
All runnable threads on the system (except the currently running threads) are listed on a run queue. A run queue is arranged as a set of doubly-linked lists, with one linked list for each thread priority. Since there are 256 different thread priorities, a single run queue consists of 256 linked lists. AIX selects the next thread to run by searching the run queues for the highest priority (example, numerically lowest) runnable thread. A single CPU system has one run queue.
Wait thread
The wait thread is always ready to run, and has a priority value of 255. It is the only thread on the system that will run at priority 255. If AIX finds no other ready to run thread, it will run the wait thread.
Run Queues
thread
wait
thread
Run Queue
255
.
.
100
.
.
80
.
.
60
.
.
40
.
.
20
.
.
0
thread thread
thread thread
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-45
Student Notebook
Figure 3-27. Dispatcher and Scheduler Functions BE0070XS4.0
Notes:
Introduction
The scheduling and running of threads are the jobs of the dispatcher and scheduler. AIX is designed to handle many simultaneous threads.
Clock ticks
A clock tick is 1/100 of a second. The number of clock ticks a thread has accumulated will be used to calculate a new priority for the thread by the scheduler. Generally, a thread that has accumulated many clock ticks will have its priority decreased, (i.e. the priority value will grow larger).
Dispatcher and Scheduler Functions
Dispatcher
Searches the run queues for the highest priority thread
Dispatches the most-favored thread (highest priority)
Invoked at various points in the kernel, including:By the clock interrupt (every 1/100th of a second)When the running thread gives up the CPU
Scheduler
Runs once a second
Recalculates thread priority for all runnable threads based on:
The amount of CPU time a thread has receivedThe priority valueThe nice value
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-46 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-28. Dispatcher BE0070XS4.0
Notes:
Dispatcher
The dispatcher runs under the following circumstances:
- A time interval has passed (1/100 sec).
- A thread has voluntarily given up the CPU.
- A thread (from a non-threaded process) that has been boosted is returning to user mode from kernel mode.
- A thread has been made runnable by an interrupt and the processor is about to finish interrupt processing and return to INTBASE.
The steps the dispatcher takes are listed above.
Dispatcher
Step Action
1 If invoked because a clock tick has passed, then increment the t_cpu element of the currently
running thread. t_cpu is limited to a maximum
value of T_CPU_MAX.if (thread->t_cpu < T_CPU_MAX)thread->t_cpu++;
2 Scan the run queue(s) looking for the highest priority read-to-run thread.
3 If the selected thread is different from the currently running thread, place the currently running thread back on the run queue, and place the selected thread at the end of the MST chain.
4 Resume execution of the thread at the end of the MST chain.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-47
Student Notebook
Figure 3-29. Scheduler BE0070XS4.0
Notes:
Scheduler
The scheduler runs every second. Its job is to recalculate the priority of all runnable threads on the system. The priority of a sleeping thread will not be changed. The steps the scheduler uses to calculate thread priorities are shown in the table above.
r and d
The values of r and d can be set using the schedo command. The r and d values control how a process is impacted by the run time; r impacts how severely a process is penalized by used CPU time, while d controls how fast the system “forgives” previous CPU consumption.
0 <= r,d <= 32
The default value for r,d is 16.
Scheduler
Step Action
1 If the value of nice is greater than the default value of
20, double its value, making it possible to more strongly discriminate against upwardly nice'd threads. Recall
that the value of p_nice is: nice+PUSER. Given: PUSER=40 and 0<=nice<=40.
2 Calculate the new priority using the equation:
3 Degrade the value of t_cpu so that ticks the thread
has used in the past have less affect as recent ticks.
new_nice 2 p_nice - 60( )× 60+=
if p_nice 60>( )
priority new_nice t_cpu r
32------× new_nice 4+
64--------------------------------×+=
t_cpu = t_cpu d
32------×
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-48 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-30. Preemption BE0070XS4.0
Notes:
Preemption
Definition
When the dispatcher runs and finds a runnable thread with a higher priority than the current running thread the running context is switched to the higher priority thread. The thread that was displaced before it’s time slice expired is said to have been preempted.
Non-preemptive kernel
Most UNIX systems will not allow pre-emption to occur when running in kernel mode. If the current running thread is in kernel mode and a higher priority thread becomes ready to run, it will not be granted CPU time until the running thread returns to user mode and voluntarily gives up the CPU. This can result in long delays in processing high-priority or real-time threads.
Preemption
What is preemption?
Non-preemptive kernel vs. preemptive kernel
Preventing deadlock in preemptive kernels
Priority boost
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-49
Student Notebook
Preemption in kernel mode
AIX allows thread pre-emption in kernel mode. This feature supports real-time processing where a real-time thread must respond to an action in a known time-frame.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-50 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-31. Preemptive Kernels BE0070XS4.0
Notes:
Problems with preemptive kernels
The above scenario demonstrates the problem that AIX has solved to make kernel preemption work. In this example threads A,B and C are all running in kernel mode.
Step Action
1. Thread A, a low priority thread, has obtained access to an exclusive resource lock.
2.Thread B, running at a higher priority, is waiting to obtain the same resource lock. This thread cannot continue until thread A releases the lock.
3. Thread C’s priority is higher than thread A’s and is ready to run. When the dispatcher runs, thread C pre-empts thread A.
Preemptive Kernels
Thread A
Low priority
Thread B
High priority
Thread C
Medium priority
Holding lock
Waiting for lock
Running
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-51
Student Notebook
Priority boost
To resolve this situation, priority boost was added to AIX. Priority boost increases the priority of threads holding locks.
- When a high priority thread has to wait for a lock, it changes the priority of the thread that is holding the lock to its own priority.
- The priority boost only applies to the “low priority” thread when it is holding the lock. The priority is set back to the original value when either:
— The scheduler notices that the boosted thread is no longer holding any locks. — The boosted thread returns to user mode from kernel mode. — The high priority thread that was waiting for the lock obtains the lock.
Priority boost applies to both kernel locks and user (pthreads library) locks. A thread running in kernel mode must release any kernel locks it holds before returning to user mode.
4.
Thread A is still holding the resource lock. Even though thread B is the highest priority thread on the system it can’t proceed until it obtains the resource held by thread A. Thread A is not running so it can’t release the lock.
Step Action
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-52 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-32. Scheduling Algorithms BE0070XS4.0
Notes:
Introduction
AIX has 3 main types of scheduling algorithms that will affect how a threads priority is calculated by the scheduler. The main algorithms as defined in <sys/sched.h> are listed in the visual above.
Scheduling Algorithms
SCHED_RRFixed priorityThreads are timesliced
SCHED_FIFOFixed priorityThreads ignore timeslicing
SCHED_OTHERDefault policyPriority based on CPU time and nice value
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-53
Student Notebook
SCHED_RR
This is a round robin scheduling mechanism in which the thread is time-sliced at a fixed priority. The amount of CPU time and the nice value have no affect on the threads priority.
- This scheme is similar to creating a fixed-priority, real-time process.
- The thread must have root authority to be able to use this scheduling mechanism.
- It is possible to create a thread with SCHED_RR that has a high enough priority that it could monopolize the processor if it is always runnable and there are no other runnable threads with the same (or higher) priority.
SCHED_FIFO
Similar to SCHED_RR; however:
- The thread runs at fixed priority and is not time-sliced.
- It will be allowed to run on a processor until it voluntarily relinquishes by blocking or yielding, or until a higher priority thread is made runnable.
- A thread using SCHED_FIFO must have root authority to use it.
- It is possible to create a thread with SCHED_FIFO that has a high enough priority that it could monopolize the processor if it is always runnable.
There are actually three other related policies, SCHED_FIFO2, SCHED_FIFO3 and SCHED_FIFO4. The FIFO policies differ in how they return threads to the run queue, and thereby provide a way of differentiating between their effective priorities. See the Performance Management Guide of the AIX online documentation for more details.
SCHED_OTHER
This is the default AIX scheduling policy that was discussed earlier. Thread priority is constantly being adjusted based on the value of nice and the amount of CPU time a thread has received. Priority degrades with CPU usage.
Choosing scheduling algorithms
By default a thread will run with the SCHED_OTHER scheduling algorithm. Threads running as the root user can change scheduling algorithms using the thread_setsched() subroutine.
int thread_setsched (tid, priority, policy)
tid_t tid;
int priority;int policy;
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-54 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
t_policyThe scheduling policy a thread is using is stored in:
thread->t_policy.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-55
Student Notebook
Figure 3-33. SMP - Multiple Run Queues BE0070XS4.0
Notes:
Introduction
On Symmetric Multi-Processing systems (SMP) per-CPU run queues are used to compensate for the multiple memory caches used on these systems.
Memory cache
Each CPU in a symmetric multi-processing system has its own memory cache. The purpose of the cache is to speed up processing by pre-loading blocks of physical memory into the higher speed cache.
Cache warmth
A thread is said to have gained cache warmth to a CPU when a portion of the process memory had been loaded into the CPU’s cache. In an SMP system, threads can be scheduled onto any CPU. The best performance is achieved when a thread runs on a
SMP - Multiple Run Queues
CPU 2CPU 1CPU 0
Globle Run Queue
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-56 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
CPU where it has gained some cache warmth. The AIX thread scheduler takes advantage of cache warmth by attempting to schedule a thread on the same CPU it ran on last.Multiple run queues
In addition to a global run queue each CPU has given its own run queue. Each CPU draws work from its own run queue, selecting the highest priority work from its queue.
Soft cache affinity
By having a run queue for each processor, we allow for some measure of soft cache affinity. As long as the thread is in the same CPU run queue it will run on the same processor.
Hard affinity
Threads can be bound to a single CPU meaning they are never placed in the global run queue. This is called hard affinity. The bindprocessor() subroutine is used to give a single thread or all threads of a process hard affinity to a CPU. Hard affinity (or binding) is recorded in thread->t_cpuid. If t_cpuid is set to PROCESSOR_CLASS_ANY=-1 the thread is not using hard affinity (note that t_cpuid=0 means bound to cpu 0).
RT_GRQ
If a thread has exported the environment variable RT_GRQ=ON, it will sacrifice soft cache affinity. The thread will be placed only in the global run queue and hence run on the first available CPU.
Load balancing
The system uses load balancing techniques to ensure that work is distributed evenly between all of the CPU’s in the system.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-57
Student Notebook
Figure 3-34. NUMA BE0070XS4.0
Notes:
In a true SMP architecture, the S stands for symmetric. This means that any CPU can access any piece of memory with virtually the same cost in terms of latency and bandwidth. The SMP architecture has a limit on the size that it can grow to, both in terms of the number of CPUs, and the amount of memory. The limits grow over time as individual technologies improve (such as processor speed and memory bandwidth), however this is still a point at which adding more CPUs, or adding more memory actually degrades performance.
One approach that has been taken in the past to allow the development of large systems is to use building blocks of SMP systems, and couple them together into a single system. A good example of this is the NUMA-Q systems developed by Sequent. NUMA stands for Non-Uniform Memory Access. The memory in a NUMA system is effectively divided into two classes. Local memory, which is on the same system building block as the CPU trying to access it, and Remote Memory, which is located on a different system building block. In a NUMA architecture, there are relatively large differences in access latency (approximately 1 order of magnitude) and bandwidth between local and remote memory.
NUMA
Node
local
memory
CPU
CPU
CPU
CPU
I/O
remote cache
Node
Node
local
memory
CPU
CPU
CPU
CPU
I/O
remote cache
Node
local
memory
CPU
CPU
CPU
CPU
I/O
remote cache
NodeNode
Memory Interconnect
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-58 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Local vs. remote memory accessAccess to memory on the same node as the device requesting the access is defined as local access. Accessing memory on a different node is defined as remote access. To the device (CPU or I/O) accessing the memory, remote and local accesses are identical with the exception of speed, remote memory access may be slower.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-59
Student Notebook
Figure 3-35. Memory Affinity BE0070XS4.0
Notes:
The visual above shows the system architecture of the pSeries 690. This system is an SMP system that has some characteristics of a NUMA system. Some memory is 'local' to a processor, and other parts of memory are 'remote'. The major difference between this architecture and a true NUMA one is that the latency and bandwidth differences between local and remote access are much smaller.
Looking at this diagram, we could consider this architecture to be a single system (since all of the components are inside a single cabinet). However if we examine the diagram more closely, we can see that each MCM has two attached memory cards. We could consider an MCM and its two memory cards to be a RAD, since these resources have a degree of physical proximity when compared to other parts of memory or other processors.
GXGX
P
L2
PP
L2
P
P
L2
P P
L2
P
GXGX
P
L2
PP
L2
P
P
L2
P P
L2
P
GX
GX
P
L2
PP
L2
P
P
L2
P P
L2
P
GX
GX
GX
GX
GX
P
L2
PP
L2
P
P
L2
P P
L2
P
GX
GX
GX
GX
GX
MemSlot
GX Slot
L3 L3 L3 L3L3 L3L3 L3
L3 L3
L3 L3
L3 L3L3 L3 L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
MCM 1
MCM 3MCM 2
MCM 0
GX Slot
MemSlot
MemSlot
MemSlot
MemSlot
MemSlot
MemSlot
MemSlot
GX Slot
GX Slot
Memory Affinity
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-60 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
DefinitionsThis section defines some additional terms:
- RAD - Resource Affinity Domain, is a group of resources connected together by some physical proximity.
- SRAD - Scheduler RAD is the RAD that the scheduler will operate on; usually a physical node.
- SDL - System Decomposition Level - A RAD exists at multiple levels. The top level is the entire system; the bottom, or atomic, level consists of individual CPUs and memory. The SDL determines how small the RAD will be.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-61
Student Notebook
Figure 3-36. Global Run Queues BE0070XS4.0
Notes:
Introduction
This section talks about design enhancements to facilitate future systems.The goal of the thread scheduler is to balance the process load between all the CPUs in the system and reduce the amount of time a runnable thread waits to be run when other CPUs are idle.
Run queues
The design of the AIX 5L thread scheduler has been extended to allow per-node run queues and one global run queue.
Process placement
For most applications the most frequent memory access is to the process’ text. Other frequent accesses include private data, stack and some kernel data structures. To
Global Run Queues
Global Run Queue
.
.
.
.
.
.
.
.
CPU 0
.
.
.
.
CPU 1
.
.
.
.
CPU 2
.
.
.
.
CPU 3
.
.
.
.
SRAD
.
.
.
.
CPU 4
.
.
.
.
CPU 5
.
.
.
.
CPU 6
.
.
.
.
CPU 7
.
.
.
.
SRAD
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-62 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
minimize memory access time the process text, data, stack and kernel data structures are allocated from memory on the RAD containing the CPUs that will execute the threads belonging to that process. This RAD or set of RAD’s is called the process home RAD.RAD affinity scheduling
The purpose of RAD affinity scheduling is to exploit RAD local memory and RAD level caches by allocating a process private memory and text on the RAD(s) where it will be executed, and conversely, by attempting to execute process threads on CPUs where there is cache warmth.
Process migration
In order to keep the system efficient, AIX will occasionally migrate a process between SRADs. For a process to migrate, its memory must be copied to the process’ new home RAD.
Logical attachment
Processes that share resources may be logically attached. Logically attached processes are required to run on the same RAD. An API is provided for the control of logical attachments.
Physical attachment
Processes can be attached to a physical collection of resources (CPU and memory) called an RSet. Processes attached to an RSet can only migrate between members of the RSet.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-63
Student Notebook
Figure 3-37. Checkpoint BE0070XS4.0
Notes:
Checkpoint
1. AIX provides _____ programming models for user threads.
2. A new thread is created by the __________system call.
3. The process table is an _____ of _______ structures.
4. All process IDs (except pid 1) are _____.
5. A thread table slot number is included in a thread ID. True or False?
6. A thread holding a lock may have its priority _______.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-64 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 3-38. Exercise BE0070XS4.0
Notes:
Introduction
Turn to your lab workbook and complete exercise three.
Read the information blocks contained within the exercise. They provide you with information you need to do the exercise.
Exercise
Complete exercise 3
Consists of theory and hands-on
Ask questions at any time
Activities are identified by a
What you will do:Examine the process and thread structures using kdbApply what you learned to the analysis of a crash dumpLearn about and configure system hang detectionExplore how signal information is stored and used in AIX
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 3. Process Management 3-65
Student Notebook
Figure 3-39. Unit Summary BE0070XS4.0
Notes:
Unit Summary
The primary unit of execution in AIX is the thread.
AIX has three thread programing models available: 1:1, M:1, M:N
The dispatcher Selects what thread to run
The scheduler adjusts thread priority based on:nice
CPU time
Scheduling algorithms are SCHED_RR, SCHED_FIFO, SCHED_OTHER
The six structures of a process are: pvproc, proc, pv_thread, thread, user, u_thread.
Processes can handle or ignore signals; threads can mask signals.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
3-66 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 4. Addressing MemoryWhat This Unit Is About
This unit describes how memory is organized and addressed in AIX 5L.
What You Should Be Able to Do
After completing this unit, you should be able to:
• List the types of addressing spaces used by AIX 5L
• List the attributes associated with each segment type
• Given the effective address of a memory object, identify the segment number and object type
How You Will Check Your Progress
Accountability:
• Exercises using your lab system • Unit review
References
PowerPC Microprocessor Family: The Programmers Reference Guide
Available from http://www-3.ibm.com/chips/techlib/techlib.nsf/productfamilies/PowerPC
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-1
Student Notebook
Figure 4-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this lesson you should be able to:
List the types of addressing spaces used by AIX 5L.
List the attributes associated with each segment type.
Given the effective address of a memory object, identify the segment number and object type.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-2. Memory Management Definitions BE0070XS4.0
Notes:
Memory Management Definitions
Introduction
To explore how AIX 5L addresses memory we must first define the terms and concepts listed above.
Memory Management Definitions
Page
Frame
Address SpaceEffective address spaceVirtual address spacePhysical address space
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-3
Student Notebook
Figure 4-3. Pages and Frames BE0070XS4.0
Notes:
Introduction
AIX manages memory in 4096-byte chunks called pages. Pages are organized and stored in real (physical) memory chunks called frames.
Page
A page is a fixed-sized chunk of contiguous storage that is treated as the basic entity transferred between memory and disk. Pages stay separate from each other; they do not overlap in virtual address space. AIX 5L uses a fixed page size of 4096 bytes. The smallest unit of memory managed by hardware and software is one page.
Pages and Frames
PagePage size =
4096 bytesframe
Real
(physical)
Memory
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Large page supportPOWER4 processors can handle 16MB pages. AIX 5L can be configured to allow a number of large page segments. See the AIX online documentation for more information.
Frame
The place in real memory used to hold the page is called the frame. Whereas a page is a collection of information, a frame is the place in memory to hold that information.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-5
Student Notebook
Figure 4-4. Address Space BE0070XS4.0
Notes:
Introduction
An address space is memory (real or virtual) defined by a range of addresses. AIX 5L defines several different address spaces:
- Effective address space
- Virtual address space
- Physical address space
Effective address space
Effective addresses are those referenced by the machine instructions of a program or kernel. The effective address space is the range of addresses defined by the instruction set. The effective address space is mapped to physical address space or to disk files for each process. However, programs and processes ‘see’ one contiguous address space.
Address Space
Physical
Memory Process 1
Process 2
Virtual address
space
Effective
address
Paging space
Filesystem pages
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Virtual address spaceThe virtual address space is the set of all memory objects that could be made addressable by the hardware. The virtual address space is bigger (since it is addressed using more address bits) than the effective address. Processes have access to a limited range of virtual addresses given to them by the kernel.
Physical address space
The physical address space is dependent on how much physical memory (DRAM) is on the machine. The physical address space is mapped to the machine’s hardware memory, however depending on how much memory is installed, and the number of PCI host bus controllers in the machine, it may not be referenced in a single contiguous range. For example, a system with 8GB of memory installed may use the ranges 0-3GB and 4GB-9GB to reference physical memory, rather than a single 0-8GB address range. Physical addresses in the range 3GB-4GB would be used to access devices connected to PCI host bus controllers. There is more information about this subject in a later unit covering the implementation of LPAR.
Paging space
The paging space is the disk area used by the memory manager to hold inactive memory pages with no other home. In AIX, the paging space is mainly used to hold the pages from working storage (process data pages). If a memory page is not in physical memory, it may be loaded from disk; this is called a page-in. Writing a modified page to disk is called a page-out.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-7
Student Notebook
Figure 4-5. Translating Addresses BE0070XS4.0
Notes:
Introduction
When a program accesses an effective address, the hardware translates the address into a physical address using the above process.
Translating Addresses
Step Action
1The effective address is referenced by a process or by the kernel.
2The hardware translates the address into a system wide virtual address.
3The page containing the virtual address is located in physical memory or on disk.
4If the page is currently located on disk a free frame is found in physical memory and the page is loaded into this frame.
5The memory operation requested by the process or kernel is completed on the physical memory.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-6. Segments BE0070XS4.0
Notes:
Introduction
Effective memory address space in AIX 5L is divided into 256 MB objects called segments.
Segments
The maximum number of segments available to a process depends on the effective address space size (32-bit or 64-bit).
Available memory
A process can control how much of its effective address space is available in two ways. A process can create or destroy segments in its address space. A process can adjust the number of pages in a single segment (up to 256 MB).
Segments
Segment number n
.
.
.
Segment number 1
Segment number 0256 MB
Segment
Effective
address
space
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-9
Student Notebook
Sharing address space
The benefit of the segmented addressing model is the high degree of memory sharing that can occur between processes. A segment can be mapped into more that one process’s effective address space allowing the same physical memory to be shared. Once a shared segment is defined it can be attached or detached by many processes.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-7. Segment Addressing BE0070XS4.0
Notes:
Introduction
This section discusses how memory segments are addressed.
Segment addressing
Both the 64-bit and 32-bit effective address spaces are divided into 256 MB segments. Each segment has a Segment number or Effective Segment ID (ESID). In the 32-bit model, this number is 4 bits long, allowing for 16 segments. In this case the ESID identifies one of 16 Segment Registers. In the 64-bit model, 36 bits are used for the ESID, allowing for 236 (more than 64 million) segments. In this case the value identifies an entry in the STAB table, which is pointed to by the ASR (Address Space Register). In both cases the main item in the register/table entry is called a Virtual Segment ID (VSID). The virtual page index and byte offset are used together with the VSID to resolve the effective address. The address resolution information that follows describes this process.
Segment Addressing
An effective address is broken down into the following
three components
Segment # Virtual Page Index Byte Offset
4/36 bits 16 bits 12 bits
The first 4 bits (32 bit address) or 36 bits (64 bit address) is called an ESID and selects the segment register or STAB table slot
The next 16 bits select the page within the segment
The next 12 bits select the offset within the page
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-11
Student Notebook
32 bit process on 64 bit hardware
Keeping a consistent segment size in both the 32-bit and 64-bit execution modes allows for a 32-bit environment that is compatible with 64-bit hardware. When running a 32-bit application the 64-bit hardware will zero extend the 32-bit effective address. Therefore, only the first 16 segments (ESID 0 - 15) can be accessed by a 32-bit application. This is consistent with the 32-bit hardware model which only has 16 segments.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-8. 32-bit Hardware Address Resolution BE0070XS4.0
Notes:
Introduction
As already noted, the effective address segment number identifies a register or table value. We call this table value the Virtual Segment ID (VSID), and it is 24/52 bits long for 32/64 bit hardware. This value together with the remaining effective address information (segment page number and page offset) is used to resolve our effective address to a machine-usable address. This visual, as well as the following visual illustrate this process.
Note that the virtual address space is larger than the effective or real address spaces (it is 52/80 bits wide on 32/64 bit hardware platforms, respectively).
32-bit Hardware Address Resolution
On 32-bit hardware, each 32 bit effective address uses the first 4 bits to select a segment register. The segment register contains a 24 bit Virtual Segment (VSID).
32-bit Hardware Address Resolution
Segment # Virtual Page Index Page Offset
12
Page
Offset
16
24 Segment ID
Virtual Page Number 40 52-bit Virtual Address
Real Page Number 20
32 Real Address
Translation Look-Aside Buffer (TLB)
Hash Anchor Table (HAT)
Hardware Page Frame Table (PFT)
Software Page Frame Table
16 Segment Registers
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-13
Student Notebook
These 24 bits are used with the 16 bit segment page number from the original address to yield a 40 bit virtual page number. Combine this with the 12 bit page offset, and we get a 52 bit virtual address which is used internally by the processor. The 40 bit virtual page number is then used in a lookup mechanism to find a 20 bit real page number, which is combined with the 12 bit page offset to end up with a 32 bit real address.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-9. 64 Bit Hardware Address Resolution BE0070XS4.0
Notes:
64-bit Hardware Platform Address Resolution
The visual above illustrates the address resolution process for 64-bit hardware platforms. Note that it is completely analogous to the preceding 32 bit platform illustration. 64-bit hardware allows the operating system to define a virtual memory space that is significantly larger than the maximum amount of real memory that can be addressed. This is accomplished via the use of a segment table. Each 64 bit effective address uses the first 36 bits as a segment number. The segment number is mapped to a 52 bit Virtual Segment ID (VSID) (using either a segment lookaside buffer (SLB) or a segment table (STAB)). These 52 bits are used with the segment page number from the original address to yield a 68 bit virtual page number. Combine this with the 12 bit page offset, and we get an 80 bit virtual address which is used internally by the processor. The 68 bit virtual page number is then used in a lookup mechanism to find a 52 bit real page number, which is combined with the 12 bit page offset to end up with a 64 bit real address.
64-bit Hardware Address Resolution
36
52 Segment ID
16
Virtual Page Number 68 80-bit Virtual Address
Real Page Number 5264 Real Address
12 Page
Offset
Segment # Virtual Page Index Page Offset
Segment
Table
(STAB)
Segment
Lookaside
Buffer
Translation Look-Aside Buffer (TLB)
Hash Anchor Table (HAT)
Hardware Page Frame Table (PFT)
Software Page Frame Table
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-15
Student Notebook
Figure 4-10. Segment Types BE0070XS4.0
Notes:
Introduction
Several segment types are used in a process’s address space. The segment types are listed in the visual above.
Private vs. shared
Memory in a shared segment may be mapped to the same virtual address in more than one process. This allows the sharing of data between processes. Memory in private segments is only mapped to a single process’ address space. This prevents one process from accessing or altering another process’ private memory.
Kernel segments
Kernel segments are segments that are shared by all process on the system. These segments can only be accessed by code running in the kernel protection domain.
Segment Types
Kernel Segment
User Text
Process Private
Shared Library Text
Shared Data
Shared Library Data
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
User textThe user text segments contain the code of the program. Threads in user mode have read-only access to text segments to prevent modification during program execution. This protection allows a single copy of a text segment to be shared by all processes associated with the same program. For example, if two processes in the system are running the ls command, then the instructions of ls are shared between them.
Running a debugger
When running a debugger, a private read/write copy of the text segment is used. This allows debuggers to set breakpoints directly in code. In that case, the status of the text segment is changed from shared to private.
Process private segment
The process private segment is not shared among other processes. The process private segment contains:
- The user data
- The user stack (for 32-bit programs)
- Text and data from explicitly loaded modules (for 32-bit programs)
- Kernel per-process data such as the u-block (accessible only in kernel mode)
- The primary kernel thread stack (accessible only in kernel mode)
- Per-process loader data (accessible only in kernel mode)
Performance advantage
When a process calls fork, the process private segment of the child process is created as a ‘copy-on-write’ segment. It shares its contents with the process private segment of the parent process. Whenever the parent or child process modifies a page that is part of the process private segment, the page is actually copied into the segment for the child process. This results in a major performance advantage for the kernel, especially in the (very common) situation where the newly created child process immediately performs an exec() call to start running a different program.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-17
Student Notebook
Shared library text
The shared library text segment contains mappings whose addresses are common across all processes. A shared library segment:
- Contains a copy of the program text (instructions) for the shared libraries currently in use in the system.
- Are added to the user address space by the loader when the first shared library is loaded.
Each process using text from the shared library text segment has a copy of the corresponding data in the per-process shared library data segment.
Executable modules list the shared libraries they need at exec() time. The shared library text is loaded into this segment when a module is loaded via the exec() system call. Or, a program may issue load() calls to get additional shared modules.
Shared library data segment
Functions in shared libraries can define variables and other data elements that are private to a process. These elements are placed in the shared library data segment.
- Each process has one shared library data segment.
- Addresses of data items are generally the same across processes.
- Data itself is not shared.
The shared library data segments act as extensions of the process private segment.
Shared data
Mapped memory regions, also called shared memory areas, can serve as large pools for exchanging data among processes.
- A process can create and/or attach a shared data segment that is accessible by other processes.
- A shared data segment can represent a single memory object or a collection of memory objects.
- Shared memory can be attached read-only or read-write.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-11. Shared Memory BE0070XS4.0
Notes:
Introduction
Shared memory areas can be most beneficial when the amount of data to be exchanged between processes is too large to transfer with messages, or when many processes maintain a common large database.
Shared memory address
The shared memory is process-based and can be attached at different effective addresses in different processes.
Methods of sharing
The system provides two methods of sharing memory:
- Mapping file data into the process address space (mmap() services).
Shared Memory
Process A
effective
address
space
Process B
effective
address
space
Memory
SegmentsVirtual memory
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-19
Student Notebook
- Mapping to anonymous memory regions that may be shared (shmat() services).
Serialization
There is no implicit serialization support when two or more processes access the same shared data segment. The available subroutines do not provide locks or access control among the processes. Therefore, processes using shared memory areas must set up a signal or semaphore control method to prevent access conflicts and to keep one process from changing data that another is using.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-12. shmat Memory Services BE0070XS4.0
Notes:
Introduction
The shmat services are typically used to create and use shared memory objects from a program.
shmat functions
A program can use the following functions to create and manage shared memory segments.
Using shmat
The shmget() system call is used to create a shared memory region; and, when supporting objects larger than 256 MB shared memory regions, creates multiple segments.
shmat Memory Services
shmctl () - Controls shared memory operations
shmget () - Gets or creates a shared memory segment
shmat () - Attaches a shared memory segment
shmdt () - Detaches a shared memory segment
disclaim () - Removes a mapping from a specified address range within a shared memory segment
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-21
Student Notebook
The shmat() system call is used to gain address ability to a shared memory region.
EXTSHM
The environment variable EXTSHM=ON allows shared memory regions to be created with page granularity instead of the default segment granularity. This allows more shared memory regions within the same-sized address space, with no increase in the total amount of shared memory region space.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-22 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-13. Memory Mapped Files BE0070XS4.0
Notes:
Introduction
Memory segments can be used to map any ordinary file directly into memory. Instead of reading and writing to the file, using system calls, the program would just access variables stored in the segment.
mmap ()
The mmap() service is normally used to map disk files into a process address space; however, shmat()can also be used to map disk files.
Advantages
Memory mapped files provides easy random access, as the file data is always available. This avoids the system call overhead of read() and write(). This single-level store approach can also greatly improve performance by creating a form of
Memory Mapped Files
Effective
Address space
Virtual memory
Disk File
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-23
Student Notebook
Direct Memory Access (DMA) file access. Instead of buffering the data in the kernel and copying the data from kernel to user, the file data is mapped directly into the user’s address space.
Shared files
A mapped file can be shared between multiple processes, even if some are using mapping and others are using the read/ write system call interface. Of course, this may require synchronization between the processes.
mmap services
The mmap() services are typically used for mapping files, although they may also be used for creating shared memory segments.
Both the mmap()and shmat() services provide the capability for multiple processes to map the same region of an object so that they share addressability to that object. However, the mmap() subroutine extends this capability beyond that provided by the shmat() subroutine by allowing a relatively unlimited number of such mappings to be established.
When to use mmap()
Use mmap() under the following circumstances:
- Portability of the application is a concern.
- Many files are mapped simultaneously.
- Page-level protection needs to be set on the mapping.
- Only a portion of a file needs to be mapped.
- Private mapping is required.
When to use shmat()
Use the shmat() services under the following circumstances:
- When mapping files larger than 256 MB.
Service Descriptionmadvise() Advises the system of a process' expected paging behavior.
mincore() Determines residency of memory pages.
mmap() Maps an object file into virtual memory.
mprotect() Modifies the access protections of memory mapping.
msync() Synchronizes a mapped file with its underlying storage device.
munmap() Un-maps a mapped memory region.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-24 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
- For 32-bit applications, when eleven or fewer files are mapped simultaneously and each is smaller than 256 MB.- When mapping shared memory regions which need to be shared among unrelated processes (no parent-child relationship).
- When mapping entire files.
Mapping types
There are a 3 mapping types:
- Read-write mapping
- Read-only mapping
- Deferred-update mapping
Read-write mapping allows loads and stores in the segment to behave like reads and writes to the corresponding file.
Read-only mapping allows only loads from the segment. The operating system generates a SIGSEGV signal if a program attempts an access that exceeds the access permission given to a memory region. Just as with read-write access, a thread that loads beyond the end of the file loads zero values.
Deferred-update mapping also allows loads and stores to the segment to behave like reads and writes to the corresponding file. The difference between this mapping and read-write mapping is that the modifications are delayed. Any storing into the segment modifies the segment, but does not modify the corresponding file.
With deferred update (O_DEFER flag set on file open), the application can begin modifying the file data (by memory-mapped loads and stores) and then either commit the modifications to the file system (via fsync()) or discard the modifications completely. This can greatly simplify error recovery, and allows the application to avoid a costly temporary file that may otherwise be required.
If all processes that have a file open with the O_DEFER flag set close that file before an fsync() or synchronous update operation is made against the file then that file is not updated.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-25
Student Notebook
Figure 4-14. 32-bit User Address Space BE0070XS4.0
Notes:
Introduction
For the 32-bit hardware platform, segment numbers (Effective Segment IDs) have different uses in user and kernel modes.
32-bit user mode
The table above shows the segment layout of a user mode 32-bit process. This represents how a process running in user mode will see its effective memory. Segment zero contains the first kernel segment. The kernel segment contains the system call table and the kernel text.
The user program text (application code) is located in segment 1.
Segment 2 contains the data, BSS (uninitialized data), stack and heap for the program. The u-block for the process is also located in segment 2.
32-bit User Address Space
Segment Number (SID)
Segment Type and Use Attributes
0 Kernel Segment. shared, read-only
1 User Text - applications text (code).
shared, read-only
2 Process Private Segment (Data, BSS, stack, Ublock, uthread, heap)
private, read-write
3-12 Shared data (shmat or mmap)NOTE: for big data programs segments 3-10 can optionally be used as additional heap.
shared, read-write
13 Shared library text shared, read-only
14 Shared data (shmat or mmap)
shared, read-write
15 Shared library data segment private, read-write
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-26 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Segments 3 -12 are used for shmat() and mmap() areas. Segment 14 provides an additional segment for shmat() and mmap().Segment 13 contains the text for shared libraries (library code). Segment 15 holds the library data.
Big data model
A big data model is supported for 32-bit applications. This allows an application to use more segments for heap, data, and stack. To accomplish this segments 3 - 12 are added to the heap. Eliminating these segments as shmat() and mmap() areas. Such a model is required for programs which exceed the limit imposed by the normal 32-bit address space (i.e. a single 256MB segment for heap and data and stack).
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-27
Student Notebook
Figure 4-15. 32-bit Kernel Address Space BE0070XS4.0
Notes:
32-bit kernel mode
When a process switches into kernel mode (32-bit kernel) the mapping of segments is changed so that the kernel may access its entire address space. The segment layout for the 32-bit kernel is shown above.
Private process segment
Segment 2 the (private process segment) is mapped the same for both user and kernel modes. This allows the kernel access to this section of the user address space. Any data passed between kernel and user will occur by copying data in and out of this segment.
32-bit Kernel Address Space
Segment Number
Segment type and use
0 Kernel and kernel extension text and data.
1 Extended kernel address space (file system and network data)
2 Private process segment
3 Kernel heap segment
7-10 MBUF segments
14 Kernel address space
15 Kernel thread segment
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-28 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-16. 64-bit User/Kernel Address Space BE0070XS4.0
Notes:
Segment
64-bit layout
The 64-bit model adds many more segments to the effective address space. Also, for the 64-bit case one segment layout applies to both user and kernel modes.
text, data, bss, stack and heap
A program’s text, data, BSS and heap can occupy from segments 0x10 through segment 0x6FFFFFFF. This allows for 64-bit programs to be significantly larger than 32-bit programs.
64-bit User/Kernel Address Space
Segment Number (hex) Segment usage0x0000_0000_0 System call tables, kernel text0x0000_0000_1 Reserved for system use0x0000_0000_2 Reserved for user mode loader
(process private segment)0x0000_0000_3 - 0x0000_0000_C Shmat or mmap use0x0000_0000_D Reserved for user mode loader0x0000_0000_E shmat or mmap use0x0000_0000_F Reserved for user mode loader0x0000_0001_0 - 0x06FF-FFFF_F Application text, data, BSS and
heap0x0700_0000_0 - 0x07FF_FFFF_F Default application shmat and
mmap area0x0800_0000_0 - 0x08FF_FFFF_F Application explicit module
load area0x0900_0000_0 - 0x09FF_FFFF_F Shared library text and
per-process shared library data0x0A00_0000_0 - 0x0EFF_FFFF_F Reserved for future use0x0F00_0000_0 - 0x0FFF_FFFF_F Application primary thread
stack0x1000_0000_0 - 0XEFFF_FFFF_F Reserved for future use0xF000_0000_0 - 0xFFFF_FFFF_F Additional kernel segments
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-29
Student Notebook
shmat() and mmap()
For 64-bit applications the default segments for shmat() and mmap() are segments 0x70000000 - 0x7FFFFFFF. Note that segments 0x3-0xC and segment 0xE are also reserved for shmat() and mmap(), this mirrors the 32-bit segment model.
Kernel segments
Segment 0 is the first kernel segment. The segments from 0xF00000000 and up may be used for additional kernel segments.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-30 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-17. Checkpoint BE0070XS4.0
Notes:
Checkpoint
1. AIX divides physical memory into ______.
2. The _____________ provides each process with its own _______address space.
3. A segment can be up to ______ in size.
4. A 32-bit effective address contains a ______segment number.
5. Shared library data segments can be shared between processes. True or False?
6. The 32-bit user address space layout is the same s the 32-bit kernel address space layout. True or False?
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-31
Student Notebook
Figure 4-18. Exercise. BE0070XS4.0
Notes:
Turn to your lab workbook and complete exericse four.
Exercise
Complete exercise four
Consists of theory and hands-on
Ask questions at anytime
Activities are identified by a
What you will do:Given the address of a memory object you will identify what segment the address belongs to and speculate as to how the object was created.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-32 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 4-19. Unit Summary BE0070XS4.0
Notes:
Unit Summary
Pages size = 4096
Virtual memory management
Address spaceseffectivevirtualphysical
Segment size = 256 MB
32-bit vs 64-bit segment layout
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 4. Addressing Memory 4-33
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
4-34 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 5. Memory ManagementWhat This Unit Is About
This unit describes how AIX 5L manages memory using demand paging.
What You Should Be Able to Do
After completing this unit, you should be able to:
• Identify the key functions of the AIX virtual memory manager
• Given a memory object type identify the location of the backing store the VMM system will use for this object
• Describe the affect that different paging space allocation policies have on applications and the system
• Find the current paging space usage on the system
• Identify the paging characteristics of a system from a vmcore file
How You Will Check Your Progress
Accountability:
• Exercises using your lab system • Unit review
References
PowerPC Microprocessor Family: The Programmers Reference Guide
Available from http://www-3.ibm.com/chips/techlib/techlib.nsf/productfamilies/PowerPC
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-1
Student Notebook
Figure 5-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this lesson you should be able to:
Identify the key functions of the AIX virtual memory management system
Given a memory object type identify the location of the backing store the VMM system will use for this object
Describe the affect that different paging space allocation policies have on applications and the system
Find the current paging space usage on the system
Identify the paging characteristics of a system from a vmcore file
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-2. Virtual Memory Management (VMM) BE0070XS4.0
Notes:
Introduction
In the Addressing Memory lesson we saw how AIX 5L manages the effective address space for both the user and kernel. This lesson focuses on the management of the virtual address space by the Virtual Memory Manager (VMM).
Function of the VMM
The VMM is responsible for keeping track of which program pages are resident in memory and which are on secondary storage (disk). It handles interrupts from the address translation hardware in the system to determine when pages must be retrieved from secondary storage and placed in physical memory.
When all of the physical memory is in use, the VMM decides which programs’ pages are to be replaced and paged out to secondary storage.
Virtual Memory Management
Physical
Memory Process 1
Process 2
Virtual address
space
Effective
address
Paging space
Filesystem pages
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-3
Student Notebook
Each time a process accesses a virtual address, the virtual address is mapped (if it is not already mapped) by the VMM to a physical address (where the data is located).
Access protection
Another function of the VMM is to provide for access protection that prevents illegal access to data. This function protects programs from incorrectly accessing kernel memory or memory belonging to other programs. Access protection also allows programs to set up memory that may be shared between processes.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-3. Object Types BE0070XS4.0
Notes:
Memory Object Types
Introduction
Memory objects in AIX 5L are classified based how the object is used. All memory objects are assigned one of five classification types. The Virtual Memory Management system manages each memory object based on its type.
Working objects
Working objects (also called working storage and working segments) are temporary segments, used during the execution of a program, such as stack and data areas. Process data is created by the loader at run time and is paged in and out of paging space. The working storage segment holds the amount of paging space allocated to
Object Types
Working objects
Persistent objects
Client objects
Log objects
Mapping objects
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-5
Student Notebook
pages in the segment. Part of the AIX kernel is also pageable and is part of the working storage.
Persistent objects
The VMM is used for performing I/O operations of file systems. Persistent objects are used to hold file data for the local file systems. When the process opens the file, the data pages are paged-in. When the contents of a file changes, the page is marked as modified and eventually paged-out directly to the original disk location. File system reads and writes occur by attaching the appropriate file system object and performing loads/stores between the mapped object and the user buffer. File data pages and program text are both part of persistent storage. The program text pages are read-only pages; they are paged-in, and never paged-out to disk. Persistent pages do not use paging space.
Client objects
Client objects are used for pages of client file systems. When remote pages are modified, they are marked and eventually paged-out to the original disk location across the network. Remote program text pages (read-only pages) page-out to paging space, from where they can be paged-in later if needed.
Log objects
Log objects are used for writing or reading journaled file systems file logs during journaling operations.
Mapping objects
Mapping objects are used to support the mmap() interfaces, which allows an application to map multiple objects to the same memory segment.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-4. Demand Paging BE0070XS4.0
Notes:
Introduction
AIX is a demand paging system. Physical pages (frames) are not allocated for virtual pages until they are needed (referenced).
How it works
Data is copied to a physical page only when referenced by a program or by the kernel. References to a non allocated page results in a page fault. Paging is done on-the-fly and is invisible to the program causing the page fault.
Page faults
A page fault occurs when a thread tries to access a page that is not currently in physical memory.
Demand Paging
Pinned
Physical Memory Virtual address
space
Kernel or user
effective
address space
Paging space
Filesystem pages
Backing store
Page
Fault
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-7
Student Notebook
The mapping of effective addresses to physical addresses is done in the hardware on a page-by-page basis. When the hardware finds that there is no mapping to physical memory, it raises a page fault condition.
Page fault handler
The job of a virtual memory management system (VMM) is to handle page faults so that they are transparent to the thread using effective memory addresses. The steps to resolve a page fault are:
Page validity
The VMM checks to ensure that the effective address being referenced is part of the valid address range of the segment that contains the effective address. There are a number of possible scenarios.
- The effective address is outside the valid address range for the segment. In this case, the page fault cannot be resolved. If the processor is running in kernel mode, an unresolveable page fault results in a system crash. If the processor is running in user mode, then the unresolved page fault results in the running process being sent either a SIGSEGV (Segmentation violation) or SIGBUS (Bus error) depending on the address being referenced.
- The effective address is within the valid address range for the segment, and the page containing the effective address has already been instantiated. The actions of the VMM in this case are described over the next few pages of the class.
- The effective address is within the valid address range for the segment, however the page containing the effective address has not been instantiated. For example, this happens when an application performs a large malloc() operation. The pages for the malloced space are not instantiated until they are referenced for the first time.In this case, the VMM allocates a physical frame for use by the page, and then updates the segment information to indicate that the page has been allocated. It then updates the hardware page frame table to reflect the physical location of the page, and allows the faulting thread to continue.
Step Action1. The hardware detects a page fault and raises the page fault condition.
2. Execution of the faulting thread is suspended.
3. Control is transferred to the page fault handler (part of the virtual memory management system).
4. The page is loaded from disk into physical memory.
5. Execution of the faulted thread is resumed.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
AdvantagesThe demand paging system in AIX allows more virtual pages to be allocated than can be stored in physical memory. Demand paging also saves much of the overhead of creating new processes because the pages for execution do not have to be loaded until they are needed. If a process never makes use of a portion of its virtual space, valuable physical memory will never be used.
Physical memory management
Data that has been recently used is kept in physical memory. Data that has not been recently used is kept in paging space. A pager daemon attempts to keep a pool of physical pages free. If the number of pages available goes below a low-water mark threshold, the pager frees the oldest referenced pages, and continues to do so until a high-water mark threshold is reached.
Pageable kernel
AIX’s kernel is pageable. Only some of the kernel is in physical memory at one time. Kernel pages that are not currently being used can be paged out.
Pinned pages
Some parts of the kernel are required to stay in memory because it is not possible to perform a page-in when those pieces of code execute.These pages are said to be pinned. The interrupt processing portion of a device driver is pinned. Only a small part of the kernel is required to be pinned.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-9
Student Notebook
Figure 5-5. Data Structures BE0070XS4.0
Notes:
Introduction
The main function of the VMM is to make translations from the effective address to the physical address. Address translation requires both hardware and software components. This section covers the relationship between the hardware and software components of the VMM.
Data structures
The diagram above shows the overall relationships between the major AIX data structures involved in mapping a virtual page to a physical page or to paging space.
Data Structures
Hardware Page
Frame Table
Effective
address space
Software page
frame table
SID table
Physical memory
External page tables
(XPT)
File inode
Filesystem pages
Paging space
Segment ID
and page
number
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Page faultsA page fault causes the AIX (VMM) to do the bulk of its work. It handles the fault by first verifying that the requested page is valid.
If the page is valid, the VMM determines the location of the page, recovers the page if necessary and updates the hardware’s frame page frame table with the location of the page. A faulted page will be recovered from one of the following locations:
- Physical memory (but not in the hardware PFT).
- Paging disk (working object)
- File system object (persistent object)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-11
Student Notebook
Figure 5-6. Hardware Page Mapping BE0070XS4.0
Notes:
Introduction
In a normal situation, an effective address refers to a piece of memory that is currently in real memory. We say the memory is paged in.
Illustration
The flow of the best case address translation is illustrated above.
Hardware Page Frame Table (PFT)
A hardware Page Frame Table (PFT, sometimes “HWPFT”) of address translations is used to make the conversions from effective addresses to physical addresses. These tables only contain a subset of all available translations for the contents of physical memory. If a translation is found in this table the physical page is returned to the requestor. There is no need for a page fault to be generated.
Hardware Page Mapping
Hardware Page
Frame Table
Effective
address space
Software page
frame table
SID table
Physical memory
External page tables
(XPT)
File inode
Filesystem pages
Paging space
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-7. Page not in Hardware Table BE0070XS4.0
Notes:
Introduction
The size of the hardware Page Frame Table is limited; therefore, the hardware can not satisfy all address translation requests. The VMM software must supplement the hardware table with a software-managed page table.
Illustration
When a translation cannot be found in the hardware table, a page fault is generated. The physical page may be resident in memory; however, the translation entry is not in the hardware table. The VMM must be called to update the hardware tables. The procedure is shown in the table above.
Page not in Hardware Table
Hardware Page
Frame Table
Effective
address space
Software page
frame table
SID table
Physical memory
External page tables
(XPT)
File inode
Filesystem pages
Paging space
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-13
Student Notebook
What happened to the thread?
When this type of page fault is resolved the dispatcher is not run. The faulted thread just continues the execution at the instruction that caused the fault.
Procedure
These steps assume that the memory page is in memory but not in the hardware Page Frame Table.
Software Page Frame Table
Software Page Frame Table (SWPFT) is an extension of the hardware Page Frame Table, and is used and managed by the VMM software. SWPFTs contain information connected with a page as well as page-in flags, page-out flags, free list flags, and the block number. They contain the device information used to obtain the proper page from disk. The software PFT is big enough to contain translation information for every page resident in physical memory.
Step Action
1. The hardware Page Frame Table is searched for a page translation and none is found.
2. The hardware generates a page fault causing the VMM to be called.
3. The VMM first verifies that the requested page is valid. If the page is not valid, a kernel exception is generated.
4.If the page is valid, the VMM searches the software PFT for the page. This process resembles hardware processing, but uses a software page table instead. Only some parts of the software PFT are pinned.
5.
If the page is found:
• The hardware Page Frame Table is updated with the real page number for this page, and the process resumes execution.
• No page-in of the page occurs, since it is already in memory.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-8. Page on Paging Space BE0070XS4.0
Notes:
Introduction
If a page is not found in physical memory, the VMM determines whether it is on paging space or elsewhere on disk. If the page is in paging space, then the disk block containing the page is located, and the page is loaded into a free memory page.
Waiting for I/O
Copying a page from the paging space to an available frame is not a synchronous process. Any process or thread waiting for a page fault to be handled is put to sleep until the page is available.
Illustration
Working pages are mapped to disk blocks in the paging space. The procedure for loading a page from paging space is shown in the visual on the previous page.
Page on Paging Space
Hardware Page
Frame Table
Effective
address space
Software page
frame table
SID table
Physical memory
External page tables
(XPT)
File inode
Filesystem pages
Paging space
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-15
Student Notebook
Figure 5-9. External Page Table (XPT) BE0070XS4.0
Notes:
External Page Table (XPT)
The XPT maps a page within working storage segments to a disk block on external storage. The XPT is a two-level tree structure. There is one XPT for each working storage segment.
Structure
Each segment that is mapped to paging space has the following XPT structure.
Description
The first level of the tree is the XPT root block. The second level consists of 256 direct blocks. Each word in the root block is a pointer to one of the direct blocks. Each word in a direct block represents a single page in the segment. It contains the page’s state and disk block information. Each XPT direct block covers 1 MB of the 256MB segment.
External Page Table (XPT)
255
0
255
0
255
0
XPT Root
XPT Direct Block #0
XPT Direct Block #255
Disk blocks in paging space
page 255
page 0
1 MB
page 65535
page 65280
.
.
.
.
.
.
.
.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
ProcedureIn this procedure the faulting thread must be suspended until I/O for the faulting page has completed.
The net effect is that the process or thread has no knowledge that a page fault occurred except for a delay in its processing.
Step Action1. The thread causing the fault is suspended.
2. The VMM looks up the object ID for this address in the Segment ID table and gets the External Page Table (XPT) root pointer.
3. The VMM finds the correct XPT (direct block from XPT root).
4. The VMM gets the paging space disk block number from the XPT direct block.
5. The VMM takes the first available frame from the free frame list. The free list contains one entry for each free frame of real memory.
6. The VMM issues an I/O request to the device with the logical block and physical address of the page to be loaded.
7. When the I/O completes, the VMM is notified. The VMM updates the hardware PFT.
8. The thread waiting on the frame is awakened and resumes at the faulting instruction.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-17
Student Notebook
Figure 5-10. Loading Pages From the File System BE0070XS4.0
Notes:
Introduction
Persistent pages do not use external page tables. The VMM uses the information contained in a file’s inode structure to locate the pages for the file.
Procedure
Persistent pages are mapped to local files located on file systems. The effective address for the mapped page of the local file is indexed in the Segment Information Table (SID). The inode is pointed to by the SID entry, allowing the VMM to find and page-in the faulting block.
Loading Pages From the File System
Hardware Page
Frame Table
Effective
address space
Software page
frame table
SID table
Physical memory
External page tables
(XPT)
File inode
Filesystem pages
Paging space
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
File System I/OIntroduction
The paging functions of the VMM is also used to perform file “reads” and “writes” by processes.
File system objects
File system reads and writes occur by attaching the appropriate file system object and performing loads/stores between the mapped object and the user buffer. It means that file objects are not directly addressable in the current address space but instead are temporarily attached.
A local file has a segment allocated and has an entry (SID) in the segment information table. A file gnode contains information about which segment belongs to the particular file.
Persistent pages
AIX uses a large portion of memory as the file system buffer cache. The pages for files compete for storage the same way as other pages. The VMM schedules the modified persistent pages to be written to their original location on disk when:
- The VMM needs the frame for another page.
- The file is closed.
- The sync operation is performed.
Scheduling a page to be written does not mean that the data is written to disk immediately. A sync() operation flushes all scheduled pages to disk. The sync() operation is performed by the syncd daemon every 60 seconds by default or by a user running the sync command.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-19
Student Notebook
Figure 5-11. Object Type / Backing Store BE0070XS4.0
Notes:
Introduction
Paging provides automatic backup copies of memory objects on disk. This copy is called the backing store and can be located on a paging disk, a regular disk file, or even on a network accessible disk file.
Questions
Using what you know about memory object types, match the object types on the left with the location of its backing store on the right in the visual above.
Object Type Backing Store
A. Working 1. A regular disk file
B. Persistent 2. An NFS disk file
C. Client 3. Paging disk
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-12. Paging Space Management Process BE0070XS4.0
Notes:
Introduction
Proper management of paging space is required for the system to perform. Low paging space can result in failed applications and system crashes.
SIGDANGER
Application programs can ask AIX to notify them when paging space runs low by registering to receive a SIGDANGER signal. This feature allows applications to release memory or take other appropriate actions when paging space runs low. The default action for SIGDANGER is to ignore the signal.
Threshold
AIX has two paging space thresholds; they are:
- Paging space warning level
Paging Space Management Process
Steps Condition Action
1 If the number of free paging space blocks falls below the paging space warning level (npswarn).
SIGDANGER is sent to all process (except kprocs) that have registered to handle the signal
2 If the number of free paging space blocks falls below the paging space kill level. (npskill)
A SIGKILL is sent to the newest process that does not have a signal handler for SIGDANGER, and the UID is not less than nokilluid.
3 If paging space is still below the paging space kill threshold.
SIGKILL will continue to be sent to eligible processes until the free paging space rises above the kill threshold.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-21
Student Notebook
- Paging space kill level
Application programs can monitor these thresholds and free paging space using the psdanger() function. Both thresholds are set with the vmtune (AIX 5.1) and vmo (AIX 5.2) commands.
Process
The table above describes the actions AIX takes when paging space becomes low.
Nokilluid
The SIGKILL signal is only sent to processes that do not have a handler for SIGDANGER and where the UID of the process is greater than or equal to the kernel variable nokilluid, which can be set with the vmtune (AIX 5.1) and vmo (AIX 5.2) commands. The value of nokilluid is 0 by default, which means processes owned by root are eligible to be sent a SIGKILL.
Age of the process
The kernel send the SIGKILL signal to the youngest eligible process. This helps to prevent long-running processes from being terminated due to a low paging space condition caused by a recently started process.
Example
The init process (pid 1) registers a signal handler for the SIGDANGER signal. The handler prints a warning message on the system console and attempts to free memory by unloading unused modules.
intdanger(void){
if (own_pid == SPECIALPID) {console(NOLOG, M_DANGER, "Paging space low!\n");unload(L_PURGE); /* unload and remove any
* unused modules in kernel or* library */
}return(0);
}
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-22 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-13. Paging Space Allocation Policy BE0070XS4.0
Notes:
Introduction
Individual processes may select when paging space will be allocated for them. This is called a paging space policy.
PSALLOC
A process that has the environment variable PSALLOC=early will cause the VMM to allocate paging space for any memory which is requested, whether or not the memory is accessed. This is the algorithm that was used on AIX v3.1.
Finding a process allocation policy
Use kdb to examine the process flags in the proc structure to determine a process’s current paging space allocation policy.
Paging Space Allocation Policy
Policy DescriptionEarly allocationPSALLOC=early
Causes paging space to be allocated as soon as the memory request is made. This helps to ensure that the paging space will be available if it is needed. Note that this policy holds only for this process and is not system-wide.
PSALLOC= The system wide default applies. For AIX 4.3.2 and later releases the system default is Deferred Paging Space Allocation (DPSA), which means that paging space will not be allocated until a page Out occurs. This can be controlled with vmtune -d {0,1}. VMtune -d 0 will turn DPSA off, which means paging space will be allocated when requested memory is accessed. Note that this is a system-wide policy and applies to all processes running on the system.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-23
Student Notebook
When early allocation is selected the SPEARLYALLOC flag will be set in proc->p_flag. This flag is defined in proc.h as:
#define SPEARLYALLOC 0x04000000 /* allocates paging space early */
This flag can be seen through kdb by running the p <slot_number> subcommand. If the flag is set it will show up in the second set of “FLAGS” indicated by the name: “SPEARLYALLOC”.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-24 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-14. Free Memory BE0070XS4.0
Notes:
Introduction
To maintain system performance, the VMM always wants some physical memory to be available for page-ins. This section describes the free memory list and the algorithms used to keep pages on the list.
Free memory list
The VMM maintains a linked list containing all the currently free real memory pages in the system. When a page fault occurs, the VMM just takes the first page from this list and assigns it to the faulting page.
Page stealer
The page stealer is invoked when the number of memory pages on the free list drops below the threshold defined by the value of minfree. The page stealer attempts to
Free Memory
Free memory list
Run page stealer
free pages
<minfree
free pages
=>maxfree
minfree
maxfree
no
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-25
Student Notebook
replenish the free list until it reaches the high threshold defined by maxfree. The values of maxfree and minfree can be viewed or adjusted on AIX 5.1 with the vmtune command (/usr/samples/kernel/vmtune), and on AIX 5.2 with the vmo command.
Page replacement algorithm
The method used by the page stealer to select a page which should be placed on the free list is called the Page Replacement Algorithm. The page replacement algorithm used in AIX is called the clock-hand algorithm.
Evidence
The page stealer is visible as the lrud kernel process.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-26 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-15. Clock Hand Algorithm BE0070XS4.0
Notes:
Clock hand
The algorithm is called the clock-hand algorithm because the algorithm acts like a clock hand that is constantly pointing at frames in order. The clock-hand advances whenever the algorithm advances to the next frame. If a modified page is stolen, the clock-hand algorithm writes the page to disk (to paging space or a file system) before stealing the page.
Clock Hand Algorithm
Reference = 1
Reference = 0
Reference = 1
Reference = 0
The reference bit
is changed to
zero when the
clock hand
passes
This page is
eligible to be
stolen
rotation
Physical
page
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-27
Student Notebook
Process
This algorithm is commonly used in operating systems when the hardware provides only a reference bit for each page in the physical memory. The hardware automatically sets the reference bit for a page translation whenever the page is referenced.
Bucket size
The clock hand algorithm examines a set of frames at a time. If it were to examine all memory frames in the system in one cycle, then it is likely that all frames would have been referenced by the time the algorithm starts its second pass. The number of frames considered in each cycle is known as the lrud bucket size.
Step Action
1. Each time a page is referenced the hardware sets the referenced bit in the PTE (Page Table Entry) for that page.
2. The clock hand algorithm scans all PTEs checking the reference bit.
3. If the reference bit is found set the bit is reset.
4. If the reference bit is found reset the page will be stolen.
5. The process continues until the number of free pages reaches maxfree.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-28 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-16. Fatal Memory Exceptions BE0070XS4.0
Notes:
Introduction
Not all page and protection faults can be handled by the OS. When a fault occurs that cannot be handled by the OS, the system will panic and immediately halt.
Fatal Memory Exceptions
In all of the following cases, the VMM bypasses all kernel
exception handlers and immediately halts the system:
A page fault occurs in the interrupt environment.
A page fault occurs with interrupts partially disabled.
A protection fault occurs while in kernel mode on kernel data.
An I/O error occurs when paging in kernel data.
An instruction storage exception occurs while in kernel mode.
A memory exception occurs while in kernel mode without an exception handler set up.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-29
Student Notebook
Figure 5-17. Checkpoint BE0070XS4.0
Notes:
Checkpoint
1. The system hardware maintains a table of recently referenced ______ to ______address translations.
2. The S_____ P____ F____ T____ contains information on all pages resident in _______ _______.
3. Each ______ _______ has an XPT.
4. A _________ signal is sent to every process when the free paging space drops below the warning threshold.
5. The ________environment variable can be used to change the paging space policy of a process.
6. A ______ ______ when interrupts are disabled will cause the system to crash.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-30 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 5-18. Exercise BE0070XS4.0
Notes:
Turn to your lab workbook and complete exercise five.
Exercise
Complete exercise five
Consists of theory and hands-on
Ask questions at anytime
Activities are identified by a
What you will do:Observe the effect of the AIX paging space allocation policies on an application programInvestigate what effect running out of paging space has on applications and the systemDiagnose a crash dump from a system with paging space depletion
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 5. Memory Management 5-31
Student Notebook
Figure 5-19. Unit Summary BE0070XS4.0
Notes:
Unit Summary
Virtual memory management
Memory objects types
Demand paging system
Backing store
Paging space allocation policies
Free memory list - clock hand
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
5-32 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 6. Logical PartitioningWhat This Unit Is About
This unit describes the implementation of logical partitioning (otherwise known as LPAR) on pSeries systems.
What You Should Be Able to Do
After completing this unit, you should be able to:
• Describe the implementation of logical partitioning
• List the components required to support partitioning
• Understand the terminology relating to partitions
How You Will Check Your Progress
Accountability:
• Checkpoint questions • Unit review
References
AIX Documentation: AIX Installation in a Partitioned Environment
Hardware Management Console for pSeries Installation and Operations Guide
Available from http://www-1.ibm.com/servers/eserver/pseries/library/hardware_docs/hmc.html
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-1
Student Notebook
Figure 6-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this lesson you should be able to:
Describe the implementation of logical partitioning
List the components required to support partitioning
Understand the terminology relating to partitions
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-2. Partitioning BE0070XS4.0
Notes:
Introduction
Partitioning is the term used to describe the ability to run multiple independent operating system images on a single server machine.
Each partition has its own allocation of processors, memory and I/O devices. A large system that can be partitioned to run multiple images offers more flexibility than using a collection of smaller individual systems.
Partitioning
Subdivision of a single machine to run multiple operating system instances
Collection of resources able to run an operating system image
ProcessorsMemory
I/O devices
Physical partitionBuilding blocks
Logical partitionIndependent assignment of resources
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-3
Student Notebook
Reasons for partitioning
Partitioning is intended to address a number of pervasive requirements, including:
- Server consolidation: The ability to consolidate a set of disparate workloads and applications onto a smaller number of hardware platforms, in order to reduce total cost of ownership (administrative and physical planning overhead).
- Production and test environments: The ability to have an environment to test and migrate software releases or applications, which runs on exactly the same platform as the production environment to ensure compatibility, but does not cause any exposure to the production environment.
- Data and workload isolation: The ability to support a set of disparate applications and data on the same server, while maintaining very strong isolation of resource utilization and data access.
- Scalability balancing: The ability to create resource configurations appropriate to the scaling characteristics of a particular application, without being limited by hardware upgrade granularities.
- Flexible configuration: The ability to change configurations easily to adapt to changing workload patterns and capacity requirements.
Partitioning types
In the UNIX market place, there are two main types of partitioning available:
- Physical partitioning
- Logical partitioning
There are a number of distinct differences between the two implementations.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-3. Physical Partitioning BE0070XS4.0
Notes:
Introduction
Physical partitioning is the term used to describe a system where the partitions are based around physical building blocks. Each building block contains a number of processors, system memory and I/O device connections. A partition consists of one or more physical building blocks.
The diagram shows a system that contains three building block units. The system currently is configured to run two partitions. One partition consists of all of the resources (CPU, memory, I/O) on two physical building blocks. The other partition contains of all of the resources on the remaining building block.
Physical Partitioning
SMP Building Block SMP Building Block SMP Building Block
Dedicated CPU, Memory and I/O
Dedicated CPU, Memory and I/O
Dedicated CPU, Memory and I/O
Interconnect
Operating System
Physical Partition
Operating System
Physical Partition
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-5
Student Notebook
Properties
A system that implements physical partitioning has the following characteristics:
- Multiple memory coherence domains, each with an OS image.A memory coherence domain is a group of processors that are accessing the same physical system memory. Memory coherence traffic (such as cache line invalidation, and snooping) is shared between the processors in the domain.
- Separation controlled by interfaces between physical units.Memory coherence information stays within the physical building blocks allocated to the partition. A processor that is part of one building block cannot access the memory on another building block that is not part of the memory coherence domain (partition).
- Strong software isolation, strong hardware fault isolation.Applications running inside an operating system instance have no impact on applications running inside another partition.A failure of a component on one system building block will not (or should not) impact a partition running on other building blocks. However the system as a whole still contains components that could impact multiple partitions in the event of failure, for example a failure of the backplane interconnect.
- Granularity of allocation at the physical building block level.A partition that does not have enough resources can only be grown by incorporating whole building blocks, and therefore will include all of the resources on the building block, even though they may not be desired.For example, a partition that needs more processors will need to add another building block. By doing so, the partition will also incorporate the memory and I/O devices on that building block.
- Resources allocated only by contents of complete physical group. The granularity of growing individual resources (CPU, memory, I/O) is determined by the amount of each resource on the physical building block being added to the partition. For example, in a system where each building block contains 4 processors, a partition that required more CPU power would receive an increment of 4 processors, even though perhaps only 1 or 2 would be sufficient.
Example
The Sun Enterprise 10000 and Sun Fire15K are examples of systems that use physical partitioning. In the case of Sun machines, the term domain is used instead of partition.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-4. Logical Partitioning BE0070XS4.0
Notes:
Introduction
Logical partitioning is the term used to describe a system where the partitions are created independently of any physical boundaries.
The diagram shows a system configured with three partitions. Each partition contains an amount of resource (CPU, memory, I/O slots) that is independent of the physical layout of the hardware.
In the case of pSeries systems, an additional system, the Hardware Management Console for pSeries (HMC), is required for configuring and administering a partitioned server. The HMC connects to the system through a dedicated serial link connection to the service processor. Additionally, applications running on the HMC communicate over an Ethernet connection with the operating system instances in the partitions to provide service functionality, and in the case of AIX 5.2, dynamic partitioning capabilities.
Logical Partitioning
I/O adapters
and devices
Up to 16 LPARs
OS OS OS
Processors
Memory
LPAR 1 LPAR 2 LPAR 3
Hardware Management Console (HMC)Hypervisor
Ethernet
RS232
RS422
Managed System
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-7
Student Notebook
Properties
A system that implements logical partitioning has the following characteristics:
- One memory coherence domain with multiple OS images.This basically means that all processors in the system are aware of the physical memory addresses being accessed by the other processors, even if they are in a different partition. Since each partition is allocated its own portion of physical memory, this has no real performance impact.
- Separation controlled mainly by address mapping mechanisms.Rather than using physical boundaries between components to control the memory access available to each partition, a set of address mapping mechanisms provided by hardware and firmware features is used. The operating system running in each partition is restricted in its ability to access physical memory, and is only permitted to access physical memory that has been explicitly assigned to that partition.
- Strong software isolation, fair-to-strong hardware fault isolation.Applications running inside an operating system instance have no impact on applications running inside another partition. The failure of the operating system in one partition has no impact on the others.
- Granularity of allocation at the logical resource level (or below). In the case of pSeries systems, the current unit of allocation for each resource type is:
• One CPU
• Individual I/O slot
• 256MB memory
- Resources allocated in almost any combinations or amounts.The amount of memory allocated to a partition is independent of the number of CPUs or I/O slots. Each resource quantity is based on the system administrator’s understanding of the needs of the partition, rather than the physical layout of the machine.
- Some resources can even be shared.In the case of pSeries systems, some resources are shared by all partitions. These are divided into two classes:
• Physical resources (such as power supplies) that are visible to each partition.
• Logical resources, where each partition is given its own “instance”, for example, the operator panel and virtual console devices provided by the HMC.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-5. Components Required for LPAR BE0070XS4.0
Notes:
Introduction
No single feature determines whether a pSeries system is capable of implementing LPAR or not. Rather, it is a combination of features provided by different components, all of which must be present.
Hardware
The following hardware features are required for LPAR support:
- Interrupt controller hardwareThe interrupt controller hardware on the system directs interrupts to a CPU for processing. In the case of a partitioned system, the interrupt controller hardware must be capable of maintaining multiple global interrupt queues, one for each partition. The hardware must be capable of recognizing the source of an interrupt and determining which partition should receive the interrupt notification. For
Components Required for LPAR
HardwareInterrupt controller hardwareProcessors require RMO, RML and LPAR ID registersProcessors require Hypervisor support
Means no LPAR support on older machines
FirmwareGlobal firmware imagePartition specific firmware instance
Hypervisor code
Operating SystemUse of Hypervisor callout by VMMMeans no LPAR support for older operating systems (e.g. AIX
4.3)
All 3 required for LPAR operation
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-9
Student Notebook
example, an interrupt from a SCSI adapter card must be sent to the partition that controls the card and the devices connected to it. If the interrupt is sent to a CPU that is part of a different partition, the CPU would be unable to access the device to process the interrupt.
- Processor supportA processor requires 3 new registers in order to be used in a partitioned environment. The POWER4 processor is the first CPU used in pSeries systems that has the required capabilities. The registers are:
• Real Mode Offset (RMO) registerThe RMO register is used by the processor when referencing an address in real mode. All processors in the same partition have the same value loaded in the RMO register. The use of the register is described in detail in a later part of this unit.
• Real Mode Limit (RML) registerThe RML register is also used when the processor is referencing an address in real mode. All processors in the same partition have the same value loaded in the RML register. The use of the register is described in detail in a later part of this unit.
• Logical Partition Identity registerThe LPI register contains a value that indicates the partition to which the processor is assigned. All processors in the same partition have the same value loaded in the LPI register.
- In order to implement the required isolation between partitions, a processor must have hypervisor support. The hypervisor is described in detail later. A processor implements hypervisor support by recognizing the HV bit in the Machine Status Register (MSR). The HV bit of the MSR, along with the Problem State bit indicates if the processor is in hypervisor mode. Hypervisor mode is implemented in a similar fashion to the system call mechanism used to transition the processor between Problem State (user mode) and Supervisor State (kernel mode). Hypervisor mode can only be invoked from Supervisor State. In other words, only kernel code can make hypervisor calls.
Firmware
The job of firmware in a system is to:
- Identify and configure system components
- Create a device tree
- Initialize/Reset system components
- Locate an operating system boot image
- Load the boot image into memory and transfer control
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
- When the operating system is running, it has control over the hardware. In order to allow AIX to run on different hardware platform types, it uses a component of firmware called Run-Time Abstraction Services (RTAS) to interact with the hardware. The RTAS functions are provided by pSeries RISC Platform Architecture (RPA) platforms to insulate the operating system from having to know about and manipulate a number of key functions which ordinarily would require platform-dependent code. The OS calls these functions rather than manipulating hardware registers directly, reducing the need for hard-coding the OS for each platform. Examples of RTAS functions include accessing the time-of-day clock, and updating the boot list in NVRAM.- When the operating system image is terminated, control is returned to firmware.
Since firmware in a partitioned system now has to deal with multiple operating system images, a special version is required that provides additional functionality.
The functionality of firmware is now divided into two parts, known as Global firmware, and Partition firmware. The global firmware is initialized when the system is first powered on. It identifies and configures all of the hardware components in the system, and creates a global device tree that contains information on all devices. When a partition is started, a partition specific instance of firmware is created. The partition specific instance contains a device tree that is a subset of the global device tree, and contains only the devices that have been assigned to the partition. It then continues with the task of locating an operating system image and loading it. The RTAS functionality provided by partition firmware performs validation checks and locking to ensure that the partition is permitted to access the particular hardware feature being used, and that its use does not conflict with that of another partition.
An additional component of firmware required for LPAR support is the hypervisor function.
Hypervisor
The hypervisor can be considered as a special set of RTAS features, that run in hypervisor mode on the processor. The hypervisor is trusted code that allows a partition to manipulate physical memory that is outside the region assigned to the partition. The hypervisor code performs partition and argument validation before allowing the requested action to take place. The hypervisor provides the following functions:
- Page Table accessPage tables are described later in this unit when we examine the changes in translating a virtual address to a physical address in the LPAR environment.
- Virtual console serial deviceWhen multiple partitions are running on a system, each partition requires an I/O device to act as the console. Most pSeries systems have two or three native serial ports, so it would be impractical to insist that each partition have its own native serial port, or enforce the addition of additional serial adapters. The hypervisor provides a
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-11
Student Notebook
virtual serial console interface to each partition. The I/O from the virtual device is communicated to the HMC via the serial link from the service processor in the partitioned system.
- Debugger supportThe hypervisor also provides support that permits the debugger running on the system to access specific memory and register locations.
Operating system
The operating system that will run in a partition needs to be modified to use hypervisor calls to manipulate the Page Frame Table (PFT), rather than maintain the table directly in memory. A few other low level kernel components are aware of the fact that the OS is running inside a partition. The vast bulk of the kernel however is unaware, since there is no need for any changes. This allows the operating system to present a consistent interface to the application layer, regardless of whether it is running in a partition or running as the only operating system on a regular standalone machine.
The net effect of the required changes is that an operating system not designed for use in a partitioned environment will fail to boot. This means that older operating systems (such as AIX 4.3) will not work in a partition.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-6. Operating System Interfaces BE0070XS4.0
Notes:
Introduction
The diagram summarizes the interfaces used by the operating system to interact with the hardware platform. It details the different components of the OS that interact with each function provided by the platform firmware. The Platform Adaptation Layer (PAL) is an operating system component similar in function to the RTAS layer provided by firmware. In other words, its job is to mask the differences between hardware platforms from other parts of the kernel.
Operating System Interfaces
"Global"
Open Firmware
Hypervisor Run-Time Abstraction
Services (RTAS)
AIX
Boot/ConfigD
evic
e T
ree
Su
bset
VMM
Vir
tual
Pag
e
Map
pin
g
Kernel
Debugger
& Dump
Reg
iste
r &
Mem
ory
A
ccess
Virtual TTY Dev Driver
TT
Y D
ata
S
tream
s
Op
era
tin
g S
yste
mF
irm
ware
Applications
Platform Adaptation Layer (PAL)
PartitionValidation
Hard
ware
S
erv
ice
Calls
"Partitioned"Open Firmware
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-13
Student Notebook
Figure 6-7. Virtual Memory Manager BE0070XS4.0
Notes:
Introduction
The job of the Virtual Memory Manager (VMM) component of the operating system is to manage the effective address space of each process on the system, and ensure that pages are mapped to physical memory when required so that they can be accessed by the processors.
The translation of a virtual address to a physical address is an area of the operating system that has undergone some changes to allow the implementation of a partitioned environment, since there are now multiple operating system images co-existing in a single machine.
Virtual Memory Manager
Physical
Memory Process 1
Process 2
Virtual address
space
Effective
address
Paging space
Filesystem pages
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-8. Real Address Range BE0070XS4.0
Notes:
Introduction
Before examining the changes in address translation for the LPAR environment, we first take a closer look at the memory layout on a non-LPAR system.
Device I/O
The hardware provides memory mapped access to I/O devices. A system has at least one Host Bridge (HB), which is mapped to a region in the address map. When the processor writes to specific addresses, the data is passed to the Host Bridge, rather than being stored in the DRAMS or other components used to implement physical memory. The Host Bridge device allocates portions of its address space to each I/O adapter plugged into a slot it controls. Data written to the Host Bridge is passed to a specific I/O adapter card, based on the address being written. Each HB is allocated a unique portion of the system address space.
Real Address Range
Non-LPAR system with 2 PCI busses Processor View
= Invalid for load/store
HB1
HB0
I/O Adapters
I/O Adapters
Sys Mem 1
Sys Mem 0
4GB
0
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-15
Student Notebook
Physical memory
Another feature of the diagram that is worth noting is that the address range of physical memory in the system is not necessarily contiguous. The physical memory in the system always starts with address zero, however depending on the total amount of memory, and the number of Host Bridge devices in the system, the physical address range may be divided into multiple components. In other words, there appears to be ‘holes’ in the physical address range used by the system. This is perfectly normal, and the VMM system of AIX (and most other modern operating systems) is designed to cope with this.
As an example, a system with 8GB total of physical memory may address 3GB of that memory using physical addresses in the range 0 to 3GB, and the remaining part of memory using addresses 4.5GB to 9.5GB.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-9. Real Mode Memory BE0070XS4.0
Notes:
Introduction
In addition to considering the ranges used when addressing memory, another important distinction to make is the type of access being performed.
The function of the VMM is to translate a virtual address into a real (or physical) address. Address translation can be enabled or disabled, and the status of this is indicated by bits in the MSR. Address translation for instructions and data can be enabled or disabled independently.
Real address
A real address is an address that is generated by the processor when address translation is disabled. Typically real addresses are used by specialized parts of kernel code, such as the boot process (before the VMM is initialized) or interrupt/exception handler code. Real mode memory starts at address zero. The size of real mode
Real Mode Memory
Real address = address generated when translation disabled
Used by system startup code that runs before VMM is configuredUsed by interrupt vector mechanism
Used by VMM itself to maintain tables
Real mode memory normally starts at address zeroSize of real mode memory region depends on operating system
On non-LPAR systems, real address = physical address
On LPAR systems real address != physical addressOnly one physical address zero in the systemPhysical address zero used by hypervisorEach partition requires its own address zeroRequires mapping from real address generated by partition to physical address used by memory hardware
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-17
Student Notebook
memory is dependent on the requirements of the operating system. Another important thing to note is that on a non-LPAR system, a real address is equivalent to a physical address.
LPAR changes
The assertion that a real address is the same as a physical address no longer holds true in the partitioned environment however, since a system only has a single overall physical address range (although it may be split into multiple sections).
Each partition requires its own address zero, but there is only one true physical address zero inside a system. In actual fact, physical address zero is used by the hypervisor, but we can generalize the statement as:
For any given address n, each partition expects to be able to access address n. Obviously they can’t all access the same physical address n, so something needs to be done to accommodate this. We explain things later, but for now, just know that:
- For real mode addresses, this is where the RMO register of the processor is used.
- For virtual addresses, partition page tables are used to translate the partition specific address n into a system-wide physical address.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-10. Operating System Real Mode Issues BE0070XS4.0
Notes:
Introduction
The amount of real mode memory required by a partition depends upon two factors.
1) The version of the operating system.
2) The amount of memory allocated to the partition.
Alignment
Physical memory allocated in a partitioned environment for use as Real Mode memory by a partition must be contiguous, and aligned on an address boundary that is divisible by the size of the real mode region. For example, a 16GB real mode region must be aligned on an address boundary divisible by 16GB (i.e. 16GB, 32GB, 48GB, 64GB etc.). As we will see later, address 0 cannot be used, since it is used by the hypervisor.
Operating System Real Mode Issues
Real mode memory aligned on same size address boundary
e.g. 1GB real mode region aligned on 1GB address boundary
AIX 5.2 & Linux require 256MB of real mode memoryVMM requires fixed size real mode regionMost VMM tables accessed only with address translation
enabled
AIX 5.1 VMM accesses many tables with translation disabled
Some of these tables scale with amount of memory in partitionTherefore AIX 5.1 real mode requirement scales with memory in partition
Real mode region size Supported partition sizes
256MB 256MG - 4GB
1GB 1GB - 16GB
16GB 16GB - 256GB
AIX 5.1 Real mode Requirements
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-19
Student Notebook
AIX 5.2 and linux
Both AIX 5.2 and Linux require only 256MB of memory be accessible in real mode, since the VMM only uses real mode to maintain tables that do not scale with memory size.
AIX 5.1
The VMM in AIX 5.1 maintains tables in real mode memory that scale with the total amount of memory allocated to the partition. The result of this is that partitions running AIX 5.1 may need 256MB, 1GB or 16GB of real mode memory, rather than the 256MB required by AIX 5.2 and Linux.
Sometimes the alignment requirements of the 1GB and 16GB real mode regions can cause problems on systems that are using a large percentage of their physical memory. In these situations, sometimes the order in which partitions are started can have an impact on whether all partitions can be started.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-11. Address Translation BE0070XS4.0
Notes:
Introduction
The method used by a partition to interpret an address depends if virtual address translation is currently enabled or disabled.
Translation enabled
When address translation is enabled, the VMM is in charge. In a normal non-LPAR system, the VMM is effectively translating the virtual address to a real address, but because a real address is the same as a physical address, there is no problem. In a partitioned environment, the VMM uses a slightly different method to convert a virtual address into a true system-wide physical address. The VMM converts the virtual address into a real address, however the real address is a “logical” address within all of the memory assigned to the partition. The VMM then performs an additional step, and converts the partition-specific real address into a system-wide physical address. It accomplishes this using partition page tables.
Address Translation
If address translation enabled, VMM converts virtual address to real address
Treats address as segment ID, page number and page offsetDetermines physical page starting address from segment ID and page number
Non-LPAR systems use software PFT (page frame table)LPAR systems use partition page tables (stored outside partition)
Adds page offset to physical page address
Value of RMO register is not used
If address translation disabled, value of RMO register added to address
All processors in the same partition have the same value in RMO registerRMO value set by firmware when partition is activated
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-21
Student Notebook
Translation disabled
When address translation is disabled, the RMO (Real Memory Offset) register of the processor is used in the address calculation. The processor knows when it is dealing with a real address, as indicated by the status bits in the MSR. When dealing with a real address, the processor automatically (and without the knowledge of the operating system) adds the value loaded in the RMO to the address to convert the partition specific real address into a true system-wide physical address before submitting it to the memory controller hardware as part of the request to read or write the memory location. The RML register is used to limit the amount of memory that a partition can access in real mode.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-22 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-12. Allocating Physical Memory BE0070XS4.0
Notes:
Introduction
The physical memory of a partitioned system must be divided up between the partitions that are to be started.
Terminology
The physical memory of the system is divided up into 256MB chunks called Physical Memory Blocks (PMBs). Each PMB has a unique ID within the system, so that the hypervisor can track which PMBs are allocated for specific purposes. In order to be activated, a partition will be allocated sufficient PMBs to satisfy the minimum memory requirement as indicated by the partition profile being activated. The PMBs assigned to a partition need not be contiguous.
The partition views the memory assigned to it as a number of logical memory blocks (LMBs). Each LMB has an ID that is unique within the partition.
Allocating Physical Memory
Memory divided into 256MB Physical Memory Blocks (PMB)
Each PMB has a unique ID
Multiple PMBs assigned to provide the logical address space for a partition
e.g. 2GB partition requires 8 PMBs
assigned to a partition need not be contiguous
Logical Memory Block (LMB) is the name given to a block of memory when viewed from the partition perspective
Each LMB has a unique ID within a partition, and is associated
with a PMB
Some PMBs are used for special purposes, and cannot be allocated to partitions
Partition page tablesTCE spaceHypervisor
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-23
Student Notebook
Some PMBs in the system are used for special purposes, and cannot be allocated for use by partitions. The number of PMBs allocated for these special purposes depends upon many factors.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-24 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-13. Partition Page Tables BE0070XS4.0
Notes:
Introduction
As mentioned previously, each partition is allocated space for a partition page table. The table is used by the VMM in the partition to translate a partition specific virtual address into a system-wide physical address.
Page table requirements
The page table space for a partition is under the control of the hypervisor. In other words, the operating system instance cannot read or write the page table entries directly, but instead must make a hypervisor call to perform the requested action.
Four 16 byte entries are required in the page table for each 4K page in the partition. This equates to a size equal to 1/64th of the memory allocated to the partition. For example, a partition with 1GB of memory requires a partition page table of 16MB in size. Page tables are allocated in sizes that are powers of two. A page table
Partition Page Tables
Used when translating virtual address to physical address
Stored outside memory area allocated to partition
Under control of HypervisorVMM makes hypervisor call to read and update partition page
table
Scale with size of partition memoryFour 16 byte entries per 4K page of memory assigned to partition (rounded up to power of 2)
Equivalent to 1/64th of partition memory
Placed in contiguous physical memory
Aligned on address boundary divisible by table sizee.g. 64MB page table aligned on 64MB address
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-25
Student Notebook
requirement that is not a power of two is rounded up to the next size that is a power of two. So a partition that has 2.5GB of memory has a page table requirement of 40MB, but this would be rounded up to 64MB, the next power of two.
Page tables must be allocated on an address boundary that is divisible by the size of the page table. In addition, page tables must be allocated in contiguous physical memory. The hypervisor will attempt to place multiple page tables of 128MB or smaller inside a single PMB that has been allocated for page table use. If existing PMBs allocated for page table use do not contain sufficient space (or sufficient contiguous space), then the hypervisor will allocate more PMBs for page table use.
The size of the page table allocated to a partition is large enough to handle the maximum memory amount the partition may grow to. The maximum memory amount is an attribute of the partition that is used in limiting the extent of dynamic LPAR operations.
Performance penalty
There is a small performance penalty associated with the action of the VMM in a partition accessing the partition page tables. This performance penalty is only experienced when a virtual page is mapped into physical memory. If the virtual page is already in physical memory, then the VMM can perform the virtual to physical address translation by accessing the Translation Lookaside Buffer (TLB), a processor specific cache of the most recently accessed virtual to physical translations.
The performance penalty is only really noticeable when a partition is performing heavy paging activity, since this means the page tables are being accessed frequently.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-26 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-14. Translation Control Entries BE0070XS4.0
Notes:
Introduction
Host bridge devices use Translation Control Entries (TCEs) to allow a PCI adapter that can only generate a 32-bit address (i.e. an address in the range 0 to 4GB) to access system memory above the 4GB address range. The translation entries are used to convert the 32-bit I/O address generated by the adapter card on the I/O bus into a 64-bit address that the host bridge will submit to the system memory controller.
TCE tables
TCE tables contain information on the current TCE mappings for each host bridge device. In a standalone system, the operating system controls all host bridge devices in the system, therefore all PCI slots are controlled by a single operating system instance. In this case, the TCE tables exist within the memory image of the operating system.
Translation Control Entries
Used to allow 32-bit PCI adapters to access 64-bit memory space
Similar in concept to partition page tables, but used for device I/O
Provided as a function of the PCI Host Bridge device
TCE space controlled by hypervisorOutside the control of a single partitionSingle PCI Host Bridge may have slots in different partitions
Hypervisor required for dynamic LPAR operations
TCE space allocated at the top of physical memory
Amount of TCE space depends on number of PCI slots/drawers
512MB for 5-8 I/O drawers on p690256MB for all others
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-27
Student Notebook
LPAR changes
In the partitioned environment, there is no requirement for all of the slots of a single host bridge device to be under the control of a single partition. As an example, a single host bridge device may support 4 PCI slots, and each slot may be assigned to a different partition. Since the TCEs need to be manipulated by the operating system as it establishes a mapping to the adapter card, we now have a situation where multiple partitions require to access adjacent memory locations.
Rather than having the TCE tables under the control of a special partition, they are placed under the control of the hypervisor. The memory locations are not under the control of any specific partition. The hypervisor allocates each partition valid “windows” into the TCE address space, that relate to the adapter slots that have been assigned to the partition.
Access to the TCE tables is performed by the partition in a manner similar to accessing partition page tables. The partition makes a hypervisor call (similar to a system call), and after validating the permissions and arguments, the hypervisor performs the requested action on the TCE table entry.
Another benefit of having the TCE space under the control of the hypervisor is that it allows the “windows” that are valid for each partition to be changed on the fly, which is a requirement for the ability to dynamically reassign an I/O slot from one partition to another with a DLPAR operation.
The amount of memory allocated for TCE space depends on the number of host bridge devices (and PCI slots) in the system. Currently a p690 system that has between 5 and 8 I/O drawers will use 512MB of memory (2 PMBs) for TCE space. p690 systems with less than 5 I/O drawers, and all other LPAR capable pSeries systems use 256MB (1 PMB) for TCE space.
TCE space is always located at the top of physical memory.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-28 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-15. Hypervisor BE0070XS4.0
Notes:
Introduction
The hypervisor is the name given to code that runs under the hypervisor mode of the processor. The hypervisor code is supplied as part of the firmware image loaded onto the system. It is loaded in the first PMB in the system, starting at physical address zero.
Hypervisor mode
Hypervisor mode is entered using a mechanism similar to that used when a user application makes a system call.
When a user application makes a system call, the processor state transitions between Problem State (user mode) and Supervisor State (kernel mode), and the kernel segment becomes visible.
The transition to hypervisor mode can only be made from Supervisor State (i.e. kernel mode). Making a hypervisor call from user mode results in a permission denied error.
Hypervisor
Similar to system call mechanismHypervisor bit in MSR indicating processor mode
Can only be invoked from Supervisor (kernel) mode
Used by operating system to access memory outside the partition
e.g. partition page tables
Hypervisor code validates arguments and ensures each partition can only access its allocated page table & TCE space
Checks tables of PMBs allocated to each partitionPrevents a partition from accessing physical memory not assigned to the partition
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-29
Student Notebook
The HV bit in the MSR indicates if the processor is in hypervisor mode.
Purpose
The hypervisor is trusted code that allows a partition to manipulate memory that is outside the bounds of that allocated to the partition.
The operating system must be modified for use in the LPAR environment to make use of hypervisor calls to maintain page frame tables and TCE tables that would normally be managed by the OS directly if it were running on a non-LPAR system. This means that the parts of the VMM used for page table management and device I/O mapping are aware of the fact that the operating system is running within a partition.
The hypervisor routines first validate that the calling partition is permitted to access the requested memory before performing the requested action.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-30 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-16. Dividing Physical Memory BE0070XS4.0
Notes:
Introduction
The diagram above shows a sample system that has two active partitions. The first PMB is allocated to the hypervisor, and the PMB at the top of physical memory is allocated for TCE space.
LPAR 1
LPAR 1 has 4.5GB of memory allocated to the partition. The partition needs to run AIX 5.1, so this means it has a real mode memory requirement of 1GB, which must be contiguous. This means the first set of PMBs allocated to the partition must be contiguous for at least 1GB, and aligned on a 1GB address boundary. The remaining 3.5GB allocated to the partition consists of 14 PMBs, which may or may not be contiguous.
Dividing Physical Memory
LPAR 2
2
0
4.5
0
TCE space
Physical
Memory
M
N
Physical Address 0
RMO = M
RMO = N Partition page tables
Hypervisor
LPAR 1
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-31
Student Notebook
A partition with 4.5GB of memory has a page table requirement of 72MB, which will be rounded up to 128MB. If this is the first partition to be activated, the page table will be placed in a PMB that is marked by the hypervisor for use as page table storage. The partition will be permitted to access the portions of TCE space that are used to map the I/O slots that are assigned to the partition.
LPAR 2
LPAR 2 has 2GB of memory assigned. It is running AIX 5.2, so is quite happy with just 256MB of real mode memory. Since this is the same size as the PMB, it effectively means that a partition running AIX 5.2 can consist of the required number of PMBs to satisfy the requested memory amount. The allocated PMBs need not be contiguous, however the system firmware will allocate them in a contiguous fashion where possible.
A partition with 2GB of memory (and an attribute of a maximum of 2GB) requires a page table of 32MB. This is already a power of 2, and so at partition activation time, the firmware allocates a page table of 32MB. It only allocates a new PMB for page tables if free space inside a PMB already being used for page tables cannot be found. In this example, there was one PMB allocated for page tables, and only 128MB was being used. This means the 32MB page table for LPAR 2 shares the same PMB as the 128MB page table for LPAR 1.
LPAR 2 is permitted to access the portions of TCE space required for mapping the I/O slots assigned to the partition.
Typical example
The example shown in the diagram shows a situation where multiple partitions may have been activated and then terminated, resulting in the seemingly sparse allocation of PMBs. The algorithms used by the firmware to allocate PMBs try to make best use of those available, and are careful to avoid encroaching on a 16GB aligned 16GB contiguous group of PMBs if it can be avoided, since these are required for AIX 5.1 partitions that are 16GB or larger in size.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-32 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 6-17. Checkpoint BE0070XS4.0
Notes:
Introduction
Answer all of the questions above. We will review them as a group when everyone has finished.
Checkpoint
1) What processor features are required in a partitioned system?
2) Memory is allocated to partitions in units of __________MB.
3) All partitions have the same real mode memory requirements. True or
False?
4) In a partitioned environment, a real address is the same as a physical
address. True or False?
5) Any piece of code can make hypervisor calls. True or False?
6) Which physical addresses in the system can a partition access?
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 6. Logical Partitioning 6-33
Student Notebook
Figure 6-18. Unit Summary BE0070XS4.0
Notes:
Unit Summary
Hardware and software (operating system) changes are required for LPAR
Can't run LPAR on just any system
Can't use just any OS inside a partition
Resources (CPU, memory, IO slots) are allocated to partitions independently of one another
A partition can receive as much (or as little) of each resource as it needs
Multiple partitions on a single machine imply changes to the addressing mechanism used by the operating system
Can't have all partitions using the same physical address range
Hypervisor is special code called by the operating system that allows it to modify memory outside the partitions
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
6-34 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 7. LFS, VFS and LVMWhat This Unit Is About
This unit describes the organization and operation of the logical and virtual file system, and the LVM structures used by the kernel.
What You Should Be Able to Do
After completing this unit, you should be able to:
• List the design objectives of the logical and virtual file systems.
• Identify the data structures that make up the logical and virtual file systems.
• Use kdb to identify the data structures representing an open file.
• Use kdb to identify the data structures representing a mounted file system.
• Given a file descriptor of a running process, locate the file and the file system the descriptor represents.
• Identify the basic kernel structures for tracking LVM volume groups, logical and physical volumes. Identify the kdb subcommands for displaying these structures.
How You Will Check Your Progress
Accountability:
• Exercises using your lab system • Unit review
References
AIX Documentation: Kernel Extensions and Device Support Programming Concepts
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-1
Student Notebook
Figure 7-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this lesson you should be able to:
List the design objectives of the logical and virtual file systems.
Identify the data structures that make up the logical and virtual file systems.
Use kdb to identify the data structures representing an open file.
Use kdb to identify the data structures representing a mounted file system.
Given a file descriptor of a running process, locate the file and the file system the descriptor represents.
Identify the basic kernel structures for tracking LVM volume groups, logical and physical volumes. Identify the kdb subcommands for displaying these structures.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-2. What is the Purpose of LFS/VFS? BE0070XS4.0
Notes:
Introduction
This unit covers the interface, services and data structures that are provided by the Logical File System (LFS) and the Virtual File System (VFS).
Supported file systems
Using the structure of the logical file system and the virtual file system, AIX 5L can support a number of different file system types that are transparent to application programs. These file systems reside below the LFS/VFS layer and operate relatively independently of each other. The following physical file system implementations are currently supported:
- Enhanced Journaled File System (JFS2)
- Journaled File System (JFS)
- Network File System (NFS)
What is the Purpose of LFS/VFS?
Provide support for many different file systems types simultaneously
Allow for different types of file systems to be mounted together forming a single homogenous view
Provide a consistent user interface to all file type objects (regular files, special files, sockets...)
Support the sharing of files over the network
Provide an extensible framework allowing third party file system types to be added into AIX
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-3
Student Notebook
- A CD-ROM file system, which supports ISO-9660, High Sierra and Rock Ridge formats
Extensible
The LFS/VFS interface also provides a relatively easy means by which third party file system types can be added without any changes to the LFS.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-3. Kernel I/O Layers BE0070XS4.0
Notes:
Introduction
Several layers of the AIX kernel are involved in the support of file systems I/O as described in this section.
Hierarchy
Access to files and directories by a process is controlled by the various layers in the AIX 5L kernel, as illustrated above.
Kernel I/O Layers
System call interface
Logical File System
Virtual File System
File System
Implementation
VMM Fault handler
Device Driver
Device
read(), write()
LVM
VMM
JFS, JFS2
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-5
Student Notebook
Layers
The layers involved in file I/O are described in this table:
Level Purpose
System call interfaceA user application can access files using the standard interface of the read() and write() system calls.
Logical file systemThe system call interface is supported in the LFS with a standard set of operations.
Virtual file systemThe VFS defines a generic set of operation that can be performed on a file system.
File systemDifferent physical file systems can handle the request (JFS, JFS2, NFS). The file system type is invisible to the user.
VMM fault handlerFiles are mapped to virtual memory. I/O to a file causes a page fault and is resolved by the VMM fault handler.
Device driversDevice driver code to interface with the device. It is invoked by the page fault handler. The LVM is the device driver for JFS2 and JFS.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-4. Major Data Structures BE0070XS4.0
Notes:
Introduction
This illustration shows the major data structures that will be discussed in this unit. The illustration is repeated throughout the unit highlighting the areas being discussed.
Logical file system
The LFS is the level of the file system at which users can request file operations by using system calls, such as open(), close(), read() and write(). The system calls implement services that are exported to users to provide a consistent user-mode programming interface that is independent of the underlying file system type.
Major Data Structures
Logical File SystemVirtual File System
(Vnode-VFS Interface)File System
System File
Table
gfs
vnodeops
vfsops
vnode
vmount
vfs
inode
gnode
u-block
User File
Descriptor
Table
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-7
Student Notebook
Virtual files system
The Virtual File System (VFS) defines a standard set of operations on an entire file system. Operations performed by a process on a file or file system are mapped through the VFS to the file system below. In this way, the process need not know the specifics of different file systems (such as JFS, J2, NFS or CD-ROM).
File system
Each file system type extension provides functions to perform operations on the file system and its files. Pointers to these functions are stored in the vfsops (file system operations) and vnodeops (file operations) structures.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-5. Logical File System Structures BE0070XS4.0
Notes:
Introduction
The user file descriptor table and the system file table are the key data bases used by the LFS. These memory structures and their relationship to vnodes are discussed in this section.
Structures in the LFS
The user file descriptor table (one per process) contains entries for each of the process’ open files. The system open file table has entries for open files on the system. Each entry in the system file table points to a vnode in the virtual file system.
Logical File System Structures
vnode
System File
Table
read(0)
n=open("file")
fp
vnode
vnode
f_data
u-block
User File
Descriptor
Table
1
n
0
Process private Global One per file
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-9
Student Notebook
User file descriptor table
The user file descriptor table is private to a process and located in the processes u-area. When a process opens a file, an entry is created in the user’s file descriptor table. The index of the entry in the table is returned to open() as a file descriptor.
System open file table
The system file table is a global resource and is shared by all processes on the system. One unique entry is allocated for each unique open of a file, device, or socket in the system. If multiple processes have the same file open (or one process has opened the file several times) a separate entry exists in the table for each unique open.
vnode
The vnode provides the connection between the LFS and the VFS. It is the primary structure the kernel uses to reference files. Each time an object is located, a vnode for that object is created. The vnode will be covered in more detail later.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-6. User File Descriptor BE0070XS4.0
Notes:
Introduction
The user file descriptor table is private to a process and located in the process’ u-area. When a process opens a file, an entry is created in the users file descriptor table. The index of the entry in the table is returned to open()as a file descriptor.
Descriptor table definition
The user file descriptor table consists of an array of user file descriptors as defined in /usr/include/sys/user.h in the structure ufd.
Table management
One or more slots of the file descriptor table are used for each open file. The file descriptor table can extend beyond the first page of the u-block, and is pageable. There
User File Descriptor
struct ufd {
struct file * fp;
unsigned short flags;
unsigned short count;
#ifdef __64BIT_KERNEL
unsigned int reserved;
#endif /* __64BIT_KERNEL */
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-11
Student Notebook
is a fixed upper limit of 65534 open file descriptors per process (defined as OPEN_MAX in /usr/include/sys/limits.h). This value is fixed, and may not be changed.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-7. The file Structure BE0070XS4.0
Notes:
Introduction
The system file table is a global resource and is shared by all processes on the system. One entry is allocated for each unique open of a file, device, or socket in the system.
Structure definition
The file structure is described in /usr/include/sys/file.h. In the visual above the fileops definitions for __FULL_PROTO have been omitted for clarity.
Table management
The system file table is a large array of file structures. The array is partly initialized. It grows on demand and is never shrunk. Once entries are freed, they are added back onto the free list. The table can contain a maximum of 1,000,000 entries and is not configurable. The head of the free list is pointed to by ffreelist.
The File Structure
struct file {
long f_flag; /* see fcntl.h */
int f_count; /* reference count */
short f_options; /* file flags not passed through vnode layer */
short f_type; /* descriptor type */
union {
struct vnode *f_uvnode; /* pointer to vnode structure */
struct file *f_unext; /* next entry in freelist */
} f_up;
offset_t f_offset; /* read/write character pointer */
off_t f_dir_off; /* BSD style directory offsets */
union {
struct ucred *f_cpcred; /* process credentials at open() */
struct file *f_cpqmnext; /* next quick move chunk on free list*/
} f_cp;
Simple_lock f_lock; /* file structure fields lock */
Simple_lock f_offset_lock; /* file structure offset field lock */
caddr_t f_vinfo; /* any info vfs needs */
struct fileops
{ . . .
#else
int (*fo_rw)();
int (*fo_ioctl)();
int (*fo_select)();
int (*fo_close)();
int (*fo_fstat)();
#endif /* __64BIT_KERNEL || __FULL_PROTO */
} *f_ops;
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-13
Student Notebook
Table entries
The file table array consists of struct file data elements. Several of the key members of this data structure are described in this table:
Member Description
f_count
A reference count field detailing the current number of opens on the file. This value is increased each time the file is opened, and decremented on each close(). Once the reference count is zero, the slot is considered free, and may be re-used.
f_flag Various flags described in fcntl.h
f_type
A type field describing the type of file:/* f_type values */#define DTYPE_VNODE 1 /* file */#define DTYPE_SOCKET 2 /* communicationsendpoint */#define DTYPE_GNODE 3 /* device */#define DTYPE_OTHER -1 /* unknown */
f_offset A read/write pointer.
f_dataDefined as f_up.f_uvnode, it is a pointer to another data structure representing the object (typically the vnode structure).
f_opsA structure containing pointers to functions for the following file operations: rw (read/write), ioctl, select, close and fstat.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-8. vnode/vfs Interface BE0070XS4.0
Notes:
Introduction
The interface between the logical file system and the underlying file system implementations is referred to as the vnode/vfs interface. This interface provides a logical boundary between generic objects understood at the LFS layer, and the file system specific objects that the underlying file system implementation must manage.
Data structures
vnodes and vfs structures are the primary data structures used to communicate through the interface (with help from vmount).
Description
Descriptions of the vnode, vfs and vmount structures are given in this table:
vnode/vfs Interface
Logical File SystemVirtual File System
(Vnode-VFS Interface)File System
System File
Table
gfs
vnodeops
vfsops
vnode
vmount
vfs
inode
gnode
u-block
User File
Descriptor
Table
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-15
Student Notebook
Part Functionvnodes Represents a single file or directory
vfs Represents a mounted file system
vmount Contains specifics of the mount request
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-9. vnode BE0070XS4.0
Notes:
Introduction
A vnode represents an active file or directory in the kernel. Each time a file is located, a vnode for that object is located or created. Several vnodes may be created as a result of path resolution.
Structure definition
The vnode structure is defined in /usr/include/sys/vnode.h.
vnode management
vnodes are created by the vfs-specific code when needed, using the vn_get kernel service. vnodes are deleted with the vn_free kernel service. vnodes are created as the result of a path resolution.
vnode
struct vnode {
ushort v_flag;
ulong32int64 v_count; /* the use count of this vnode */
int v_vfsgen; /* generation number for the vfs */
Simple_lock v_lock; /* lock on the structure */
struct vfs *v_vfsp; /* pointer to the vfs of this vnode */
struct vfs *v_mvfsp; /* pointer to vfs which was mounted over
/* this vnode; NULL if not mounted */
struct gnode *v_gnode; /* ptr to implementation gnode */
struct vnode *v_next; /* ptr to other vnodes that share same gnode */
struct vnode *v_vfsnext; /* ptr to next vnode on list off of vfs
struct vnode *v_vfsprev; /* ptr to prev vnode on list off of vfs
union v_data {
void * _v_socket; /* vnode associated data */
struct vnode * _v_pfsvnode; /* vnode in pfs for spec */
} _v_data;
char * v_audit; /* ptr to audit object
*/
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-17
Student Notebook
Detail
Each time an object (file) within a file system is located (even if it is not opened), a vnode for that object is located (if already in existence), or created, as are the vnodes for any directory that has to be searched to resolve the path to the object. As a file is created, a vnode is also created, and will be re-used for every subsequent reference made to the file by a path name. Every path name known to the logical file system can be associated with, at most, one file system object, and each file system object can have several names because it can be mounted in different locations. Symbolic links and hard links to an object always get the same vnode if accessed through the same mount point.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-10. vfs BE0070XS4.0
Notes:
Introduction
There is one vfs structure for each file system currently mounted. The vfs structure connects the vnodes with the vmount information, and the gfs structure that help define the operations that can be performed on the file system and its files.
Structure definition
The vfs structure is defined in /usr/include/sys/vfs.h.
Key elements
Several key elements of the vfs structure are described in this table:
Element Description*vfs_next The next mounted file system.
vfs
struct vfs {
struct vfs *vfs_next /* vfs's are a linked list */
struct gfs *vfs_gfs; /* ptr to gfs of vfs */
struct vnod *vfs_mntd; /* pointer to mounted vnode, */
/* the root of this vfs */
struct vnode *vfs_mntdover; /* pointer to mounted-over */
/* vnode */
struct vnode *vfs_vnodes; /* all vnodes in this vfs */
int vfs_count; /* number of users of this vfs */
caddr_t vfs_data; /* private data area pointer */
unsigned int vfs_number; /* serial number to help distinguish between */
/* different mounts of the same object */
int vfs_bsize; /* native block size */
#ifdef _SUN
short vfs_exflags; /* for SUN, exported fs flags */
unsigned short vfs_exroot; /* for SUN, " fs uid 0 mapping */
#else
short vfs_rsvd1; /* Reserved */
unsigned short vfs_rsvd2; /* Reserved */
#endif /* _SUN */
struct vmount *vfs_mdata; /* record of mount arguments */
Simple_lock vfs_lock; /* lock to serialize vnode list */
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-19
Student Notebook
vfs_mntdThe vfs_mntd pointer points to the vnode within the file system which generally represents the root directory of the file system.
vfs_mntdoverThe vfs_mntdover pointer points to a vnode within another file system, usually represents a directory, which indicates where the file system is mounted.
vfs_vnodes The pointer to all vnodes for this file system.
*vfs_gfsThe path back to the gfs structure and its file system specific subroutines through the vfs_gfs pointer.
vfs_mdataThe pointer to vmount providing mount information for this file system
Element Description
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-11. root (l) and usr File Systems BE0070XS4.0
Notes:
Relationship between vfs and vnodes
This illustration shows the relationship between the vfs and vnode objects for mounted file systems. This example shows the root (/) and usr file systems.
root (/) and usr File Systems
rootvfsvfs for root file
systemvfs for usr file
system
vnode for /vnode for
/usr
vnode for
root of usr
vfs_next
vfs
_m
ntd
v_vfsp
v_mvfsp
vfs
_m
ntd
v_vfs
p
v_vfs
p
vfs_mntdovervfs_mntdover
Null
root file system usr file system
1 2
3 34
2
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-21
Student Notebook
Description
The numbered items in the table match the number in the illustration.
Item Description1. The global address rootvfs points to the vfs for the root file system
2. The vfs_next pointers create a linked list of mounted file systems
3. The vfs_mntd points to the vnode representing the root of the file system
4. The vfs_mntdover points to the vnode of the directory the file system is mounted over
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-22 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-12. vmount BE0070XS4.0
Notes:
Introduction
The vmount structure contains specifics of the mount request. The vfs and vmount are created as pairs and linked together.
Structure definition
The vmount structure is defined in /usr/include/sys/vmount.h.
vmount
struct vmount {
uint vmt_revision; /* I revision level, currently 1 */
uint vmt_length; /* I total length of structure & data */
fsid_t vmt_fsid; /* O id of file system */
int vmt_vfsnumber; /* O unique mount id of file system */
uint vmt_time; /* O time of mount */
uint vmt_timepad; /* O (in future, time is 2 longs) */
int vmt_flags; /* I general mount flags */
/* O MNT_REMOTE is output only */
int vmt_gfstype; /* I type of gfs, see MNT_XXX above */
struct vmt_data {
short vmt_off; /* I offset of data, word aligned */
short vmt_size; /* I actual size of data in bytes */
} vmt_data[VMT_LASTINDEX + 1];
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-23
Student Notebook
vfs management
The mount helper creates the vmount structure and calls the vmount subroutine. The vmount subroutine then creates the vfs structure, partially populates it, and invokes the file system dependent vfs_mount subroutine, which completes the vfs structure and performs any operations required internally by the particular file system implementation.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-24 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-13. File and File System Operations BE0070XS4.0
Notes:
Introduction
Each file system type extension provides functions to perform operations on the file system and its files. Pointers to these functions are stored in the vfsops (file system operations) and vnodeops (file operations) structures.
Data structures
For each file system type installed, one group of these three data structures shown above will be created.
File and File System Operations
Logical File SystemVirtual File System
(Vnode-VFS Interface)File System
System File
Table
gfs
vnodeops
vfsops
vnode
vmount
vfs
inode
gnode
u-block
User File
Descriptor
Table
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-25
Student Notebook
Structure descriptions
Descriptions of gfs, vnodeops, and vfsops are given in this table:
Part Functiongfs Holds pointers to the vnodeops and the vfsops structures
vnodeopsContains pointers to file system dependent operations on files (open, close, read, write, etc.)
vfsopsContains pointers to file system dependent operations on the file system (mount, umount, etc.)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-26 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-14. gfs BE0070XS4.0
Notes:
Introduction
gfs is used as a pointer to the vnodevops and the vfsops structures.
gfs
gfs
vnodeops
vfsops
vfsvfs_gfs
gn_ops
gfs_ops
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-27
Student Notebook
Structure definition
The gfs structure is defined in /usr/include/sys/gfs.h:
struct gfs {struct vfsops *gfs_ops;struct vnodeops *gn_ops;int gfs_type; /* type of gfs (from vmount.h) */char gfs_name[16]; /* name of vfs (eg. "jfs","nfs")*/int (*gfs_init)(); /* ( gfsp ) - if ! NULL, */
/* called once to init gfs */int gfs_flags; /* flags for gfs capabilities */caddr_t gfs_data; /* gfs private config data*/int (*gfs_rinit)();int gfs_hold /* count of mounts */
}
gfs management
The gfs structures are stored within a global array accessible only by the kernel. The gfs entries are inserted with the gfsadd() kernel service, and only one gfs entry of a given gfs_type can be inserted into the array. Generally, gfs entries are added by the CFG_INIT section of the configuration code of the file system kernel extension. The gfs entries are removed with the gfsdel() kernel service. This is usually done within the CFG_TERM section of the configuration code of the file system kernel extension.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-28 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-15. vnodeops BE0070XS4.0
Notes:
vnodeops
The vnodeops structure contains pointers to the file system dependent operations that can be performed on the vnode, such as link, mkdir, mknod, open, close and remove.
vnodeops
gfs vnodeopsvfsvfs_gfs gn_ops
vn_link()
vn_mkdir()
vn_open()
vn_close()
vn_remove()
vn_rmdir()
vn_lookup()
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-29
Student Notebook
Structure definition
The vnodeops structure is defined in /usr/include/sys/vnode.h. Due to the size of this structure, only a few lines are detailed below:
struct vnodeops {
/* creation/naming/deletion */int (*vn_link)(struct vnode *, struct vnode *, char *,
struct ucred *);int (*vn_mkdir)(struct vnode *, char *, int32long64_t,
struct ucred *);int (*vn_mknod)(struct vnode *, caddr_t, int32long64_t,
dev_t, struct ucred *);int (*vn_remove)(struct vnode *, struct vnode *, char *,
struct ucred *);int (*vn_rename)(struct vnode *, struct vnode *, caddr_t,
struct vnode *,struct vnode *,caddr_t,struct ucred *);
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-30 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-16. vfsops BE0070XS4.0
Notes:
vfsops
The vfsops structure contains pointers to the file system dependent operations that can be performed on the vfs, such as mount, unmount, or sync.
vfsops
gfs vfsopsvfsvfs_gfs gfs_ops
vfs_mount()
vfs_unmount()
vfs_root()
vfs_sync)
vfs_vget()
vfs_cntl()
vfs_quotactl()
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-31
Student Notebook
Structure definition
The vfsops structure is defined in /usr/include/sys/vfs.h:
struct vfsops {/* mount a file system */int (*vfs_mount)(struct vfs *, struct ucred *);/* unmount a file system */int (*vfs_unmount)(struct vfs *, int, struct ucred *);/* get the root vnode of a file system */int (*vfs_root)(struct vfs *, struct vnode **,
struct ucred *);/* get file system information */int (*vfs_statfs)(struct vfs *, struct statfs *,
struct ucred *);/* sync all file systems of this type */int (*vfs_sync)();/* get a vnode matching a file id */int (*vfs_vget)(struct vfs *, struct vnode **, struct fileid *,
struct ucred *);/* do specified command to file system */int (*vfs_cntl)(struct vfs *, int, caddr_t, size_t,
struct ucred *);/* manage file system quotas */int (*vfs_quotactl)(struct vfs *, int, uid_t, caddr_t,
struct ucred *);};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-32 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-17. gnode . BE0070XS4.0
Notes:
Introduction
gnodes are generic objects pointed to by vnodes but may be contained in different structures depending on the file system type.
Location
The gnode is contained in an in-core-inode for a file on a local file system. Special files (such as /dev/tty), have gnodes contained in specnodes. NFS files have gnodes contained within rnodes.
Structure definition
The gnode structure is defined in /usr/include/sys/vnode.h:
struct gnode {enum vtype gn_type; /* type of object: VDIR,VREG etc */
gnode
vnode
in-core inode
gnodev_gnode
vnode
specnode
gnodev_gnode
vnode
rnode
gnodev_gnode
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-33
Student Notebook
short gn_flags; /* attributes of object */ulong gn_seg; /* segment into which file is mapped */long32int64 gn_mwrcnt; /* count of map for write */long32int64 gn_mrdcnt; /* count of map for read */long32int64 gn_rdcnt; /* total opens for read */long32int64 gn_wrcnt; /* total opens for write */long32int64 gn_excnt; /* total opens for exec */long32int64 gn_rshcnt; /* total opens for read share */struct vnodeops *gn_ops;struct vnode *gn_vnode; /* ptr to list of vnodes per this gnode*/dev_t gn_rdev; /* for devices, their "dev_t" */chan_t gn_chan; /* for devices, their "chan", minor’s minor*/Simple_lock gn_reclk_lock; /* lock for filocks list */int gn_reclk_event;/* event list for file locking */struct filock *gn_filocks; /* locked region list */caddr_t gn_data; /* ptr to private data (usually contiguous)
}
Key elements
Some of the key elements of the gnode are described below:
Detail
Each file system implementation is responsible for allocating and destroying gnodes. Calls to the file system implementation serve as requests to perform an operation on a specific gnode. A gnode is needed, in addition to the file system inode, because some file system implementations may not include the concept of an inode. Thus the gnode structure substitutes for whatever structure the file system implementation may have used to uniquely identify a file system object.
gnodes are created, as needed by file system specific code at the same time as implementation specific structures are created. This is normally immediately followed by a call to the vn_get kernel service to create a matching vnode. The gnode structure is usually deleted either when the file it refers to is deleted, or when the implementation specific structure is being reused for another file.
Element Description
gn_typeIdentifies the type of object represented by the gnode. Some examples are directory, character, and block.
gn_opsIdentifies the set of operations that can be performed on the object
gn_seg Segment number to which the file is mapped
gn_dataPointer to private data. Points to the start of the inode the gnode is imbedded
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-34 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-18. kdb devsw Subcommand Output BE0070XS4.0
Notes:
Introduction
The file systems discussed earlier in this unit are contained within Logical Volume Manager (LVM) Logical Volumes. The data defining LVM entities (including Volume Groups, Logical Volumes and Physical Volumes) is maintained both on disks and in the ODM. This architecture is discussed in other classes.
Here we would like to introduce three kernel structures which maintain LVM data, and the kdb commands that display these structures. The structures are volgrp, lvol and pvol (defined in src/bos/kernel/sys/dasd.h, which is not distributed with the AIX product). The kdb subcommands to display these structures have corresponding names: volgrp, lvol and pvol.
In the above visual we will illustrate the structure definitions with example output from the kdb subcommands and corresponding AIX commands. All definitions are from src/bos/kernel/sys/dasd.h unless otherwise noted.
kdb devsw Subcommand Output
(0)> devsw 0xa
Slot address 30057280
MAJOR: 00A
open: 0207DC40
close: 0207D694
read: 0207CDC0
write: 0207CCF4
ioctl: 0207B4DC
strategy: 02095914
ttys: 00000000
select: .nodev (000E12EC)
config: 020795E8
print: .nodev (000E12EC)
dump: 020A7530
mpx: .nodev (000E12EC)
revoke: .nodev (000E12EC)
dsdptr: 310E3000 selptr: 00000000
opts: 0000002A DEV_DEFINED DEV_MPSAFE
(0)>
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-35
Student Notebook
volgrp structure
The administrative unit of LVM is a volume group. The kernel describes this in the volgrp structure. Portions of the structure definition follows:
struct volgrp {
Simple_lockvg_lock;/* lock for all vg structures */
struct unique_idvg_id;/* volume group id */
intmajor_num;/* major number of volume group */
. . .
short open_count;/* count of open logical volumes */
. . .
struct volgrp *nextvg; /* pointer to next volgrp structure */
. . .
struct lvol*lvols[NEW_MAXLVS];/* logical volume struct array*/
struct pvol*pvols[NEW_MAXPVS];/* physical volume struct array */
. . .
};
The Items in bold are defined as:
- vg_id This is the 32 character volume id.
- open_count This is the count of active logical volumes in this volume group.
- *nextvg This is the volgrp linked list item. A value of zero means this is the last or only volume group.
- *lvols[NEW_MAXLVS] points to the array of lvol structures for this volume group. The array is indexed by logical volume minor number.
- *pvols[NEW_MAXPVS] points to the array of pvol structures for this volume group. The array is indexed by physical volume minor number.
volgrp kdb subcommand
volgrp addresses are registered in the devsw table. This table is displayed with the kdb subcommand, devsw. At this point we introduce it only to obtain a volgrp address. We the devsw subcommand with a single parameter 0xa, the major number of rootvg on this system. In the command output, the dsdptr: field is the address of rootvg’s volgrp structure.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-36 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-19. kdb volgrp Subcommand Output BE0070XS4.0
Notes:
Output of volgrp command
The visual above shows partial output of the kdb subcommand, volgrp, using the address just obtained with devsw.
The volgrp subcommand formats volgrp structure data in a helpful way. The pointer values for pvol and lvol arrays are provided (“pvols” and “lvols”), but in addition the subcommand formats each lvols array entry. So we see an “LVOL” entry for each logical volume in our rootvg.
In the example above there were 10 entries for lvols array data. We have shown only the entry for minor device number 7 which is the entry for hd3, the /tmp file system. We will examine this logical volume with other commands, and describe the bold items at that time.
kdb volgrp Subcommand Output
(0)> volgrp 310e3000
VOLGRP............. 310E3000
. . .
lvols............... @ 310E302C
pvols............... @ 310E382C major_num............. 0000000A
vg_id................. 0001D2CA00004C00000000F11C1697A0
nextvg................ 00000000 opn_pin............. @ 310E3A2C
. . .
sa_hld_lst............ 00000000 vgsa_ptr.............. 31107000
config_wait........... FFFFFFFF sa_lbuf............. @ 310E3B10
sa_pbuf............. @ 310E3B68
. . .
LVOL[007]....... 31108180
work_Q.......... 3110BE00 lv_status....... 00000002
lv_options...... 00001000 nparts.......... 00000001
i_sched......... 00000000 nblocks......... 00010000
parts[0]........ 31108300 pvol@ 310E4600 dev 00190000 start 00DE1100
parts[1]........ 00000000
parts[2]........ 00000000 . . .
LVOL[009]....... 31108380
. . .
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-37
Student Notebook
Other items above in bold give:
- major_num = 0xA means this is the rootvg volume group.
- The vg_id value is rootvg’s volume group id.
- *nextvg=0 means this volume group is the last or only one on the volgrp linked list.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-38 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-20. AIX lsvg Command Output BE0070XS4.0
Notes:
AIX lsvg command view of the same data
Now that we have seen the kernel’s view of rootvg data, it is interesting to look at what our command line interface shows. The lsvg command provides a summary of volume group information. This visual shows lsvg output for the same rootvg that we just examined with volgrp.
The items in bold print above correspond to kdb volgrp items described on the prior slide:
“VOLUME GROUP: rootvg” corresponds to major_num=0xA
“VG IDENTIFIER” corresponds to the vg_id value.
AIX lsvg Subcommand Output
# lsvg rootvg
VOLUME GROUP: rootvg VG IDENTIFIER: 0001d2ca00004c00000000f11c1697a0
VG STATE: active PP SIZE: 32 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 542 (17344 megabytes)
MAX LVs: 256 FREE PPs: 497 (15904 megabytes)
LVs: 9 USED PPs: 45 (1440 megabytes)
OPEN LVs: 8 QUORUM: 2
TOTAL PVs: 1 VG DESCRIPTORS: 2
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 1 AUTO ON: yes
MAX PPs per PV: 1016 MAX PVs: 32
LTG size: 128 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
#
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-39
Student Notebook
Figure 7-21. kdb lvol Subcommand Output BE0070XS4.0
Notes:
kdb lvol Subcommand Output
(0)> lvol 31108180
LVOL............ 31108180
work_Q.......... 3110BE00 lv_status....... 00000002
lv_options...... 00001000 nparts.. 00000001
i_sched......... 00000000 nblocks......... 00010000
parts[0]..31108300 pvol@ 310E4600 dev 00190000 start 00DE1100
parts[1]........ 00000000
parts[2]........ 00000000
maxsize......... 00000000 tot_rds......... 00000000
complcnt........ 00000000 waitlist........ FFFFFFFF
stripe_exp...... 00000000 striping_width.. 00000000
lvol_intlock. @ 311081BC lvol_intlock.... 00000000
(0)>
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-40 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
The lvol structureEach active logical volume is represented by an lvol structure. The lvol structure is defined as follows:
struct lvol {
struct buf **work_Q; /*work in progress hash table */
short lv_status; /*lv status:closed,closing,open */
ushort lv_options;/*logical dev options (see below)*/
short nparts; /* num of part structures for this*/
/* lv - base 1 */
char i_sched; /* initial scheduler policy state */
char lv_avoid; /* online backup mask indicator */
ulong nblocks; /* LV length in blocks */
struct part *parts[3]; /*partition arrays for each mirror*/
int maxsize; /* max number of pp allowed in lv */
ulong tot_rds; /* total number of reads to LV */
int parent_minor_num;/*if this is an online backupcopy*/
/*this is the minor number of the ’real’*/
/* or ’parent’ logical volume */
/* These fields of the lvol structure are read and/or written by
* the bottom half of the LVDD; and therefore must be carefully
* modified.
*/
int complcnt; * completion count-used to quiesce */
tid_t waitlist; /* event list for quiesce of LV */
struct file *fp; /*file ptr for lv mir bkp open/close */
unsigned int stripe_exp; /* 2**stripe_block_exp = stripe */
/* block size */
unsigned int striping_width; /* number of disks striped across */
Simple_lock lvol_intlock;
uchar lv_behavior;/* special conditions lv may be under */
struct io_stat *io_stats[3];/* collect io statistics here */
unsigned int syncing; /* Count of SYNC requests */
unsigned int blocked; /* Count of blocked requests */
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-41
Student Notebook
Items shown in bold:
lv_status: 0=> closed, 1=> trying to close, 2=> open, 3=> being deleted
lv_options is a flag word. Some of the flags are: 0x0001=>write verify, 0x0020=>read-only, 0x0040=>dump in progress to this logical volume, 0x0080=>this logical volume is a dump device, 0x1000=>original default (not passive) mwcc (mirror write consistency check) on.
nparts: Number of copies (1=>no mirror, 2=>single mirror, 3=>two mirrors). This gives the number of *parts array elements that are meaningful.
i_sched: Scheduling policy for this logical volume
values include: 0=>regular, non-mirrored LV, 1=>sequential write, sequential read, 2=>parallel write, read closest, 3=>sequential write, read closest, 4=> parallel write, sequential read, 5=>striped
n_blocks: Number of 512 byte blocks in this logical volume
*parts[3]: Each parts element is a part structure pointer, which points to an array of part structures, which define the physical volume storage for one logical volume copy.
- Each of these part structures points to a pvol structure and disk start address for one part of the logical volume data. The structure is defined as follows:
struct part {struct pvol *pvol; /* containing physical volume */daddr_t start; /* starting physical disk address */int sync_trk; /* current LTG being resynced */char ppstate; /* physical partition state */char sync_msk; /* current LTG sync mask */
kdb lvol subcommand
The kdb subcommand, lvol, formats lvol structure data. The visual above shows the lvol output for lvols[7], from the rootvg volume group. This is the logical volume with minor # 7: hd3.
Items above in bold give:
- lv_status = 2 means the logical volume is open.
- nparts=1 means there is only one parts structure for this logical volume.
- i_sched=0 means the scheduling policy for this logical volume is “regular, non-mirrored”.
- n_blocks=0x10000 is the number of 512 byte blocks in this logical volume. This translates to 65536 decimal.
The single part structure is at location 0x31108300. The lvol subcommand summarizes this part structure:
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-42 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
- It points to the pvol structure at 0x310e4600. The physical volume major/minor numbers are 0x19 (decimal 25)/0. The disk start address is 0x00DE1100. The ls-l command on /dev/hd* tells us this is the major/minor number of hdisk0.Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-43
Student Notebook
Figure 7-22. AIX lslv Command Output BE0070XS4.0
Notes:
lslv command output
The visual above shows lslv command output for rootvg logical volume hd3, the /tmp logical volume.
The items in bold print above correspond to kdb lvol items described on prior slide:
- “LV STATE: opened/syncd” corresponds to lvstatus=2
- “Write Verify: off” corresponds to lv_options=00001000 (flag is 0x0001 for write verify)
- “PP SIZE: 32”, ”LPs: 1 and “PPs: 1” correspond to nblocks=00010000 (1 pp x 32 MB/pp = 65536 blocks x 512 bytes/block, and 65536 decimal = 10000 hexadecimal.)
- “MIRROR WRITE CONSISTENCY: on/ACTIVE” corresponds to lvoptions=00001000 (flag is 0x1000 for original default mwcc)
AIX lslv Command Output
# lslv hd3
LOGICAL VOLUME: hd3 VOLUME GROUP: rootvg
LV IDENTIFIER: 0001d2ca00004c00000000f11c1697a0.7 PERMISSION: read/write
VG STATE: active/complete LV STATE:opened/syncd
TYPE: jfs WRITE VERIFY: off
MAX LPs: 512 PP SIZE: 32 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 1 PPs: 1
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: center UPPER BOUND: 32
MOUNT POINT: /tmp LABEL: /tmp
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-44 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
- “SCHED POLICY: parallel” is technically incorrect here. But it has no meaning because this logical volume is not mirrored. The i_sched=00000000 value from kdb correctly reflects this (SCH_REGULAR = 0 => regular, non-mirrored logical volume).Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-45
Student Notebook
Figure 7-23. kdb pvol Subcommand Output BE0070XS4.0
Notes:
pvol structure
The basic hardware unit for LVM is a physical volume. The kernel describes this in the pvol structure. This structure is defined as follows:
struct pvol {dev_t dev; /* dev_t of physical device */struct unique_idpvid;short pvstate; /* PV state */short pvnum; /* LVM PV number 0-31/0-127 */int vg_num; /* VG major number*/struct file * fp; /* file pointer from open of PV */char flags; /* place to hold flags */short num_bbdir_ent;/* current number of BB Direntries */daddr_t fst_usr_blk;/* first available block on the PV */
/* for user data*/daddr_t beg_relblk;/* first blkno in reloc pool */
kdb pvol Subcommand Output
(0)> pvol 310e4600
PVOL............... 310E4600
dev................ 00190000 xfcnt.............. 00000000
pvstate............ 00000000
pvnum........ 00000000 vg_num......... 0000000A
fp................. 10000C60 flags.............. 00000000
num_bbdir_ent...... 00000000 fst_usr_blk........ 00001100
beg_relblk......... 021E6B9F next_relblk........ 021E6B9Fl
max_relblk......... 021E6C9E defect_tbl......... 310E4800
sa_area[0]....... @ 310E4638
sa_area[1]....... @ 310E4640 pv_pbuf.......... @ 310E4648
oclvm............ @ 310E46F0
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-46 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
daddr_t next_relblk;/* blkno of next unused relocation *//* block in reloc blk pool at end *//* of PV */daddr_t max_relblk;/* largest blkno avail for reloc */struct defect_tbl *defect_tbl; /* pointer to defect table */struct sa_pv_whl { /* VGSA information for this PV */
daddr_t lsn; /* SA logical sector number - LV 0 */ushort sa_seq_num;/* SA wheel sequence number */char nukesa; /* flag set if SA to be deleted */
} sa_area[2]; /* one for each possible SA on PV */
struct pbuf pv_pbuf; /* pbuf struct for writing cache */short bad_read;/* changed to 1 on first bad read */
#ifdef CLVM_2_3struct clvm_2_3pv *oclvm;/* ptr to old CLVM pv struct */
#endif /* CLVM_2_3 */int xfcnt; /* transfer count for this pv */
};
Items shown in bold:
- dev: major/minor device number for this disk (dev(31-16) = major, dev(15-0) = minor). Defined in /usr/include/sys/types.h.
- pvstate: Physical volume state (0=>normal, 1=>cannot be accessed, 2=> No hw/sw relocation allowed, 3=> pv involved in snapshot)
- vg_num: volume group major number
pvol kdb subcommand
The visual above shows output of the kdb pvol command. The parameter used is from our volgrp output for rootvg, and is for hdisk0.
Items above in bold give
- dev=00190000 means major/minor #s are 25/0 (decimal).
- pvstate=0 means normal, accessible physical volume.
- pvnum=0 means physical volume number 0 in this volume group.
- vg_num= 0xA is the major number of this volume group (rootvg) This can be confirmed by executing ls -l in /dev: this shows /dev/rootvg as having major number 10 (decimal).
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-47
Student Notebook
Figure 7-24. AIX lspv Command Output BE0070XS4.0
Notes:
AIX lspv command
The visual above shows output of the AIX lspv command for hdisk0.
The items in bold print above correspond to kdb pvol items described on prior visual:
- “PHYSICAL VOLUME: HDISK0” corresponds to pvnum=0, the LVM number for hdisk0 in rootvg
- “VOLUME GROUP: rootvg” corresponds to vg_num=0xA, the rootvg major number.
The “VG IDENTIFIER” is maintained in the volgrp structure which points to this pvol structure. It is also maintained in ODM class CuAt. The “PV IDENTIFIER” is maintained in the ODM class CuAt.
AIX lspv Command Output
# lspv hdisk0
PHYSICAL VOLUME :hdisk0 VOLUME GROUP: rootvg
PV IDENTIFIER: 0001d2ca308b4251 VG IDENTIFIER 0001d2ca00004c00000000f1c1697a0
PV STATE: active
STALE PARTITIONS: 0 ALLOCATABLE: yes
PP SIZE: 32 megabyte(s) LOGICAL VOLUMES: 9
TOTAL PPs: 542 (17344 megabytes) VG DESCRIPTORS: 2
FREE PPs: 497 (15904 megabytes) HOT SPARE: no
USED PPs: 45 (1440 megabytes)
FREE DISTRIBUTION: 108..92..80..108..109
USED DISTRIBUTION: 01..16..28..00..00
#
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-48 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-25. Checkpoint (1 of 2) BE0070XS4.0
Notes:
Checkpoint (1 of 2)
Each user process contains a private F___ D______ T____.
The kernel maintains a _______structure and a _______structure for each mounted file system.
There is one gfs structure for each mounted file system. True or False?
The three kernel structures __________, __________ and __________ are used to track LVM volume group, logical volume and physical volume data, respectively.
The kdb subcommand __________ and the AIX command _________ both reflect volume group information.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-49
Student Notebook
Figure 7-26. Checkpoint (2 of 2) BE0070XS4.0
Notes:
Checkpoint (2 of 2)
There is one vmount/vfs structure pair for each
mounted filesystem. True or False?
Every open file in a filesystem is represented by exactly one file structure. True or False?
The inode number given by ls -id/usr is _____. Why?
Each vnode for an open file points to a _______structure.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-50 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 7-27. Exercise BE0070XS4.0
Notes:
Turn to your lab workbook and complete exercise six.
Exercise
Complete exercise six
Consists of theory and hands-on
Ask questions at any time
Activities are identified by a
What you will do:Test what you have learned about the LFS and VFSLocate the LFS/VFS structures for an open fileIdentify what file a process has opened
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 7. LFS, VFS and LVM 7-51
Student Notebook
Figure 7-28. Unit Summary BE0070XS4.0
Notes:
Unit Summary
The LFS and VFS provide support for many different file systems types simultaneously
The LFS/VFS allows for different types of file systems to be mounted together forming a singe homogenous view
The LFS services the system call interface for read()write()
The VFS defines files (vnodes) and file systems (vfs)
Each file system type provides unique functions for file and file system types operations. Operations are defined by the vnodeops and vfsops structures.
The gnode is a generic object connecting the VFS with the file system specific inode
kdb has special subcommands for viewing LFS/VFS structures
The kernel tracks LVM data in structures volgrp, lvol and pvol. There are kdb subcommands for displaying these structures.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
7-52 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 8. Journaled File SystemWhat This Unit Is About
This unit describes the internal structures of the Journaled File System (JFS).
What You Should Be Able to Do
After completing this unit, you should be able to:
• Describe basic concepts of the JFS disk layout
• Describe JFS elements: inodes, allocation groups, superblock, indirect block and double indirect block
• Contrast on disk and incore inode structures
• Describe the relationship between JFS and LVM in performing I/O
How You Will Check Your Progress
Accountability:
• Unit review
References
AIX Documentation: System Management Guide: Operating System and Devices
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-1
Student Notebook
Figure 8-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this lesson you should be able to:
Describe basic concepts of the JFS disk layout
Describe JFS elements: inodes, allocation groups, superblock, indirect block and double indirect block
Contrast on disk and incore inode structures
Describe the relationship between JFS and LVM in performing I/O
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 8-2. JFS File System BE0070XS4.0
Notes:
Journaled File System
Introduction
AIX 5L supports two main native file system types: JFS and JFS2. JFS (Journaled File System) is the original native file system for AIX. JFS2 (Enhanced Journaled File System) is a more recent development and is discussed in a following unit.
JFS maintains file data and components that identify where a file or directory's data is located on the disk. These components include inodes, data blocks, super blocks a boot block and one or more allocation groups
An allocation group contains disk inodes and fragments. Each JFS file system occupies one logical volume.
The actual on-disk layout of a JFS file system can be viewed with the fsdb command. The visual illustrates some of the basic components of a JFS.
JFS File System
Boot Block
Inodes
Super Block
Data Blocks
Indirect Blocks
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-3
Student Notebook
Boot Block
The boot block occupies the first 4096 bytes of a JFS starting at byte offset 0. This area is from the original Berkeley Software Distribution (BSD) Fast File System design, and is not used in AIX.
Superblock
The superblock is 4096 bytes in size and starts at byte offset 4096. The superblock maintains information about the entire JFS and includes the following fields:
- Size
- Number of data blocks
- A flag indicating the state
- Allocation group sizes
The superblock is critical to the JFS and if corrupted will prevent the file system from being mounted. For this reason a backup copy of the superblock is always written in block 31.
Blocks
A block is a 4096 byte data allocation unit.
Fragments
The journaled file system is organized in a contiguous series of fragments. JFS fragments are the basic allocation unit and the disk is addressed at the fragment level.
JFS fragment support allows disk space to be divided into allocation units that are smaller than the default size of 4096 bytes. Smaller allocation units or fragments minimize wasted disk space by more efficiently storing the data in a file or directory's partial logical blocks. The functional behavior of JFS fragment support is based on that provided by Berkeley Software Distribution (BSD) fragment support.
Specifying fragment size
The fragment size for a JFS is specified during its creation. The allowable fragment sizes for JFS are 512, 1024, 2048, and 4096 bytes. The default fragment size is 4096 bytes.
Different Doffs can have different fragment sizes, but only one fragment size can be used within a single file system.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
InodesThe disk inode is the anchor for files in a JFS. There is a one to one correspondence between a disk inode, an i-number, and a file. The inode records file information such as size, allocation, owner, and so on. However, it is disjoint from the name since many different names can be refer to the same inode via the inode number. The collection of disk inodes can be referred to as the disk inode table.
Allocation groups
The set of fragments making up a JFS are divided into one or more fixed-sized units of contiguous fragments. These are called allocation groups. An allocation group is similar to BSD cylinder groups.
The first 4096 bytes of the first allocation group holds the boot block and the second 4096 bytes holds the superblock.
Each allocation group contains disk inodes and free blocks. This permits inodes and data blocks to be dispersed throughout the file system and allows file data to lie in closer proximity to its inode. Despite the fact that the inodes are distributed through the disk, a disk inode can be located using a simple formula based on the i-number and the allocation group information contained in the super block.
For the first allocation group, the inodes occupy the fragments immediately following the reserved block area.
For subsequent groups, the inodes are found at the start of each group. Inodes are 128 bytes in size and are identified by a unique inode number. The inode number maps an inode to its location on the disk or to an inode within its allocation group.
Allocation group sizes
Allocation groups are described by three sizes:
- The fragment allocation group size and the inode allocation group size are specified as the number of fragments and inodes that exist in each allocation group.
- The default allocation group size is 8 MB.
- Beginning in Version 4.2, it can be as large as 64 MB.
These three values are stored in the file system superblock, and they are set at JFS creation.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-5
Student Notebook
Virtual memory
AIX exploits the segment architecture to implement its JFS physical file system. Just as virtual memory looks contiguous to a user program but may be scattered about real memory or paging space, disk files are made to look contiguous to the user program even though the physical disk blocks may be very scattered. When AIX needs to create a segment of virtual memory, it creates an External Page Table (XPT), which contains a collection of XPT blocks. When the physical file system creates a file, it creates a disk inode and possibly indirect blocks to describe the file. Disk inodes (and indirect blocks), and XPT blocks make their respective user-level resources appear contiguous.
The JFS maps all file system information into virtual memory, including user data blocks. The read and write operations are much simplified in that they merely initialize the mapping and then copy the data. Likewise, a directory lookup operation merely maps the directory into virtual memory and then goes walking through the directory structure. This greatly simplifies the code by dividing the algorithmic problem of searching directory entries from the task of performing disk I/O operations and managing a buffer cache.
The I/O function is handled by the Virtual Memory Manager (VMM). When a page fault occurs on a mapped file object, the VMM is able to determine what file is being accessed, examine the inode to determine where the data is, and initiate a page in to transfer the data from the file system into memory. Once completed, the faulting process can be resumed and the operation continues, oblivious to the fact that a memory mapped access caused a disk operation.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 8-3. Reserved Inodes BE0070XS4.0
Notes:
Reserved Inodes
Introduction
A unique feature of the JFS implementation is the implementation of file system data as unnamed files that reside in the file system. Every JFS file system has inodes 0-15 reserved. Most of these files names begin with a dot (“.”) because they are hidden files. But, these ‘hidden’ files do not appear in any directory. This is done by manipulating the inodes so they do not require a directory entry to support their link count value. Every open file is represented by a segment in the VMM. Most of these reserved inodes never actually exist on the disk, but are only present in the VMM when a file system is mounted.
Reserved Inodes
0
1
2
3
4
5
6
7
8
9-15
Not used
Reserved
Disk inode extensions (.inodex)
Inode extension map (.inodexmap)
Disk block allocation map (.diskmap)
Disk inode allocation map (.inodemap)
Indirect blocks (.indirect)
Disk inodes (.inodes)
Root directory of file system
Superblock (.superblock)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-7
Student Notebook
Superblock
Inode 1 is reserved for a file named .superblock. The superblock holds a concise description of the JFS: its size allocation information, and an indication of the consistency of on-disk data structures. The inode points to two data blocks, 1 and 31. Data block 31 is a spare copy of the superblock at data block 1.
Root directory
Inode 2 is always used for the JFS root directory.
Disk inodes
Inode 3 is reserved for a file named .inodes. Every JFS object is described by an disk inode. Each disk inode is a fixed size: 128 bytes.
Indirect blocks
Inode 4 is reserved for a file named .indirect. The most common JFS object is a regular file. For a regular file, the inode holds a list of the data blocks which compose the file. It would be impractical to allocate inodes large enough to directly hold this entire list. The list of physical blocks are held in a tree structure, rather than an array. The intermediate nodes of this tree are the indirect blocks.
Disk inode allocation map
Inode 5 is reserved for a virtual file named .inodemap. This allocation map has bit flags turned on or off showing if an inode is in use or free.
Disk block allocation map
Inode 6 is reserved for a virtual file named .diskmap. This bit map indicates whether each block on the logical volume is in use or free.
Disk inode extensions
Inode 7 is reserved for a virtual file named .inodex. This file contains information about inode extensions which are used by access control lists.
Inode extension map
Inode 8 is reserved for the virtual file named .inodexmap. This bit map is used to keep track of free and allocated inode extensions.
Future use
Inodes 9 through 15 are reserved for future extensions.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 8-4. Disk Inode Structure BE0070XS4.0
Notes:
Disk inode structure
Introduction
Inodes exist in a static form on disk and have access information for the file in addition to pointers to the real disk addresses of the file’s data blocks. The number of inodes in a JFS file system depends on its size. The allocation group size (default 8MB by default), and the number of bytes per inode ratio (4096 by default). The on disk inode structure is defined in /usr/include/jfs/ino.h. The most basic elements or this structure are shown on the slide above and described in the text that follows.
Disk Inode Structure
struct dinode{ uint di_gen;
mode_t di_mode;ushort di_nlink;. . .uid_t di_uid;gid_t di_gid;. . .uint di_nblocks;. . . di_mtime_ts;. . . di_atime_ts;. . . di_ctime_ts;. . .. . .
# define di_rdaddr _di_info._di_file._di_rdaddr. . .
# define di_rindirect _di_info._di_file._di_indblk.id_raddr};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-9
Student Notebook
Inode header file
The inode structure is defined in /usr/include/jfs/ino.h and contains:
Inode types
The private portion of the inode depends on its type. The types are defined in /usr/include/sys/mode.h and compose portions of the di_mode field. Inode types are:
Symbol Description
di_gen The disk inode generation number
di_nlink The number of directory entries which refer to the file
di_mode The file type, access permissions and attributes
di_uid User ID of owner
di_gid Group ID
di_size File size
di_nblocks Number of blocks used by file. This does not include indirect blocks.
di_mtime Time at which the contents of the file were last modified
di_atime Time at which the file was last accessed by read
di_ctime Time at which contents of disk inode were last updated
di_rdaddr[8] Real disk addresses of the data
di_rindirect Real disk address of the indirect block, if any
Type Description
S_IFREG
Regular file. The format of the private portion of an inode for a data file (including some symbolic links and directories) depends on the size of the file. The AIX file system always allocates full blocks to data files.
S_IFDIRDirectory. The private portion of a directory inode is identical to that of a regular file.
S_IFBLK Block device. Block device inodes have only the dev_t
S_IFCHAR Character device. Character device inodes have only the dev_t
S_IFLNK Symbolic link
S_IFSOCK A UNIX domain socket
S_IFIFO FIFO. A FIFO inode has no persistent private data.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 8-5. In-core Inodes BE0070XS4.0
Notes:
In-core inodes
Introduction
When a JFS file is opened, an in-core inode is created in memory. The in-core inode contains a copy of all the fields defined in the disk inode in addition to fields for keeping track of the in-core inode.
In-core inode header file
The in-core inode structure is defined in /usr/include/jfs/inode.h.
There are two parts to each in-core inode:
- First portion of data structure relevant only while the object is accessed
- The last 128 bytes is a copy of the disk inode
In-core Inodes
When a file is opened, an in-core inode is created in memory
The in-core inode structure is defined in /usr/include/jfs/inode.h
In-core inodes include:An exclusive-use lockUse countOpen countsState flagsExclusion countsHash table linksFree list linksMount table entry
In-core inode statesActiveCachedFree
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-11
Student Notebook
In-core inode header file
The in-core inode includes:
Item Notes
Exclusive-use lock
• Must be held before the in-core inode is updated
• Actually implemented with a simple lock
Use countThe in-core inode cannot be destroyed while it has a non-zero use count
Open counts
• Separate reader and writer counts are maintained in the gnode in the in-core inode
• Are incremented at each open, and decremented at close
• A process which has opened the file for both reading and writing is counted as both a reader and writer
State flagsMaintain miscellaneous in-core inode state
Exclusion counts
• A bit indicates that the file has been opened for exclusive access
• A separate count of the number of readers who have specified read-only sharing (precluded writers) is also maintained
• If a process attempts to open the inode with a mode which conflicts with the current open status, it can be placed on a wait list for the inode (if the O_DELAY open flag was specified)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
In-core inode states
There are three states for every in-core inode:
- Active. There is currently a vnode that refers to this inode. This implies that a process has the corresponding file open.
- Cached. There is no vnode that refers to this inode. This implies that the corresponding file is not open anywhere on the system. The data the in-core inode and associated segment holds is still valid and may be reused if the inode is reopened. This avoids extra disk I/O.
- Free. The structure is available for immediate use.
In-core inode table
Active in-core inodes are maintained in the inode table, a hash table of in-core inodes for recently accessed files. It contains the inode number and a file system number.
Entries are accessed by iget(). If an entry is not already in the table, iget() will call iread to obtain the entry.
Entries are marked as unused by iput(). If an inode is iput(), it no longer has other references to it, and it has a zero link count, then it is placed on the free list. If an inode is iput() and has no references to it, but still has a non-zero link count, then it is placed in the cache list. From this list, it can be easily reacquired should a process need the inode again.
Item Notes
Hash table links
• All existing in-core inodes are kept in a hash table, accessed by device and index
• Allows finding an inode by file handle, and assures that multiple inodes are not created for the same object
Free list linksAll unused in-core inodes are kept in a free list
Mount table entry
• If an object is in use, its underlying device must currently be mounted
• Each in-core inode points back to its mount table entry to avoid searching the mount table to find the entry for this object
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-13
Student Notebook
In-core inode creation
The steps for in-core inode creation are:
Inode locking
The JFS serializes operations by obtaining an exclusive lock on each inode involved in the operation.
For all operations which require locking more than one inode, all involved inodes are known at the start of the operation. The ilocklist() routine sorts these into a descending order before locking (highest inode number is locked first). This prevents deadlock conditions.
Note: The iget() routine does not return a locked inode; nor does iput() free any lock on the inode.
Step Action
1. When a file is opened, the kernel searches the hash queue to see if there is an in-core inode already associated with the file.
2. If an inode is found in the hash queue, the reference count of the in-core inode is incremented and the file descriptor is returned to the user.
3. Otherwise, an in-core inode is removed from the free list and the disk inode is copied into the in-core inode.
4. The in-core inode is then placed on the hash queue and remains there until the reference count is zero (no processes have the file open).
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 8-6. Direct (No Indirect Blocks) BE0070XS4.0
Notes:
Indirect blocks
Introduction
JFS uses indirect blocks to address the disk space allocated to larger files. There are three methods for addressing the disk space
- Direct
- Single indirect
- Double indirect
Beginning in AIX 4.2, file systems enabled for large files allow a maximum file size of slightly less than 64 gigabytes (68589453312). The first double indirect block contains 4096 byte fragments, and all subsequent double indirect blocks contain (32 X 4096 =
Direct (no Indirect Blocks)
Inode Disk Addresses for File Size <= 32KB
di_raddr[0] di_raddr[7]
data block 0 data block 7
Inode
(Logical volume block numbers)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-15
Student Notebook
131072) byte fragments. The following produces the maximum file size for file systems enabling large files:
(1 * (1024 * 4096)) + (511 * (1024 * 131072))
The fragment allocation assigned to a directory is divided into records of 512 bytes each and grows in accordance with the allocation of these records.
Direct
The first eight addresses point directly to a single allocation of disk fragments. Each disk fragment is 4 KB in size. (8 x 4KB = 32 KB). This method is used for files that are less than 32 KB in size.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 8-7. Single Indirect BE0070XS4.0
Notes:
Single indirect
The i_rindirect field of the inode contains the address of an indirect block containing 1024 addresses. These addresses point to disk fragments for each allocation. This method is used for files between 32KB and 4MB (1024 x 4KB) in size.
Single Indirect
data block
1023
File Size Between 32KB and 4MB
Inode
Indirect
Page
indirect
indir[0] indir[1023]
(page index in .indirect)
data block 0
(Logical volume block numbers)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-17
Student Notebook
Figure 8-8. Double Indirect BE0070XS4.0
Notes:
Double indirect
The i_rindirect field of the inode points to a double indirect block that contains 512 addresses that point to indirect blocks.The 512 addresses do not point to data but instead point to 1024 addresses that point to data blocks (512 x ( 1024 x 4KB) ) = 2GB. This method is used for files in the range from 4 MB to 2GB
With large file support enabled, the graphic still holds true. However, in this case all “data blocks” pointed to through indir[1] ... indir[511] are 32 x 4096 = 131072 bytes long, rather than the default fragment size of 4096 bytes.
Double Indirect
Inode Disk Addresses for File Size > 4MB
Inode indirect
indir[0] indir[511]
ind[0] ind[1023] ind[0] ind[1023]
data block 0data block
1023data block 0
data block
1023
Indirect Root (page index in .indirect)
Indirect Pages (Pages indices in .indirect)
(Logical volume block numbers)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 8-9. Checkpoint BE0070XS4.0
Notes:
Checkpoint
1. An allocation group contains __________ and __________.
2. The basic allocation unit in JFS is a disk block. True or False?
3. The root inode number of a filesystem is always 1. True or
False?
4. The last 128 bytes of an in core JFS inode is a copy of the disk
inode. True or False?
5. JFS maps user data blocks and directory information into
virtual memory. True or False?
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 8. Journaled File System 8-19
Student Notebook
Figure 8-10. Unit Summary BE0070XS4.0
Notes:
Unit Summary
Principle components of the JFS are allocation groups, inodes, data blocks and indirect blocks.
A JFS allocation group contains inodes and related data blocks.
A JFS in core inode contains the disk inode data together with activity information such as open count and in core inode state information. The state information indicates whether the structure is active or available for re use.
JFS accomplishes I/O by mapping all file system information into virtual memory, thus relying on VMM to do the actual I/O operations.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
8-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 9. Enhanced Journaled File SystemWhat This Unit Is About
This unit is about the internal structures of the Enhanced Journaled File System (JFS2).
What You Should Be Able to Do
After completing this unit, you should be able to:
• List the difference between the terms aggregate and fileset
• Identify the various data structures that make up the JFS2 file system
• Use the fsdb command to trace the various data structures that make up files and directories.
How You Will Check Your Progress
Accountability:
• Exercises using your lab system.
References
AIX Documentation: System Management Guide: Operating System and Devices
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-1
Student Notebook
Figure 9-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this lesson you should be able to:
List the difference between the terms aggregate and fileset.
Identify the various data structures that make up the JFS2 filesystem.
Use the fsdb command to trace the various data structures that make up files and directories.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-2. Numbers BE0070XS4.0
Notes:
Introduction
The Enhanced Journaled File System (JFS2), is an extent based Journaled File System. It is the default file system for the 64-bit kernel of AIX 5L. The table above lists some general information about JFS2.
Numbers
Function Value
Block Size 512 - 4096 Configurable block size
Architectural max. files size(this is not the supported size!)
4 Petabytes
Max. file system size (supported) 1 Terabyte (16 Terabytes on AIX 5.2)
Max. file size (supported) 1 Terabyte (16 Terabytes on AIX 5.2)
Number if Inodes Dynamic, limited by disk space
Directory Organization B+ tree
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-3
Student Notebook
Figure 9-3. Aggregate and Fileset BE0070XS4.0
Notes:
Introduction
The term aggregate is defined in this section. The layout of a JFS2 aggregate is also described.
Definitions
JFS2 separates the notion of a disk space allocation pool, called an aggregate, from the notion of a mountable file system sub-tree, called a fileset. The rules that define aggregates and filesets in JFS2 are listed above in the visual.
Aggregate block size
An aggregate has a fixed block size (number of bytes per block) that is defined at configuration time. The aggregate block size defines the smallest unit of space allocation supported on the aggregate. The block size cannot be altered, and must be
Aggregate and Fileset
There is exactly one aggregate per logical volume.
There may be multiple filesets per aggregate
Currently only one fileset per aggregate is supported.
The meta-data has been designed to support multiple filesets, and this feature may be introduced in a future release of AIX 5L
Aggregate Block Size512 bytes1024 bytes2048 bytes4096 bytes
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
no smaller than the physical block size (currently 512 bytes). Legal aggregate block sizes are:- 512 bytes
- 1024 bytes
- 2048 bytes
- 4096 bytes.
Do not confuse aggregate block size with the logical volume block size, which defines the smallest unit of I/O.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-5
Student Notebook
Figure 9-4. Aggregate BE0070XS4.0
Notes:
Aggregate layout
The diagram above and the table below details the layout of the aggregate.
Part Function
Reserved areaThe first 32 KB is not used by JFS2. The first block is used by the LVM.
Primary aggregate superblock
The primary aggregate superblock (defined as a struct superblock) contains aggregate-wide information, such as the:
• Size of the aggregate.
• Size of allocation groups.
• Aggregate block size.
Aggregate
Note: Aggregate Block Size is 1K in this example.
0
Reserved for LVM31
(One Aggregate Block)
AggregateBlock #
aggr inode #1 : “self”
offset: 0addr: 36
length: 8
aggr inode #2 : block map
offset: 0addr: 64
length: 16
owner: rootperm: -rwx------etc: blah blah
size: 16384
xad
en
trie
s
(8 t
ota
l)
owner: rootperm: -rwx------etc: blah blah
size: 8192
aggr in od e # 17: fi le se t 1
offset: 0addr: 5992
length: 8
owner: rootperm: -rwx------etc: blah blahsize: 8192
aggr in od e #1 6: fi le set 0
offset: 0addr: 240
length: 8
owner: rootperm: -rwx------etc: blah blahsize: 12288
offset: 8192addr: 10284
length: 4
2
3 19
1816
Aggregate Inode Table; inode numbers shown
32
AggregateSuperblock
32 Inodes (16KB)
0 64 8 10 12 14
1 75 9 11 1 3 15
2220 24 26 28 30
17 2321 25 27 29 31
IAG
40 44 60
PrimaryAggregateSuperblock
SecondaryControl Page
36
17
1KB
1st extent of Aggregate Inode Allocation Map
ControlSection
iagnum:0
WorkingMap
0xf8008000
0x00000000
...
PersistentMap
0xf8008000
0x00000000
...
ixdSection
length[0]:16
addr[0]:44
length[1]:0
addr[1]:0
...
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Secondary aggregate superblock
The secondary aggregate superblock is a direct copy of the primary aggregate superblock. The secondary aggregate superblock is used if the primary aggregate superblock is corrupted. Both primary and secondary superblocks are located at a fixed locations. This allows the superblocks to be found without depending on any other information.
Aggregate inode tableContains inodes that describe the aggregate-wide control structures. Inodes will be described later.
Secondary aggregate inode table
Contains replicated inodes from the aggregate inode table. Since the inodes in the aggregate inode table are critical for finding file system information they are replicated in the secondary aggregate inode table. The actual data for the inodes will not be repeated, just the addressing structures used to find the data and the inode itself.
Aggregate inode allocation map
Describes the aggregate inode table. It contains allocation state information on the aggregate inodes as well as their on-disk location.
Secondary aggregate inode allocation map
Describes the secondary aggregate inode table.
Block allocation map
Describes the control structures for allocating and freeing aggregate disk blocks within the aggregate. The block allocation map maps one-to-one with the aggregate disk blocks.
fsck working space
Provides space for fsck to be able to track the aggregate block allocations. This space is necessary. For a very large aggregate, there might not be enough memory to track this information in memory when fsck is run. The space is described by the superblock. One bit is needed for every aggregate block. The fsck working space always exists at the end of the aggregate.
In-line LogProvides space for logging the meta-data changes of the aggregate. The space is described by the superblock. The in-line log always exist after the fsck working space.
Part Function
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-7
Student Notebook
Aggregate inodes
When the aggregate is initially created, the first inode extent is allocated; additional inode extents are allocated and de-allocated dynamically as needed. Each of these aggregate inodes describe certain aspects of the aggregate itself, as follows:
Inode # Description
0 Reserved
1.
Called the “self” inode, this inode describes the aggregate disk blocks comprising the aggregate inode map. This is a circular representation, in that aggregate inode one is itself in the file that it describes. The obvious circular representation problem is handled by forcing at least the first aggregate inode extent to appear at a well-known location, namely, 4 KB after the primary aggregate superblock. Therefore, JFS2 can easily find aggregate inode one, and from there it can find the rest of the aggregate inode table by following the B+–tree in inode one
2. Describes the block allocation map.
3. Describes the In-line Log when mounted. This inode is allocated but no data is saved to disk.
4 - 15 Reserved for future extensions.
16 -
Starting at aggregate inode 16 there is one inode per fileset (the fileset allocation map Inode). These inodes describe the control structures that represent each fileset. As additional filesets are added to the aggregate, the aggregate inode table itself may have to grow to accommodate additional fileset inodes. Note that as of AIX 5.2 release there can only be one fileset. The preceding graphic shows a fileset 17. This is included to show design potential, and is not realizable at present.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-5. Allocation Group BE0070XS4.0
Notes:
Introduction
Allocation Groups (AG) divide the space on an aggregate into chunks. Allocation groups are used for heuristics only. Allocation groups allow JFS2 resource allocation policies to use well known methods for achieving good JFS2 I/O performance.
Allocation policies
When locating data on the disk, JFS2 will attempt to:
- Group disk blocks for related data and inodes close together.
- Distribute unrelated data throughout the aggregate.
Allocation Group
The maximum number of allocation groups per aggregate is 128.
The minimum number of allocation group is 8192 aggregate blocks.
The allocation group size must always be a power of 2 multiple of the number of blocks described by one dmap page. (for example 1, 2, 4, 8, ... dmap pages)
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-9
Student Notebook
Allocation group sizes
Allocation group sizes must be selected which yield allocation groups that are sufficiently large to provide for contiguous resource allocation over time. The allocation group size is stored in the aggregate superblock. The rules for setting the allocation group size are shown in the visual on the previous page.
Partial allocation group
An aggregate whose size is not a multiple of the allocation group size contains a partial allocation group that it is not fully covered by disk blocks. This partial allocation group will be treated as a complete allocation group, except we mark the non-existent disk blocks allocated in the Block Allocation Map.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-6. Fileset BE0070XS4.0
Notes:
Introduction
A fileset is a set of files and directories that form an independently mountable sub-tree that is equivalent to a UNIX file system file hierarchy. A fileset is completely contained within a single aggregate. The visual illustration above and table below details the layout of a fileset.
Part Function
Fileset inode tableContains inodes describing the fileset-wide control structures. The Fileset Inode Table logically contains an array of inodes.
Fileset inode allocation map
A fileset inode allocation map which describes the Fileset Inode Table. The Fileset Inode allocation map contains allocation state information on the fileset inodes, as well as their on-disk location.
Fileset
0 64 8 10 12 142 16 2220 24 26 28 3018
17
244 264
1575 9 11 133 2321 25 27 29 3119
248
1
IAG
fileset inode #2:
owner: rootperm: -rwx------etc: blah blahsize: 4096
root directory
fileset #0:
AG Free Inode List
1
2
idotdot:2
Fileset Inode Allocation Map: 1st extent
Control Sectioniagnum: 0
Working Map0xf00000000xffffffff...
Persistent Map0xf00000000xffffffff...
ixd Sectionlength[0]: 16addr[0]: 248length[1]: 0addr[1]: 0...
10284
IAG
IAG Free List: 1st entry
Control Sectioniagnum: 1
Working Map0xffffffff0xffffffff...
Persistent Map0xffffffff0xffffffff...
ixd Sectionlength[0]: 0addr[0]: 0length[1]: 0addr[1]: 0...
iagfree: -1
Fileset Inode Allocation Map: 2nd extent
inofree: 1
Control Page
240
extfree: 1numinos: 32numfree: 28
inofree: -1extfree: -1numinos: 0numfree: 0
inofree: -1extfree: -1numinos: 0numfree: 0
Fileset Inode Table
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-11
Student Notebook
Inodes
Every JFS2 object is represented by an inode, which contains the expected object-specific information such as time stamps, file type (regular or directory). They also “contain” a B+–tree to record the allocation of extents. Note that all JFS2 meta data structures (except for the superblock) are represented as “files.” By reusing the inode structure for this data, the data format (on-disk layout) becomes inherently extensible.
Part Function
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-7. Inode Allocation Map BE0070XS4.0
Notes:
Super inode
Super inodes found in the aggregate inode table (#16 and greater) describe the fileset inode allocation map and other fileset information resides in the aggregate inode table. Since the aggregate inode table is replicated, there is also a secondary version of this inode, which points to the same data.
Inodes
Every file and directory in a fileset is describe by an on-disk inode. When the fileset is initially created, the first inode extent is allocated; additional inode extents are allocated and de-allocated dynamically as needed. The inodes in a fileset are allocated as shown above in the visual.
Inode Allocation Map
0
1
2
3
4
5
6
7
8
9 11
12 3020
19
18
17
16
15
14
13 2927
26
25
24
23
22
21
2810
31
FilesetInode #
Description
0 reserved.
1 additional fileset information that would not fit in the fileset allocation map inode in the aggregate inode table.
2 The root directory inode for the fileset.
3 The ACL file for the fileset.
4- Fileset inodes from four onwards are used by ordinary fileset objects, user files, directories, and symbolic links.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-13
Student Notebook
Figure 9-8. Extents BE0070XS4.0
Notes:
Introduction
Disk space in a JFS2 file system is allocated in a sequence of contiguous aggregate blocks called an extent.
Extents
struct xad {uint8 xad_flag;uint16 xad_reserved;uint40 xad_offset;uint24 xad_length;uint40 xad_address;};
addr=101
len=3
offset=0
reserved
flag
addr=503
len=4
offset=0
addr=856
len=2
offset=0disk block 856
disk block 503
disk block 101
/usr/include/j2/j2_xtree.h
XADs for a file
File system
disk blocks
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Extent rulesAn extent:
- Is made up of a series contiguous aggregate blocks.
- Are variable in size and can range from 1 to 224- 1 aggregate blocks.
- Are wholly contained within a single aggregate
- Are indexed in a B+-tree.
- Large extents may span multiple allocation groups.
Extent allocation descriptor
Extents are described in an xad structure (a 16 byte structure). The two main values describing an extent are its length, and its address. In an xad, both the length and address are expressed in units of the aggregate block size. Details of the xad data structure are shown in the visual on the previous page.
xad description
The elements of the xad structure are described in this table.
Member Description
xad_flagFlags set on this extent. See /usr/include/j2/j2_xtree.h for a list of flags.
xad_reserved Reserved for future use.
xad_offset
Extents are generally grouped together to form a larger group of disk blocks. The xad_offset, describes the logical block offset this extent represents in the larger group.
xad_lengthA 24-bit field, containing the length of the extent in aggregate blocks. An extent can range in size from 1 to 224 - 1 aggregate blocks.
xad_address
A 40-bit field, containing the address of the first block of the extent. The address is in units of aggregate blocks, and is the block offset from the beginning of the aggregate.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-15
Student Notebook
Figure 9-9. Increasing an Allocation BE0070XS4.0
Notes:
Introduction
In general, the allocation policy for JFS2 tries to maximize contiguous allocation by allocating a minimum number of extents, keeping each extent as large and contiguous as possible. This allows for larger I/O transfer, resulting in improved performance.
Increasing an Allocation
addr=101
len=100
offset=0
reserved
flag
100 disk blocks
100 disk blocks
100 disk blocks
100 disk blocks
File system
disk blocks
addr=701
len=100
offset=0
reserved
flag
addr=701
len=100
offset=0
reserved
flag
addr=1001
len=100
offset=100
reserved
flag
addr=101
len=200
offset=0
reserved
flag
maximize contiguous
allocation
Before After
100 disk blocks
100 disk blocks
File system
disk blocks
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
ExceptionsIn special cases, this is not always possible to keep extent allocation contiguous. For example, copy-on-write clone of a segment will cause a contiguous extent to be partitioned into a sequence of smaller contiguous extents. Another case is restriction of the extent size. For example, the extent size is restricted for compressed files, since we must read the entire extent into memory and decompress it. We have a limited amount of memory available, so we must ensure we will have enough room for the decompressed extent.
Fragmentation
The user can configure a JFS2 aggregate with a small aggregate block size of 512 bytes to minimize internal fragmentation for aggregates with large numbers of small size files.
The defragfs utility can be used to defragment a JFS2 file system.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-17
Student Notebook
Figure 9-10. Binary Tree of Extents BE0070XS4.0
Notes:
Introduction
Objects in JFS2 are stored in groups of extents arranged in binary trees. The concepts of binary trees are introduced in this section.
Trees
Binary trees consists of nodes arranged in a tree structure. Each node contains a header describing the node. A flag in the node header identifies the role of the node in the tree.
As we will show in subsequent material, these headers reside in the second inode quadrant and in 4KB blocks referenced by the inode.
Binary Tree of Extents
Root node
Header
flags=BT_ROOT
Internal
node
Headerflags=
BT_INTERNAL
Array of extent
descriptors
xad
xad
xad
Array of extent
descriptors
xad
xad
xad
Array of extent
descriptors
xad
xad
xad
Leaf node
Headerflags=
BT_LEAF
Leaf node
Headerflags=
BT_LEAF
Leaf node
Headerflags=
BT_LEAF
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Header flagsThis table describes the binary tree header flags:
Why B+-tree?
B+–trees are used in JFS2, and help performance by:
- Providing fast reading and writing of extents; the most common operations.
- Providing fast search for reading a particular extent of a file.
- Providing efficient append or insert of an extent in a file.
- Being efficient for traversal of an entire B+–tree.
B+-tree index
There is one generic B+–tree index structure for all index objects in JFS2 (except for directories). The data being indexed depends upon the object. The B+–tree is keyed by the offset of the xad structure of the data being described by the tree. The entries are sorted by the offsets of the xad structures, each of which is an entry in a node of a B+–tree.
Flag Description
BT_ROOT The root or top of the tree.
BT_LEAFThe bottom of a branch of a tree. Leaf nodes point to the extents containing the objects data.
BT_INTERNALAn internal node points to two or more leaf nodes or other internal nodes.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-19
Student Notebook
Figure 9-11. Inodes BE0070XS4.0
Notes:
Overview
Every file on a JFS2 file system is describe by an on-disk inode. The inode holds the root header for the extent binary tree. File attribute data and block allocation maps are also kept in the inode.
Inode layout
The inode is a 512 byte structure, split into four 128 byte sections. The sections of the inode are described in this table.
Section Description
1.
This section describes the POSIX attributes of the JFS2 object including the inode and fileset number, object type, object size, user Id, group Id, created, access time, modified time, created time and more.
Inodes
POSIX Attributes
extended attributes
block allocation maps
Inode allocation maps
headers describing the inode data
In-line data
or
xad's
extended attributes
or
more in-line data
or
additional xad's
Section 1
Section 2
Section 3
Section 4
Inode Layout
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Design vs. implementation
Currently section 4 is not used. Currently section 3 is only used for extent information. The inline data function of JFS2 is not currently enabled.
2.
This section contains several parts:
• Descriptors for extended attributes.
• Block allocation maps.
• Inode allocation maps.
• Header pointing to the data (b+-tree root, directory, in-line data)
3.
This section can contain one of the following:
• In-line file data for very small files (up to 128 bytes).
• The first eight xad structures describing the extents for this file.
4. This section extends section 3 by providing additional storage for more attributes, xad structures or in-line data.
Section Description
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-21
Student Notebook
Structure
The current definition of the on-disk inode structure is:
struct dinode{
/** I. base area (128 bytes)* ------------------------** define generic/POSIX attributes*/ino64_t di_number; /* 8: inode number, aka file serial number */uint32 di_gen; /* 4: inode generation number */uint32 di_fileset; /* 4: fileset #, inode # of inode map file */uint32 di_inostamp; /* 4: stamp to show inode belongs to fileset */uint32 di_rsv1; /* 4: */
pxd_t di_ixpxd; /* 8: inode extent descriptor */
int64 di_size; /* 8: size */int64 di_nblocks; /* 8: number of blocks allocated */
uint32 di_uid; /* 4: uid_t user id of owner */uint32 di_gid; /* 4: gid_t group id of owner */
int32 di_nlink; /* 4: number of links to the object */uint32 di_mode; /* 4: mode_t attribute format and permission */
j2time_t di_atime; /* 16: time last data accessed */j2time_t di_ctime; /* 16: time last status changed */j2time_t di_mtime; /* 16: time last data modified */j2time_t di_otime; /* 16: time created */
/** II. extension area (128 bytes)* ------------------------------*//** extended attributes for file system (96);*/
ead_t di_ea; /* 16: ea descriptor */
union {uint8 _data[80];
/** block allocation map*/struct {
struct bmap *__bmap; /* incore bmap descriptor */} _bmap;
#define di_bmap _data2._bmap.__bmap
/*
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-22 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
* inode allocation map (fileset inode 1st half)*/struct {uint32 _gengen; /* di_gen generator */struct inode *__ipimap2; /* replica */struct inomap *__imap; /* incore imap control */
} _imap;} _data2;
#define di_gengen _data2._imap._gengen#define di_ipimap2 _data2._imap.__ipimap2#define di_imap _data2._imap.__imap
/** B+-tree root header (32)** B+-tree root node header, or dtroot_t for directory,* or data extent descriptor for inline data;* N.B. must be on 8-byte boundary.*/union {
struct {int32 _di_rsrvd[4]; /* 16: */dxd_t _di_dxd; /* 16: data extent descriptor */
} _xd;int32 _di_btroot[8]; /* 32: xtpage_t or dtroot_t */ino64_t _di_parent; /* 8: idotdot in dtroot_t */
} _data2r;#define di_dxd _data2r._xd._di_dxd#define di_btroot _data2r._di_btroot#define di_dtroot _data2r._di_btroot#define di_xtroot _data2r._di_btroot#define di_parent _data2r._di_parent
/** III. type-dependent area (128 bytes)* ------------------------------------** B+-tree root node xad array or inline data**/union {
uint8 _data[128];#define di_inlinedata _data3._data
/** regular file or directory** B+-tree root node/inline data area*/struct {
uint8 _xad[128];} _file;
/** device special file*/
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-23
Student Notebook
struct {dev64_t _rdev; /* 8: dev_t device major and minor */
} _specfile;#define di_rdev _data3._specfile._rdev
/** symbolic link.** link is stored in inode if its length is less than* IDATASIZE. Otherwise stored like a regular file.*/struct {
uint8 _fastsymlink[128];} _symlink;
#define di_fastsymlink _data3._symlink._fastsymlink} _data3;
/** IV. type-dependent extension area (128 bytes)* -----------------------------------------** user-defined attribute, or* inline data continuation, or* B+-tree root node continuation**/union {
uint8 _data[128];#define di_inlineea _data4._data
} _data4;};
typedef struct dinode dinode_t;
Allocation policy
JFS2 allocates inodes dynamically, which provides the following advantages:
- Allows placement of inode disk blocks at any disk address, which decouples the inode number from the location. This decoupling simplifies supporting aggregate and fileset reorganization (to enable shrinking the aggregate). The inodes can be moved and still retain the same number, which makes it unnecessary to search the directory structure to update the inode numbers.
- There is no need to allocate “ten times as many inodes as you will ever need, as with file systems that contain a fixed number of inodes; thus, file system space utilization is optimized. This is especially important with the larger inode size of 512 bytes in JFS2.
- File allocation for large files can consume multiple allocation groups and still be contiguous. Static allocation forces a gap containing the initially allocated inodes in each allocation group. With dynamic allocation, all the blocks contained in an allocation group can be used for data.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-24 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Dynamic inode allocation causes a number of problems, including:- With static allocation, the geometry of the file system implicitly describes the layout of inodes on disk. With dynamic allocation, separate mapping structures are required.
- The inode mapping structures are critical to JFS2 integrity. Due to the overhead involved in replicating these structures we accept the risk of losing these maps. However, replicating the B+–tree structures, allows us to find the maps.
Inode extents
Inodes are allocated dynamically by allocating inode extents that are simply a contiguous chunk of inodes on the disk. By definition, a JFS2 inode extent contains 32 inodes. With a 512 byte inode size, an inode extent occupies 16 KB on the disk.
Inode initialization
When a new inode extent is allocated, the inodes in the extent are initialized, i.e. their inode numbers and extent addresses are set, and the mode and link count fields are set to zero. Information about the inode extent is also added to the inode allocation map.
Inode allocation map
Dynamic inode allocation implies that there is no direct relationship between an inode number and the disk address of the inode. Therefore we must have a means of finding the inodes on disk. The inode allocation map provides this function.
Inode generation numbers
Inode generation numbers are simply counters that will increment each time an inode is reused. Network file system protocols, such as NFS, (implicitly) require them; they form part of the file identifier manipulated by VNOP_FID() and VFS_VGET().
The static inode allocation practice of storing a per-inode generation counter will not work with dynamic inode allocation, because when an inode becomes free its disk space may literally be reused for something other than an inode (e.g., the space may be reclaimed for ordinary file data storage). Therefore, in JFS2, there is simply one inode generation counter that is incremented on every inode allocation (rather than one counter per inode that would be incremented when that inode is reused).
Although a fileset-wide generation counter will recycle faster than a per-inode generation counter, a simple calculation shows that the 32-bit value is still sufficient to meet NFS or DFS requirements.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-25
Student Notebook
Figure 9-12. Inline Data BE0070XS4.0
Notes:
In-line data
If a file contains small amounts of data, the data may be stored in the inode itself. This is called in-line storage. The header found in the second section of the inode points to the data that is stored in the third and fourth section of the inode. This design feature has not been implemented yet.
Inline Data
Inode Info
Header for in-line dataIn
-lin
e d
ata
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-26 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-13. Binary Trees BE0070XS4.0
Notes:
Binary trees
When more storage is needed than can be provided in-line the data must be placed in extents. The header in the inode now becomes the binary tree root header. If there are 8 or fewer extents for the file, then the xad structures describing the extents are contained in the inode. An inode containing 8 or less xad structures would look like the figure shown above.
INLINEEA bit
Once the 8 xad structures in the inode are filled, an attempt is made to use the last quadrant of the inode for more xad structures. If the INLINEEA bit is set in the di_mode field of the inode, then the last quadrant of the inode is available for 8 more xad structures. This design feature has not been implemented yet.
Binary Trees
48KB
4096
68
16KBData
Data
26624
8KB Data
Inode Info
B+-tree header
In-lin
e d
ata
offset: 0
addr: 68
length: 16
offset: 84
addr: 4096
length: 48
offset: 256
addr: 26624
length:48
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-27
Student Notebook
Figure 9-14. More Extents BE0070XS4.0
Notes:
More extents
Once all of the available xad structures in the inode are used, the B+–tree must be split. 4 KB of disk space is allocated for a leaf node of the B+–tree, which is logically an array of xad entries with a header. The 8 xad entries are moved from the inode to the leaf node, and the header is initialized to point to the 9th entry as the first free entry. The first xad structure in the inode is updated to point to the newly allocated leaf node, and the inode header is updated to indicate that only one xad structure is now being used, and that it contains the pure root of a B+-tree. The offset for this new xad structure contains the offset of the first entry in the leaf node.
The organization of the inode now looks like the figure above.
More Extents
48KB
4096
68
26624
16KBData
Data
8KB Data
inode
offset: 0addr: 412
length: 4
B+-tree header
xad
en
trie
s
(8 t
ota
l)
Inode Info
offset: 0addr: 0
length: 0
offset: 0addr: 0
length: 0
254
xad
leaf n
ode e
ntrie
s
header
412
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-28 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-15. Continuing to Add Extents BE0070XS4.0
Notes:
Continuing to add extents
As new extents are added to the file, they continue to be added to the leaf node in the necessary order, until the node fills. Once the node fills an additional 4 KB of disk space is allocated for another leaf node of the B+–tree, and the second xad structure from the inode is set to point to this newly allocated node. The node now looks like the figure shown above.
Continuing to Add Extents
48KB
4096
68
26624
16KBData
Data
8KB Data
254 x
ad lea
f node en
tries
header
412
25
4 x
ad lea
f no
de en
tries
header
inode
offset: 0addr: 412
length: 4
B+-tree header
xad e
ntr
ies
(8 t
ota
l)
Inode Info
offset: 750addr: 560
length: 4
offset: 0addr: 0
length: 0
560
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-29
Student Notebook
Figure 9-16. Another Split BE0070XS4.0
Notes:
Another split
As extents are added to the inode, this behavior continues until all 8 xad structures in the inode contain leaf node xad structures, at which time another split of the B+–tree will occur. This split creates an internal node of the B+–tree, which is used purely to route the searches of the tree. An internal node looks exactly like a leaf node. 4 KB of disk space is allocated for the internal node of the B+–tree, the 8 xads of the leaf nodes are moved from the inode to the newly created internal node, and the internal node header is initialized to point to the 9th entry as the first free entry. The root of the B+–tree is then updated by making the inode’s first xad structure point to the newly allocated internal node, and the header in the inode is updated to indicate that only 1 xad structure is now being used for the B+–tree.
Another Split
48KB
4096
68
26624
16KBData
Data
8KB Data
254 x
ad lea
f no
de en
tries
header
412
254 x
ad lea
f node en
tries
header
560
254 x
ad in
ternal n
ode en
tries
header
inode
offset: 0addr: 380
length: 4
B+-tree header
xad
en
trie
s(8
tota
l)Inode Info
offset: 8340addr: 212
length: 4
offset: 0addr: 0
length: 0
380
254 x
ad in
tern
al node en
tries
header
212
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-30 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
As extents continue to be added, additional leaf nodes are created to contain the xad structures for the extents, and these leaf nodes are added to the internal node.Once the first internal node is filled, a second internal node is allocated, the inode’s second xad structure is updated to point to the new internal node.This behavior continues until all eight of the inode’s xad structures contain internal nodes.Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-31
Student Notebook
Figure 9-17. fsdb Utility BE0070XS4.0
Notes:
Introduction
The fsdb command enables you to examine, alter, and debug a file system.
Starting fsdb
It is best to run fsdb against an unmounted file system. Use the following syntax to start fsdb:fsdb <path to logical volume>For example:# fsdb /dev/lv00Aggregate Block Size: 512>
fsdb Utility
# fsdb /dev/lv00Aggregate Block Size: 512>> help
Xpeek Commandsa[lter] <block> <offset> <hex string>b[tree] <block> [<offset>]dir[ectory] <inode number> [<fileset>]d[isplay] [<block> [<offset> [<format> [<count>]]]]dm[ap] [<block number>]dt[ree] <inode number> [<fileset>]h[elp] [<command>]ia[g] [<IAG number>] [a | <fileset>]i[node] [<inode number>] [a | <fileset>]q[uit]su[perblock] [p | s]
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-32 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Support file systemsfsdb supports both the JFS and JFS2 file systems. The commands available in fsdb are different depending on what file system type it is running against. The following explains how to use fsdb with a JFS2 file system.
Commands
The commands available in fsdb can be viewed with the help command as shown in the visual.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-33
Student Notebook
Figure 9-18. Exercise BE0070XS4.0
Notes:
Turn to your lab workbook and complete exercise seven.
Exercise
Complete exercise seven
Consists of theory and hands-on
Ask questions at any time
Activities are identified by a
What you will do:Use the fsdb utility to examine a JFS2 file system.Identify a file's inode numberIdentify extent descriptorsLocate the data extents that hold the contents of a file
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-34 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-19. Directory BE0070XS4.0
Notes:
Introduction
In addition to files, an inode can represent a directory. A directory is a journaled meta-data file in JFS2, and is composed of directory entries, which indicate the files and sub-directories contained in the directory.
Directory entry
Stored in an array, the directory entries link the names of the objects in the directory to an inode number. The directory entry is a 32 byte structure and has the members shown here.
Member Description
inumber Inode number.
Directory
name[22]
namelen
next
inumber
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-35
Student Notebook
Directory entry structure definition
Following is the structure definition for a directory entry. It is from /usr/include/j2/j2_dtree.h.
/*
* leaf node entry head/only segment
*/
typedef struct {
ino64_t inumber; /* 8: 4-byte aligned */
int8 next; /* 1: */
uint8 namlen; /* 1: */
#ifdef _J2_UNICODE
UniChar name[11]; /* 22: 2-byte aligned */
#else
char name[22]; /* 22: 2-byte aligned */
#endif
} ldtentry_t; /* (32) */
nextIf more than 22 characters are needed, additional entries are linked using the next pointer.
namelen Length of the name.
name[22] File name, up to 22 characters.
Member Description
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-36 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-20. Directory Root Header BE0070XS4.0
Notes:
Root header
In order to improve the performance of locating a specific directory entry, a binary tree sorted by name is used. As with files, the header section of a directory inode contains the binary tree root header. Each header describes an eight element array of directory entries. The root header is a 32 byte structure defined by dtroot_t in /usr/include/j2/j2_dtree.h.
Member Description
idotdot Inode number of parent directory.
flagIndicates if the node is an internal or leaf node, and whether it is the root of the binary tree.
nextindex The last used slot in the directory entry slot array.
freecnt The number of free slots in the directory entry array.
Directory Root Header
typedef union {struct {
ino64_t idotdot; /* 8: parent inode number */int64 rsrvd1; /* 8: */uint8 flag; /* 1: */int8 nextindex; /* 1: next free entry in stbl */int8 freecnt; /* 1: free count */int8 freelist; /* 1: freelist header */int32 rsrvd2; /* 4: */int8 stbl[8]; /* 8: sorted entry index table */
} header; /* (32) */dtslot_t slot[9];
} dtroot_t;
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-37
Student Notebook
Leaf and internal node header
When more than eight directory entries are needed a leaf or internal node is added. The directory internal and leaf node headers are similar to the root node header, except that they may have up to 128 directory entries (corresponding to a 4096 byte leaf page). The page header is defined by a dtpage_t structure, contained in /usr/include/j2/j2_dtree.h.
freelist The slot number of the head of the free list
stbl[8]The indices to the directory entry slots that are currently in use. The entries are sorted alphabetically by name.
slot[9]The array of directory entries. There are eight entries, The header is stored in the first slot.
Member Description
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-38 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-21. Directory Slot Array BE0070XS4.0
Notes:
Directory slot array
The directory slot array (stbl[]) is a sorted array of indices of the directory slots that are currently in use. The entries are sorted alphabetically by name. This limits the amount of shifting necessary when directory entries are added or deleted, since the array is much smaller than the entries themselves. A binary search can be used on this array to search for particular directory entries.
Example
In the example show above, the directory entry table contains four files. The stbl table contains the slot numbers of the entries ordering the entries alphabetically.
Directory Slot Array
00003412
hij
xyz
abc
def1
2
3
4
5
7
6
8
Directory Entry
table
STBL[8]
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-39
Student Notebook
. and .. directories
A directory does not contain specific entries for the self (“.”) and parent (“..”) directories. Instead, these will be represented in the inode itself. Self is the directory’s own inode number, and the parent inode number is held in the “idotdot” field in the header.
Growing directory size
As the number of files in the directory grow, the directory tables must be increase in size. This table describes the steps used.
Step Action
1. Initial directory entries are stored in the directory inode in-line data area.
2. When the in-line data area of the directory inode becomes full, JFS2 allocates a leaf node the same size as the aggregate block size.
3.
When that initial leaf node becomes full and the leaf node is not yet 4 KB, double the current size. First attempt to double the extent in place; if there is not room to do this, a new extent must be allocated, and the data from the old extent must be copied to the new extent. The directory slot array will only have been big enough to reference enough slots for the smaller page, so a new slot array will have to be created. Use the slots from the beginning of the newly allocated space for the larger array and copy the old array data to the new location. Update the header to point to this array and add the slots for the old array to the free list.
4.If the leaf node again becomes full and is still not 4 KB repeat step 3. Once the leaf node reaches 4 KB allocate a new leaf node. Every leaf node after the initial one will be allocated as 4 KB to start.
5.When all entries are free in a leaf page, the page will be removed from the B+–tree. When all the entries in the last leaf page are deleted, the directory will shrink back into the directory inode in-line data area.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-40 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-22. Small Directory Example BE0070XS4.0
Notes:
Introduction
This section demonstrates how the directory structures change over time.
Small directories
Initial directory entries are stored in the directory inode in-line data area. Examine the example of a small directory. In the example shown above, all the inode information fits into the in-line data area.
Note: the file with a long name has its name split across two slots.
Small Directory Example
inumber: 69652next: -1namelen: 7name: foobar1
inumber: 69653next: -1namelen: 8name: foobar12
inumber: 69654next: -1namelen: 7name: foobar2
inumber: 69655next: 5namelen: 37name:longnamedfilewithover2
next: -1cnt: 0name: 2charsinitsname
flag: BT_ROOT BT_LEAFnextindex: 4freecnt: 3freelist: 6idotdot: 2stbl: {1,2,3,4,0,0,0}
1
2
3
4
5
# ls -ai69651 .
2 ..69652 foobar169653 foobar1269654 foobar369655 longnamedfilewithover22charsinitsname
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-41
Student Notebook
Figure 9-23. Adding a File BE0070XS4.0
Notes:
Adding a file
An additional file called “afile” is created. Details for this file are added at the next free slot (slot 6). As this is now, alphabetically, the first file in the directory, the search table array (stbl[]) is re-organized, so that the entry in slot 6 is now in the first entry.
Adding a File
inumber: 69656next: -1namelen: 5name: afile
6
inumber: 69652next: -1namelen: 7name: foobar1
inumber: 69653next: -1namelen: 8name: foobar12
inumber: 69654next: -1namelen: 7name: foobar2
inumber: 69655next: 5namelen: 37name:longnamedfilewithover2
next: -1cnt: 0name: 2charsinitsname
flag: BT_ROOT BT_LEAFnextindex: 5freecnt: 2freelist: 7idotdot: 2stbl: {6,1,2,3,4,0,0,0}
1
2
3
4
5
# ls -ai69651 .
2 ..69656 afile69652 foobar169653 foobar269654 foobar369655 longnamedfilewithover22charsinitsname
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-42 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-24. Adding a Leaf Node BE0070XS4.0
Notes:
Adding a leaf node
When the directory grows to the point where there are more entries than can be stored in the in-line data area of the inode, then JFS2 allocates a leaf node the same size as the aggregate block size. The in-line entries are moved to a leaf node, as illustrated above.
Once the leaf is full, an internal node is added at the next free in-line data slot in the inode, which will contain the address of the next leaf node.
Note: the internal node entry contains the name of the first file (in alphabetical order) for that leaf node.
Adding a Leaf Node
xd.len: 1xd.addr1: 0xd.addr2: 52next: -1namelen: 0name: file0
flag: BT_ROOT BT_INTERNALnextindex: 1freecnt: 7freelist: 2idotdot: 2stbl: {1,2,3,4,5,6,7,8}
1 inumber: 5next: -1namelen: 5name: file0
inumber: 6next: -1namelen: 5name: file1
inumber: 15next: -1namelen: 6name: file10
flag: BT_LEAFnextindex: 20freecnt: 103freelist: 25maxslot: 128stbl: {1,2,15, ... 8,13,14}
1
2
3
inumber: 23next: -1namelen: 6name: file18
19
inumber: 24next: -1namelen: 6name: file19
20
Block 52
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-43
Student Notebook
Figure 9-25. Adding an Internal Node BE0070XS4.0
Notes:
Adding an internal node
Once all the in-line slots have been filled by internal nodes, a separate node block is allocated, the entries from the in-line data slots are moved to this new node, and the first in-line data slot updated with the address of the new internal node.
After many extra files have been added to the directory, two layers of internal nodes are required to reference all the files.
Note: now, that the internal node entries in the inode contain the name of the alphabetical first entry referenced by each of the second level internal nodes, and each entry in these references the name of the alphabetically first entry in each leaf node.
Adding an Internal Node
xd.len: 1xd.addr1: 0xd.addr2: 118next: -1namelen: 0name: file0
flag: BT_ROOT BT_INTERNALnextindex: 4freecnt: 4freelist: 5idotdot: 2stbl: {1,3,4,2,6,7,2,8}
1 inumber: 5next: -1namelen: 5name: file0
inumber: 6next: -1namelen: 5name: file1
inumber: 15next: -1namelen: 6name: file10
flag: BT_LEAFnextindex: 64freecnt: 59freelist: 21maxslot: 128stbl: {1,2,15 ... 113,112}
1
2
3
inumber: 10057next: -1namelen: 9name: file10052
126
inumber: 10041next: -1namelen: 9name: file10036
127
Block 52
xd.len: 1xd.addr1: 0xd.addr2: 52next: -1namelen: 0name: file0
flag: BT_INTERNALnextindex: 64freecnt: 59freelist: 76maxslot: 128stbl: {1,19,18, ... 7,8}
1
xd.len: 1xd.addr1: 0xd.addr2: 1204next: -1namelen: 8name: file4845
2
xd.len: 1xd.addr1: 0xd.addr2: 1991next: -1namelen: 9name: file13833
3
xd.len: 1xd.addr1: 0xd.addr2: 2609next: -1namelen: 8name: file17723
4
Block 118
xd.len:xd.addr1:xd.addr2:next:namelen:name:
2
xd.len: 0xd.addr1: -1xd.addr2: 1473next: -1namelen: 8name: file1472
126
xd.len: 1xd.addr1: 0xd.addr2: 1472next: -1namelen: 8name: file1017
127
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-44 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-26. Checkpoint BE0070XS4.0
Notes:
Checkpoint
There is ____ aggregate per logical volume.
An allocation group is at least ____aggregate blocks.
The number of inodes in a JFS2 file system is fixed. True or False?
The data contents of a file is stored in objects called _____.
A single extent can be up to ____ in size.
A JFS2 directory contains directory entries for the . and .. directories. True or False?
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-45
Student Notebook
Figure 9-27. Exercise BE0070XS4.0
Notes:
Turn to your lab workbook and complete exercise eight.
Exercise
Complete exercise eight
Consists of theory and hands-on
Ask questions at any time
Activities are identified by a
What you will do:Use fsdb to examine the structures of directories in a JFS2 file system
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-46 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 9-28. Unit Summary BE0070XS4.0
Notes:
Unit Summary
Aggregate is a pool of space allocated to filesets
A fileset is a mountable file system
The contents of files and directories are stored in extents
Extents are arranged in B+ trees for fast file and directory traversal
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 9. Enhanced Journaled File System 9-47
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
9-48 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 10. Kernel ExtensionsWhat This Unit Is About
This unit describes how the AIX 5L kernel is dynamically extended.
What You Should Be Able to Do
After completing this unit, you should be able to
• List the 3 uses for kernel extensions
• Build a kernel extension from scratch
• Compose an export file
• Create an extended system call
How You Will Check Your Progress
Accountability:
• Exercises using your lab system
References
AIX Documentation: Kernel Extensions and Device Support Programming Concepts
AIX Documentation: Technical Reference: Kernel and Subsystems, Volume 1
AIX Documentation: Technical Reference: Kernel and Subsystems, Volume 2
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-1
Student Notebook
Figure 10-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this lesson you should be able to:
List the 3 uses for kernel extensions
Build a kernel extension from scratch
Compose an export file
Create an extended system call
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-2. Kernel Extensions BE0070XS4.0
Notes:
Introduction
The AIX kernel is dynamically extensible and can be extended by adding additional routines called kernel extensions. A kernel extension could best be described as a dynamically loadable module that adds functionality to the kernel.
Kernel protection domain
These modules are extensions to the kernel in the sense that they run within the protection domain of the kernel. User-level code can only access kernel extensions through the system call interface. Kernel extensions add extensibility, configurability, and ease of system administration to AIX.
Kernel Extensions
Kernel extensions can include:Device driversSystem callsVirtual file systemsKernel processes
Other device driver management routines
Kernel extensions run within the protection domain of the kernel
Extensions can be loaded into the kernel during:system boot
runtime
Extensions can be removed at runtime
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-3
Student Notebook
Loading extensions
Extensions can be added at system boot or while the system is in operation. Extensions are loaded and removed from the running kernel using the sysconfig() system call.
Advantages
Allowing kernel extensions to be loaded and unloaded allows a system administrator to customize a system for particular environments and applications. Rather than bundling all possible options into the kernel at compile time (and creating a large kernel), kernel extensions allow maximum flexibility. The option of loading and unloading kernel extensions at runtime increases system availability and ease of use. In addition, development time is reduced since a new kernel does not have to be compiled and installed for each development cycle.
Disadvantages
Importing new code into the kernel allows the possibility of an unlimited number of runtime errors to be introduced into the system. Such issues as execution environment, path length, pageability, and serialization must be taken into account when writing extensions to the kernel.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-1. Relationship With the Kernel Nucleus BE0070XS4.0
Notes:
Kernel Components
The schematic drawing above illustrates the relationship to the kernel.
Relationship With the Kernel Nucleus
Commands
System Calls
Kernel Protection Boundary
System Call Interface
Nucleus Kernel Services
Virtual
File
System
Device
Drivers
Extended
System Calls
Private routines
Extended Kernel Mode Experts
Extended Kernel Services
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-5
Student Notebook
Figure 10-2. Global Kernel Name Space BE0070XS4.0
Notes:
Introduction
This section describes how symbol names are shared between the kernel and kernel extensions.
Name space
The kernel contains many functions and storage locations that are represented by symbols. The set of symbols used by the kernel makes up the kernel’s name space. Some of these symbols are private to the parts of the kernel that use them. Some of these symbols are made available for other parts of the kernel and kernel extensions to use.
Global Kernel Name Space
import
export
Kernel Extensions
Device drivers
Extended system
calls
Other kernel
extensions
Global kernel Name
space
Core kernel
services
(/unix)
Extended kernel
Services
import/
export
export
import
/usr/lib/kernex.exp
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Exported symbolsThe kernel makes symbols available for kernel extensions by exporting them. If a kernel extension or other program wants to reference these symbols they must import them. Extensions can make symbols they define visible to other extensions by exporting these symbols.
Kernel exports file
The purpose of the kernel exports file is to list the symbols exported by the kernel. The kernel exports file is imported by the kernel extension when the linker command (ld) is run. The linker uses the kernel export file to resolve the kernel symbols used by the kernel extension code. The kernel export file is /usr/lib/kernex.exp.
Exports file format
The first line of the kernel export file indicates the binary where the symbols are being exported from. In the case of the kernel exports file, they are exported from the /unix binary. The remainder of the file lists the symbols that are exported.
Export file
The kernel export file has the following format:
#!/unix* list of kernel exportsdevswadddevswchgdevswdeldevswqrydevwritee_assert_waite_block_threade_clear_wait
System calls
There is an additional file that lists the system calls that are exported from the kernel (/usr/lib/syscalls.exp).
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-7
Student Notebook
The format of the file syscalls.exp file is similar to the format of the kernel exports file except for an additional tag for each system call. This descriptor indicates the ability of the system call to interact with 64-bit processes. Here is a fragment of the file syscalls.exp and a description of the tags.
absinterval syscall3264access syscall3264accessx syscall3264acct syscall3264adjtime syscall3264..
Tag Description
syscallThis system call does not pass any arguments by reference (address).
syscall32This system call is a 32-bit system call and passes 32-bit addresses.
syscall64 This system call is only available in the 64-bit kernel.
syscall3264 This system call supports both 32-bit and 64-bit applications.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-3. Why Export Symbols? BE0070XS4.0
Notes:
Introduction
Kernel extensions can export symbols that are defined by the extension, which makes these symbols available for reference outside the kernel extension. Symbols are exported by creating an export file.
All symbols within a kernel extension remain private by default. This means that other kernel extensions cannot use the routines and variables within the extension. This default action can be changed by creating an export file for the extension. The export file lists the symbols you want to exported from the kernel extension. The format of the exports file is identical to the format of the imports file. An exports file for a kernel extension is used as an import file by other kernel extensions that wish to use the symbols exported by the latter. Any symbols which are exported by a kernel extension are automatically added to the kernel global name space when the module is explicitly loaded.
Why Export Symbols?
To make symbols available for use by other extensions
To share private symbols between extensions
To define extended system calls to programs that will call them
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-9
Student Notebook
Using private routines
Kernel extensions can also consist of several separately link-edited object files that are bound at load time. Load-time binding is useful where several kernel extensions use common routines provided in a separate object file.
For object files that reference each other's symbols, each file should use the other's export file as an import file during link-edit. The export file for the object file providing the services should specify the directory path to the object file as the first line in the exports file. The filename specified should be where the file will be installed when the kernel extension is loaded into the kernel. For example, the first line of the export file should be:
#!/usr/lib/drivers/pci/scsi_ddpin
Extended system calls
When a kernel extension creates a new system call, an export file must be created containing the symbol name of the new system call. Examples of these files are shown here:
#!/unix
sys_call_name syscall
Note that the above will only work if “sys_call_name” has no parameters. If the system call has parameters a different “tag” value such as syscall3264 must be used. This was explained earlier.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-4. Kernel Libraries BE0070XS4.0
Notes:
Introduction
Normal C applications are linked with the C library, libc.a, which provides a set of useful programming routines. The C library for application programs is a shared object. It is not possible to access this user-level library from within the kernel protection domain. For this reason, kernel extensions should not be linked the normal C library.
Instead, the kernel extension may link with the libraries libcsys.a and libsys.a. These are static libraries (ar format library with static.o files), and contain special kernel safe versions of some useful routines such as atoi() and strlen() that are normally found in the regular C library.
Note that the routines provided by libcsys.a are only a very small subset of those provided in the normal C library.
Kernel Libraries
libcsys.a
a641 atoi bcmp bcopy bzero
164a memccpy memchr memcmp memcpy
memmove memset ovbcopy remque strcat
strchr strcmp strcpy strcspn strlen
strncat strncmp strncpy strpbrk strrchr
strspn strstr strtok
libsys.ad_align d_roundup date_to_jul date_to_secs
newstack secs_to_date timeout timeoutcf untimeout
xdump
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-11
Student Notebook
Kernel libraries
Libraries available to kernel extensions are shown in the visual on the previous page.
Reference
Additional information on the libcsys.a and libsys.a are available in the AIX online documentation.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-5. Configuration Routines BE0070XS4.0
Notes:
Introduction
Unlike a normal user-level C language application, a kernel extension does not have a routine called main. Instead it has a configuration routine and one or more entry points. These routines can have any name, and are automatically exported to the global name space.
In order to avoid conflicts in the kernel name space, it is normally best to prepend the names of exported symbols with something that indicates the extension which defines the symbol. For example, the symbol nfs_config is the entry point routine for the NFS kernel extension.
Configuration Routines
Kernel extensionint module_entry (cmd, uiop)int cmd;struct uio *uiop;
Device Driverint dd_entry (dev, cmd, uiop)dev_t dev;int cmd;struct uio *uiop;
Value of cmd Description
CFG_INIT Initialize
CFG_TERM Terminate
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-13
Student Notebook
Configuration routine
An extension configuration routine is typically executed shortly after loading the extension. When linking the extension the configuration routine is specified with the -e option of the ld command. The format of the configuration routine is below.
The uio structure is used to pass arguments from the configuration method. The value of cmd depends on the operation the configuration method is being requested to perform. See later section on sysconfig() for details.
Entry points
Kernel extensions typically define one or more entry points. These are routines that could be called as a result of a system call or other action that invokes the kernel extension.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-14 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-6. Compiling and Linking Kernel Extensions BE0070XS4.0
Notes:
Introduction
Compiling and linking a kernel extension must be split into two phases:
1) Compile each source file to create an object file.
2) Link the required object files to create the extension binary.
Compiler command
A number of different commands can be used to invoke the compiler on AIX. The commands call the same compiler core with a different set of options. In general, kernel code should be compiled with either the cc or xlc commands.
Compiling and Linking Kernel Extensions
Compilecc -q64 -c ext.c -o ext64.o -D__64BIT_KERNEL -D_KERNEL -D_KERNSYS
Linkld -b64 -o ext64 ext64.o -e init_routine -bE:extension.exp \-bI: /usr/lib/kernex.exp -lsys -lcsys
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-15
Student Notebook
Conditional compiler values
One of the main requirements of the compile stage is that the appropriate conditional compile values are used to select the correct code sections. Some conditional compile values will vary from extension to extension, and are decided by the developer. Other conditional compile values should be used to ensure that the correct sections of system-provided header files are used for environment (32-bit or 64-bit kernel) for which the extension is being built. The compiler automatically defines a conditional compile variable to indicate which platform the code is being compiled on. Additional values should be chosen appropriately.
Value Meaning
_POWER_MPCode is being compiled for a multiprocessor machine. This value should always be used for 64-bit kernel extensions and device drivers.
_KERNSYSEnable kernel symbols in header files. This value should always be used.
_KERNELCompiling kernel extension or device driver code. This value should always be used.
__64BIT_KERNEL Code is being compiled for a 64-bit kernel.
__64BIT__Code is being compiled in 64-bit mode. This value is automatically defined by the kernel if the -q64 option is specified.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-16 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Compiler optionsThe default mode for the compiler is 32-bit. In order to compile 64-bit code, the -q64option should be used. Other compiler options may be used to generate additional information about the source files being compiled.
Linking
Once you have created all of the object files, use the linker (ld) to create the kernel extension binary. Some linker options will always be used when creating the binary; some are optional, and some are platform dependent.
The general format of the linker command is:
Compiler option Meaning
-q64 Generate 64-bit object files. (-q32 is the default)
-qlist Produce an object listing; output goes to .lst file.
-qsource Produce a source listing; output goes to .lst file.
-c Do not send files to the linkage editor.
-D<name>[=<def>]Define <name> as in #define directive. If <def> is not specified, 1 is assumed.
-MGenerate information to be included in a “make” description file.
-O Generate optimized code.
-S Produce a .s output file (assembler source)
-vDisplays language processing commands as they are invoked by the compiler; output goes to stdout.
-qcpluscmt Allow C++ style comments //
-qwarn64Enables checking for possible long-to-integer or pointer-to-integer truncation.
Linker option Meaning
-b64 Generate a 64-bit executable
-b32 Generate a 32-bit executable
-eLabel Set the entry point of the executable to Label.
-lcsys -lsysLink the libcsys.a and libsys.a libraries with the kernel extension.
-oName Names the output file Name.
-bE:FileID Exports the external symbols listed in the file FileID.
-bI:FileID Imports the symbols listed in FileID.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-17
Student Notebook
ld -e entry_point [import files] [export files] \-o output_file object1.o object2.o -lcsys -lsysThe order of arguments is not important.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-18 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-7. How to Build a Dual Binary Extension BE0070XS4.0
Notes:
Introduction
Machines with 64-bit hardware can run either the 32-bit kernel or the 64-bit kernel. A kernel extension must be of the same binary type as the kernel. A kernel extension that supports both 32-bit and 64-bit kernels is packaged as an ar format archive library. The library contains both the 32-bit and 64-bit binary versions of the kernel extension. When the extension is loaded, if the kernel detects that the file is an ar format library, it will load the appropriate binary for the type of kernel. For example, a 64-bit kernel will extract the 64-bit binary from the library.
How to Build a Dual Binary Extension
Step Action1 Compile a 32-bit object file using the -q32 compiler option.
cc -q32 -o ext32.o -c ext.c -D_KERNEL -D_KERNSYS
2 Link a 32-bit module file using the -b32 linker option.ld -b32 -o ext32 ext32.o -e ext_init \-bI: /usr/lib/kernex.exp -lcsys
3 Build a 64-bit object file from the same source file as step 1.cc -q64 -o ext64.o -c ext.c -D_KERNEL -D_KERNSYS \-D__64BIT_KERNEL
4 Build a 64-bit object file using the -b64 linker option.ld -b64 -o ext64 ext64.o -e ext_init \-bI: /usr/lib/kernex.exp -lcsys
5 Create an archive of both 32- and 64-bit extensionsar -X32_64 -r -v ext ext32 ext64
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-19
Student Notebook
Creating a dual binary extension
The table/visual on the previous page describes the steps to building a dual binary kernel extension.Note: The name of the library file does not need to be of the libnnn.a format.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-20 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-8. Loading Extensions BE0070XS4.0
Notes:
Introduction
A user-level program called a Configuration Method is used to load a kernel extension into the kernel. The program is normally a 32-bit executable, even on systems running the 64-bit kernel.
sysconfig() and loadtext()
There are two routines available for loading the extension into the kernel as shown in the visual above.
Loading Extensions
sysconfig() system call can be used to:
Load kernel extensions
Unload kernel extensions
Invoke the extension's entry point
Query the kernel to determine if a extension is loaded
loadext() library routine can be used to:
Load kernel extensions
Unload kernel extensions
Query the kernel to determine if an extension is loaded
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-21
Student Notebook
Figure 10-9. sysconfig() - Loading and Unloading BE0070XS4.0
Notes:
Loading, unloading and querying
When loading, unloading or querying, the sysconfig() subroutine is passed a pointer to a cfg_load structure (defined in <sys/sysconfig.h>) and one of the commands shown in this table.
The caller provides the path value, and the sysconfig routine returns the kmid. The libpath is optional. When unloading, the caller specifies the kmid, and the path and libpath are ignored.
sysconfig() - Loading and Unloading
sysconfig ( Cmd, &cfg_load, sizeof(cfg_load) )
Cmd Value DescriptionSYS_KLOAD Loads a kernel extension object file into kernel memory.
SYS_SINGLELOAD Loads a kernel extension object file only if it is not already loaded.
SYS_QUERYLOAD Determines if a specified kernel object file is loaded.
SYS_KUNLOAD Unloads a previously loaded kernel object file.
struct cfg_load{
caddr_t path; /* ptr to object module pathname */caddr_t libpath; /* ptr to a substitute libpath */mid_t kmid; /* kernel module id (returned) */
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-22 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-10. sysconfig() - Configuration BE0070XS4.0
Notes:
Calling the entry point
Once the kernel extension has been loaded into the kernel, the next step is to call the entry point or configuration routine. For all extensions other than device drivers, a pointer to a cfg_kmod structure and the SYS_CFGKMOD command is passed to sysconfig().
The cfg_kmod structure is used with the SYS_CFGKMOD command to call the entry point of a kernel extension.
sysconfig() - Configuration
sysconfig(SYS_CFGKMOD, &cfg_kmod, sizeof(cfg_kmod) )
struct cfg_kmod{mid_t kmid; /* module ID of module to call */int cmd; /* command parameter for module */caddr_t mdiptr; /* pointer to module dependent info */int mdilen; /* length of module dependent info */
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-23
Student Notebook
Figure 10-11. sysconfig() - Device Driver Configuration BE0070XS4.0
Notes:
Device driver entry point
The cfg_dd structure is used with the SYS_CFGDD command to the sysconfig()routine to call the entry point of a device driver.
Entry point options
A number of commands can be passed to the entry point of a kernel extension in the cmd parameter of the cfg_dd or cfg_kmod structure passed to sysconfig(). Values are defined in <sys/device.h> as follows:
Value Meaning
CFG_INIT Initialize the extension
CFG_TERM Terminate the extension
CFG_QVPD Query of vital product data
sysconfig() - Device Driver Configuration
sysconfig(SYS_CFGDD, &cfg_dd, sizeof(cfg_dd) )
struct cfg_dd{
mid_t kmid; /* module ID of device driver*/dev_t devno; /* device major/minor number*/int cmd; /* config command code for device */caddr_t ddsptr;/* pointer to DD structure*/int ddslen;/* length of DD structure */
};
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-24 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
sysconfig() commands
This table provides a complete list of commands for the sysconfig() system call:
CFG_UCODE Download of microcode
Cmd Value Result
SYS_KLOAD Loads a kernel extension object file into kernel memory.
SYS_SINGLELOADLoads a kernel extension object file only if it is not already loaded.
SYS_QUERYLOAD Determines if a specified kernel object file is loaded.
SYS_KULOAD Unloads a previously loaded kernel object file.
SYS_QDVSWChecks the status of a device switch entry in the device switch table.
SYS_CFGDDCalls the specified device driver configuration routine (module entry point).
SYS_CFGKMODCalls the specified module at its module entry point for configuration purposes.
SYS_GETPARMSReturns a structure containing the current values of run-time system parameters found in the var structure.
SYS_SETPARMSSets run-time system parameters from a caller-provided structure.
SYS_64BIT
When running on the 32-bit kernel, this flag can be bit-wise OR'ed with the cmd parameter (if the cmd parameter is SYS_KLOAD or SYS_SINGLELOAD). For kernel extensions, this indicates that the kernel extension does not export 64-bit system calls, but that all 32-bit system calls also work for 64-bit applications. For device drivers, this indicates that the device driver can be used by 64-bit applications.
Value Meaning
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-25
Student Notebook
Figure 10-12. The loadext() Routine BE0070XS4.0
Notes:
Introduction
The loadext() routine, defined in the libcfg.a library, is often used to perform the task of loading the extension code into the kernel. It uses a boolean logic interface to perform the query, load and unload of kernel extensions.
dd_name
The dd_name string specifies the pathname of the extension module to load. If the dd_name string is not a relative or absolute path name (in other words, it does not start with “./”, “../”, or a “/”), then it is concatenated to the string “/usr/lib/drivers/”.
For example, PCI device drivers are normally stored in the /usr/lib/drivers/pci directory. The dd_name argument “pci/fred” would result in the loadext routine trying to load the file /usr/lib/drivers/pci/fred into the kernel.
The loadext() Routine
The loadext() routine is defined as follows:
#include <sys/types.h>
mid_t loadext (dd_name, load, query)char *dd_name;int load, query;
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-26 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
load and query parametersThe load and query parameters are either TRUE or FALSE, and indicate the action to be taken as follows:
loadext(“pci/fred”, FALSE, TRUE); /* Query of pci/fred */loadext(“pci/fred”, TRUE, FALSE); /* SYS_SINGLELOAD of pci/fred */loadext(“pci/fred”, FALSE, FALSE); /* Unload pci/fred */
Multiple copies
If you require multiple copies of a kernel extension to be loaded, you should use the sysconfig interface with the SYS_KLOAD command, since loadext uses SYS_SINGLELOAD, which will only load the extension if it is not already loaded.
Calling entry points
Even if using the loadext routine to load the kernel extension, you still need to use the sysconfig() routing to call the entry point.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-27
Student Notebook
Figure 10-13. System Calls BE0070XS4.0
Notes:
Introduction
A system call is a function called by user-process code that runs in the kernel protection domain.
What is a system call?
A system call:
- Provides user access to kernel functions and resources
- Runs with kernel-mode privileges
- Protects the kernel from direct user mode access to the kernel domain
System Calls
User address space
Kernel address space
1) Switch protection domain
from user to kernel
2) Switch to the kernel stack.
3) Execute the system call code.
User program
main(){
.
.
sys_call(arg1, arg2,..... )
.
.
}
System call code in kernel
sys_call( arg1, arg2,.....) {
.
.
.
}
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-28 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Differences from a user-mode functionFrom an external view the mechanism used to call a system call appears the same as calling a user-mode function. There are, however, several significant differences between a user-mode function and a system call. In a system call:
- Execution mode is switched from user to kernel mode
- Code and data are located in global kernel memory
- Cannot use the shared user libraries
- Cannot reference symbols outside of the kernel protection domain
- System calls can’t be interrupted by signals (must poll for signals)
- Can create kernel process to perform asynchronous processing
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-29
Student Notebook
Figure 10-14. Sample System Call - Export/Import File BE0070XS4.0
Notes:
Introduction
This section describes the creation of a very simple kernel extension that adds a new system call to the kernel. The extended system call created here is called question().
Export and import files
When creating an extended system call, the function name of the system call must be exported by the kernel extension and imported by any program calling the system call. Shown above is the export and import file used for this example.
Note: The “tag”, syscall, shown above works here because the question() function has no parameters. If it did have parameters we would need to use a “tag” such as syscall3264.
Sample System Call - Export/Import File
question.exp
#!/unixquestion syscall
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-30 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-15. Sample System Call - question.c BE0070XS4.0
Notes:
Example extension
This is the kernel extension code. The init routine question_init() is run when the extension is loaded. The function question() is the new system call.
The code uses kernel printf() calls. The output from these calls will be displayed on /dev/console if the running kernel image has the kernel debugger loaded.
Sample System Call - question.c
/* question.c */#include <stdio.h>#include <sys/device.h>
question_init(int cmd, struct uio *uio){
switch(cmd) {case CFG_INIT:{ /* do init stuff here */printf("question_init: command=CFG_INIT\n");break;}case CFG_TERM:{ /* clean up */
printf("question_init: command=CFG_TERM\n");break;}default:printf("question_init: command=%d\n",cmd);
}return(0);
}question(){
return(42); /* return the answer to the user */}
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-31
Student Notebook
Figure 10-16. Sample System Call - Makefile BE0070XS4.0
Notes:
System call makefile
This is the Makefile used to build the kernel extension. In this example both 32-bit and 64-bit objects are built. The two objects are archived (ar) into a single file. When loaded into the kernel, the object matching the kernel type will be extracted from the archive and loaded.
Sample System Call - Makefile
question: question.ccc -q32 -D_KERNEL -D_KERNSYS -o question32.o \
-c question.cld -b32 -o question32 question32.o -e question_init \-bE:question.exp -bI:/usr/lib/kernex.impcc -q64 -D_KERNEL -D_KERNSYS -D_64BIT_KERNEL \
-o question64.o -c question.cld -b64 -o question64 question64.o -e question_init \-bE:question.exp -bI:/usr/lib/kernex.imprm -f questionar -X32_64 -r -v question question32 question64
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-32 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-17. Argument Passing BE0070XS4.0
Notes:
Introduction
System calls can accept up to 8 arguments. Often these arguments are 64-bits long or pointers to buffers in the user’s address space. Because AIX supports a mix of 32-bit and 64-bit environments, care must be taken when processing 64-bit arguments.
64-bit kernels
When running a 64-bit kernel, pointer arguments passed from a 32-bit process will be zero extended. This case requires no special handling.
32-bit kernels
In the 32-bit kernel, a kernel service that accepts a pointer as a parameter expects a 32-bit value. When dealing with a 64-bit user process however, things are different. Although the kernel expects (and indeed receives) 32-bit values as the arguments, the
Argument Passing
64-bit User Process:
sys_call(int * )
User mode
64-bit kernel
sys_call(int * )
32-bit User Process:
sys_call(int * )
64-bit User Process:
sys_call(int * )
32-bit User Process:
sys_call(int * )
32-bit kernel
sys_call(int * )
32-bit pointers are
zero extended
Low-order 32
bits only are
passed.
Kernel mode
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-33
Student Notebook
parameters in the user process itself are 64-bit. The system call handler copies the low-order 32-bits of all parameters onto the kernel stack it creates before entering the system call. The high-order 32-bits are stored elsewhere. A new kernel service called get64bitparm() is used to retrieve the stored high-order 32-bits and reconstruct the 64-bit value inside the kernel.
get64bitparm()
The get64bitparm() kernel service is defined in the header file <sys/remap.h> as follows:
unsigned long long get64bitparm(unsigned long low32, intparmnum);
The get64bitparm() kernel service is used to reconstruct a 64-bit long pointer that was passed (and truncated) from a 64-bit user process to the 32-bit kernel. The 64-bit system call handler stores the high order 32-bits of all system call arguments. Once the 64-bit value has been re-constructed, the kernel service may use it for whatever purpose it deems necessary.
In the following material we demonstrate the use of this service in forming a 64-bit address which is then used to read parameter data from a 64-bit process into a 32-bit kernel extension. In this case the get64bitparm() call is used to obtain a user space address which is then accessed by the copyin64() kernel service.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-34 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-18. User Memory Access BE0070XS4.0
Notes:
Introduction
Within the kernel, a number of services can be used to copy data from user space to kernel space, and from kernel space to user space.
Overview
User applications reside in the user protection domain and cannot directly access kernel memory. Kernel extensions reside in the kernel protection domain and cannot directly access user space memory.
List of services
The following services can be used to transfer data between user and kernel address space. Prototypes are defined for the services in the header file <sys/uio.h>.
User Memory Access
User address space
Kernel address space
user_buffersys_call( &user_buffer, sizeof(user_buffer) );
sys_call( void * buffer, int count ){
copyin(buffer,&kernel_buffer, count);
copyout(&kernel_buffer,buffer,count);kernel_buffer
copyincopyout
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-35
Student Notebook
Copy data from user to kernel space
Use the following services to copy data from user to kernel space:
- copies count bytes of data
copyin (void * uaddr, void * kaddr, size_t count)
- copies a character string (including the null character)
copyinstr(void * uaddr, void * kaddr, size_t max, size_t*actual);
- fetch a byte and word respectively
int fubyte(void *uaddr);
int fuword(void *uaddr);
Copy data from kernel to user space
Use the following services to copy data from kernel to user space:
- copies count bytes of data
copyout(void * kaddr, void * uaddr, size_t count)
- store a byte and word respectively
subyte(void *uaddr, char val);
suword(void *uaddr, int val);
32-bit kernels
Additional services can be used by 32-bit kernels when dealing with a 64-bit user process.
copyin64(unsigned long long uaddr, char * kaddr, int count)copyinstr64(unsigned long long uaddr, caddr_t kaddr, uint max,uint *actual);fubyte64(unsigned long long uaddr);fuword64(unsigned long long uaddr);copyout64(char * kaddr, unsigned long long uaddr, int count)subyte64(unsigned long long uaddr, uchar val);suword64(unsigned long long uaddr, int val);
IS64U
The macro IS64U can be used by system call code to determine if the calling process is 64-bit or 32-bit. The macro evaluates to true if the calling process is 64-bit. It checks the U_64bit member of the user structure described earlier. Both the user structure and the IS64U macro are defined in /usr/include/sys/user.h.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-36 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
64-bit argument code sampleThe following code sample shows the logic used in a kernel extension that can handle calls from 64-bit user applications when running in the 32-bit kernel.
int myservice(void * buf, int count, long size){
void *localmem = xmalloc(count,2,kernel_heap);int rc;
#ifndef __64BIT_KERNEL/* 32-bit kernel logic */unsigned long long lbuf, lsize;char is64u = IS64U;if(is64u){
/* 32-bit kernel & caller is a 64-bit process */lbuf = get64bitparm( (unsigned long) buf, 0);lsize = get64bitparm( size, 2);size = (long) lsize;copyin64(lbuf,localmem,count);
} else#endif
{/* this path is taken if 32-bit kernel & 32-bit process** OR any size process if running in 64-bit kernel*/copyin(buf,localmemm,count);
}. . ./* body of kernel service */
#ifndef __64BIT_KERNEL/* 32-bit kernel logic */if(is64u){
/* 32-bit kernel & caller is a 64-bit process */rc = copyout64(localmem,lbuf,count);
} else#endif
{rc = copyout(localmem,buf,count);
}if (rc != 0 ){. . .
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-37
Student Notebook
Figure 10-19. Checkpoint BE0070XS4.0
Notes:
Checkpoint
Kernel extensions can be loaded at _____ _____ and during _______.
A kernel extension can be compiled and linked like a regular user application. True or False?
A kernel extension must supply a routine called main().
True or False?
Kernel extensions are used mainly for D_____ D_____, F_____ S______ and S______ C______.
The ________ system call is used to invoke the entry point of a kernel extension.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-38 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure 10-20. Exercise BE0070XS4.0
Notes:
Developing code for the kernel environment is very different compared with developing a user-level application. In general, kernel services perform very little (or no) checking of arguments for error conditions. The consequences of invoking a kernel service with incorrect arguments include data corruption, or even causing the system to crash. This is in stark contrast to similar problems in a user-level application which normally would result in the application terminating because of a SIGSEGV signal.
Turn to your lab workbook and complete exercise ten.
Exercise
Complete exercise ten
Consists of theory and hands-on
Ask questions at any time
Activities are identified by a
What you will do:Compile, link and load a kernel extensionWrite your own system callWrite a kernel extension that creates kernel processesCreate your own ps command
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Unit 10. Kernel Extensions 10-39
Student Notebook
Figure 10-21. Unit Summary BE0070XS4.0
Notes:
Unit Summary
Kernel extensions are used to implement device drivers, file systems and extended system calls
Kernel extensions can be loaded at boot time or runtime, and can be unloaded at runtime
Kernel extensions require special compile and link steps
Kernel extensions need to match the binary type of the running kernel
Kernel extension code must take into account that the kernel is pageable
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
10-40 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Appendix A. Checkpoint SolutionsUnit 1
Checkpoint Solutions
1. The kernel is the base program of the operating system.
2. The processor runs interrupt routines in kernel mode.
3. The AIX kernel is preemptable, pageable and dynamically extendable.
4. The 64-bit AIX kernel supports only 64-bit kernel extensions, and only runs on 64-bit hardware.
5. The 32-bit kernel supports 64-bit user applications when running on 64-bit hardware.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix A. Checkpoint Solutions A-1
Student Notebook
Unit 2
Checkpoint Solutions
1. KDB is used for live system debugging.
2. kdb is used for system image analysis.
3. The value of the dbg_avail kernel variable indicates how the debugger is loaded.
4. A system dump image contains everything that was in the kernel at the time of the crash. True or False?False. The system dump image contains only selected areas of kernel memory.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
A-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 3Checkpoint Solutions
1. AIX provides three programming models for user threads.
2. A new thread is created by the thread create() system call.
3. The process table is an array of pvproc structures.
4. All process IDs (except pid 1) are even.
5. A thread table slot number is included in a thread ID. True or False? True.
6. A thread holding a lock may have its priority boosted.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix A. Checkpoint Solutions A-3
Student Notebook
Unit 4
Checkpoint Solutions
1. AIX divides physical memory into frames.
2. The virtual memory manager provides each process with its own effective address space.
3. A segment can be up to 256MB in size.
4. A 32-bit effective address contains a 4-bit segment number.
5. Shared library data segments can be shared between processes. True or False?False. The shared library text segments are shared, but the data segments are private.
6. The 32-bit user address space layout is the same s the 32-bit kernel address space layout. True or False?False.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
A-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 5Checkpoint Solutions
1. The system hardware maintains a table of recently referenced virtual to physical address translations.
2. The Software Page Frame Table contains information on all pages resident in physical memory.
3. Each working storage has an XPT.
4. A SIGDANGER signal is sent to every process when the free paging space drops below the warning threshold.
5. The PSALLOC environment variable can be used to change the paging space policy of a process.
6. A page fault when interrupts are disabled will cause the system to crash.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix A. Checkpoint Solutions A-5
Student Notebook
Unit 6
Checkpoint Solutions
1) What processor features are required in a partitioned system? RMO, RML
and LPI registers are needed in a partitioned system
2) Memory is allocated to partitions in units of ____256___MB.
3) All partitions have the same real mode memory requirements. True or
False? The statement is False. AIX 5.2 and Linux need 256MB. AIX 5.1
requires 256MB, 1GB or 16GB, depending on the amount of memory
allocated to the partition.
4) In a partitioned environment, a real address is the same as a physical
address. True or False? The statement is False. A real address is not
equivalent to a physical address in the partitioned environment.
5) Any piece of code can make hypervisor calls. True or False? The
statement is False. Only kernel code can make hypervisor calls.
6) Which physical addresses in the system can a partition access? A
partition can access the PMBs allocated to the partition, (and with
hypervisor assistance) the partition's own page table, and the TCE
windows for the allocated I/O slots.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
A-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 7Checkpoint Solutions (1 of 2)
Each user process contains a private File Descriptor Table.
The kernel maintains a vfs structure and a vmount structure for each mounted file system.
There is one gfs structure for each mounted file system. True or False? False. There is one gfs structure for each file system type registered with the kernel.
The three kernel structures volgrp, lvol and pvol are used to track LVM volume group, logical volume and physical volume data, respectively.
The kdb subcommand volgrp and the AIX command lsvg both reflect volume group information.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix A. Checkpoint Solutions A-7
Student Notebook
Unit 7 (continued)
Checkpoint Solutions (2 of 2)
There is one vmount/vfs structure pair for each
mounted filesystem. True or False? True.
Every open file in a filesystem is represented by exactly one file structure. True or False? False. There is one file
structure (system file table entry) for each unique open() of a file. So, a given file
may be represented by several file structures.
The inode number given by ls -id/usr is shown as 2. Why? The reason is that ls is giving us the root inode of the /usr filesystem, not
the inode of the /usr directory in the /(root) filesystem. To obtain this directory inode we need to follow the vfs_mntdover pointer in the /usr filesystem vfs structure. This will point us to the vnode structure of directory /usr in the root filesystem, which
contains the directory inode number.
Each vnode for an open file points to a gnode structure. The reason for this is that gnode structures are of one format. They are imbedded in the corresponding inode/specnode/rnode structure for the file in question. These structures are of different formats.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
A-8 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 8Checkpoint Solutions
1. An allocation group contains disk inodes and fragments.
2. The basic allocation unit in JFS is a disk block. True or False?
False. The basic allocation unit is a fragment.
3. The root inode number of a filesystem is always 1. True or
False? False. The root inode number is always 2.
4. The last 128 bytes of an in core JFS inode is a copy of the disk
inode. True or False? True. The first part of an in core JFS
inode contains data relevant only when the associated object
is being referenced. This includes such items as open count
and in-core inode state.
5. JFS maps user data blocks and directory information into
virtual memory. True or False? True. JFS itself does copy
operations and relies on VMM to do the actual I/O operations.
This is a reason for JFS I/O efficiency.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix A. Checkpoint Solutions A-9
Student Notebook
Unit 9
Checkpoint Solutions
There is one aggregate per logical volume.
An allocation group is at least 8192 aggregate blocks.
The number of inodes in a JFS2 file system is fixed. True or False? False.
The data contents of a file is stored in objects called extents.
A single extent can be up to 224-1 in size.
A JFS2 directory contains directory entries for the . and .. directories. True or False? False. The information for . and .. is contained in the inode of the directory.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
A-10 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Unit 10Checkpoint Solutions
Kernel extensions can be loaded at system boot and during runtime.
A kernel extension can be compiled and linked like a regular user application. True or False? False.
A kernel extension must supply a routine called main(). True or False? False.
Kernel extensions are used mainly for Device Drivers, File Systems and System Calls.
The sysconfig system call is used to invoke the entry point of a kernel extension.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix A. Checkpoint Solutions A-11
Student Notebook
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
A-12 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Appendix B. KI Crash DumpWhat This Unit Is About
This unit describes how to configure and perform system dumps on a system running a version of the AIX 5L operating system.
What You Should Be Able to Do
After completing this unit, you should be able to:
• Configure a system to perform a system dump
• Test the system dump configuration of a system
• Validate a dump file
How You Will Check Your Progress
Accountability:
• Exercises using your lab system
References
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix B. KI Crash Dump B-1
Student Notebook
Figure B-1. Unit Objectives BE0070XS4.0
Notes:
Unit Objectives
At the end of this unit you should be able to:
Configure a system to perform a system dump
Test the system dump configuration of a system
Validate a dump file
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
B-2 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure B-2. Crash Dumps BE0070XS4.0
Notes:
System Dump Facility in AIX 5L
What is crash dump?
A system crash dump is a snapshot of the operating system state at the time of the crash or manually initiated dump. When a manually-initiated or unexpected system halt occurs, the system dump facility automatically copies selected areas of kernel data to the primary dump device. These areas include kernel memory as well as other areas registered in the Master Dump Table by kernel modules or kernel extensions.
When is a crash dump created?
An AIX 5L system will generate a system crash dump when encountering a severe system error, such as unexpected or unrecoverable kernel mode exceptions. It can also be initiated by the system administrator when the system is hung.
Crash Dumps
What is a crash dump?
When is a crash dump created?
What is a crash dump used for?
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix B. KI Crash Dump B-3
Student Notebook
What is a crash dump used for?
The system dump facility provides a mechanism to capture sufficient information about the AIX 5L kernel for later expert analysis. Once the preserved image is written to disk, the system will be booted and returned to production. The dump is then typically submitted to IBM for analysis.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
B-4 Kernel Internals © Copyright IBM Corp. 2001, 2003
Student NotebookV2.0.0.3
Uempty
Figure B-3. Process Flow BE0070XS4.0
Notes:
System dump process
Introduction
The process of performing a system dump is illustrated in the chart. The process involves two stages. In stage one, the contents of memory is copied to a temporary disk location. In stage two, AIX 5L is booted and the memory image is moved to a permanent location in the /var/adm/ras directory.
Process Flow
AIX 5L in production
System Panics
Memory dumper is run
Memory is copied to disk location
specified in SWservAt ODM object
class
System is
booted
copycore copies dump into
/var/adm/ras.
copycore is called by
rc.boot
Stage 1
Stage 2
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
© Copyright IBM Corp. 2001, 2003 Appendix B. KI Crash Dump B-5
Student Notebook
Figure B-4. About This Exercise BE0070XS4.0
Notes:
Exercise
Complete exercise A
Consists of theory and hands-on
Ask questions at any time
Activities are identified by a
What you will do:Learn about the sysdumpdev command
Configure your lab system to perform a system dump
Test the crash dump configuration
Verify you have obtained a successful system dump
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
B-6 Kernel Internals © Copyright IBM Corp. 2001, 2003
V2.0
backpg
Back page���®