memory

30
Memory Management in Linux Anand Sivasubramaniam

Upload: muhammed-mazhar-khan

Post on 20-Nov-2014

620 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Memory

Memory Management in Linux

Anand Sivasubramaniam

Page 2: Memory

Two Parts

• Architecture Independent Memory

Should be flexible and portable enough across platforms

• Implementation for a specific architecture

Page 3: Memory

Architecture Independent Memory Model

• Process virtual address space divided into pages

• Page size given in PAGE_SIZE macro in asm/page.h

(4K for x86 and 8K for Alpha)

• The pages are divided between 4 segments

• User Code, User Data, Kernel Code, Kernel Data

• In User mode, access only User Code and User Data

• But in Kernel mode, access also needed for User Data

Page 4: Memory

• put_user(), get_user(), memcpy_tofs(), memcpy_fromfs() allow kernel to access user data (defined in asm/segment.h)

• Registers cs and ds point to the code and data segments of the current mode

• fs points to the data segment of the calling process in kernel mode.

• Get_ds(), get_fs(), and set_fs() are defined in asm/segment.h

Page 5: Memory

• Segment + Offset = 4 GB Linear address (32 bits)

• Of this, user space = 3 GB (defined by TASK_SIZE macro) and kernel space = 1GB

• Linear Address converted to physical address using 3 levels

Index intoPage Dir.

Index intoPage MiddleDir.

Index intoPage Table

PageOffset

Page 6: Memory

Page Dir. And Middle Dir. Access Functions (in asm/page.h and asm/pgtable.h)

• Structures pgd_t and pmd_t define an entry of these tables.

• pgd_alloc_alloc()/pgd_free() to allocate and free a page for the page directory

• pmd_alloc(),pmd_alloc_kernel()/pmd_free(),pmd_free_kernel() allocate and free a page middle directory in user and kernel segments.

• pgd_set(),pgd_clear()/pmd_set(),pmd_clear() set and clear a entry of their tables.

• pgd_present()/pmd_present() checks for presence of what the entries are pointing to.

• pgd_page()/pmd_page() returns the base address of the page to which the entry is pointing

• …..

Page 7: Memory

Page Table Entry (pte_t)

Attributes

• Presence (is page present in VAS?)

• Read, Write and Execute

• Accessed ? (age)

• Dirty• Macros of Pgprot_type

– PAGE_NONE (invalid)

– PAGE_SHARED (read-write)

– PAGE_COPY/READ_ONLY (read only, used by copy-on-write)

– PAGE_KERNEL (accessibe only by kernel)

Page 8: Memory

Page Table Functions

• mk_pte(), Pte_clear(), set_pte()• pte_mkclean(), pte_mkdirty(),

pt_mkread(), ….

• pte_none() (check whether entry is set)

• pte_page() (returns address of page)

• pte_dirty(), pte_present(), pte_young(), pte_read(), pte_write()

Page 9: Memory

Process Address Space (not to scale!)

Kernel

0xC0000000File name, Environment

Arguments

Stack

bss_end

_bss_start

Data_edata

_etextCode

Header0x84000000

Shared Libs

Page 10: Memory

Address Space Descriptor• mm_struct defined in the process descriptor. (in linux/sched.h)• This is duplicated if CLONE_VM is specified on forking.

• struct mm_struct {int count; // no. of processes sharing this descriptorpgd_t *pgd; //page directory ptrunsigned long start_code, end_code;unsigned long start_data, end_data;unsigned long start_brk, brk;unsigned long start_stack;unsigned long arg_start, arg_end, env_start, env_end;unsigned long rss; // no. of pages resident in memoryunsigned long total_vm; // total # of bytes in this address spaceunsigned long locked_vm; // # of bytes locked in memoryunsigned long def_flags; // status to use when mem regions are

createdstruct vm_area_struct *mmap; // ptr to first region desc.struct vm_area_struct *mmap_avl; // faster search of region

desc.}

Page 11: Memory

Region Descriptors

• Why even allocate all of the VAS? Allocate only on demand.

• Use region descriptors for each allocated region of VAS

• Map allocated but unused regions to same physical page to save space.

• struct vm_area_struct {struct mm_struct *vm_mm; // descriptor of VASunsigned long vm_start, vm_end; // of this regionpgprot_t vm_page_prot; // protection attributes for this regionshort vm_avl_height;struct vm_avl_left;vm_area_struct *vm_avl_permission; // right hand childvm_area_struct * vm_next_share, *vm_prev_share; // doubly

linkedvm_operations_struct *vm_ops; struct inode *vm_inode; // of file mapped, or NULL =

“anonymous mapping”unsigned long vm_offset; // offset in file/device}

Page 12: Memory

• If vm_inode is NULL (anonymous mapping), all PTEs for this region point to the same page.

• If the process does a write to any of these pages, the faulting mechanism creates a new physical page (copy-on-write).

• This is used by the brk() system call.

• Operations specific to this region (including fault handling) are specified in vm_operations_struct.

• Hence, different regions can have different functions.

Page 13: Memory

Struct vm_operations_struct {void (*open)(struct vm_area_struct *);void (*close)(struct vm_area_struct *);void (*unmap)(…);void (*protect)(…)void (*sync)(…);unsigned long (*nopage)(struct vm_area_struct *, unsigned

long address, unsigned long page, int write_access);void (*swapout)(struct vm_area_struct *, unsigned long, pte_t

*);pte_t (*swapin)(struct vm_area_struct *, unsigned long,

unsigned long);}

Page 14: Memory

Traditional mmap()

• int do_mmap(struct file *, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long off);

• Creates a new memory region• Creates the required PTEs• Sets the PTEs to fault later• The handler (nopage) will either copy-on-write if

anonymous mapping, or will bring in the required page of file.

Page 15: Memory

How is brk() implemented?

• Check whether to allocate (deny if not enough physical memory, exceeds its VA limits, or crosses stack).

• Then call do_mmap() for anonymous mapping between the old and new values of brk (in process table).

• Return the new brk value.

Page 16: Memory

Kernel Segment• On a sys call, CS points to kernel segment. DS and ES are set to

kernel segment as well.• Next, FS is set to user data segment.• Put_user() and get_user() can then access user space if needed.• The address parameters to these functions cannot exceed

0xc0000000.• Violation of this should result in a trap, together with any writes

to a read-only page (creates a problem on 386, while the problem does not exist in 486/Pentium)

• Hence, verify_area() is typically called before performing such operations.

• Physical and Virtual addresses are same except for those allocated using vmalloc().

• Kernel segment shared across processes (not switched!)

Page 17: Memory

Memory Allocn for Kernel Segment

• StaticMemory_start = console_init(memory_start, memory_end);Typically done for drivers to reserve areas, and for some other kernel

components.

• Dynamic

Void *kmalloc(size, priority), Void kfree (void *) // in mm/kmalloc.c

Void *vmalloc(size), void *vmfree(void *) // in mm/vmalloc.c

Kmalloc is used for physically contiguous pages while vmalloc does not necessarily allocate physically contiguous pages

Memory allocated is not initialized (and is not paged out).

Page 18: Memory

kmalloc() data structures

3264128

252508

10202040

40808176

16368

3275265520

131056

sizes[]

bh bh

bh

bh bh

bh

Null Null

page_descriptor

size_descriptor

Page 19: Memory

vmalloc()• Allocated virtually contiguous pages, but they do not need to be physically

contiguous.

• Uses __get_free_page() to allocate physical frames.

• Once all the required physical frames are found, the virtual addresses are created (and mappings set) at an unused part.

• The virtual address search (for unused parts) on x86 begins at the next address after physical memory on an 8 MB boundary.

• One (virtual) page is left free after each allocation for cushioning.

addr

size

next

addr

size

next

vmlist

vmstruct

VAS

Page 20: Memory

vmalloc vs kmalloc

• Contiguous vs non-contiguous physical memory

• kmalloc is faster but less flexible

• vmalloc involves __get_free_page() and may need to block to find a free physical page

• DMA requires contiguous physical memory

Page 21: Memory

Paging

• All kernel segment pages are locked in memory (no swapping)

• User pages can be paged out:

– Complete block device

– Fixed length files in a file system

• First 4096 bytes are a bitmap indicating that space for that page is available for paging.

• At byte 4086, string “SWAP_SPACE” is stored.

• Hence, max swap of 4086*8-1 = 32687 pages = 130784KB per device or file

• MAX_SWAPFILES specifies number of swap files or devices

• Swap device is more efficient than swap file.

Page 22: Memory

• Inform swap space to kernel using • int sys_swapon(char * swapfile, int swapflags);• Ceates an entry in swap_info table.• struct swap_info_struct {

unsigned int flags;kdev_t swap_device;struct indoe *swap_file;unsigned char *swap_map; // ptr to table, with 1

byte for each page to indicate how many processes are referring to this page

unsigned char *swap_lockmap; // ptr to bitmap, bit indicating lock

int lowest_bit, highest_bit; // to calculate maximum page number

unsigned long max; // highest_bit + 1int prio; // priority for this swap spaceint cluster_nr, cluster_next; // to cluster pages on

storage deviceint next;

}

Page 23: Memory

• For each physical frame (mm.h):

typedef struct page {struct page *prev, *next; // doubly linkedstruct inode *inode; unsigned long offset; // where

to swapstruct page *prev_hash, next_hash; // in hash list of

pages in page cacheatomic_t count; // number of users of this pageunsigned dirty:16, age:8;struct buffer_head * buffers; // if it is part of a block

bufferunsigned long map_nr; // frame #struct wait_queue *wait; // Tasks waiting for page

to be unlockedunsigned flags;

} mem_map_t;

Page 24: Memory

Finding a Physical Page

• unsigned long __get_free_pages(int priority, unsigned long order, int dma) in mm/page_alloc.c

• Priority =

– GFP_BUFFER (free page returned only if available in physical memory)

– GFP_ATOMIC (return page if possible, do not interrupt current process)

– GFP_USER (current process can be interrupted)

– GFP_KERNEL (kernel can be interrupted)

– GFP_NOBUFFER (do not attempt to reduce buffer cache)

• order says give me 2^^order pages (max is 128KB)

• dma specifies that it is for DMA purposes

Page 25: Memory

• First tries to find a free frame using Buddy system.

• Table free_area[] keeps appropriate data structures.

free_area_list[0 1 2] free_area_map0 1 2

Page 26: Memory

• If you cannot find a free page,

int try_to_free_page(int priority, int dma, int wait) {static int state = 6; int I = 6; int stop;stop = 3; if (wait) stop = 0;switch (state) {

do {Case 0:

if (shrink_mmap(i,dma)) return 1;state = 1;

Case 1: if (shm_swap(i,dma)) return 1;state = 2;

Default: if (swap_out(i,dma,wait)) return 1;state = 0;

i--;} while ((i-stop) >= 0);

}return 0;

}

Page 27: Memory

• shrink_mmap() tries to discard pages in page cache or buffer cache that have only one user currently, and have not been references since the last cycle. The number of examined pages depends on priority.

• shm_swap() tries pages allocated for shared memory.

• swap_out()– Uses swap_cnt to determine how many pages to swap out

for current process before moving on to next.

– Always start where you left off last time (Clock algorithm)

– Uses swap_out_process() function, which then calls try_to_swap_out() for each possible page present in memory (and is not locked).

– try_to_swap_out() checks the age attribute in mem_map data structure, and the page is selected if this is 0.

– VM area’s swapout() operation is called.

– Write back if the page is dirty

– Invalidate page table entry.

Page 28: Memory

• kswapd kernel thread running in background is activated each time the number of free pages falls below a critical level.

• This thread calls the try_to_free_page() function.

• A block of memory is released using free_pages(). When the number of users reaches 0, the frames are entered in free_area[].

Page 29: Memory

Page Fault

• Error code written onto stack, and the VA is stored in register CR2

• do_page_fault(struct pt_regs *regs, unsigned long error_code) is now called.

• If faulting address is in kernel segment, alarm messages are printed out and the process is terminated.

• If faulting address is not in a virtual memory area, check if VM_GROWSDOWN for the nexy virtual memory area is set (I.e. Stack). If so, expand VM. If error in expanding send SIGSEGV.

• If faulting address is in a virtual memory area, check if protection bits are OK. If not legal, send SIGSEGV. Else, call do_no_page() or do_wp_page().

Page 30: Memory

• void do_wp_page(struct task_struct *task, struct vm_area_struct *vma, unsigned long address, int write_access);

– Check if page is mapped in– If it is referenced only once, change permissions– Else, create a copy and entry in page table to this copy with write

permissions on

• void do_no_page(struct task_struct *task, struct vm_area_struct *vma, unsigned long address, int write_access);

– If there is no nopage() handler, an empty page is mapped to this area

– Else, do_swap_page(struct task_struct, struct vm_area_struct, unsigned long address, pte_t, int write_access) is called.

– If no swap_in() handler is defined, swap_in(struct task_struct, struct vm_area_struct, pte_t *, unsigned long entry, int write_access) is called.

– entry contains swap space and page number.– This calls swap_free() to release te page in swap space and the

swap_map counter is decremented.