Memory Management

Relevant Reading

Chapter 3 of Tanenbaum. These lectures approach memory management in a slightly different order, but the basic content is similar to Tanenbaum.

Outline

The OS memory manager

Conceptually, the process/thread subsystem in an OS is to the CPU what the memory manager is to physical memory. A memory manager: Note that these are two independent ideas, so we study them separately. Several ways to approach this: we study address translation first and virtual memory second.

Address Translation Mechanisms

Address translation for protection: untranslated addresses allow user programs access to OS memory and other user memory. Obvious solution: can implement protection in memory hardware (e.g. if processor is in user mode, can only access this part of memory), but that can be expensive. When multiple processes' address spaces co-resident in memory, need for relocation, relatively easily solved.

Generic address translation: hardware assisted, OS managed. MMU does translation based on table loaded by OS. Several approaches have been tried. We discuss these approaches and their advantages and disadvantages.

Already looked at a base-and-bounds approach which uses the x86 segmented memory (fixed bounds). Idea: each process' memory reference augmented by a base register within bounds. MMU does add/compare of virtual address. What does OS have to do to load new program? Find large enough chunk of memory (what heuristics? fragmentation, compaction what about the effects of those heuristics and the need for compaction) and then set the base and bounds appropriately. What about changing size of process (extend heap or stack)? Protection: can only access things within process' base and base+bounds. (Obviously, only OS can change base/bounds). Problems: text only sharing difficult, extending stack and heap difficult, convoluted memory management.

One way to do sharing is to allow each process' virtual address space to be divided into segments. Then, we can load different segments at different places in physical memory. What happens at gaps: if program accesses gaps, error, unless it is the stack segment, which the OS grows on demand. What does OS do when loading new program? Hardware needs a segment table, or a pointer to a location in memory containing segment table. Example of segmentation. Advantages: can do sharing of code segments easily, but we may need to do fancy memory allocation schemes. Efficient for sparse address spaces (only as many entries as there are segments). Basic problem: growing segments such as stack.

Paging. Divide up memory into pages, then represent entire virtual address space as a collection of pages. Allocation is in units of pages (can use a simple bitmap, couldn't do that with segments since a segment was contiguous physical memory), and can still share segments. Hardware needs two registers: pointer to page table and page table size. Example of paged address space. Size of page is crucial: small page increases size of table, large page suffers from internal fragmentation. Easy to share, inefficient for sparse, large address spaces.

Efficiency in sparse address spaces. Using multi-level translation, lowest level is paging and higher levels are segmented. In two level, part of address determines segment, and then part of address determines page table. Segment tables small, page tables have to be stored in memory. Example of paged segmentation: easy memory allocation, sharing, efficient, at the cost of an additional lookup.

Such a scheme has one page table per segment. Can be many page tables. For a reason we study later, we can actually put the page tables in memory pages in a separate segment (called the system page table) which is translated, but inaccessible to users. The system page table is in physical memory. Can have upto three memory references to resolve a single address.

Improving the performance of address translation

Caching principles
Very commonly used performance improvement principle. Keep recently accessed stuff around, so we can quickly access copy. Motivates the development of a memory hierarchy: upper levels are expensive per byte, but fast. So we can keep less frequently accessed, large stuff lower down in hierarchy and more frequently/recently accessed stuff in faster memory.

Terms: hit ratio and expected access time. How often do you find things in the cache? What is the expected access time? More things in cache, larger hit ratio but more access time for hit. Engineering tradeoffs. Crucial observations about program behavior that allow us to have smaller caches with high hit rates: locality (both temporal and spatial).

Translation Lookaside Buffers
Caching can be applied to address translation. Recall that to resolve one virtual address, we needed 2-3 memory accesses. If we can cache the virtual/physical mapping in MMU, things would work OK. Access times are small. How do we find out whether there is a hit or a miss? Depends on TLB organization: similar issues as instruction and data caches in processor architectures.

What does TLB entry contain: part of virtual page number, corresponding physical page number, valid/invalid bit. Can have direct mapped TLBs: basically hash table of some part of virtual page number. Can have hash collisions, so choice of hash function critical. To reduce collisions, can have set associative TLBs. Algorithms to replace, for the set associative case, later.

How does TLB get populated? Whenever MMU incurs cache miss, we trap to OS, so OS can fill in the translation. OS can also flush entire TLB or parts thereof (hardware dependent). When does TLB change? Context switch! (Unless we can associate process ID with entry). Also, if the virtual mapping changes (e.g. when a page is added or removed).

Demand Paged Virtual Memory

So far, we have assumed that the result of a translation is always in physical memory. Not always necessary, given that most programs exhibit spatial locality: that is, at any given instant in time, program instruction and data references tend to be to only to a small, slowly changing, working set. (already argued that programs exhibit temporal locality in referencing virtual addresses).

Because of this locality, can use physical memory to cache disk-based program code and data. Main benefit: illusion of large virtual address space, allowing OS to run more programs than will simultaneously fit in memory.

Demand Paging Algorithm
To implement demand paging, add an extra bit to a page table entry (valid) to indicate whether the corresponding page frame is in memory or not. If page frame corresponding to page not in memory (invalid), then:

On some machines, such as the MIPS, the MMU only has a software loaded TLB. When a fault happens with this TLB, we do a similar sequence of operations.

Initially, a program's pages all faulted in. Page fault handled transparently. Two issues. Restarting thread in the middle of a complex instruction. What algorithm is best for page replacement? How do you implement that algorithm quickly (remember, algorithm is run in interrupt handler)?

Restarting Faulted Thread
Several problems. Some instructions have multiple effects: update a memory location and change the value of a register. Can fault midway: how do we re-start instruction? Some processors allow pipelined instructions: what happens if a page fault happens mid-way through an instruction. Complex memory move instructions.

Page Replacement Policies
Different caching mechanisms can use different caching policies. For TLBs we saw that in some cases, the replacement was determined by the choice of the TLB organization.

What are the goals of a page replacement policy? Policy should minimize page faults globally. Can this be achieved: as we will see, a bit like the goal of trying to find an optimal scheduling policy.

Will a random page replacement policy work? Another possible policy is FIFO: keeps each page in memory for the same number of page faults. Doesn't distinguish usage: heavily used pages as likely to be thrown out.

Key observation: some page faults inherent because of fixed memory size, some page faults caused because page was replaced and immediately re-used. Optimal page replacement? MIN: replace page which will be next referenced furthest in the future. Approximation to MIN: LRU, replace page that was used furthest in the past. Examples comparing performance of FIFO, MIN and LRU. Are the number of page faults reduced if physical memory is increased (Belady's anomaly)?

LRU Implementation
Best possible implementation. Keep of use timestamp and order pages by timestamp. Issues: entry size, at each memory reference need to update page. Not really feasible.

Most UNIX systems implement something called the clock algorithm. Replaces a page referenced sometime in the past, not necessarily oldest such page. Suppose hardware supports use bit per physical page frame, which is set everytime page frame is referenced. If use bit is not set, page not referenced in a while. Clock algorithm sweeps page frames (assume page frames arranged in a circle) on a page fault to find first page whose use bit is not set (if use bit is set, clear it).

Will it loop indefinitely (will, after one cycle find an unused page)? What determines how quickly the hand moves? Basically, use bit partitions pages into young and old. Some old page gets replaced.

Variant: In addition to use bit, maintain a count per page. Don't replace it even if use bit is not set, until the clock hand has swept by N times. Will, after N iterations in the worst case, find a page. How to select N: large N comes near timestamp, but can take a long time to find a page. Another variant: give dirty pages an additional chance.

Other Implementation Issues
Four bits: read-only, modified, use and valid. Can we reduce the number of bits? (Remember, these bits need hardware support and some hardware may not provide that).

Can emulate modified bit using read-only bit. Maintain two lists: a list of modified page frames, and a list of all page frames. Initially, page frame starts out read-only. On write access, trap happens and page inserted in modified list (logically marked modified). Clock checks if page is in modified list.

Same way, can emulate use bit with valid bit.

Need a map of page frame descriptors, for the clock algorithm to work.

Thrashing
Each program references a small subset of its pages at any instant. If those pages currently required by all programs don't fit into memory, can have thrashing. Characterized by frequent page faults: caused by replacement of pages that are immediately needed again. Here caching breaks down differently: we have program locality, but too many programs. Average memory reference time goes up. No way to know this a priori.

Users can modify program behavior by re-writing programs, splitting tasks. How can the OS detect this? Basic idea of a working set: set of pages that process currently needs. If combined process working sets are near physical memory, stop forking new processes. How to determine working set: pages referenced over time T. What factors determine the selection of time T?