Generic address translation: hardware assisted, OS managed. MMU does translation based on table loaded by OS. Several approaches have been tried. We discuss these approaches and their advantages and disadvantages.
Already looked at a base-and-bounds approach which uses the x86 segmented memory (fixed bounds). Idea: each process' memory reference augmented by a base register within bounds. MMU does add/compare of virtual address. What does OS have to do to load new program? Find large enough chunk of memory (what heuristics? fragmentation, compaction what about the effects of those heuristics and the need for compaction) and then set the base and bounds appropriately. What about changing size of process (extend heap or stack)? Protection: can only access things within process' base and base+bounds. (Obviously, only OS can change base/bounds). Problems: text only sharing difficult, extending stack and heap difficult, convoluted memory management.
One way to do sharing is to allow each process' virtual address space to be divided into segments. Then, we can load different segments at different places in physical memory. What happens at gaps: if program accesses gaps, error, unless it is the stack segment, which the OS grows on demand. What does OS do when loading new program? Hardware needs a segment table, or a pointer to a location in memory containing segment table. Example of segmentation. Advantages: can do sharing of code segments easily, but we may need to do fancy memory allocation schemes. Efficient for sparse address spaces (only as many entries as there are segments). Basic problem: growing segments such as stack.
Paging. Divide up memory into pages, then represent entire virtual address space as a collection of pages. Allocation is in units of pages (can use a simple bitmap, couldn't do that with segments since a segment was contiguous physical memory), and can still share segments. Hardware needs two registers: pointer to page table and page table size. Example of paged address space. Size of page is crucial: small page increases size of table, large page suffers from internal fragmentation. Easy to share, inefficient for sparse, large address spaces.
Efficiency in sparse address spaces. Using multi-level translation, lowest level is paging and higher levels are segmented. In two level, part of address determines segment, and then part of address determines page table. Segment tables small, page tables have to be stored in memory. Example of paged segmentation: easy memory allocation, sharing, efficient, at the cost of an additional lookup.
Such a scheme has one page table per segment. Can be many page tables. For a reason we study later, we can actually put the page tables in memory pages in a separate segment (called the system page table) which is translated, but inaccessible to users. The system page table is in physical memory. Can have upto three memory references to resolve a single address.
Terms: hit ratio and expected access time. How often do you find things in the cache? What is the expected access time? More things in cache, larger hit ratio but more access time for hit. Engineering tradeoffs. Crucial observations about program behavior that allow us to have smaller caches with high hit rates: locality (both temporal and spatial).
What does TLB entry contain: part of virtual page number, corresponding physical page number, valid/invalid bit. Can have direct mapped TLBs: basically hash table of some part of virtual page number. Can have hash collisions, so choice of hash function critical. To reduce collisions, can have set associative TLBs. Algorithms to replace, for the set associative case, later.
How does TLB get populated? Whenever MMU incurs cache miss, we trap to OS, so OS can fill in the translation. OS can also flush entire TLB or parts thereof (hardware dependent). When does TLB change? Context switch! (Unless we can associate process ID with entry). Also, if the virtual mapping changes (e.g. when a page is added or removed).
Because of this locality, can use physical memory to cache disk-based program code and data. Main benefit: illusion of large virtual address space, allowing OS to run more programs than will simultaneously fit in memory.
On some machines, such as the MIPS, the MMU only has a software loaded TLB. When a fault happens with this TLB, we do a similar sequence of operations.
Initially, a program's pages all faulted in. Page fault handled transparently. Two issues. Restarting thread in the middle of a complex instruction. What algorithm is best for page replacement? How do you implement that algorithm quickly (remember, algorithm is run in interrupt handler)?
What are the goals of a page replacement policy? Policy should minimize page faults globally. Can this be achieved: as we will see, a bit like the goal of trying to find an optimal scheduling policy.
Will a random page replacement policy work? Another possible policy is FIFO: keeps each page in memory for the same number of page faults. Doesn't distinguish usage: heavily used pages as likely to be thrown out.
Key observation: some page faults inherent because of fixed memory size, some page faults caused because page was replaced and immediately re-used. Optimal page replacement? MIN: replace page which will be next referenced furthest in the future. Approximation to MIN: LRU, replace page that was used furthest in the past. Examples comparing performance of FIFO, MIN and LRU. Are the number of page faults reduced if physical memory is increased (Belady's anomaly)?
Most UNIX systems implement something called the clock algorithm. Replaces a page referenced sometime in the past, not necessarily oldest such page. Suppose hardware supports use bit per physical page frame, which is set everytime page frame is referenced. If use bit is not set, page not referenced in a while. Clock algorithm sweeps page frames (assume page frames arranged in a circle) on a page fault to find first page whose use bit is not set (if use bit is set, clear it).
Will it loop indefinitely (will, after one cycle find an unused page)? What determines how quickly the hand moves? Basically, use bit partitions pages into young and old. Some old page gets replaced.
Variant: In addition to use bit, maintain a count per page. Don't replace it even if use bit is not set, until the clock hand has swept by N times. Will, after N iterations in the worst case, find a page. How to select N: large N comes near timestamp, but can take a long time to find a page. Another variant: give dirty pages an additional chance.
Can emulate modified bit using read-only bit. Maintain two lists: a list of modified page frames, and a list of all page frames. Initially, page frame starts out read-only. On write access, trap happens and page inserted in modified list (logically marked modified). Clock checks if page is in modified list.
Same way, can emulate use bit with valid bit.
Need a map of page frame descriptors, for the clock algorithm to work.
Users can modify program behavior by re-writing programs, splitting tasks. How can the OS detect this? Basic idea of a working set: set of pages that process currently needs. If combined process working sets are near physical memory, stop forking new processes. How to determine working set: pages referenced over time T. What factors determine the selection of time T?