Linux Memory Management

Memory Address

For x86 Architecture there are 3 types of addresses:

  • Logical Address:
    • This address is address of an instruction or data in machine language
    • This address consist of a segment and an offset i.e. distance from segment start address.
  • Linear address or Virtual address:
    • This address is a binary number in virtual memory that enables a process to use a location in main memory independently of other processes and to use more space than actually exists in primary storage by temporarily relegating some contents to a hard disk or internal flash drive.
  • Physical Address:
    • Address of the memory cells in RAM of the computer.

Need for Virtual Addressing

  • The main memory (RAM) available for a computer is limited.
  • Many processes use a common code in libraries.
  • Using Virtual addressing, a CPU and Kernel gives an impression to a process that the memory is unlimited.

Address Translation

  • Since 2 out of 3 address are virtual mentioned above, there is a need for address translation from Logical to Linear and Linear to Physical address.
  • For this reason each CPU contains a hardware names as Memory Management Unit (MMU).
  • Segmentation Unit: converts the Logical address to Linear.
  • Paging Unit: converts Linear address to Physical.
  • The address translation from linear address is done using two translation tables
    • Page Directory
    • Page Table

memoryTranslation

Image Credit: https://manybutfinite.com/post/memory-translation-and-segmentation/

Address Translation in Intel x86

address_translation

Linear-Address Translation to a 4-KByte Page using IA-32e Paging

address_translation_format

Formats of CR3 and Paging-Structure Entries with PAE Paging

MMU

MMU

Image Credit: By Mdjango, Andrew S. Tanenbaum (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Segmentation and Paging in Linux

  • Both segmentation and Paging is redundent and hence Linux use segmentation in limited way.
  • Segmentation can assign a different linear address space to each process
  • Paging can map the same linear address space into different physical address spaces.

Segments and their usage

  • Each segment is described by 8-byte Segment Descriptor.
  • These descriptors are defined in Global Descriptor Table (GDT) or in the Local Descriptor Table(LDT).

segments

Image credit: By John Källén (jkl at commons) (Own work) [Public domain], via Wikimedia Commons

Types of Segment Descriptors

  • Code Segment Descriptor
  • Data Segment Descriptor
  • Task State Segment Descriptor

Segmentation in Linux

linuxFlexibleAddressSpaceLayout

image credit https://manybutfinite.com/post/anatomy-of-a-program-in-memory/

Data structures for segmentation

  • GDT Global Descriptor Table
    • There is one GDT per CPU in Linux.
    • There are 18 descriptors in each GDT for various purposes as follows:
      • Descriptors for Kernel code and User code.
      • Task State Segment (TSS)
      • default Local Descriptor Table (LDT)
      • Thread-Local Storage (TLS) segments
      • Advanced Power Management (APM )
      • Plug and Play (PnP ) BIOS services
      • Special segment used for handling exceptions.

 

Image Credit: Lars H. Rohwedder (User:RokerHRO – selfmade work

Paging


  • paging unit translates linear addresses into physical ones
  • For efficiency the linear addresses are divided in fixed length intervals called as pages. These continuous linear addresses within a page are mapped into continuous physical addresses.
  • Page frames: main memory is divided into fixed lenght page frames. Each page frame contains a page. Page is just block of data in memory or disk.
  • page tables: The data structures that map linear to physical addresses.
  • Page Size: 4 KB
  • Huge Pages: 2 MB and 1 GB.
  • Page address (32 bit) = Directory (10 bits) + Page Table (10 bits) + Offset(12 bit)

Physical Address Extension (PAE)

  • In order to use more than 4 GB memory Intel started using 36 pins for address translation effectively supporting more addresses.

Caching in Hardware

  • There are atleast 3 levels of caches supported in modern microprocessor.
  • The caches work on principle of spatial locality and principle of temporal locality.
  • Cache is devided into cache line generally of 64 bytes.
  • Most of caches are N-way associative.
  • Cache unit resides between the paging unit and the main memory.
  • It includes both a hardware cache memory and a cache controller.
  • Each cache line has a tag and some flags that stores the status of cache line.
  • CPUs first look for address in cache before looking into main memory.
  • Flushing cache is done using write Back mechanism which is more efficient. Only cache entry is updated by CPU and the main memory is updated eventually.

Translation Lookaside Buffers (TLB)

  • This is kind of a cache used for storing recently converted addresses between linear to physical.
  • The Address Translation unit first looks in TLB for physical address for given linear address if not found, the hardware goes through page tables to find the page.

xeon_cach_structure

Process page tables

  • Linux stores pages using 4 levels
  • Page Global Directory
  • Page Upper Directory
  • Page Middle Directory
  • Page Table
  • Each process has its own Page Global Directory and its own set of Page Tables.
  • Linear addresses from 0x00000000 to 0xbfffffff can be addressed when the process runs in either User or Kernel Mode.
  • Linear addresses from 0xc0000000 to 0xffffffff can be addressed only when the process runs in Kernel Mode.

Non Uniform Memory Access

NUMA is a shared memory architecture used in today’s multiprocessing systems. Each CPU is assigned its local memory and can access memory from other CPUs in the system. Local memory access provides the best performance; it provides low latency and high bandwidth. Accessing memory that is owned by the other CPU has a performance penalty, higher latency, and lower bandwidth.

  • Access to local memory is fast, more latency for remote memory
  • Practically all multi-socket systems have NUMA
  • Most servers have 1 NUMA node / socket
  • Some AMD systems have 2 NUMA nodes / socket
  • Sometimes optimal performance still requires manual tuning.

Typical 4 node NUMA system.

Screen Shot 2018-03-14 at 7.51.21 PM

Image credit: https://videos.cdn.redhat.com/summit2015/presentations/15284_performance-analysis-tuning-of-red-hat-enterprise-linux.pdf

Processors and Memory layout in NUMA system

Screen Shot 2018-03-14 at 11.53.58 PM

Image credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

QPI

Screen Shot 2018-03-14 at 11.58.39 PM

Image credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

To see the NUMA nodes and which cpus are under numa nodes:

Screen Shot 2018-03-14 at 8.02.02 PM

To see the CPUs under NUMA

Screen Shot 2018-03-14 at 8.05.09 PM

NUMA system represented on my machine.

NUMAf

NUMA related tools

Screen Shot 2018-03-14 at 11.00.34 PM

Numa Settings

  • BalancingTurn off: echo 0 > /proc/sys/kernel/numa_balancing
  • NUMA locality (hit vs miss, local vs foreign)
    • Number of NUMA faults & page migrations
    • /proc/vmstat numa_* fields
  • Location of process memory in NUMA system
    • /proc/<pid>/numa_maps
  • Numa scans, migrations & numa faults by node
    • /proc/<pid>/sched

 

More info here

 

Superscalar Processor Design

Superscalar processors are designed to fetch and issue multiple instructions every machine cycle vs Scalar processors which fetch and issue single instruction every machine cycle.

ISA

  • Instruction set architecture provides a contract between software and hardware i.e between program and the machine.
  • ISA is an abstraction between the hardware implementation and programs can be written with knowledge of ISA.
  • ISA ensures portability.
  • For hardware developers, ISA is a specification.
  • The set of instructions defined by ISA is an assembly language.
  • Dynamic-static interface: defined as separation between stuff can be done statically (at compile time) and stuff can be done dynamically (at runtime).

Processor Performance is measured as CPI – cycles per instruction. There are following techniques for decreasing instruction count

  • Executing multiple instructions per cycle using pipelining. The deeper pipeline goes, the branch misprediction penalty goes high as processor has to flush the pipeline and fill it up with new instructions. Also a deeper pipeline increases hardware and latency overhead.
  • Decreasing the instruction count and moving the complexity on Hardware may increase cycle time.

Stages of Execution in Scalar Pipelined Processors

procsimu-dlx-pipeline
Fetch, Instruction decode, execute, memory and writeback stages in scalar pipelines processors.

Image credit: http://www.oberle.org/procsimu-index.html

 

 

 

 

 

 

Screen Shot 2018-03-17 at 11.31.11 PM
CRAY-1 is an example of superpipelined processor
Screen Shot 2018-03-17 at 11.38.49 PM
Superscalar Instruction level parallelism machines
Screen Shot 2018-03-17 at 11.38.39 PM
Very Long Word Instruction Set Machines

Stages in scalar Pipelined processors

  1. Fetch: Since main memory is slow, the fetch stage is divided into two or more sub-stages. This ensures once the data/instructions starts coming into processor for execution, more than stage is executed in parallel. But since all the next stages are depend upon this stage, the pipeline is stalled until data/instructions become available at this stage. This stage is considered as in-order front-end.
    1. A superscalar processor can fetch more than one instruction in parallel.
  2. Decode: The instructions are divided into further micro-instructions(micro-ops) at this level. Various caching and optimization techniques are done in order to complete this stage faster. This stage is considered as in-order front-end.
    1. for CISC processors this stage is complex and itself is divided into multiple substages.
    2. Since the decoding functionality is extremely complex, the predecoding has been implemented.
  3. Dispatch:
  4. Execution: the execution unit of a processor generally has more parallelism and considered as out of order execution stage. Intel x86 processors have 2 ALU, FPUs and vectorized processors in this stage.
  5. Complete
  6. Retite ( Writeback) : the results of processor execution are written back the registers.

640px-MIPS_Architecture_(Pipelined).svg

Image credit: By Inductiveload – Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=5769084

The instruction types are

  1. Arithmetic Operations
  2. Load/store – data movement operations between memory, caches and registers.
  3. Branch Instructions

Superscalar Pipelines

0708.cs-323016
Image credit: http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323009.html

This an example of a super scalar processor design. The pipeline is not only deep but also parallel.

Dynamic Pipelines

figure1
image credit: http://www.cs.utah.edu/~wyman/classes/arch/athlon.html
  • Buffers are used to hold the data between multiple stages in the pipelined design.
  • In parallel pipelined processors, multientry buffers are used.
  • Parallel pipelined design that supports out of order execution of instructions is called as dynamic pipeline.
  • The complex multientry buffer designs allow instructions flow in different order.
    • First of the buffers in pipeline is dispatch buffer. This buffer receives instructions from program in order but can dispatch them out of order to the functional units.
    • Similar kind of buffer named completion buffer is present at the back end of pipeline. It can receive results of computation in any order. This buffer retires the instructions in order and proceeds the results to WB stage.

Buffers in superscalar design

haswell-5
Image credit: https://www.realworldtech.com/haswell-cpu/6/

Example of latest Intel x86 Microprocessor

Each core has following number of hardware units:

  • Reorder Buffer
    • ROB is used for register renaming and reordering the out of order execution results
    • 192 entries
  • Reservation Station
    • has 60 entries
  • Register file
    • 168 integer registers
  • Vector registers
    • 168
  • Loopback buffer
    • 56 entries.
  • μop cache
    • 1536 μops, 8 way, 6 μop line size, per core
  • L1I cache
    • 32 KB, 8 way set associative 64 sets
    • 64 bytes cache line
    • 32 bytes read and write per clock cycle
  • L1D cache
    • 32 KB, 8 way set associative 64 sets
    • 64 bytes cache line
  • L2 cache
    • 256KB, 8 way set associative 512 sets
    • 64 bytes cache line
  • L3 cache
    • 45 MB (ring shaped shared)
  • Instruction Fetch Rate
    • 16 bytes per clock cycle.

Trace Cache

After the instructions fetch, they are decoded and divided into μops. Instructions are stored in the trace cache after being decoded into μops. The opcode has length between 1 to 15.

Here is a nice video about breaking ISA instruction set.

Branch Prediction Techniques

Predicting branches correctly is important for superscalar processor performance. The branch prediction is much predictable because of various techniques explained below:

  • Branch Target Speculation
    • A fully associative cache name Branch Target buffer is used for storing target address of last branch taken. For next lookup, the cache is used.
    • image007image009
  • Branch Conditional Speculation
  • Multi-level adaptive branch predictor
  • Image13
    • The Two-Level Branch Predictor, also referred to as Correlation-Based Branch Predictor, uses a two-dimensional table of counters, also called “Pattern History Table”. It was introduced by Yeh and Patt who because of the fact that the outcome of the branch depends not only on the branch address but also on the outcome of other recent branches (inter branch correlation) and a longer history of the same branch itself (intra branch correlation).
      • A Global Branch History is a shift register in which the outcome of any branch is stored. A “one” is stored for a taken branch and a “zero” for a non-taken one. The register is shifted through while storing the newest value. In order to address the table, the n last branch outcomes are considered.
      • The Local History Table is a table of shift registers of the sort of a global branch history. Each shift register, however, refers to the last outcomes of one single branch. Since this local history table is accessed as a one-level branch prediction table, it is not guaranteed that no overlapping of the branches occurs, and in one shift register may be stored the information of different branches.
      • Since the table has only two dimensions, two of the three information sources have to be selected to access rows and columns. Another method is to merge two sources to one, which will be covered later.
      • In general it can be stated that a two-level branch predictor is more accurate than a one-level branch predictor, but this advantage is also associated with the disadvantage of a more costly implementation and the fact that the so called Warm Up Phase, i.e. the time the table entries contain usable values, is much longer.
  • Static Branch Prediction

    • Static Branch prediction algorithms do not speculate the prediction based on the past hence they are simple. Following are some of the techniques
  • Dynamic Branch Prediction

    • Dynamic Branch prediction predicts at the rate of 80% to 95%.
    • The past outcomes are used as input for branch prediction.
      • One level branch predictor
      • Various predictor features
      • Two level branch predictor
      • Hashing techniques
      • Difficulties
      • Hybrid branch predictor
      • Multiple component Hybrid branch predictor
      • Branch classification
      • Industry branch prediction implementations

Register Renaming

Register renaming is controlled by the reorder buffer and the scheduler. Register renaming is a technique that eliminates the false data dependencies arising from the reuse of architectural registers by successive instructions that do not have any real data dependencies between them.

Machine language programs specify reads and writes to a limited set of registers specified by the instruction set architecture (ISA). For instance, the Alpha ISA specifies 32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 bits wide. These are the architectural registers

 

 

Anatomy of an Assembly Program

Here is a very simple c/c++ program and its assembly program compiled using flag -S.
There are no headers included.

int main(){
  return 0;
}

Following is the output generated.

This output shows an anatomy of an

; represents sections in the program like .bss section, .text section etc.
 .file "simple.cpp"
 .text
 .globl main
 .type main, @function
; entry to main function
main:
; push the current value of each should be pushed onto the stack to be restored at the end
 pushq %rbp
; move the contents of stack pointer to rbp register.
 movq %rsp, %rbp
; push the exit code 0 to the eax register.
 movl $0, %eax
; pop the contents of rbp register.
 popq %rbp
; return from the program.
 ret

 

Virtual Memory

What is Virtual Memory?

It allows us to run more applications on the system than we have enough physical memory to support. Virtual memory is simulated memory that is written to a file on the hard drive.

Why Virtual Memory?

Most of the modern CPUs has memory management unit built in to generate virtual memory addresses. Using virtual memory has following advantages.

  • Main Memory is limited: especially in small devices like raspberry pi which runs linux can run multiple processes each with 32 bit addressing (4GB/process)
  • Sharing: It makes easy for library programs to be shared across applications example glibc.
  • Memory protection: Memory corruption in one program does not affect memory for another program.
  • Only a part of program needs to be in memory execution.

virtual_to_physical_mapping

Virtual Memory is implemented with

  • Demand Paging
  • Demand Segmentation

Physical memory management


Zones are defined using data structure struct zone, which is defined in <linux/mmzone.h>. On X86 the physical memory is divided in zones as follows

  • ZONE_DMA First 16MB of memory
  • This is used by (legacy) hardware because the hardware can address only upto 16 bytes.
  • ZONE_NORMAL 16MiB – 896MB
  • The DMA and NORMAL zone page frames can be accessed directly by Kernel.
  • ZONE_HIGHMEM 896 MB – End
  • In 32 bit systems – the page frames belongs to HIGHMEM can not be accessed directly by Kernel.
  • In 64-bit system this problem does not exists because the linear space available is vastly large than the RAM available on the system.

Each zone has a zone descriptor with following fields:

  • free_pages: Number of free pages
  • pages_min: Number of reserved pages
  • pages_low: Low watermark for reclaiming page frames
  • pages_high: High watermark for reclaiming page frames
  • free_area: Blocks of free pages
  • lru_lock: spinlock for any operation
  • wait_table: wait queues of processes waiting on one of the pages in the zone.
  • zone_mem_map: pointer to first page descriptor
  • name: “DMA”, “Normal” or “HighMem”

Screen Shot 2018-03-14 at 8.27.48 PM

Zones in NUMA System

Screen Shot 2018-03-14 at 8.22.46 PM

Memory Management in Linux


  • Some part of main memory is assigned permanenetly to Kernel.
  • Remaining memory is called as Dynamic Memory
  • The effeciancy of system depends upon how effectively the memory is managed.

Page Frame sizes

  • Linux uses 4 KB and 4MB page size by default.
  • With PAE (Physical Address Extension) enabled it can support 2 MB pages.
  • 32-bit architectures have 4KB pages
  • 64-bit architectures have 8KB pages.
  • This means: 4KB pages and 1GB of memory, physical memory is divided into 262,144 distinct pages.

The reason to support 4 KB page size

  • Data transfer between disk and main memory is effecient when smaller page size is used.

Page Descriptors

  • The page information structure is defined using struct page structure. This structure is defined in <linux/mm_types.h>
  • Information about the state of the page frame is stored in page descriptor data structure.
  • Page descriptor store following information:
    • Flags
      • Various flags store the information about page.
      • The flag values are defined in <linux/page-flags.h>.
    • Frame’s reference counter
      • This is important field. When its value is -1, the page frame free and can be accessed by any process or kernel. A 0 or positive number implies, that the page is assigned to one or more processes.
    • Number of page entries
    • Index
    • pointer to least recently used doubly linked list

Memory Allocation

  • Zoned Page Frame Allocator is used by Kernel to handle memory allocation requests for a group of continuous page frames. The algorithm used is called as “Buddy System”. Kernel provides various macros as follows for allocating page frames.
    • alloc_pages(gfp_mask, order)
    • alloc_page(gfp_mask)
    • _ _get_free_pages(gfp_mask, order)
    • _ _get_free_page(gfp_mask)
    • get_zeroed_page(gfp_mask)
    • _ _get_dma_pages(gfp_mask, order)

    Deallocation of page frames is done using following:

    • _ _free_pages(page, order)
    • free_pages(addr, order)
    • _ _free_page(page)
    • free_page(addr)

    High-Memory Page frame mappings:

    There are 3 mechanisms to map page frames in high memory:

    • Permanent Kernel mapping
    • Temporary Kernel Mapping
    • Non Contiguous Memory Allocation

    Fragmentation:

    • Frequent allocations and deallocations of page frames of different sizes may result in several small blocks of free page frames scattered in the block of page frames. Then it is impossible to allocate large chunk of page frames.
    • To fix this issue, Kernel keeps track of existing blocks of free contiguous page frames and avoid need to split up large blocks if request comes for smaller chunk.

    Buddy System

    • This is one of the algorithms for allocating memory from fixed-size segment consisting of physically-contiguous pages.
    • Memory is allocated from the segment using a power-of-2 allocator
      • This method satisfies requests in units sized as power of 2
      • The requests rounded up to next highest power of 2
      • When smaller allocation needed than is available, current chunk split into two buddies of next-lower power of 2
      • Continue until appropriate sized chunk available
    • For example, assume 256KB chunk available, kernel requests 21KB
      • Split into AL and Ar of 128KB each
        • One further divided into BL and BR of 64KB
          • One further into CL and CR of 32KB each – one used to satisfy request
    • Advantage – quickly coalesce unused chunks into larger chunk
    • Disadvantage – fragmentation

    buddy_system_algorithm

Slab Allocator

  • There is an alternate strategy for Memory allocation Slab Allocator.
  • A slab allocator is a list that contains a block of available, already allocated, data structures.
  • When needed the data structure is allocated from the slab, when not needed, it’s returned to the slab rather than freeing.
  • The slab allocator works as generic data structure caching layer.
  • Slab is one or more physically contiguous pages
  • Cache consists of one or more slabs
  • Single cache for each unique kernel data structure
    • Each cache filled with objects – instantiations of the kernel data structure
  • Slab-allocation algorithm uses caches to store kernel objects
    • When cache created, filled with objects marked as free
    • When a new object for a kernel data structure is needed, the allocator can assign any free object from the cache to satisfy the request.
    • The object assigned from the cache is marked as used.
    • There are 3 states of a slab,  full slab has no free objects, an empty slab has no allocated objects and partial slab has some free and some allocator objects.
  • If slab is full of used objects, next object allocated from empty slab
    • If no empty slabs, a new slab is allocated from contiguous physical pages and assigned to a cache
  • Benefits include no fragmentation, fast memory request satisfaction

Screen Shot 2018-02-28 at 11.53.26 PM.png

Screen Shot 2018-03-13 at 8.06.41 PMScreen Shot 2018-03-13 at 8.07.37 PM

Demand Paging


Software Programs are generally very large in size and can not fit entirely in main memory. Hence only part of the pages are brought in main memory from disk when absolutely required.

Advantages of demand paging

  • Similar to a paging system with swapping
  • Less I/O needed, no unnecessary I/O
  • Less memory needed. Example: Raspberry pi.
  • Faster response because of shared libraries.
  • More users can be using the operating system at a time

Details of demand paging

  • Extreme case – start a process with no pages in memory
    • OS sets instruction pointer to the first instruction of process, non-memory-resident page fault
    • And for every other process pages on first access faults for the page
    • Pure demand paging never bring a page into memory until it is required.
  • Actually, a given instruction could access multiple pages multiple page faults
    • One page for the instruction and many for data : unacceptable system performance
    • Fortunately, pain decreased because of locality of reference reasonable performance
  • Hardware support needed for demand paging
    • Page table with valid / invalid bit
    • Secondary memory, usually a high-speed disk. (swap device with swap space)
    • Instruction restart capability after a page fault
      • We must be able to restart the process in exactly the same place and state,
      • With the saved state (registers, condition code, instruction counter) of the interrupted process

Stages in Demand Paging

  1. Trap to the operating system
  2. Save the user registers and process state
  3. Determine that the interrupt was a page fault
  4. Check that the page reference was legal and determine the location of the page on the disk
  5. Issue a read from the disk to a free frame:
    1. Wait in a queue for this device until the read request is serviced
    2. Wait for the device seek and/or latency time
    3. Begin the transfer of the page to a free frame
  6. While waiting, allocate the CPU to some other user : context switch
  7. Receive an interrupt from the disk I/O subsystem (I/O completed)
  8. Save the registers and process state for the other user
  9. Determine that the interrupt was from the disk
  10. Correct the page table and other tables to show page is now in memory: set to v
  11. Wait for the CPU to be allocated to this process again : context switch
  12. Restore the user registers, process state, and new page table, and then resume the interrupted instruction: restart

Copy on Write

  • Copy-on-Write (COW) allows both parent and child processes to initially share the same pages in memory: fork() system call
    • If either process modifies a shared page, only then is the page copied
  • COW allows more efficient process creation as only modified pages are copied
  • In general, free pages are allocated from a pool of zero-fill-on-demand pages
    • Why zero-out a page before allocating it?
  • vfork() as a variation on fork() system call has parent process suspended and child process uses address space of parent.
    • Different from fork() with copy-on-write.
    • Do not use copy-on-write; changes made by the child process on any pages of the parent’s address space the altered pages will be visible to the parent once it resumes
    • Must be used with caution to prevent the child from modifying the parent address space
    • vfork() is intended to be used when the child calls exec()immediately after creation
    • No copy of pages, very efficient method of process creation.

Handling Page Fault


  1. Check an internal table (within PCB) to decide:
    1. Invalid reference ⇒ abort, terminate the process
    2. Valid reference ⇒ page it in when it is not in memory
  2. Find a free frame (taking one from the free-frame list)
  3. Schedule a disk operation to read the desired page into the new frame
  4. Reset tables to indicate the page is now in memory
  5. Restart the instruction that was interrupted by the trap caused by the page fault

Screen Shot 2018-02-28 at 11.40.07 PMScreen Shot 2018-02-28 at 11.40.25 PM

Page Replacement Algorithms

  1. Find the location of the desired page on disk
  2. Find a free frame:
    1. If there is a free frame, use it
    2. If there is no free frame, use a page replacement algorithm to select a victim frame
    3. Write the victim frame to the disk if dirty
  3. Bring  the desired page into the (newly) free frame; update the page and frame tables
  4. Continue the process by restarting the instruction that caused the trap

Note now potentially 2 page transfers (one out and one in) for page fault – increasing EAT

LRU Algorithm

  • This is generally good algorithm and used most widely.
  • Use past knowledge rather than future
  • Replace page that has not been used for the longest period of time
  • Associate time of last use with each page

Implementation of the LRU aalgorithm

  • Need hardware assistance to determine an order for the frames defined by the time of last use.
  • Counter implementation
    • Every page entry has a time-of-use field and a logical clock (counter) is added to the CPU;
    • Every time a page is referenced, the clock is incremented and the clock register value is copied into the time-of-use field in its page-table entry.
    • When a page needs to be changed, look at the counters to find smallest value
      • Search through table needed
  • Stack implementation
    • Keep a stack of page numbers in a doubly linked list
    • When a page referenced:
      • Removed from the stack and put on the top
      • Most recently used page is always at the top of the stack.
      • requires 6 pointers to be changed at worst
    • But each update more expensive
    • No search for replacement; the tail points to the bottom of the stack, the least recently used (LRU) page.
  • LRU and OPT are cases of stack algorithms that don’t have Belady’s Anomaly

Modern Microprocessors

Characteristics of Modern Microprocessors

  • Multicore processors: Modern processors can have between 2 (LCC- Low core count) to 24 (HCC – high core count) number of cores for parallelism at hardware level.
  • out-of-order execution:
    • In order to hide the ever widing gap between the CPU and main memory, all modern CPUs are pipelined and have execution reordering.
    • While CPU is waiting for some data or instructions from memory, it might work on something else. There is a complex logic designed in hardware to check the input dependencies between instructions.
  • Multi-level caches: There are atleast 3 levels of caches built in the processor which acts like local memory to the CPU.
  • Speculative execution: Modern CPUs predict the memory location of data to next instruction that may be executed.
  • Microops: The assembly instructions are divided into micro operations which makes it easy for execution and the result is then combined into one.
  • Register renaming: The compilers can compile the code with logical registers like EAX, EBX, EDX etc. The processors have a lot of temporary registers. So it renames one of its temporary registers with logical names.

Architecture diagram for Intel Nehalem Processor

Intel_Nehalem_arch

Image credit: By Appaloosa (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)%5D, via Wikimedia Commons

CPU Package Overview

Screen Shot 2018-03-15 at 12.00.58 AM

Image credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

Intel Xeon Uncore Elements

UNCORE ELEMENT

DESCRIPTION

RESPONSIBLE FOR

QPI Agent

Quick Path Interconnect

QPI caching agent, manages R3QPI and QPI Link Interface.

PCU

Power Controller

Core/Uncore power unit and thermal manager, governs P-State of the CPU, C-State of the Core and package. It enables Turbo Mode and can throttle cores when a thermal violation occurs.

Ubox

System Config Controller

Intermediary for interrupt traffic between system and core.

IIO

Integrated IO

Provides the interface to PCIe Devices.

R2PCI

Ring to PCI Interface

Provides interface to the ring for PCIe access.

IMC

Integrated Memory Controller

Provides the interface to RAM and communicates with Uncore through Home Agent.

HA

Home Agent

Responsible for ordering read/writes coming from Ring to IMC. Provides directory cache coherency.

SMI

Scalable Memory Interface

Provides IMC access to DIMMs.

High Core Count Architecture

Screen Shot 2018-03-15 at 12.04.59 AM.png

Image Credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

Intel Xeon Processor Overview

GENERATION

BRANDING

YEAR

PROCESS

CADENCE

MAX CORES

Nehalem

X5500

2008

45nm

4

Westmere

X5600

2010

32nm

Tick

6

Sandy Bridge

E5-2600-v1

2012

32nm

Tock

8

Ivy Bridge

E5-2600-v2

2013

22nm

Tick

12

Haswell

E5-2600-v3

2014

22nm

Tock

18

Broadwell

E5-2600-v4

2016

14nm

Tick-Progress

22

Skylake

2P

2017

14nm

Architecture

28

Image Credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

 

Interrupts, Signals and Exceptions

What is an Interrupt

Interrupt is an event that changes the program flow i.e. the instruction stream being executed by the CPU. Interrupts are also generated by various devices connected to the CPU or they caused by bugs within the software. Interrupts are way for hardware to signal to the processor.

Interrupts and Exceptions

  1. Exceptions Synchronous Interrupts:
    Caused by software and produced by control unit of CPU.
    Example: A bug in software or a page fault. Kernel handles these exceptions by following the steps defined (in kernel code) to recover from such a condition.
  2.  Interrupts Asynchronous Interrupts:
    Caused by hardware devices.
    Example: Keypress/mouse movement by user.

Interesting Points about Interrupts

  • Interrupts are asynchronous and they are nested.
  • An interrupt can occur while kernel is handling another interrupt. When kernel is executing some critical region, interrupts are disabled and kept the critical region as small as possible.
  • By disabling interrupts, Kernel guarantees that an interrupt handler will not preempt the critical code.
  • The interrupts and exceptions are identified by a number between 0 to 255.
  • The code executed by interrupt handler is not a process switch, rather its ran at an expense of the process that was running when interrupt was received.
  • Interrupt handling is critical for Kernel but since handling can take long time in the case of slow I/O devices. Hence the interrupt handling is divided into two parts
    • Urgent: Kernel executes this right away.
    • Bottom Halves: Deferred to execute later. (using various techniques like soft irqs, Tasklets, Task/ Work Queues)

Classification of Interrupts

  • Maskable
    • These are interrupt requests issued by I/O devices.
    • There are two states for a maskable interrupt.
      • Masked
      • Unmasked
    • The vectors of maskable interrupts are altered by programming the interrupt control.
  • Nonmaskable
    • They are always recognized by CPU.
    • The vectors of non-maskable interrupts are fixed.

Classification of Exceptions

  • Processor Detected
    • CPU detects an anomalous condition while executing an instruction
  • Faults:
    • Example: Page faults
    • Faults can be corrected and once corrected, program can be resumed.
  • Traps:
    • They can be reported immediately at the next instruction.
    • Used mainly for debugging.
  • Aborts:
    • These are severe errors like hardware failure
    • The process is terminated on receiving this signal.
  • Programmable Exceptions:
    • Often called as software interrupts.
    • Used to implement system calls and debugging.
Exception Number Explanation Type
0 divide by zero error Fault
1 Debug Trap or Fault
2 Not Used
3 Breakpoint Trap
4 Overflow Trap
5 Bounds check Fault
6 Invalid opcode Fault
7 Device not available Fault
8 Double Fault Fault
9 Coprocessor segment overrun Abort
10 Invalid TSS Fault
11 Segment not present Fault
12 Stack segment fault Fault
13 General protection Fault
14 Page Fault Fault
15 Reserved by Intel
16 Floating-point error Fault
17 Alignment check Fault
18 Machine check abort
19 SIMD floating point exception Fault

_Page Fault occurs when the process try to address a page in its address space but is not currently in RAM. When Kernel is handling this exception, it may suspend current process and switch to another process until the page is available in the RAM. The process switch is done because of high latency of RAM (200 ns or serveral hundred CPU cycles). _

Hardware for Interrupt Handling

  • Each hardware device connected to a computer has a single output line named as Interrupt Request (IRQ) line.
  • There is a hardware circuit called Programmable Interrupt Controller (PIC) to which all the IRQ lines are connected.
  • Interrupt Controller (PIC) monitors IRQ lines for raised signals.
  • In case of multiple signals raised simultenously, the signal with lower pine number is selected.
  • When there is a signal raised, its stored in signal vector, then the vector is sent to CPU and signal is raised to CPUs INTR pin to wait until the CPU acknowledges the signal.

 

PIC_Hardware_interrupt_path

Image credit: By Jfmantis – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=18168230

APIC – Advanced Programmable Interrupt Controller

In modern multiprocessor systems, there is a local APIC chip per CPU. The APIC has following components:

  1. 32 bit registers
  2. Internal clocks
  3. Local timer device
  4. Two additional lines LINT 0 and LINT 1

local_apics_intel_xeon

Image credit: [Intel Software Developer manual vol 3.](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf)

Categories of Interrupts

  • I/O Interrupts
  • Timer Interrupts
  • Interprocessor interrupts

Types of Actions taken by Linux on interrupts

  • Critical
    • Critical actions are executed within the interrupt handler immediately.
  • Noncritical
    • These are quick to finish and hence executed by interrupt handler immediately.
  • Noncritical deferrable
    • These may be delayed for a long time interval without affecting the kernel operations

IRQ Distribution in Multiprocessor System

The kernel tries to distribute the IRQ signals coming from the hardware devices in a round-robin fashion among all the CPUs.

The interrupts coming from external hardware can be distributed within CPUs in following ways

  • Static Distribution
  • Dynamic Distribution

IRQ affinity

The kernel provides a functionality to redirect all the interrupts to a particular CPU. This is achieved by modifying Interrupt Redirection Table entries of the I/O APIC. IRQ affinity of particular interrupts can also be changed by writing a new CPU bitmap mask into the /proc/irq/n/smp_affinity file.

Interrupt Handling

When an interrupt is received, kernel runs inyerrupt handler or interrupt service routine code. These are ‘C’ functions. A data structure named Interrupt Descriptor Table (IDT) stores each interrupt or exception vector with the address of the corresponding interrupt or exception handler. That Table must be properly initialized before the kernel enables interrupts.

Top Halves vs Bottom Halves

  • Interrupt handling should be fast but there may be large amount of work involved, hence the handling is divided into two parts:
  • Top Half: Executed immediately and perform time critical work like acknoledging the interrupt.
  • Bottom Half: This part can be deferred like communicating with I/O.

Deferred Part of Interrupt Handling

  • As stated above the interrupt handling has two parts: critial and non critical (deferred handling).
  • SoftIRQs, Tasklets, WorkQueues etc are ways to process deferred part of interrupt handling which is also called as bottom halves.
SoftIRQs Tasklets
They are statically allocated. can also be allocated and initialized at runtime
softirqs are reentrant functions and must explicitly protect their data structures with spin lock Do not need synchronization because Kernel handles that for them.
provide the least serialization Tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time
Easy to code

Work Queues

  • They defer work into Kernel Queue.
  • Functions in work queues run in process context and hence the function can be blocking functions or can sleep.
  • Function in a work queue is executed by a kernel thread, so there is no User Mode address space to access.

Exception Handling

  • The exceptions raised by CPU are handled by linux as error conditions.
  • Kernel sends a signal to the process about the erroneous condition.
  • Steps taken to handle exception:
    • Save registers to Kernel Stack
    • Invoke C-level function to handle exception.
    • call ret_from_exception() function and exit!

Signals

Signals are software generated interrupts. A signal is generated for a process (or sent to a process) when the event that causes the signal occurs. When the signal is generated, the kernel usually sets a flag of some form in the process table. A signal is delivered to a process when the action for a signal is taken. Between the time of generation and delivery, the signal is pending.

When a process receives a signal, it can do either of following.

  1. Ignore: except for SIGKILL and SIGSTOP all signals can ignored.
  2. catch the signal: call some callback on receiving this signal. Again,  SIGKILL and SIGSTOP can not be caught or blocked.
  3. Apply some default action.

On Termination of the process the memory image of the file is stored in the pwd of the process.

Reentrant functions: functions that are guaranteed to be safe to call from within a signal handler. These functions are async-safe functions meaning they block the signals before entering into a critical region