microprocessor – Supercharged Computing

Caching in Modern Microprocessors

The modern microprocessors have a speed of 3GHz vs the main memory (RAM) has a speed of 133 MHz. With the newest advancements in hardware, the memory bandwidth only increased between DDR to DDR2 and DDR3 but the latency decrease by a very little factor. Caches were introduced as a level of indirection in order to hide the latency gap. Caches work on two very important principles of data access patterns:

Spatial Locality: if a memory location is accessed, the nearby memory location is most probably accessed, example: arrays
Temporal Locality: if memory location is accessed, there is probability that the same location will be accessed again, example: loops

Modern x86 processors have the hierarchy of an L1, L2 and a distributed LLC accessed via the on-die scalable ring architecture. L1 and L2 cache are private to the core and the LLC is shared within all the cores.

The Access Timing in terms of CPU cycles
- L1 – 4 cycles
- L2 – 12 cycles
- LLC – 26 to 31 cycles
The cache size
- L1 32 KB data cache (L1I) + 32 KB instruction cache (L1D)
- L2 – 256 KB
- LLC – 8 MB to 56 MB.
Cache line
- It is unit of transport between memory.
- on x86 the cache line size is 64 bytes
Data Prefetchers
- spatial prefetcher attempts to complete every cache line fetched to the L2 cache with another cache line to fill a 128-byte aligned chunk.
- streamer prefetcher monitors read requests from the L1D cache and fetches the appropriate data and instructions. Server vendors might use their own designation for L1 and L2 prefetchers

Cache Snooping protocols:

CACHE STATE	DEFINITION	STATE DEFINITION	CACHE LINE EXISTS IN
M	Modified	The cache line is updated relative to memory	Single core
E	Exclusive	The cache line is consistent with memory	Single core
S	Shared	The cache line is shared with other cores. (The cache line is consistent with other cores, but may not be consistent with memory)	Multiple cores
I	Invalid	The cache line is not present in this core L1 or L2	Multiple cores

Types of Caches

Directly mapped caches:
- an address can reside only at a particular address in cache.
- Direct mapping is simple and inexpensive to implement, but if a program accesses 2 blocks that map to the same line repeatedly, the cache begins to thrash back and forth reloading the line over and over again meaning misses are very high.
- Processor registers use logic as direct mapped.
Fully associative caches
- any block can go into any line of the cache.
Set-associative caches
- set associative addresses the problem of possible thrashing in the direct mapping method. It does this by saying that instead of having exactly one line that a block can map to in the cache, we will group a few lines together creating a set. Then a block in memory can map to any one of the lines of a specific set. There is still only one set that the block can map to.

Cache Eviction Policy

There are many algorithms for cache eviction in case of conflict example:

FIFO – First in first out
- keep track of insertion time and evicts block that is oldest.
LRU – least recently used
- keep track of references to the data in cache and data with lowest references will be evicted
Random

Cache Flush Policy

When the data is updated in cache, there are following policies for updating main memory with updates to cache. Since the main memory is slow, the update is deferred as much as possible.

Write-through:
- Cache pushes all the changes to the main memory immediately
- advantages are the implementation is simple and less error prone.
- disadvantages are requirement of bandwidth because of subsequence writes and also write backs take latency because of slowness of main memory.
- write-no-allocate:
  - the block is modified in the main memory and not loaded into the cache.
- write-allocate:
  - the block is loaded on a write miss, followed by the write-hit action.
Write back:
- the update is deferred as much as possible.

Cache Misses

Cold or compulsory misses:
- Cache misses because of first reference to the block in program
Capacity misses:
- Cache misses because of capacity is not enough.
Conflict misses:
- Cache misses because of conflicting cache location.
Coherent misses:
- The coherence miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the thread’s cache has been invalidated by a write from another thread.
Coverage misses:
- he coverage miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the processor’s cache has been invalidated as a consequence of a directory eviction.

Cache Hierarchy

Image credit: http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms?

cpu_cache_structure — Cache hierarchy latency vs capacity.

Translation Lookaside Buffer – TLB

The TLB is an other form of cache that stores the virtual to physical mappings of addresses.
TLB is fully associative.
There are generally 2 levels of TLB and less number of entires (128).
When a page fault occurs, the hardware starts address translation. Since the address translation in x86 is 5 level, the hardware start looking into the TLB if the address has been translated recently and if found, the translation returns immediately.
If the address is not found in the TLB, the hardware walks through the page tables entries in the main memory in order to find the physical address for that page.

Superscalar Processor Design

Superscalar processors are designed to fetch and issue multiple instructions every machine cycle vs Scalar processors which fetch and issue single instruction every machine cycle.

ISA

Instruction set architecture provides a contract between software and hardware i.e between program and the machine.
ISA is an abstraction between the hardware implementation and programs can be written with knowledge of ISA.
ISA ensures portability.
For hardware developers, ISA is a specification.
The set of instructions defined by ISA is an assembly language.
Dynamic-static interface: defined as separation between stuff can be done statically (at compile time) and stuff can be done dynamically (at runtime).

Processor Performance is measured as CPI – cycles per instruction. There are following techniques for decreasing instruction count

Executing multiple instructions per cycle using pipelining. The deeper pipeline goes, the branch misprediction penalty goes high as processor has to flush the pipeline and fill it up with new instructions. Also a deeper pipeline increases hardware and latency overhead.
Decreasing the instruction count and moving the complexity on Hardware may increase cycle time.

Stages of Execution in Scalar Pipelined Processors

procsimu-dlx-pipeline — Fetch, Instruction decode, execute, memory and writeback stages in scalar pipelines processors.

Image credit: http://www.oberle.org/procsimu-index.html

Screen Shot 2018-03-17 at 11.31.11 PM — CRAY-1 is an example of superpipelined processor

Screen Shot 2018-03-17 at 11.38.49 PM — Superscalar Instruction level parallelism machines

Screen Shot 2018-03-17 at 11.38.39 PM — Very Long Word Instruction Set Machines

Stages in scalar Pipelined processors

Fetch: Since main memory is slow, the fetch stage is divided into two or more sub-stages. This ensures once the data/instructions starts coming into processor for execution, more than stage is executed in parallel. But since all the next stages are depend upon this stage, the pipeline is stalled until data/instructions become available at this stage. This stage is considered as in-order front-end.
1. A superscalar processor can fetch more than one instruction in parallel.
Decode: The instructions are divided into further micro-instructions(micro-ops) at this level. Various caching and optimization techniques are done in order to complete this stage faster. This stage is considered as in-order front-end.
1. for CISC processors this stage is complex and itself is divided into multiple substages.
2. Since the decoding functionality is extremely complex, the predecoding has been implemented.
Dispatch:
Execution: the execution unit of a processor generally has more parallelism and considered as out of order execution stage. Intel x86 processors have 2 ALU, FPUs and vectorized processors in this stage.
Complete
Retite ( Writeback) : the results of processor execution are written back the registers.

640px-MIPS_Architecture_(Pipelined).svg

Image credit: By Inductiveload – Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=5769084

The instruction types are

Arithmetic Operations
Load/store – data movement operations between memory, caches and registers.
Branch Instructions

Superscalar Pipelines

0708.cs-323016 — Image credit: http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323009.html

This an example of a super scalar processor design. The pipeline is not only deep but also parallel.

Dynamic Pipelines

image credit: http://www.cs.utah.edu/~wyman/classes/arch/athlon.html

Buffers are used to hold the data between multiple stages in the pipelined design.
In parallel pipelined processors, multientry buffers are used.
Parallel pipelined design that supports out of order execution of instructions is called as dynamic pipeline.
The complex multientry buffer designs allow instructions flow in different order.
- First of the buffers in pipeline is dispatch buffer. This buffer receives instructions from program in order but can dispatch them out of order to the functional units.
- Similar kind of buffer named completion buffer is present at the back end of pipeline. It can receive results of computation in any order. This buffer retires the instructions in order and proceeds the results to WB stage.

Buffers in superscalar design

haswell-5 — Image credit: https://www.realworldtech.com/haswell-cpu/6/

Example of latest Intel x86 Microprocessor

Each core has following number of hardware units:

Reorder Buffer
- ROB is used for register renaming and reordering the out of order execution results
- 192 entries
Reservation Station
- has 60 entries
Register file
- 168 integer registers
Vector registers
- 168
Loopback buffer
- 56 entries.
μop cache
- 1536 μops, 8 way, 6 μop line size, per core
L1I cache
- 32 KB, 8 way set associative 64 sets
- 64 bytes cache line
- 32 bytes read and write per clock cycle
L1D cache
- 32 KB, 8 way set associative 64 sets
- 64 bytes cache line
L2 cache
- 256KB, 8 way set associative 512 sets
- 64 bytes cache line
L3 cache
- 45 MB (ring shaped shared)
Instruction Fetch Rate
- 16 bytes per clock cycle.

Trace Cache

After the instructions fetch, they are decoded and divided into μops. Instructions are stored in the trace cache after being decoded into μops. The opcode has length between 1 to 15.

Here is a nice video about breaking ISA instruction set.

Branch Prediction Techniques

Predicting branches correctly is important for superscalar processor performance. The branch prediction is much predictable because of various techniques explained below:

Branch Target Speculation
- A fully associative cache name Branch Target buffer is used for storing target address of last branch taken. For next lookup, the cache is used.
Branch Conditional Speculation
- Predictors based on hint from compiler
- FSM: finite state machine based predictors
- image credit: https://www.semanticscholar.org/paper/A-Shadow-Dynamic-Finite-State-Machine-for-Branch-An-Abdel-Hafeez-Gordon-Ross/1a551c6c5993ae4e7f4fe22db9a4e1004259a2b9/figure/0
- Real Life example PPC 604
  - BPU featuring dynamic branch prediction –
    - Speculative execution through two branches
    - 64-entry fully-associative branch target address cache (BTAC)
    - 512-entry branch history table (BHT) with two bits per entry for four levels of prediction— not-taken, strongly not-taken, taken, strongly taken.
Multi-level adaptive branch predictor
- The Two-Level Branch Predictor, also referred to as Correlation-Based Branch Predictor, uses a two-dimensional table of counters, also called “Pattern History Table”. It was introduced by Yeh and Patt who because of the fact that the outcome of the branch depends not only on the branch address but also on the outcome of other recent branches (inter branch correlation) and a longer history of the same branch itself (intra branch correlation).
  - A Global Branch History is a shift register in which the outcome of any branch is stored. A “one” is stored for a taken branch and a “zero” for a non-taken one. The register is shifted through while storing the newest value. In order to address the table, the n last branch outcomes are considered.
  - The Local History Table is a table of shift registers of the sort of a global branch history. Each shift register, however, refers to the last outcomes of one single branch. Since this local history table is accessed as a one-level branch prediction table, it is not guaranteed that no overlapping of the branches occurs, and in one shift register may be stored the information of different branches.
  - Since the table has only two dimensions, two of the three information sources have to be selected to access rows and columns. Another method is to merge two sources to one, which will be covered later.
  - In general it can be stated that a two-level branch predictor is more accurate than a one-level branch predictor, but this advantage is also associated with the disadvantage of a more costly implementation and the fact that the so called Warm Up Phase, i.e. the time the table entries contain usable values, is much longer.
Static Branch Prediction
- Static Branch prediction algorithms do not speculate the prediction based on the past hence they are simple. Following are some of the techniques
  - Single Direction Prediction: speculate the direction is same for all the branches. Example: speculate branch is always taken or always not taken.
  - Backwards Taken/Forward not taken
  - Branch Hints: Compilers can hint the processor about possible branch output. Example:
  - ```
  __builtin_expect(argc,0)
```
- Read more: https://stackoverflow.com/questions/14332848/intel-x86-0x2e-0x3e-prefix-branch-prediction-actually-used
Dynamic Branch Prediction
- Dynamic Branch prediction predicts at the rate of 80% to 95%.
- The past outcomes are used as input for branch prediction.
  - One level branch predictor
  - Various predictor features
  - Two level branch predictor
  - Hashing techniques
  - Difficulties
  - Hybrid branch predictor
  - Multiple component Hybrid branch predictor
  - Branch classification
  - Industry branch prediction implementations

Register Renaming

Register renaming is controlled by the reorder buffer and the scheduler. Register renaming is a technique that eliminates the false data dependencies arising from the reuse of architectural registers by successive instructions that do not have any real data dependencies between them.

Machine language programs specify reads and writes to a limited set of registers specified by the instruction set architecture (ISA). For instance, the Alpha ISA specifies 32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 bits wide. These are the architectural registers

Modern Microprocessors

Characteristics of Modern Microprocessors

Multicore processors: Modern processors can have between 2 (LCC- Low core count) to 24 (HCC – high core count) number of cores for parallelism at hardware level.
out-of-order execution:
- In order to hide the ever widing gap between the CPU and main memory, all modern CPUs are pipelined and have execution reordering.
- While CPU is waiting for some data or instructions from memory, it might work on something else. There is a complex logic designed in hardware to check the input dependencies between instructions.
Multi-level caches: There are atleast 3 levels of caches built in the processor which acts like local memory to the CPU.
Speculative execution: Modern CPUs predict the memory location of data to next instruction that may be executed.
Microops: The assembly instructions are divided into micro operations which makes it easy for execution and the result is then combined into one.
Register renaming: The compilers can compile the code with logical registers like EAX, EBX, EDX etc. The processors have a lot of temporary registers. So it renames one of its temporary registers with logical names.

Architecture diagram for Intel Nehalem Processor

Intel_Nehalem_arch

Image credit: By Appaloosa (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)%5D, via Wikimedia Commons

CPU Package Overview

Screen Shot 2018-03-15 at 12.00.58 AM

Image credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

Intel Xeon Uncore Elements

UNCORE ELEMENT	DESCRIPTION	RESPONSIBLE FOR
QPI Agent	Quick Path Interconnect	QPI caching agent, manages R3QPI and QPI Link Interface.
PCU	Power Controller	Core/Uncore power unit and thermal manager, governs P-State of the CPU, C-State of the Core and package. It enables Turbo Mode and can throttle cores when a thermal violation occurs.
Ubox	System Config Controller	Intermediary for interrupt traffic between system and core.
IIO	Integrated IO	Provides the interface to PCIe Devices.
R2PCI	Ring to PCI Interface	Provides interface to the ring for PCIe access.
IMC	Integrated Memory Controller	Provides the interface to RAM and communicates with Uncore through Home Agent.
HA	Home Agent	Responsible for ordering read/writes coming from Ring to IMC. Provides directory cache coherency.
SMI	Scalable Memory Interface	Provides IMC access to DIMMs.

High Core Count Architecture

Screen Shot 2018-03-15 at 12.04.59 AM.png

Image Credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

Intel Xeon Processor Overview

GENERATION	BRANDING	YEAR	PROCESS	CADENCE	MAX CORES
Nehalem	X5500	2008	45nm	4
Westmere	X5600	2010	32nm	Tick	6
Sandy Bridge	E5-2600-v1	2012	32nm	Tock	8
Ivy Bridge	E5-2600-v2	2013	22nm	Tick	12
Haswell	E5-2600-v3	2014	22nm	Tock	18
Broadwell	E5-2600-v4	2016	14nm	Tick-Progress	22
Skylake	2P	2017	14nm	Architecture	28