Non Uniform Memory Access

NUMA is a shared memory architecture used in today’s multiprocessing systems. Each CPU is assigned its local memory and can access memory from other CPUs in the system. Local memory access provides the best performance; it provides low latency and high bandwidth. Accessing memory that is owned by the other CPU has a performance penalty, higher latency, and lower bandwidth.

  • Access to local memory is fast, more latency for remote memory
  • Practically all multi-socket systems have NUMA
  • Most servers have 1 NUMA node / socket
  • Some AMD systems have 2 NUMA nodes / socket
  • Sometimes optimal performance still requires manual tuning.

Typical 4 node NUMA system.

Screen Shot 2018-03-14 at 7.51.21 PM

Image credit: https://videos.cdn.redhat.com/summit2015/presentations/15284_performance-analysis-tuning-of-red-hat-enterprise-linux.pdf

Processors and Memory layout in NUMA system

Screen Shot 2018-03-14 at 11.53.58 PM

Image credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

QPI

Screen Shot 2018-03-14 at 11.58.39 PM

Image credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

To see the NUMA nodes and which cpus are under numa nodes:

Screen Shot 2018-03-14 at 8.02.02 PM

To see the CPUs under NUMA

Screen Shot 2018-03-14 at 8.05.09 PM

NUMA system represented on my machine.

NUMAf

NUMA related tools

Screen Shot 2018-03-14 at 11.00.34 PM

Numa Settings

  • BalancingTurn off: echo 0 > /proc/sys/kernel/numa_balancing
  • NUMA locality (hit vs miss, local vs foreign)
    • Number of NUMA faults & page migrations
    • /proc/vmstat numa_* fields
  • Location of process memory in NUMA system
    • /proc/<pid>/numa_maps
  • Numa scans, migrations & numa faults by node
    • /proc/<pid>/sched

 

More info here