NUMA – Supercharged Computing

3 main levels for tuning

CPU/BIOS tuning:
- Power
- P-states and C-states
- Hyper-threading
- Frequency/Turbo mode
- Governance
- Termal mode
  Further reading:
  https://community.mellanox.com/docs/DOC-2297
  https://community.mellanox.com/docs/DOC-2817
  https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
OS tuning:
- CPU/NUMA affinity
- IRQ Affinity
- Swapiness
- Scheduling
- HUGE pages
  Further Reading:
  Red Hat Enterprise Linux Network Performance Tuning Guide https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/index https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf
  https://community.mellanox.com/docs/DOC-2135
  https://community.mellanox.com/docs/DOC-2797
Network Tuning:
- TCP/IP tunables
- Kernel Bypass Technologies
  Further Reading
  Kernel-bypass networking for fun and profit
  https://www.youtube.com/watch?v=noqSZorzooc
  Intel PDPK https://www.intel.com/content/www/us/en/communications/data-plane-development-kit.html
  Fundamentals of RDMA: Free training courses on Mellanox Academy site: https://academy.mellanox.com/course/view.php?id=138

Configuration using tuned

tuned is available from RHEL7
Make changes in sysctl as well as /sys directory
there are profiles available for specific needs like network-latency, latency-performance, network-performance, throughput-performance, desktop and balanced etc

Screen Shot 2018-03-14 at 7.49.16 PM

Examples of the profiles

Screen Shot 2018-03-14 at 7.50.11 PM

Latency Performance Settings in Linux Tuned

Settings	Meaning
force_latency=1	processor state C1
governor=performance	CPU is at higher performance state
energy_perf_bias= performance	Higher performance state
min_perf_pct=100	comes from the P-state drivers. The interfaces provided by the cpufreq core for controlling frequency the driver provides sysfs files for controlling P state selection. These files have been added to /sys/devices/system/cpu/intel_pstate.
kernel.sched_min_granularity_ns=10000000	Minimal preemption granularity for CPU-bound tasks
vm.dirty_ratio=10	The generator of dirty data starts writeback at this percentage
vm.dirty_background_ratio=3	Start background writeback (via writeback threads) at this percentage
vm.swappiness=10	The swappiness parameter controls the tendency of the kernel to move processes out of physical memory and onto the swap disk. 0 tells the kernel to avoid swapping processes out of physical memory for as long as possible 100 tells the kernel to aggressively swap processes out of physical memory and move them to swap cache
kernel.sched_migration_cost_ns=5000000	The total time the scheduler will consider a migrated process “cache hot” and thus less likely to be re-migrated
net.core.busy_read=50	This parameter controls the number of microseconds to wait for packets on the device queue for socket reads. It also sets the default value of the SO_BUSY_POLL option.
sysctl.net.core.busy_poll=50	This parameter controls the number of microseconds to wait for packets on the device queue for socket poll and selects
kernel.numa_balancing=0	disable NUMA balancing
net.ipv4.tcp_fastopen=3	Linux supports configuring both overall client and server support via /proc/sys/net/ipv4/tcp_fastopen (net.ipv4.tcp_fastopen via sysctl). The options are a bit mask, where the first bit enables or disables client support (default on), 2nd bit sets server support (default off), 3rd bit sets whether data in SYN packet is permitted without TFO cookie option. Therefore a value of 1 TFO can only be enabled on outgoing connections (client only), value 2 allows TFO only on listening sockets (server only), and value 3 enables TFO for both client and server.

More information about Tunables:

Dependent on NUMA:
- Reclaim Ratios
  - /proc/sys/vm/swappiness
  - /proc/sys/vm/min_free_kbytes
Independent of NUMA:
- Reclaim Ratios
  - /proc/sys/vm/vfs_cache_pressure
  - Writeback Parameters
    - /proc/sys/vm/dirty_background_ratio
    - /proc/sys/vm/dirty_ratio
  - Readahead parameters
    - /sys/block/queue/read_ahead_kb

top utility

available on all the machines
default tool for measurement that has lot of information.

Parameters important for Optimization

Swappiness

Controls how aggressively the system reclaims anonymous memory versus
page cache memory:
- Anonymous memory – swapping and freeing
- File pages – writing if dirty and freeing
- System V shared memory – swapping and freeing
Default is 60
If it is decreased: more aggressive reclaiming of page cache memory
If it is Increased: more aggressive swapping of anonymous memory
Should be set to 0 as per the low latency optimization guide http://wiki.dreamrunner.org/public_html/Low_Latency_Programming/low-latency-programming.html

Memory reclaim watermarks:

Screen Shot 2018-03-14 at 10.54.08 PM

zone_reclaim_mode

Controls NUMA specific memory allocation policy
When set and node memory is exhausted:
- Reclaim memory from local node rather than allocating
  from next node
- Slower initial allocation, higher NUMA hit ratio
When clear and node memory is exhausted:
- Allocate from all nodes before reclaiming memory
- Faster initial allocation, higher NUMA miss ratio
To see current setting: cat /proc/sys/vm/zone_reclaim_mode
Turn ON: echo 1 > /proc/sys/vm/zone_reclaim_mode
Turn OFF: echo 0 > /proc/sys/vm/zone_reclaim_mode
It is recommended that the settings should be off for large in memory database server.

CPU Tuning

p-states
- It’s an Advanced Configuration and Power Interface (ACPI) defined processor performance state.
- P0, P1, P2..p11 etc are values for P-states. For highest performance the p-states should be set to P0

c-states

It allow the CPU package to shut down cores and parts of the CPU microarchitecture to save energy while balancing response times.

C0, C1, C3, C6 etc are the values for c-states. Here is the chart for more information

Mode	Name	What it does	CPUs
C0	Operating State	CPU fully turned on	All CPUs
C1	Halt	Stops CPU main internal clocks via software; bus interface unit and APIC are kept running at full speed.	486DX4 and above
C1E	Enhanced Halt	Stops CPU main internal clocks via software and reduces CPU voltage; bus interface unit and APIC are kept running at full speed.	All socket LGA775 CPUs
C1E	—	Stops all CPU internal clocks.	Turion 64, 65-nm Athlon X2 and Phenom CPUs
C2	Stop Grant	Stops CPU main internal clocks via hardware; bus interface unit and APIC are kept running at full speed.	486DX4 and above
C2	Stop Clock	Stops CPU internal and external clocks via hardware	Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
C2E	Extended Stop Grant	Stops CPU main internal clocks via hardware and reduces CPU voltage; bus interface unit and APIC are kept running at full speed.	Core 2 Duo and above (Intel only)
C3	Sleep	Stops all CPU internal clocks	Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
C3	Deep Sleep	Stops all CPU internal and external clocks	Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
C3	AltVID	Stops all CPU internal clocks and reduces CPU voltage	AMD Turion 64
C4	Deeper Sleep	Reduces CPU voltage	Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
C4E/C5	Enhanced Deeper Sleep	Reduces CPU voltage even more and turns off the memory cache	Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
C6	Deep Power Down	Reduces the CPU internal voltage to any value, including 0 V	45-nm mobile Core 2 Duo only

Latency for C states

C-STATE

RESIDENCY

WAKE-UP LATENCY

C0

ACTIVE

ACTIVE

C1

1 μs

1 μs

C3

106 μs

80 μs

C6

345 μs

104 μs

How to Optimize

Use real production system for measurement
Note the state of system before change
Apply the optimizations later going to come in the post.
Compare state of system after applying the optimizations
Tune until you get desired results.
Document the changes.

Types of Tools needed

Tools that gather system information or Hardware
- lscpu
- uname
- systemctl list -units -t service
Tools for measuring Performance
Tools for Microbenchmarking

Optimization include Benchmarking, Profiling and Tracing

Benchmarking
- this is comparing performance with industry standard
- Tools used for benchmarking
  - vmstat used for benchmarking virtual memory
  - iostat used for benchmarking io
  - mpstat multiple processor
Profiling gathering information about hot spots
Tracing

Based on the Benchmarking, Profiling and Tracking the system is tuned as follows;

using echo into /proc
- The settings are not persistent. Stay until reboot.
- /proc files system has pid directories for processes, configuration files and sys directory
  - /proc/sys directory has some important directories like /kernel, /net, /vm , /user etc.
  - /proc/sys/vm has settings for with virtual memory.
Using sysctl command
- sysctl it’s a service
- Persistent settings.
- It’s started at the boot time’s
- It reads settings from a configuration file at /etc directory
- There are some more settings in
  - usr/lib/sysctl.d stores the default
  - /run/sysctl.d stores the runtime
- /etc/sysctl.d stores runtime settings
Using loading/unloading of kernel modules and configure kernel parameters
- Using modinfo to gather information about available parameters
- Using modprobe for modifying parameters
- modprobe modulename key=val
- for persistent settings, /etc/modprobe.d modulename.conf

NUMA is a shared memory architecture used in today’s multiprocessing systems. Each CPU is assigned its local memory and can access memory from other CPUs in the system. Local memory access provides the best performance; it provides low latency and high bandwidth. Accessing memory that is owned by the other CPU has a performance penalty, higher latency, and lower bandwidth.

Access to local memory is fast, more latency for remote memory
Practically all multi-socket systems have NUMA
Most servers have 1 NUMA node / socket
Some AMD systems have 2 NUMA nodes / socket
Sometimes optimal performance still requires manual tuning.

Typical 4 node NUMA system.

Screen Shot 2018-03-14 at 7.51.21 PM

Image credit: https://videos.cdn.redhat.com/summit2015/presentations/15284_performance-analysis-tuning-of-red-hat-enterprise-linux.pdf

Processors and Memory layout in NUMA system

Screen Shot 2018-03-14 at 11.53.58 PM

Image credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

QPI

Screen Shot 2018-03-14 at 11.58.39 PM

Image credit: http://pages.rubrik.com/rs/794-OHF-673/images/vSphere_6.5_Host_Resources_Deep_Dive.pdf

To see the NUMA nodes and which cpus are under numa nodes:

Screen Shot 2018-03-14 at 8.02.02 PM

To see the CPUs under NUMA

Screen Shot 2018-03-14 at 8.05.09 PM

NUMA system represented on my machine.

NUMAf

NUMA related tools

Screen Shot 2018-03-14 at 11.00.34 PM

Numa Settings

BalancingTurn off: echo 0 > /proc/sys/kernel/numa_balancing
NUMA locality (hit vs miss, local vs foreign)
- Number of NUMA faults & page migrations
- /proc/vmstat numa_* fields
Location of process memory in NUMA system
- /proc/<pid>/numa_maps
Numa scans, migrations & numa faults by node
- /proc/<pid>/sched

More info here

Category: NUMA

Linux Performance Optimization