C++ Operator Overloading

I recently came across a tricky situation in operator overloading. The idea is overload ++ operator for a class. Since ++ works as both prefix and postfix.

int x = 0;
 auto y = ++x; // This is prefix 
 std::cout << "x: "<< x++ << "\n"; // this is postfix and value of 
// x is changed after the execution of this statement x = 1
 std::cout << "y: " << y << "\n"; // y = 1

When it comes to overloading of ++ operator, things becomes little tricky. Following is the way to overload ++ operator for both prefix and postfix.

struct X{
  int i = 0;
  // The prefix is coded as ++x
  X & operator++(){
    return *this;
  // this is postfix as x++
  X operator ++(int){
    X y(*this); 
    return y;

the prefix solution is straight forward but the postfix returns a copy of instance of X and hence its first copied, prefix increment is done and return the instance.

When overloading another operator << following code can be used.

struct X{
  friend std::iostream& operator<<(std::iostream & os, const X & x){
    os << "x.i: "<< x.i << "\n";
    return os;




Caching in Modern Microprocessors

The modern microprocessors have a speed of 3GHz vs the main memory (RAM) has a speed of 133 MHz. With the newest advancements in hardware, the memory bandwidth only increased between DDR to DDR2 and DDR3 but the latency decrease by a very little factor. Caches were introduced as a level of indirection in order to hide the latency gap.  Caches work on two very important principles of data access patterns:

  1. Spatial Locality: if a memory location is accessed, the nearby memory location is most probably accessed, example: arrays
  2. Temporal Locality: if memory location is accessed, there is probability that the same location will be accessed again, example: loops

Modern x86 processors have the hierarchy of an L1, L2 and a distributed LLC accessed via the on-die scalable ring architecture. L1 and L2 cache are private to the core and the LLC is shared within all the cores.

  • The Access Timing in terms of CPU cycles
    • L1 – 4 cycles
    • L2 – 12 cycles
    • LLC – 26 to 31 cycles

  • The cache size
    • L1 32 KB data cache (L1I) + 32 KB instruction cache (L1D)
    • L2 – 256 KB

    • LLC – 8 MB to 56 MB.
  • Cache line
    • It is unit of transport between memory.
    • on x86 the cache line size is 64 bytes
  • Data Prefetchers
    • spatial prefetcher attempts to complete every cache line fetched to the L2 cache with another cache line to fill a 128-byte aligned chunk.

    • streamer prefetcher monitors read requests from the L1D cache and fetches the appropriate data and instructions. Server vendors might use their own designation for L1 and L2 prefetchers

  • Cache Snooping protocols:







    The cache line is updated relative to memory

    Single core



    The cache line is consistent with memory

    Single core



    The cache line is shared with other cores. (The cache line is consistent with other cores, but may not be consistent with memory)

    Multiple cores



    The cache line is not present in this core L1 or L2

    Multiple cores



Types of Caches

  1. Directly mapped caches:
    • an address can reside only at a particular address in cache.
    • Screen Shot 2018-03-18 at 6.23.01 PM
    • Direct mapping is simple and inexpensive to implement, but if a program accesses 2 blocks that map to the same line repeatedly, the cache begins to thrash back and forth reloading the line over and over again meaning misses are very high.
    • Processor registers use logic as direct mapped.
  2. Fully associative caches
    • any block can go into any line of the cache.
  3. Set-associative caches
    • set associative addresses the problem of possible thrashing in the direct mapping method. It does this by saying that instead of having exactly one line that a block can map to in the cache, we will group a few lines together creating a set. Then a block in memory can map to any one of the lines of a specific set. There is still only one set that the block can map to.
    • Screen Shot 2018-03-18 at 6.25.44 PM

Cache Eviction Policy

There are many algorithms for cache eviction in case of conflict example:

  • FIFO – First in first out
    • keep track of insertion time and evicts block that is oldest.
  • LRU – least recently used
    • keep track of references to the data in cache and data with lowest references will be evicted
  • Random

Cache Flush Policy

When the data is updated in cache, there are following policies for updating main memory with updates to cache. Since the main memory is slow, the update is deferred as much as possible.

  • Write-through:
    • Cache pushes all the changes to the main memory immediately
    • advantages are the implementation is simple and less error prone.
    • disadvantages are requirement of bandwidth because of subsequence writes and also write backs take latency because of slowness of main memory.
    • write-no-allocate:
      • the block is modified in the main memory and not loaded into the cache.
    • write-allocate:
      •  the block is loaded on a write miss, followed by the write-hit action.
  • Write back:
    • the update is deferred as much as possible.

Cache Misses

  • Cold or compulsory misses:
    • Cache misses because of first reference to the block in program
  • Capacity misses:
    • Cache misses because of capacity is not enough.
  • Conflict misses:
    • Cache misses because of conflicting cache location.
  • Coherent misses:
    • The coherence miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the thread’s cache has been invalidated by a write from another thread.
  • Coverage misses:
    • he coverage miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the processor’s cache has been invalidated as a consequence of a directory eviction.

Cache Hierarchy

Image credit: http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms?

Cache hierarchy latency vs capacity.

Translation Lookaside Buffer – TLB

  • The TLB is an other form of cache that stores the virtual to physical mappings of addresses.
  • TLB is fully associative.
  • There are generally 2 levels of TLB and less number of entires (128).
  • When a page fault occurs, the hardware starts address translation. Since the address translation in x86 is 5 level, the hardware start looking into the TLB if the address has been translated recently and if found, the translation returns immediately.
  • If the address is not found in the TLB, the hardware walks through the page tables entries in the main memory in order to find the physical address for that page.




Understanding glibc malloc


I always got fascinated by heap memory. Questions such as

How heap memory is obtained from kernel?
How efficiently memory is managed?
Is it managed by kernel or by library or by application itself?
Can heap memory be exploited?

were in my mind for quite some time. But only recently I got time to understand about it. So here I would like to share my fascination turned knowledge!! Out there in the wild, many memory allocators are available:

  • dlmalloc – General purpose allocator
  • ptmalloc2 – glibc
  • jemalloc – FreeBSD and Firefox
  • tcmalloc – Google
  • libumem – Solaris

Every memory allocator claims they are fast, scalable and memory efficient!! But not all allocators can be suited well for our application. Memory hungry application’s performance largely depends on memory allocator performance. In this post, I will only talk about ‘glibc malloc’ memory allocator. In future, I am hoping to cover up other memory allocators. Throughout…

View original post 3,632 more words

Commons Problems in Parallel Programming

Dining Philosophers Problem

140px-An_illustration_of_the_dining_philosophers_problemThis is a classic problem with data sharing and signaling.

Five (male) philosophers spend their lives thinking and eating. The philosophers share a common circular table surrounded by five chairs, each belonging to one philosopher. In the centre of the table there is a bowl of spaghetti, and the table is laid with five forks, as shown in figure. When a philosopher thinks, he does not interact with other philosophers. From time to time, a philosopher gets hungry. In order to eat he must try to pick up the two forks that are closest (and are shared with his left and right neighbors), but may only pick up one fork at a time. He cannot pick up a fork already held by a neighbor. When a hungry philosopher has both his forks at the same time, he eats without releasing them, and when he has finished eating, he puts down both forks and starts thinking again.

Solutions to this kind of problem are as follows:

  1. Use resource hierarchy: assign numbers to the forks (resources). The philosopher can only request resource with lower number first then the higher number.
  2. Use a central entity for assigning permissions: use a arbitrator (waiter) as a mutex which will allow the philosopher to periodically use the forks(resources).
  3. Distributed message passing between the philosophers without a central authority.
    1. For every pair of philosophers contending for a resource, create a fork and give it to the philosopher with the lower ID (n for agent Pn). Each fork can either be dirtyor clean. Initially, all forks are dirty.
    2. When a philosopher wants to use a set of resources (i.e. eat), said philosopher must obtain the forks from their contending neighbors. For all such forks the philosopher does not have, they send a request message.
    3. When a philosopher with a fork receives a request message, they keep the fork if it is clean, but give it up when it is dirty. If the philosopher sends the fork over, they clean the fork before doing so.
    4. After a philosopher is done eating, all their forks become dirty. If another philosopher had previously requested one of the forks, the philosopher that has just finished eating cleans the fork and sends it.

Sleeping Barber Problem

206px-Sleeping_barberThis is a classic single procedure multiple consumer problem.

The analogy is based upon a hypothetical barber shop with one barber. The barber has one barber chair and a waiting room with a number of chairs in it. When the barber finishes cutting a customer’s hair, they dismiss the customer and go to the waiting room to see if there are other customers waiting. If there are, they bring one of them back to the chair and cut their hair. If there are no other customers waiting, they return to their chair and sleep in it.

Each customer, when they arrive, looks to see what the barber is doing. If the barber is sleeping, the customer wakes them up and sits in the chair. If the barber is cutting hair, the customer goes to the waiting room. If there is a free chair in the waiting room, the customer sits in it and waits their turn. If there is no free chair, the customer leaves.

Process Synchronization and IPC

Why Synchronization

Since Linux 2.0 support Symmetric Multiprocessing with multicore modern microprocessors. But with this new dimension of problems arose like race conditions, dead/live locks etc. In order to mitigate these issues, various synchronization mechanisms were introduced in Linux. The 2.6 version of Linux supported full preemption, that means scheduler can preempt kernel code at virtually any point and reschedule another task.

  • critical path: code that access and manipulate shared data

Reasons for Synchronization:

  • Interrupts. An interrupt can occur asynchronously at almost any time, interrupting the currently executing code.
  • Softirqs and tasklets. The kernel can raise or schedule a softirq or tasklet at almost any time, interrupting the currently executing code.
  • Kernel preemption. Because the kernel is preemptive, one task in the kernel can preempt another.
  • Sleeping and synchronization with user-space. A task in the kernel can sleep and thus invoke the scheduler, resulting in the running of a new process.
  • Symmetrical multiprocessing. Two or more processors can execute kernel code at exactly the same time.

Ways to Synchronization

  • Shared Memory
  • Sockets
  • Atomics and lock-free programming
  • Pipes/FIFOs
  • Mutexes
    • recursive
    • reader/writer
  • Spinlocks
  •  Semaphores
    • Binary and counting
  • Condition variables
  • Messages Queues


Since all threads run in the same address space, they all have access to the same data and variables. If two threads simultaneously attempt to update a global counter variable, it is possible for their operations to interleave in such way that the global state is not correctly modified. Although such a case may only arise only one time out of thousands, a concurrent program needs to coordinate the activities of multiple threads using something more reliable that just depending on the fact that such interference is rare.

A semaphore is somewhat like an integer variable, but is special in that its operations (increment and decrement) are guaranteed to be atomic—you cannot be halfway through incrementing the semaphore and be interrupted and waylaid by another thread trying to do the same thing. That means you can increment and decrement the semaphore from multiple threads without interference. By convention, when a semaphore is zero it is “locked” or “in use”.

Semaphores vs. mutexes (from wikipedia)
A mutex is essentially the same thing as a binary semaphore and sometimes uses the same basic implementation. The differences between them are in how they are used. While a binary semaphore may be used as a mutex, a mutex is a more specific use-case, in that only the thread that locked the mutex is supposed to unlock it. This constraint makes it possible to implement some additional features in mutexes:

  • Since only the thread that locked the mutex is supposed to unlock it, a mutex may store the id of thread that locked it and verify the same thread unlocks it.
  • Mutexes may provide priority inversion safety. If the mutex knows who locked it and is supposed to unlock it, it is possible to promote the priority of that thread whenever a higher-priority task starts waiting on the mutex.
  • Mutexes may also provide deletion safety, where the thread holding the mutex cannot be accidentally deleted.
  • Alternately, if the thread holding the mutex is deleted (perhaps due to an unrecoverable error), the mutex can be automatically released.
  • A mutex may be recursive: a thread is allowed to lock it multiple times without causing a deadlock.

Semaphore is one of the thread synchronization mechanisms. Unlike mutex, semaphore allows more than one thread to access similar shared resources at the same time. There are two types of semaphores

  • binary semaphore
  • counting semaphore.

Pipes and FIFOs

Pipes are used for communicating data between the processes. Output of a process is piped as an input to another process.

cat doctum.txt | wc

Linux: Pipes no Linux
Output of the cat command goes to word count

image credit: https://www.vivaolinux.com.br/dica/Pipes-no-Linux





FIFOs are named pipes. FIFOs can be used as follows:

int mkfifo(const char *path, mode_t mode);
int mkfifoat(int fd, const char *path, mode_t mode);

once the FIFOs are made, they can be opened using open command and the normal file I/O functions (e.g., closereadwriteunlink) all work with FIFOs.

Linux Processes

process is a program in execution. Since linux represent everything in terms of files, currently running processes are also represented using /proc file system. The proc/<pid> contains open files (sockets), pending signals, process state, kernel data structures, threads of execution and data section.

Processes provides two types of virtualization

  • Virtual Processor
  • Virtual Memory

Processes are created on linux using fork() system call. When fork() is called, a process is created which is child of process who is called fork(). The reason it’s called as a child because it gets copy of resources (data, variables, code, pages, sockets etc) from it’s parent process.

The list of processes are stored in a doubly linked list called as task list. The process descriptor is type  struct task_struct defined in <linux/sched.h> .

Processes are identified by PID. PID is a defined by pid_t data structure which is an integer up to 32,768. Generally processes up to number 1024 is kernel processes and rest of them are user processes. When a new process is forked, the number is not sequential but its randomized in order to avoid being guesses. When number is processes reaches max limit, the number is reseted and starts from 1024. The limit of number of processes can be increased by setting value at proc/sys/kernel/pid_max.

Process creation

  • fork()
    • a process can create another process using fork() function.
    • It is implemented using clone()
    • fork(), vfork() and __clone() calls clone() which calls do_fork() which calls copy_process()
    • the process created is called as child process
    • Example of simple fork call:
      pid_t pid = fork();
    • if return value of fork() call is 0, fork is successful and control block is in child process.
       if (pid == 0)
            printf("child process");
    • if return value of fork() call is greater than 0, fork is successful and control block is in parent process process.
      else if (pid > 0)
            printf("parent process");
    • if return value of fork() is negative, the call is failed.
            printf("fork failed");
    • fork() creates a child process that is a copy of the current task. It differs from the parent only in its PID, its PPID (parent’s PID), and certain resources and statistics (e.g. pending signals) which are not inherited
    • Besides the open files, other properties of the parent are inherited by the child:
      • Real user ID, real group ID, effective user ID, and effective group ID
      • Supplementary group IDs
      • Process group ID
      • Session ID
      • Controlling terminal
      • The set-user-ID and set-group-ID flags
      • Current working directory
      • Root directory
      • File mode creation mask
      • Signal mask and dispositions
      • The close-on-exec flag for any open file descriptors
      • Environment
      • Attached shared memory segments
      • Memory mappings
      • Resource limits
  • exec():
    • The exec() family of function calls creates a new address space and loads a new program into the newborn child immediately after a fork.
    • There are following variations of exec calls
      • int execl(const char *path, const char *arg, ...
         /* (char *) NULL */);
      • int execlp(const char *file, const char *arg, ...
         /* (char *) NULL */);
      • int execle(const char *path, const char *arg, ...
         /*, (char *) NULL, char * const envp[] */);
      • int execv(const char *path, char *const argv[]);
      • int execvp(const char *file, char *const argv[]);
      • int execvpe(const char *file, char *const argv[],
         char *const envp[]);
Function pathname filename fd Arg list argv[] environ envp[]
execl * * *
execlp * * *
execle * * *
execv * * *
execvp * * *
execve * * *
fexecve * * *
(letter in name) p f l v e

Process 0 and 1

Process 0 is ancestor of all processes which is also known as swapper process. This process is build from scratch during boot up. Process 0 initializes all the data structures needed by kernel, enables interrupts, and creates init process which is process 1.init process, invoked by the kernel at the end of the bootstrap procedure. It is responsible for bringing up the system after the kernel has been bootstrapped. init usually reads the system-dependent initialization files (/etc/rc* files or /etc/inittab and the files in /etc/init.d) and brings the system to a certain state. Process 1 never dies. It is a normal user process, not a system process within the kernel, but runs with superuser privileges.

State of a Process

Process can be any one of following states:

    • Process is being executed on CPU or waiting
    • Process is sleeping/suspended.
    • Process is in sleeping state but signal to these processes will not change state of the process.
    • Process being stopped.
    • Process is being debugged
    • Child process is being removed because parent process issued wait4() or waitpid()
    • Child process being terminated but parent has not issued wait() or waitpid()

Copy on Write – an optimization on fork()

All the data and resources that belong to the parent are not copied to child. This is done is steps and as needed. This is an optimization in order to delay copying data as needed. The another reason is in calls such as exec(), none of the parent data is actually needed, there should not be a need to copy any pages from parent.

Process Switch

During process switch the registers are saved in kernel memory area for process and loaded back when process resumes on CPU. This is was done using far jmp assembly instruction on x86. On linux 2.6 onwards software is used for context switching instead of hardware context switching. Software context switch is safer and faster.


Processes work on time slicing basis giving user the illusion that only the user has access to the CPU and the memory is unlimited. The scheduling of processes depend upon scheduling policy or algorithm. The scheduling algorithm must schedule in such a way that processes should have fast response, good throughput, limited use of resources and honor the priorities.

Linux processes are preemptive. A process with high priority than low priority runs first by interruption of low priority process is one is running. There are following classes of Linux processes:


Scheduling factors for a process

  • static priority: between 100 (high priority) to 139 (low priority)
  • nice value: -20 (high priority) to 19 (low priority)
  • dynamic priority: between 100 (high priority) to 139 (low priority)
  • real-time priority: 1 (highest priority) to 99 (lowest priority)

Kernel Preemption

In non-preemptive kernels, kernel code runs until completion. That is, the scheduler cannot reschedule a task while it is in the kernel: kernel code is scheduled cooperatively, not preemptively. Kernel code runs until it finishes (returns to user-space) or explicitly blocks.

The Linux kernel (since 2.6), unlike most other Unix variants and many other operating systems, is a fully preemptive kernel. It is possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule.


Linux threads are light weight processes i.e. Linux does not make any difference between threads and processes. They share resources with parent like heap space, file descriptors, sockets  but each thread gets own stack.

  • each thread contain TCB- a thread control block which contains following
    • Thread Identifier:Unique id (tid) is assigned to every new thread
    • Stack pointer: Points to thread’s stack in the process
    • Program counter: Points to the current program instruction of the thread
    • State of the thread (running, ready, waiting, start, done)
    • Thread’s register values
    • Pointer to the Process control block (PCB) of the process that the thread lives on
  • threads are also created with clone() system call like other processes except for some flags to clone like

Address Space of a Process

The address space of a process consist of linear addresses the process is allowed to use. The linear address spaces are different per process. The linear addresses are managed in form of Memory Regions. Following are examples of system calls to manage memory regions:

Syscall usage
brk()/sbrk() changing the heap size of process
execve Load a new program in the memory in the currently running process
_exit() exit the current process and delete the memory space occupied
fork() create a new (child) process, copying/referencing page tables from parent process
mmap(), mmap2() creates memory mapped file, enlarge the memory address space of the process.
mremap() Expand or shrink memory region
munmap() destroys previously mmaped memory area and shrinks memory space of the process
shmat() Attaches new memory region to the process
shmdt() Detaches new memory region to the process

List all processes

Since linux uses proc file system to represent processes, listing all the processes can be done using ls /proc/

Screen Shot 2018-03-04 at 11.24.18 PM

listing /proc/<PID>/

Screen Shot 2018-03-04 at 11.28.01 PM

The directories and files listed here represent resources like file descriptors /fd/ etc

Memory Descriptor

The address associated with a process is stored in memory descriptor with type mm_struct. All the descriptors are stored in a doubly linked list with type mmlist.

Heap Management

malloc(size): c stdlib method to request size of Dynamic memory. On success returns pointer to  first linear address. On failure returns non zero integer.

calloc(n, size):c stdlib method to request size of Dynamic memory. On success returns pointer to  first linear address, set the memory to 0. On failure returns non zero integer.

realloc(size):c stdlib method that changes the size of memory location of previously allocated memory region.

free(): frees the allocated memory.

brk(address): Modifies the heap directly

sbrk(incr): increases/decreases the memory region

brk() System call:

  • brk() is the only function that is a syscall. All the other functions listed above are implemented using brk() and mmap()
  • it allocated and deallocated whole pages since it acts on memory region.

System calls a.k.a syscalls:

syscalls are implemented in Kernel and can be called from a process running in user space with appropriate privileges. There are around 330 number of system calls in Linux. The Linux API is written in c.

Process Termination

Process termination can be normal (finishing intended code execution) or can be abnormal. The Abnormal execution is because of following reasons:

  • Kernel, the process itself or any other process send certain signal.
  • Sigfault: Process tries to access memory location which it does not have privilege of access.
  • Abnormal conditions is exit with non-zero code.

On process exit Kernel releases the memory and other resources held by the process like file descriptors.

When Parent process dies, the children of that process becomes orphan processes and the process init becomes parent of those processes. Sometimes parents do not wait for child processes after fork(). These child processes becomes zombi processes after termination. A process can wait (block) when its children are in the running state using wait or waitpid.

  • The wait function can block the caller until a child process terminates, whereas waitpid has an option that prevents it from blocking.
  • The waitpid function doesn’t wait for the child that terminates first; it has a number of options that control which process it waits for.

Linux Performance Optimization

How to Optimize

  1. Use real production system for measurement
  2. Note the state of system before change
  3. Apply the optimizations later going to come in the post.
  4. Compare state of system after applying the optimizations
  5. Tune until you get desired results.
  6. Document the changes.

Types of Tools needed

  • Tools that gather system information or Hardware
    • lscpu
    • uname
    • systemctl list -units -t service
  • Tools for measuring Performance
  • Tools for Microbenchmarking

Optimization include Benchmarking, Profiling and Tracing

  • Benchmarking
    • this is comparing performance with industry standard
    • Tools used for benchmarking
      • vmstat used for benchmarking virtual memory
      • iostat used for benchmarking io
      • mpstat multiple processor
  • Profiling gathering information about hot spots
  • Tracing

Based on the Benchmarking, Profiling and Tracking the system is tuned as follows;

  • using echo into /proc
    • The settings are not persistent. Stay until reboot.
    • /proc files system has pid directories for processes, configuration files and sys directory
      • /proc/sys directory has some important directories like /kernel, /net, /vm , /user etc.
      • /proc/sys/vm has settings for with virtual memory.
  • Using sysctl command
    • sysctl it’s a service
    • Persistent settings.
    • It’s started at the boot time’s
    • It reads settings from a configuration file at /etc directory
    • There are some more settings in
      • usr/lib/sysctl.d stores the default
      • /run/sysctl.d stores the runtime
    • /etc/sysctl.d stores runtime settings
  • Using loading/unloading of kernel modules and configure kernel parameters
    • Using modinfo to gather information about available parameters
    • Using modprobe for modifying parameters
    • modprobe modulename key=val
    • for persistent settings, /etc/modprobe.d modulename.conf

Configuration using tuned

  • tuned is available from RHEL7
  • Make changes in sysctl as well as /sys directory
  • there are profiles available for specific needs like network-latency, latency-performance, network-performance, throughput-performance, desktop and balanced etc

Screen Shot 2018-03-14 at 7.49.16 PM

Examples of the profiles

Screen Shot 2018-03-14 at 7.50.11 PM

Latency Performance Settings in Linux Tuned

Settings  Meaning
force_latency=1 processor state C1
governor=performance CPU is at higher performance state
energy_perf_bias= performance Higher performance state
min_perf_pct=100 comes from the P-state drivers. The interfaces provided by the cpufreq core for controlling frequency the driver provides sysfs files for controlling P state selection. These files have been added to /sys/devices/system/cpu/intel_pstate.
kernel.sched_min_granularity_ns=10000000 Minimal preemption granularity for CPU-bound tasks
vm.dirty_ratio=10 The generator of dirty data starts writeback at this percentage
vm.dirty_background_ratio=3 Start background writeback (via writeback threads) at this percentage
vm.swappiness=10 The swappiness parameter controls the tendency of the kernel to move processes out of physical memory and onto the swap disk. 0 tells the kernel to avoid swapping processes out of physical memory for as long as possible 100 tells the kernel to aggressively swap processes out of physical memory and move them to swap cache
kernel.sched_migration_cost_ns=5000000 The total time the scheduler will consider a migrated process “cache hot” and thus less likely to be re-migrated
net.core.busy_read=50 This parameter controls the number of microseconds to wait for packets on the device queue for socket reads. It also sets the default value of the SO_BUSY_POLL option.
sysctl.net.core.busy_poll=50 This parameter controls the number of microseconds to wait for packets on the device queue for socket poll and selects
kernel.numa_balancing=0 disable NUMA balancing
net.ipv4.tcp_fastopen=3 Linux supports configuring both overall client and server support via /proc/sys/net/ipv4/tcp_fastopen (net.ipv4.tcp_fastopen via sysctl). The options are a bit mask, where the first bit enables or disables client support (default on), 2nd bit sets server support (default off), 3rd bit sets whether data in SYN packet is permitted without TFO cookie option. Therefore a value of 1 TFO can only be enabled on outgoing connections (client only), value 2 allows TFO only on listening sockets (server only), and value 3 enables TFO for both client and server.

More Tunables:

  • Dependent on NUMA:
    • Reclaim Ratios
      • /proc/sys/vm/swappiness
      • /proc/sys/vm/min_free_kbytes
  • Independent of NUMA:
    • Reclaim Ratios
      • /proc/sys/vm/vfs_cache_pressure
      • Writeback Parameters
        • /proc/sys/vm/dirty_background_ratio
        • /proc/sys/vm/dirty_ratio
      • Readahead parameters
        • /sys/block/queue/read_ahead_kb

top utility

  • available on all the machines
  • default tool for measurement that has lot of information.

Parameters important for Optimization


  • Controls how aggressively the system reclaims anonymous memory versus
    page cache memory:

    • Anonymous memory – swapping and freeing
    • File pages – writing if dirty and freeing
    • System V shared memory – swapping and freeing
  • Default is 60
  • If it is decreased: more aggressive reclaiming of page cache memory
  • If it is Increased: more aggressive swapping of anonymous memory
  • Should be set to 0 as per the low latency optimization guide http://wiki.dreamrunner.org/public_html/Low_Latency_Programming/low-latency-programming.html

Memory reclaim watermarks:

Screen Shot 2018-03-14 at 10.54.08 PM


  • Controls NUMA specific memory allocation policy
  • When set and node memory is exhausted:
    • Reclaim memory from local node rather than allocating
      from next node
    • Slower initial allocation, higher NUMA hit ratio
  • When clear and node memory is exhausted:
    • Allocate from all nodes before reclaiming memory
    • Faster initial allocation, higher NUMA miss ratio
  • To see current setting: cat /proc/sys/vm/zone_reclaim_mode
  • Turn ON: echo 1 > /proc/sys/vm/zone_reclaim_mode
  • Turn OFF: echo 0 > /proc/sys/vm/zone_reclaim_mode
  • It is recommended that the settings should be off for large in memory database server.

CPU Tuning

  • p-states
    • It’s an Advanced Configuration and Power Interface (ACPI) defined processor performance state.
    • P0, P1, P2..p11 etc are values for P-states. For highest performance the p-states should be set to P0
  • c-states
    • It allow the CPU package to shut down cores and parts of the CPU microarchitecture to save energy while balancing response times.

    • C0, C1, C3, C6 etc are the values for c-states. Here is the chart for more information

      Mode Name What it does CPUs
      C0 Operating State CPU fully turned on All CPUs
      C1 Halt Stops CPU main internal clocks via software; bus interface unit and APIC are kept running at full speed. 486DX4 and above
      C1E Enhanced Halt Stops CPU main internal clocks via software and reduces CPU voltage; bus interface unit and APIC are kept running at full speed. All socket LGA775 CPUs
      C1E Stops all CPU internal clocks. Turion 64, 65-nm Athlon X2 and Phenom CPUs
      C2 Stop Grant Stops CPU main internal clocks via hardware; bus interface unit and APIC are kept running at full speed. 486DX4 and above
      C2 Stop Clock Stops CPU internal and external clocks via hardware Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
      C2E Extended Stop Grant Stops CPU main internal clocks via hardware and reduces CPU voltage; bus interface unit and APIC are kept running at full speed. Core 2 Duo and above (Intel only)
      C3 Sleep Stops all CPU internal clocks Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
      C3 Deep Sleep Stops all CPU internal and external clocks Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
      C3 AltVID Stops all CPU internal clocks and reduces CPU voltage AMD Turion 64
      C4 Deeper Sleep Reduces CPU voltage Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
      C4E/C5 Enhanced Deeper Sleep Reduces CPU voltage even more and turns off the memory cache Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
      C6 Deep Power Down Reduces the CPU internal voltage to any value, including 0 V 45-nm mobile Core 2 Duo only
    • Latency for C states
    • C-STATE







      1 μs

      1 μs


      106 μs

      80 μs


      345 μs

      104 μs