c++ multithreading code

Screen Shot 2018-04-12 at 11.33.23 PM
By I, Cburnett, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2233446

Simple example of defining work as () operator overload on a struct.

#include <thread>
#include <iostream>

struct work{
 void operator ()() const{
 std::cout << "this is test from work" << std::endl;

int main(){
   work w;
   std::thread t([&](){
   t.join(); // wait in the main thread for finishing the thread t.
   return 0;

std::thread objects are not copyable but they are movable.

std::thread t1([]{std::cout << std::this_thread::get_id() << std::endl;});
std::thread t2 = t1; // NOT going to work since constructor 
for thread class is private
std::thread t2 = std::move(t1); // this work!

Detach thread: background/long running jobs which does not to report the status back to parent threads can be detached.

t.detach(); // once detached, can not be joined.

RAII applied to thread class


#include <thread>
#include <iostream>

struct RThread{
     std::thread t([](){std::cout << "Hello from thread " << std::this_thread::get_id() << std::endl;});
     t_= std::move(t);

   // Rule of 5

  // delete the copy constructor and copy assignment operator
  RThread(RThread const &) = delete;
  RThread& operator=(RThread const &) = delete;

  // use default move operator and move assignment operator
  RThread(RThread const &&) = default;
  RThread & operator=(RThread const && ) = default;

   std::thread t_;

int main(){
 return 0;

Threads pool and Atomic shared variable

#include <thread>
#include <iostream>
#include <vector>
#include <atomic>

std::atomic<int> total(0);

int main(){
   std::vector<std::thread> workers;
   for(int i = 0; i <= 20; ++i){
         for(int j = 0; j <= 20; ++j){
           // can also use ++total same effect
           // Memory order does not matter on x86
           total.fetch_add(1, std::memory_order_relaxed);

       [](auto & w){

  std::cout << total << std::endl;
  return 0;

Using non-atomic variables with mutexes to prevent data races

#include <mutex>
#include <thread>
#include <iostream>
#include <vector>

static int total = 0;
static int worker_count = 20;
std::mutex mtx;

int main(){
  std::vector<std::thread> workers;
  for(int i = 0; i < worker_count; ++i){
    for(int j = 0; j < worker_count; ++j){ 
      std::lock_guard<std::mutex> lkg(mtx);

for(auto & worker: workers){
   if (worker.joinable()){

std::cout << total << std::endl;
 return 0;

DCLP – Double Checked Linked Pattern

As per the paper by Scott Meyers and Andrei Alexandrescu, the DCLP pattern implementation is given below.

In order to get instance of a singleton class, there are two checks, first check without the lock, then the lock is acquired and then again checked if the instance exists.

Singleton* Singleton::instance() { 
if (pInstance == 0) { // 1st test 
    Lock lock; // acquire the lock 
    if (pInstance == 0) { // 2nd test 
      pInstance = new Singleton; 
  return pInstance; 

In order to simplify this situation, c++ standard came with call_once and once_flag.

std::once_flag pInstance_initialized; // Flag
std::call_once(pInstance_initialized, {pInstance =new Singleton;});

Here is an example code called once as per the once flag:

#include <thread>
#include <iostream>
#include <vector>
#include <chrono>
#include <atomic>

int worker_count = 20;

int main(){
  std::vector<std::thread> workers;
  std::once_flag is_called;
  for(int i = 0; i < worker_count; ++i){
      std::call_once(is_called, [](){
      std::cout << "once_flag is set in the thread id " << std::this_thread::get_id() << std::endl;

  for(auto & worker: workers){
   if (worker.joinable()){

  return 0;

Condition Variables: used for intercommunication between the threads. Condition Variables work with mutex as a way of synchronization between threads.

Following is a classic example of printing number series like 0102030405… using 3 threads. One thread prints 0, another prints all odd numbers and 3rd thread prints even numbers. Condition variables are used to solve this problem.

#include <thread>
#include <iostream>
#include <mutex>
#include <condition_variable>

std::condition_variable cv;
std::mutex mtx;
int n = 1;
int max = 100;
bool digit_printed = false;

int main(){
  std::thread t_zero([](){
  while(n < max){
    std::unique_lock<std::mutex> lk(mtx);
    cv.wait(lk, [](){
      return digit_printed == true;
    std::cout << 0<< "\n";
    digit_printed = false;

std::thread t_odd([](){
  while(n < max){
    std::unique_lock<std::mutex> lk(mtx);
    cv.wait(lk, [](){
      return digit_printed == false && n%2 == 1;
    std::cout << n++ << "\n";
    digit_printed = true;

std::thread t_even([](){
  while(n < max){
    std::unique_lock<std::mutex> lk(mtx);
    cv.wait(lk, [](){
      return digit_printed == false && n%2 == 0;
    std::cout << n++ << "\n";
    digit_printed = true;
 return 0;


c++ Multithreading Theory and Concepts

Concurrency vs parallelism

Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. For example, multitasking on a single-core machine.

Parallelism is when tasks literally run at the same time, e.g., on a multicore processor.

following image explains the difference correctly.

image credit: https://medium.com/@deepshig/concurrency-vs-parallelism-4a99abe9efb8

Why Threading

  • Forking processes has lot of overhead like creating the data structures for processes.
  • Interprocess communication is also complex. Because of overhead, its not appropriate for latency sensitive applications.
  • Threading makes parallelism easy with simplified coding API.
  • Threading makes data sharing simple.

Threading in Linux

Linux represents threads as lightweight processes. POSIX threads library provides API for thread management using the header pthread.h. A simple program to create threads is shown as

#include <pthread.h>
#include <iostream>

void * work(void * void_ptr)
  std::cout << "in the work function" << std::endl;
  return NULL;

int main(){
  pthread_t thread_id;
  pthread_create(&thread_id, NULL, work, NULL);
  pthread_join(thread_id, NULL);
  return 0;

Threading using Modern c++

#include <thread>
#include <iostream>

void work(){
 std::cout << "Hello from thread " << std::this_thread::get_id() << std::endl;

int main(){
 std::thread t(work);
 return 0;

Modern c++ 11 introduced following in regards to multi-threading

  • std::thread and related classes
  • memory model: this defines concepts such as memory location, threads, synchronization primitives and most importantly the memory order.
    • memory order: memory order means how access to memory is managed for atomic and non-atomic variables and operations. The memory order guarantees the memory read or write operations will be done in the order specified. There are following types of memory orders.
      • memory_order_relaxed
      • memory_order_consume
      • memory_order_acquire
      • memory_order_release
      • memory_order_acq_rel
      • memory_order_seq_cst: this is default memory order if no memory order is specified,

Why memory ordering

The program written in c/c++ gets compiled to assembly. The compiler reorders the instructions are all the times for optimization. In a single threaded program, if  load or store instruction reordered for a variable, there is generally no problem because the effects of reordering is observed at the end of program. But in a multi-threaded program if load or store instructions for a shared variable get reordered, then the threads may observe the shared variable in inconsistent state.

Screen Shot 2018-04-08 at 11.43.49 PM
The instruction reordering as per Herb Sutter’s talk: https://www.youtube.com/watch?v=KeLBd2EJLOU

Here is an excellent talk on memory reordering:

Reordering can be achieved using fences. Fences can be Compiler fences or hardware fences. Memory order types in c++ memory model:

  • memory_order_relaxed: instructions can be ordered in sequence, there is no strict ordering applied.
  • memory_order_acquire: A load operation guarantees that the instructions after acquire operations happen after the load happens i.e. the operations will NOT be moved before current load happens.
  • memory_order_release: The store operation guarantees that the instructions happening before release operations happen before store happens, i.e. the instructions will not be moved after the store.
  • memory_order_acq_rel: This combines effects of both memory_order_acquire and memory_order_release. This provides the read/write ordering relative to atomic variables.
  • memory_order_seq_cst: This is most strictest of all the memory ordering.  This provides read/write ordering relative to all variables including non-acomic variables too.

Here are couple more interesting talks on this subject


Hardware Fences/Barriers in x86

x86 ISA provides strictest memory ordering guarantees.

There are 3 types of fences:

  • SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program instruction stream, but does not affect load operations.

  • LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations.2

  • MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the program instruction stream.

The memory ordering in x86

  • Reads are not reordered with other reads.
  • Writes are not reordered with older reads
  • Reads may be reordered with older writes to different locations but not with older writes to the same location.

  • Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

  •  Reads cannot pass earlier LFENCE and MFENCE instructions.

  •  Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

  • LFENCE instructions cannot pass earlier reads. 

  • SFENCE instructions cannot pass earlier writes or executions of CLFLUSH and CLFLUSHOPT.

  • MFENCE instructions cannot pass earlier reads, writes, or executions of CLFLUSH and CLFLUSHOPT.

In a multiple-processor system, the following ordering principles apply:

  • Individual processors use the same ordering principles as in a single-processor system.
  • Writes by a single processor are observed in the same order by all processors.
  • Writes from an individual processor are NOT ordered with respect to the writes from other processors.
  • Memory ordering obeys causality (memory ordering respects transitive visibility).
  • Any two stores are seen in a consistent order by processors other than those performing the stores
  • Locked instructions have a total order.


Issues in Multithreading

  • Data sharing can be complicated
  • Race conditions can be difficult to debug
  • Dead locks
  • Live locks

c++ std::vector objects

Find unique numbers in the vector if there is ONLY one element that is unique

#include <vector>
#include <accumulate>

std::vector <int> v {1,2,1,1,1,1,1,1,1,1,1};

int u = std::accumulate(v.begin(), v.end(), (uint32_t)0, [](uint32_t a, uint32_t b) { return a ^ b; });
// The reason this works is because xoring the number with itself is 0 and xroing number with 0 is itself.
>> 2


C++ Operator Overloading

I recently came across a tricky situation in operator overloading. The idea is overload ++ operator for a class. Since ++ works as both prefix and postfix.

int x = 0;
 auto y = ++x; // This is prefix 
 std::cout << "x: "<< x++ << "\n"; // this is postfix and value of 
// x is changed after the execution of this statement x = 1
 std::cout << "y: " << y << "\n"; // y = 1

When it comes to overloading of ++ operator, things becomes little tricky. Following is the way to overload ++ operator for both prefix and postfix.

struct X{
  int i = 0;
  // The prefix is coded as ++x
  X & operator++(){
    return *this;
  // this is postfix as x++
  X operator ++(int){
    X y(*this); 
    return y;

the prefix solution is straight forward but the postfix returns a copy of instance of X and hence its first copied, prefix increment is done and return the instance.

When overloading another operator << following code can be used.

struct X{
  friend std::iostream& operator<<(std::iostream & os, const X & x){
    os << "x.i: "<< x.i << "\n";
    return os;




Caching in Modern Microprocessors

The modern microprocessors have a speed of 3GHz vs the main memory (RAM) has a speed of 133 MHz. With the newest advancements in hardware, the memory bandwidth only increased between DDR to DDR2 and DDR3 but the latency decrease by a very little factor. Caches were introduced as a level of indirection in order to hide the latency gap.  Caches work on two very important principles of data access patterns:

  1. Spatial Locality: if a memory location is accessed, the nearby memory location is most probably accessed, example: arrays
  2. Temporal Locality: if memory location is accessed, there is probability that the same location will be accessed again, example: loops

Modern x86 processors have the hierarchy of an L1, L2 and a distributed LLC accessed via the on-die scalable ring architecture. L1 and L2 cache are private to the core and the LLC is shared within all the cores.

  • The Access Timing in terms of CPU cycles
    • L1 – 4 cycles
    • L2 – 12 cycles
    • LLC – 26 to 31 cycles

  • The cache size
    • L1 32 KB data cache (L1I) + 32 KB instruction cache (L1D)
    • L2 – 256 KB

    • LLC – 8 MB to 56 MB.
  • Cache line
    • It is unit of transport between memory.
    • on x86 the cache line size is 64 bytes
  • Data Prefetchers
    • spatial prefetcher attempts to complete every cache line fetched to the L2 cache with another cache line to fill a 128-byte aligned chunk.

    • streamer prefetcher monitors read requests from the L1D cache and fetches the appropriate data and instructions. Server vendors might use their own designation for L1 and L2 prefetchers

  • Cache Snooping protocols:







    The cache line is updated relative to memory

    Single core



    The cache line is consistent with memory

    Single core



    The cache line is shared with other cores. (The cache line is consistent with other cores, but may not be consistent with memory)

    Multiple cores



    The cache line is not present in this core L1 or L2

    Multiple cores



Types of Caches

  1. Directly mapped caches:
    • an address can reside only at a particular address in cache.
    • Screen Shot 2018-03-18 at 6.23.01 PM
    • Direct mapping is simple and inexpensive to implement, but if a program accesses 2 blocks that map to the same line repeatedly, the cache begins to thrash back and forth reloading the line over and over again meaning misses are very high.
    • Processor registers use logic as direct mapped.
  2. Fully associative caches
    • any block can go into any line of the cache.
  3. Set-associative caches
    • set associative addresses the problem of possible thrashing in the direct mapping method. It does this by saying that instead of having exactly one line that a block can map to in the cache, we will group a few lines together creating a set. Then a block in memory can map to any one of the lines of a specific set. There is still only one set that the block can map to.
    • Screen Shot 2018-03-18 at 6.25.44 PM

Cache Eviction Policy

There are many algorithms for cache eviction in case of conflict example:

  • FIFO – First in first out
    • keep track of insertion time and evicts block that is oldest.
  • LRU – least recently used
    • keep track of references to the data in cache and data with lowest references will be evicted
  • Random

Cache Flush Policy

When the data is updated in cache, there are following policies for updating main memory with updates to cache. Since the main memory is slow, the update is deferred as much as possible.

  • Write-through:
    • Cache pushes all the changes to the main memory immediately
    • advantages are the implementation is simple and less error prone.
    • disadvantages are requirement of bandwidth because of subsequence writes and also write backs take latency because of slowness of main memory.
    • write-no-allocate:
      • the block is modified in the main memory and not loaded into the cache.
    • write-allocate:
      •  the block is loaded on a write miss, followed by the write-hit action.
  • Write back:
    • the update is deferred as much as possible.

Cache Misses

  • Cold or compulsory misses:
    • Cache misses because of first reference to the block in program
  • Capacity misses:
    • Cache misses because of capacity is not enough.
  • Conflict misses:
    • Cache misses because of conflicting cache location.
  • Coherent misses:
    • The coherence miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the thread’s cache has been invalidated by a write from another thread.
  • Coverage misses:
    • he coverage miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the processor’s cache has been invalidated as a consequence of a directory eviction.

Cache Hierarchy

Image credit: http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms?

Cache hierarchy latency vs capacity.

Translation Lookaside Buffer – TLB

  • The TLB is an other form of cache that stores the virtual to physical mappings of addresses.
  • TLB is fully associative.
  • There are generally 2 levels of TLB and less number of entires (128).
  • When a page fault occurs, the hardware starts address translation. Since the address translation in x86 is 5 level, the hardware start looking into the TLB if the address has been translated recently and if found, the translation returns immediately.
  • If the address is not found in the TLB, the hardware walks through the page tables entries in the main memory in order to find the physical address for that page.




Understanding glibc malloc


I always got fascinated by heap memory. Questions such as

How heap memory is obtained from kernel?
How efficiently memory is managed?
Is it managed by kernel or by library or by application itself?
Can heap memory be exploited?

were in my mind for quite some time. But only recently I got time to understand about it. So here I would like to share my fascination turned knowledge!! Out there in the wild, many memory allocators are available:

  • dlmalloc – General purpose allocator
  • ptmalloc2 – glibc
  • jemalloc – FreeBSD and Firefox
  • tcmalloc – Google
  • libumem – Solaris

Every memory allocator claims they are fast, scalable and memory efficient!! But not all allocators can be suited well for our application. Memory hungry application’s performance largely depends on memory allocator performance. In this post, I will only talk about ‘glibc malloc’ memory allocator. In future, I am hoping to cover up other memory allocators. Throughout…

View original post 3,632 more words