Detecting and fixing Memory Issues

daan-mooij-674206-unsplash

There are two main tools I like to use for any memory related stuff in my c++ code. Following is an example code I had written as my implementation of a shared pointer. I think my code works as expected but not sure how much memory I am leaking if any.

#include <iostream>

template <typename T>
class MySharedPtr{
  public:
    MySharedPtr(T val){
      resource_ptr = new T(val);
      c_ptr = new int(1);
    };

  MySharedPtr<T>(const MySharedPtr<T> & ptr){
    std::cout << "copy constructor \n";
    resource_ptr = ptr.resource_ptr;
    (*ptr.c_ptr)++;
    this->c_ptr = ptr.c_ptr;
  }

  MySharedPtr<T>& operator=(const MySharedPtr<T> & ptr){
    std::cout << "copy assignment operator\n";
    if(this != &ptr){
      (*ptr.c_ptr)++;
      this->c_ptr = ptr.c_ptr;
      this->resource_ptr = ptr.resource_ptr;
     }
    return *this;
  }

  // delete move constructors...
  MySharedPtr(MySharedPtr && ptr) = delete;
  MySharedPtr & operator=(MySharedPtr && ptr) = delete;

  MySharedPtr & operator *(){
    return resource_ptr;
  }

  MySharedPtr* operator ->(){
    return resource_ptr;
  }

  virtual ~MySharedPtr(){
    std::cout << *c_ptr << '\n';
    if(*c_ptr > 0){
      (*c_ptr)--;
    }else{
      delete c_ptr;
      delete resource_ptr;
    }
  }

  int use_count(){
    return *c_ptr;
  }

  private:
    int * c_ptr;
  T * resource_ptr;
};

int main(){
  int x;
    MySharedPtr<decltype(x)> sp(1);
  {
    MySharedPtr<decltype(x)> sp1(2);
    sp1 = sp;
    std::cout << "use count is " << sp.use_count() << '\n';
  }
  
  sp = sp;
  std::cout << "use count is " << sp.use_count() << '\n';
  return 0;
}

Google Sanitizer

The command:

/usr/local/Cellar/gcc/8.2.0/bin/g++-8 -std=c++17 -g -o2 -fsanitize=address -fno-omit-frame-pointer shared_ptr.cpp -o shared_ptr && ASAN_OPTIONS=detect_leaks=1 ./shared_ptr

The output:

copy assignment operator
use count is 2
2
copy assignment operator
use count is 1
1

=================================================================
==25533==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x526b18 in operator new(unsigned long) (/home/aarna/devel/practice/smart_ptrs+0x526b18)
    #1 0x52b50c in MySharedPtr<int>::MySharedPtr(int) /home/aarna/devel/practice/smart_ptrs.cpp:9:19
    #2 0x52b112 in main /home/aarna/devel/practice/smart_ptrs.cpp:62:28
    #3 0x7f6507f483d4 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:274

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x526b18 in operator new(unsigned long) (/home/aarna/devel/practice/smart_ptrs+0x526b18)
    #1 0x52b47b in MySharedPtr<int>::MySharedPtr(int) /home/aarna/devel/practice/smart_ptrs.cpp:8:26
    #2 0x52b134 in main /home/aarna/devel/practice/smart_ptrs.cpp:64:32
    #3 0x7f6507f483d4 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:274

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x526b18 in operator new(unsigned long) (/home/aarna/devel/practice/smart_ptrs+0x526b18)
    #1 0x52b50c in MySharedPtr<int>::MySharedPtr(int) /home/aarna/devel/practice/smart_ptrs.cpp:9:19
    #2 0x52b134 in main /home/aarna/devel/practice/smart_ptrs.cpp:64:32
    #3 0x7f6507f483d4 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:274

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x526b18 in operator new(unsigned long) (/home/aarna/devel/practice/smart_ptrs+0x526b18)
    #1 0x52b47b in MySharedPtr<int>::MySharedPtr(int) /home/aarna/devel/practice/smart_ptrs.cpp:8:26
    #2 0x52b112 in main /home/aarna/devel/practice/smart_ptrs.cpp:62:28
    #3 0x7f6507f483d4 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:274

SUMMARY: AddressSanitizer: 16 byte(s) leaked in 4 allocation(s).

The explanation: The address sanitizer has precisely pointed out the line numbers. It is very useful in coming up with a fix.

Valgrind

Command:

valgrind --tool=memcheck  --leak-check=full --show-leak-kinds=all --track-origins=yes  ./smart_ptrs

The output:

[aarna@localhost practice]$ valgrind --tool=memcheck --leak-check=full --show-leak-kinds=all --track-origins=yes ./smart_ptrs
==25564== Memcheck, a memory error detector
==25564== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==25564== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==25564== Command: ./smart_ptrs
==25564==
==25564==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==25564==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==25564==Process memory map follows:
0x000000400000-0x00000055d000 /home/aarna/devel/practice/smart_ptrs
0x00000075c000-0x00000075d000 /home/aarna/devel/practice/smart_ptrs
0x00000075d000-0x000000760000 /home/aarna/devel/practice/smart_ptrs
0x000000760000-0x000001447000
0x000004000000-0x000004022000 /usr/lib64/ld-2.17.so
0x000004022000-0x000004036000
0x00000403f000-0x000004055000
0x000004221000-0x000004222000 /usr/lib64/ld-2.17.so
0x000004222000-0x000004223000 /usr/lib64/ld-2.17.so
0x000004223000-0x000004224000
0x000004224000-0x000004225000
0x000004a24000-0x000004a25000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_core-amd64-linux.so
0x000004a25000-0x000004c24000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_core-amd64-linux.so
0x000004c24000-0x000004c25000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_core-amd64-linux.so
0x000004c25000-0x000004c26000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_core-amd64-linux.so
0x000004c26000-0x000004c34000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
0x000004c34000-0x000004e34000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
0x000004e34000-0x000004e35000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
0x000004e35000-0x000004e36000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
0x000004e36000-0x000004f1f000 /usr/lib64/libstdc++.so.6.0.19
0x000004f1f000-0x00000511e000 /usr/lib64/libstdc++.so.6.0.19
0x00000511e000-0x000005126000 /usr/lib64/libstdc++.so.6.0.19
0x000005126000-0x000005128000 /usr/lib64/libstdc++.so.6.0.19
0x000005128000-0x00000513d000
0x00000513d000-0x00000523e000 /usr/lib64/libm-2.17.so
0x00000523e000-0x00000543d000 /usr/lib64/libm-2.17.so
0x00000543d000-0x00000543e000 /usr/lib64/libm-2.17.so
0x00000543e000-0x00000543f000 /usr/lib64/libm-2.17.so
0x00000543f000-0x000005456000 /usr/lib64/libpthread-2.17.so
0x000005456000-0x000005655000 /usr/lib64/libpthread-2.17.so
0x000005655000-0x000005656000 /usr/lib64/libpthread-2.17.so
0x000005656000-0x000005657000 /usr/lib64/libpthread-2.17.so
0x000005657000-0x00000565b000
0x00000565b000-0x000005662000 /usr/lib64/librt-2.17.so
0x000005662000-0x000005861000 /usr/lib64/librt-2.17.so
0x000005861000-0x000005862000 /usr/lib64/librt-2.17.so
0x000005862000-0x000005863000 /usr/lib64/librt-2.17.so
0x000005863000-0x000005865000 /usr/lib64/libdl-2.17.so
0x000005865000-0x000005a65000 /usr/lib64/libdl-2.17.so
0x000005a65000-0x000005a66000 /usr/lib64/libdl-2.17.so
0x000005a66000-0x000005a67000 /usr/lib64/libdl-2.17.so
0x000005a67000-0x000005a7c000 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
0x000005a7c000-0x000005c7b000 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
0x000005c7b000-0x000005c7c000 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
0x000005c7c000-0x000005c7d000 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
0x000005c7d000-0x000005e40000 /usr/lib64/libc-2.17.so
0x000005e40000-0x00000603f000 /usr/lib64/libc-2.17.so
0x00000603f000-0x000006043000 /usr/lib64/libc-2.17.so
0x000006043000-0x000006045000 /usr/lib64/libc-2.17.so
0x000006045000-0x00000639c000
0x00000639c000-0x00000679c000
0x000058000000-0x00005823b000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/memcheck-amd64-linux
0x00005843b000-0x00005843e000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/memcheck-amd64-linux
0x00005843e000-0x000059e40000
0x001002001000-0x001008b0a000
0x001008b8c000-0x001008bac000
0x001008bac000-0x001008bae000
0x001008bae000-0x001008cae000
0x001008cae000-0x001008cb0000
0x001008cb0000-0x001008cb1000 /tmp/vgdb-pipe-shared-mem-vgdb-25564-by-aarna-on-localhost.localdomain
0x001008cb1000-0x00100b0fd000
0x00100b1fd000-0x00100b3fd000
0x00100b4fd000-0x00100b5fd000
0x00100b7f2000-0x00100ba1b000
0x00100bbdb000-0x00100bcdb000
0x001ffeffd000-0x001fff001000
0x7ffd1941d000-0x7ffd1943e000 [stack]
0xffffffffff600000-0xffffffffff601000 [vsyscall]
==25564==End of process memory map.
==25564==
==25564== HEAP SUMMARY:
==25564== in use at exit: 32 bytes in 1 blocks
==25564== total heap usage: 1 allocs, 0 frees, 32 bytes allocated
==25564==
==25564== 32 bytes in 1 blocks are still reachable in loss record 1 of 1
==25564== at 0x4C2B955: calloc (vg_replace_malloc.c:711)
==25564== by 0x586454F: _dlerror_run (dlerror.c:141)
==25564== by 0x5864057: dlsym (dlsym.c:70)
==25564== by 0x50233B: __interception::GetRealFunctionAddress(char const*, unsigned long*, unsigned long, unsigned long) (in /home/aarna/devel/practice/smart_ptrs)
==25564== by 0x4DD462: __asan::InitializeAsanInterceptors() (in /home/aarna/devel/practice/smart_ptrs)
==25564== by 0x41A49F: __asan::AsanInitInternal() [clone .part.1] (in /home/aarna/devel/practice/smart_ptrs)
==25564== by 0x400FCA2: _dl_init (dl-init.c:116)
==25564== by 0x4001029: ??? (in /usr/lib64/ld-2.17.so)
==25564==
==25564== LEAK SUMMARY:
==25564== definitely lost: 0 bytes in 0 blocks
==25564== indirectly lost: 0 bytes in 0 blocks
==25564== possibly lost: 0 bytes in 0 blocks
==25564== still reachable: 32 bytes in 1 blocks
==25564== suppressed: 0 bytes in 0 blocks
==25564==
==25564== For counts of detected and suppressed errors, rerun with: -v
==25564== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

The explanation:

The output by valgrind shows entire picture of virtual memory and how source files and code is laid out in the memory. It detects the loss of 32 bytes of heap memory still in use at the exit.

The fix:

The output shown by the google address sanitizer was useful at micro-level. It precisely pointed the line number and I was able to apply my first fix to the memory leak in virtual destructor as follows:

virtual ~MySharedPtr(){
  if(*c_ptr > 0){
    (*c_ptr)--;
  }
  
  if(*c_ptr == 0){
    if(resource_ptr != nullptr){
      delete resource_ptr;
      resource_ptr = nullptr;
    }
    delete c_ptr;
    c_ptr = nullptr;
  }
}

I still have following leaks:

=================================================================
==9130==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x1026a53d0 in wrap__Znwm (libasan.5.dylib:x86_64+0x6e3d0)
#1 0x102324526 in MySharedPtr<int>::MySharedPtr(int) shared_ptr.cpp:9
#2 0x10232428a in main shared_ptr.cpp:68
#3 0x7fff7661b014 in start (libdyld.dylib:x86_64+0x1014)

Direct leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x1026a53d0 in wrap__Znwm (libasan.5.dylib:x86_64+0x6e3d0)
#1 0x1023244a4 in MySharedPtr<int>::MySharedPtr(int) shared_ptr.cpp:8
#2 0x10232428a in main shared_ptr.cpp:68
#3 0x7fff7661b014 in start (libdyld.dylib:x86_64+0x1014)

SUMMARY: AddressSanitizer: 8 byte(s) leaked in 2 allocation(s).

Going through the code I fixed my copy assignment operator by deleting the resource pointer and the counter pointer as follows:

MySharedPtr<T>& operator=(const MySharedPtr<T> & ptr){
  std::cout << "copy assignment operator\n";
  if(this != &ptr){
    delete resource_ptr;  // deleting fix 1
    delete c_ptr;         // deleting fix 2
    (*ptr.c_ptr)++;
    this->c_ptr = ptr.c_ptr;
    this->resource_ptr = ptr.resource_ptr;
   }
   return *this;
}

And my code generated error free output:

copy assignment operator
use count is 2
copy assignment operator
use count is 1

Memory Barriers

Why Memory Barriers?

Because the CPU and/or the compiler can reorder the instructions written in program order. Modern processors and Compilers try to optimize the program by reordering the instructions all the time. But the observed effects (on load and stores on memory locations) are consistent.

Sequential Consistency

is defined as the result of any execution is the same as if the read and write operations by all processes were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program [Lamport, 1979].
which means:

The instructions executed by the same CPU in order as they were written.
all the threads should observe the effect of loads/stores to any shared variable.

Sequential consistency is very important especially in multi-threaded programs because when a thread changes the shared variable, the other threads should see the variable in a consistent and valid state.

Total Store Ordering

Modern processors have a buffer where they store the updates to the memory location called as store buffer. The reason, updates do not go to main memory directly is writes to main memory are very slow and value from store buffer can be reused because of spatial locality.

Architectures with Strong Store Ordering guarantees: x86, SPARC
Architectures with weak Store Ordering guarantees: ARM, POWER, alpha etc.

Types of Barriers:

Compiler Barriers
- This is to prevent compiler from reordering the instructions.
- in Modern c++ (c++ 11) following compiler barriers are introduced:
- A barrier can be applied by calling std::atomic_thread_fence() with argument for memory orders as memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cst etc.
Mandatory Barriers (Hardware)
- Example instructions are mfence, lfence or sfence.
- General barriers
- Read-Write barriers
SMP (Simultaneous Multiprocessors) Conditional Barries:
- This is used during multiprocessing example: used mainly in Linux Kernel.
Acquire/Release Barriers
Data Dependency Barriers
Devise (I/O) Barriers

Atomics & lock-free data structures c++

The modern microprocessor pipeline is 14 stages deep during which the programming instructions reordered all the times for optimization purposes.
Linux 2.6 supports full preemption, i.e. the thread can be suspended in between.
Atomic types are the locations in main memory to which the access is exclusive to a thread/process.
Barriers are used for ordering the accesses to the memory locations.
Atomic operations are provided at hardware level in order to make the operations indivisible.
The Implementation is highly dependent upon the hardware. x86 has strictest rules around memory ordering.
Atomic operations with memory fences prevents reordering of the instructions in order to make the operation indivisible.
Atomic operations are expensive because the OS and hardware can not do all the necessary optimizations.

<atomic> header provides various atomic types. Following is non-exhaustive list of atomic types

atomic_bool	std::atomic<bool>
atomic_char	std::atomic<char>
atomic_schar	std::atomic<signed char>
atomic_uchar	std::atomic<unsigned char>
atomic_int	std::atomic<int>
atomic_uint	std::atomic<unsigned>
atomic_short	std::atomic<short>
atomic_ushort	std::atomic<unsigned short>
atomic_long	std::atomic<long>
atomic_ulong	std::atomic<unsigned long>
atomic_llong	std::atomic<long long>
atomic_ullong	std::atomic<unsigned long long>
atomic_char16_t	std::atomic<char16_t>
atomic_char32_t	std::atomic<char32_t>
atomic_wchar_t	std::atomic<wchar_t>

Operations on Atomic types

These operations take a argument for memory order. Which can be one of std::memory_order_relaxed, std::memory_order_acquire, std::memory_order_release, std::memory_order_acq_rel, std::memory_order_consume or std::memory_order_seq_cst

load: this is read operation on an atomic type.
store: this is a write operation on an atomic type.
exchange: this is read-modify-write operation on an atomic type. All the compare operations are used as compare_exchange(expected, desired, <optional memory order>). On successful exchange it return true else it returns false.
- compare_exchange
- compare_exchange_weak: these are really for the architectures where the read-modify-write operation is not guaranteed atomic. It can generate spurious errors and it is advised to use in a loop. It has same effect as of compare_exchange_strong on x86 platform.
- compare_exchange_strong: it is guaranteed to return false on failure and guaranteed to return true on success.
fetch_ versions of add, or etc
overriden operators like +=, -=, *=, |=

Lock Based implementation of a Multi-producer, Multi-consumer Queue.

producer-consumer

It is important to see the lock based data structures before implementing lock-free data structures.

/*
Code for Lock based queue.
std::queue is wrapped in a struct Queue. 
internal_queue variable maintains the queue.
Work to do is an instance of struct Work.
Queue has pop and push functions.
Push function takes the work instance by rvalue instance and 
pushes it onto the internal queue.
Pop function returns a shared pointer to Work. The shared pointer
 is used to avoid the ABA problem.
Pop function is a non blocking function.

If the queue is empty, the thread just yields.
A sleep is added in order to simulate slowness in the work..

Disadvantages of this kind of queue:
 * Whole Queue is locked during any operations in order to make 
sure there is synchronized access.
*/

#include <queue> // std::queue
#include <mutex> // for locking
#include <thread>
#include <memory> // smart pointers
#include <utility> 
#include <functional> // passing and storing lambda 
#include <iostream> // for std::cout
#include <chrono>

static long work_number = 0;

struct Work{
 // take lambda as a work and call lambda to process.
  Work(std::function<void(void)> lambda):f(lambda){};
  void operator()(){
    f(); 
   }
private:
  std::function<void(void)> f; 
};

struct Queue{
  Queue() = default;
  // Queue is not copyable or copy constructible.
  Queue(const Queue &) = delete;
  Queue & operator=(const Queue &) = delete;
  std::shared_ptr<Work> pop();
  void push(Work &&);
private:
  std::mutex mtx;
  std::queue<Work> internal_queue;
};

 void Queue::push(Work && work){
  std::lock_guard<std::mutex> lg(mtx);
  internal_queue.push(work);
 }

 std::shared_ptr<Work> Queue::pop(){
  Work w([](){});
  {
    std::lock_guard<std::mutex> lg(mtx);
    if(internal_queue.size() > 0){
      w = std::move(internal_queue.front());
      internal_queue.pop();
     }
    else
    {
     std::this_thread::yield(); // let the other threads work
   }
 }
 return std::make_shared<Work>(w);
}

struct Producer{
 Producer(Queue & q):q(q){};
 void produce(){
   for(;;){
     Work w([&](){std::cout << "work number : " << ++work_number << " is called.." << "\n";});
     q.push(std::move(w));
   }
 }
private:
 Queue & q;
};

struct Consumer{
  Consumer(Queue & q):q(q){};
   void consume(){
   for(;;){
     std::shared_ptr<Work> w_ptr = q.pop();
     Work * w = w_ptr.get();
     (*w)();
     std::this_thread::sleep_for(std::chrono::milliseconds(100));
   }
 }
private:
 Queue & q;
};

int main(){
 Queue q;
 std::thread producer_thread1([&q](){
   Producer p(q);
   p.produce();
 });

std::thread producer_thread2([&q](){
  Producer p(q);
  p.produce();
 });

std::thread consumer_thread1([&q](){
 Consumer c(q);
 c.consume();
 });

std::thread consumer_thread2([&q](){
 Consumer c(q);
 c.consume();
 });

std::thread consumer_thread3([&q](){
 Consumer c(q);
 c.consume();
 });

 producer_thread1.join();
 producer_thread2.join();
 consumer_thread1.join();
 consumer_thread2.join();
 consumer_thread3.join();
 return 0;
}

Lock Free Data Structures

In the above implementation, the data structure is a lock based one and hence no two threads can access is concurrently.

Understanding glibc malloc

sploitF-U-N

I always got fascinated by heap memory. Questions such as

How heap memory is obtained from kernel?
How efficiently memory is managed?
Is it managed by kernel or by library or by application itself?
Can heap memory be exploited?

were in my mind for quite some time. But only recently I got time to understand about it. So here I would like to share my fascination turned knowledge!! Out there in the wild, many memory allocators are available:

dlmalloc – General purpose allocator
ptmalloc2 – glibc
jemalloc – FreeBSD and Firefox
tcmalloc – Google
libumem – Solaris
…

Every memory allocator claims they are fast, scalable and memory efficient!! But not all allocators can be suited well for our application. Memory hungry application’s performance largely depends on memory allocator performance. In this post, I will only talk about ‘glibc malloc’ memory allocator. In future, I am hoping to cover up other memory allocators. Throughout…

View original post 3,632 more words

Process Synchronization and IPC

Why Synchronization

Since Linux 2.0 support Symmetric Multiprocessing with multicore modern microprocessors. But with this new dimension of problems arose like race conditions, dead/live locks etc. In order to mitigate these issues, various synchronization mechanisms were introduced in Linux. The 2.6 version of Linux supported full preemption, that means scheduler can preempt kernel code at virtually any point and reschedule another task.

critical path: code that access and manipulate shared data

Reasons for Synchronization:

Interrupts. An interrupt can occur asynchronously at almost any time, interrupting the currently executing code.
Softirqs and tasklets. The kernel can raise or schedule a softirq or tasklet at almost any time, interrupting the currently executing code.
Kernel preemption. Because the kernel is preemptive, one task in the kernel can preempt another.
Sleeping and synchronization with user-space. A task in the kernel can sleep and thus invoke the scheduler, resulting in the running of a new process.
Symmetrical multiprocessing. Two or more processors can execute kernel code at exactly the same time.

Ways to Synchronization

Shared Memory
Sockets
Atomics and lock-free programming
Pipes/FIFOs
Mutexes
- recursive
- reader/writer
Spinlocks
Semaphores
- Binary and counting
Condition variables
Messages Queues

Semaphores

Since all threads run in the same address space, they all have access to the same data and variables. If two threads simultaneously attempt to update a global counter variable, it is possible for their operations to interleave in such way that the global state is not correctly modified. Although such a case may only arise only one time out of thousands, a concurrent program needs to coordinate the activities of multiple threads using something more reliable that just depending on the fact that such interference is rare.

A semaphore is somewhat like an integer variable, but is special in that its operations (increment and decrement) are guaranteed to be atomic—you cannot be halfway through incrementing the semaphore and be interrupted and waylaid by another thread trying to do the same thing. That means you can increment and decrement the semaphore from multiple threads without interference. By convention, when a semaphore is zero it is “locked” or “in use”.

Semaphores vs. mutexes (from wikipedia)
A mutex is essentially the same thing as a binary semaphore and sometimes uses the same basic implementation. The differences between them are in how they are used. While a binary semaphore may be used as a mutex, a mutex is a more specific use-case, in that only the thread that locked the mutex is supposed to unlock it. This constraint makes it possible to implement some additional features in mutexes:

Since only the thread that locked the mutex is supposed to unlock it, a mutex may store the id of thread that locked it and verify the same thread unlocks it.
Mutexes may provide priority inversion safety. If the mutex knows who locked it and is supposed to unlock it, it is possible to promote the priority of that thread whenever a higher-priority task starts waiting on the mutex.
Mutexes may also provide deletion safety, where the thread holding the mutex cannot be accidentally deleted.
Alternately, if the thread holding the mutex is deleted (perhaps due to an unrecoverable error), the mutex can be automatically released.
A mutex may be recursive: a thread is allowed to lock it multiple times without causing a deadlock.

Semaphore is one of the thread synchronization mechanisms. Unlike mutex, semaphore allows more than one thread to access similar shared resources at the same time. There are two types of semaphores

binary semaphore
counting semaphore.

Pipes and FIFOs

Pipes are used for communicating data between the processes. Output of a process is piped as an input to another process.

cat doctum.txt | wc

image credit: https://www.vivaolinux.com.br/dica/Pipes-no-Linux

FIFOs are named pipes. FIFOs can be used as follows:

int mkfifo(const char *path, mode_t mode);
int mkfifoat(int fd, const char *path, mode_t mode);

once the FIFOs are made, they can be opened using open command and the normal file I/O functions (e.g., close, read, write, unlink) all work with FIFOs.

Linux Processes

A process is a program in execution. Since linux represent everything in terms of files, currently running processes are also represented using /proc file system. The proc/<pid> contains open files (sockets), pending signals, process state, kernel data structures, threads of execution and data section.

Processes provides two types of virtualization

Virtual Processor
Virtual Memory

Processes are created on linux using fork() system call. When fork() is called, a process is created which is child of process who is called fork(). The reason it’s called as a child because it gets copy of resources (data, variables, code, pages, sockets etc) from it’s parent process.

The list of processes are stored in a doubly linked list called as task list. The process descriptor is type struct task_struct defined in <linux/sched.h> .

Processes are identified by PID. PID is a defined by pid_t data structure which is an integer up to 32,768. Generally processes up to number 1024 is kernel processes and rest of them are user processes. When a new process is forked, the number is not sequential but its randomized in order to avoid being guesses. When number is processes reaches max limit, the number is reseted and starts from 1024. The limit of number of processes can be increased by setting value at proc/sys/kernel/pid_max.

Process creation

```
fork()
```
- a process can create another process using fork() function.
- It is implemented using clone()
- fork(), vfork() and __clone() calls clone() which calls do_fork() which calls copy_process()
- the process created is called as child process
- Example of simple fork call:
```
pid_t pid = fork();
```
- if return value of fork() call is 0, fork is successful and control block is in child process.
```
 if (pid == 0)
    {
      printf("child process");
    }
```
- if return value of fork() call is greater than 0, fork is successful and control block is in parent process process.
```
else if (pid > 0)
    {
      printf("parent process");
    }
```
- if return value of fork() is negative, the call is failed.
```
else
    {
      printf("fork failed");
    }
```
- fork() creates a child process that is a copy of the current task. It differs from the parent only in its PID, its PPID (parent’s PID), and certain resources and statistics (e.g. pending signals) which are not inherited
- Besides the open files, other properties of the parent are inherited by the child:
  - Real user ID, real group ID, effective user ID, and effective group ID
  - Supplementary group IDs
  - Process group ID
  - Session ID
  - Controlling terminal
  - The set-user-ID and set-group-ID flags
  - Current working directory
  - Root directory
  - File mode creation mask
  - Signal mask and dispositions
  - The close-on-exec flag for any open file descriptors
  - Environment
  - Attached shared memory segments
  - Memory mappings
  - Resource limits

exec():

The exec() family of function calls creates a new address space and loads a new program into the newborn child immediately after a fork.

There are following variations of exec calls

int execl(const char *path, const char *arg, ...
 /* (char *) NULL */);

int execlp(const char *file, const char *arg, ...
 /* (char *) NULL */);

int execle(const char *path, const char *arg, ...
 /*, (char *) NULL, char * const envp[] */);

int execv(const char *path, char *const argv[]);

int execvp(const char *file, char *const argv[]);

int execvpe(const char *file, char *const argv[],
 char *const envp[]);

Function	pathname	filename	fd	Arg list	`argv[]`	`environ`	`envp[]`
`execl`	*			*		*
`execlp`		*		*		*
`execle`	*			*			*
`execv`	*				*	*
`execvp`		*			*	*
`execve`	*				*		*
`fexecve`			*		*		*
(letter in name)		p	f	l	v		e

Process 0 and 1

Process 0 is ancestor of all processes which is also known as swapper process. This process is build from scratch during boot up. Process 0 initializes all the data structures needed by kernel, enables interrupts, and creates init process which is process 1.init process, invoked by the kernel at the end of the bootstrap procedure. It is responsible for bringing up the system after the kernel has been bootstrapped. init usually reads the system-dependent initialization files (/etc/rc* files or /etc/inittab and the files in /etc/init.d) and brings the system to a certain state. Process 1 never dies. It is a normal user process, not a system process within the kernel, but runs with superuser privileges.

State of a Process

Process can be any one of following states:

TASK_RUNNING
- Process is being executed on CPU or waiting
TASK_INTERRUPTABLE
- Process is sleeping/suspended.
TASK_UNINTERRUPTABLE
- Process is in sleeping state but signal to these processes will not change state of the process.
TASK_STOPPED
- Process being stopped.
TASK_TRACED
- Process is being debugged
EXIT_DEAD
- Child process is being removed because parent process issued wait4() or waitpid()
EXIT_ZOMBIE
- Child process being terminated but parent has not issued wait() or waitpid()

Copy on Write – an optimization on fork()

All the data and resources that belong to the parent are not copied to child. This is done is steps and as needed. This is an optimization in order to delay copying data as needed. The another reason is in calls such as exec(), none of the parent data is actually needed, there should not be a need to copy any pages from parent.

Process Switch

During process switch the registers are saved in kernel memory area for process and loaded back when process resumes on CPU. This is was done using far jmp assembly instruction on x86. On linux 2.6 onwards software is used for context switching instead of hardware context switching. Software context switch is safer and faster.

Scheduling

Processes work on time slicing basis giving user the illusion that only the user has access to the CPU and the memory is unlimited. The scheduling of processes depend upon scheduling policy or algorithm. The scheduling algorithm must schedule in such a way that processes should have fast response, good throughput, limited use of resources and honor the priorities.

Linux processes are preemptive. A process with high priority than low priority runs first by interruption of low priority process is one is running. There are following classes of Linux processes:

SCHED_FIFO
SCHED_RR
SCHED_NORMAL

Scheduling factors for a process

static priority: between 100 (high priority) to 139 (low priority)
nice value: -20 (high priority) to 19 (low priority)
dynamic priority: between 100 (high priority) to 139 (low priority)
real-time priority: 1 (highest priority) to 99 (lowest priority)

Kernel Preemption

In non-preemptive kernels, kernel code runs until completion. That is, the scheduler cannot reschedule a task while it is in the kernel: kernel code is scheduled cooperatively, not preemptively. Kernel code runs until it finishes (returns to user-space) or explicitly blocks.

The Linux kernel (since 2.6), unlike most other Unix variants and many other operating systems, is a fully preemptive kernel. It is possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule.

Threads

Linux threads are light weight processes i.e. Linux does not make any difference between threads and processes. They share resources with parent like heap space, file descriptors, sockets but each thread gets own stack.

each thread contain TCB- a thread control block which contains following
- Thread Identifier:Unique id (tid) is assigned to every new thread
- Stack pointer: Points to thread’s stack in the process
- Program counter: Points to the current program instruction of the thread
- State of the thread (running, ready, waiting, start, done)
- Thread’s register values
- Pointer to the Process control block (PCB) of the process that the thread lives on
threads are also created with clone() system call like other processes except for some flags to clone like
```
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
```

Address Space of a Process

The address space of a process consist of linear addresses the process is allowed to use. The linear address spaces are different per process. The linear addresses are managed in form of Memory Regions. Following are examples of system calls to manage memory regions:

Syscall	usage
brk()/sbrk()	changing the heap size of process
execve	Load a new program in the memory in the currently running process
_exit()	exit the current process and delete the memory space occupied
fork()	create a new (child) process, copying/referencing page tables from parent process
mmap(), mmap2()	creates memory mapped file, enlarge the memory address space of the process.
mremap()	Expand or shrink memory region
munmap()	destroys previously mmaped memory area and shrinks memory space of the process
shmat()	Attaches new memory region to the process
shmdt()	Detaches new memory region to the process

List all processes

Since linux uses proc file system to represent processes, listing all the processes can be done using ls /proc/

Screen Shot 2018-03-04 at 11.24.18 PM

listing /proc/<PID>/

Screen Shot 2018-03-04 at 11.28.01 PM

The directories and files listed here represent resources like file descriptors /fd/ etc

Memory Descriptor

The address associated with a process is stored in memory descriptor with type mm_struct. All the descriptors are stored in a doubly linked list with type mmlist.

Heap Management

malloc(size): c stdlib method to request size of Dynamic memory. On success returns pointer to first linear address. On failure returns non zero integer.

calloc(n, size):c stdlib method to request size of Dynamic memory. On success returns pointer to first linear address, set the memory to 0. On failure returns non zero integer.

realloc(size):c stdlib method that changes the size of memory location of previously allocated memory region.

free(): frees the allocated memory.

brk(address): Modifies the heap directly

sbrk(incr): increases/decreases the memory region

brk() System call:

brk() is the only function that is a syscall. All the other functions listed above are implemented using brk() and mmap()
it allocated and deallocated whole pages since it acts on memory region.

System calls a.k.a syscalls:

syscalls are implemented in Kernel and can be called from a process running in user space with appropriate privileges. There are around 330 number of system calls in Linux. The Linux API is written in c.

Process Termination

Process termination can be normal (finishing intended code execution) or can be abnormal. The Abnormal execution is because of following reasons:

Kernel, the process itself or any other process send certain signal.
Sigfault: Process tries to access memory location which it does not have privilege of access.
Abnormal conditions is exit with non-zero code.

On process exit Kernel releases the memory and other resources held by the process like file descriptors.

When Parent process dies, the children of that process becomes orphan processes and the process init becomes parent of those processes. Sometimes parents do not wait for child processes after fork(). These child processes becomes zombi processes after termination. A process can wait (block) when its children are in the running state using wait or waitpid.

The wait function can block the caller until a child process terminates, whereas waitpid has an option that prevents it from blocking.
The waitpid function doesn’t wait for the child that terminates first; it has a number of options that control which process it waits for.

Linux Performance Optimization

3 main levels for tuning

CPU/BIOS tuning:
- Power
- P-states and C-states
- Hyper-threading
- Frequency/Turbo mode
- Governance
- Termal mode
  Further reading:
  https://community.mellanox.com/docs/DOC-2297
  https://community.mellanox.com/docs/DOC-2817
  https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt
OS tuning:
- CPU/NUMA affinity
- IRQ Affinity
- Swapiness
- Scheduling
- HUGE pages
  Further Reading:
  Red Hat Enterprise Linux Network Performance Tuning Guide https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/index https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf
  https://community.mellanox.com/docs/DOC-2135
  https://community.mellanox.com/docs/DOC-2797
Network Tuning:
- TCP/IP tunables
- Kernel Bypass Technologies
  Further Reading
  Kernel-bypass networking for fun and profit
  https://www.youtube.com/watch?v=noqSZorzooc
  Intel PDPK https://www.intel.com/content/www/us/en/communications/data-plane-development-kit.html
  Fundamentals of RDMA: Free training courses on Mellanox Academy site: https://academy.mellanox.com/course/view.php?id=138

Configuration using tuned

tuned is available from RHEL7
Make changes in sysctl as well as /sys directory
there are profiles available for specific needs like network-latency, latency-performance, network-performance, throughput-performance, desktop and balanced etc

Screen Shot 2018-03-14 at 7.49.16 PM

Examples of the profiles

Screen Shot 2018-03-14 at 7.50.11 PM

Latency Performance Settings in Linux Tuned

Settings	Meaning
force_latency=1	processor state C1
governor=performance	CPU is at higher performance state
energy_perf_bias= performance	Higher performance state
min_perf_pct=100	comes from the P-state drivers. The interfaces provided by the cpufreq core for controlling frequency the driver provides sysfs files for controlling P state selection. These files have been added to /sys/devices/system/cpu/intel_pstate.
kernel.sched_min_granularity_ns=10000000	Minimal preemption granularity for CPU-bound tasks
vm.dirty_ratio=10	The generator of dirty data starts writeback at this percentage
vm.dirty_background_ratio=3	Start background writeback (via writeback threads) at this percentage
vm.swappiness=10	The swappiness parameter controls the tendency of the kernel to move processes out of physical memory and onto the swap disk. 0 tells the kernel to avoid swapping processes out of physical memory for as long as possible 100 tells the kernel to aggressively swap processes out of physical memory and move them to swap cache
kernel.sched_migration_cost_ns=5000000	The total time the scheduler will consider a migrated process “cache hot” and thus less likely to be re-migrated
net.core.busy_read=50	This parameter controls the number of microseconds to wait for packets on the device queue for socket reads. It also sets the default value of the SO_BUSY_POLL option.
sysctl.net.core.busy_poll=50	This parameter controls the number of microseconds to wait for packets on the device queue for socket poll and selects
kernel.numa_balancing=0	disable NUMA balancing
net.ipv4.tcp_fastopen=3	Linux supports configuring both overall client and server support via /proc/sys/net/ipv4/tcp_fastopen (net.ipv4.tcp_fastopen via sysctl). The options are a bit mask, where the first bit enables or disables client support (default on), 2nd bit sets server support (default off), 3rd bit sets whether data in SYN packet is permitted without TFO cookie option. Therefore a value of 1 TFO can only be enabled on outgoing connections (client only), value 2 allows TFO only on listening sockets (server only), and value 3 enables TFO for both client and server.

More information about Tunables:

Dependent on NUMA:
- Reclaim Ratios
  - /proc/sys/vm/swappiness
  - /proc/sys/vm/min_free_kbytes
Independent of NUMA:
- Reclaim Ratios
  - /proc/sys/vm/vfs_cache_pressure
  - Writeback Parameters
    - /proc/sys/vm/dirty_background_ratio
    - /proc/sys/vm/dirty_ratio
  - Readahead parameters
    - /sys/block/queue/read_ahead_kb

top utility

available on all the machines
default tool for measurement that has lot of information.

Parameters important for Optimization

Swappiness

Controls how aggressively the system reclaims anonymous memory versus
page cache memory:
- Anonymous memory – swapping and freeing
- File pages – writing if dirty and freeing
- System V shared memory – swapping and freeing
Default is 60
If it is decreased: more aggressive reclaiming of page cache memory
If it is Increased: more aggressive swapping of anonymous memory
Should be set to 0 as per the low latency optimization guide http://wiki.dreamrunner.org/public_html/Low_Latency_Programming/low-latency-programming.html

Memory reclaim watermarks:

Screen Shot 2018-03-14 at 10.54.08 PM

zone_reclaim_mode

Controls NUMA specific memory allocation policy
When set and node memory is exhausted:
- Reclaim memory from local node rather than allocating
  from next node
- Slower initial allocation, higher NUMA hit ratio
When clear and node memory is exhausted:
- Allocate from all nodes before reclaiming memory
- Faster initial allocation, higher NUMA miss ratio
To see current setting: cat /proc/sys/vm/zone_reclaim_mode
Turn ON: echo 1 > /proc/sys/vm/zone_reclaim_mode
Turn OFF: echo 0 > /proc/sys/vm/zone_reclaim_mode
It is recommended that the settings should be off for large in memory database server.

CPU Tuning

p-states
- It’s an Advanced Configuration and Power Interface (ACPI) defined processor performance state.
- P0, P1, P2..p11 etc are values for P-states. For highest performance the p-states should be set to P0

c-states

It allow the CPU package to shut down cores and parts of the CPU microarchitecture to save energy while balancing response times.

C0, C1, C3, C6 etc are the values for c-states. Here is the chart for more information

Mode	Name	What it does	CPUs
C0	Operating State	CPU fully turned on	All CPUs
C1	Halt	Stops CPU main internal clocks via software; bus interface unit and APIC are kept running at full speed.	486DX4 and above
C1E	Enhanced Halt	Stops CPU main internal clocks via software and reduces CPU voltage; bus interface unit and APIC are kept running at full speed.	All socket LGA775 CPUs
C1E	—	Stops all CPU internal clocks.	Turion 64, 65-nm Athlon X2 and Phenom CPUs
C2	Stop Grant	Stops CPU main internal clocks via hardware; bus interface unit and APIC are kept running at full speed.	486DX4 and above
C2	Stop Clock	Stops CPU internal and external clocks via hardware	Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
C2E	Extended Stop Grant	Stops CPU main internal clocks via hardware and reduces CPU voltage; bus interface unit and APIC are kept running at full speed.	Core 2 Duo and above (Intel only)
C3	Sleep	Stops all CPU internal clocks	Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
C3	Deep Sleep	Stops all CPU internal and external clocks	Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
C3	AltVID	Stops all CPU internal clocks and reduces CPU voltage	AMD Turion 64
C4	Deeper Sleep	Reduces CPU voltage	Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
C4E/C5	Enhanced Deeper Sleep	Reduces CPU voltage even more and turns off the memory cache	Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
C6	Deep Power Down	Reduces the CPU internal voltage to any value, including 0 V	45-nm mobile Core 2 Duo only

Latency for C states

C-STATE

RESIDENCY

WAKE-UP LATENCY

C0

ACTIVE

ACTIVE

C1

1 μs

1 μs

C3

106 μs

80 μs

C6

345 μs

104 μs

How to Optimize

Use real production system for measurement
Note the state of system before change
Apply the optimizations later going to come in the post.
Compare state of system after applying the optimizations
Tune until you get desired results.
Document the changes.

Types of Tools needed

Tools that gather system information or Hardware
- lscpu
- uname
- systemctl list -units -t service
Tools for measuring Performance
Tools for Microbenchmarking

Optimization include Benchmarking, Profiling and Tracing

Benchmarking
- this is comparing performance with industry standard
- Tools used for benchmarking
  - vmstat used for benchmarking virtual memory
  - iostat used for benchmarking io
  - mpstat multiple processor
Profiling gathering information about hot spots
Tracing

Based on the Benchmarking, Profiling and Tracking the system is tuned as follows;

using echo into /proc
- The settings are not persistent. Stay until reboot.
- /proc files system has pid directories for processes, configuration files and sys directory
  - /proc/sys directory has some important directories like /kernel, /net, /vm , /user etc.
  - /proc/sys/vm has settings for with virtual memory.
Using sysctl command
- sysctl it’s a service
- Persistent settings.
- It’s started at the boot time’s
- It reads settings from a configuration file at /etc directory
- There are some more settings in
  - usr/lib/sysctl.d stores the default
  - /run/sysctl.d stores the runtime
- /etc/sysctl.d stores runtime settings
Using loading/unloading of kernel modules and configure kernel parameters
- Using modinfo to gather information about available parameters
- Using modprobe for modifying parameters
- modprobe modulename key=val
- for persistent settings, /etc/modprobe.d modulename.conf

Linux Memory Management

Memory Address

For x86 Architecture there are 3 types of addresses:

Logical Address:
- This address is address of an instruction or data in machine language
- This address consist of a segment and an offset i.e. distance from segment start address.
Linear address or Virtual address:
- This address is a binary number in virtual memory that enables a process to use a location in main memory independently of other processes and to use more space than actually exists in primary storage by temporarily relegating some contents to a hard disk or internal flash drive.
Physical Address:
- Address of the memory cells in RAM of the computer.

Need for Virtual Addressing

The main memory (RAM) available for a computer is limited.
Many processes use a common code in libraries.
Using Virtual addressing, a CPU and Kernel gives an impression to a process that the memory is unlimited.

Address Translation

Since 2 out of 3 address are virtual mentioned above, there is a need for address translation from Logical to Linear and Linear to Physical address.
For this reason each CPU contains a hardware names as Memory Management Unit (MMU).
Segmentation Unit: converts the Logical address to Linear.
Paging Unit: converts Linear address to Physical.
The address translation from linear address is done using two translation tables
- Page Directory
- Page Table

memoryTranslation

Image Credit: https://manybutfinite.com/post/memory-translation-and-segmentation/

Address Translation in Intel x86

address_translation

Linear-Address Translation to a 4-KByte Page using IA-32e Paging

address_translation_format

Formats of CR3 and Paging-Structure Entries with PAE Paging

MMU

Image Credit: By Mdjango, Andrew S. Tanenbaum (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Segmentation and Paging in Linux

Both segmentation and Paging is redundent and hence Linux use segmentation in limited way.
Segmentation can assign a different linear address space to each process
Paging can map the same linear address space into different physical address spaces.

Segments and their usage

Each segment is described by 8-byte Segment Descriptor.
These descriptors are defined in Global Descriptor Table (GDT) or in the Local Descriptor Table(LDT).

segments

Image credit: By John Källén (jkl at commons) (Own work) [Public domain], via Wikimedia Commons

Types of Segment Descriptors

Code Segment Descriptor
Data Segment Descriptor
Task State Segment Descriptor

Segmentation in Linux

linuxFlexibleAddressSpaceLayout

image credit https://manybutfinite.com/post/anatomy-of-a-program-in-memory/

Data structures for segmentation

GDT Global Descriptor Table
- There is one GDT per CPU in Linux.
- There are 18 descriptors in each GDT for various purposes as follows:
  - Descriptors for Kernel code and User code.
  - Task State Segment (TSS)
  - default Local Descriptor Table (LDT)
  - Thread-Local Storage (TLS) segments
  - Advanced Power Management (APM )
  - Plug and Play (PnP ) BIOS services
  - Special segment used for handling exceptions.

Image Credit: Lars H. Rohwedder (User:RokerHRO – selfmade work

Paging

paging unit translates linear addresses into physical ones
For efficiency the linear addresses are divided in fixed length intervals called as pages. These continuous linear addresses within a page are mapped into continuous physical addresses.
Page frames: main memory is divided into fixed lenght page frames. Each page frame contains a page. Page is just block of data in memory or disk.
page tables: The data structures that map linear to physical addresses.
Page Size: 4 KB
Huge Pages: 2 MB and 1 GB.
Page address (32 bit) = Directory (10 bits) + Page Table (10 bits) + Offset(12 bit)

Physical Address Extension (PAE)

In order to use more than 4 GB memory Intel started using 36 pins for address translation effectively supporting more addresses.

Caching in Hardware

There are atleast 3 levels of caches supported in modern microprocessor.
The caches work on principle of spatial locality and principle of temporal locality.
Cache is devided into cache line generally of 64 bytes.
Most of caches are N-way associative.
Cache unit resides between the paging unit and the main memory.
It includes both a hardware cache memory and a cache controller.
Each cache line has a tag and some flags that stores the status of cache line.
CPUs first look for address in cache before looking into main memory.
Flushing cache is done using write Back mechanism which is more efficient. Only cache entry is updated by CPU and the main memory is updated eventually.

Translation Lookaside Buffers (TLB)

This is kind of a cache used for storing recently converted addresses between linear to physical.
The Address Translation unit first looks in TLB for physical address for given linear address if not found, the hardware goes through page tables to find the page.

xeon_cach_structure

Process page tables

Linux stores pages using 4 levels

Page Global Directory
Page Upper Directory
Page Middle Directory
Page Table

Each process has its own Page Global Directory and its own set of Page Tables.
Linear addresses from 0x00000000 to 0xbfffffff can be addressed when the process runs in either User or Kernel Mode.
Linear addresses from 0xc0000000 to 0xffffffff can be addressed only when the process runs in Kernel Mode.

Virtual Memory

What is Virtual Memory?

It allows us to run more applications on the system than we have enough physical memory to support. Virtual memory is simulated memory that is written to a file on the hard drive.

Why Virtual Memory?

Most of the modern CPUs has memory management unit built in to generate virtual memory addresses. Using virtual memory has following advantages.

Main Memory is limited: especially in small devices like raspberry pi which runs linux can run multiple processes each with 32 bit addressing (4GB/process)
Sharing: It makes easy for library programs to be shared across applications example glibc.
Memory protection: Memory corruption in one program does not affect memory for another program.
Only a part of program needs to be in memory execution.

virtual_to_physical_mapping

Virtual Memory is implemented with

Demand Paging
Demand Segmentation

Physical memory management

Zones are defined using data structure struct zone, which is defined in <linux/mmzone.h>. On X86 the physical memory is divided in zones as follows

ZONE_DMA First 16MB of memory

This is used by (legacy) hardware because the hardware can address only upto 16 bytes.

ZONE_NORMAL 16MiB – 896MB

The DMA and NORMAL zone page frames can be accessed directly by Kernel.

ZONE_HIGHMEM 896 MB – End

In 32 bit systems – the page frames belongs to HIGHMEM can not be accessed directly by Kernel.
In 64-bit system this problem does not exists because the linear space available is vastly large than the RAM available on the system.

Each zone has a zone descriptor with following fields:

free_pages: Number of free pages
pages_min: Number of reserved pages
pages_low: Low watermark for reclaiming page frames
pages_high: High watermark for reclaiming page frames
free_area: Blocks of free pages
lru_lock: spinlock for any operation
wait_table: wait queues of processes waiting on one of the pages in the zone.
zone_mem_map: pointer to first page descriptor
name: “DMA”, “Normal” or “HighMem”

Zones in NUMA System

Screen Shot 2018-03-14 at 8.22.46 PM

Memory Management in Linux

Some part of main memory is assigned permanenetly to Kernel.
Remaining memory is called as Dynamic Memory
The effeciancy of system depends upon how effectively the memory is managed.

Page Frame sizes

Linux uses 4 KB and 4MB page size by default.
With PAE (Physical Address Extension) enabled it can support 2 MB pages.
32-bit architectures have 4KB pages
64-bit architectures have 8KB pages.
This means: 4KB pages and 1GB of memory, physical memory is divided into 262,144 distinct pages.

The reason to support 4 KB page size

Data transfer between disk and main memory is effecient when smaller page size is used.

Page Descriptors

The page information structure is defined using struct page structure. This structure is defined in <linux/mm_types.h>
Information about the state of the page frame is stored in page descriptor data structure.
Page descriptor store following information:
- Flags
  - Various flags store the information about page.
  - The flag values are defined in <linux/page-flags.h>.
- Frame’s reference counter
  - This is important field. When its value is -1, the page frame free and can be accessed by any process or kernel. A 0 or positive number implies, that the page is assigned to one or more processes.
- Number of page entries
- Index
- pointer to least recently used doubly linked list

Memory Allocation

Zoned Page Frame Allocator is used by Kernel to handle memory allocation requests for a group of continuous page frames. The algorithm used is called as “Buddy System”. Kernel provides various macros as follows for allocating page frames.
- alloc_pages(gfp_mask, order)
- alloc_page(gfp_mask)
- _ _get_free_pages(gfp_mask, order)
- _ _get_free_page(gfp_mask)
- get_zeroed_page(gfp_mask)
- _ _get_dma_pages(gfp_mask, order)
Deallocation of page frames is done using following:
- _ _free_pages(page, order)
- free_pages(addr, order)
- _ _free_page(page)
- free_page(addr)
High-Memory Page frame mappings:

There are 3 mechanisms to map page frames in high memory:
- Permanent Kernel mapping
- Temporary Kernel Mapping
- Non Contiguous Memory Allocation
Fragmentation:
- Frequent allocations and deallocations of page frames of different sizes may result in several small blocks of free page frames scattered in the block of page frames. Then it is impossible to allocate large chunk of page frames.
- To fix this issue, Kernel keeps track of existing blocks of free contiguous page frames and avoid need to split up large blocks if request comes for smaller chunk.
Buddy System
- This is one of the algorithms for allocating memory from fixed-size segment consisting of physically-contiguous pages.
- Memory is allocated from the segment using a power-of-2 allocator
- - This method satisfies requests in units sized as power of 2
  - The requests rounded up to next highest power of 2
  - When smaller allocation needed than is available, current chunk split into two buddies of next-lower power of 2
  - Continue until appropriate sized chunk available
- For example, assume 256KB chunk available, kernel requests 21KB
- - Split into A_L _and A_r of 128KB each
    - One further divided into B_L and B_R of 64KB
      - One further into C_L and C_R of 32KB each – one used to satisfy request
- Advantage – quickly coalesce unused chunks into larger chunk
- Disadvantage – fragmentation

Slab Allocator

There is an alternate strategy for Memory allocation Slab Allocator.
A slab allocator is a list that contains a block of available, already allocated, data structures.
When needed the data structure is allocated from the slab, when not needed, it’s returned to the slab rather than freeing.
The slab allocator works as generic data structure caching layer.
Slab is one or more physically contiguous pages
Cache consists of one or more slabs
Single cache for each unique kernel data structure

- Each cache filled with objects – instantiations of the kernel data structure

Slab-allocation algorithm uses caches to store kernel objects

- When cache created, filled with objects marked as free
- When a new object for a kernel data structure is needed, the allocator can assign any free object from the cache to satisfy the request.
- The object assigned from the cache is marked as used.
- There are 3 states of a slab, full slab has no free objects, an empty slab has no allocated objects and partial slab has some free and some allocator objects.

If slab is full of used objects, next object allocated from empty slab

- If no empty slabs, a new slab is allocated from contiguous physical pages and assigned to a cache

Benefits include no fragmentation, fast memory request satisfaction

Screen Shot 2018-03-13 at 8.06.41 PM Screen Shot 2018-03-13 at 8.07.37 PM

Demand Paging

Software Programs are generally very large in size and can not fit entirely in main memory. Hence only part of the pages are brought in main memory from disk when absolutely required.

Advantages of demand paging

Similar to a paging system with swapping
Less I/O needed, no unnecessary I/O
Less memory needed. Example: Raspberry pi.
Faster response because of shared libraries.
More users can be using the operating system at a time

Details of demand paging

Extreme case – start a process with no pages in memory

- OS sets instruction pointer to the first instruction of process, non-memory-resident page fault
- And for every other process pages on first access faults for the page
- Pure demand paging never bring a page into memory until it is required.

Actually, a given instruction could access multiple pages multiple page faults

- One page for the instruction and many for data : unacceptable system performance
- Fortunately, pain decreased because of locality of reference reasonable performance

Hardware support needed for demand paging

- Page table with valid / invalid bit
- Secondary memory, usually a high-speed disk. (swap device with swap space)
- Instruction restart capability after a page fault
  - We must be able to restart the process in exactly the same place and state,
  - With the saved state (registers, condition code, instruction counter) of the interrupted process

Stages in Demand Paging

Trap to the operating system
Save the user registers and process state
Determine that the interrupt was a page fault
Check that the page reference was legal and determine the location of the page on the disk
Issue a read from the disk to a free frame:
1. Wait in a queue for this device until the read request is serviced
2. Wait for the device seek and/or latency time
3. Begin the transfer of the page to a free frame
While waiting, allocate the CPU to some other user : context switch
Receive an interrupt from the disk I/O subsystem (I/O completed)
Save the registers and process state for the other user
Determine that the interrupt was from the disk
Correct the page table and other tables to show page is now in memory: set to v
Wait for the CPU to be allocated to this process again : context switch
Restore the user registers, process state, and new page table, and then resume the interrupted instruction: restart

Copy on Write

Copy-on-Write (COW) allows both parent and child processes to initially share the same pages in memory: fork() system call

- If either process modifies a shared page, only then is the page copied

COW allows more efficient process creation as only modified pages are copied
In general, free pages are allocated from a pool of zero-fill-on-demand pages

- Why zero-out a page before allocating it?

vfork() as a variation on fork() system call has parent process suspended and child process uses address space of parent.

- Different from fork() with copy-on-write.
- Do not use copy-on-write; changes made by the child process on any pages of the parent’s address space the altered pages will be visible to the parent once it resumes
- Must be used with caution to prevent the child from modifying the parent address space
- vfork() is intended to be used when the child calls exec()immediately after creation
- No copy of pages, very efficient method of process creation.

Handling Page Fault

Check an internal table (within PCB) to decide:
1. Invalid reference ⇒ abort, terminate the process
2. Valid reference ⇒ page it in when it is not in memory
Find a free frame (taking one from the free-frame list)
Schedule a disk operation to read the desired page into the new frame
Reset tables to indicate the page is now in memory
Restart the instruction that was interrupted by the trap caused by the page fault

Screen Shot 2018-02-28 at 11.40.07 PM

Page Replacement Algorithms

Find the location of the desired page on disk
Find a free frame:
1. If there is a free frame, use it
2. If there is no free frame, use a page replacement algorithm to select a victim frame
3. Write the victim frame to the disk if dirty
Bring the desired page into the (newly) free frame; update the page and frame tables
Continue the process by restarting the instruction that caused the trap

Note now potentially 2 page transfers (one out and one in) for page fault – increasing EAT

LRU Algorithm

This is generally good algorithm and used most widely.
Use past knowledge rather than future
Replace page that has not been used for the longest period of time
Associate time of last use with each page

Implementation of the LRU aalgorithm

Need hardware assistance to determine an order for the frames defined by the time of last use.
Counter implementation

- Every page entry has a time-of-use field and a logical clock (counter) is added to the CPU;
- Every time a page is referenced, the clock is incremented and the clock register value is copied into the time-of-use field in its page-table entry.
- When a page needs to be changed, look at the counters to find smallest value
  - Search through table needed

Stack implementation

- Keep a stack of page numbers in a doubly linked list
- When a page referenced:
  - Removed from the stack and put on the top
  - Most recently used page is always at the top of the stack.
  - requires 6 pointers to be changed at worst
- But each update more expensive
- No search for replacement; the tail points to the bottom of the stack, the least recently used (LRU) page.

LRU and OPT are cases of stack algorithms that don’t have Belady’s Anomaly

Interrupts, Signals and Exceptions

What is an Interrupt

Interrupt is an event that changes the program flow i.e. the instruction stream being executed by the CPU. Interrupts are also generated by various devices connected to the CPU or they caused by bugs within the software. Interrupts are way for hardware to signal to the processor.

Interrupts and Exceptions

Exceptions Synchronous Interrupts:
Caused by software and produced by control unit of CPU.
Example: A bug in software or a page fault. Kernel handles these exceptions by following the steps defined (in kernel code) to recover from such a condition.
Interrupts Asynchronous Interrupts:
Caused by hardware devices.
Example: Keypress/mouse movement by user.

Interesting Points about Interrupts

Interrupts are asynchronous and they are nested.
An interrupt can occur while kernel is handling another interrupt. When kernel is executing some critical region, interrupts are disabled and kept the critical region as small as possible.
By disabling interrupts, Kernel guarantees that an interrupt handler will not preempt the critical code.
The interrupts and exceptions are identified by a number between 0 to 255.
The code executed by interrupt handler is not a process switch, rather its ran at an expense of the process that was running when interrupt was received.
Interrupt handling is critical for Kernel but since handling can take long time in the case of slow I/O devices. Hence the interrupt handling is divided into two parts
- Urgent: Kernel executes this right away.
- Bottom Halves: Deferred to execute later. (using various techniques like soft irqs, Tasklets, Task/ Work Queues)

Classification of Interrupts

Maskable
- These are interrupt requests issued by I/O devices.
- There are two states for a maskable interrupt.
  - Masked
  - Unmasked
- The vectors of maskable interrupts are altered by programming the interrupt control.
Nonmaskable
- They are always recognized by CPU.
- The vectors of non-maskable interrupts are fixed.

Classification of Exceptions

Processor Detected
- CPU detects an anomalous condition while executing an instruction
Faults:
- Example: Page faults
- Faults can be corrected and once corrected, program can be resumed.
Traps:
- They can be reported immediately at the next instruction.
- Used mainly for debugging.
Aborts:
- These are severe errors like hardware failure
- The process is terminated on receiving this signal.
Programmable Exceptions:
- Often called as software interrupts.
- Used to implement system calls and debugging.

Exception Number	Explanation	Type
0	divide by zero error	Fault
1	Debug	Trap or Fault
2	Not Used
3	Breakpoint	Trap
4	Overflow	Trap
5	Bounds check	Fault
6	Invalid opcode	Fault
7	Device not available	Fault
8	Double Fault	Fault
9	Coprocessor segment overrun	Abort
10	Invalid TSS	Fault
11	Segment not present	Fault
12	Stack segment fault	Fault
13	General protection	Fault
14	Page Fault	Fault
15	Reserved by Intel
16	Floating-point error	Fault
17	Alignment check	Fault
18	Machine check	abort
19	SIMD floating point exception	Fault

_Page Fault occurs when the process try to address a page in its address space but is not currently in RAM. When Kernel is handling this exception, it may suspend current process and switch to another process until the page is available in the RAM. The process switch is done because of high latency of RAM (200 ns or serveral hundred CPU cycles). _

Hardware for Interrupt Handling

Each hardware device connected to a computer has a single output line named as Interrupt Request (IRQ) line.
There is a hardware circuit called Programmable Interrupt Controller (PIC) to which all the IRQ lines are connected.
Interrupt Controller (PIC) monitors IRQ lines for raised signals.
In case of multiple signals raised simultenously, the signal with lower pine number is selected.
When there is a signal raised, its stored in signal vector, then the vector is sent to CPU and signal is raised to CPUs INTR pin to wait until the CPU acknowledges the signal.

PIC_Hardware_interrupt_path

Image credit: By Jfmantis – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=18168230

APIC – Advanced Programmable Interrupt Controller

In modern multiprocessor systems, there is a local APIC chip per CPU. The APIC has following components:

32 bit registers
Internal clocks
Local timer device
Two additional lines LINT 0 and LINT 1

local_apics_intel_xeon

Image credit: [Intel Software Developer manual vol 3.](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf)

Categories of Interrupts

I/O Interrupts
Timer Interrupts
Interprocessor interrupts

Types of Actions taken by Linux on interrupts

Critical
- Critical actions are executed within the interrupt handler immediately.
Noncritical
- These are quick to finish and hence executed by interrupt handler immediately.
Noncritical deferrable
- These may be delayed for a long time interval without affecting the kernel operations

IRQ Distribution in Multiprocessor System

The kernel tries to distribute the IRQ signals coming from the hardware devices in a round-robin fashion among all the CPUs.

The interrupts coming from external hardware can be distributed within CPUs in following ways

Static Distribution
Dynamic Distribution

IRQ affinity

The kernel provides a functionality to redirect all the interrupts to a particular CPU. This is achieved by modifying Interrupt Redirection Table entries of the I/O APIC. IRQ affinity of particular interrupts can also be changed by writing a new CPU bitmap mask into the /proc/irq/n/smp_affinity file.

Interrupt Handling

When an interrupt is received, kernel runs inyerrupt handler or interrupt service routine code. These are ‘C’ functions. A data structure named Interrupt Descriptor Table (IDT) stores each interrupt or exception vector with the address of the corresponding interrupt or exception handler. That Table must be properly initialized before the kernel enables interrupts.

Top Halves vs Bottom Halves

Interrupt handling should be fast but there may be large amount of work involved, hence the handling is divided into two parts:

Top Half: Executed immediately and perform time critical work like acknoledging the interrupt.
Bottom Half: This part can be deferred like communicating with I/O.

Deferred Part of Interrupt Handling

As stated above the interrupt handling has two parts: critial and non critical (deferred handling).
SoftIRQs, Tasklets, WorkQueues etc are ways to process deferred part of interrupt handling which is also called as bottom halves.

SoftIRQs	Tasklets
They are statically allocated.	can also be allocated and initialized at runtime
softirqs are reentrant functions and must explicitly protect their data structures with spin lock	Do not need synchronization because Kernel handles that for them.
provide the least serialization	Tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time
	Easy to code

Work Queues

They defer work into Kernel Queue.
Functions in work queues run in process context and hence the function can be blocking functions or can sleep.
Function in a work queue is executed by a kernel thread, so there is no User Mode address space to access.

Exception Handling

The exceptions raised by CPU are handled by linux as error conditions.
Kernel sends a signal to the process about the erroneous condition.
Steps taken to handle exception:
- Save registers to Kernel Stack
- Invoke C-level function to handle exception.
- call ret_from_exception() function and exit!

Signals

Signals are software generated interrupts. A signal is generated for a process (or sent to a process) when the event that causes the signal occurs. When the signal is generated, the kernel usually sets a flag of some form in the process table. A signal is delivered to a process when the action for a signal is taken. Between the time of generation and delivery, the signal is pending.

When a process receives a signal, it can do either of following.

Ignore: except for SIGKILL and SIGSTOP all signals can ignored.
catch the signal: call some callback on receiving this signal. Again, SIGKILL and SIGSTOP can not be caught or blocked.
Apply some default action.

On Termination of the process the memory image of the file is stored in the pwd of the process.

Reentrant functions: functions that are guaranteed to be safe to call from within a signal handler. These functions are async-safe functions meaning they block the signals before entering into a critical region