Detecting and fixing Memory Issues

daan-mooij-674206-unsplash

There are two main tools I like to use for any memory related stuff in my c++ code. Following is an example code I had written as my implementation of a shared pointer. I think my code works as expected but not sure how much memory I am leaking if any.

#include <iostream>

template <typename T>
class MySharedPtr{
  public:
    MySharedPtr(T val){
      resource_ptr = new T(val);
      c_ptr = new int(1);
    };

  MySharedPtr<T>(const MySharedPtr<T> & ptr){
    std::cout << "copy constructor \n";
    resource_ptr = ptr.resource_ptr;
    (*ptr.c_ptr)++;
    this->c_ptr = ptr.c_ptr;
  }

  MySharedPtr<T>& operator=(const MySharedPtr<T> & ptr){
    std::cout << "copy assignment operator\n";
    if(this != &ptr){
      (*ptr.c_ptr)++;
      this->c_ptr = ptr.c_ptr;
      this->resource_ptr = ptr.resource_ptr;
     }
    return *this;
  }

  // delete move constructors...
  MySharedPtr(MySharedPtr && ptr) = delete;
  MySharedPtr & operator=(MySharedPtr && ptr) = delete;

  MySharedPtr & operator *(){
    return resource_ptr;
  }

  MySharedPtr* operator ->(){
    return resource_ptr;
  }

  virtual ~MySharedPtr(){
    std::cout << *c_ptr << '\n';
    if(*c_ptr > 0){
      (*c_ptr)--;
    }else{
      delete c_ptr;
      delete resource_ptr;
    }
  }

  int use_count(){
    return *c_ptr;
  }

  private:
    int * c_ptr;
  T * resource_ptr;
};

int main(){
  int x;
    MySharedPtr<decltype(x)> sp(1);
  {
    MySharedPtr<decltype(x)> sp1(2);
    sp1 = sp;
    std::cout << "use count is " << sp.use_count() << '\n';
  }
  
  sp = sp;
  std::cout << "use count is " << sp.use_count() << '\n';
  return 0;
}

Google Sanitizer

The command:

/usr/local/Cellar/gcc/8.2.0/bin/g++-8 -std=c++17 -g -o2 -fsanitize=address -fno-omit-frame-pointer shared_ptr.cpp -o shared_ptr && ASAN_OPTIONS=detect_leaks=1 ./shared_ptr

The output:

copy assignment operator
use count is 2
2
copy assignment operator
use count is 1
1

=================================================================
==25533==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x526b18 in operator new(unsigned long) (/home/aarna/devel/practice/smart_ptrs+0x526b18)
    #1 0x52b50c in MySharedPtr<int>::MySharedPtr(int) /home/aarna/devel/practice/smart_ptrs.cpp:9:19
    #2 0x52b112 in main /home/aarna/devel/practice/smart_ptrs.cpp:62:28
    #3 0x7f6507f483d4 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:274

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x526b18 in operator new(unsigned long) (/home/aarna/devel/practice/smart_ptrs+0x526b18)
    #1 0x52b47b in MySharedPtr<int>::MySharedPtr(int) /home/aarna/devel/practice/smart_ptrs.cpp:8:26
    #2 0x52b134 in main /home/aarna/devel/practice/smart_ptrs.cpp:64:32
    #3 0x7f6507f483d4 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:274

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x526b18 in operator new(unsigned long) (/home/aarna/devel/practice/smart_ptrs+0x526b18)
    #1 0x52b50c in MySharedPtr<int>::MySharedPtr(int) /home/aarna/devel/practice/smart_ptrs.cpp:9:19
    #2 0x52b134 in main /home/aarna/devel/practice/smart_ptrs.cpp:64:32
    #3 0x7f6507f483d4 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:274

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x526b18 in operator new(unsigned long) (/home/aarna/devel/practice/smart_ptrs+0x526b18)
    #1 0x52b47b in MySharedPtr<int>::MySharedPtr(int) /home/aarna/devel/practice/smart_ptrs.cpp:8:26
    #2 0x52b112 in main /home/aarna/devel/practice/smart_ptrs.cpp:62:28
    #3 0x7f6507f483d4 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:274

SUMMARY: AddressSanitizer: 16 byte(s) leaked in 4 allocation(s).

The explanation: The address sanitizer has precisely pointed out the line numbers. It is very useful in coming up with a fix.

Valgrind

Command:

valgrind --tool=memcheck  --leak-check=full --show-leak-kinds=all --track-origins=yes  ./smart_ptrs

The output:

[aarna@localhost practice]$ valgrind --tool=memcheck --leak-check=full --show-leak-kinds=all --track-origins=yes ./smart_ptrs
==25564== Memcheck, a memory error detector
==25564== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==25564== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==25564== Command: ./smart_ptrs
==25564==
==25564==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==25564==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==25564==Process memory map follows:
0x000000400000-0x00000055d000 /home/aarna/devel/practice/smart_ptrs
0x00000075c000-0x00000075d000 /home/aarna/devel/practice/smart_ptrs
0x00000075d000-0x000000760000 /home/aarna/devel/practice/smart_ptrs
0x000000760000-0x000001447000
0x000004000000-0x000004022000 /usr/lib64/ld-2.17.so
0x000004022000-0x000004036000
0x00000403f000-0x000004055000
0x000004221000-0x000004222000 /usr/lib64/ld-2.17.so
0x000004222000-0x000004223000 /usr/lib64/ld-2.17.so
0x000004223000-0x000004224000
0x000004224000-0x000004225000
0x000004a24000-0x000004a25000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_core-amd64-linux.so
0x000004a25000-0x000004c24000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_core-amd64-linux.so
0x000004c24000-0x000004c25000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_core-amd64-linux.so
0x000004c25000-0x000004c26000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_core-amd64-linux.so
0x000004c26000-0x000004c34000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
0x000004c34000-0x000004e34000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
0x000004e34000-0x000004e35000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
0x000004e35000-0x000004e36000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
0x000004e36000-0x000004f1f000 /usr/lib64/libstdc++.so.6.0.19
0x000004f1f000-0x00000511e000 /usr/lib64/libstdc++.so.6.0.19
0x00000511e000-0x000005126000 /usr/lib64/libstdc++.so.6.0.19
0x000005126000-0x000005128000 /usr/lib64/libstdc++.so.6.0.19
0x000005128000-0x00000513d000
0x00000513d000-0x00000523e000 /usr/lib64/libm-2.17.so
0x00000523e000-0x00000543d000 /usr/lib64/libm-2.17.so
0x00000543d000-0x00000543e000 /usr/lib64/libm-2.17.so
0x00000543e000-0x00000543f000 /usr/lib64/libm-2.17.so
0x00000543f000-0x000005456000 /usr/lib64/libpthread-2.17.so
0x000005456000-0x000005655000 /usr/lib64/libpthread-2.17.so
0x000005655000-0x000005656000 /usr/lib64/libpthread-2.17.so
0x000005656000-0x000005657000 /usr/lib64/libpthread-2.17.so
0x000005657000-0x00000565b000
0x00000565b000-0x000005662000 /usr/lib64/librt-2.17.so
0x000005662000-0x000005861000 /usr/lib64/librt-2.17.so
0x000005861000-0x000005862000 /usr/lib64/librt-2.17.so
0x000005862000-0x000005863000 /usr/lib64/librt-2.17.so
0x000005863000-0x000005865000 /usr/lib64/libdl-2.17.so
0x000005865000-0x000005a65000 /usr/lib64/libdl-2.17.so
0x000005a65000-0x000005a66000 /usr/lib64/libdl-2.17.so
0x000005a66000-0x000005a67000 /usr/lib64/libdl-2.17.so
0x000005a67000-0x000005a7c000 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
0x000005a7c000-0x000005c7b000 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
0x000005c7b000-0x000005c7c000 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
0x000005c7c000-0x000005c7d000 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
0x000005c7d000-0x000005e40000 /usr/lib64/libc-2.17.so
0x000005e40000-0x00000603f000 /usr/lib64/libc-2.17.so
0x00000603f000-0x000006043000 /usr/lib64/libc-2.17.so
0x000006043000-0x000006045000 /usr/lib64/libc-2.17.so
0x000006045000-0x00000639c000
0x00000639c000-0x00000679c000
0x000058000000-0x00005823b000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/memcheck-amd64-linux
0x00005843b000-0x00005843e000 /opt/rh/devtoolset-7/root/usr/lib64/valgrind/memcheck-amd64-linux
0x00005843e000-0x000059e40000
0x001002001000-0x001008b0a000
0x001008b8c000-0x001008bac000
0x001008bac000-0x001008bae000
0x001008bae000-0x001008cae000
0x001008cae000-0x001008cb0000
0x001008cb0000-0x001008cb1000 /tmp/vgdb-pipe-shared-mem-vgdb-25564-by-aarna-on-localhost.localdomain
0x001008cb1000-0x00100b0fd000
0x00100b1fd000-0x00100b3fd000
0x00100b4fd000-0x00100b5fd000
0x00100b7f2000-0x00100ba1b000
0x00100bbdb000-0x00100bcdb000
0x001ffeffd000-0x001fff001000
0x7ffd1941d000-0x7ffd1943e000 [stack]
0xffffffffff600000-0xffffffffff601000 [vsyscall]
==25564==End of process memory map.
==25564==
==25564== HEAP SUMMARY:
==25564== in use at exit: 32 bytes in 1 blocks
==25564== total heap usage: 1 allocs, 0 frees, 32 bytes allocated
==25564==
==25564== 32 bytes in 1 blocks are still reachable in loss record 1 of 1
==25564== at 0x4C2B955: calloc (vg_replace_malloc.c:711)
==25564== by 0x586454F: _dlerror_run (dlerror.c:141)
==25564== by 0x5864057: dlsym (dlsym.c:70)
==25564== by 0x50233B: __interception::GetRealFunctionAddress(char const*, unsigned long*, unsigned long, unsigned long) (in /home/aarna/devel/practice/smart_ptrs)
==25564== by 0x4DD462: __asan::InitializeAsanInterceptors() (in /home/aarna/devel/practice/smart_ptrs)
==25564== by 0x41A49F: __asan::AsanInitInternal() [clone .part.1] (in /home/aarna/devel/practice/smart_ptrs)
==25564== by 0x400FCA2: _dl_init (dl-init.c:116)
==25564== by 0x4001029: ??? (in /usr/lib64/ld-2.17.so)
==25564==
==25564== LEAK SUMMARY:
==25564== definitely lost: 0 bytes in 0 blocks
==25564== indirectly lost: 0 bytes in 0 blocks
==25564== possibly lost: 0 bytes in 0 blocks
==25564== still reachable: 32 bytes in 1 blocks
==25564== suppressed: 0 bytes in 0 blocks
==25564==
==25564== For counts of detected and suppressed errors, rerun with: -v
==25564== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

The explanation:

The output by valgrind shows entire picture of virtual memory and how source files and code is laid out in the memory. It detects the loss of 32 bytes of heap memory still in use at the exit.

The fix:

The output shown by the google address sanitizer was useful at micro-level. It precisely pointed the line number and I was able to apply my first fix to the memory leak in virtual destructor as follows:

virtual ~MySharedPtr(){
  if(*c_ptr > 0){
    (*c_ptr)--;
  }
  
  if(*c_ptr == 0){
    if(resource_ptr != nullptr){
      delete resource_ptr;
      resource_ptr = nullptr;
    }
    delete c_ptr;
    c_ptr = nullptr;
  }
}

I still have following leaks:

=================================================================
==9130==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x1026a53d0 in wrap__Znwm (libasan.5.dylib:x86_64+0x6e3d0)
#1 0x102324526 in MySharedPtr<int>::MySharedPtr(int) shared_ptr.cpp:9
#2 0x10232428a in main shared_ptr.cpp:68
#3 0x7fff7661b014 in start (libdyld.dylib:x86_64+0x1014)

Direct leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x1026a53d0 in wrap__Znwm (libasan.5.dylib:x86_64+0x6e3d0)
#1 0x1023244a4 in MySharedPtr<int>::MySharedPtr(int) shared_ptr.cpp:8
#2 0x10232428a in main shared_ptr.cpp:68
#3 0x7fff7661b014 in start (libdyld.dylib:x86_64+0x1014)

SUMMARY: AddressSanitizer: 8 byte(s) leaked in 2 allocation(s).

Going through the code I fixed my copy assignment operator by deleting the resource pointer and the counter pointer as follows:

MySharedPtr<T>& operator=(const MySharedPtr<T> & ptr){
  std::cout << "copy assignment operator\n";
  if(this != &ptr){
    delete resource_ptr;  // deleting fix 1
    delete c_ptr;         // deleting fix 2
    (*ptr.c_ptr)++;
    this->c_ptr = ptr.c_ptr;
    this->resource_ptr = ptr.resource_ptr;
   }
   return *this;
}

And my code generated error free output:

copy assignment operator
use count is 2
copy assignment operator
use count is 1

Linux Performance Optimization

3 main levels for tuning

  1. CPU/BIOS tuning:
  2. OS tuning:
  3. Network Tuning:

Configuration using tuned

  • tuned is available from RHEL7
  • Make changes in sysctl as well as /sys directory
  • there are profiles available for specific needs like network-latency, latency-performance, network-performance, throughput-performance, desktop and balanced etc

Screen Shot 2018-03-14 at 7.49.16 PM

Examples of the profiles

Screen Shot 2018-03-14 at 7.50.11 PM

Latency Performance Settings in Linux Tuned

Settings  Meaning
force_latency=1 processor state C1
governor=performance CPU is at higher performance state
energy_perf_bias= performance Higher performance state
min_perf_pct=100 comes from the P-state drivers. The interfaces provided by the cpufreq core for controlling frequency the driver provides sysfs files for controlling P state selection. These files have been added to /sys/devices/system/cpu/intel_pstate.
kernel.sched_min_granularity_ns=10000000 Minimal preemption granularity for CPU-bound tasks
vm.dirty_ratio=10 The generator of dirty data starts writeback at this percentage
vm.dirty_background_ratio=3 Start background writeback (via writeback threads) at this percentage
vm.swappiness=10 The swappiness parameter controls the tendency of the kernel to move processes out of physical memory and onto the swap disk. 0 tells the kernel to avoid swapping processes out of physical memory for as long as possible 100 tells the kernel to aggressively swap processes out of physical memory and move them to swap cache
kernel.sched_migration_cost_ns=5000000 The total time the scheduler will consider a migrated process “cache hot” and thus less likely to be re-migrated
net.core.busy_read=50 This parameter controls the number of microseconds to wait for packets on the device queue for socket reads. It also sets the default value of the SO_BUSY_POLL option.
sysctl.net.core.busy_poll=50 This parameter controls the number of microseconds to wait for packets on the device queue for socket poll and selects
kernel.numa_balancing=0 disable NUMA balancing
net.ipv4.tcp_fastopen=3 Linux supports configuring both overall client and server support via /proc/sys/net/ipv4/tcp_fastopen (net.ipv4.tcp_fastopen via sysctl). The options are a bit mask, where the first bit enables or disables client support (default on), 2nd bit sets server support (default off), 3rd bit sets whether data in SYN packet is permitted without TFO cookie option. Therefore a value of 1 TFO can only be enabled on outgoing connections (client only), value 2 allows TFO only on listening sockets (server only), and value 3 enables TFO for both client and server.

More information about Tunables:

  • Dependent on NUMA:
    • Reclaim Ratios
      • /proc/sys/vm/swappiness
      • /proc/sys/vm/min_free_kbytes
  • Independent of NUMA:
    • Reclaim Ratios
      • /proc/sys/vm/vfs_cache_pressure
      • Writeback Parameters
        • /proc/sys/vm/dirty_background_ratio
        • /proc/sys/vm/dirty_ratio
      • Readahead parameters
        • /sys/block/queue/read_ahead_kb

top utility

  • available on all the machines
  • default tool for measurement that has lot of information.

Parameters important for Optimization


Swappiness

  • Controls how aggressively the system reclaims anonymous memory versus
    page cache memory:

    • Anonymous memory – swapping and freeing
    • File pages – writing if dirty and freeing
    • System V shared memory – swapping and freeing
  • Default is 60
  • If it is decreased: more aggressive reclaiming of page cache memory
  • If it is Increased: more aggressive swapping of anonymous memory
  • Should be set to 0 as per the low latency optimization guide http://wiki.dreamrunner.org/public_html/Low_Latency_Programming/low-latency-programming.html

Memory reclaim watermarks:

Screen Shot 2018-03-14 at 10.54.08 PM

zone_reclaim_mode

  • Controls NUMA specific memory allocation policy
  • When set and node memory is exhausted:
    • Reclaim memory from local node rather than allocating
      from next node
    • Slower initial allocation, higher NUMA hit ratio
  • When clear and node memory is exhausted:
    • Allocate from all nodes before reclaiming memory
    • Faster initial allocation, higher NUMA miss ratio
  • To see current setting: cat /proc/sys/vm/zone_reclaim_mode
  • Turn ON: echo 1 > /proc/sys/vm/zone_reclaim_mode
  • Turn OFF: echo 0 > /proc/sys/vm/zone_reclaim_mode
  • It is recommended that the settings should be off for large in memory database server.

CPU Tuning

  • p-states
    • It’s an Advanced Configuration and Power Interface (ACPI) defined processor performance state.
    • P0, P1, P2..p11 etc are values for P-states. For highest performance the p-states should be set to P0
  • c-states
    • It allow the CPU package to shut down cores and parts of the CPU microarchitecture to save energy while balancing response times.

    • C0, C1, C3, C6 etc are the values for c-states. Here is the chart for more information

      Mode Name What it does CPUs
      C0 Operating State CPU fully turned on All CPUs
      C1 Halt Stops CPU main internal clocks via software; bus interface unit and APIC are kept running at full speed. 486DX4 and above
      C1E Enhanced Halt Stops CPU main internal clocks via software and reduces CPU voltage; bus interface unit and APIC are kept running at full speed. All socket LGA775 CPUs
      C1E Stops all CPU internal clocks. Turion 64, 65-nm Athlon X2 and Phenom CPUs
      C2 Stop Grant Stops CPU main internal clocks via hardware; bus interface unit and APIC are kept running at full speed. 486DX4 and above
      C2 Stop Clock Stops CPU internal and external clocks via hardware Only 486DX4, Pentium, Pentium MMX, K5, K6, K6-2, K6-III
      C2E Extended Stop Grant Stops CPU main internal clocks via hardware and reduces CPU voltage; bus interface unit and APIC are kept running at full speed. Core 2 Duo and above (Intel only)
      C3 Sleep Stops all CPU internal clocks Pentium II, Athlon and above, but not on Core 2 Duo E4000 and E6000
      C3 Deep Sleep Stops all CPU internal and external clocks Pentium II and above, but not on Core 2 Duo E4000 and E6000; Turion 64
      C3 AltVID Stops all CPU internal clocks and reduces CPU voltage AMD Turion 64
      C4 Deeper Sleep Reduces CPU voltage Pentium M and above, but not on Core 2 Duo E4000 and E6000 series; AMD Turion 64
      C4E/C5 Enhanced Deeper Sleep Reduces CPU voltage even more and turns off the memory cache Core Solo, Core Duo and 45-nm mobile Core 2 Duo only
      C6 Deep Power Down Reduces the CPU internal voltage to any value, including 0 V 45-nm mobile Core 2 Duo only
    • Latency for C states

      C-STATE

      RESIDENCY

      WAKE-UP LATENCY

      C0

      ACTIVE

      ACTIVE

      C1

      1 μs

      1 μs

      C3

      106 μs

      80 μs

      C6

      345 μs

      104 μs

How to Optimize

  1. Use real production system for measurement
  2. Note the state of system before change
  3. Apply the optimizations later going to come in the post.
  4. Compare state of system after applying the optimizations
  5. Tune until you get desired results.
  6. Document the changes.

Types of Tools needed

  • Tools that gather system information or Hardware
    • lscpu
    • uname
    • systemctl list -units -t service
  • Tools for measuring Performance
  • Tools for Microbenchmarking

Optimization include Benchmarking, Profiling and Tracing

  • Benchmarking
    • this is comparing performance with industry standard
    • Tools used for benchmarking
      • vmstat used for benchmarking virtual memory
      • iostat used for benchmarking io
      • mpstat multiple processor
  • Profiling gathering information about hot spots
  • Tracing

Based on the Benchmarking, Profiling and Tracking the system is tuned as follows;

  • using echo into /proc
    • The settings are not persistent. Stay until reboot.
    • /proc files system has pid directories for processes, configuration files and sys directory
      • /proc/sys directory has some important directories like /kernel, /net, /vm , /user etc.
      • /proc/sys/vm has settings for with virtual memory.
  • Using sysctl command
    • sysctl it’s a service
    • Persistent settings.
    • It’s started at the boot time’s
    • It reads settings from a configuration file at /etc directory
    • There are some more settings in
      • usr/lib/sysctl.d stores the default
      • /run/sysctl.d stores the runtime
    • /etc/sysctl.d stores runtime settings
  • Using loading/unloading of kernel modules and configure kernel parameters
    • Using modinfo to gather information about available parameters
    • Using modprobe for modifying parameters
    • modprobe modulename key=val
    • for persistent settings, /etc/modprobe.d modulename.conf