性能优化篇（1）：几种简单的访存优化手段

Author: stormQ

Created: Sunday, 10. November 2019 11:36AM

Last Modified: Wednesday, 28. October 2020 08:17PM

摘要

本文描述了几种简单的访存优化手段，并从高速缓存的角度定量分析这些手段提升程序性能的原理。

减少不必要的内存引用

示例：

void poor(const int *src, int n, int *dest)
{
    for (int i = 0; i < n; ++i)
    {
        *dest += src[i];
    }
}

void better(const int *src, int n, int *dest)
{
    int sum = *dest;
    for (int i = 0; i < n; ++i)
    {
        sum += src[i];
    }
    *dest = sum;
}

执行时间统计：

启动程序方式	第一次执行耗时(us)	第二次执行耗时(us)	第三次执行耗时(us)	第四次执行耗时(us)	第五次执行耗时(us)
./main_Og	poor:176463 better:67489	poor:172714 better:69189	poor:170618 better:67232	poor:176130 better:68082	poor:170447 better:69484
valgrind --tool=cachegrind ./main_Og	poor:2403126 better:1540885	poor:2401457 better:1543162	poor:2404605 better:1541964	poor:2410699 better:1546153	poor:2409636 better:1543493

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 2.5 倍左右。

从汇编的角度分析：

; With -Og optimization
; Dump of assembler code for function poor(int const*, int, int*):
   0x000000000040080c <+0>:    mov w3, #0x0                    // #0
   0x0000000000400810 <+4>:    cmp w3, w1
   0x0000000000400814 <+8>:    b.ge    0x400830 <poor(int const*, int, int*)+36>
   0x0000000000400818 <+12>:    ldr w5, [x2]
   0x000000000040081c <+16>:    ldr w4, [x0,w3,sxtw #2]
   0x0000000000400820 <+20>:    add w4, w5, w4
   0x0000000000400824 <+24>:    str w4, [x2]
   0x0000000000400828 <+28>:    add w3, w3, #0x1
   0x000000000040082c <+32>:    b   0x400810 <poor(int const*, int, int*)+4>
   0x0000000000400830 <+36>:    ret

; With -Og optimization
; Dump of assembler code for function better(int const*, int, int*):
   0x0000000000400834 <+0>:    ldr w4, [x2]
   0x0000000000400838 <+4>:    mov w3, #0x0                    // #0
   0x000000000040083c <+8>:    cmp w3, w1
   0x0000000000400840 <+12>:    b.ge    0x400854 <better(int const*, int, int*)+32>
   0x0000000000400844 <+16>:    ldr w5, [x0,w3,sxtw #2]
   0x0000000000400848 <+20>:    add w4, w4, w5
   0x000000000040084c <+24>:    add w3, w3, #0x1
   0x0000000000400850 <+28>:    b   0x40083c <better(int const*, int, int*)+8>
   0x0000000000400854 <+32>:    str w4, [x2]
   0x0000000000400858 <+36>:    ret

从汇编代码中可以看出：

使用-Og优化选项编译时，poor() 函数的 for 循环的一次迭代中：读内存操作数量为2，写内存操作数量为1。
使用-Og优化选项编译时，better() 函数的 for 循环的一次迭代中：读内存操作数量为1，写内存操作数量为0。better() 函数的内存读写数量较少，因此执行速度更快。

从 cache 性能的角度分析：

--------------------------------------------------------------------------------
I1 cache:         16384 B, 64 B, 4-way associative
D1 cache:         16384 B, 64 B, 4-way associative
LL cache:         262144 B, 64 B, 8-way associative
Command:          ./b_Og
Data file:        cachegrind.out.18226
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
           Ir  I1mr ILmr          Dr       D1mr       DLmr          Dw      D1mw      DLmw 
--------------------------------------------------------------------------------
1,993,889,018 1,004  864 315,036,517 13,121,792 13,115,591 209,899,117 6,555,946 6,555,006  PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir I1mr ILmr          Dr      D1mr      DLmr          Dw      D1mw      DLmw  file:function
--------------------------------------------------------------------------------
838,860,804    0    0 209,715,200 6,553,601 6,553,601 104,857,600         0         0  /home/b.cpp:poor(int const*, int, int*)
629,145,606    0    0 104,857,601 6,553,601 6,553,601           1         1         1  /home/b.cpp:better(int const*, int, int*)
524,288,004    1    1           0         0         0 104,857,600 6,553,600 6,553,600  /home/b.cpp:init(int*, int)

输出结果解析：

poor() 函数的读内存操作的数量为 209,715,200（Dr 列），读内存操作在 L1 级缓存中不命中的数量为 6,553,601（D1mr 列），读内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 6,553,601（DLmr 列）。
poor() 函数的写内存操作的数量为 104,857,600（Dw 列），写内存操作在 L1 级缓存中不命中的数量为 0（D1mw 列），写内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 0（DLmw 列）。
better() 函数的读内存操作的数量为 104,857,601（Dr 列），读内存操作在 L1 级缓存中不命中的数量为 6,553,601（D1mr 列），读内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 6,553,601（DLmr 列）。
better() 函数的写内存操作的数量为 1（Dw 列），写内存操作在 L1 级缓存中不命中的数量为 1（D1mw 列），写内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 1（DLmw 列）。

从上述数据中可以看出：1）poor() 函数的读内存操作的数量为 better() 函数的两倍；2）poor() 函数的写内存操作的数量比 better() 函数多 104,857,599（104,857,599 = 104,857,600 - 1）。这也验证了better() 函数执行速度较快的原因。

按顺序访问数据

按数据在内存中排列先后顺序进行访问（读或写）时，通常会降低 cache 的缺失率，即减少访问内存的次数，从而执行速度更快。比如：C/C++ 中的多维数组是以行主序（存储完一行后再存储下一行）存储在内存中的，那么在循环中访问完一行后再访问下一行的方式更高效。Fortran 中的多维数组是以列主序（存储完一列后再存储下一列）存储在内存中的，那么在循环中访问完一列后再访问下一列的方式更高效。

示例：访问二维整型数组

// 按列访问数组元素
long long poor(const int *src, int rows, int cols)
{
    long long sum = 0;
    for (int i = 0; i < cols; ++i)
    {
        for (int j = 0; j < rows; ++j)
        {
            sum += *(src + j * cols + i);
        }
    }
    return sum;
}

// 按行访问数组元素
long long better(const int *src, int rows, int cols)
{
    long long sum = 0;
    for (int i = 0; i < rows; ++i)
    {
        for (int j = 0; j < cols; ++j)
        {
            sum += *(src + i * cols + j);
        }
    }
    return sum;
}

执行时间统计：

启动程序方式	第一次执行耗时(us)	第二次执行耗时(us)	第三次执行耗时(us)	第四次执行耗时(us)	第五次执行耗时(us)
./main_Og	poor:12575 better:2479	poor:12661 better:2240	poor:12687 better:2313	poor:12387 better:2241	poor:12493 better:2230
valgrind --tool=cachegrind ./main_Og	poor:101736 better:54882	poor:99056 better:56708	poor:96098 better:57031	poor:97853 better:57691	poor:95651 better:57502

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 5 倍左右。

统计 cache 使用情况：

--------------------------------------------------------------------------------
       Ir I1mr ILmr        Dr   D1mr   DLmr        Dw   D1mw  DLmw 

         .    .    .         .         .      .         .       .       .  long long poor(const int *src, int rows, int cols)
         .    .    .         .         .      .         .       .       .  {
         2    0    0         0         0      0         0       0       0      long long sum = 0;
     4,323    0    0         0         0      0         0       0       0      for (int i = 0; i < cols; ++i)
         .    .    .         .         .      .         .       .       .      {
 8,296,560    1    1         0         0      0         0       0       0          for (int j = 0; j < rows; ++j)
         .    .    .         .         .      .         .       .       .          {
14,515,200    0    0 2,073,600 2,073,600 76,630         0       0       0              sum += *(src + j * cols + i);
         .    .    .         .         .      .         .       .       .          }
         .    .    .         .         .      .         .       .       .      }
         .    .    .         .         .      .         .       .       .      return sum;
         1    0    0         1         1      1         0       0       0  }
         .    .    .         .         .      .         .       .       .  
         .    .    .         .         .      .         .       .       .  long long better(const int *src, int rows, int cols)
         .    .    .         .         .      .         .       .       .  {
         2    0    0         0         0      0         0       0       0      long long sum = 0;
     7,683    0    0         0         0      0         0       0       0      for (int i = 0; i < rows; ++i)
         .    .    .         .         .      .         .       .       .      {
 8,298,240    0    0         0         0      0         0       0       0          for (int j = 0; j < cols; ++j)
         .    .    .         .         .      .         .       .       .          {
14,515,200    0    0 2,073,600   129,601 74,916         0       0       0              sum += *(src + i * cols + j);
         .    .    .         .         .      .         .       .       .          }
         .    .    .         .         .      .         .       .       .      }
         .    .    .         .         .      .         .       .       .      return sum;
         1    0    0         1         1      1         0       0       0  }

输出结果解析：

poor() 和 better() 函数的读内存操作的数量是相同的。
poor() 和 better() 函数在 L1 级缓存的读数据操作不命中数量差距很大，前者的读数据操作不命中数量为后者的 16 倍。这正是 better() 函数执行速度快于 poor() 函数的原因。

按顺序存储同时要访问的数据

示例：

void poor(const int *src, int *dest, int n)
{
    for (int i = 0; i < n; ++i)
    {
        // 同时要访问的数据（src[i]、dest[i]）在两个数组中，即同时要访问的数据不是连续存储的
        dest[i] = src[i];
    }
}

struct Vec2
{
    Vec2()
    {
        static long long count = 0;
        a = count++;
    }
    int a;
    int b;
};

void better(struct Vec2 *data, int n)
{
    for (int i = 0; i < n; ++i)
    {
        // 同时要访问的数据（data[i].a、data[i].b）存储在同一个结构体中，即同时要访问的数据是连续存储的
        data[i].b = data[i].a;
    }
}

执行时间统计：

启动程序方式	第一次执行耗时(us)	第二次执行耗时(us)	第三次执行耗时(us)	第四次执行耗时(us)	第五次执行耗时(us)
./main_Og	poor:8591 better:2714	poor:4657 better:1936	poor:7424 better:3436	poor:8976 better:1937	poor:4866 better:1931
valgrind --tool=cachegrind ./main_Og	poor:60573 better:52037	poor:59379 better:51119	poor:60963 better:51360	poor:59742 better:51622	poor:64674 better:52330

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 2 倍左右。

统计 cache 使用情况：

--------------------------------------------------------------------------------
         Ir I1mr ILmr          Dr      D1mr      DLmr          Dw      D1mw      DLmw 

         .    .    .         .       .       .         .       .       .  void poor(const int *src, int *dest, int n)
         .    .    .         .       .       .         .       .       .  {
 8,294,404    0    0         1       1       1         0       0       0      for (int i = 0; i < n; ++i)
         .    .    .         .       .       .         .       .       .      {
 6,220,800    0    0 2,073,600 129,601 129,601 2,073,600 129,601 129,601          dest[i] = src[i];
         .    .    .         .       .       .         .       .       .      }
         .    .    .         .       .       .         .       .       .  }
         .    .    .         .       .       .         .       .       .  
         .    .    .         .       .       .         .       .       .  struct Vec2
         .    .    .         .       .       .         .       .       .  {
         .    .    .         .       .       .         .       .       .      Vec2()
         .    .    .         .       .       .         .       .       .      {
         .    .    .         .       .       .         .       .       .          static long long count = 0;
 8,294,400    0    0 2,073,600       1       1 4,147,200 259,200 259,200          a = count++;
         .    .    .         .       .       .         .       .       .      }
         .    .    .         .       .       .         .       .       .      int a;
         .    .    .         .       .       .         .       .       .      int b;
         .    .    .         .       .       .         .       .       .  };
         .    .    .         .       .       .         .       .       .  
         .    .    .         .       .       .         .       .       .  void better(struct Vec2 *data, int n)
         .    .    .         .       .       .         .       .       .  {
 8,294,404    0    0         1       1       1         0       0       0      for (int i = 0; i < n; ++i)
         .    .    .         .       .       .         .       .       .      {
 8,294,400    0    0 2,073,600 259,201 259,201 2,073,600       0       0          data[i].b = data[i].a;
         .    .    .         .       .       .         .       .       .      }
         .    .    .         .       .       .         .       .       .  }

输出结果解析：

poor() 和 better() 函数的内存读操作的数量是相同的，而 better() 函数的 D1mr、DLmr 比 poor() 函数分别多 129600 次、129600 次。
poor() 和 better() 函数的内存写操作的数量是相同的，而 poor() 函数的 D1mw、DLmw 比 better() 函数分别多 129601 次、129601 次。
从内存读和写操作的不命中总数量来看，两者是相同的。但为什么 better() 函数的执行速度比 poor() 函数快 2 倍？

下一篇：性能优化篇（2）：小心“STL 低效率用法”所带来的性能开销

首页

性能优化篇（1）：几种简单的访存优化手段

目录

摘要

减少不必要的内存引用

按顺序访问数据

按顺序存储同时要访问的数据