性能优化之编译优化

本文主要分享总结一下工作过程中用到有关于编译优化方面的性能优化手段。

1.反馈式编译PGO

本文使用冒泡排序作为例子来介绍PGO的使用，冒泡排序代码如下：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <vector>

void bubble_sort(std::vector<int> &nums)
{
    int n = nums.size();
    for (int i = 0; i < n; ++i) {
        for (int j = i; j < n; ++j) {
            if (nums[j] < nums[i]) {
                std::swap(nums[i], nums[j]);
            }
        }
    }
}

int main()
{
    srand(time(nullptr));
    int n = 30000;
    std::vector<int> nums(30000);
    for (int i = 0; i < 30000; ++i) {
        nums[i] = rand();
    }
    bubble_sort(nums);
    return 0;
}

编译时使用-fprofile-generate，运行程序进行训练，生成profile(.gcda文件)

注：通常是在接近真实生产运行环境中进行训练，达到一定训练时间后，需要手动调用call (void)__gcov_flush()函数，否则不会生成.gcda文件(正常需要程序终止才能生成)。

使用profile(*.gcda文件)进行再次编译
-fprofile-use
2022-05-09T15:36:12.png

进行反汇编：
2022-05-09T15:36:53.png

原函数反汇编：

0000000000400740 <_Z11bubble_sortRSt6vectorIiSaIiEE>:
  400740:   4c 8b 07                mov    (%rdi),%r8
  400743:   48 8b 47 08             mov    0x8(%rdi),%rax
  400747:   4c 29 c0                sub    %r8,%rax
  40074a:   48 c1 f8 02             sar    $0x2,%rax
  40074e:   89 c7                   mov    %eax,%edi
  400750:   85 c0                   test   %eax,%eax
  400752:   7e 44                   jle    400798 <_Z11bubble_sortRSt6vectorIiSaIiEE+0x58>
  400754:   4c 89 c6                mov    %r8,%rsi
  400757:   44 8d 50 ff             lea    -0x1(%rax),%r10d
  40075b:   45 31 c9                xor    %r9d,%r9d
  40075e:   66 90                   xchg   %ax,%ax
  400760:   4c 89 c8                mov    %r9,%rax
  400763:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400768:   41 8b 0c 80             mov    (%r8,%rax,4),%ecx
  40076c:   8b 16                   mov    (%rsi),%edx
  40076e:   39 d1                   cmp    %edx,%ecx
  400770:   7d 06                   jge    400778 <_Z11bubble_sortRSt6vectorIiSaIiEE+0x38>
  400772:   89 0e                   mov    %ecx,(%rsi)
  400774:   41 89 14 80             mov    %edx,(%r8,%rax,4)
  400778:   48 83 c0 01             add    $0x1,%rax
  40077c:   39 c7                   cmp    %eax,%edi
  40077e:   7f e8                   jg     400768 <_Z11bubble_sortRSt6vectorIiSaIiEE+0x28>
  400780:   49 8d 41 01             lea    0x1(%r9),%rax
  400784:   48 83 c6 04             add    $0x4,%rsi
  400788:   4d 39 d1                cmp    %r10,%r9
  40078b:   74 0b                   je     400798 <_Z11bubble_sortRSt6vectorIiSaIiEE+0x58>
  40078d:   49 89 c1                mov    %rax,%r9
  400790:   eb ce                   jmp    400760 <_Z11bubble_sortRSt6vectorIiSaIiEE+0x20>
  400792:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  400798:   c3                      retq   
  400799:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

PGO优化后反汇编：

0000000000400650 <_Z11bubble_sortRSt6vectorIiSaIiEE>:
  400650:   48 8b 37                mov    (%rdi),%rsi
  400653:   48 8b 47 08             mov    0x8(%rdi),%rax
  400657:   45 31 c9                xor    %r9d,%r9d
  40065a:   48 29 f0                sub    %rsi,%rax
  40065d:   4c 8d 56 04             lea    0x4(%rsi),%r10
  400661:   48 c1 f8 02             sar    $0x2,%rax
  400665:   41 89 c3                mov    %eax,%r11d
  400668:   44 8d 40 ff             lea    -0x1(%rax),%r8d
  40066c:   45 39 cb                cmp    %r9d,%r11d
  40066f:   7e 42                   jle    4006b3 <_Z11bubble_sortRSt6vectorIiSaIiEE+0x63>
  400671:   44 89 c2                mov    %r8d,%edx
  400674:   48 89 f0                mov    %rsi,%rax
  400677:   44 29 ca                sub    %r9d,%edx
  40067a:   4c 01 ca                add    %r9,%rdx
  40067d:   49 8d 3c 92             lea    (%r10,%rdx,4),%rdi
  400681:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  400688:   8b 0e                   mov    (%rsi),%ecx
  40068a:   8b 10                   mov    (%rax),%edx
  40068c:   39 ca                   cmp    %ecx,%edx
  40068e:   7d 18                   jge    4006a8 <_Z11bubble_sortRSt6vectorIiSaIiEE+0x58>
  400690:   89 16                   mov    %edx,(%rsi)
  400692:   48 83 c0 04             add    $0x4,%rax
  400696:   89 48 fc                mov    %ecx,-0x4(%rax)
  400699:   48 39 f8                cmp    %rdi,%rax
  40069c:   75 ea                   jne    400688 <_Z11bubble_sortRSt6vectorIiSaIiEE+0x38>
  40069e:   49 83 c1 01             add    $0x1,%r9
  4006a2:   48 83 c6 04             add    $0x4,%rsi
  4006a6:   eb c4                   jmp    40066c <_Z11bubble_sortRSt6vectorIiSaIiEE+0x1c>
  4006a8:   48 83 c0 04             add    $0x4,%rax
  4006ac:   48 39 f8                cmp    %rdi,%rax
  4006af:   75 d9                   jne    40068a <_Z11bubble_sortRSt6vectorIiSaIiEE+0x3a>
  4006b1:   eb eb                   jmp    40069e <_Z11bubble_sortRSt6vectorIiSaIiEE+0x4e>
  4006b3:   c3                      retq

这种方式的缺点：
1.需要训练数据。如何保证训练数据与实际生产环境是相匹配的
2.配置文件生成开销大

第二种方法AutoFDO
1.使用perf record -b
2.使用autofdo tool(available on github)
create_gcov --binary=xxx --profile=perf.data --gcov=xxx.gcov --gcov-version=1
3.gcc xxx.cpp -g -O2 -fauto-profile=xxx.gcov -o xxx

2.编译选项优化

使用可以优化性能的编译选项，本文主要介绍项目过程中使用到的inline优化，实际生产环境中收益比较可观。

1.给只在本文件中调用的函数添加static前缀，打开-finlie-functions-called-once选项
2.-finline-function
3.f-inline-small-function

coderocku

coderocku