PERF
perf
is a tool to profile a process using Hardware Performance Counters. Each counter counts events in the CPU
such as cycles, executed instructions, loads from a given level of the memory caches, branches…
perf list
lists the event that are predefined inside the tool.
The easiest way to use perf
is to profile the whole application (say ./a.out
) using a default set of events
perf stat --delay 0 -d ./a.out
or even
perf stat --delay 0 -d -d -d ./a.out
One can choose a set of events and list them on the command line, for example:
perf stat --delay 0 -e task-clock -e cycles -e instructions \
-e branch-instructions -e branch-misses \
-e idle-cycles-frontend -e idle-cycles-backend \
-e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses ./a.out
The doPerf
script is available, that runs perf
on an
executable, counting those events.
Exercise
Architecture: Front-end
Consider branchPredictor.cpp. Compile it with
g++ -Wall -g -march=native branchPredictor.cpp -o branchPredictor -O2 -funroll-all-loops
measure its performance with
perf stat -d --delay 0 ./branchPredictor
Try to enable and disable sorting. What happens? Explain the behaviour. Modify the code to make it “branchless”.
toplev
Use backend.cpp
compile
g++ -Wall -g -march=native
with different compiler options (-O2, -O3, -Ofast, -funroll-loops
) measure performance,
identify “hotspot”, modify code to speed it up.
You can also try the toplev analysis by:
git clone https://github.com/andikleen/pmu-tools.git
cd pmu-tools/
export PATH=$PATH:$(pwd)
Then run your program with toplev
.
You can set the affinity of a program by using taskset
.
For example,
taskset -c 0 ./out
will force the program to run on core 0.
toplev --single-thread ./out
or
toplev --all --core C0 taskset -c 0,1 program
.
A wrapper defining more user-friedly name for INTEL counters can be downloaded as part of pmu-tools
.
Try:
ocperf.py list
to have a list of ALL available counters (and their meaning) The actual name of the counters keep changing, so for a detail analysis one has to tailor the events to the actual hardware…
An example (tailored to the Skylake-avx512 machine also available used for the exercise) see
doOCPerfSX
The “TopDown” metodology and the use of toplel tool is documented in https://github.com/andikleen/pmu-tools/wiki/toplev-manual and in the excellent slides by A.Y. himself http://www.cs.technion.ac.il/~erangi/TMA_using_Linux_perf__Ahmad_Yasin.pdf
Further information
For large applications more details can be obtained running
perf record
that will produce a file containing all sampled events and their location in the application.
perf record --call-graph=dwarf
will produce a full call-graph. On more recent Intel hardware (since Haswell) one can use
perf record --call-graph=lbr
which is faster and produces a more compact report.
perf report
can be used to display the detailed profile.
To identify performance “hotspot” at code level compile with -g
and perf report
will interleave source code with assembler.
perf record/report
is well documented in
https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report
an interesting reading is also https://stackoverflow.com/questions/27742462/understanding-linux-perf-report-output