libcpucycles-20240114/ 0000755 0000000 0000000 00000000000 14551077770 013145 5 ustar root root libcpucycles-20240114/Makefile 0000644 0000000 0000000 00000000150 14551077770 014601 0 ustar root root default: cd build && $(MAKE) install: cd build && $(MAKE) install clean: cd build && $(MAKE) clean libcpucycles-20240114/autogen/ 0000755 0000000 0000000 00000000000 14551077770 014607 5 ustar root root libcpucycles-20240114/autogen/html 0000755 0000000 0000000 00000003307 14551077770 015504 0 ustar root root #!/usr/bin/env python3 import os import datetime import markdown def load(fn): with open(fn) as f: return f.read() style = load('autogen/html-style') sitetitle = load('autogen/html-title') files = [] with open('autogen/html-files') as f: for line in f: line = line.strip() line = line.split(':') if len(line) != 3: continue files += [line] for md,html,pagetitle in files: fnmd = 'doc/%s.md' % md fnhtml = 'doc/html/%s.html' % html output = '' x = load(fnmd) x = markdown.markdown(x,extensions=['markdown.extensions.extra','markdown.extensions.tables']) mtime = datetime.datetime.utcfromtimestamp(os.path.getmtime(fnmd)).strftime('%Y.%m.%d') output += '\n
\n' output += style output += 'cpucycles - count CPU cycles
#include <cpucycles.h>
long long count = cpucycles();
long long persecond = cpucycles_persecond();
const char *implementation = cpucycles_implementation();
const char *version = cpucycles_version();
Link with -lcpucycles
. Old systems may also need -lrt
.
cpucycles()
returns an estimate for the number of CPU cycles that have
occurred since an unspecified time in the past (perhaps system boot,
perhaps program startup).
Accessing true cycle counters can be difficult on some CPUs and
operating systems. cpucycles()
does its best to produce accurate
results, but selects a low-precision counter if the only other option is
failure.
cpucycles_persecond()
returns an estimate for the number of CPU cycles
per second. This estimate comes from /etc/cpucyclespersecond
if that
file exists, otherwise from various OS mechanisms, otherwise from the
cpucyclespersecond
environment variable if that is set, otherwise
2399987654.
cpucycles_implementation()
returns the name of the counter in use:
e.g., "amd64-pmc"
.
cpucycles_version()
returns the libcpucycles
version number as a
string: e.g., "20240114"
. Results of cpucycles_implementation()
should be interpreted relative to cpucycles_version()
.
cpucycles
is actually a function pointer. The first call to
cpucycles()
or cpucycles_persecond()
or cpucycles_implementation()
selects one of the available counters and updates the cpucycles
pointer accordingly. Subsequent calls to cpucycles()
are thread-safe.
gettimeofday(2), clock_gettime(2)
Currently libcpucycles supports the following cycle counters. Some cycle counters are actually other forms of counters that libcpucycles scales to imitate a cycle counter. There is separate documentation for how libcpucycles makes a choice of cycle counter. See also security considerations regarding enabling or disabling counters and regarding Turbo Boost.
amd64-pmc
: Requires a 64-bit Intel/AMD platform. Requires the Linux
perf_event interface. Accesses a cycle counter through RDPMC. Requires
/proc/sys/kernel/perf_event_paranoid
to be at most 2 for user-level
RDPMC access. This counter runs at the clock frequency of the CPU core.
amd64-tsc
, amd64-tscasm
: Requires a 64-bit Intel/AMD platform.
Requires RDTSC to be enabled, which it is by default. Uses RDTSC to
access the CPU's time-stamp counter. On current CPUs, this is an
off-core clock rather than a cycle counter, but it is typically a very
fast off-core clock, making it adequate for seeing cycle counts if
overclocking and underclocking are disabled. The difference between
tsc
and tscasm
is that tsc
uses the compiler's __rdtsc()
while
tscasm
uses inline assembly.
arm32-cortex
: Requires a 32-bit ARMv7-A platform. Uses
mrc p15, 0, %0, c9, c13, 0
to read the cycle counter. Requires user
access to the cycle counter, which is not enabled by default but can be
enabled under Linux via
a kernel module.
This counter is natively 32 bits, but libcpucycles watches how the
counter and gettimeofday
increase to compute a 64-bit extension of the
counter.
arm32-1176
: Requires a 32-bit ARM1176 platform. Uses
mrc p15, 0, %0, c15, c12, 1
to read the cycle counter. Requires user
access to the cycle counter, which is not enabled by default but can be
enabled under Linux via
a kernel module.
This counter is natively 32 bits, but libcpucycles watches how the
counter and gettimeofday
increase to compute a 64-bit extension of the
counter.
arm64-pmc
: Requires a 64-bit ARMv8-A platform. Uses
mrs %0, PMCCNTR_EL0
to read the cycle counter. Requires user access
to the cycle counter, which is not enabled by default but can be enabled
under Linux via
a kernel module.
arm64-vct
: Requires a 64-bit ARMv8-A platform. Uses
mrs %0, CNTVCT_EL0
to read a "virtual count" timer. This is an
off-core clock, typically running at 24MHz. Results are scaled by
libcpucycles.
mips64-cc
: Requires a 64-bit MIPS platform. (Maybe the same code would
also work as mips32-cc
, but this has not been tested yet.) Uses RDHWR
to read the hardware cycle counter (hardware register 2 times a constant
scale factor in hardware register 3). This counter is natively 32 bits,
but libcpucycles watches how the counter and gettimeofday
increase to
compute a 64-bit extension of the counter.
ppc32-mftb
: Requires a 32-bit PowerPC platform. Uses mftb
and
mftbu
to read the "time base". This is an off-core clock, typically
running at 24MHz.
ppc64-mftb
: Requires a 64-bit PowerPC platform. Uses mftb
and
mftbu
to read the "time base". This is an off-core clock, typically
running at 24MHz.
riscv32-rdcycle
: Requires a 32-bit RISC-V platform. Uses rdcycle
and rdcycleh
to read a cycle counter.
riscv64-rdcycle
: Requires a 64-bit RISC-V platform. Uses rdcycle
to read a cycle counter.
s390x-stckf
: Requires a 64-bit z/Architecture platform. Uses stckf
to read the TOD clock, which is documented to run at 4096MHz. On the
z15, this looks like a doubling of an off-core 2048MHz clock. Results
are scaled by libcpucycles.
sparc64-rdtick
: Requires a 64-bit SPARC platform. Uses rd %tick
to read a cycle counter.
x86-tsc
, x86-tscasm
: Same as amd64-tsc
and amd64-tscasm
, but
for 32-bit Intel/AMD platforms instead of 64-bit Intel/AMD platforms.
default-gettimeofday
: Reasonably portable. Resolution is limited to 1
microsecond. Results are scaled by libcpucycles.
default-mach
: Requires an OS with mach_absolute_time()
. Typically
runs at 24MHz. Results are scaled by libcpucycles.
default-monotonic
: Requires CLOCK_MONOTONIC
. Reasonably portable,
although might fail on older systems where default-gettimeofday
works.
Resolution is limited to 1 nanosecond. Can be almost as good as a cycle
counter, or orders of magnitude worse, depending on the OS and CPU.
Results are scaled by libcpucycles.
default-perfevent
: Requires the Linux perf_event
interface, and a
CPU where perf_event
supports PERF_COUNT_HW_CPU_CYCLES
. Similar
variations in quality to default-monotonic
, without the 1-nanosecond
limitation.
default-zero
: The horrifying last resort if nothing else works.
These are examples of cpucycles-info
output on various machines. The
machines named cfarm*
are from the
GCC Compile Farm.
A median
line saying, e.g., 47 +47+28+0+2-5+0+2-5...
means that the
differences between adjacent cycle counts were 47+47, 47+28, 47+0, 47+2,
47−5, 47+0, 47+2, 47−5, etc., with median difference 47. The first few
differences are typically larger because of cache effects.
berry0
,
Broadcom BCM2835:
cpucycles version 20240114
cpucycles tracesetup 0 arm32-cortex precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 1 arm32-1176 precision 22 scaling 1.000000 only32 1
cpucycles tracesetup 2 default-perfevent precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 4 default-monotonic precision 1199 scaling 1.000000 only32 0
cpucycles tracesetup 5 default-gettimeofday precision 1200 scaling 1000.000000 only32 0
cpucycles tracesetup 6 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 1000000000
cpucycles implementation arm32-1176
cpucycles median 720 +942+124+1+0+2+0+0+0+0+0+0+0+0+0+0+0+0+0+0+1+2+0+0+2+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+222+300+1+0+0+2+0+0+0+0+0+0+0+0+0+0+0+0+0
cpucycles observed persecond 798307692...2045181819 with 1024 loops 12 microseconds
cpucycles observed persecond 915478260...1260523810 with 2048 loops 22 microseconds
cpucycles observed persecond 947809523...1106100000 with 4096 loops 41 microseconds
cpucycles observed persecond 966353658...1129037500 with 8192 loops 81 microseconds
cpucycles observed persecond 988490566...1030019109 with 16384 loops 158 microseconds
cpucycles observed persecond 995169327...1002034063 with 32768 loops 2379 microseconds
cpucycles observed persecond 996871019...1012568691 with 65536 loops 627 microseconds
cpucycles observed persecond 997832134...1004212170 with 131072 loops 1250 microseconds
cpucycles observed persecond 997740918...1000887780 with 262144 loops 5009 microseconds
cpucycles observed persecond 998528349...1001961164 with 524288 loops 5537 microseconds
cpucycles observed persecond 999202882...1001166794 with 1048576 loops 10547 microseconds
pi3aplus
,
Broadcom BCM2837B0:
cpucycles version 20230105
cpucycles tracesetup 0 arm64-pmc precision 9 scaling 1.000000 only32 0
cpucycles tracesetup 1 arm64-vct precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 2 default-perfevent precision 189 scaling 1.000000 only32 0
cpucycles tracesetup 3 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 4 default-monotonic precision 272 scaling 1.400000 only32 0
cpucycles tracesetup 5 default-gettimeofday precision 1600 scaling 1400.000000 only32 0
cpucycles tracesetup 6 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 1400000000
cpucycles implementation arm64-pmc
cpucycles median 10 +10+8+3+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0
cpucycles observed persecond 1032000000...4224666667 with 1024 loops 4 microseconds
cpucycles observed persecond 1286000000...1756000000 with 2048 loops 7 microseconds
cpucycles observed persecond 1368266666...1598000000 with 4096 loops 14 microseconds
cpucycles observed persecond 1366700000...1473428572 with 8192 loops 29 microseconds
cpucycles observed persecond 1366100000...1417534483 with 16384 loops 59 microseconds
cpucycles observed persecond 1332739837...1357132232 with 32768 loops 122 microseconds
cpucycles observed persecond 1354483471...1366945834 with 65536 loops 241 microseconds
cpucycles observed persecond 1385684989...1392195330 with 131072 loops 472 microseconds
cpucycles observed persecond 1347223021...1350328528 with 262144 loops 972 microseconds
cpucycles observed persecond 1375460125...1377069853 with 524288 loops 1905 microseconds
cpucycles observed persecond 1376527697...1377335961 with 1048576 loops 3808 microseconds
bblack
,
TI Sitara XAM3359AZCZ100:
cpucycles version 20230105
cpucycles tracesetup 0 arm32-cortex precision 8 scaling 1.000000 only32 1
cpucycles tracesetup 1 default-perfevent precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 1283 scaling 1.000000 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 1200 scaling 1000.000000 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 1000000000
cpucycles implementation arm32-cortex
cpucycles median 1260 +1506+62+31+7+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+13+7+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0
cpucycles observed persecond 622181818...2101888889 with 1024 loops 10 microseconds
cpucycles observed persecond 806133333...1492615385 with 2048 loops 14 microseconds
cpucycles observed persecond 879880000...1232565218 with 4096 loops 24 microseconds
cpucycles observed persecond 939577777...1130581396 with 8192 loops 44 microseconds
cpucycles observed persecond 956954022...1050047059 with 16384 loops 86 microseconds
cpucycles observed persecond 982878542...1020685715 with 32768 loops 246 microseconds
cpucycles observed persecond 988105105...1012217523 with 65536 loops 332 microseconds
cpucycles observed persecond 993752077...1007159723 with 131072 loops 721 microseconds
cpucycles observed persecond 995364296...1004009448 with 262144 loops 1377 microseconds
cpucycles observed persecond 998216306...1001821536 with 524288 loops 2685 microseconds
cpucycles observed persecond 998991848...1000914196 with 1048576 loops 5397 microseconds
hiphop
,
Intel Xeon E3-1220 v3:
cpucycles version 20230105
cpucycles tracesetup 0 amd64-pmc precision 40 scaling 1.000000 only32 0
cpucycles tracesetup 1 amd64-tsc precision 124 scaling 1.000000 only32 0
cpucycles tracesetup 2 amd64-tscasm precision 124 scaling 1.000000 only32 0
cpucycles tracesetup 3 default-perfevent precision 160 scaling 1.000000 only32 0
cpucycles tracesetup 4 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 5 default-monotonic precision 272 scaling 3.100000 only32 0
cpucycles tracesetup 6 default-gettimeofday precision 3300 scaling 3100.000000 only32 0
cpucycles tracesetup 7 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3100000000
cpucycles implementation amd64-pmc
cpucycles median 44 +38+23+23+23-4+0-4+0-4+0-4+0+10-4-2+1-4+1-4+1+17+1-4+1-4+1-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4+0-4
cpucycles observed persecond 2066500000...4235000000 with 8192 loops 3 microseconds
cpucycles observed persecond 2760833333...4200250000 with 16384 loops 5 microseconds
cpucycles observed persecond 2743416666...3313100000 with 32768 loops 11 microseconds
cpucycles observed persecond 2986227272...3295000000 with 65536 loops 21 microseconds
cpucycles observed persecond 3052069767...3206073171 with 131072 loops 42 microseconds
cpucycles observed persecond 3050395348...3125523810 with 262144 loops 85 microseconds
cpucycles observed persecond 3085123529...3123059524 with 524288 loops 169 microseconds
cpucycles observed persecond 3084561764...3103434912 with 1048576 loops 339 microseconds
nucnuc
,
Intel Pentium N3700:
cpucycles version 20230105
cpucycles tracesetup 0 amd64-pmc precision 26 scaling 1.000000 only32 0
cpucycles tracesetup 1 amd64-tsc precision 120 scaling 1.000000 only32 0
cpucycles tracesetup 2 amd64-tscasm precision 120 scaling 1.000000 only32 0
cpucycles tracesetup 3 default-perfevent precision 427 scaling 1.000000 only32 0
cpucycles tracesetup 4 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 5 default-monotonic precision 320 scaling 1.600000 only32 0
cpucycles tracesetup 6 default-gettimeofday precision 1800 scaling 1600.000000 only32 0
cpucycles tracesetup 7 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 1600000000
cpucycles implementation amd64-pmc
cpucycles median 66 +12+12+14+14-1-1+0-1+0-1+0-1+0+1-1+0-1+0-1+0-2+0-1+0-1+0-1+0-2+0-1+0-1+0-1+0-2+0-1+0-1+1-1+0-2-1-1+0-1+0-1+0-2+0-1+2+0-1+0-1+0+0-1
cpucycles observed persecond 1060500000...2325000000 with 2048 loops 3 microseconds
cpucycles observed persecond 1387166666...2208250000 with 4096 loops 5 microseconds
cpucycles observed persecond 1376083333...1705500000 with 8192 loops 11 microseconds
cpucycles observed persecond 1495727272...1671800000 with 16384 loops 21 microseconds
cpucycles observed persecond 1563428571...1655100000 with 32768 loops 41 microseconds
cpucycles observed persecond 1580807228...1626234568 with 65536 loops 82 microseconds
cpucycles observed persecond 1589539393...1612619632 with 131072 loops 164 microseconds
cpucycles observed persecond 1598841463...1610230062 with 262144 loops 327 microseconds
cpucycles observed persecond 1564336810...1569988042 with 524288 loops 670 microseconds
cpucycles observed persecond 1599759725...1602608098 with 1048576 loops 1310 microseconds
saber214
,
AMD FX-8350:
cpucycles version 20230105
cpucycles tracesetup 0 amd64-pmc precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 1 amd64-tsc precision 167 scaling 1.000000 only32 0
cpucycles tracesetup 2 amd64-tscasm precision 168 scaling 1.000000 only32 0
cpucycles tracesetup 3 default-perfevent precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 4 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 5 default-monotonic precision 376 scaling 4.013452 only32 0
cpucycles tracesetup 6 default-gettimeofday precision 4213 scaling 4013.452000 only32 0
cpucycles tracesetup 7 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 4013452000
cpucycles implementation amd64-tsc
cpucycles median 77 +87-2+21+7+4+1+0+2-2-7-4+0+1+4-2+3+1-2-2+5-6+2+2+2+2+1-1-1+0-4+0-1-1-1-2+3-1-1+2-2+0+0+2+0+0+2-2-2+1-1-2+2-5+2+0+2+0+1+0+3-2-1-1
cpucycles observed persecond 2767500000...5759000000 with 4096 loops 3 microseconds
cpucycles observed persecond 3426000000...4893800000 with 8192 loops 6 microseconds
cpucycles observed persecond 3724076923...4446363637 with 16384 loops 12 microseconds
cpucycles observed persecond 3977833333...4363318182 with 32768 loops 23 microseconds
cpucycles observed persecond 3984854166...4168739131 with 65536 loops 47 microseconds
cpucycles observed persecond 3981709923...4048193799 with 131072 loops 130 microseconds
cpucycles observed persecond 3982716417...4026914573 with 262144 loops 200 microseconds
cpucycles observed persecond 4001637602...4025136987 with 524288 loops 366 microseconds
cpucycles observed persecond 4007411111...4018600248 with 1048576 loops 809 microseconds
cfarm14
,
Intel Xeon E5-2620 v3,
Debian testing (bookworm),
Linux kernel 6.0.0-6-amd64:
cpucycles version 20230105
cpucycles tracesetup 0 amd64-pmc precision 41 scaling 1.000000 only32 0
cpucycles tracesetup 1 amd64-tsc precision 148 scaling 1.000000 only32 0
cpucycles tracesetup 2 amd64-tscasm precision 148 scaling 1.000000 only32 0
cpucycles tracesetup 3 default-perfevent precision 159 scaling 1.000000 only32 0
cpucycles tracesetup 4 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 5 default-monotonic precision 289 scaling 3.200000 only32 0
cpucycles tracesetup 6 default-gettimeofday precision 3400 scaling 3200.000000 only32 0
cpucycles tracesetup 7 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3200000000
cpucycles implementation amd64-pmc
cpucycles median 47 +47+28+0+2-5+0+2-5+16+2-5+0+2-5+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0+1-4+0
cpucycles observed persecond 1653800000...2819333334 with 8192 loops 4 microseconds
cpucycles observed persecond 1832111111...2389285715 with 16384 loops 8 microseconds
cpucycles observed persecond 1936058823...2207200000 with 32768 loops 16 microseconds
cpucycles observed persecond 2052843750...2196200000 with 65536 loops 31 microseconds
cpucycles observed persecond 2050750000...2120048388 with 131072 loops 63 microseconds
cpucycles observed persecond 2081896825...2117048388 with 262144 loops 125 microseconds
cpucycles observed persecond 2089478087...2107044177 with 524288 loops 250 microseconds
cpucycles observed persecond 2093343313...2102124249 with 1048576 loops 500 microseconds
cfarm23
,
Cavium Octeon II V0.1,
Debian 8.11,
Linux kernel 4.1.4:
cpucycles version 20240114
cpucycles tracesetup 0 mips64-cc precision 24 scaling 1.000000 only32 1
cpucycles tracesetup 1 default-perfevent precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 46649 scaling 2.399988 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 45799 scaling 2399.987654 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 2399987654
cpucycles implementation mips64-cc
cpucycles median 2206 +581+5+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+83791+581+18+18+18+18+18+18
cpucycles observed persecond 634500000...1843500000 with 1024 loops 9 microseconds
cpucycles observed persecond 746142857...1361500000 with 2048 loops 13 microseconds
cpucycles observed persecond 846318181...1222000000 with 4096 loops 21 microseconds
cpucycles observed persecond 897717948...1105432433 with 8192 loops 38 microseconds
cpucycles observed persecond 954521126...1065971015 with 16384 loops 70 microseconds
cpucycles observed persecond 979395454...1018958716 with 32768 loops 219 microseconds
cpucycles observed persecond 986875354...1011415955 with 65536 loops 352 microseconds
cpucycles observed persecond 994412144...1005722798 with 131072 loops 773 microseconds
cpucycles observed persecond 997076363...1003483613 with 262144 loops 1374 microseconds
cpucycles observed persecond 959310151...1001940950 with 524288 loops 2846 microseconds
cpucycles observed persecond 993951907...1000833365 with 1048576 loops 5426 microseconds
cfarm26
,
Intel Core i5-4570 in 32-bit mode under KVM,
Debian 12.4,
Linux kernel 6.1.0-17-686-pae:
cpucycles version 20240114
cpucycles tracesetup 0 x86-tsc precision 118 scaling 1.000000 only32 0
cpucycles tracesetup 1 x86-tscasm precision 118 scaling 1.000000 only32 0
cpucycles tracesetup 2 default-perfevent precision 627 scaling 1.000000 only32 0
cpucycles tracesetup 3 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 4 default-monotonic precision 2335 scaling 3.192606 only32 0
cpucycles tracesetup 5 default-gettimeofday precision 3392 scaling 3192.606000 only32 0
cpucycles tracesetup 6 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3192606000
cpucycles implementation x86-tsc
cpucycles median 18 +34+0+0+13+0+0+13+0+0+13+0+0+13+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+0
cpucycles observed persecond 1950500000...6176000000 with 8192 loops 3 microseconds
cpucycles observed persecond 2591000000...5117000000 with 16384 loops 5 microseconds
cpucycles observed persecond 2824090909...4013333334 with 32768 loops 10 microseconds
cpucycles observed persecond 2993757575...3362258065 with 65536 loops 32 microseconds
cpucycles observed persecond 3093644067...3286807018 with 131072 loops 58 microseconds
cpucycles observed persecond 3126202531...3270727273 with 262144 loops 78 microseconds
cpucycles observed persecond 3144248407...3216322581 with 524288 loops 156 microseconds
cpucycles observed persecond 3172426332...3209545742 with 1048576 loops 318 microseconds
cfarm27
,
Intel Core i5-4570 in 32-bit mode under KVM,
Alpine 3.19.0,
Linux kernel 6.6.7-0-lts:
cpucycles version 20240114
cpucycles tracesetup 0 x86-tsc precision 118 scaling 1.000000 only32 0
cpucycles tracesetup 1 x86-tscasm precision 118 scaling 1.000000 only32 0
cpucycles tracesetup 2 default-perfevent precision 631 scaling 1.000000 only32 0
cpucycles tracesetup 3 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 4 default-monotonic precision 1084 scaling 3.192606 only32 0
cpucycles tracesetup 5 default-gettimeofday precision 3392 scaling 3192.606000 only32 0
cpucycles tracesetup 6 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3192606000
cpucycles implementation x86-tsc
cpucycles median 18 +113+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+0+13+0+13+0+0+13+0+0+13+0+0+13+0+0+13
cpucycles observed persecond 2404000000...4642666667 with 8192 loops 4 microseconds
cpucycles observed persecond 2617333333...4441250000 with 16384 loops 5 microseconds
cpucycles observed persecond 3001312500...3606857143 with 32768 loops 15 microseconds
cpucycles observed persecond 3096870967...3394000000 with 65536 loops 30 microseconds
cpucycles observed persecond 3123943661...3244913044 with 131072 loops 70 microseconds
cpucycles observed persecond 3173264150...3225305733 with 262144 loops 158 microseconds
cpucycles observed persecond 3170094339...3210561905 with 524288 loops 211 microseconds
cpucycles observed persecond 3178732087...3205529781 with 1048576 loops 320 microseconds
cfarm29
,
IBM POWER9,
Debian 12.4,
Linux kernel 6.1.0-17-powerpc64le:
cpucycles version 20240114
cpucycles tracesetup 0 ppc64-mftb precision 218 scaling 7.421875 only32 0
cpucycles tracesetup 1 default-perfevent precision 292 scaling 1.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 355 scaling 3.800000 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 4000 scaling 3800.000000 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3800000000
cpucycles implementation ppc64-mftb
cpucycles median 207 +52+31+1-14+0+1-14-6+8-7-6+8-7+1-7+1+8-21-7+1+16-22+1+8-21-7+16-7-21+1+8-21+0+1+1-14+8-6-14+0+9-7+1+1-14-14+8-7-21+1+8-7+1-7+9-22+8-6+1-14+8-7-6
cpucycles observed persecond 3267500000...6865000000 with 4096 loops 3 microseconds
cpucycles observed persecond 3246125000...4445666667 with 8192 loops 7 microseconds
cpucycles observed persecond 3435333333...4016307693 with 16384 loops 14 microseconds
cpucycles observed persecond 3674892857...3984115385 with 32768 loops 27 microseconds
cpucycles observed persecond 3734963636...3888641510 with 65536 loops 54 microseconds
cpucycles observed persecond 3768266055...3845158879 with 131072 loops 108 microseconds
cpucycles observed persecond 3783654377...3822125582 with 262144 loops 216 microseconds
cpucycles observed persecond 3791669745...3810830627 with 524288 loops 432 microseconds
cpucycles observed persecond 3795847398...3805719583 with 1048576 loops 864 microseconds
cfarm45
,
AMD Athlon II X4 640,
Debian 8.11,
Linux kernel 3.16.0-11-686-pae:
cpucycles version 20230105
cpucycles tracesetup 0 x86-tsc precision 199 scaling 1.000000 only32 0
cpucycles tracesetup 1 x86-tscasm precision 199 scaling 1.000000 only32 0
cpucycles tracesetup 2 default-perfevent precision 170 scaling 1.000000 only32 0
cpucycles tracesetup 3 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 4 default-monotonic precision 941 scaling 3.000000 only32 0
cpucycles tracesetup 5 default-gettimeofday precision 3200 scaling 3000.000000 only32 0
cpucycles tracesetup 6 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3000000000
cpucycles implementation default-perfevent
cpucycles median 72 +12+0+0+0+0+0+0+0+5+0+0+0+0+0+0+0+2+0+0+0+0+0+0+0+1+0+0+0+0+0+0+0+2+0+0+0+0+0+0+0+1+0+0+0+0+0+0+0+2+0+0+0+0+0+0+0+1+0+0+0+0+0+0
cpucycles observed persecond 541500000...1812000000 with 1024 loops 3 microseconds
cpucycles observed persecond 712333333...1212250000 with 2048 loops 5 microseconds
cpucycles observed persecond 1193285714...1733600000 with 4096 loops 6 microseconds
cpucycles observed persecond 1689176470...1804562500 with 8192 loops 33 microseconds
cpucycles observed persecond 1713074626...1770600000 with 16384 loops 66 microseconds
cpucycles observed persecond 1765107692...1795140625 with 32768 loops 129 microseconds
cpucycles observed persecond 1785369649...1800603922 with 65536 loops 256 microseconds
cpucycles observed persecond 1781377862...1796288462 with 131072 loops 261 microseconds
cpucycles observed persecond 1772647398...1778247827 with 262144 loops 691 microseconds
cpucycles observed persecond 1789670493...1794149598 with 524288 loops 870 microseconds
cpucycles observed persecond 1860276211...1861561332 with 1048576 loops 3156 microseconds
cfarm91
,
StarFive JH7100,
Linux trixie/sid,
Linux kernel 5.18.11-starfive:
cpucycles version 20240114
cpucycles tracesetup 0 riscv64-rdcycle precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 1 default-perfevent precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 1351 scaling 2.399988 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 2599 scaling 2399.987654 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 2399987654
cpucycles implementation default-monotonic
cpucycles median 1536 -384+0+384+0-384+0-384+0+0-384+384-1-384+0+0-381+0-384+384+0+0-384+0-384+0+0+384+0+0+0+0+0-384+0+384+0-384+0-384+0+0-384+384+0-384+0+0-384+0+0+0+0+0-384+0-382+0+0+384-384+0-384+0
cpucycles observed persecond 1590857142...4147200000 with 1024 loops 6 microseconds
cpucycles observed persecond 1954909090...3157333334 with 2048 loops 10 microseconds
cpucycles observed persecond 2142421052...2755882353 with 4096 loops 18 microseconds
cpucycles observed persecond 2293085714...2606606061 with 8192 loops 34 microseconds
cpucycles observed persecond 2337970588...2496090910 with 16384 loops 67 microseconds
cpucycles observed persecond 2358522388...2443712122 with 32768 loops 133 microseconds
cpucycles observed persecond 2382335849...2423813689 with 65536 loops 264 microseconds
cpucycles observed persecond 2385986013...2405815790 with 131072 loops 571 microseconds
cpucycles observed persecond 2395157522...2405531915 with 262144 loops 1129 microseconds
cpucycles observed persecond 2397798685...2402770560 with 524288 loops 2433 microseconds
cpucycles observed persecond 2398637218...2401114855 with 1048576 loops 4572 microseconds
cfarm92
,
SiFive Freedom U740,
Ubuntu 22.04.3,
Linux kernel 5.19.0-1021-generic:
cpucycles version 20240114
cpucycles tracesetup 0 riscv64-rdcycle precision 8 scaling 1.000000 only32 0
cpucycles tracesetup 1 default-perfevent precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 2599 scaling 2.399988 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 2599 scaling 2399.987654 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 2399987654
cpucycles implementation riscv64-rdcycle
cpucycles median 8 +168+20+2+2+0+0+0+0+570+0+0+0+0+0+0+0+144+0+0+0+0+0+0+0+160+0+0+0+0+0+0+0+160+0+0+0+0+0+0+0+154+0+0+0+0+0+0+0+154+0+0+0+0+0+0+0+152+0+0+0+0+0+0
cpucycles observed persecond 571500000...2198000000 with 1024 loops 3 microseconds
cpucycles observed persecond 833600000...2094000000 with 2048 loops 4 microseconds
cpucycles observed persecond 921888888...1445142858 with 4096 loops 8 microseconds
cpucycles observed persecond 1029625000...1320642858 with 8192 loops 15 microseconds
cpucycles observed persecond 1137034482...1284481482 with 16384 loops 28 microseconds
cpucycles observed persecond 1155701754...1227454546 with 32768 loops 56 microseconds
cpucycles observed persecond 1177464285...1217163637 with 65536 loops 111 microseconds
cpucycles observed persecond 1188018099...1207858448 with 131072 loops 220 microseconds
cpucycles observed persecond 1189925170...1200519363 with 262144 loops 440 microseconds
cpucycles observed persecond 1193962457...1199117446 with 524288 loops 878 microseconds
cpucycles observed persecond 1194051324...1196780111 with 1048576 loops 1811 microseconds
cfarm103
,
Apple M1 (Icestorm-M1 + Firestorm-M1),
Debian trixie/sid,
Linux kernel 6.5.0-asahi-00780-g62806c2c6f29:
cpucycles version 20240114
cpucycles tracesetup 0 arm64-pmc precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 1 arm64-vct precision 186 scaling 86.000000 only32 0
cpucycles tracesetup 2 default-perfevent precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 4 default-monotonic precision 285 scaling 2.064000 only32 0
cpucycles tracesetup 5 default-gettimeofday precision 2264 scaling 2064.000000 only32 0
cpucycles tracesetup 6 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 2064000000
cpucycles implementation arm64-vct
cpucycles median 0 +86+0+0+0+86+0+0+0+0+0+0+0+0+0+0+0+0+86+0+0+0+0+0+0+0+0+0+0+0+86+0+0+0+0+0+0+0+0+0+0+0+0+86+0+0+0+0+0+0+0+0+0+0+0+86+0+0+0+0+0+0+0+0
cpucycles observed persecond 1440500000...3010000000 with 4096 loops 3 microseconds
cpucycles observed persecond 1621714285...2339200000 with 8192 loops 6 microseconds
cpucycles observed persecond 1884833333...2296200000 with 16384 loops 11 microseconds
cpucycles observed persecond 1963043478...2166380953 with 32768 loops 22 microseconds
cpucycles observed persecond 2004755555...2106000000 with 65536 loops 44 microseconds
cpucycles observed persecond 2051295454...2103000000 with 131072 loops 87 microseconds
cpucycles observed persecond 2054549450...2080722223 with 262144 loops 181 microseconds
cpucycles observed persecond 2056159544...2068681949 with 524288 loops 350 microseconds
cpucycles observed persecond 2061174285...2067573066 with 1048576 loops 699 microseconds
cfarm110
(gcc1-power7
),
IBM POWER7,
CentOS 7.9 AltArch,
Linux kernel 3.10.0-862.14.4.el7.ppc64:
cpucycles version 20240114
cpucycles tracesetup 0 ppc64-mftb precision 212 scaling 7.000000 only32 0
cpucycles tracesetup 1 default-perfevent precision 236 scaling 1.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 346 scaling 3.550000 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 3750 scaling 3550.000000 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3550000000
cpucycles implementation ppc64-mftb
cpucycles median 168 -49+56-21+21-14+14-28+28-28+28-7+0-14+21-28+28-28+28-42+28-35+28-35+35-35+21-21+56-49+42-49+21-21+0+0-21+21-49+28-35+7-7-14+14-42+42-7+7+0+0-7+7-21+21-28+28-35+35-42+28-35+28-35
cpucycles observed persecond 3136000000...6569500000 with 4096 loops 3 microseconds
cpucycles observed persecond 3108000000...4233833334 with 8192 loops 7 microseconds
cpucycles observed persecond 3322666666...3878538462 with 16384 loops 14 microseconds
cpucycles observed persecond 3423000000...3698592593 with 32768 loops 28 microseconds
cpucycles observed persecond 3480842105...3616327273 with 65536 loops 56 microseconds
cpucycles observed persecond 3571702702...3641862386 with 131072 loops 110 microseconds
cpucycles observed persecond 3571387387...3605986364 with 262144 loops 221 microseconds
cpucycles observed persecond 3570914414...3588307693 with 524288 loops 443 microseconds
cpucycles observed persecond 3578817155...3587452489 with 1048576 loops 885 microseconds
cfarm112
(gcc2-power8
),
IBM POWER8E,
CentOS 7.9 AltArch,
Linux kernel 3.10.0-1127.13.1.el7.ppc64le:
cpucycles version 20240114
cpucycles tracesetup 0 ppc64-mftb precision 194 scaling 7.250000 only32 0
cpucycles tracesetup 1 default-perfevent precision 308 scaling 1.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 414 scaling 3.690000 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 3890 scaling 3690.000000 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3690000000
cpucycles implementation ppc64-mftb
cpucycles median 123 +1871+7+0+1+0+7-7+1+7+0+1-7+0+0+8+0+0+0+1+7-7+8+0+0+0+8+0+0+1+7+0+1+7+0+1+0+7-7+8+0+0+1+7-7+8+7-7+8-7-7+0+7+1+0+0+8-7+0+0+8+0+0+0
cpucycles observed persecond 2903666666...4451500000 with 4096 loops 5 microseconds
cpucycles observed persecond 3475700000...4630875000 with 8192 loops 9 microseconds
cpucycles observed persecond 3640684210...4205882353 with 16384 loops 18 microseconds
cpucycles observed persecond 3545051282...3800189190 with 32768 loops 38 microseconds
cpucycles observed persecond 3683973333...3816780822 with 65536 loops 74 microseconds
cpucycles observed persecond 3682366666...3747662163 with 131072 loops 149 microseconds
cpucycles observed persecond 3706476510...3739236487 with 262144 loops 297 microseconds
cpucycles observed persecond 3706573825...3722984849 with 524288 loops 595 microseconds
cpucycles observed persecond 3709504617...3717714046 with 1048576 loops 1190 microseconds
cfarm120
,
IBM POWER10,
AlmaLinux 9.3,
Linux kernel 5.14.0-284.11.1.el9_2.ppc64le:
cpucycles version 20240114
cpucycles tracesetup 0 ppc64-mftb precision 123 scaling 5.750000 only32 0
cpucycles tracesetup 1 default-perfevent precision 203 scaling 1.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 226 scaling 2.950000 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 3150 scaling 2950.000000 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 2950000000
cpucycles implementation ppc64-mftb
cpucycles median 69 +6-40+6+5-40+0-40+11+0-40+6-46+6+5-46+12-40+0+5-40+6-46+11+6-46+6+6-41+0-40+12-46+5+6-46+6+11-46+0-40+0+12-41+0-40+6-46+6+11-40+0+6-41+0-40+12+0-41+6-46+6-40+5
cpucycles observed persecond 2103666666...3215500000 with 8192 loops 5 microseconds
cpucycles observed persecond 2827666666...3662714286 with 16384 loops 8 microseconds
cpucycles observed persecond 2821000000...3185125000 with 32768 loops 17 microseconds
cpucycles observed persecond 2818305555...2989823530 with 65536 loops 35 microseconds
cpucycles observed persecond 2897014285...2985852942 with 131072 loops 69 microseconds
cpucycles observed persecond 2920582733...2964649636 with 262144 loops 138 microseconds
cpucycles observed persecond 2930339350...2952341819 with 524288 loops 276 microseconds
cpucycles observed persecond 2941188405...2952218182 with 1048576 loops 551 microseconds
cfarm202
,
UltraSparc T5,
Debian unstable (bookworm),
Linux kernel 5.19.0-2-sparc64-smp:
cpucycles version 20230105
cpucycles tracesetup 0 sparc64-rdtick precision 65 scaling 1.000000 only32 0
cpucycles tracesetup 1 default-perfevent precision 386 scaling 1.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 442 scaling 3.599910 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 3799 scaling 3599.910000 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 3599910000
cpucycles implementation sparc64-rdtick
cpucycles median 73 +24+0+24+24+24+24+24+24+0+1+24+0+1+24+0+1+24+0+0+1+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+0+1+0+0+0+0+0+0+0+0+0+0+0+0+0
cpucycles observed persecond 2751500000...4258250000 with 4096 loops 5 microseconds
cpucycles observed persecond 3289200000...4206875000 with 8192 loops 9 microseconds
cpucycles observed persecond 3454789473...3900823530 with 16384 loops 18 microseconds
cpucycles observed persecond 3452026315...3659888889 with 32768 loops 37 microseconds
cpucycles observed persecond 3543770270...3650916667 with 65536 loops 73 microseconds
cpucycles observed persecond 3567299319...3620662069 with 131072 loops 146 microseconds
cpucycles observed persecond 3591373287...3618220690 with 262144 loops 291 microseconds
cpucycles observed persecond 3597353344...3610774527 with 524288 loops 582 microseconds
cpucycles observed persecond 3595899403...3603058071 with 1048576 loops 1172 microseconds
IBM z15:
cpucycles version 20230106
cpucycles tracesetup 0 s390x-stckf precision 250 scaling 1.269531 only32 0
cpucycles tracesetup 1 default-perfevent precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 2 default-mach precision 0 scaling 0.000000 only32 0
cpucycles tracesetup 3 default-monotonic precision 272 scaling 5.200000 only32 0
cpucycles tracesetup 4 default-gettimeofday precision 5400 scaling 5200.000000 only32 0
cpucycles tracesetup 5 default-zero precision 0 scaling 0.000000 only32 0
cpucycles persecond 5200000000
cpucycles implementation s390x-stckf
cpucycles median 48 +87+8+0-2+0+0+38-2+0+1-3+1+28+0+3-3+1+0+28+0-2+3+0-2+36+0+0+0+1+0+28+0-2+0+3-2+35+1+0-2+0+3+28+0-2+0+0-2+3+25+3+0-2+0+1+35+1+0+0-2+0+28+0
cpucycles observed persecond 4948941176...5627733334 with 8192 loops 16 microseconds
cpucycles observed persecond 4104125000...5515666667 with 16384 loops 7 microseconds
cpucycles observed persecond 5047076923...5987818182 with 32768 loops 12 microseconds
cpucycles observed persecond 5044846153...5475708334 with 65536 loops 25 microseconds
cpucycles observed persecond 5141313725...5357428572 with 131072 loops 50 microseconds
cpucycles observed persecond 5150892156...5257250000 with 262144 loops 101 microseconds
cpucycles observed persecond 5183421568...5236549505 with 524288 loops 203 microseconds
cpucycles observed persecond 5190282555...5216582717 with 1048576 loops 406 microseconds
To download and unpack the latest version of libcpucycles:
wget -m https://cpucycles.cr.yp.to/libcpucycles-latest-version.txt
version=$(cat cpucycles.cr.yp.to/libcpucycles-latest-version.txt)
wget -m https://cpucycles.cr.yp.to/libcpucycles-$version.tar.gz
tar -xzf cpucycles.cr.yp.to/libcpucycles-$version.tar.gz
cd libcpucycles-$version
Then install.
libcpucycles-20240114.tar.gz
browse
Add arm32-1176
counter.
Allow slop 0.2 rather than 0.1 for FINDMULTIPLIER
.
Improve platform detection.
Port to FreeBSD.
Use blue boldface during compilation for "skipping option that did not compile".
doc/install.md
: headings; note manual pages.
Add doc/license.md
.
Update HTML style for better tt visibility and copy-paste.
libcpucycles-20230115.tar.gz
browse
Update actual cpucycles_version
behavior to match documentation.
libcpucycles-20230110.tar.gz
browse
doc/api.md
: Document cpucycles_version()
.
Add s390x-stckf
counter.
cpucycles/default-perfevent.c
: Read into int64_t
instead of long long
.
Add comment explaining issues with PERF_FORMAT_TOTAL_TIME_RUNNING
.
configure
: Improve uname
handling.
doc/api.md
: Update description of default frequency.
libcpucycles-20230105.tar.gz
browse
libcpucycles is a microlibrary for counting CPU cycles. Cycle counts are not as detailed as Falk diagrams but are the most precise timers available to typical software; they are central tools used in understanding and improving software performance.
The libcpucycles API is simple: include <cpucycles.h>
, call
cpucycles()
to receive a long long
whenever desired, and link with
-lcpucycles
.
Internally, libcpucycles understands machine-level
cycle counters for amd64 (both PMC and TSC), arm32, arm64 (both PMC and
VCT), mips64, ppc32, ppc64, riscv32, riscv64, s390x, sparc64, and x86.
libcpucycles also understands four OS-level mechanisms, which give
varying levels of accuracy: mach_absolute_time
, perf_event
,
CLOCK_MONOTONIC
, and, as a fallback, microsecond-resolution
gettimeofday
.
When the program first calls cpucycles()
, libcpucycles automatically
benchmarks the available mechanisms and selects the
mechanism that does the best job. Subsequent cpucycles()
calls are
thread-safe and very fast. An accompanying cpucycles-info
program
prints a summary of cycle-counter accuracy.
For comparison, there is a simple-sounding __rdtsc()
API provided by
compilers, but this works only on Intel/AMD CPUs and is generally noisier
than PMC. There is a __builtin_readcyclecounter()
that works on more
CPUs, but this works only with clang
and has the same noise problems.
Both of these mechanisms put the burden on the caller to figure out what
can be done on other CPUs. Various packages include their own more
portable abstraction layers for counting cycles (see, e.g., FFTW's
cycle.h
,
used to automatically select from among multiple implementations
provided by FFTW), but this creates per-package effort to keep up with
the latest cycle counters. The goal of libcpucycles is to provide
state-of-the-art cycle counting centrally for all packages to use.
Prerequisites: python3
; gcc
and/or clang
. Currently tested only
under Linux, but porting to other systems shouldn't be difficult.
To install in /usr/local/{include,lib,bin,man}
:
./configure && make -j8 install
Typically you'll already have
export LD_LIBRARY_PATH="$HOME/lib"
export LIBRARY_PATH="$HOME/lib"
export CPATH="$HOME/include"
export PATH="$HOME/bin:$PATH"
in $HOME/.profile
. To install in $HOME/{include,lib,bin,man}
:
./configure --prefix=$HOME && make -j8 install
Run
./configure --prefix=/usr && make -j8
and then follow your usual packaging procedures for the
build/0/package
files:
build/0/package/man/man3/cpucycles.3
build/0/package/include/cpucycles.h
build/0/package/lib/libcpucycles*
build/0/package/bin/cpucycles-info
There are some old systems where libcpucycles requires -lrt
for
clock_gettime
; currently libcpucycles.so
doesn't link to -lrt
,
so it's up to the caller to link to -lrt
.
You can run
./configure --host=amd64
to override ./configure
's guess of the architecture that it should
compile for. The architecture controls which cycle counters to try
compiling: e.g., amd64
tries compiling cpucycles/amd64*
and
cpucycles/default*
.
Inside the build
directory, 0
is symlinked to amd64
for
--host=amd64
. Running make clean
removes build/amd64
. Re-running
./configure
automatically starts with make clean
.
A subsequent ./configure --host=arm64
will create build/arm64
and
symlink 0 -> arm64
, without touching an existing build/amd64
.
However, cross-compilers aren't yet selected automatically.
Compilers tried are listed in compilers/default
. Each compiler
includes -fPIC
to create a shared library, -fvisibility=hidden
to
hide non-public symbols in the library, and -fwrapv
to switch to a
slightly less dangerous version of C. The first compiler that seems to
work is used to compile everything.
libcpucycles is hereby placed into the public domain.
SPDX-License-Identifier: LicenseRef-PD-hp OR CC0-1.0 OR 0BSD OR MIT-0 OR MIT
Many security systems have been shown to be breakable by "timing attacks". These attacks extract secrets by analyzing timings of the legitimate user's operations on secret data. See the June 2022 survey page https://timing.attacks.cr.yp.to for an overview and further references.
Sometimes these attacks are used as motivation to disable the attacker's
access to various timing mechanisms. For example, Firefox rounds its
performance.now
timer to 1-millisecond resolution
"to mitigate potential security threats".
As another example, reducing /proc/sys/kernel/perf_event_paranoid
under Linux to 2 (from 3 or higher), so that libcpucycles has access to
the best available Intel/AMD cycle counter (RDPMC), also means making
this cycle counter and other performance-monitoring counters available
to any attacker-controlled software running on the computer. Perhaps
this helps timing attacks, not to mention the possibility of opening up
other vulnerabilities via the complicated perf_event
interface.
As yet another example, ARM CPUs disable user access to the main CPU cycle counter by default. Installing a kernel module to enable user access to the cycle counter could help attacks.
Given the availability of simple mechanisms to disable RDPMC etc., it is easy to recommend using those mechanisms. To avoid creating unnecessary tension between those recommendations and the use of libcpucycles, applications that use libcpucycles should be structured so that high-resolution timers are used only on controlled development and benchmarking machines, not on general end-user machines.
This structure might seem incompatible with using cycle counts to automatically select the best of multiple options, as in FFTW. However, new infrastructure introduced in lib25519 automatically selects options on end-user machines based on cycle counts that were collected on benchmarking machines.
The above text should not be understood as endorsing the idea that disabling timers is an effective defense against timing attacks. Certainly disabling high-resolution timers is not sufficient for security: there are many ways for attackers to amplify timing signals and to statistically filter out noise from low-resolution timers. Disabling every standard timing mechanism on the machine does not stop the attacker from accessing a remote timer or a counter maintained by the attacker's software. Perhaps disabling timers sometimes makes the difference between a feasible attack and an infeasible attack, but evaluating this is extremely difficult.
Meanwhile there is an auditable methodology available to stop timing attacks: constant-time programming, which systematically cuts off data flow from secrets to timings.
For example, secrets affect a CPU's power consumption, and Turbo Boost creates data flow from power consumption to timings, as illustrated by the Hertzbleed attack extracting secret keys from the SIKE cryptosystem (before SIKE was broken in other ways), and an independent attack extracting secret AES keys. Consequently, the constant-time methodology does not allow Turbo Boost.
This is why https://timing.attacks.cr.yp.to recommends turning off Turbo Boost "right now", and explains the mechanisms available to do this. One non-security reason that it was already normal (although not universal) for manufacturers to provide these mechanisms to end users is that Turbo Boost has a reputation for causing premature hardware failures. Turbo Boost also provides very little speed benefit for modern multithreaded vectorized applications.
Another reaction to timing attacks is to apply "masking" techniques. These techniques seem to make it more difficult for attackers to extract secrets from power consumption and other side channels. However, as https://timing.attacks.cr.yp.to explains, it is "practically impossible for an auditor to obtain any real assurance that these techniques are secure". See the December 2022 paper "Breaking a fifth-order masked implementation of CRYSTALS-Kyber by copy-paste" for a newer example of a security failure in a masked implementation.
Here is how libcpucycles decides which cycle counter to use. The underlying principles are as follows:
Failure is not allowed. Using a low-resolution timer such as
gettimeofday()
to estimate cycle counts is not desirable but is better
than providing no information.
A counter that does well on some CPUs and OSes can do badly on others.
The counter selection in libcpucycles is based not just on rules set
at compile time but also on measurements of how well the counters
perform when the program first calls cpucycles()
.
A critical application of cycle counting is collecting cycle counts for multiple options to see which option is faster. It is the caller's responsibility to compute medians of cycle counts for many runs of whatever is being benchmarked: medians filter out occasional cycle-count jumps caused by migration to another core (if the benchmark is not pinned to a single core) or interrupts from other OS activity. libcpucycles does not reject an otherwise attractive counter merely because of occasional jumps.
Cycle-counting overhead is not desirable, but does not directly affect comparisons of multiple options measured using the same cycle counter, so it is less important than consistent major errors such as treating 2^32 + x cycles as x cycles. (Performance experts seeing a function that takes billions of cycles usually focus on smaller subroutines, but libcpucycles should not break larger measurements.) This is why libcpucycles does not provide direct access to 32-bit cycle counters: it provides wrappers that combine the counters with gettimeofday() to produce 64 bits, even though this incurs some extra overhead.
The noise introduced by typical off-core clocks, such as multiplying a 24MHz clock by 86 to estimate cycles on a 2.064GHz CPU core, comes in small part from low resolution but much more from changes in CPU frequency: e.g., a 10000-cycle computation might be measured as 20000 cycles when the CPU enters a power-saving mode. When libcpucycles has access to what is believed to be an on-core cycle counter, it uses that even when its measurements show some noise. (Choosing an on-core cycle counter does not magically eliminate the change in the relative speed of the CPU and DRAM; the usual advice to warm up the CPU and set constant frequencies if possible still applies.)
When cpucycles()
is first called, libcpucycles tries running each
cycle counter that has been compiled into the library. For example, for
64-bit ARM CPUs, libcpucycles will try arm64-pmc
, arm64-vct
,
default-gettimeofday
, default-mach
, default-monotonic
, and
default-perfevent
, minus any of those that failed to compile.
Cycle counters that fail at run time with SIGILL (or SIGFPE or SIGBUS or
SIGSEGV) are eliminated from the list. For example, arm64-pmc
will
fail with SIGILL if the kernel does not allow user access to
PMCCNTR_EL0
. Beware that libcpucycles does not catch SIGILL after its
initial tests: if the kernel initially allows user access to
PMCCNTR_EL0
but later turns it off then arm64-pmc
will crash.
Independently of these counters, libcpucycles uses various OS mechanisms
to obtain an estimate of the CPU frequency. This estimate is also
available to the caller as cpucycles_persecond()
.
The methods that libcpucycles uses to ask the OS for an estimated CPU
frequency fail on some OS-CPU combinations, in which case libcpucycles
falls back to a cpucyclespersecond
environment variable, or, if that
variable does not exist, an estimate of 2399987654 cycles per second.
(This estimate is in a realistic range of CPU speeds, and is close to
multiples of 24MHz, 25MHz, and 19.2MHz, which are common crystal
frequencies.) The sysadmin can create /etc/cpucyclespersecond
to
override all of the OS mechanisms.
For counters that do not ask for scaling, the estimated CPU frequency is
shown in cpucycles-info
as a double-check on the counter results. For
counters that ask for scaling, libcpucycles uses the estimated CPU
frequency to compute the scaling, so this is not a double-check. If a
counter asks for scaling and the estimated CPU frequency does not seem
close to a multiple of the counter frequency (possibly with a small
power-of-2 denominator) then libcpucycles will throw the counter away,
except in the case of fixed-resolution OS counters such as
gettimeofday
and CLOCK_MONOTONIC
.
libcpucycles computes a precision estimate for each counter (times any applicable scaling) as follows. Call the counter 1000 times. Check that the counter has never decreased, and has increased at least once. (A counter where the decrease/increase checks fail is retried 10 times, so 10000 calls overall, and removed if it fails all 10 times.) The precision estimate is then the smallest nonzero difference between adjacent counter results, plus a penalty explained below.
The penalty is 100 cycles for off-core counters (including RDTSC) and
default-perfevent
, and 200 cycles for fixed-resolution OS counters.
For example, an on-core CPU cycle counter will be selected even if it
actually has, e.g., a resolution of 8 cycles and 50 cycles of overhead.
Finally, libcpucycles selects the counter where the precision estimate is the smallest number of cycles. Note that an inaccurate estimate of CPU frequency can influence the choice between a scaled counter and an unscaled counter.
libcpucycles does not carry out its counter selection (typically tens
of milliseconds, sometimes even more) as a static initializer; callers
are presumed to not want to incur the cost of initialization unless and
until they are actually using cpucycles()
. A multithreaded caller thus
has to place locks around any possibly-first call to cpucycles()
, or
create its own static initializer (an __attribute__((constructor))
function) with an initial cpucycles()
call so that all subsequent
cpucycles()
calls are thread-safe.