Tooling for Automated Benchmarking and Visualization

Clara Habermeier, Jan Winkelmann
July 8, 2025

Maintaining peak software performance is a critical aspect of our development process, and early regression detection is non-negotiable. At Cryspen, we’ve addressed this by implementing an automated, multi-platform benchmarking system. This post will detail the enhancements we’ve made to our workflows, allowing us to preemptively identify performance issues. Besides focusing on algorithm benchmarks, we utilize a tracing strategy for our protocol code. This method allows us to measure performance on live code. In addition, we will explore the tools and methods designed for creating comprehensive, informative visualizations of benchmark data, applicable to our wide range of projects and repositories.

When looking at the code we needed benchmarks for, we realized that it can be broadly divided into two categories. The first category is calls to cryptographic algorithms, which are generally found at the end of the call stack. This sort of code naturally lends itself to being run in a tight loop and measuring how long each iteration takes. The durations involved are usually in the micro- or nano-second range, so the measurement process can’t tolerate noise. This category of code can be measured with existing tools such as criterion, or manual benchmarking loops.

The other category is protocol code, which often contains different sections whose performance we want to understand individually, but which can only really be run as part of the full protocol run. We use tracing to measure this class of code. The idea is that we annotate certain functions or code segments such that when we enter or leave these sections during execution, we write events with timestamps and some metadata to a trace. Later we can go through the captured events and recover the elapsed time for each section. The benefits of this approach are that we don’t need to set up mock protocol states for running the benchmarks, and that we benchmark the actual code as it is run, instead of a stand-alone function call that is detached from its context. On the other hand, this approach adds some clutter to our protocol code, because we need to annotate each section to be measured.

Tracing with proc macros

In order to enable tracing for a function, we annotate it as follows:

// in the crate root - could also be somewhere else
static TRACE: LazyLock<MutexTrace<&'static str, Instant>> =
        LazyLock::new(|| MutexTrace::default());

// at the function to be measured
#[cfg_attr(test, trace_span("step1", crate::TRACE))]
fn protocol_step1(&mut self, msg: &[u8], w: &mut impl io::Write) -> Result<usize, Error> {
  // ... protocol code ...
}

When built for tests, this attribute proc macro roughly expands to:

fn protocol_step1(&mut self, msg: &[u8], w: &mut impl io::Write) -> Result<usize, Error> {
  let _span_guard = crate::TRACE.emit_span("step1");
  // ... protocol code ...
}

The emit_span function inserts a start marker for “step1” into the trace and returns a guard. At the end of the function, the guard is dropped, at which point an end marker for the same label will be inserted.

To run this as a benchmark, we need a benchmark function that runs the protocol and afterwards inspects the trace. An example for how this could be done can be found in the Rosenpass benchmarks, which is based on our tracing code. That example also produces a format ready for consumption by our Github actions. Let’s get into those next.

Benchmarking on CI and visualizing on GitHub Pages

We wanted to enhance our benchmarking approach with automated, multi-platform benchmark runs to identify performance regressions before PRs are merged, and to create tools that automatically generate visualizations of the benchmark results, with minimal configuration needed. We also wanted this to be a unified visualization interface for all benchmarks, that can be re-used across different repositories, and sub-crates, and to display the data in an easily navigable manner.

Existing tooling on CI, including a combination of github-action-benchmark and criterion benchmarks, allowed us to automatically run benchmarks and visualize them on GitHub Pages. More specifically, the original action, which runs in the context of a GitHub Actions workflow, can take the output of cargo bench and other tools as its input, transform it into a common format, and store it in a GitHub Pages branch for the repository alongside a chart interface that displays one chart for each group of observations, representing the change in one benchmark run over time. However, we often wanted to group multiple benchmark runs into combined charts to compare them visually, for example all for a given os, as well as to perform performance regression detection against base git revision; neither of these features were supported out of the box using the existing tooling.

A screenshot of the dashboard chart for libcrux ML-KEM, with the os windows-latest_64 and the key size 768 A screenshot of the dashboard chart for libcrux ML-KEM, with the os windows-latest_64 and the key size q1024

Taking this existing benchmarking tooling as a starting point, we forked the benchmarking action into two separate actions, splitting up the underlying functionality into two modular units. The first action extracts the raw data from the criterion or other benchmarking results into a list of JSON objects, and the second action (the chart action) uploads these results to the GitHub Pages branch in a common JSON format, and additionally adds the index.html for the chart interface to the GitHub Pages branch.

First, as a prerequisite to the data extraction process in the first action, we looked at how best to associate user-defined metadata, e.g. key size, API, library, and platform, to the benchmark results. We accomplished this by utilizing the benchmark names in Criterion to include any relevant metadata using a key=value-based metadata specification. Then, these key=value pairs are automatically parsed by the first action, and included in its output, which is uploaded to the GitHub pages branch by the second action.

Run benchmarks step:

- name: 🏃🏻‍♀️ Benchmarks
        run: cargo bench --verbose $RUST_TARGET_FLAG -- --output-format bencher | tee bench.txt

Output of the above:

test category=ML-KEM,keySize=512,name=Key Generation/platform=portable,api=external random ... bench:      13,675 ns/iter (+/- 95)
test category=ML-KEM,keySize=512,name=Key Generation/platform=portable,api=unpacked (external random) ... bench:      13,314 ns/iter (+/- 68)
test category=ML-KEM,keySize=512,name=Key Generation/platform=avx2,api=external random ... bench:       9,613 ns/iter (+/- 130)

Extract action step in workflow:

      - name: Extract benchmarks
        uses: cryspen/benchmark-data-extract-transform@v2
        with:
          name: ML-KEM Benchmark
          tool: 'cargo'
          os: ${{ matrix.os }}_${{ matrix.bits }}
          output-file-path: libcrux-ml-kem/bench.txt
          data-out-path: libcrux-ml-kem/bench-processed.json

Next, within the chart interface, we support using the metadata described above to group benchmarks data series into different charts. For this, two pieces of information are needed: a schema that is a list of all the relevant metadata fields, and a group-by, which is a subset of the schema, and is used to divide the results into charts, one chart per each unique value of the group-by fields e.g. os=ubuntu-latest,keySize=44. These two pieces of information are provided to the chart action. The chart interface we built utilizes these fields to group data series into charts, so that we can, for example, compare different key generation algorithms across one OS/platform.

Upload step in workflow:

      - name: Upload benchmarks
        uses: cryspen/benchmark-upload-and-plot-action@v3
        with:
          name: ML-KEM Benchmark
          input-data-path: libcrux-ml-kem/bench-processed.json
          github-token: ${{ secrets.GITHUB_TOKEN }}
          gh-repository: github.com/${{ github.repository }}
          group-by: os,keySize
          schema: category,keySize,name,platform,api,os
          auto-push: true
          fail-on-alert: true
          alert-threshold: 200%

Because the content of the key=value metadata and the schema and group-by fields are entirely user-specified, they are compatible with any benchmarking use case, which means that this benchmarking tooling can be re-used across different repositories and workspace crates without any additional configuration being needed. The resulting data is written to the GitHub pages branch, where it is retrieved by the HTML dashboard. But other targets could be used.

Additionally, in order to identify regressions on CI, we needed to be able to compare against data from the base ref and sha of a PR. To do this, we designed a new file structure for the stored results on the GitHub pages branch, so that a separate file is maintained for each ref or PR. This means that, when a workflow using our benchmarking actions runs on the merge queue, data from the underlying base revision (e.g. corresponding to a specific commit on main) can be pulled and used as the basis of comparison.

Each of these datasets can also be visualized in the chart interface using the ref/PR selection dropdown and filtered for the charts most relevant to a given question.

A screenshot of the dataset selection and chart filtering panel

Conclusion

To conclude, this tooling greatly reduces the amount of manual configuration needed for setting up benchmark workflows on CI and visualizing the data, as well as the amount of code we need to write to measure time spent in individual functions.

The libcrux benchmarks for ML-KEM and ML-DSA can be viewed here: libcrux.cryspen.com

The benchmarking GitHub Actions we developed can be found here: