Skip to content

ahrav/scratch-scanner-rs

Repository files navigation

scanner-rs

A secret scanner for git repositories and filesystems (for now).

This started as a question: how much faster can a secret scanner get if you design it around the CPU instead of around programmer ergonomics? What if you take ideas from data-oriented design, mechanical sympathy, and TigerBeetle-style pre-allocation — and apply them to regex-based secret detection? This project is the attempt to find out.

The short answer: quite a bit faster, it turns out. But not for free.

What We Measured

We (Claude Code, Codex, and me) benchmarked scanner-rs against three established secret scanners — Kingfisher (Rust), TruffleHog (Go), and Gitleaks (Go) — across 8 open-source repositories of varying size and character, in both git history and filesystem modes, with cold and warm page caches (on EBS-backed NVMe storage, where kernel readahead behavior matters). 128 total configurations, each on an ARM Graviton3 (16 vCPUs, 61 GiB RAM).

Speedups vary — a lot

Scan time comparison

Warm-cache git mode, showing how many times slower each scanner is vs scanner-rs:

Repo vs Kingfisher vs TruffleHog vs Gitleaks
node 5.8x 35.6x 30.0x
vscode 2.3x 13.1x 8.3x
linux 2.3x 10.6x 7.8x
rocksdb 2.3x 10.8x 6.7x
tensorflow 2.4x 16.0x 10.5x
Babylon.js 1.3x 11.6x 11.1x
gcc 1.9x 12.4x 60.0x
jdk 1.7x 18.3x 16.1x

The range is wide. Against Kingfisher (the closest Rust scanner), the advantage is 1.3x on Babylon.js and 5.8x on node. Against the Go scanners, 6.7x to 60x. The variance matters — it tells us the advantage depends on workload characteristics, not a single trick.

A few patterns we noticed:

  • Git mode advantages are larger than filesystem mode. Git scanning involves decompressing and traversing commit history, which generates more CPU-bound work where our DFA prefilter can skip more input. scanner-rs reads pack files directly with custom pure-Rust parsers — no libgit2, no git CLI subprocess. Object IDs are resolved through a zero-copy MIDX index, and pack objects are decoded in offset order for sequential I/O. TruffleHog and Gitleaks shell out to git log --patch and parse unified diffs; Kingfisher uses the gix library. See architecture-comparison.md Section 4.9 for the full comparison. In filesystem mode, the bottleneck shifts toward I/O on cold caches and the gaps narrow (though scanner-rs still shows the largest cold-to-warm speedup — see below).
  • Warm-cache filesystem mode isolates compute throughput. When I/O isn't the bottleneck, scanner-rs reaches 1.2–3.3 GiB/s on large repos (linux, gcc, Babylon.js).
  • The Gitleaks-on-gcc outlier (60x) is real but context-dependent. Gitleaks runs at 0.5–0.6 MiB/s on gcc across all configurations, suggesting a pathological interaction between its sequential rule loop and gcc's repository structure.

Cold vs warm: what the page cache reveals

Cold vs warm cache ratio

Filesystem mode shows the starkest cold/warm contrast. The ratio of cold-cache to warm-cache wall time tells you whether a scanner is I/O-bound (large ratio: slow when data isn't in memory, fast when it is) or CPU-bound (ratio near 1.0: I/O was never the bottleneck).

Repo scanner-rs Kingfisher TruffleHog Gitleaks
node 10.1x 4.0x 1.3x 1.1x
vscode 3.3x 1.5x 1.1x 0.9x
linux 13.1x 6.8x 1.0x 1.1x
rocksdb 1.1x 1.1x 1.3x 1.0x
tensorflow 9.5x 3.0x 1.1x 1.0x
Babylon.js 2.1x 1.5x 1.1x 1.0x
gcc 19.8x 8.7x 1.4x 1.0x
jdk 13.8x 4.1x 1.9x 1.3x
Average 9.1x 3.8x 1.3x 1.0x

scanner-rs averages a 9.1x cold/warm speedup; Gitleaks averages 1.0x. A large cold/warm ratio indicates the scanner processes data faster than I/O can deliver it — so warm cache removes the bottleneck. A ratio near 1.0 indicates the scanner is CPU-bound regardless of cache state. Gitleaks shows almost no cold/warm delta, consistent with CPU being its bottleneck even on cold storage.

Several design choices likely contribute to scanner-rs's large cold/warm delta: posix_fadvise(POSIX_FADV_SEQUENTIAL) on every file open, madvise(MADV_SEQUENTIAL) on every mmap, overlap-carry I/O (no re-reading), and the work-stealing scheduler keeping I/O pipelined. We haven't isolated each factor's individual contribution — the ratio captures all of them together.

One context note: the benchmark storage is EBS (Elastic Block Store), which presents as NVMe but is network-attached. Kernel readahead hints matter more on EBS than on local NVMe, where device-level prefetching is already aggressive. These ratios would likely look different on local SSDs.

Throughput across repository sizes

Throughput comparison

A selection of throughput numbers showing how performance scales:

Repo Mode Cache scanner-rs Kingfisher TruffleHog Gitleaks
linux fs warm 3.3 GiB/s 1.4 GiB/s 125 MiB/s 112 MiB/s
gcc fs warm 1.8 GiB/s 716 MiB/s 109 MiB/s 0.6 MiB/s
vscode fs warm 1.5 GiB/s 284 MiB/s 98 MiB/s 125 MiB/s
node git warm 106 MiB/s 18 MiB/s 3.0 MiB/s 3.5 MiB/s
vscode git warm 84 MiB/s 37 MiB/s 6.4 MiB/s 10 MiB/s
linux git warm 39 MiB/s 17 MiB/s 3.7 MiB/s 5.0 MiB/s

Filesystem warm-cache is the most compute-bound scenario. Git mode is slower across all scanners because of the decompression and object-traversal overhead. scanner-rs showed the highest throughput in every configuration we tested.

The memory cost

Peak memory usage

This is the part we can't hand-wave away. scanner-rs uses substantially more RSS than every other scanner:

Repo scanner-rs Kingfisher TruffleHog Gitleaks
node 5.5 GiB 2.3 GiB 1.7 GiB 1.6 GiB
vscode 5.4 GiB 2.1 GiB 1.6 GiB 1.3 GiB
linux 22.9 GiB 8.1 GiB 8.3 GiB 7.2 GiB
tensorflow 7.2 GiB 2.4 GiB 1.8 GiB 1.4 GiB
gcc 15.8 GiB 5.6 GiB 4.8 GiB 4.5 GiB

Roughly 2–3x more memory across the board. This is the direct cost of pre-allocated pools, per-worker scratch buffers, and fixed-capacity arenas. It's a deliberate tradeoff — but it means scanner-rs needs a bigger instance or won't fit where a lighter scanner would. More on this in The Memory Tradeoff below.

What the finding counts tell us (and don't)

Finding counts diverge wildly across scanners on the same repo — scanner-rs reports 98,584 findings on vscode git mode while Gitleaks reports 116. scanner-rs's counts are inflated because it currently lacks several false-positive reduction filters that other scanners ship: entropy gates on rule matches (checking the secret span, not the full match window), safelists for known-benign patterns, and confidence scoring. These filters are planned but not yet implemented. Once entropy gating and safelists land, we expect a significant drop in reported findings.

The throughput numbers above compare scan speed on the same input and are unaffected by finding counts.

Why: The CPU-Level Story

The benchmark results told us that scanner-rs was faster, but not why. So we ran perf stat on the vscode repo (1.12 GiB, git mode, warm cache) with 24 hardware counters across all four scanners, capturing cycles, instructions, cache misses, branch mispredictions, TLB behavior, and pipeline stalls.

This was a deep-dive on one representative workload. The findings below explain the mechanisms we think drive the broader benchmark results.

Scan less: anchor-first prefiltering

The single biggest win had nothing to do with memory layout or cache lines. It was algorithmic: don't run regex on input that can't possibly match.

scanner-rs compiles all 223 detection rules into a single Vectorscan (Hyperscan) multi-pattern DFA. This DFA scans the input buffer in one SIMD-accelerated pass, identifying narrow anchor windows where a match could exist. Only those windows — typically a small fraction of the input — get fed to the full regex engine. Everything else is skipped.

The perf counters show the impact: scanner-rs executes 3.5x fewer instructions than Kingfisher and 26x fewer than Gitleaks on the same input. Most of the other scanners run regex over the full input for every matched rule. That instruction reduction alone accounts for most of the wall-clock difference — and it's likely why the advantage holds consistently across all 8 benchmark repos, not just vscode.

  • src/engine/buffer_scan.rs:1 — Prefilter → normalize → confirm → validate pipeline
  • src/engine/vectorscan_prefilter.rs:112VsPrefilterDb: compiled database holding all patterns
  • src/engine/core.rs:30 — Scan algorithm: prefilter seeds windows, regex only runs in hit windows

Deterministic state transitions: fewer branch misses

The DFA approach has a secondary benefit beyond instruction count: it replaces per-rule branching with table lookups. Each byte advances the automaton state via a deterministic table index — the branch predictor doesn't need to speculate about which rule will match next.

The counters show 4.2x fewer branch mispredictions than TruffleHog (which dispatches to per-detector regex engines) and 4.4x fewer than Kingfisher (Vectorscan + per-rule regex). On Graviton3, each misprediction costs ~10-15 cycles of pipeline flush, so 6 billion fewer misses adds up.

This is really the same bet as anchor-first scanning — use a single DFA instead of N separate regex engines — but showing up in a different counter.

  • src/engine/vectorscan_prefilter.rs:89RawPatternMeta: 12-byte #[repr(C)] per-pattern metadata

Work-stealing scheduler: keep data warm

The third-largest measured effect was backend stall cycles — 2.2x fewer than Kingfisher, 3.6x fewer than TruffleHog. Some of this is simply a consequence of executing fewer instructions (fewer memory operations = fewer stalls). But the scheduling strategy likely contributes too.

scanner-rs uses a custom work-stealing executor with Chase-Lev deques. The key property is LIFO-local scheduling: when a worker spawns a subtask, it pushes to its own deque and pops from the same end. This means recently-touched data (still warm in L1/L2) gets reused immediately. Stealing is FIFO from remote workers, randomized to avoid correlated contention, with a tiered idle strategy (spin → yield → park with 200us timeout).

Contrast this with Go's goroutine scheduler (used by TruffleHog and Gitleaks), which may migrate goroutines between OS threads, or Rayon (used by Kingfisher), which provides work-stealing parallelism but without the LIFO-local scheduling or tiered idle strategy.

How much of the 77 billion fewer stall cycles is scheduling vs just doing less work? Hard to say definitively. But the warm-cache filesystem throughput numbers (1.2–3.3 GiB/s) are consistent with the scheduler keeping workers fed with work.

  • src/scheduler/executor.rs:3 — Architecture diagram and design rationale
  • src/scheduler/executor.rs:74ExecutorConfig with tuning knobs
  • src/scheduler/executor.rs:472WorkerCtx: per-worker deque + scratch + metrics

Pre-allocate everything: stable address space

Borrowed from TigerBeetle's philosophy: if you know the maximum size at startup, allocate it once and never touch the allocator again.

  • ScratchVec: page-aligned, fixed capacity, zero reallocation (src/scratch_memory.rs:43)
  • NodePoolType: contiguous arena with bitset free-list, O(1) allocate/free (src/pool/node_pool.rs:44)
  • BufferPool: fixed-size 8 MiB chunk buffers, Rc-backed (src/runtime.rs:570)
  • AllocGuard: runtime enforcement that hot paths perform zero allocations (src/scheduler/alloc.rs:1)

The measurable effect on vscode: 1.9x fewer dTLB misses than Kingfisher. The theory is that stable virtual addresses keep TLB entries warm — no GC relocation, no realloc copying to new pages. In absolute terms this saves an estimated ~11 billion cycles on that workload, meaningful but modest compared to the algorithmic wins above.

This is also the primary driver of the memory cost discussed earlier.

I/O hints: helping the kernel help you

The cold/warm ratios above hint that scanner-rs handles I/O differently. One concrete difference: scanner-rs calls posix_fadvise(POSIX_FADV_SEQUENTIAL) on every file descriptor and madvise(MADV_SEQUENTIAL) on every mmap'd region. No other scanner does this (verified by searching the Kingfisher, TruffleHog, and Gitleaks codebases for posix_fadvise, madvise, fadvise, and MADV_).

  • src/scheduler/local_fs_owner.rs:1044hint_sequential(): posix_fadvise(POSIX_FADV_SEQUENTIAL) on local FS file reads
  • src/git_scan/runner_exec.rs:517advise_sequential(): posix_fadvise + madvise(MADV_SEQUENTIAL) on pack file mmaps
  • src/git_scan/pack_io.rs:421 — Same pattern on pack cache entries
  • src/git_scan/spill_arena.rs:266 — Same pattern on spill arenas

On Linux, POSIX_FADV_SEQUENTIAL doubles the kernel's default readahead window (from 128 KiB to 256+ KiB). For sequential scans over large files, this reduces the number of I/O round-trips. MADV_SEQUENTIAL does the same for mmap'd regions and additionally lets the kernel proactively drop already-read pages, reducing memory pressure.

On EBS storage (network-attached, presenting as NVMe), each I/O round-trip carries higher latency than local SSD, so reducing their count via larger readahead windows has proportionally more impact.

Honest caveat: we haven't isolated the effect of fadvise/madvise from the other I/O choices (overlap-carry reads, work-stealing I/O pipelining). The cold/warm ratio captures all of them together. We can say that scanner-rs is the only scanner making explicit prefetch hints, and the cold/warm ratios are consistent with this mattering — but we can't say "fadvise alone explains the difference."

The smaller optimizations

The remaining design decisions showed measurable but smaller effects in the vscode perf profile. Whether they're worth their complexity is debatable — we think they are, mainly because they were cheap to get right at design time and they compound.

Cache-line aligned atomics (src/engine/core.rs:142): Shared atomic counters padded to 64 bytes each, verified by compile-time assertion. Prevents false sharing. 2.1x fewer L2 writebacks, ~24-40B cycles saved.

Per-worker scratch (src/scheduler/executor.rs:472): Each worker owns its scratch buffers via Rc, never shared. 1.5x fewer L2 refills, ~5-8B cycles saved. Also simplifies reasoning about thread safety.

Compact packed metadata (src/engine/hit_pool.rs:82): 4-byte PairMeta (16 per cache line), 12-byte RawPatternMeta (5 per cache line). 2.0x fewer L1D misses, ~12B cycles saved — roughly 3% of the total cycle delta. The algorithmic decision to scan less input matters far more than how tightly you pack the metadata for the input you do scan.

The memory tradeoff

Every pre-allocation decision above trades memory for speed. Per-worker scratch, Vectorscan scratch, buffer pools, node arenas, cache-line padding — they all consume RSS whether fully utilized or not.

The 2-3x memory overhead is consistent across all 8 repos, which makes sense: it's proportional to worker count and pool configuration, not input size. On a 61 GiB machine scanning the linux kernel, 22.9 GiB of RSS is a lot. Whether that's acceptable depends on deployment context.

We think it's the right tradeoff for a batch scanning tool where you provision the machine for the job. It would be the wrong tradeoff for a memory-constrained CI environment. There's room to add configurable pool sizing, but we haven't prioritized it.

Measure, then measure again

The measurement infrastructure turned out to be important. Intuition about performance is unreliable — the impact ordering above surprised us; we expected the memory-layout optimizations to matter more than they did. The broad benchmark told us what was faster; the perf counters told us why; and the gap between our expectations and the data told us where our mental models were wrong.

  • 128-run benchmark matrix: 8 repos x 2 modes x 2 cache states x 4 scanners (findings.md)
  • 24 hardware counters on vscode: cycles, instructions, L1/L2/LLC misses, branches, TLB, stalls (perf_analysis.md)
  • Conditional compilation: perf_stats feature gate enables per-operation instrumentation without runtime cost in production
  • Side-by-side code analysis: every design decision mapped to other scanners' code excerpts (architecture-comparison.md)

Design Decisions Summary

Ordered by estimated cycle impact (from the vscode perf deep-dive):

# Decision Observed Impact Est. Cycle Share Code
1 Anchor-first scanning + Vectorscan DFA 3.5x fewer instructions Dominant src/engine/buffer_scan.rs:1
2 Deterministic DFA transitions 4.2x fewer branch misses Large src/engine/vectorscan_prefilter.rs:89
3 Work-stealing scheduler 2.2x fewer backend stalls Large (partly from #1) src/scheduler/executor.rs:472
4 Cache-line aligned atomics 2.1x fewer L2 writebacks Moderate src/engine/core.rs:142
5 Pre-allocated pools 1.9x fewer dTLB misses Small src/scratch_memory.rs:43
6 Compact packed metadata 2.0x fewer L1D misses Small src/engine/hit_pool.rs:82
7 Per-worker scratch memory 1.5x fewer L2 refills Small src/scheduler/executor.rs:472
8 I/O hints (fadvise + madvise) 9.1x avg cold/warm ratio FS-mode src/scheduler/local_fs_owner.rs:1044

Quick Start

# Build (requires Vectorscan/Hyperscan development headers)
cargo build --release

# Scan a git repository
./target/release/scanner-rs scan git /path/to/repo

# Scan a filesystem path
./target/release/scanner-rs scan fs /path/to/directory

Documentation

Document Contents
Architecture Overview System architecture, component relationships
Detection Engine Vectorscan prefilter, rule compilation, scan pipeline
Memory Management Pools, scratch memory, allocation strategy
Detection Rules Rule format, YAML schema, built-in rules
Window Validation Anchor-first scanning, gate sequence
Transform Chain Base64/hex/URL decode pipeline
Pipeline Flow End-to-end data flow
Git Scanning Git object traversal, commit scanning

Benchmarks

Report What it contains
findings.md 128-run benchmark: wall time, throughput, peak memory across 8 repos
perf_analysis.md CPU-level profiling: 24 hardware counters on vscode
architecture-comparison.md Side-by-side code analysis mapping design decisions to hardware counters

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors