Fission

Batch Benchmark

This is the canonical operator-facing benchmark guide for decompilation quality and Ghidra-parity work. Criterion/perf helpers under scripts/benchmark/ are non-canonical and should be treated as microbenchmark tooling only.

Canonical benchmark script root:

Default artifact root:

Fission keeps two benchmark entrypoints:

full_decomp_benchmark.py compares:

The runner now supports both:

Requirements

Example Usage

# Historical fixed-seed comparison
python3 benchmark/full_benchmark/compare_legacy_preview.py \
  samples/windows/x64/putty.exe \
  --addresses 0x140006380 \
  --with-ghidra \
  --repeat 3 \
  --fission-bin target/release/fission_cli \
  --output-dir benchmark/artifacts/full_benchmark/compare_legacy_preview/putty-fixed

# Whole-binary benchmark
python3 benchmark/full_benchmark/full_decomp_benchmark.py \
  samples/windows/x64/putty.exe \
  --fission-bin target/release/fission_cli \
  --ghidra-dir vendor/ghidra/ghidra-Ghidra_12.0.4_build \
  --output-dir benchmark/artifacts/full_benchmark/putty-balanced-latest

# Faster validation: first N canonical seed functions
python3 benchmark/full_benchmark/full_decomp_benchmark.py \
  samples/windows/x64/test_control_flow_x64_O0.exe \
  --limit 30 \
  --timeout 300

# Smoke corpus benchmark
python3 benchmark/full_benchmark/full_decomp_benchmark.py \
  --corpus-manifest benchmark/config/benchmark_corpus/smoke_corpus.json \
  --fission-bin target/release/fission_cli \
  --ghidra-dir vendor/ghidra/ghidra-Ghidra_12.0.4_build \
  --output-dir benchmark/artifacts/full_benchmark/fission-smoke-windows-samples-balanced-latest

# Parity corpus benchmark for Ghidra-reference work
python3 benchmark/full_benchmark/full_decomp_benchmark.py \
  --corpus-manifest benchmark/config/benchmark_corpus/parity_corpus.json \
  --fission-bin target/release/fission_cli \
  --ghidra-dir vendor/ghidra/ghidra-Ghidra_12.0.4_build \
  --output-dir benchmark/artifacts/full_benchmark/fission-ghidra-parity-windows-workbench-balanced-latest \
  --baseline-dir benchmark/artifacts/full_benchmark/fission-ghidra-parity-windows-workbench-balanced-baseline

# Release corpus benchmark against a previously accepted corpus baseline
python3 benchmark/full_benchmark/full_decomp_benchmark.py \
  --corpus-manifest benchmark/config/benchmark_corpus/release_corpus.json \
  --fission-bin target/release/fission_cli \
  --ghidra-dir vendor/ghidra/ghidra-Ghidra_12.0.4_build \
  --output-dir benchmark/artifacts/full_benchmark/fission-release-windows-samples-balanced-latest \
  --baseline-dir benchmark/artifacts/full_benchmark/fission-release-windows-samples-balanced-baseline

Output Naming

Default output directories are split by workflow:

Generated Artifacts

Regression Validation (limit 2 / 20)

Use the helper below to run full_decomp_benchmark.py twice for --limit 2 and --limit 20, then validate:

python3 benchmark/full_benchmark/validate_limit_regression.py \
  samples/windows/x64/test_control_flow_x64_O0.exe \
  --fission-bin target/debug/fission_cli \
  --ghidra-dir vendor/ghidra/ghidra-Ghidra_12.0.4_build

Corpus Manifest

The corpus manifest is a JSON file with an entries array. Each entry keeps the benchmark contract minimal:

Optional:

Checked-in defaults:

Current checked-in suites are intentionally constrained to samples/windows so Ghidra-parity work stays on Windows x86/x64 binaries only.

Top-level manifest metadata:

Quality Metrics

Owner-Aware Metrics

Corpus and per-binary summaries now surface stable owner-facing counters already present in summary.engines.fission / preview_build_stats.

Current owner metrics:

These appear in:

Shape-Drift Proxies

The benchmark also carries presentation-oriented proxy metrics so semantic owner drift and surface-shape drift are easier to separate.

Current proxy set:

synthetic_helper_call_total is derived conservatively from __fission_*( call sites only. These metrics are not semantic truth; they are there to highlight surfacing drift.

Speed Metrics

Release Model

The release owner is no longer putty.exe alone.

Suite Purposes

Watchlists

Row fidelity is no longer intended to be putty-only, but the checked-in suites are still Windows-only.

Dynamic row selection reasons are explicit and stable:

Advisory Rollout

Corpus suites currently default to gate_mode=advisory.

Recommended workflow:

  1. local unit / invariant tests
  2. smoke corpus benchmark
  3. parity corpus benchmark for Windows reference-guided work
  4. release corpus benchmark only for Windows promotion candidates

Investigating Timeout Causes

If --limit 20 results in a 900-second timeout:

# Identify the culprit function by testing each one individually
python benchmark/full_benchmark/find_timeout_culprit.py samples/windows/x64/putty.exe --limit 20 --timeout 120 --verbose

For the full procedure, see docs/debug/TIMEOUT_DEBUG_GUIDE.md.

Current Validation Snapshot