Verifiable AI You Can Prove: A Practical Guide to ZKML Inference and Deployment

Most AI systems ask for trust: trust the model weights, trust the server, trust the logs. But what if a prediction could prove itself? Zero‑knowledge machine learning (ZKML) does that. A model runs, and alongside the output it produces a compact cryptographic proof that says, in effect: “This answer is correct for these committed weights and (optionally) these hidden inputs.” No peeking at your data. No blind faith in a vendor’s server. Just math you can verify anywhere, including on a blockchain or in a mobile app.

This piece is a practical, step‑by‑step guide to shipping verifiable AI. You’ll learn what ZKML is, when to use it, how to pick an approach, and how to build a small, shippable system without drowning in theory. We’ll keep the language plain and the focus on what works today.

What ZKML Actually Proves

Zero‑knowledge proofs assure a verifier that a computation was performed correctly, without revealing the private parts of that computation. In ZKML, the “computation” is an ML inference:

You have a model with specific weights and a known architecture.
You have inputs (features, an image, or audio).
You run inference, producing outputs (scores, classes, or continuous values).

A ZK proof binds all of this together. It certifies: “Given these committed weights and (optionally) hidden inputs, the output is exactly what the forward pass of this model yields.” The verifier checks the proof quickly without redoing the heavy compute.

Why not just use trusted hardware?

Trusted Execution Environments (TEEs) are useful but different. TEEs rely on hardware vendors and attestation services. ZK proofs are mathematical and vendor‑neutral. You can combine both, but ZKML lets you verify claims even if you don’t control the other side’s hardware or cloud.

Where ZKML Fits Today

ZKML is not for every workload yet. It shines when:

Verifiability is a feature: compliance audits, provable scoring, or guarantee of model usage.
Privacy is necessary: users prove eligibility or model output without revealing the raw inputs.
Cross‑trust environments exist: different companies, or apps that don’t trust each other’s servers.
On‑chain or public verification is helpful: settle rewards, gate access, or certify fairness.

Real use cases you can build now

Private eligibility proofs: A student proves a discount classifier would accept them, without leaking their personal details.
Model‑as‑a‑service with receipts: A vendor returns a prediction and a proof that they used the promised weights and code path.
On‑device filters: A phone runs a tiny CNN for image moderation and ships a proof to a server to avoid re‑scoring the upload.
Provable recommendations: A marketplace proves a ranking respected policy constraints (e.g., safety scores over a threshold).

The Building Blocks, Without the Hype

ZKML has a few core parts. Understanding them once saves you weeks of confusion later.

Circuits: how math becomes provable

Your model must be expressed as a set of constraints called a circuit. Two common families exist:

Arithmetic/Plonkish circuits (Halo2, Plonky2, gnark): You describe the forward pass with additions, multiplications, lookups, and range checks. Great for small to mid‑sized neural nets.
zkVMs (RISC Zero, zkWASM): You write normal code (Rust, C, Wasm). The VM proves the program ran correctly. Easier to port general logic; slower per operation right now than specialized circuits.

SNARKs vs STARKs

SNARKs (e.g., KZG‑based): Small proofs, quick verification, sometimes need trusted setup. Many mature libraries.
STARKs: Transparent (no trusted setup), scalable, larger proofs. Often faster provers at scale.

Both are fine. Your choice depends on ecosystem and constraints. If you want Ethereum on‑chain verification fees to be tiny, SNARKs help. If you want easy setup and horizontal scaling, STARKs are appealing.

Quantization: getting real numbers into finite fields

Most ZK systems work over finite fields, not floating‑point. You will typically use fixed‑point integer math:

Quantize weights and activations to int8 or int16 with per‑tensor or per‑channel scales.
Track scales carefully. Every multiply increases magnitude; clamp or rescale to avoid overflow.
Use lookup tables for activations like ReLU or sigmoid approximations.

This is where many first projects fail: they do not assert ranges or lose accuracy by poor calibration. Spend time here.

Commitments: pin the exact model

You don’t want someone to swap a different model after you verify. Hash the weights and architecture metadata using a SNARK‑friendly hash such as Poseidon, and include that hash in your circuit. You’ll hear the term model commitment: it’s just this binding.

Pick an Approach You Can Ship

There are three pragmatic paths for a first ZKML deployment.

1) Use a neural‑net‑aware circuit tool (fastest path)

Tools like ezkl and gnark‑based stacks let you import ONNX models and generate circuits with common layers (conv, matmul, ReLU). This is the most straightforward way to prove small CNNs or MLPs.

Pros: Friendly to ML engineers. Built‑in quantization and layer support. Good documentation.
Cons: You still need to reason about ranges, scales, and supported ops. Big models won’t fly.

2) Write a bespoke circuit in Halo2/Plonky2/gnark (max control)

If your model is small and structured (e.g., a 2‑layer MLP), hand‑rolling a circuit can be worth it. You will get tighter constraints and smaller proofs.

Pros: Highly efficient for the exact model. Easy to add custom checks.
Cons: Learning curve. Circuit bugs are subtle. Not portable across teams.

3) Use a zkVM (best for mixed logic)

If your pipeline has a lot of preprocessing or non‑ML logic, consider a zkVM such as RISC Zero or zkWASM. You can reuse ordinary code, then optimize hot math later.

Pros: Regular dev workflow. Good for signature checks, parsing, policy rules around the model.
Cons: Heavier proofs. You still need to quantize math for performance.

Design a Minimal, Useful ZKML Project

Let’s sketch a small, shippable system: a binary classifier that decides whether a listing in a marketplace meets quality standards. A seller wants proof that the platform’s “quality pass” was computed by the promised model. The platform wants to keep its inputs and full model hidden to avoid gaming. ZKML fits.

Model choice

Inputs: 20 normalized numeric features (e.g., text length, photo count, structured checks).
Model: A 2‑layer MLP: 20→32 (ReLU) → 1 (sigmoid approximation), trained in PyTorch.
Quantization: Post‑training to int8 with calibration set; per‑layer scales.

What the proof should reveal

Public outputs: The binary decision bit (pass/fail) and a confidence score bucket (e.g., low/med/high).
Private inputs: The raw features remain hidden. Only their hashes and range checks are inside the proof.
Bound model: The proof binds to a Poseidon hash of the exact ONNX weights and a version string.

Workflow

Train and export: Train the MLP. Export to ONNX. Save an audit bundle (training seeds, metrics).
Quantize: Use ONNX Runtime or a toolkit to int8. Verify accuracy on a holdout set.
Commit: Compute a Poseidon hash of the quantized weights plus metadata (op list, shapes, version).
Prover service: The platform runs inference and creates a proof using an ezkl‑style circuit with input range checks.
Receipt to user: The platform returns {decision, proof, model_commitment} to the seller.
Verification: Anyone can verify the proof offline or via a lightweight HTTP verifier, without revealing features.

Why this is practical

It is small enough to prove in seconds to minutes and needs no specialized hardware. You’ve made the decision publicly verifiable while keeping both first‑party data and proprietary weights secret.

Engineering Details That Matter

Quantization and accuracy

Calibrate scales with a representative dataset. Auto‑calibration usually helps; inspect edge cases.
Check overflow: Add explicit range constraints for layer outputs. Fail closed instead of clamping silently.
Approximate activations: ReLU is easy. Sigmoid can be a piecewise linear lookup. Document its error bound.

Model commitments done right

Commit to weights + architecture + opset version. If you only hash weights, someone could swap the graph.
Use a domain‑separated hash label (e.g., “MODEL_V1”) so commitments aren’t reused out of context.
Expose the hash publicly in docs. Your users should know what to expect.

Prover performance tips

Batch proofs when the same model runs over many inputs. Amortize constraint setup.
Use lookups for expensive non‑linear ops. They cut constraints a lot in Plonkish systems.
GPU acceleration: For polynomial commitments or FFTs, enable CUDA backends if the library supports them.
Separate preprocessing: Do normalization, tokenization, or resizing outside the circuit if it doesn’t affect trust assumptions.

Security and integrity checks

Code path pinning: Prove not just math, but the exact path taken (e.g., which activation is used). zkVMs make this easier.
Input bounds: If inputs must be in [0, 1], prove it. Otherwise, attackers can skew the model with out‑of‑range values.
Versioning: Include a version string in the public inputs. Rotate keys and commitments on updates.

Client vs Server vs On‑Chain Verification

Client‑proved, server‑verified

The user runs the model locally (say, on a laptop) and sends {output, proof} to your server. This is ideal when you never want raw inputs to leave the device. The server can accept or reject based on the verified output only.

Server‑proved, client‑verified

Your API returns proofs for its responses. This builds trust in a marketplace or ad network where counterparties don’t control your stack. Latency is higher than a normal API, so pre‑compute or batch when possible.

On‑chain verified

Smart contracts can check SNARK or STARK proofs to gate rewards or settle claims. Today, keep on‑chain circuits tiny and use recursive or aggregated proofs to shrink verification cost. You’ll likely do the heavy proving off‑chain and post a succinct receipt on‑chain.

A Concrete Build Plan (Week by Week)

Week 1: Scope and model

Pick a binary or small multi‑class task with 10–50 features or a 1–2 layer CNN.
Train and export to ONNX. Lock a baseline accuracy.

Week 2: Quantization and circuit selection

Quantize to int8. Validate accuracy delta is acceptable.
Choose a tool: start with ezkl (neural nets) or a zkVM (mixed logic).

Week 3: Commitments and constraints

Implement Poseidon weight commitment. Pin architecture metadata.
Add input range assertions, activation lookups, and any policy checks you want to prove.

Week 4: Prover service and API

Wrap the prover in a microservice. Expose endpoints: /predict (returns output+proof) and /verify (idempotent public verifier).
Log proof size, time, and memory. Track failures with actionable error messages.

Week 5: Verification clients

Publish a tiny verifier SDK for Node and Python. Include one example script and a checksum for known good commits.
Optionally, deploy a demo contract on a low‑fee testnet to verify small SNARK receipts.

Testing: Don’t Skip the Boring Parts

Property tests

Generate random valid inputs; assert that plain inference == proved inference within quantization error.
Fuzz inputs near bounds to catch off‑by‑one errors and overflow.

Negative tests

Flip one bit in the weight file and confirm verification fails.
Change the architecture metadata; the commitment should break.
Feed out‑of‑range inputs; the circuit must reject them.

Audit bundle

Bundle: training code commit, dataset hash, quantization config, model commitment, circuit version, and proof transcript for a canonical test input.

Performance and Cost Reality

Proofs take longer than plain inference. Plan for it:

Latency targets: Expect seconds to a few minutes for small models. Hide latency with async receipts or precomputation.
Compute cost: Provers are heavy. Use spot GPUs for batch jobs. Auto‑scale by queue depth.
Proof size: SNARKs are tiny (tens to hundreds of bytes to a few KB). STARKs are larger (hundreds of KB+). This affects network costs and on‑chain fees.

Privacy Choices: What to Reveal

Zero‑knowledge gives you knobs:

Hide inputs: Prove you used valid inputs within ranges without exposing raw values.
Hide weights: Commit to weights but don’t reveal them; this protects IP.
Release only thresholds: Instead of a full score, prove the score > T. This enables fine‑grained access control without oversharing.

Be explicit with partners. Document which fields are public and which are private, and why.

Pitfalls to Avoid

Ignoring preprocessing: If your model expects standardized inputs, either prove the standardization or make inputs pre‑standardized by contract.
Floating‑point leakage: Mixing float and fixed‑point paths leads to mismatches. Keep a single authoritative quantized path.
Unbounded activations: Forgetting range checks lets adversaries force wraparound or undefined behavior in finite fields.
Silent model drift: Re‑exported weights with different ordering can still hash differently. Lock a canonical export recipe and stick to it.

How This Scales Next

As you grow beyond tiny models, consider:

Recursion: Prove many small inferences and fold them into a single succinct proof.
Hybrid stacks: Use zkVM for control flow and a specialized circuit for the hot matmul/conv kernels.
Model distillation: Distill a large model into a small, provable student. Trade some accuracy for tractable proofs.
Hardware acceleration: GPUs and specialized libraries now speed up MSMs, FFTs, and FRI steps. Turn them on.

Quick Start: Your First ZKML Prototype

Here’s a condensed checklist you can follow to get to “hello world” with verifiable inference:

Pick a 1D MLP (≤64 hidden units) or a 1‑layer conv net. Export to ONNX.
Quantize to int8. Save the scale factors per layer.
Use a neural‑net‑aware prover (e.g., ezkl) to generate a circuit; enable ReLU lookup tables.
Hash the ONNX file and metadata using Poseidon; include the hash as a public input.
Run the prover on a sample. Compare outputs between plain and proved runs within tolerance.
Ship a small CLI: “predict —model M.onnx —features f.json —proof out.proof —public pub.json”.
Write a 50‑line verifier in the language your customers use. Teach it to parse pub.json and check out.proof.

Once that works end‑to‑end, layer on batching, better observability, and a REST API.

Governance and Trust, Without the Hand‑Waving

A proof is only as good as your contract with users about what’s inside:

Publish commitments and versions so others can monitor for drift.
Explain quantization error bounds and how they affect thresholds.
Document update policy: How often can the model change? How do you announce new commitments?

Verifiability is a product surface. Treat it with the same polish you give your UI or API ergonomics.

Summary:

ZKML lets you ship AI predictions with cryptographic proofs of correctness, without revealing private inputs or weights.
Start small: int8 MLPs or tiny CNNs using circuit libraries like ezkl or a zkVM for mixed logic.
Get quantization and range checks right. Commit to the exact weights and architecture with a SNARK‑friendly hash.
Pick a verification pattern: client‑proved, server‑proved, or on‑chain. Hide latency with batching and async receipts.
Test like you mean it: property tests, negative tests, and an audit bundle keep you honest.
Scale with recursion, hybrid stacks, and hardware acceleration when your use case proves its value.