Proof-Carrying Models: Practical ZKML for Verifying AI Without Trusting the Server

Why verify AI at all?

Most of us accept AI results on faith. A server says “this image is a cat,” your app agrees, and that’s the end of it. But when the stakes move beyond curiosity—payments, marketplace moderation, identity checks, trading, or safety systems—faith is not enough. You want a result you can independently verify without trusting whoever ran the model.

That’s where proof‑carrying models come in. Instead of sending only an answer, the system sends the answer plus a compact proof that the answer was produced by a specific model on specific inputs. The verifier checks the proof in milliseconds, even if the original computation took seconds or minutes. This approach, often called ZKML (zero‑knowledge machine learning), lets you build AI services that are verifiable anywhere—inside another company’s backend, on a low‑power device, or even on a public blockchain.

Real-world cases where you’d want this

Marketplaces and content platforms: Require a proof that a moderation model flagged content using a specific version, to avoid “shadow” models with different rules.
Finance and rewards: Pay out only if a proof confirms the score or risk rating came from the committed model, not a tuned variant.
Games and online competitions: Accept tournament entries only if a proof shows the inference ran under a known policy, blocking server‑side tweaks.
Edge devices: Gate actions (like door unlocks) on proofs that an approved model evaluated sensor data, not a local hack.
On-chain ecosystems: Write smart contracts that verify an AI decision without trusting an oracle server. The chain checks the math, not the operator.

The result is a new security layer: you don’t just know what the model said—you can verify how it was obtained.

What ZKML actually proves

Zero‑knowledge systems allow one computer (the prover) to convince another (the verifier) that a computation was done correctly, without revealing the full details or redoing the work. In ZKML, the computation is the forward pass of a model.

Two useful patterns

Circuit-level proofs: You translate the model’s math directly into an arithmetic circuit: a giant graph of adds, multiplies, and lookups. Tools produce a proof from that circuit. This yields fast verification and small proofs for predictable models (e.g., small CNNs or tree ensembles). It takes work to translate the model and tune constraints, but you get great performance.
zkVMs (virtual machines): You compile regular code (C/C++/Rust or specialized languages) and run it inside a zero‑knowledge virtual machine. The VM proves that the code executed correctly. It’s easier to onboard because you reuse normal programming models, but proofs can be larger and slower than custom circuits.

Model commitments: locking the rules

To prevent model swapping, ZKML systems commit to model parameters via a cryptographic hash or a Merkle root. The proof guarantees: “I used weights whose hash equals H.” Version your models and publish the hash. Consumers can then require that exact version to accept results. If you need to rotate models, publish a new commitment and a deprecation date for old ones.

What can’t be proved easily?

You can’t prove everything for free. Some parts are tricky or expensive:

Floating point: Zero‑knowledge systems prefer finite field arithmetic. Floating‑point emulation is heavy. Most pipelines use fixed‑point or integer arithmetic (quantization).
Nonlinear ops and big activations: ReLU, softmax, and attention can be expensive. Builders use lookup tables, piecewise polynomials, or different architectures to keep costs down.
Huge models: Proving a 7B‑parameter model in real time is out of reach today. ZKML shines with small to medium models and selective verification (e.g., classification heads or distilled submodels).

Choose models that prove well

The fastest route to production ZKML is not “prove a giant model.” It’s “design the smallest honest model that meets your product goal.” This often means adapting the model and pre/post‑processing so the proof stays cheap while the decision stays useful.

Linear models and trees

Logistic regression, linear SVMs, and tree ensembles are ZK‑friendly. The math is simple, constraints are modest, and quantization is straightforward. Tree models—especially gradient boosted trees—do surprisingly well for tabular problems, fraud scoring, or lightweight ranking. They’re easy to explain and to prove.

Compact CNNs and “transformer‑lite”

For images and audio, small CNNs and shallow attention blocks are practical with lookup tables and quantized layers. Use techniques like:

Integer or fixed‑point weights: Represent weights as scaled integers.
Lookup-accelerated activations: Replace expensive ops with table lookups enforced by permutations or commitment checks.
Depthwise separable convolutions: Fewer multiplies per output.
Pooling carefully: Average pooling is cheap; max pooling needs comparisons, which cost constraints. Benchmark both.

Transformers are heavy, but a small transformer head on top of frozen features or embeddings can be provable, especially if the head is linear and the token count is capped.

Quantization and activation choices

Quantize early. If you need 8‑bit or 16‑bit precision, fix it in training so accuracy holds. ReLU in circuits becomes “prove x ≥ 0” and “y = x if x ≥ 0 else 0,” which adds constraints. Alternatives include:

Hard‑swish/hard‑sigmoid approximations: Piecewise linear functions with cheap checks.
Small LUTs: Commit to a table and prove membership for each activation.

Track range proofs. Every intermediate must be constrained to allowed ranges to avoid overflow or malicious exploits. This is a core ZKML hygiene practice.

Build a proof‑friendly pipeline

Proving the model pass is half the story. The rest is convincing the verifier that the inputs and outputs are tied to the right real‑world data and business rules.

Input integrity

Content hashing: Compute a hash of the raw input (image, audio, text) and feed it into the proof as a public value. The verifier can check that the input they see matches the hashed content the model received.
Merkle trees for streams: When the input is a chunk from a longer stream, include a Merkle proof that this chunk belongs to a known sequence. You can verify an excerpt without exposing the entire asset.
Simple, provable pre‑processing: Stick to resizing, normalization, tokenization, and feature extraction that are easy to encode and check in the proof. Avoid operations that are non‑deterministic or rely on large external state.

Output constraints

Argmax and thresholds: Don’t just output logits. Prove which index was max or that a score crossed a threshold that triggers a business action.
Top‑K consistency: If you display multiple labels, prove that the set is sorted and has the expected ties broken deterministically.
Policy checks: You can fold simple policy logic into the proof: for example, “accept only if class ∈ {A,B,C} and score ≥ S.”

Model IP hygiene

If your model is proprietary, don’t reveal weights. Commit to their hash. Only the proof and public commitment are visible. If you distribute the prover, use licensing and anti‑extraction measures, and rotate weights periodically with new commitments.

Who proves? Where to verify?

With proof‑carrying models, you have design choices. Each has trade‑offs in latency, cost, and trust boundaries.

Client proves, server verifies

The user’s device runs inference and produces a proof. The server verifies quickly and accepts only if the proof matches the published model. This pattern prevents server‑side tampering and is ideal for privacy‑sensitive inputs you don’t want to upload in raw form. Costs move to the client’s compute budget. Great for small on‑device models and occasional proofs.

Server proves, client verifies

The server runs inference and returns both result and proof. The client (or a third party) verifies. This is common for APIs and on‑chain uses. It adds server cost but enables portable trust; the result can be audited anywhere later. Logs become cryptographically meaningful.

Hybrid proving

Split the pipeline: client proves input preprocessing and commitment, server proves the model pass, a contract proves post‑processing checks. With recursion, you can fold multiple proofs into one compact object for end‑users or chains.

Performance in plain terms

Zero‑knowledge isn’t magic; it’s math with budgets. Expect:

Verification time: Usually milliseconds to tens of milliseconds. Predictable and cheap.
Proof size: From a few kilobytes to a few megabytes, depending on the system and circuit size.
Proving time: Ranges from sub‑second for tiny models to many seconds or minutes for heavier graphs. Throughput improves with batching and GPUs.

Use recursion to batch many micro‑inferences into a single proof. For streaming scenarios (e.g., frames or audio windows), prove each chunk and fold them into a rolling proof. You pay small per‑chunk time plus occasional “folding” cost.

Deployment patterns you can ship today

Proof‑gated API results

Offer a classification endpoint that returns {label, score, proof}. Clients verify before acting. Publish model commitments and SLA policies. Include a cache key that combines the input hash and model version so clients can deduplicate requests.

On‑chain decisions without oracles

Wrap your verifier in a smart contract. For decisions like “accept only if score ≥ 0.8,” the chain checks the proof and flips a state bit. No one has to trust an off‑chain admin. Watch gas costs: pick proof systems with small verification footprints, and keep circuits tight.

Attested edges plus proofs

Combine hardware attestation on edge devices with ZK proofs. The device attests it’s genuine; the proof certifies the model ran correctly. This defends against both device spoofing and model swapping. If attestation fails, discard the proof.

Audited moderation pipelines

Moderation often mixes models, thresholds, and heuristics. Create a verifiable slice for the decisive step: “Was the final block triggered by Model v27 with threshold T?” Keep the rest flexible. This gives accountability without freezing your entire stack.

Tooling to start (with practical notes)

You don’t have to write provers from scratch. Pick a stack that fits your team’s skills and your target platform.

Circom + snarkjs: A mature circuit DSL and toolkit for Groth16 proofs. Great for small to medium circuits and on‑chain verification. Needs a trusted setup per circuit.
gnark: Go library for building circuits (Groth16, PlonK). Good for teams who like Go and strong type safety.
Halo2: A flexible proving system used in several projects. Suitable for custom gates and lookup arguments that help with ML ops.
Plonky2: Fast recursive proofs; useful when you need to fold many steps. Popular in high‑throughput contexts.
STARK frameworks (Cairo, Winterfell): Transparent setups, great for scalable or iterative computations. Proofs can be larger; verification remains fast and transparent.
zkVMs (Risc Zero, zkSync VM, others): Write normal code, let the VM handle proving. Excellent for complex control flow or when you need to reuse existing codebases.
ML bridges (ezkl and friends): Tools that convert ONNX models into circuits and handle quantization/LUTs. Handy for pilots and for discovering what changes your model needs.

Whichever stack you choose, build a minimal proof first: a single linear layer with an argmax. Verify end‑to‑end latency and proof size in your real environment. Then add layers and constraints until accuracy meets your target.

Costs, risks, and common myths

“Zero‑knowledge makes models private.”

Not by default. ZKML proves correctness, not secrecy. You can keep weights hidden by using commitments, but outputs may still leak information about inputs. If privacy is essential, combine ZK with techniques like differential privacy, secure enclaves, or private information retrieval—use the right tool for each job.

“We can prove any LLM on demand.”

Not today. Massive generative models are hard to prove in real time. Narrow tasks with small models are viable. If you must involve a big model, try:

Distillation: Train a small verifier model on the large model’s outputs for your specific task; prove only the small model.
Selective verification: Prove the final scoring or filtering step rather than the entire generative pipeline.

“Trusted setup kills trust.”

Groth16 and some PlonK variants need a trusted setup. You can mitigate with multi‑party ceremonies and documented transcripts. Or choose transparent systems like STARKs. Pick based on your threat model and deployment needs.

Operational risks

Upgrades and version drift: If you update the model but forget to update commitments or verifiers, clients will reject proofs. Automate version publishing and sunset policies.
Prover bottlenecks: Proving can be compute‑intensive. Use batch queues, GPUs, and job isolation to prevent contention with other workloads.
Side channels: Proofs hide data, but timing and system logs may leak patterns. Sanitize logs and apply consistent batching.

Security and governance basics

ZKML adds a strong cryptographic layer, but you still need the usual lifecycle controls.

Model registry: Store model versions, hashes, and activation dates. Expose a public endpoint for clients to fetch the current allowed commitments.
Revocation: If a model is found deficient, mark its hash as revoked. Verifiers should reject proofs using revoked hashes.
Proof telemetry: Track acceptance rates, proof sizes, proving times, and failure causes. This helps you detect abuse and capacity issues.
Key management: If you sign commitments or policies, rotate keys and publish revocation lists. Don’t hardcode keys inside apps without update paths.
Human review gates: For sensitive actions, require a proof plus a human checkpoint until you build enough confidence.

Design checklist for proof‑carrying models

Define the smallest useful claim: “I used Model v53 to compute class C for input hash H, and score ≥ 0.82.” Keep it narrow.
Quantize at training time: Lock precision early so accuracy holds in the proof domain.
Pick activation and pooling strategies that are cheap to prove (piecewise linear, average pooling where possible).
Commit to the model: Publish a hash/root and embed it in your verifier and contracts.
Constrain inputs and outputs: Hash inputs, prove argmax/thresholds, and include policy checks in the circuit.
Start with a pilot: One task, one circuit, real data. Measure proving latency, proof size, and verification cost.
Plan for upgrades: Version commitments, deprecate old models, and automate distribution of accepted hashes.
Consider recursion: Batch many micro‑inferences into one proof to control costs.
Monitor live: Collect metrics on proof acceptance and prover load to avoid silent failures.

Where this is heading

As hardware and proof systems improve, proof‑carrying models will handle more complex tasks. Expect better LUTs for nonlinears, GPU‑first provers, and libraries that turn ONNX graphs into circuits with minimal manual tuning. We’ll also see more hybrid trust designs: enclaves or attestations for speed paired with ZK proofs for portability and audit.

Even now, you can deliver tangible value. If your product hinges on accepting or rejecting user actions, or if you’re shipping AI decisions into an adversarial setting, the ability to verify before you act changes the rules. You move from “we think the server did the right thing” to “we can prove it—instantly, anywhere.”

Summary:

Proof‑carrying models attach a zero‑knowledge proof to AI results so verifiers don’t have to trust the runner.
ZKML works best with small to medium models, quantized arithmetic, and simple activations.
Commit to model weights by hash to prevent silent model swapping and to enable version control.
Constrain inputs and outputs: hash inputs, prove argmax/thresholds, and enforce policy rules inside the proof.
Choose who proves (client or server) based on latency, privacy, and cost; consider recursion for batching.
Use existing toolchains (circom/snarkjs, gnark, Halo2, Plonky2, STARK frameworks, zkVMs, ezkl) to accelerate pilots.
Plan for upgrades, revocation, and telemetry; ZKML adds integrity but not automatic privacy.
Ship narrowly scoped, verifiable claims today; extend coverage as performance and tooling improve.