WebGPU + On-Device AI: How to Build Fast, Private, Offline Apps in the Browser

Published 5 days ago • 4 mins read

Run real ML models directly on users’ GPUs with WebGPU. Cut latency from hundreds of milliseconds to single digits, avoid server costs, and ship privacy-preserving features that work offline.

TL;DR Use WebGPU (with WASM/ONNX Runtime or TensorFlow.js) to run quantized models in the browser. Stream weights, cache with Service Workers, keep PII on-device, and fall back to WebGL/WebAssembly when GPU isn’t available.

Why On-Device AI in the Browser?

Latency

Inference happens locally → no network round-trip. Great for real-time UX.

Privacy

User data stays on device; easier compliance for sensitive inputs.

Cost

Fewer GPU servers; ship models as static assets behind a CDN.

Reach

Runs on Chrome, Edge, Safari TP, and soon Firefox—no install required.

Recommended Stack (2025)

Layer Option Why
Runtime ONNX Runtime Web (WebGPU backend) Fast, production-ready, supports quantization & operators.
Alt Runtime TensorFlow.js (WebGPU) Great docs; easy for JS teams.
Model Formats ONNX, TF.js Graph, GGUF (converted) Portable; wide tooling support.
Caching Service Worker + Cache Storage + HTTP ranges Stream & resume weight downloads; offline-first.

Which Models Run Well On-Device?

  • Vision: MobileNetV3, EfficientNet-Lite, YOLO-N variants (quantized INT8).
  • Text: Small LLMs (1–3B params) distilled/quantized for summarization & extraction.
  • Audio: Keyword spotting, small ASR, noise suppression.
  • Multimodal: CLIP-like encoders for on-device search & tagging.

Tip: Prefer encoder-style models for latency. Use quantization (INT8/FP16) and operator fusions.

Hands-On: Image Tagging PWA with WebGPU (ONNX Runtime Web)

  1. Detect WebGPU and gracefully fall back to WebGL/WASM.
  2. Stream the model over HTTP; cache with a Service Worker.
  3. Preprocess the image on the GPU where possible.
  4. Run inference; show top-k labels in under ~30 ms on modern laptops.
<!-- index.html -->
<script type="module">
import * as ort from 'https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.mjs';

const hasWebGPU = !!navigator.gpu;
const providers = hasWebGPU ? ['webgpu'] : ['wasm']; // fallback

// Load model (streamed) and warm up
const session = await ort.InferenceSession.create('/models/mobilenet-int8.onnx', {
  executionProviders: providers
});

// Simple image → tensor
async function toTensor(imgEl){
  const w = 224, h = 224, canvas = new OffscreenCanvas(w,h);
  const ctx = canvas.getContext('2d');
  ctx.drawImage(imgEl, 0,0,w,h);
  const { data } = ctx.getImageData(0,0,w,h);
  // Normalize to [0,1] and channel-first
  const float = new Float32Array(3*w*h);
  for(let i=0;i<w*h;i++){
    float[i] = data[i*4]/255;          // R
    float[i+w*h] = data[i*4+1]/255;    // G
    float[i+2*w*h] = data[i*4+2]/255;  // B
  }
  return new ort.Tensor('float32', float, [1,3,h,w]);
}

async function classify(file){
  const img = new Image(); img.src = URL.createObjectURL(file);
  await img.decode();
  const input = await toTensor(img);
  const outputs = await session.run({ 'input': input });
  const logits = outputs[Object.keys(outputs)[0]].data;
  // softmax + top-k (k=3)
  const exps = logits.map(v => Math.exp(v - Math.max(...logits)));
  const sum = exps.reduce((a,b)=>a+b,0);
  const probs = exps.map(v => v/sum);
  const top = probs.map((p,i)=>({i,p})).sort((a,b)=>b.p-a.p).slice(0,3);
  console.log('Top-3', top);
}

document.querySelector('#file').addEventListener('change', e => classify(e.target.files[0]));
</script>
// service-worker.js (sketch)
self.addEventListener('install', e => {
  e.waitUntil(caches.open('ai-cache-v1').then(c => c.addAll([
    '/', '/index.html', '/models/mobilenet-int8.onnx'
  ])));
});
self.addEventListener('fetch', e => {
  e.respondWith(caches.match(e.request).then(r => r || fetch(e.request)));
});

This skeleton demonstrates provider selection, streaming, and offline caching. For production, add range requests, integrity checks, and a labels file.

Performance Playbook

  • Quantize: Prefer INT8/FP16; measure accuracy delta.
  • Chunk your weights: 1–4 MB parts for early start + resume.
  • Warm up: Run a tiny dummy tensor after session creation; keep the tab alive with the Page Visibility API.
  • Tiled preprocess: Use WebGPU compute shaders for resize/normalize on GPU.
  • Memory budget: Avoid >500 MB models in mainstream laptops; stream outputs when possible.
  • Fallbacks: WebNN (where available) > WebGL > WASM to maximize reach.

Privacy & Security Notes

  • Keep all inference on device. Only send anonymized telemetry (opt-in).
  • Sign model files; validate Subresource Integrity (SRI) on load.
  • Use COOP/COEP headers to unlock high-perf features securely.
  • For regulated data, stick to client-side storage (IndexedDB) with encryption at rest.

UX Patterns That Win

  1. Progressive disclosure: Light model first; load heavier models on demand.
  2. Offline-first messaging: “Private, runs on your device.” Convert privacy into a feature.
  3. Optimistic UI: Start rendering partial results; refine as layers complete.
  4. Device capability check: Offer “Performance” vs “Battery saver” modes.

Roadmap: MVP → Production

  1. Week 1: Choose model; export to ONNX; quantize; create minimal demo.
  2. Week 2: Add Service Worker caching; implement fallback providers.
  3. Week 3: Ship PWA install; collect anonymized perf metrics.
  4. Week 4: Optimize (INT8, shader preprocess); add feature flags and A/B tests.

Where This Shines (Use-Case Ideas)

  • Photo moderation/tagging inside creator tools.
  • Meeting notes summarization that never leaves the browser.
  • Real-time product recommendations in a storefront without server inference.
  • Assistive features (captions, glare removal) on low-connectivity devices.
Bottom Line

WebGPU makes serious AI in the browser practical. Start with a small, quantized model, ship it as a PWA, and measure. You’ll get lower latency, lower cost, and a privacy story users actually love.


Join my mailing list