WebGPU + On-Device AI: How to Build Fast, Private, Offline Apps in the Browser

Run real ML models directly on users’ GPUs with WebGPU. Cut latency from hundreds of milliseconds to single digits, avoid server costs, and ship privacy-preserving features that work offline.

TL;DR Use WebGPU (with WASM/ONNX Runtime or TensorFlow.js) to run quantized models in the browser. Stream weights, cache with Service Workers, keep PII on-device, and fall back to WebGL/WebAssembly when GPU isn’t available.

Why On-Device AI in the Browser?

Latency

Inference happens locally → no network round-trip. Great for real-time UX.

Privacy

User data stays on device; easier compliance for sensitive inputs.

Cost

Fewer GPU servers; ship models as static assets behind a CDN.

Reach

Runs on Chrome, Edge, Safari TP, and soon Firefox—no install required.

Recommended Stack (2025)

Layer	Option	Why
Runtime	ONNX Runtime Web (WebGPU backend)	Fast, production-ready, supports quantization & operators.
Alt Runtime	TensorFlow.js (WebGPU)	Great docs; easy for JS teams.
Model Formats	ONNX, TF.js Graph, GGUF (converted)	Portable; wide tooling support.
Caching	Service Worker + Cache Storage + HTTP ranges	Stream & resume weight downloads; offline-first.

Which Models Run Well On-Device?

Vision: MobileNetV3, EfficientNet-Lite, YOLO-N variants (quantized INT8).
Text: Small LLMs (1–3B params) distilled/quantized for summarization & extraction.
Audio: Keyword spotting, small ASR, noise suppression.
Multimodal: CLIP-like encoders for on-device search & tagging.

Tip: Prefer encoder-style models for latency. Use quantization (INT8/FP16) and operator fusions.

Hands-On: Image Tagging PWA with WebGPU (ONNX Runtime Web)

Detect WebGPU and gracefully fall back to WebGL/WASM.
Stream the model over HTTP; cache with a Service Worker.
Preprocess the image on the GPU where possible.
Run inference; show top-k labels in under ~30 ms on modern laptops.

<!-- index.html -->
<script type="module">
import * as ort from 'https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.mjs';

const hasWebGPU = !!navigator.gpu;
const providers = hasWebGPU ? ['webgpu'] : ['wasm']; // fallback

// Load model (streamed) and warm up
const session = await ort.InferenceSession.create('/models/mobilenet-int8.onnx', {
  executionProviders: providers
});

// Simple image → tensor
async function toTensor(imgEl){
  const w = 224, h = 224, canvas = new OffscreenCanvas(w,h);
  const ctx = canvas.getContext('2d');
  ctx.drawImage(imgEl, 0,0,w,h);
  const { data } = ctx.getImageData(0,0,w,h);
  // Normalize to [0,1] and channel-first
  const float = new Float32Array(3*w*h);
  for(let i=0;i<w*h;i++){
    float[i] = data[i*4]/255;          // R
    float[i+w*h] = data[i*4+1]/255;    // G
    float[i+2*w*h] = data[i*4+2]/255;  // B
  }
  return new ort.Tensor('float32', float, [1,3,h,w]);
}

async function classify(file){
  const img = new Image(); img.src = URL.createObjectURL(file);
  await img.decode();
  const input = await toTensor(img);
  const outputs = await session.run({ 'input': input });
  const logits = outputs[Object.keys(outputs)[0]].data;
  // softmax + top-k (k=3)
  const exps = logits.map(v => Math.exp(v - Math.max(...logits)));
  const sum = exps.reduce((a,b)=>a+b,0);
  const probs = exps.map(v => v/sum);
  const top = probs.map((p,i)=>({i,p})).sort((a,b)=>b.p-a.p).slice(0,3);
  console.log('Top-3', top);
}

document.querySelector('#file').addEventListener('change', e => classify(e.target.files[0]));
</script>

// service-worker.js (sketch)
self.addEventListener('install', e => {
  e.waitUntil(caches.open('ai-cache-v1').then(c => c.addAll([
    '/', '/index.html', '/models/mobilenet-int8.onnx'
  ])));
});
self.addEventListener('fetch', e => {
  e.respondWith(caches.match(e.request).then(r => r || fetch(e.request)));
});

This skeleton demonstrates provider selection, streaming, and offline caching. For production, add range requests, integrity checks, and a labels file.

Performance Playbook

Quantize: Prefer INT8/FP16; measure accuracy delta.
Chunk your weights: 1–4 MB parts for early start + resume.
Warm up: Run a tiny dummy tensor after session creation; keep the tab alive with the Page Visibility API.
Tiled preprocess: Use WebGPU compute shaders for resize/normalize on GPU.
Memory budget: Avoid >500 MB models in mainstream laptops; stream outputs when possible.
Fallbacks: WebNN (where available) > WebGL > WASM to maximize reach.

Privacy & Security Notes

Keep all inference on device. Only send anonymized telemetry (opt-in).
Sign model files; validate Subresource Integrity (SRI) on load.
Use COOP/COEP headers to unlock high-perf features securely.
For regulated data, stick to client-side storage (IndexedDB) with encryption at rest.

UX Patterns That Win

Progressive disclosure: Light model first; load heavier models on demand.
Offline-first messaging: “Private, runs on your device.” Convert privacy into a feature.
Optimistic UI: Start rendering partial results; refine as layers complete.
Device capability check: Offer “Performance” vs “Battery saver” modes.

Roadmap: MVP → Production

Week 1: Choose model; export to ONNX; quantize; create minimal demo.
Week 2: Add Service Worker caching; implement fallback providers.
Week 3: Ship PWA install; collect anonymized perf metrics.
Week 4: Optimize (INT8, shader preprocess); add feature flags and A/B tests.

Where This Shines (Use-Case Ideas)

Photo moderation/tagging inside creator tools.
Meeting notes summarization that never leaves the browser.
Real-time product recommendations in a storefront without server inference.
Assistive features (captions, glare removal) on low-connectivity devices.

Muhammad Zubair
Exploring AI, Software & Future Tech