Run real ML models directly on users’ GPUs with WebGPU. Cut latency from hundreds of milliseconds to single digits, avoid server costs, and ship privacy-preserving features that work offline.
Why On-Device AI in the Browser?
Latency
Inference happens locally → no network round-trip. Great for real-time UX.
Privacy
User data stays on device; easier compliance for sensitive inputs.
Cost
Fewer GPU servers; ship models as static assets behind a CDN.
Reach
Runs on Chrome, Edge, Safari TP, and soon Firefox—no install required.
Recommended Stack (2025)
Which Models Run Well On-Device?
- Vision: MobileNetV3, EfficientNet-Lite, YOLO-N variants (quantized INT8).
- Text: Small LLMs (1–3B params) distilled/quantized for summarization & extraction.
- Audio: Keyword spotting, small ASR, noise suppression.
- Multimodal: CLIP-like encoders for on-device search & tagging.
Tip: Prefer encoder-style models for latency. Use quantization (INT8/FP16) and operator fusions.
Hands-On: Image Tagging PWA with WebGPU (ONNX Runtime Web)
- Detect WebGPU and gracefully fall back to WebGL/WASM.
- Stream the model over HTTP; cache with a Service Worker.
- Preprocess the image on the GPU where possible.
- Run inference; show top-k labels in under ~30 ms on modern laptops.
<!-- index.html --> <script type="module"> import * as ort from 'https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.mjs'; const hasWebGPU = !!navigator.gpu; const providers = hasWebGPU ? ['webgpu'] : ['wasm']; // fallback // Load model (streamed) and warm up const session = await ort.InferenceSession.create('/models/mobilenet-int8.onnx', { executionProviders: providers }); // Simple image → tensor async function toTensor(imgEl){ const w = 224, h = 224, canvas = new OffscreenCanvas(w,h); const ctx = canvas.getContext('2d'); ctx.drawImage(imgEl, 0,0,w,h); const { data } = ctx.getImageData(0,0,w,h); // Normalize to [0,1] and channel-first const float = new Float32Array(3*w*h); for(let i=0;i<w*h;i++){ float[i] = data[i*4]/255; // R float[i+w*h] = data[i*4+1]/255; // G float[i+2*w*h] = data[i*4+2]/255; // B } return new ort.Tensor('float32', float, [1,3,h,w]); } async function classify(file){ const img = new Image(); img.src = URL.createObjectURL(file); await img.decode(); const input = await toTensor(img); const outputs = await session.run({ 'input': input }); const logits = outputs[Object.keys(outputs)[0]].data; // softmax + top-k (k=3) const exps = logits.map(v => Math.exp(v - Math.max(...logits))); const sum = exps.reduce((a,b)=>a+b,0); const probs = exps.map(v => v/sum); const top = probs.map((p,i)=>({i,p})).sort((a,b)=>b.p-a.p).slice(0,3); console.log('Top-3', top); } document.querySelector('#file').addEventListener('change', e => classify(e.target.files[0])); </script>
// service-worker.js (sketch) self.addEventListener('install', e => { e.waitUntil(caches.open('ai-cache-v1').then(c => c.addAll([ '/', '/index.html', '/models/mobilenet-int8.onnx' ]))); }); self.addEventListener('fetch', e => { e.respondWith(caches.match(e.request).then(r => r || fetch(e.request))); });
This skeleton demonstrates provider selection, streaming, and offline caching. For production, add range requests, integrity checks, and a labels file.
Performance Playbook
- Quantize: Prefer INT8/FP16; measure accuracy delta.
- Chunk your weights: 1–4 MB parts for early start + resume.
- Warm up: Run a tiny dummy tensor after session creation; keep the tab alive with the Page Visibility API.
- Tiled preprocess: Use WebGPU compute shaders for resize/normalize on GPU.
- Memory budget: Avoid >500 MB models in mainstream laptops; stream outputs when possible.
- Fallbacks: WebNN (where available) > WebGL > WASM to maximize reach.
Privacy & Security Notes
- Keep all inference on device. Only send anonymized telemetry (opt-in).
- Sign model files; validate Subresource Integrity (SRI) on load.
- Use COOP/COEP headers to unlock high-perf features securely.
- For regulated data, stick to client-side storage (IndexedDB) with encryption at rest.
UX Patterns That Win
- Progressive disclosure: Light model first; load heavier models on demand.
- Offline-first messaging: “Private, runs on your device.” Convert privacy into a feature.
- Optimistic UI: Start rendering partial results; refine as layers complete.
- Device capability check: Offer “Performance” vs “Battery saver” modes.
Roadmap: MVP → Production
- Week 1: Choose model; export to ONNX; quantize; create minimal demo.
- Week 2: Add Service Worker caching; implement fallback providers.
- Week 3: Ship PWA install; collect anonymized perf metrics.
- Week 4: Optimize (INT8, shader preprocess); add feature flags and A/B tests.
Where This Shines (Use-Case Ideas)
- Photo moderation/tagging inside creator tools.
- Meeting notes summarization that never leaves the browser.
- Real-time product recommendations in a storefront without server inference.
- Assistive features (captions, glare removal) on low-connectivity devices.