Insights Engineering

Building a Mobile-First Virtual Try-On Experience

Marcus Webb September 1, 2025 7 min read

Over 70% of beauty e-commerce sessions originate on mobile — and in some DTC beauty cohorts, that number sits closer to 80%. Building an AR try-on experience that actually works on mobile is not a responsive-design problem. It's a performance architecture, camera API, and rendering pipeline problem that requires different decisions than desktop-first design.

This piece addresses what mobile-first AR try-on actually requires in practice: the constraints of mobile camera APIs, the performance budgets that determine whether a shopper completes a try-on session or abandons it, and the UX surface decisions that make the difference between a feature that gets used and one that gets ignored.

The camera API landscape on mobile browsers

AR try-on in a browser context depends on the getUserMedia API — the WebRTC mechanism that requests camera access and streams video frames to a canvas element for processing. The API is supported across all modern mobile browsers, but with behavioral differences that matter for production implementations.

iOS Safari has the most restrictive implementation. Camera access via getUserMedia requires: (1) HTTPS — no exceptions, including on development and staging domains; (2) an explicit user gesture to trigger the permission prompt — the call must originate from a user-initiated event like a button tap, not from page load or a programmatic trigger; (3) no background tab — if the user switches away from the tab mid-session, the camera stream is paused by the browser and must be restarted. These are iOS security constraints, not bugs. Your implementation must handle the restart case gracefully rather than showing an error state.

Android Chrome is more permissive on HTTPS enforcement on local networks, but enforces the user gesture requirement consistently. The primary Android-specific consideration is variance in camera hardware: Android devices span a much wider range of sensor quality, autofocus behavior, and color science than the iOS fleet. Implementations tested only on high-end Android flagship hardware often have rendering quality issues on mid-range devices with less capable camera sensors.

Both platforms support the facingMode: 'user' constraint in getUserMedia to select the front-facing camera by default — which is the correct behavior for try-on. However, on some Android devices, the front camera stream delivers frames at a lower resolution than the rear camera. Set a minimum resolution constraint of 640×480 to prevent the SDK from receiving frames that are too small for accurate face segmentation:

navigator.mediaDevices.getUserMedia({
  video: {
    facingMode: 'user',
    width:  { min: 640, ideal: 1280 },
    height: { min: 480, ideal: 720 }
  }
});

Face mesh on mobile: MediaPipe and the performance budget

Real-time face tracking for AR overlays uses a face landmark detection model to locate the relevant anatomy — lip boundary, eye region, face perimeter — with per-frame precision sufficient to composite an overlay naturally without visible drift. MediaPipe Face Mesh, the open-source library developed by Google, is the most widely used foundation for this layer in browser-based AR implementations.

MediaPipe Face Mesh delivers 468 3D face landmarks at up to 30fps on mid-range mobile hardware when the model runs via WebAssembly and the inference is GPU-accelerated via WebGL. The performance reality on actual device hardware:

High-end mobile (current-gen flagship iOS/Android): 28–32fps face mesh inference at 1280×720 input resolution. Comfortable budget for rendering pipeline on top.
Mid-range mobile (2022–2023 Android, older iOS): 18–25fps at 640×480. Still acceptable for a smooth try-on experience; shade switching may have brief latency on older devices.
Low-end mobile: Below 15fps face mesh inference, which produces visible landmark jitter and overlay drift. This segment represents a small but non-zero share of users — the correct response is a graceful fallback to a photo-based overlay or a static "shade on swatch" preview, not a broken experience.

The WebAssembly model file for MediaPipe Face Mesh is approximately 2.5MB. Pre-fetching it as part of the SDK initialization sequence — before the user activates the camera — prevents a perceptible delay at session start. The Time to First Rendered Shade should be under 2 seconds from camera button tap on mid-range hardware. Above 3 seconds, abandonment rates increase sharply.

ARKit and ARCore: native apps vs. browser context

Apple's ARKit and Google's ARCore provide native-level AR capabilities — positional tracking, scene understanding, plane detection — that exceed what browser-based AR can achieve via WebXR or WebGL. For a beauty try-on use case, the performance ceiling is most relevant in two areas: (1) face mesh accuracy at the lip and eye boundaries, where ARKit's depth sensing produces more accurate segmentation than camera-only methods; and (2) ambient lighting estimation, where ARKit can estimate the scene's lighting environment model from the camera feed with better precision than the color temperature estimation approach available in browser context.

However, native implementation (via ARKit for iOS or ARCore for Android) requires a native app — not a browser embed. This is a significant distribution trade-off for a beauty brand: app downloads add friction that a browser-embedded SDK eliminates. The typical calculus for an indie DTC brand is that the additional accuracy of a native implementation does not justify requiring a separate app download, given that conversion from app store listing to installed and opened represents a multi-step funnel loss most brands can't sustain.

We're not saying native AR try-on is worse — for brands with an existing app and an engaged installed base, a native implementation is worth pursuing. We're saying the accuracy differential between a well-built browser SDK and a native implementation is smaller than the distribution differential, for the majority of beauty brands who don't have a large existing app user base.

Mobile UX: the decisions that determine session completion

Beyond the rendering pipeline, the UX decisions that govern whether mobile users complete a try-on session are worth addressing explicitly. Three decisions account for most of the variance in session completion rates:

Where the try-on trigger appears in the page. On mobile, the product image gallery occupies the top portion of the viewport, followed by the product name, price, and shade selector. The try-on trigger button should appear within the first two scrolls — ideally immediately after or within the shade selector UI. A try-on button buried below product description copy gets activation rates of under 5% on mobile. A try-on button placed next to or immediately below the shade selector gets activation rates 4–8× higher.

How shade switching works within the session. A shade grid that requires the user to exit the camera view to switch shades breaks the session flow and results in most users not exploring additional shades. In-session shade switching — a scrollable shade strip visible while the camera is active — dramatically increases the number of shades a shopper tests per session. Each additional shade tested correlates with higher add-to-cart probability on the shades explored.

What happens on permission denial. A meaningful share of mobile users — varying by platform but roughly 15–25% in observed implementations — decline camera permission on first request, either out of habit or caution. The correct response is a non-alarming explanation of why camera access is needed, with a one-tap path to re-request. A generic "camera access required" error with no next step produces permanent session abandonment for that user.

Performance budget as a design constraint

The mobile performance budget for a try-on session — the total computation and network load the device can sustain while maintaining a smooth rendering experience — is the constraint that should drive SDK architecture decisions, not capabilities available on desktop.

The practical budget on a mid-range 2023 Android device: approximately 16ms per frame (60fps target, though 30fps is the realistic AR ceiling) for the full pipeline including camera frame acquisition, face landmark inference, shade rendering, and canvas composite. MediaPipe Face Mesh at 30fps consumes approximately 8–10ms of that budget on this hardware class. The remaining 6–8ms is what the shade rendering pipeline has to work with per frame.

This budget forces a design decision that desktop implementations don't face: the per-frame shade compositing must be done in WebGL — not in Canvas 2D API, which is too slow for this budget — and the rendering code must be structured to minimize GPU state changes between frames. Implementing the compositing as a single-pass GLSL fragment shader, with shade color and opacity uniforms updated only when the active shade changes, keeps per-frame GPU cost within budget on mid-range hardware.

Designing for the mid-range device — not the flagship — is the discipline that produces a mobile try-on experience that actually works for the majority of a beauty brand's mobile audience. The users who need the confidence signal most are not always the users with the most capable hardware.