Insights Engineering

AI Shade Matching: Why Accuracy Matters More Than Speed

Marcus Webb May 12, 2025 7 min read

When we talk about shade accuracy in AR try-on, we're not talking about whether the lipstick looks red or pink. We're talking about whether the shade on screen matches the physical product closely enough that a customer can make a confident purchase decision — and that's a significantly harder problem than it might appear from outside the rendering pipeline.

Speed gets more attention in AR discussions than accuracy. Frame rate metrics are easy to report. Shade accuracy is harder to quantify and harder to market, which is probably why the field has under-invested in it relative to rendering performance. But for a beauty brand, an AR try-on that renders the wrong shade accurately at 30 frames per second is strictly worse than one that renders the right shade at 24 frames per second. Speed without accuracy is a confident wrong answer.

The color space problem

Cosmetics shade data lives in multiple color spaces depending on how it was captured, and translating between them without accuracy loss is the first challenge in building a reliable shade matching pipeline.

Most brands describe their shades in HEX (sRGB space) — the same color reference used in web design. HEX is convenient for digital workflows but problematic for precision color work: it's device-dependent (the same HEX value displays differently on a calibrated wide-gamut monitor versus a smartphone with standard sRGB), it doesn't encode perceptual uniformity, and it doesn't carry information about the viewing conditions under which the shade was photographed or specified.

The CIE LAB color space — commonly abbreviated L*a*b* — is the standard for perceptual color work in cosmetics and materials science. In LAB, the L* axis encodes lightness (0 = black, 100 = white), the a* axis encodes red-green opposition, and the b* axis encodes yellow-blue opposition. The critical property of LAB is perceptual uniformity: equal numerical distances in LAB space correspond to roughly equal perceived color differences, which is not true in sRGB.

Delta-E (ΔE) — the Euclidean distance between two points in LAB space — is the standard metric for measuring perceived color difference. ΔE values carry specific interpretive meaning:

ΔE < 1.0: Imperceptible to most observers under any conditions.
ΔE 1–2: Perceptible only on close, side-by-side comparison under controlled lighting.
ΔE 2–3: Perceptible to a trained observer; threshold for most commercial color-matching applications.
ΔE 3–5: Clearly perceptible as a shade difference to most people.
ΔE > 5: Obviously different shades.

For AR shade rendering to be accurate enough to influence purchase decisions, the rendered shade should match the physical product swatch at ΔE < 3.0 under typical ambient lighting conditions. This is the threshold we use for LumeCore's shade calibration validation. Rendering at ΔE 4–6 — which is where undertested systems often land on darker shades or under non-standard lighting — produces an obvious-enough mismatch to undermine the try-on's value as a purchase confidence tool.

Undertone classification: where most systems fail first

Undertone — the secondary color temperature of a person's skin — is the most consequential variable in foundation and complexion shade matching, and the one most inadequately addressed in the industry. Skin undertones are broadly categorized as warm (yellow-peach dominant), cool (pink-blue dominant), or neutral (balanced), with olive as a common additional category that sits between warm and neutral but behaves distinctly under cosmetic overlays.

The challenge for an automated classifier is that undertone is not directly measurable from a standard camera image. What the camera captures is surface reflectance, which is a product of undertone combined with Fitzpatrick depth, ambient lighting color temperature, and the spectral characteristics of the camera sensor. Disentangling undertone from those confounders requires more than color histogram analysis.

Our classification pipeline runs in the first two frames of a try-on session using a dual-signal approach: analyzing the color characteristics of the neck and inner wrist regions (which carry less melanin surface variation than the face) to improve undertone signal quality, alongside a learned spectral correction model that compensates for estimated ambient lighting color temperature. The neck/wrist insight is borrowed from how trained makeup artists and colorists actually assess undertone in professional settings — not from the face, which carries too much confounding variation.

A concrete failure case to illustrate why this matters: consider a customer with Fitzpatrick Type IV skin and a strongly cool undertone. Without an accurate undertone classifier, an AR system may render warm-toned foundation shades in her shade depth range as reasonable matches, when the actual product would pull noticeably orange on her skin. The ΔE gap between a correctly-matched cool foundation and an incorrectly-matched warm one at similar depth can be ΔE 6–9 — obviously wrong, even if the shade depth is right. The return is almost guaranteed.

Ambient lighting normalization

The same shade rendering pipeline will produce different visible results under different ambient lighting conditions, and those differences can be large enough to cross the ΔE 3.0 accuracy threshold. Warm incandescent lighting (approximately 2700K color temperature) saturates warm pigments and desaturates cool ones. Cool fluorescent or blue-sky daylight (5000–6500K) does the reverse. A lip color that looks precisely accurate under 4000K neutral-white office lighting may render visibly orange-shifted under 2700K home lighting.

Compensating for this requires estimating the ambient lighting color temperature from the camera feed and applying an inverse correction to the rendering pipeline before shade output. The estimation step uses the color temperature of the white-ish regions in the camera frame — typically the sclera of the eye, the wall behind the subject if visible, or a t-shirt collar — as a reference for the white balance of the ambient scene.

This is a well-studied problem in computational photography, but applying it correctly in the context of cosmetics rendering requires calibrating the correction strength specifically against pigment behavior. Over-correcting creates its own artifacts — a rendering that looks accurate in one lighting environment but unnaturally desaturated in another. The tuning is empirical and requires test data across a range of real-world ambient conditions, not just standard controlled test environments.

The catalog calibration step that most brands skip

Accurate shade rendering depends on accurate shade reference data in the catalog. Most brand shade catalogs — the HEX values or color references that feed the SDK — were assembled from photography under specific studio conditions, and may have accumulated inconsistencies across multiple shoots, product reformulations, or catalog migrations.

Before a shade catalog goes into an AR rendering pipeline, it should be validated against physical swatches using a spectrophotometer — the same class of instrument used in print and textile color matching. A spectrophotometer measures the full spectral reflectance curve of a physical sample, which can be converted to LAB values and compared against the catalog reference. Discrepancies above ΔE 2.0 between catalog reference and measured swatch indicate a calibration issue that will propagate into the AR rendering as a systematic shade error.

We're not saying every brand needs to run their own spectrophotometric analysis before launching try-on — a colorimetric audit service can do this in a few days. What we are saying is that the accuracy of the rendering output is bounded by the accuracy of the input data. A precisely calibrated rendering engine fed imprecise shade references will produce precise renderings of the wrong colors. Getting the input data right is a prerequisite, not an optional enhancement.

Speed and accuracy are not in opposition at production scale

The frequent framing of accuracy versus speed as a trade-off is misleading for production-grade platforms. The computationally expensive parts of accurate shade rendering — undertone classification, LAB conversion, lighting normalization — can all be run in the initialization phase of a session rather than on every frame. Once the calibration values are established in the first 2–3 frames, per-frame rendering is primarily a compositing operation that can run within a 30ms frame budget on mid-range mobile hardware.

The accuracy work happens in setup; the speed work happens in render. Building a pipeline that conflates these two phases — and sacrifices accuracy to hit per-frame latency targets — is making the wrong trade-off. A session that takes 600ms to initialize and then runs at 30fps is a better product than one that starts immediately and renders at 30fps with ΔE 5.0 shade error throughout.