Insights Engineering

Skin Tone Diversity in AR: The Technical Challenge Beauty Tech Must Solve

Marcus Webb February 17, 2025 8 min read

The technical challenges of AR try-on are not evenly distributed across skin tones. This is one of the most consistently underacknowledged engineering problems in beauty tech, and it's worth being direct about why it exists, what it takes to fix, and where most platforms have chosen not to prioritize it.

The short version: AR beauty systems that render accurately across the full Fitzpatrick scale require substantially more calibration work than systems optimized for lighter tones. Most systems were not built with the full range as a primary requirement. The result is a technology class that works well for some customers and poorly for others — which is, in the context of beauty and self-representation, a significant failure.

What the Fitzpatrick scale actually measures

The Fitzpatrick scale was developed by dermatologist Thomas Fitzpatrick in 1975 as a tool for predicting skin response to UV radiation — not as a color system. It classifies skin into six types based on melanin content and UV response: Type I (very fair, always burns, never tans) through Type VI (deeply pigmented, never burns). It's the most widely used classification system in dermatology and has been adopted broadly by the beauty and cosmetics industry as a practical reference for shade range design.

For AR rendering purposes, what the Fitzpatrick scale provides is a rough taxonomy of surface reflectance and absorption behavior. Lighter skin types (I–III) have higher diffuse reflectance — they scatter light across a broader frequency range. Darker skin types (IV–VI) have higher melanin concentration, which absorbs more light in the shorter wavelengths and changes how pigment overlays visually interact with the skin substrate.

The engineering implication is this: a color overlay pipeline calibrated primarily against lighter skin behaves incorrectly on darker skin not because of a trivial mistake, but because the physics of the interaction are genuinely different. Compensating for it requires specific calibration, not a general brightness adjustment.

Where the Monk Skin Tone Scale adds precision

The Fitzpatrick scale has a practical limitation for AR and cosmetics applications: it was never designed as a perceptual color system, and it groups many distinct skin appearances into broad categories. Fitzpatrick IV, for example, encompasses a wide range of olive through medium-brown tones that behave quite differently under cosmetic overlays.

Google's Monk Skin Tone Scale (MST), developed in 2022, provides a 10-tone linear progression specifically designed for use in technology products — more perceptually uniform than Fitzpatrick and better suited to the practical task of training and validating visual AI systems. For a rendering pipeline, the MST gives more granular calibration checkpoints: we can tune behavior at 10 reference points rather than 6, catching edge cases between Fitzpatrick categories that would otherwise fall through.

Our validation process uses both. The Fitzpatrick categories are the baseline requirement — every product must render accurately across I–VI. The MST provides the intermediate calibration targets that catch the errors that appear between Fitzpatrick categories rather than at the named points.

The specific rendering problems that appear at darker tones

Three failure modes appear repeatedly when an AR beauty pipeline is undertested at Fitzpatrick V–VI:

Ashy overlay rendering. When a warm-pigmented cosmetic (a coral or warm-nude lip color, for instance) is applied as a multiplicative blend over a high-melanin skin substrate using a naively calibrated blending mode, the result often renders with a desaturated, grayish cast rather than the warm tone the product actually delivers. This happens because the blending math is derived from lighter-skin assumptions about how the substrate interacts with added pigment.

Loss of detail in deep skin segments. Face segmentation models — the neural network layer that identifies the lip region, eyelid region, and skin surface for overlay targeting — tend to have lower confidence scores on darker skin tones when undertrained. Lower confidence in segmentation means the edge between the overlay region and the surrounding skin becomes approximate rather than precise. The result is a try-on that looks slightly blurred or mis-bordered on darker skin, even when the color rendering is otherwise correct.

Undertone misclassification. Undertone classification — warm, cool, neutral — is the step that determines which foundation shades to recommend. Classification models trained on datasets skewed toward lighter skin tones often perform poorly at detecting undertone on deeper complexions, where the surface-level color signals the model has learned to read are less pronounced relative to melanin dominance. The practical result: a deeply pigmented customer with a distinctly warm undertone gets classified as neutral or cool, and the shade recommendation is wrong.

What training data diversity actually means in practice

Building a rendering pipeline that performs correctly across the full skin tone range requires two things: a training dataset that is genuinely representative, and a validation protocol that explicitly tests at both ends of the range rather than only at the center.

On the training data side, the challenge is that publicly available facial image datasets — the ones commonly used to train computer vision models — have historically been heavily weighted toward lighter Fitzpatrick types, reflecting the demographics of the populations where they were collected and labeled. Building a pipeline that works well at Fitzpatrick V–VI often requires either sourcing or generating supplemental training data specifically at those tone ranges, or employing data augmentation techniques that synthetically expand the training distribution.

We're not saying dataset diversity is easy to achieve — it isn't, and it requires deliberate resource allocation rather than defaulting to available public datasets. What we are saying is that "we tested it across skin tones" means very little without specifics: how many reference tones, measured against what standard, and what was the acceptance threshold for accuracy?

Our internal threshold for shade rendering accuracy is a CIE ΔE value of less than 3.0 between the rendered color and the physical swatch under the same lighting conditions. ΔE (delta-E) is the standard colorimetric measure of perceived color difference — at ΔE < 1, the difference is imperceptible to most observers; at ΔE < 3, it's considered acceptable for most commercial applications; above ΔE 5, the difference is clearly visible. We require that threshold to hold across MST tones 1 through 10, not just at the lighter end of the range.

Why this matters beyond accuracy metrics

There's a reason Rihanna's launch of Fenty Beauty in 2017 with 40 foundation shades was a commercial and cultural moment. It demonstrated, with commercial evidence, that the beauty industry had been systematically under-serving customers with deeper skin tones — and that correcting that underservice was both the right thing to do and good business.

AR try-on that doesn't work for darker skin tones doesn't just fail as a technical product. It signals to those customers that the tool wasn't built with them in mind — which, in the context of beauty and self-representation, lands as a specific and meaningful form of exclusion. Brands that deploy undertested AR try-on for their darker shade ranges may be creating a worse experience for exactly the customer segments they claim to be serving.

The technical challenge is real and it is solvable. It requires treating skin tone diversity as a first-order engineering requirement from the beginning of the pipeline design, not a QA checklist item at the end. The difference in outcome between those two approaches is the difference between a try-on that your whole customer base can rely on and one that works for most of them.