In 2024, a mid-size skincare and color brand — one of the more inclusive in their product lineup — added AR try-on to their website through a third-party tool. Returns didn't drop. In fact, returns from customers who had used the try-on feature were marginally higher than from those who hadn't. When we looked at the session data with them, the pattern was clear: shoppers on Fitzpatrick IV–VI skin tones were getting inaccurate renders. The foundation shades were reading two to three tones lighter than the actual product. The try-on wasn't solving the shade-mismatch problem. It was replicating it in a new format. We see this more than we'd like to admit.
How Training Data Bias Enters the Pipeline
Most AR try-on systems that launched between 2018 and 2022 were trained on datasets that overrepresented lighter skin tones. This wasn't always intentional — it reflected the composition of available labeled face data, which was itself biased by who was photographed, who was asked to participate in data collection, and which public datasets happened to be well-annotated at the time. The result was models that performed well on Fitzpatrick I–III skin tones and degraded noticeably on IV–VI.
The bias shows up in two distinct ways. The first is landmark detection accuracy — the face mesh that defines where lips begin, where the eyelid crease sits, and how the jawline curves. Models trained predominantly on lighter-skin images can have higher landmark error rates on darker skin tones, particularly in areas where skin/hair contrast is lower (hairline, eyebrow edges). A mesh that's slightly off produces a product overlay that doesn't sit correctly.
The second form of bias is in tone-mapping — the adjustment that makes a given shade look correct on a given skin tone. If the model hasn't seen enough examples of how particular pigments interact with high-melanin skin, its opacity and undertone predictions will be wrong. This is why a shade that looks accurate on a Fitzpatrick II test renders too light, too saturated, or with an off undertone on a Fitzpatrick V.
The Fitzpatrick Scale and Its Limits
Most skin-tone classification systems in beauty AR are built around the Fitzpatrick scale, developed in 1975 to classify photosensitivity to UV radiation. It's a six-point scale: Type I (very fair, always burns) through Type VI (deeply pigmented, never burns). It's become a de facto standard in skin-tone representation in digital beauty, partly because it's simple and partly because it was what was available when early datasets were being labeled.
The Fitzpatrick scale has real limitations for cosmetics applications. It was designed to classify UV sensitivity, not optical properties. Two people with the same Fitzpatrick classification can have meaningfully different undertones — one cool-toned, one warm — and those differences matter enormously for how a nude lipstick reads. The scale also underrepresents the diversity within darker tone categories; the range from Fitzpatrick IV to VI covers enormous variation that a six-point system can't fully capture.
Some newer systems layer undertone classification on top of Fitzpatrick — measuring the ratio of warm (yellow-red) to cool (blue-pink) tones in the skin's optical signature. This produces a more granular characterization that results in better shade rendering. In our own development, we've found that a combined Fitzpatrick + undertone model reduces perceived shade accuracy errors for Type IV–VI testers by about 34% compared to a Fitzpatrick-only approach.
"Equity in beauty tech isn't a values statement — it's a product quality problem. If your try-on is inaccurate for 40% of your shoppers, you don't have an inclusive try-on tool. You have a try-on tool that excludes by default."
— Camille Laurent, CEO & Co-Founder, Lumeglint
What Brands Should Ask When Evaluating AR Tools
If you're a brand assessing AR try-on vendors, the standard demo will typically show you a white woman trying on lipstick in good lighting. That's not sufficient to evaluate a tool's equity performance. Here are the questions worth asking:
- What Fitzpatrick range was your training data collected across? A responsible vendor should be able to tell you this. If they can't, it's a red flag.
- What is your Delta-E accuracy across Fitzpatrick I–VI? Delta-E is a standard color difference metric. A good try-on should achieve under 5 Delta-E across all skin tones, not just the lighter range. Ask to see per-tier accuracy numbers.
- Can I test with a diverse set of testers before signing? Any credible vendor should welcome this. If they steer you toward controlled demo environments only, that's a reason to probe harder.
- How do you handle undertone variation within Fitzpatrick tiers? The answer here tells you whether the system is doing real tone-matching or just applying opacity based on a single-axis skin-tone estimate.
- What does your shade library look like for deep complexions? Some vendors build accurate rendering but only for the light-medium shade range because that's where their brand customers' products were focused. The rendering quality for foundations in the N60–N90 range, for example, may not have been validated at all.
The Business Case, Not Just the Moral One
The equity argument for inclusive shade rendering is important on its own terms. Shoppers with deeper skin tones have historically been underserved by the beauty industry — from limited shade ranges to products that simply didn't work on their complexions. AR tools that replicate this exclusion are compounding a long-standing problem.
But there's a business case too, and it's worth stating plainly. US shoppers identifying as Black, Hispanic, and multiracial collectively account for a growing share of beauty spending — one analysis of beauty category purchase data suggests this group will represent over 45% of US cosmetics revenue by 2030. Brands that build trust with this demographic through accurate representation — including accurate try-on — are positioned for better retention than those who treat them as an afterthought.
Return rates tell a similar story. In our analysis of return attribution data from early brand pilots, shade-mismatch return rates were consistently higher for customers in Fitzpatrick IV–VI when using try-on tools with inadequate tone-matching. The reduction in returns that AR try-on is supposed to deliver doesn't materialize if the rendering is wrong for those shoppers.
Building Toward Accuracy Across All Tones
The solution isn't mysterious — it requires collecting more representative training data, testing accuracy metrics across the full Fitzpatrick range, and being honest about where the current system falls short. It also requires brand participation. Shade data that covers a brand's full deep-complexion range, provided with accurate hex codes and finish metadata, enables the rendering system to produce accurate outputs. Incomplete shade data — or shade data that only covers the light-to-medium range — produces a try-on that works for some shoppers and not others, regardless of how good the underlying model is.
We built our tone-matching system with explicit coverage validation across all six Fitzpatrick tiers. Before a shade catalog goes live, we run automated accuracy checks against simulated skin-tone inputs across the full range. Any shade with a per-tier Delta-E above threshold gets flagged for review before publishing. It's not a perfect system — there's no perfect system — but it catches the most common rendering errors before they reach shoppers.
The bar for "good enough" in inclusive AR rendering should be the same bar applied to product photography and shade development: every customer who looks at your product should see something accurate. Not close enough for fair skin and less accurate for everyone else. Every shopper. That's what the technology can do, and it's what we should hold it to.