The traffic light colour on your watch is making promises the data can't keep

Tuesday morning.

You wake up rough. Slept badly. Legs are heavy. There is a vague tightness in your left hamstring that was not there on Monday.

You open the watch.

Green. Recovery 78. Body Battery 84. Ready to perform.

You shake your head, decide the watch knows what it is doing, and head out to do the threshold session that was on the plan.

By Thursday you are in a hole. Whatever did not feel right on Tuesday has now turned into something you have to back off from for a week. Maybe two.

The green was not about you. The green was about the average of people like you. Those are different things.

What Green Actually Means

The colour on a wearable recovery card is the output of two questions stitched together.

The first question is, what is this number today.

The second question is, what is a normal range of this number for someone of your age, sex, and fitness profile, based on what the device has collected about you over time.

The colour is the answer to the second question, not the first. Green means today is in the normal range. Amber means today is below your normal range. Red means today is well below.

This is not what athletes assume green means.

Athletes assume green means good to train. Good to push. Body is ready. The device did not say that. The device said, this number is in a normal place for someone like you.

A normal HRV is not the same thing as a recovered HRV. A normal resting heart rate is not the same thing as a fresh resting heart rate. The colour is reporting on where you are inside a band, not on whether you should train hard today.

The Baseline Problem

The other layer is whose normal the device is comparing you to.

Whoop builds a personal baseline from your own rolling history. If your HRV has been quietly drifting down for three weeks, your baseline drifts with it. Today’s HRV is compared to that drifted baseline. Today reads green even though it would have read amber against your baseline from six weeks ago.

Garmin and Apple lean more on population norms. Your HRV is compared to other 38-year-old male athletes who run 50km a week. The problem there is the inverse. Your individual normal might be twenty points off the population norm and the device does not know it.

There is no perfect baseline. Personal baselines drift. Population baselines miss the individual. Every system picks one of the two trade-offs.

The athlete who reads the colour without knowing which trade-off is in play is reading a number with no idea what it is being compared against.

The Other Bug Inside the Baseline

A specific version of this problem is worth calling out, because we ran into it ourselves and it is probably wrong inside other systems too.

Our restoration score, which measures how well overnight recovery is restoring the body’s resting state, was using an all-day average heart rate as the comparison point for the overnight low.

An all-day average HR for an athlete is inflated by their training. If you do an hour of zone three running in the afternoon, your daily average HR goes up by five or six beats. If you then sleep, the gap between your overnight low and your all-day average looks wider. The score reads that wider gap as better restoration. So the day you trained hardest produces the highest restoration score.

This is exactly backwards. The day you trained hardest produces the inflated baseline, which makes the comparison meaningless. We were rewarding the athlete for the training, not the recovery.

The fix is to compare overnight low against resting heart rate, which is a calmer, training-independent baseline. The numbers come out very differently.

We have no way of knowing how often this kind of bug exists inside other people’s recovery scores. The point is the baseline a wearable uses to compute the colour is often the part hiding the problem. If you cannot see what the comparison is being made against, you cannot tell whether the colour is honest.

Status Has to Follow Favourability, Not Magnitude

The other change we made is harder to summarise but more important.

A wearable status indicator usually works on magnitude. Number high, colour up. Number low, colour down. This works for measurements where higher is better, like sleep duration or HRV.

It does not work for measurements where the relationship is more complex.

Restoration below target is bad. The number being lower than it should be is worse, not better. But a magnitude-based status indicator that sees the score drop will report it as a small downward step, not as a colour change, because the number is still inside the usual range.

We rebuilt our status logic so it follows favourability, not magnitude. The question the status answers is, is this number on the good side or the bad side of where it should be for this athlete. Not, is this number in a normal range across the population.

The two answers diverge most exactly when the athlete most needs the answer.

How to Argue With Your Watch

When the colour on your watch and your felt sense disagree, you should not assume one is right.

Ask three questions.

What is the actual underlying number. Not the colour. The HRV value, the resting HR, the sleep score components. The colour is a compression of these. The underlying numbers are where the disagreement comes from.

What is the device comparing this number to. Your own baseline or a population baseline. Both have failure modes. Knowing which one is in play tells you what the colour is actually claiming.

Has anything in my life changed in the last 14 days that the device cannot see. New job stress. Worse sleep environment. A different bed because you are travelling. An injury you have been managing. The device does not know about any of these, and the colour will not reflect them.

Three checks. Two minutes. The colour stops being a verdict and becomes a data point you can argue with.

What We Did

Our brief now ships a status block for every metric that follows favourability, not magnitude. Each metric carries a token, one of low, watch, ok, or strong. The token follows what the number should be doing for this athlete, not what is typical across the population.

The screens render from the token. The colour the athlete sees is the same colour across the brief, the metric cards, and the recovery score breakdown. Three screens cannot disagree about a metric because they are all reading the same value from the same place.

The change is small in the code. The thing it cleans up is large. Athletes were getting traffic-light cues from us that did not mean what they thought they meant, and we had three different screens occasionally disagreeing about the same number. Both of those are fixed.

The Headline

The colour on your watch is a useful prompt to look closer.

It is not a verdict.

Treat green as, the data did not flag anything obvious, go ahead and check how you feel. Treat amber as, the data flagged something, pause and decide. Treat red as, the data is loud enough that you should not need to be talked into backing off.

But never treat any colour as the answer. The answer is in the underlying number, the baseline it is being compared against, and the context the device cannot see.

P247 ships a single status taxonomy across the dashboard. Status follows whether the number is good for you, not whether it is typical for the population. The basis for each call is surfaced so you can argue with it instead of trusting it blindly.

X Thread

1/ Tuesday morning. You feel rough. Watch says green. You do the session. By Thursday you are in a hole.

2/ Green does not mean ready to push. Green means this number is in a normal range for someone like you. Those are different statements.

3/ Whoop builds a personal baseline that drifts down with you. Garmin uses population norms that miss the individual. Every system picks a trade-off you cannot see.

4/ When the colour and your felt sense disagree, do not pick one. Look at the underlying number, the baseline it is being compared against, and the context the device cannot see.

5/ P247 ships a status taxonomy that follows whether the number is good for you, not whether it is typical for the population. With the basis surfaced so you can argue with it.