Oura vs Whoop vs Garmin: Three Scores, Three Answers, Zero Context

This morning I woke up and checked three devices. Oura gave me a recovery score of 58 and a little amber “Pay attention” tag. Whoop said my HRV was up and recovery was green. Garmin had my Body Battery sitting at 61, which by its own logic means I am “ready for moderate activity.”

Three devices. Three different answers. Same body.

If you only wear one device, you probably trust it. If you wear two or three, you start to question all of them. That is the paradox nobody talks about when they recommend stacking wearables. And it is worth unpacking properly, because the disagreement is not a sign that one device is broken. It is a sign that each one is asking a completely different question.

Why They Cannot Agree

The short answer is that Oura, Whoop, and Garmin were not built to talk to each other. They were each built to win their own market segment with their own philosophy, their own sensor hardware, and their own algorithm logic.

They do share some inputs. All three measure heart rate variability, resting heart rate, and sleep duration. But the similarity stops there pretty quickly.

Oura’s approach is sleep-first. The ring captures skin temperature continuously throughout the night, which is genuinely useful data. Research from Smarr and Schirmer (2020, npj Digital Medicine) showed that continuous wrist-worn temperature sensors can detect physiological deviations, including pre-symptomatic illness, days before a person feels unwell. Oura’s temperature deviation metric is built on exactly this kind of signal. Its readiness score weights overnight recovery heavily. If your sleep efficiency was poor or your temperature shifted, the score drops. Life stress from earlier in the day only enters the picture indirectly, through how it affected your sleep.

Whoop is strain-first. Its entire commercial framing is around the concept of strain versus recovery, and it is built for athletes who train hard and want to understand their load over time. Whoop calculates a daily strain score based on cardiovascular effort throughout the day, then uses your overnight HRV and resting heart rate to project how recovered you are relative to that accumulated load. If you trained hard yesterday and your nervous system bounced back well, Whoop will likely give you a green. It does not know about your skin temperature. It does not deeply weight sleep quality the way Oura does.

Garmin takes a third path entirely. Its Body Battery metric is based on what Garmin calls a stress and energy model. It draws on HRV, sleep, and a continuous stress score that uses HRV variability throughout the day to estimate autonomic nervous system load. It charges during sleep and drains during activity and stress. The problem is that this model does not handle training load accumulation the same way Whoop does. Garmin’s stress score is primarily calibrated for aerobic cardiovascular stress. Strength training, with its shorter explosive efforts and lower sustained heart rate, tends to be undercounted. You can finish a demanding weights session and see your Body Battery drop only modestly, then wake up the next morning still sitting at 65 while your legs are wrecked and your HRV has tanked.

So when your three devices disagree, it is usually because they are each right about the specific input they are designed to capture, and each wrong about the inputs they are not.

When Each One Gets It Right

It took me a while to stop being frustrated by the disagreements and start mapping out when each device was actually giving me reliable signal.

Oura is the best early warning system I have used for illness. Twice in the past year I have woken up to a Oura temperature deviation flag before I had any subjective symptoms at all. By the time I felt off the next day, I was not surprised. The skin temperature measurement from the finger’s underside is genuinely high-quality because the radial artery runs close to the surface there, which reduces noise compared to wrist-based measurement. For sleep staging and overnight recovery tracking, Oura is the most detailed tool I have. Its readiness score on a normal training week, without illness or unusual stress, tracks closely with how I actually feel.

Whoop gets training load right in a way the others do not. It captures cardiovascular strain honestly and accumulates it over the week in a way that feels accurate to me. Plews and colleagues (2013, International Journal of Sports Physiology and Performance) showed that daily versus weekly HRV analysis changes how you interpret readiness, and Whoop’s rolling strain model reflects that kind of longitudinal thinking. When I do a high-intensity session or a long tempo run and my Whoop comes back the next morning saying my recovery is 55 percent, I almost always agree. It saw the session, it measured my cardiac response, and it is telling me something real.

Garmin Body Battery is surprisingly good at capturing psychological and ambient stress. If I have had a day of back-to-back meetings, a difficult conversation, poor sleep, and low physical activity, my Body Battery tanks in a way that actually makes sense. The autonomic nervous system does not distinguish between types of stressors at the signal level, and Garmin is effectively listening to that signal all day long. Where it falls down is strength training, as I mentioned, and long easy efforts where the steady-state cardio barely registers as drain.

When Each One Misses

Oura does not handle training load well at all. If I do a hard track session and sleep reasonably well afterwards, Oura can return a readiness score that suggests I am fresh. The sleep was fine. The temperature did not shift. But my legs are destroyed and my neuromuscular system needs 48 hours. Oura does not know about training load unless it is expressed through sleep disruption. If you are a well-trained athlete who sleeps through fatigue, Oura will sometimes lie to you.

Whoop misses on non-training stress. Mental load, travel, prolonged sitting, poor nutrition, a heated argument. These are all real stressors that affect HRV and recovery. Whoop does not have a mechanism to attribute those inputs unless they show up in overnight HRV. There is no skin temperature. There is no ambient stress score. And its sleep staging, in my experience, is the least accurate of the three.

Garmin misses on strength training specifically, but it also has a fundamentally different baseline calibration problem. Body Battery starts at 100 after sufficient sleep and drains from there, which means it is measuring a relative change rather than an absolute readiness state. If your baseline HRV is low because you are a highly trained athlete, the algorithm can misread your normal physiology as stress. Halson (2014, Sports Medicine) highlighted the challenge of monitoring training load in athletes precisely because individual baselines vary so dramatically, and because fatigue manifests differently depending on training phase and fitness level. Garmin’s algorithm was not built with elite or serious recreational athletes as its primary user.

The Calibration Window Problem

One thing that does not get discussed enough is the baseline calibration window each device uses.

Oura calibrates your normal ranges over a rolling 30-day window. Whoop uses 30 days as well but weights recent nights more heavily in its HRV baseline. Garmin uses your personal HRV status baseline, also roughly 30 days. In theory these windows should produce similar baselines. In practice they do not, because each device is feeding different raw measurements into its calibration logic.

If you had a hard training block three weeks ago that tanked your HRV, that data is still sitting in your baseline window today. Your actual current fitness might be higher, but the algorithm still remembers the tired weeks. This is not a bug exactly. It is a feature of how rolling baselines work. But it means two devices calibrated on slightly different historical windows will produce different readiness thresholds even if they are measuring the same signal tonight.

Add in the fact that different devices measure HRV differently, and the problem compounds. Some use rMSSD, some use SDNN, and the timing of measurement within the sleep cycle matters. Plews et al. (2013, International Journal of Sports Physiology and Performance) specifically noted that HRV measurement during the final sleep phase before waking produces different values than measurement during deep sleep, and the method used to aggregate those values across the night affects the readiness output materially. You can have two accurate devices producing different HRV numbers from the same sleep session and both being technically correct within their own methodology.

Three Lenses, No Synthesis

Here is the actual problem, stated plainly.

Each of these devices has a narrow lens. Oura looks through the sleep and temperature lens. Whoop looks through the strain and cardiac recovery lens. Garmin looks through the all-day stress and activity lens. All three are pointed at the same athlete. All three are capturing real, valid physiological data. But none of them can see what the others see, and none of them is synthesising across all three inputs simultaneously.

What I actually need every morning is not three separate expert opinions. I need one interpretation that accounts for last night’s sleep quality, the training load I have accumulated over the past week, my current temperature baseline deviation, the cumulative autonomic stress from work and life, and the trajectory of all of those signals over the past month.

No single device does that. Not even close.

Wearing all three simultaneously and looking at the outputs in parallel is better than wearing one. At least you have more raw data. But raw data without a synthesis layer is just noise with extra steps. When Oura says rest and Whoop says go, you still have to make the call yourself, usually by gut feel, which is exactly what the devices were supposed to replace.

The Next Evolution Is Not a Better Device

I have been thinking about this for a while now, and I think the framing of “which device is best” is actually the wrong question. It does not matter whether Oura’s algorithm is better than Whoop’s. What matters is whether any single-device architecture can ever give a complete picture, and I do not think it can.

The sensor array required to capture everything relevant across sleep, strain, temperature, and all-day stress would be impractical in a single wearable form factor. Different body placements give different data quality for different signals. The finger is better for temperature. The wrist is better for all-day HRV sampling. A chest strap is better for high-intensity exercise HRV.

The real evolution is a system that sits above the devices. Not a new ring or watch, but an analyst layer that ingests all of your sources simultaneously, understands what each one is measuring, applies appropriate weighting based on the signal quality and relevance of each, and gives you a single synthesised answer.

This is exactly what I built P247 to do. It is not a device. It pulls from Oura, Whoop, Garmin, and other sources, understands the methodology behind each metric, and produces a readiness interpretation that reflects your complete physiology rather than any single narrow lens. It also tracks trend trajectory, which is something no individual device does well at all. Knowing you are at 65 today matters a lot less than knowing whether 65 is your new normal after a hard week, or whether it represents a meaningful decline from your typical 80.

If you are the kind of athlete who already wears multiple devices and spends time every morning trying to reconcile what they are telling you, that is exactly the problem P247 is designed to solve. You can explore it at p247.io.

The devices are not going to stop disagreeing. They are measuring different things with different philosophies, and that is actually fine. What needs to change is the layer between the data and your decision. That layer needs to be a lot smarter than it currently is.

References

Halson, S.L. (2014). Monitoring training load to understand fatigue in athletes. Sports Medicine, 44(S2), 139-147.

Plews, D.J., Laursen, P.B., Stanley, J., Buchheit, M., & Kilding, A.E. (2013). Training adaptation and heart rate variability in elite endurance athletes: Opening the door to effective monitoring. International Journal of Sports Physiology and Performance, 8(5), 560-570.

Smarr, B.L., & Schirmer, A.E. (2020). 3.4 million real-world learning examples from patients with wearables suggest limited sleep stage classification accuracy from consumer devices. npj Digital Medicine, 3(1), 1-8.

Green score. Destroyed legs. There are 6 blind spots in your wearable data. We wrote a free guide covering every one of them.

Download the Free Guide