COOPER FLAGG · KNEE-FLEXION VALIDATION vs HITL
measured pose vs the 44-record human ground truth · honest by design
What this is — and isn't. The HITL prose has no degree labels, no per-shot alignment
to the video (no shared game/clock key), and no SME has confirmed it. So this is a
grounding + plausibility check, not an accuracy validation. It shows whether the metric is meaningful
and where the pipeline is (and isn't) trustworthy. It does not prove any single angle is correct —
that needs a subject-matter expert rating depth on aligned shots.
1 · Is measuring knee flexion meaningful?
The human annotators already track knee/hip load qualitatively — so quantifying it
is the first measurement of a feature they describe. The prose spans a real range:
2 · One trustworthy measurement TRUSTED
—°
3 · At scale, GPU-free NOT YET
0 / 48
plausible trusted values without a per-shot ball anchor
The 48-shot batch (current GPU-free pipeline)
abstained (mostly not-side-on — correct)
trusted but implausible angle
trusted + plausible
4 · Same player / shot population? sanity, w/ a catch
Make / miss — both miss-heavy (contested draft-eval 3s). Consistent.
Footwork — rescan is suspiciously uniform vs the varied human labels:
Flag: the Gemini rescan labels 44/48 shots
"1-2" while the humans see a 1-2 / hop / other mix — consistent with Gemini fabricating structured fields
(as it did with phase timing). Trust the human footwork, not the rescan's.
5 · What a real validation needs
- Per-shot ball anchoring (SAM 3) so the crop locks onto Cooper, not the central-largest player — the lone anchored shot is the only trustworthy one.
- A dip-location gate + the [55–150°] plausibility band: the current gate vets pose quality, not whether the measured frame is the actual deepest knee — so straight-leg frames slip through.
- An SME to rate flexion depth on a set of aligned shots — the only way to turn "plausible" into "accurate."