Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses

Abstract

Multi-camera systems are widely employed in sports to capture the 3D motion of athletes and equipment, yet calibrating their extrinsic parameters remains costly and labor-intensive. We introduce an efficient, tool-free method for multi-camera extrinsic calibration tailored to sports involving stick-like implements (e.g. golf clubs, bats, hockey sticks).

Our approach jointly exploits two complementary cues from synchronized multi-camera videos: (i) human body keypoints with unknown metric scale and (ii) a rigid stick-like implement of known length. We formulate a three-stage optimization pipeline that refines camera extrinsics, reconstructs human and stick trajectories, and resolves global scale via the stick-length constraint. Our method achieves accurate extrinsic calibration without dedicated calibration tools.

To benchmark this task, we present the first dataset for multi-camera self-calibration in stick-based sports, consisting of synthetic sequences across four sports categories with 3 to 10 cameras. Comprehensive experiments demonstrate that our method delivers SOTA performance, achieving low rotation and translation errors.

Methodology

**Illustration of our proposal.** For sports involving stick-like implements (e.g. golf, baseball, hockey, and kendo), we perform a novel tool-free multi-camera extrinsic calibration by leveraging both the stick and the human poses.

Three-stage optimization framework — **Three-stage optimization.** (1) An initial, unscaled 3D pose is reconstructed from multi-view 2D keypoints via Bundle Adjustment (BA). (2) The real-world scale is recovered using a known measurement, such as the length of the baseball bat (0.86 m). (3) A final, scale-aware BA is performed to refine the metric 3D reconstruction.

Sports-Stick-Syn Dataset

Quantitative Results

**Per-camera Error Distribution.** The human + stick approach yields lower errors and variance in both rotation and translation compared to human-only setups.

**Per-noise-level Error Distribution.** As noise increases, the human + stick approach continues to perform robustly, maintaining a tight error profile.

**Runtime distribution.** Execution time scales reasonably well with the number of cameras across all trials.

Downstream Applications in Real Scenes

**Real Scene Evaluation.** We evaluated our method in an uncontrolled outdoor scene using time-synchronized cameras. Real-world factors such as illumination variation, background clutter, and detector noise pose challenges that are largely absent in simulation. Our optimization effectively recovers the relative camera extrinsics and reconstructs a coherent, metric-scale 3D human-club motion sequence from uncalibrated inputs.

**Enabling precise 3D sports gesture analysis via tool-free self-calibration.** Our method utilizes rigid stick constraints to self-calibrate multi-camera setups (Left & Center), enabling scale-aware 3D pose reconstruction. This supports robust downstream applications in sports analytics (Right), allowing for accurate analysis of sport gestures and movement.

Citation

If you find this project useful for your research, please use the following BibTeX entry:

@inproceedings{yang2026multicamera,
      title={Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses}, 
      author={Fan Yang and Changsoo Jung and Ryosuke Kawamura and Hon Yung Wong},
      booktitle={2026 International Conference on Automatic Face and Gesture Recognition (FG)},
      year={2026}
}