Multi-Camera Self-Calibration in Sports Motion Capture:
Leveraging Human and Stick Poses

Fan Yang1, Changsoo Jung1,2, Ryosuke Kawamura1, Hon Yung Wong1

1Fujitsu Research, 2Colorado State University
FG 2026

Abstract

Multi-camera systems are widely employed in sports to capture the 3D motion of athletes and equipment, yet calibrating their extrinsic parameters remains costly and labor-intensive. We introduce an efficient, tool-free method for multi-camera extrinsic calibration tailored to sports involving stick-like implements (e.g. golf clubs, bats, hockey sticks).

Our approach jointly exploits two complementary cues from synchronized multi-camera videos: (i) human body keypoints with unknown metric scale and (ii) a rigid stick-like implement of known length. We formulate a three-stage optimization pipeline that refines camera extrinsics, reconstructs human and stick trajectories, and resolves global scale via the stick-length constraint. Our method achieves accurate extrinsic calibration without dedicated calibration tools.

To benchmark this task, we present the first dataset for multi-camera self-calibration in stick-based sports, consisting of synthetic sequences across four sports categories with 3 to 10 cameras. Comprehensive experiments demonstrate that our method delivers SOTA performance, achieving low rotation and translation errors.

Methodology

Illustration of our proposal
Illustration of our proposal. For sports involving stick-like implements (e.g. golf, baseball, hockey, and kendo), we perform a novel tool-free multi-camera extrinsic calibration by leveraging both the stick and the human poses.
Three-stage optimization framework
Three-stage optimization. (1) An initial, unscaled 3D pose is reconstructed from multi-view 2D keypoints via Bundle Adjustment (BA). (2) The real-world scale is recovered using a known measurement, such as the length of the baseball bat (0.86 m). (3) A final, scale-aware BA is performed to refine the metric 3D reconstruction.

Sports-Stick-Syn Dataset

Our Sports-Stick-Syn dataset
Our Sports-Stick-Syn dataset. Top: Statistics of our dataset, including sports categories, number of cameras, and noise levels. Bottom: An example visualization from the dataset.

Quantitative Results

Per-camera Error Distribution
Per-camera Error Distribution. The human + stick approach yields lower errors and variance in both rotation and translation compared to human-only setups.
Per-noise-level Error Distribution
Per-noise-level Error Distribution. As noise increases, the human + stick approach continues to perform robustly, maintaining a tight error profile.
Runtime distribution
Runtime distribution. Execution time scales reasonably well with the number of cameras across all trials.

Downstream Applications in Real Scenes

Real Scene Evaluation
Real Scene Evaluation. We evaluated our method in an uncontrolled outdoor scene using time-synchronized cameras. Real-world factors such as illumination variation, background clutter, and detector noise pose challenges that are largely absent in simulation. Our optimization effectively recovers the relative camera extrinsics and reconstructs a coherent, metric-scale 3D human-club motion sequence from uncalibrated inputs.
Enabling precise 3D sports gesture analysis
Enabling precise 3D sports gesture analysis via tool-free self-calibration. Our method utilizes rigid stick constraints to self-calibrate multi-camera setups (Left & Center), enabling scale-aware 3D pose reconstruction. This supports robust downstream applications in sports analytics (Right), allowing for accurate analysis of sport gestures and movement.

Citation

If you find this project useful for your research, please use the following BibTeX entry:

@inproceedings{yang2026multicamera,
      title={Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses}, 
      author={Fan Yang and Changsoo Jung and Ryosuke Kawamura and Hon Yung Wong},
      booktitle={2026 International Conference on Automatic Face and Gesture Recognition (FG)},
      year={2026}
}