MaskBench - A Comprehensive Benchmark Framework for 2D Pose Estimation and Video De-Identification

Authors

Affiliations

Tim Riedel

Hasso Plattner Institute, University of Potsdam, Germany

Zainab Zafari

Hasso Plattner Institute, University of Potsdam, Germany

Sharjeel Shaik

University of Potsdam, Germany

Babajide Alamu Owoyele

Hasso Plattner Institute, University of Potsdam, Germany

Wim Pouw

Tilburg University, Netherlands

Published

August 8, 2025

1 Abstract

Pose estimation plays a critical role in numerous computer vision applications but remains challenging in scenarios involving privacy-sensitive data and in real-world, unconstrained videos like TED Talks, that are not recorded under controlled laboratory conditions. To address the issue of sharing datasets across academic institutions without compromising privacy, we explore how masking strategies like blurring, pixelation, contour overlays, and solid fills impact pose estimation performance. We introduce MaskBench, a modular and extensible benchmarking framework designed to evaluate pose estimators under varying conditions, including masked video inputs. MaskBench integrates a total of seven pose estimators, including YoloPose, MediaPipePose, OpenPose, and both automated and human-in-the-loop variants of MaskAnyone, a multi-stage pipeline combining segmentation and pose estimation through a mixture-of-expert approach. Our evaluation was done on four datasets with increasing scene complexity, and uses both kinematic metrics like velocity, acceleration, and jerk as well as accuracy-based metrics like Percentage of Correct Keypoints (PCK) and Root Mean Square Error (RMSE). Results show that MaskAnyone variants significantly improve the visual quality of the pose estimation by reducing jitter and improving keypoint stability, especially the human-in-the-loop variants of MaskAnyone-MediaPipe. These visual results are supported by quantitative metrics, with the aforementioned models achieving the lowest acceleration and jerk values across all datasets. YoloPose consistently ranks as the most robust standalone model. Regarding masking techniques, preliminary results suggest that blurring offers a promising balance between privacy and pose estimation quality. However, since this experiment was conducted on a limited set of videos, further investigation is needed to draw general conclusions. These findings highlight the potential of pipelines like MaskAnyone and the extensibility of MaskBench for future research on pose estimation under privacy-preserving constraints.

2 Getting Started

2.1 🛠️ Installation

System Requirements

🐳 Docker: Latest stable version
🎮 GPU: NVIDIA (CUDA-enabled)
💾 Memory: 20 GB or more

Follow the instructions below to install and run experiments with MaskBench:

Install Docker and ensure the daemon is running.

Clone this repo:

git clone https://github.com/maskbench/maskbench.git

Switch to the git repository
```
cd maskbench
```

Setup the folder structure. For a quick start, create a dataset folder with a name of your choice in the assets/datasets/ folder. Create a videos folder inside and place one or more videos in it. For storing datasets, output or weights in different locations, see “Editing the .env file”. Labels, maskanyone_ui_mediapipe and mask_anyone_ui_openpose folders are optional and not required for a quick start. The structure of a dataset is outlined below and also detailed in the “Usage - Dataset Structure” section:

maskbench/
├── src
├── config/
│   └── your-experiment-config.yml
└── assets/
    ├── weights
    ├── output
    └── datasets/
        └── your-dataset-name/
            ├── videos/
            │   └── video_name1.mp4
            ├── labels/
            │   └── video_name1.json
            ├── maskanyone_ui_mediapipe/
            │   └── video_name1.json
            └── maskanyone_ui_openpose/
                └── video_name1.json

Create the environment file. This file is used to tell MaskBench about your dataset, output and weights directory, as well as the configuration file to use for an experiment. Copy the .env file using:
```
cp .env.dist .env
```
Edit the .env file. Open it using vim .env or nano .env.. Adjust the following variables:
- MASKBENCH_CONFIG_FILE: The configuration file used to define your experiment setup. By default, it is set to config/getting-started.yml, but you can copy any of the provided configuration files to config/ and edit it to your needs.
- MASKBENCH_GPU_ID: If you are on a multi-GPU setup, tell MaskBench which GPU to use. Either specify a number (0, 1, …) or “all” in which case all available GPUs on the system are used. Currently, MaskBench only supports inference on a single GPU or on all GPUs.
The following variables only need to be adjusted, if you use a different asset folder structure than the one proposed above (for example, if your dataset is large and you want to store it on a separate disk):
- MASKBENCH_DATASET_DIR: The directory where entire datasets are located. MaskBench supports video files with .mp4 and .avi extensions.
- MASKBENCH_OUTPUT_DIR: The directory where experiment results will be saved.
- MASKBENCH_WEIGHTS_DIR: Directory for storing model weights user-specific weights for custom pose estimators.

Edit the configuration file “getting-started.yml” to use the videos folder of your dataset. See section “Usage - Configruation Files” for more details.

dataset:
    name: GettingStarted
    code_file: datasets.dataset.Dataset
    video_folder: /datasets/<your-dataset-name>/videos  # Edit this line to point to the videos folder of your dataset.

Build and run the MaskBench Docker container.
```
docker compose build
```
```
docker compose up
```
If multiple users run MaskBench simultaneously, use docker compose -p $USER up.
Install MaskAnyone. If you plan on using the UI version of MaskAnyone to create smooth poses, masked videos and improve raw pose estimation models, follow the installation instructions here.

2.2 🚀 Usage

The following paragraphs describe how to structure your dataset, configure the application, and understand the output of MaskBench. Following these guidelines ensures the application runs smoothly and recognizes your data correctly.

2.2.1 📂 Dataset structure

Videos: Place all videos you want to evaluate in the videos folder.

your-dataset/
└── videos/
     ├── video_name1.mp4
     └── video_name2.mp4

Labels (Optional): If you provide labels, there must be exactly one label file for each video, with the same file name. Example:

your-dataset/
├── videos/
│    └── video_name1.mp4
└── labels/
     └── video_name2.json

MaskAnyoneUI Output: If you use MaskAnyoneUI, run the application, download the resulting pose file, store it in either the maskanyone_ui_openpose or maskanyone_ui_mediapipe folder and once again name it exactly like the corresponding video file.

2.2.2 ⚙️ Configuration Files

We provide four sample configuration files from our experiments. Feel free to copy and adapt them to your needs. The following note explains some parameters in more detail.

Explanation of config parameters

# A directory name (MaskBench run name) inside the output directory from which to load existing results.
# If set, inference is skipped and results are loaded. To run inference from scratch, comment out or set to "None".
inference_checkpoint_name: None
execute_evaluation: true                    # Set to false to skip calculating evaluation metrics and plotting.
execute_rendering: true                     # Set to false to skip rendering the videos.

dataset:
  name: TragicTalkers                                               # User-definable name of the dataset
  code_file: datasets.tragic_talkers_dataset.TragicTalkersDataset   # Module and class name of the dataset to instantiate
  video_folder: /datasets/tragic-talkers/videos                     # Location of the dataset videos folder (always starts with /datasets, because this refers to the mounted folder in the docker container). You only need to adjust the name of the dataset folder.
  gt_folder: /datasets/tragic-talkers/labels                        # Path to the ground truth poses folder
  config:
    convert_gt_keypoints_to_coco: true                              # Whether to convert the ground truth keypoints to COCO format

pose_estimators:                            # List of pose estimators (specificy as many as needed)
  - name: YoloPose                          # User-definable name of the pose estimator. 
    enabled: true                           # Enable or disable this pose estimator.
    code_file: models.yolo_pose_estimator.YoloPoseEstimator # Module and class name of the pose estimator to instantiate
    config:                                 # Pose estimator specific configuration variables.
      weights: yolo11l-pose.pt              # Weights file name inside the specified weights directory.
      save_keypoints_in_coco_format: true   # Whether to store keypoints in COCO format (18 keypoints) or not)
      confidence_threshold: 0.3             # Confidence threshold below which keyopints are considered undetected.

  - name: MaskAnyoneUI-MediaPipe
    enabled: true
    code_file: models.maskanyone_ui_pose_estimator.MaskAnyoneUiPoseEstimator
    config:
      dataset_poses_folder: /datasets/tragic-talkers/maskanyone_ui_mediapipe # Folder of MaskAnyone poses
      overlay_strategy: mp_pose
      save_keypoints_in_coco_format: true
      confidence_threshold: 0               # Confidence thresholds not supported by MaskAnyone

metrics:                                    # List of metrics  (specificy as many as needed)
  - name: PCK                               # User-definable name of the metric
    code_file: evaluation.metrics.pck.PCKMetric
    config:                                 # Metric specific configuration variables.
      normalize_by: bbox
      threshold: 0.2

  - name: Velocity
    code_file: evaluation.metrics.velocity.VelocityMetric
    config:
      time_unit: frame

2.2.3 🏋️ Model Variants and Weights

You only need to modify the weights if you are adding a new pose estimator to MaskBench. In that case, place your weights in the weights/ folder. By default, MaskBench automatically downloads the following weights:

MediaPipe: pose_landmarker_{lite, full, heavy}.task
Yolo11: yolo11{n, s, m, l, x}-pose
OpenPose overlay_strategy: BODY_25, BODY_25B, COCO
MaskAnyone overlay_strategy: mp_pose, openpose, openpose_body25b

2.2.4 📊 Output

All results, including plots, pose files, inference times and renderings, will be saved in the output directory. For each run of MaskBench a folder is created with the name of the dataset and a timestamp. Example:

output/
 └── TedTalks_2025-08-11_15-42-10/
      ├── plots/
      ├── poses/
      ├── renderings/
      ├── inference_times.json
      └── config.yml

3 Related Work

Human pose estimation is a core task in computer vision, concerned with identifying the spatial positions of body joints—such as shoulders, elbows, and knees—from images or video sequences. Existing approaches are typically classified into two broad strategies: top-down and bottom-up. Top-down methods begin by detecting individual persons within an image, after which a separate pose estimation model is applied to each detected instance. In contrast, bottom-up approaches first detect all keypoints across the image and then group them to form full-body poses for each individual (Saiva 2025; Kaim et al. 2024).

Among widely used frameworks, OpenPose (Zhe Cao et al. 2017) is a prominent example of a bottom-up pose estimation method. It first identifies keypoints across the entire image and then assembles them into person-wise skeletons using Part Affinity Fields (PAFs). OpenPose supports multiple configurations, including 18- and 25-keypoint body models Figure 1 (a), and offers full-body tracking, hand and facial landmark estimation. The framework is optimized for GPU execution and is widely used in applications requiring multi-person pose estimation.

YOLO11 (Redmon et al. 2016; Jocher and Qiu 2024), on the other hand, follows a top-down approach. It extends the YOLO family of real-time object detectors by incorporating pose estimation capabilities. After detecting bounding boxes for each person, YOLO11 predicts 17 body keypoints Figure 1 (b) per individual, using a topology aligned with the COCO keypoint format. It is designed for high-performance scenarios and is optimized for GPU usage, making it suitable for real-time, multi-person tracking in high-resolution video streams.

MediaPipe Pose (Lugaresi et al. 2019) is a lightweight, top-down framework designed specifically for real-time pose estimation on CPU-only devices. Built upon BlazePose, it employs a 33-landmark skeleton Figure 1 (c) that extends the standard COCO format with additional joints to improve anatomical precision. The pipeline consists of an initial detection and tracking stage, followed by landmark prediction and overlay. MediaPipe is particularly suited for single-person applications in mobile and browser environments, where computational efficiency and low latency are critical.

Several studies have benchmarked popular pose estimation models across different datasets, conditions, and use cases. The following works are particularly relevant to our benchmarking framework:

Comparision Of ML Models For Posture (Kaim et al. 2024) Compared YOLOv7 Pose and MediaPipe Pose. YOLOv7 achieved a slightly higher accuracy score of 87.8% versus MediaPipe’s 84.1%. However, MediaPipe demonstrated superior real-time performance on CPU-only devices, achieving 16-18 frames per second (FPS) compared to YOLOv7’s 4-5 FPS. In low-light environments, MediaPipe maintained detection consistency, whereas YOLOv7 performed better in occluded scenarios, successfully recognizing hidden body parts.
OpenPose vs MediaPipe: A Practical and Architectural Comparison (Saiva 2025) A recent blog post by Saiwa presents a detailed comparison between OpenPose and MediaPipe, discussing their architectural differences, device compatibility, and practical applications. OpenPose uses a bottom-up approach with Part Affinity Fields and is optimized for multi-person full-body tracking, whereas MediaPipe follows a top-down strategy focusing on speed and cross-platform deployment.

In addition to standalone pose estimation models, MaskAnyone (Schilling 2024a, 2024b) is a multi-stage framework developed at the Hasso Plattner Institute (HPI) that combines object detection, segmentation, de-identification, and pose estimation within a unified pipeline. The system begins by detecting human instances in a video using YOLO. For each detected bounding box, SAM2 (Ravi et al. 2024) is applied to segment the individual subject. Depending on the configuration, pose estimation is then performed using either OpenPose or MediaPipe. Last but not least, it produces a de-identified video using the SAM2 segmentation masks. The framework supports both fully automatic processing (we refer to it as MaskAnyoneAPI) and a human-in-the-loop approach (referred to as MaskAnyoneUI), where users can manually select specific frames, refine the segmentation output of SAM2 and thereafter start the pose estimation. This combination of automated and user-guided steps allows for finer control in scenarios where automatic segmentation may be insufficient or require correction.

4 MaskBench Architecture

The general workflow of MaskBench is shown in Figure Figure 2. It begins with loading the dataset, pose estimators, and evaluation metrics. The application then creates a checkpoint folder in the specified output directory, named according to the dataset and a timestamp (e.g., /output/TedTalks-20250724-121127). Subsequently, inference is performed on all videos in the dataset using the pose estimators specified in the configuration file. For the MaskAnyoneUI pose estimators, the user is required to perform semi-automatic annotation of the videos using MaskAnyone before starting the MaskBench run. A poses folder is created within the checkpoint, containing a subfolder for each pose estimator and a single JSON file for each video. The application then evaluates all specified metrics and generates plots, which are stored in the plots folder within the checkpoint. Finally, for each video, the application produces a set of rendered videos—one for each pose estimator—which are stored in the renderings folder in the checkpoint.

Each component of MaskBench is implemented in a modular way, so it can be easily extended and modified. We will discuss this in the following sections.

4.1 Dataset

The dataset provides video data for pose estimation and, if available, ground truth data for evaluation. The user can choose to either use the generic dataset class, which requires a video folder, that contains all video files, and an optional labels folder, which contains one ground truth JSON pose file per video. The alternative is to implement a custom class, that inherits from the Dataset class and overrides the load_videos method, which generates one VideoSample object for each video in the dataset. This is useful, if the dataset has a more complex structure with nested subfolder of videos or ground truth data. If the dataset provides ground truth, the user must also override the get_gt_pose_results and get_gt_keypoint_pairs methods. For each video, the get_gt_pose_results method should return a VideoPoseResult object. The get_gt_keypoint_pairs method is used to render the ground truth keypoints and contains a list of tuples, each specifying the indices of two keypoints to be connected in the rendered video. Default keypoint pairs for YoloPose, MediaPipePose, and various implementations of OpenPose models are provided in the keypoint_pairs.py file.

Below is the code implementation for the generic dataset class, and the more complex TragicTalkers dataset (with custom pseudo-ground truth data loading).

Dataset Class

src/datasets/dataset.py

import math
import os
from abc import ABC
from typing import Dict, List

from .video_sample import VideoSample
from inference import VideoPoseResult
from keypoint_pairs import COCO_KEYPOINT_PAIRS


class Dataset(ABC):
    def __init__(self, name: str, video_folder: str, gt_folder: str = None, config: dict = None):
        self.name = name
        self.config = config
        self.video_folder = video_folder
        self.gt_folder = gt_folder  # Optional - None if dataset has no ground truth
        self.video_samples = self.load_videos()

    def load_videos(self) -> List[VideoSample]:
        """
        Default implementation to load video samples from the videos folder.
        Expects videos to be directly in the videos folder.
        """
        video_extensions = (".avi", ".mp4")
        samples = []

        if not os.path.exists(self.video_folder):
            raise ValueError(f"Videos folder not found at {self.video_folder}")

        for filename in os.listdir(self.video_folder):
            video_path = os.path.join(self.video_folder, filename)
            if filename.endswith(video_extensions):
                samples.append(VideoSample(video_path))

        return samples

    def get_gt_pose_results(self) -> Dict[str, VideoPoseResult]:
        """
        Default implementation to load ground truth pose results from the gt_folder.
        Expects one JSON file per video with the same name as the video file.
        The format of the ground truth files should be consistent with `VideoPoseResult` structure, otherwise overwrite
        this method in a subclass and implement your own logic to load the ground truth data.
        The returned dictionary should map video names to ground truth `VideoPoseResult` objects.
        Returns empty dict if no gt_folder is specified or doesn't exist.
        """
        if self.gt_folder is None or not os.path.exists(self.gt_folder):
            return {}

        gt_pose_results = {}
        for sample in self.video_samples:
            video_name = os.path.splitext(os.path.basename(sample.video_path))[0]
            json_path = os.path.join(self.gt_folder, f"{video_name}.json")

            if not os.path.exists(json_path):
                raise ValueError(f"Ground truth JSON file missing for video `{video_name}`.")
            
            gt_pose_results[video_name] = VideoPoseResult.from_json(json_path, video_name)

        return gt_pose_results

    def get_gt_keypoint_pairs(self) -> None | List[tuple]:
        """
        Default implementation to return COCO keypoint pairs if gt_folder is specified and exists,
        otherwise returns None.
        """
        if self.gt_folder is not None and os.path.exists(self.gt_folder):
            return COCO_KEYPOINT_PAIRS
        return None

    def __iter__(self):
        return iter(self.video_samples)

    def __len__(self):
        return math.ceil(len(self.video_samples))

Tragic Talkers Dataset

src/datasets/tragic_talkers_dataset.py

import os
import json
import glob
from typing import Dict, List

from inference import FramePoseResult, PersonPoseResult, PoseKeypoint, VideoPoseResult
from keypoint_pairs import COCO_KEYPOINT_PAIRS, COCO_TO_OPENPOSE_BODY25, OPENPOSE_BODY25_KEYPOINT_PAIRS
from utils import convert_keypoints_to_coco_format
from .dataset import Dataset
from .video_sample import VideoSample

class TragicTalkersDataset(Dataset):
    def __init__(self, name: str, video_folder: str, gt_folder: str = None, config: dict = None):
        super().__init__(name, video_folder, gt_folder, config)
        self.convert_gt_keypoints_to_coco = config.get("convert_gt_keypoints_to_coco", False) if config else False
    
    def load_videos(self) -> List[VideoSample]:
        samples = []
        video_extensions = (".avi", ".mp4")
        list_of_videos = glob.glob(os.path.join(self.video_folder, "*", "*"))

        for video in list_of_videos:
            if video.endswith(video_extensions):
                samples.append(VideoSample(video))
      
        return samples

    def get_gt_keypoint_pairs(self) -> List[tuple]:
        if self.convert_gt_keypoints_to_coco:
            return COCO_KEYPOINT_PAIRS
        else:
            # Tragic Talkers uses the BODY_25 model
            return OPENPOSE_BODY25_KEYPOINT_PAIRS

    def get_gt_pose_results(self) -> Dict[str, VideoPoseResult]:
        gt_pose_results = {}
        video_json_folders = glob.glob(os.path.join(self.gt_folder, "*", "*")) # for every video & camera angle
        for video_json_folder in video_json_folders:
            video_name = self._extract_video_name_from_labels_folder(video_json_folder)
            gt_pose_result = self.combine_json_files_for_video(video_json_folder, video_name)
            if self.convert_gt_keypoints_to_coco:
                gt_pose_result.frames = convert_keypoints_to_coco_format(gt_pose_result.frames, COCO_TO_OPENPOSE_BODY25)
            gt_pose_results[video_name] = gt_pose_result
        return gt_pose_results

    def combine_json_files_for_video(self, video_json_folder: str, video_name: str) -> str:
        all_json_files = glob.glob(os.path.join(video_json_folder, "*"))
        all_json_files = sorted(all_json_files, key=lambda x: int(os.path.basename(x).split('-')[1].split('_')[0])) # we need to sort them by frame number

        all_frames_keypoints = []
        for frame_idx, file in enumerate(all_json_files): # for every frame
            frame_keypoints = [] 
            with open(file, 'r') as f:
                data = json.load(f)
                people = data.get('people', [])
                if people:
                    for person in people: # for every person in the frame
                        person_keypoints = []
                        pose_keypoints = person.get('pose_keypoints_2d', [])
                        if pose_keypoints: # get keypoints for that person
                            person_keypoints = [
                                PoseKeypoint(x=pose_keypoints[i], y=pose_keypoints[i+1], confidence=pose_keypoints[i+2])
                                for i in range(0, len(pose_keypoints), 3)]
                        frame_keypoints.append(PersonPoseResult(keypoints=person_keypoints))
            all_frames_keypoints.append(FramePoseResult(persons=frame_keypoints, frame_idx=frame_idx))

        return VideoPoseResult(
            fps=30,
            frame_width=2448,
            frame_height=2048,
            video_name=video_name,
            frames=all_frames_keypoints
        )

    def _extract_video_name_from_labels_folder(self, path: str) -> str:
        """
        Extract video name from a labels folder.
        For example, a path like /datasets/tragic_talkers/labels/conversation1_t3/cam-022 will be converted to conversation1_t3-cam22.
        """
        parts = path.split(os.sep)
        conversation = parts[-2]  # e.g. conversation1_t3
        camera = parts[-1]  # e.g. cam-022
        
        # Extract camera number and format it
        cam_number = camera.split('-')[1]  # e.g. 022
        cam_number = cam_number[1:] # cam number in labels has 3 digits, we need to remove the leading one
        
        return f"{conversation}-cam{cam_number}"

4.2 Inference

4.2.1 Video Pose Result

The VideoPoseResult object represents the standardized output of a pose prediction model. It is a nested structure containing a FramePoseResult object for each frame in the video. Each frame pose result includes a list of PersonPoseResult objects, one for each person detected in the frame. Every person’s result contains a list of PoseKeypoint objects, one for each keypoint in the model’s output format, providing x and y coordinates along with an optional confidence score.

Video Pose Result Class

src/inference/pose_result.py

from dataclasses import asdict, dataclass
import json
from typing import List, Optional
import numpy as np
import numpy.ma as ma

np.set_printoptions(threshold=np.inf)


@dataclass
class PoseKeypoint:
    x: float
    y: float
    confidence: Optional[float] = None


@dataclass
class PersonPoseResult:
    keypoints: List[PoseKeypoint]  # Fixed length per pose estimator (e.g., 17 for COCO)
    id: Optional[int] = None  # for tracking across frames


@dataclass
class FramePoseResult:
    persons: List[PersonPoseResult]
    frame_idx: int


class VideoPoseResult:
    """
    This class is the main output of the pose estimation models.
    It contains the pose estimation results for a video.
    It is a nested object that contains a `FramePoseResult`object for each frame in the video.
    Within each frame pose result, there is a list of `PersonPoseResult` objects, one for each person in the frame.
    Every `PersonPoseResult` contains a list of `PoseKeypoint` objects, one for each keypoint in the model output format, with the x, y coordinates and a confidence score.
    """
    def __init__(
        self,
        fps: int,
        frame_width: int,
        frame_height: int,
        frames: List[FramePoseResult],
        video_name: str,
    ):
        self.fps = fps
        self.frame_width = frame_width
        self.frame_height = frame_height
        self.frames = frames
        self.video_name = video_name

    def __info__(self, num_of_sample_frames: int = 3) -> dict:
        return {
            "video_name": self.video_name,
            "fps": self.fps,
            "frame_width": self.frame_width,
            "frame_height": self.frame_height,
            "num_frames": len(self.frames),
            "sample_frames": self.frames[:num_of_sample_frames] if len(self.frames) > num_of_sample_frames else self.frames,
        }
    
    def to_numpy_ma(self) -> np.ndarray:
        """
        Convert the video pose results from a nested object to a masked array.
        This method is useful for evaluation and plotting in order to work
        with arrays rather than nested objects.
        
        Returns:
            Masked array with shape (num_frames, max_persons, num_keypoints, 2)
            where 2 represents x and y coordinates. Max_persons is the maximum number
            of detected persons in the entire video. Values are masked for frames with 
            fewer persons than max_persons, which means that these values are not included
            in computations (e.g. evaluation or plotting).
        """
        if not self.frames:
            print("Warning: No frames in video pose result.")
            return ma.array(np.zeros((0, 0, 0, 2)))
            
        # Get dimensions
        num_frames = len(self.frames)
        max_persons = max(len(frame.persons) for frame in self.frames)
        num_keypoints = max(
            len(person.keypoints)
            for frame in self.frames
            for person in frame.persons
        ) if any(frame.persons for frame in self.frames) else 0
        
        if max_persons == 0 or num_keypoints == 0:
            print("Warning: No persons or keypoints found in video pose result.")
            return ma.array(np.zeros((num_frames, 0, 0, 2)))
        
        # Initialize arrays - all values masked by default
        values = np.zeros((num_frames, max_persons, num_keypoints, 2))
        mask = np.ones_like(values, dtype=bool)  # True means masked
        
        for frame_idx, frame in enumerate(self.frames):
            # Only fill and unmask values for persons that exist
            for person_idx, person in enumerate(frame.persons):
                for kpt_idx, keypoint in enumerate(person.keypoints):
                    values[frame_idx, person_idx, kpt_idx, 0] = keypoint.x
                    values[frame_idx, person_idx, kpt_idx, 1] = keypoint.y
                    mask[frame_idx, person_idx, kpt_idx] = False  # Unmask only existing values
        
        return ma.array(values, mask=mask)

    def to_json(self) -> dict:
        return {
            "fps": self.fps,
            "frame_width": self.frame_width,
            "frame_height": self.frame_height,
            "frames": [asdict(frame) for frame in self.frames],
            "video_name": self.video_name,
        }

    @classmethod
    def from_json(cls, json_path: str, video_name: str = None) -> 'VideoPoseResult':
        """
        Create a VideoPoseResult instance from a JSON file.
        
        Args:
            json_path (str): Path to the JSON file containing the pose result data
            video_name (str, optional): Video name to use. If not provided, uses the one from data.
            
        Returns:
            VideoPoseResult: A new instance created from the JSON data
        """
        with open(json_path, "r") as f:
            data = json.load(f)
            frames = data.get("frames", [])
            frame_pose_results = []
            
            for frame_index, frame in enumerate(frames):
                persons = frame.get("persons", [])
                person_pose_results = []
                for person in persons:
                    keypoints = person.get("keypoints", [])
                    pose_keypoints = [
                        PoseKeypoint(
                            x=k["x"], 
                            y=k["y"], 
                            confidence=k.get("confidence", None)
                        ) for k in keypoints
                    ]
                    person_pose_results.append(PersonPoseResult(keypoints=pose_keypoints))
                frame_pose_results.append(FramePoseResult(persons=person_pose_results, frame_idx=frame_index))
            
            return cls(
                fps=data.get("fps", None),
                frame_width=data.get("frame_width", None),
                frame_height=data.get("frame_height", None),
                video_name=video_name or data.get("video_name"),
                frames=frame_pose_results
            )

    def __str__(self):
        array = self.to_numpy_ma()
        return f"VideoPoseResult(fps={self.fps}, frame_width={self.frame_width}, frame_height={self.frame_height}, video_name={self.video_name}), frame_values: \n{array}"

4.2.2 Pose Estimator

Pose estimators are responsible for predicting the poses of persons in a video by wrapping calls to specific AI models or pose estimation pipelines. Each model is implemented in a separate class that inherits from the abstract PoseEstimator class. The output of each estimator is a standardized VideoPoseResult object.

To add a new pose estimator, users must implement methods for pose estimation and for retrieving keypoint pairs. Special care must be taken to ensure that the output meets the following constraints:

The number of frames in the pose results matches the number of frames in the video.
If no persons are detected in a frame, the persons list should be empty.
For detected persons with missing keypoints, those keypoints should have values x=0, y=0, confidence=None.
The number of keypoints per person remains constant across all frames.
Keypoints with low confidence should be masked out using the confidence_threshold configuration parameter.
Keypoints must be mapped to the COCO format if the save_keypoints_in_coco_format configuration parameter is set to true.

As an example, we provide the implementation of the abstract pose estimator class and the implementation of the YOLO model.

Pose Estimator Class

src/models/pose_estimator.py

from abc import ABC, abstractmethod
import cv2

from inference.pose_result import VideoPoseResult


class PoseEstimator(ABC):
    def __init__(self, name: str, config: dict):
        """
        Initialize the PoseEstimator with a name and configuration.
        
        Args:
            name (str): The name of the estimator (e.g. "YoloPose", "MediaPipe", "OpenPose", ...).
            config (dict): Configuration dictionary for the pose estimator. This can include arbitrary parameters for the model that are necessary for inference (e.g. "confidence_threshold", "weights_file_name", ...). The config parameter "confidence_threshold" is required. This has no effect for MaskAnyonePoseEstimators, because they do not provide confidence scores. If you do not want to filter, set confidence_threshold to 0.
        """
        if not config or "confidence_threshold" not in config:
            raise ValueError(f"Config for {name} must include a 'confidence_threshold' key.")

        self.name = name
        self.config = config
        self.confidence_threshold = config["confidence_threshold"]

    @abstractmethod
    def estimate_pose(self, video_path: str) -> VideoPoseResult:
        """
        Abstract method to estimate the pose of a video using the specific pose estimation model.
        This method should be implemented by subclasses.
        It receives the full path to the input video file and returns a VideoPoseResult object.
        The user is responsible for the following three steps after creating the initial VideoPoseResult object:
        1. Assert that the number of frames in the frame results matches the number of frames in the video (call assert_frame_count_is_correct)
        2. Filter out low confidence keypoints (call filter_low_confidence_keypoints)
        3. If the config contains a "save_keypoints_in_coco_format" key, convert the keypoints to the COCO format (call utils.convert_keypoints_to_coco_format, providing a mapping from the model output format to the COCO format)

        Args:
            video_path (str): The full path to the input video file.
        Returns:
            VideoPoseResult: An object containing the pose estimation results for the video.
                  The VideoPoseResult object must contain as many FramePoseResult objects as there are frames in the video (asserted by assert_frame_count_is_correct).
                  The FramePoseResult contains a list of PersonPoseResult objects, one for each person in the frame. If there are no persons in the frame, the list is empty (persons=[]).
                  Every PersonPoseResult contains a list of PoseKeypoints, one for each keypoint in the model output format. If a keypoint is not detected, the PoseKeypoint object should have x=0 and y=0 and confidence=None.
        """
        pass

    @abstractmethod
    def get_keypoint_pairs(self) -> list:
        """
        Abstract method to get the pairs of keypoints that should be connected when visualizing the pose.
        This method should be implemented by subclasses.
        There are pre-defined list of keypoint pairs for various models available in the project.
        
        Returns:
            list: A list of tuples, where each tuple contains two integers representing the indices 
                 of keypoints that should be connected with a line when visualizing the pose.
                 For example, [(0,1), (1,2)] means that keypoint 0 should be connected to keypoint 1,
                 and keypoint 1 should be connected to keypoint 2.
        """
        pass

    def assert_frame_count_is_correct(self, video_pose_result: VideoPoseResult, video_metadata: dict):
        """
        Assert that the number of frames in the frame results matches the number of frames in the video. Should be called at the end of the estimate_pose method.

        Args:
            frame_results (list): A list of FramePoseResult objects.
            video_metadata (dict): A dictionary containing the video metadata with the key "frame_count".
        Raises:
            Exception: If the number of frames in the frame results does not match the number of frames in the video.
        """
        if len(video_pose_result.frames) != video_metadata.get("frame_count"):
            raise Exception(f"Number of frames in the video ({video_metadata.get('frame_count')}) does not match the number of frames in the frame results ({len(video_pose_result.frames)})")

    def filter_low_confidence_keypoints(self, video_pose_result: VideoPoseResult):
        for frame_result in video_pose_result.frames:
            for person_result in frame_result.persons:
                for keypoint in person_result.keypoints:
                    if keypoint.confidence is not None and keypoint.confidence < self.confidence_threshold:
                        keypoint.x = 0
                        keypoint.y = 0
                        keypoint.confidence = None
        return video_pose_result

YOLO Model

src/models/yolo_pose_estimator.py

import os
import torch
import utils
from ultralytics import YOLO

from models import PoseEstimator
from inference import FramePoseResult, PersonPoseResult, PoseKeypoint, VideoPoseResult
from keypoint_pairs import COCO_KEYPOINT_PAIRS

class YoloPoseEstimator(PoseEstimator):
    def __init__(self, name: str, config: dict):
        """
        Initialize the YoloPoseEstimator with a model name and configuration.
        Args:
            name (str): The name of the model (e.g. "YoloPose").
            config (dict): Configuration dictionary for the model. It must contain the key "weights" with the path to the weights file relative to the weights folder, otherwise it uses 'yolo11n-pose.pt'.
        """

        super().__init__(name, config)

        weights_file = self.config.get("weights", "yolo11n-pose.pt")
        print("Using weights file: ", weights_file)
        pre_built_weights_file_path = os.path.join("/weights/pre_built", weights_file)
        user_weights_file_path = os.path.join("/weights/user_weights", weights_file)

        if os.path.exists(user_weights_file_path):
            weights_file_path = user_weights_file_path
        elif os.path.exists(pre_built_weights_file_path):
            weights_file_path = pre_built_weights_file_path
        else:
            raise ValueError(
                f"Could not find weights file {weights_file}. Please download the weights from https://docs.ultralytics.com/tasks/pose/ and place them in the weights folder."
            )

        self.model = YOLO(weights_file_path)
        # only for dev
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(device)

    def get_keypoint_pairs(self):
        # Yolo keypoints are stored in Coco format
        return COCO_KEYPOINT_PAIRS

    def estimate_pose(self, video_path: str) -> VideoPoseResult:
        """
        Estimate the pose of a video using YOLO pose estimation.

        Args:
            video_path (str): The path to the input video file.
        Returns:
            VideoPoseResult: A standardized result object containing the pose estimation results for the video.
        """

        cap, video_metadata = utils.get_video_metadata(video_path)
        video_name = os.path.splitext(os.path.basename(video_path))[0]
        cap.release()

        results = self.model.track(
            video_path, conf=self.confidence_threshold, stream=True, verbose=False
        )

        frame_results = []
        for frame_idx, frame_result in enumerate(results):
            if not frame_result.keypoints:  # if no keypoints detected
                frame_results.append(FramePoseResult(persons=[], frame_idx=frame_idx))
                continue

            xys = frame_result.keypoints.xy.cpu().numpy()
            confidences = frame_result.keypoints.conf

            if xys.size == 0: # if no persons detected
                frame_results.append(FramePoseResult(persons=[], frame_idx=frame_idx))
                continue

            persons = []
            num_persons = frame_result.keypoints.shape[0]
            num_keypoints = frame_result.keypoints.shape[1]

            for i in range(num_persons):
                keypoints = []
                for j in range(num_keypoints):
                    conf = float(confidences[i, j]) if (confidences is not None) and (xys[i, j, 0] != 0 and xys[i, j, 1] != 0) else None
                    kp = PoseKeypoint(
                        x=float(xys[i, j, 0]),
                        y=float(xys[i, j, 1]),
                        confidence=conf,
                    )

                    keypoints.append(kp)
                persons.append(PersonPoseResult(keypoints=keypoints))
            frame_results.append(FramePoseResult(persons=persons, frame_idx=frame_idx))

        video_pose_result = VideoPoseResult(
            fps=video_metadata.get("fps"),
            frame_width=video_metadata.get("width"),
            frame_height=video_metadata.get("height"),
            frames=frame_results,
            video_name=video_name,
        )

        self.assert_frame_count_is_correct(video_pose_result, video_metadata)
        video_pose_result = self.filter_low_confidence_keypoints(video_pose_result)
        # we do not convert keypoints to coco format, because yolo already stores keypoints in coco format
        return video_pose_result

MaskBench supports seven pose estimators, including pure AI models such as YOLOv11-Pose, MediaPipePose, and OpenPose. Additionally, it incorporates MaskAnyone as a pose estimator, which combines multiple expert models. We distinguish between two variants of the MaskAnyone estimator: the MaskAnyoneAPI pose estimator, which runs fully automatically during inference, and the MaskAnyoneUI pose estimator, which employs a human-in-the-loop approach allowing manual adjustment of the mask for the persons of interest. The latter requires manual execution by the user prior to running MaskBench, with the resulting pose files provided as one file per video.

4.2.3 Inference Engine

The inference engine is responsible for running pose estimators on videos and saving the results as JSON files in the poses folder. If a checkpoint name is specified in the configuration file, the inference engine will load existing results from the checkpoint and skip inference for videos that already have corresponding outputs. This feature allows the user to resume an already started inference process or to bypass the time-consuming inference entirely and perform only metric evaluation and rendering. The inference engine returns a nested dictionary that maps pose estimator names to video names and their corresponding VideoPoseResult objects. Additionally, it records the inference times for each pose estimator and video, saving this information as a JSON file within the checkpoint folder.

4.3 Evaluation

4.3.1 Metric

Metrics play a crucial role in quantitatively evaluating the accuracy and quality of pose predictions. Each metric inherits from the abstract Metric class and implements a computation method that takes as input a predicted video pose result, an optional ground truth pose result, and the name of the pose estimator. The compute method of a metric outputs a MetricResult object containing the metric values for the video (see section Section 4.3.2).

Metric Class

src/evaluation/metrics/metric.py

from abc import ABC, abstractmethod
from typing import Dict, List, Optional, Any

import numpy as np
import numpy.ma as ma
from scipy.optimize import linear_sum_assignment

from inference.pose_result import VideoPoseResult
from evaluation.metrics.metric_result import MetricResult, FRAME_AXIS, PERSON_AXIS, KEYPOINT_AXIS


class Metric(ABC):
    """Base class for all metrics in MaskBench."""
    
    def __init__(self, name: str, config: Optional[Dict[str, Any]] = None):
        """
        Initialize a metric.
        
        Args:
            name: Unique name of the metric
            config: Optional configuration dictionary for the metric
        """
        self.name = name
        self.config = config or {}
    
    @abstractmethod
    def compute(
        self,
        video_result: VideoPoseResult,
        gt_video_result: Optional[VideoPoseResult] = None,
        model_name: Optional[str] = None
    ) -> MetricResult:
        """
        Compute the metric for a video.
        
        Args:
            video_result: Pose estimation results for the video
            gt_video_result: Optional ground truth pose results
            model_name: Name of the model being evaluated
            
        Returns:
            MetricResult containing the metric values for the video
        """
        pass

    def _match_person_indices(self, poses_to_match: ma.MaskedArray, reference: ma.MaskedArray) -> ma.MaskedArray:
        """
        Match the predictions to the reference (e.g. ground truth or previous frame) for a single frame.
        This is useful for metrics that are order-dependent, such as PCK, acceleration or RMSE.
        It uses the Hungarian algorithm to find the best match between the predictions and the reference.
        If there are no predictions, it returns an array of infinities of reference shape. Infinities are used instead of nans
        to have an infinitely large mean center of a pose, which will not be matched to any reference (for example in a previous frame for kinematic metrics).

        Args:
            poses_to_match: Predicted poses array of shape (M, K, 2) where M is number of persons
            reference: Reference poses array of shape (N, K, 2) where N is number of persons
            
        Returns:
            Sorted predictions array of shape (max(M,N), K, 2) where:
            - First N positions contain predictions matched to reference (or infinity if no match)
            - Remaining M-N positions (if M>N) contain unmatched predictions
        """
        M, K, _ = poses_to_match.shape
        N, _, _ = reference.shape
        
        # If no predictions, return array of infinities of reference shape
        if M == 0:
            return np.full_like(reference, np.inf)
            
        # If no reference, return predictions as is
        if N == 0:
            return poses_to_match
            

        # Calculate mean positions
        mean_poses_to_match = np.nanmean(poses_to_match, axis=1)
        mean_ref_poses = np.nanmean(reference, axis=1)

        valid_M = M
        valid_N = N
        
        # Limit the cost matrix only to the valid persons (i.e. where the person is not completely masked)
        # Therefore, we need to create a mapping of the original person indices to the valid person indices
        poses_to_match_index_mapping = []
        reference_index_mapping = []
        is_poses_to_match_masked_array = isinstance(poses_to_match, ma.MaskedArray)
        is_reference_masked_array = isinstance(reference, ma.MaskedArray)
        for i in range(M):
            if is_poses_to_match_masked_array and poses_to_match.mask.all(axis=(1,2))[i]: # Reduce the number of valid poses to match (if the person is completely masked)
                valid_M -= 1
            else:
                poses_to_match_index_mapping.append(i)
        for i in range(N):
            if is_reference_masked_array and reference.mask.all(axis=(1,2))[i]: # Reduce the number of valid references (if the person is completely masked)
                valid_N -= 1
            else:
                reference_index_mapping.append(i)

        # Calculate cost matrix based on Euclidian distance between each prediction (valid_M) in the rows and references (valid_N) in the columns
        cost_matrix = np.zeros((valid_M, valid_N))
        for i in range(valid_M):
            for j in range(valid_N):
                pos_to_match_idx = poses_to_match_index_mapping[i]
                ref_idx = reference_index_mapping[j]
                cost_matrix[i, j] = np.linalg.norm(mean_poses_to_match[pos_to_match_idx] - mean_ref_poses[ref_idx])
        # Remove rows where all entries are nan, which might happen if the shape N or M is 
        # greater than the maximum number of persons in the reference or predictions.
        valid_rows = ~np.all(np.isnan(cost_matrix), axis=1)
        cost_matrix = cost_matrix[valid_rows]
                
        # Apply Hungarian algorithm
        row_ind, col_ind = linear_sum_assignment(cost_matrix)
        mapped_row_ind = [poses_to_match_index_mapping[i] for i in row_ind]
        mapped_col_ind = [reference_index_mapping[i] for i in col_ind]
        
        # Create output array that can hold all predictions
        max_persons = max(M, N)
        sorted_poses_to_match = np.full((max_persons, K, 2), np.inf)
        
        # First, fill the matched predictions in reference order and their dedicated mapped index
        used_pred_indices = set()
        for pred_idx, gt_idx in zip(mapped_row_ind, mapped_col_ind):
            if pred_idx < M and gt_idx < N:  # Only use valid matches
                sorted_poses_to_match[gt_idx] = poses_to_match[pred_idx]
                used_pred_indices.add(pred_idx)
        
        # Then append any unused predictions at the end (i.e. additional persons)
        if valid_M > valid_N:
            extra_idx = len(used_pred_indices)  # Start after used predictions
            for pred_idx in range(0, M):
                if pred_idx not in used_pred_indices:
                    sorted_poses_to_match[extra_idx] = poses_to_match[pred_idx]
                    extra_idx += 1
                
        # Create masked array where persons that are all 0 or inf are masked
        masked_poses = ma.array(sorted_poses_to_match)
        person_mask = (
            (masked_poses == 0).all(axis=(1,2)) | 
            (np.isinf(masked_poses)).all(axis=(1,2))
        )
        masked_poses[person_mask] = ma.masked
        return masked_poses


class DummyMetric(Metric):
    """
    A simple metric that just returns the raw pose data.
    Useful for testing and as an example of how to implement a metric.
    """
    
    def __init__(self, config: Optional[Dict[str, Any]] = None):
        super().__init__(name="DummyMetric", config=config)
    
    def compute(
        self,
        video_result: VideoPoseResult,
        gt_video_result: Optional[VideoPoseResult] = None,
        model_name: Optional[str] = None
    ) -> MetricResult:
        """
        Simply convert the pose data to a MetricResult without any computation.
        """
        # Convert pose data to masked array and take only x coordinates
        values = video_result.to_numpy_ma()[:, :, :, 0]

        return MetricResult(
            values=values,
            axis_names=[FRAME_AXIS, PERSON_AXIS, KEYPOINT_AXIS],
            metric_name=self.name,
            video_name=video_result.video_name,
            model_name=model_name,
        )

MaskBench currently implements ground truth-based metrics for Euclidean Distance, Percentage of Correct Keypoints (PCK), and Root Mean Square Error (RMSE). Furthermore, we provide kinematic metrics for velocity, acceleration, and jerk. Section Section 6 contains a more extensive description of the implemented metrics.

Matching Person Indices

For some metrics, it is essential to ensure that the order of persons in the predicted video pose results matches that of the reference. The metric class provides a method called match_person_indices to align person indices between ground truth and predicted results. This method is used not only in ground-truth-based metrics but also in kinematic metrics, which require consistent person indices across consecutive frames to compute velocity, acceleration, and other temporal measures. The implementation employs the Hungarian algorithm, using the mean position of a person’s keypoints to find the optimal matching between all persons in the reference and predicted pose results.

Let $N$ denote the number of persons in the reference, $M$ the number in the prediction, and $K$ the number of keypoints per person. The output of the match_person_indices method is an array with shape $\text{max}(N, M) \times K \times 2$. The first $N$ entries correspond to persons ordered as in the reference, while the remaining $M - N$ entries (if $M > N$) represent additional persons present only in the prediction.

Edge cases include situations where a person appears in one frame but not in the next. In such cases, the unmatched person is assigned an index with infinite values to indicate absence, while the other persons retain consistent indices. This also applies when the prediction contains fewer persons than the reference (M < N). Each metric can then handle these infinite values appropriately, for example, by converting them to NaN in kinematic metrics or assigning predefined values in Euclidean distance and ground truth–based metrics.

Unit Testing

Implementing unit tests for metric classes is essential to ensure that their outputs are accurate and consistent. We provide unit tests for all metrics in the src/tests folder, which can be executed using the pytest command. Running these tests after any modifications to the metric classes helps guarantee that existing functionality remains intact.

4.3.2 Metric Result

The output of a metric’s compute method is a MetricResult object. This object contains metric values stored in a multi-dimensional array, where each axis is labeled with descriptive names such as “frame,” “person,” and “keypoint”. The class provides an aggregate function that reduces these values using a specified method along selected axes only. Currently, MaskBench supports aggregation methods including mean, median, Root Mean Square Error (RMSE), vector magnitude, sum, minimum, and maximum. The result of the aggregation is another MetricResult object with reduced dimensionality, retaining only the axes that were not aggregated.

This flexible approach of storing the results with their axes names and using the names in the aggregation method allows for the visualization of the results in a variety of ways, for example, as a per-keypoint plot, distribution plot, or as a single scalar value. Furthermore, it allows extending the framework with new metrics (possibly containing different axis names) and also different visualizations.

4.3.3 Evaluator

Given a list of metrics, the evaluator executes each configured metric on the pose estimation results for all pose estimators and videos. It returns a nested dictionary that maps metric names to pose estimator names, then to video names, and finally to their corresponding MetricResult objects. It does not perform aggregation over the videos or pose estimators in order to allow for more flexibility in the visualization of the results.

4.4 Visualization

After evaluation, the results are visualized in plots and tables.

4.4.1 Visualizer

An abstract BaseVisualizer class defines the interface for all visualization components. We implemented a MaskBench-specific visualizer class tailored to our experiments, which can be reused for other studies or extended to accommodate new types of visualizations.

MaskBench Visualizer

src/visualization/maskbench_visualizer.py

import os
from typing import Dict

from matplotlib import pyplot as plt

from evaluation.metrics import MetricResult
from evaluation.plots import KinematicDistributionPlot, CocoKeypointPlot, generate_result_table, InferenceTimePlot
from checkpointer import Checkpointer
from evaluation.metrics.metric_result import COORDINATE_AXIS
from .base_visualizer import Visualizer


class MaskBenchVisualizer(Visualizer):
    """
    This class contains specific plots and tables for the MaskBench project evaluation. 
    """
        
    def generate_all_plots(self, pose_results: Dict[str, Dict[str, Dict[str, MetricResult]]]):
        os.makedirs(self.plots_dir, exist_ok=True)

        if "Velocity" in pose_results.keys():
            velocity_distribution_plot = KinematicDistributionPlot(metric_name="Velocity")
            fig, filename = velocity_distribution_plot.draw(pose_results, add_title=False)
            self._save_plot(fig, filename)

        if "Acceleration" in pose_results.keys():
            acceleration_distribution_plot = KinematicDistributionPlot(metric_name="Acceleration")
            fig, filename = acceleration_distribution_plot.draw(pose_results, add_title=False)
            self._save_plot(fig, filename)

            coco_keypoint_plot = CocoKeypointPlot(metric_name="Acceleration")
            fig, filename = coco_keypoint_plot.draw(pose_results, add_title=False)
            self._save_plot(fig, filename)

        if "Jerk" in pose_results.keys():
            jerk_distribution_plot = KinematicDistributionPlot(metric_name="Jerk")
            fig, filename = jerk_distribution_plot.draw(pose_results, add_title=False)
            self._save_plot(fig, filename)

        inference_times = self.checkpointer.load_inference_times()
        if inference_times:
            inference_times = self.set_maskanyone_ui_inference_times(inference_times)
            inference_times = self.sort_inference_times_pose_estimator_order(inference_times, pose_results)
            inference_time_plot = InferenceTimePlot()
            fig, filename = inference_time_plot.draw(inference_times)
            self._save_plot(fig, filename)

        pose_results = self.calculate_kinematic_magnitudes(pose_results)
        table_df = generate_result_table(pose_results)
        self._save_table(table_df, "result_table.csv")

        
    def set_maskanyone_ui_inference_times(self, inference_times: Dict[str, Dict[str, float]]) -> Dict[str, Dict[str, float]]:
        """
        Set the inference times for MaskAnyoneUI to be equal to the corresponding MaskAnyoneAPI models.
        """
        # Create a copy to avoid modifying the original
        mapped_times = inference_times.copy()
        
        # Define the mapping pairs
        ui_to_api_mapping = {
            'MaskAnyoneUI-MediaPipe': 'MaskAnyoneAPI-MediaPipe',
            'MaskAnyoneUI-OpenPose': 'MaskAnyoneAPI-OpenPose'
        }
        
        # For each UI model, set its times to the corresponding API model
        for ui_model, api_model in ui_to_api_mapping.items():
            if ui_model in inference_times and api_model in inference_times:
                mapped_times[ui_model] = mapped_times[api_model].copy()
                    
        return mapped_times

    def calculate_kinematic_magnitudes(self, pose_results: Dict[str, Dict[str, Dict[str, MetricResult]]]) -> Dict[str, Dict[str, Dict[str, MetricResult]]]:
        """
        Calculate the magnitude of the kinematic metrics.
        """
        for metric_name in ["Velocity", "Acceleration", "Jerk"]:
            if metric_name in pose_results.keys():
                for model_name, video_results in pose_results[metric_name].items():
                    for video_name, metric_result in video_results.items():
                        magnitude_values = metric_result.aggregate([COORDINATE_AXIS], method='vector_magnitude')
                        pose_results[metric_name][model_name][video_name] = magnitude_values
        return pose_results

    def sort_inference_times_pose_estimator_order(self, inference_times: Dict[str, Dict[str, float]], pose_results: Dict[str, Dict[str, Dict[str, MetricResult]]]) -> Dict[str, Dict[str, float]]:
        """
        Sort the inference times according to the order in pose_results.
        
        Args:
            inference_times: Dictionary containing inference times for each pose estimator
            pose_results: Dictionary containing pose estimation results, used to determine the order
            
        Returns:
            Dictionary containing sorted inference times
        """
        # Get the list of pose estimators from any metric in pose_results
        first_metric = next(iter(pose_results))
        pose_estimator_order = list(pose_results[first_metric].keys())
        
        sorted_inference_times = {}
        for pose_estimator in pose_estimator_order:
            if pose_estimator in inference_times:
                sorted_inference_times[pose_estimator] = inference_times[pose_estimator]
        return sorted_inference_times

The visualizer saves the plots and tables in the plots folder in the checkpoint.

4.4.2 Plots

Each plot inherits from the abstract Plot class and implements the draw method. This method accepts various forms of input data, most commonly the results produced by the evaluator. Each plot can define a specific approach to aggregating and organizing the data, such as computing the median over all videos for a given pose estimator.

Plot Class

src/evaluation/plots/plot.py

from abc import ABC, abstractmethod
from typing import Dict, Optional, Tuple

import matplotlib.pyplot as plt
import seaborn as sns
from evaluation.metrics.metric_result import MetricResult
from utils import get_color_palette


class Plot(ABC):
    """Base class for all plots in MaskBench."""
    
    def __init__(self, name: str, config: Optional[Dict[str, any]] = None):
        """
        Initialize a plot.
        
        Args:
            name: Unique name of the plot
            config: Optional configuration dictionary for the plot
                   Common config options:
                   - style: str for seaborn style (default: white)
                   - figsize: tuple for figure size (default: (10, 5))
                   - dpi: int for figure resolution (default: 300)
                   - title: str for plot title
                   - xlabel: str for x-axis label
                   - ylabel: str for y-axis label
                   - legend: bool for showing legend
        """
        self.name = name
        
        self.config = config or {}
        if 'figsize' not in self.config:
            self.config['figsize'] = (10, 5)
        if 'dpi' not in self.config:
            self.config['dpi'] = 300
        if 'style' not in self.config:
            self.config['style'] = 'white'
        self.config['palette'] = get_color_palette()
        
        sns.set_style(self.config['style'])
        sns.set_palette(self.config['palette'])
        sns.set_context("paper")
        
    @abstractmethod
    def draw(
        self,
        results: Dict[str, Dict[str, Dict[str, MetricResult]]],
        add_title: bool = True,
    ) -> Tuple[plt.Figure, str]:
        """
        Draw the plot using the provided results.
        
        Args:
            results: Dictionary mapping:
                    metric_name -> model_name -> video_name -> MetricResult
            add_title: Whether to add the title to the plot (default: True)
            
        Returns:
            Tuple containing:
                - plt.Figure: The generated matplotlib figure
                - str: The suggested filename for saving the plot
        """
        pass
    
    def _setup_figure(self, add_title: bool = True) -> plt.Figure:
        """
        Set up the figure with standard configuration.
        
        Args:
            add_title: Whether to add the title to the plot (default: True)
        """
        fig = plt.figure(figsize=self.config['figsize'], dpi=self.config['dpi'])
        plt.tight_layout()

        # Remove plot edges
        plt.gca().spines['top'].set_visible(False)
        plt.gca().spines['right'].set_visible(False)
        plt.gca().spines['left'].set_visible(False)
        plt.gca().spines['bottom'].set_visible(False)

        
        if add_title and 'title' in self.config:
            plt.title(self.config['title'])
        if 'xlabel' in self.config:
            plt.xlabel(self.config['xlabel'], labelpad=10)
        if 'ylabel' in self.config:
            plt.ylabel(self.config['ylabel'], labelpad=10)
            
        return fig

    def _group_by_video(self, results: Dict[str, Dict[str, Dict[str, MetricResult]]]) -> Dict[str, Dict[str, Dict[str, MetricResult]]]:
        """Group the results by video."""
        video_to_models = {}
        for model_name, video_results in results.items():
            for video_name, metric_result in video_results.items():
                if video_name not in video_to_models:
                    video_to_models[video_name] = {}
                video_to_models[video_name][model_name] = metric_result
        return video_to_models

Kinematic Distribution Plot

src/evaluation/plots/kinematic_distribution_plot.py

from typing import Dict, List, Tuple
from itertools import cycle

import numpy as np
import numpy.ma as ma
import matplotlib.pyplot as plt

from evaluation.metrics.metric_result import COORDINATE_AXIS, MetricResult
from .plot import Plot


class KinematicDistributionPlot(Plot):
    """Plot class for visualizing kinematic distributions (velocity, acceleration, jerk) for different models."""
    
    def __init__(self, metric_name: str, kinematic_limit: float = None):
        """
        Initialize the kinematic distribution plot.
        
        Args:
            metric_name: Name of the kinematic metric ('Velocity', 'Acceleration', or 'Jerk')
            kinematic_limit: Optional limit for the kinematic values. All values greater than this value will land in one bucket.
        """
        if metric_name not in ['Velocity', 'Acceleration', 'Jerk']:
            raise ValueError(f"Metric name must be one of ['Velocity, 'Acceleration', 'Jerk']")
            
        super().__init__(
            name=f"{metric_name}Distribution",
            config={
                'title': f'{metric_name} Keypoint Distribution',
                'xlabel': f'{metric_name}',
                'ylabel': 'Percentage',
            }
        )
        
        self.metric_name = metric_name
        self.unit = None
        self.n_bins = 10
        self.kinematic_limit = kinematic_limit
        
        # Define a variety of marker shapes for different models
        # o: circle, s: square, ^: triangle up, v: triangle down, 
        # D: diamond, p: pentagon, h: hexagon, 8: octagon,
        # *: star, P: plus filled
        self.markers = ['^', '*', 'h', 's', 'D', 'o', 'p', 'h', '8', 'P']

    def _flatten_clip_validate(self, values: np.ndarray) -> np.ndarray:
        """
        Process input values by:
        1. Removing masked and NaN values
        2. Flattening the array
        3. Clipping values to the kinematic limit
        
        Args:
            values: Input numpy masked array with potential NaN values
            
        Returns:
            Flattened array of valid values clipped to kinematic limit
        """
        # Handle masked values if it's a masked array
        if isinstance(values, ma.MaskedArray):
            valid_values = values[~values.mask].data
        else:
            valid_values = values
            
        valid_values = valid_values[~np.isnan(valid_values)] # Remove NaN values
        flattened_values = valid_values.flatten()
        clipped_values = np.clip(flattened_values, -self.kinematic_limit, self.kinematic_limit)
        return clipped_values

    def _compute_distribution(self, values: np.ndarray, bin_edges: np.ndarray) -> np.ndarray:
        hist, _ = np.histogram(values, bins=bin_edges)
        return (hist / len(values)) * 100
    
    def _create_bin_edges_and_labels(self) -> Tuple[np.ndarray, List[str]]:
        """
        Create bin edges and corresponding labels for the kinematic distribution.
        
        Returns:
            Tuple containing:
                - np.ndarray: Bin edges for histogram computation
                - List[str]: Human-readable labels for the bins
        """
        diff = self.kinematic_limit / self.n_bins
        # if n_bins is 10, then we want to have 11 bins, for the last bin which contains all values greater than the kinematic limit
        # the last value in np.linspace is excluded (that's why we add 2 to n_bins)
        bin_edges = np.linspace(0, self.kinematic_limit + diff, self.n_bins + 2).astype(int)
        
        bin_labels = []
        for i in range(len(bin_edges) - 1):
            if i == len(bin_edges) - 2:
                bin_labels.append(f'> {bin_edges[i]}')
            else:
                bin_labels.append(f'[{bin_edges[i]},{bin_edges[i+1]}]')
                
        return bin_edges, bin_labels

    def _round_to_nearest_magnitude(self, value: float) -> float:
        """Round a value to the nearest magnitude based on its range.
        
        For values:
        - Between 0-100: Round to nearest 10
        - Between 100-1000: Round to nearest 100 
        - Between 1000-10000: Round to nearest 1000
        And so on up to 1,000,000
        """
        if value <= 0:
            return 0
            
        magnitude = 10 ** (len(str(int(value))) - 1)
        if magnitude < 10:
            magnitude = 10
        return np.ceil(value / magnitude) * magnitude
    
    def draw(
        self,
        results: Dict[str, Dict[str, Dict[str, MetricResult]]],
        add_title: bool = True,
    ) -> Tuple[plt.Figure, str]:

        # First pass: compute the magnitude of the kinematic values and take median over videos
        pose_estimator_results = results[self.metric_name]
        pose_estimator_medians = {} # store the median for each pose estimator over all videos
        pose_estimator_magnitude_results = {} # store the magnitude results for each pose estimator

        for pose_estimator_name, video_results in pose_estimator_results.items():
            self.unit = next(iter(video_results.values())).unit if self.unit is None else self.unit

            video_magnitudes = []
            pose_estimator_magnitude_results[pose_estimator_name] = {}
            for video_name, metric_result in video_results.items():
                magnitude_result = metric_result.aggregate([COORDINATE_AXIS], method='vector_magnitude')

                video_magnitudes.append(magnitude_result.aggregate_all(method='median'))
                pose_estimator_magnitude_results[pose_estimator_name][video_name] = magnitude_result

            pose_estimator_medians[pose_estimator_name] = np.median(video_magnitudes)

        if self.kinematic_limit is None:
            # Calculate the maximum average magnitude of all pose estimators to set the bounds of the plot
            # This maximum value is increased by 20%
            raw_limit = max(pose_estimator_medians.values()) * 1.20
            self.kinematic_limit = self._round_to_nearest_magnitude(raw_limit)
        
        if self.unit:
            self.config['xlabel'] = f'{self.metric_name} ({self.unit})'
        fig = self._setup_figure(add_title=add_title)

        # Store lines for updating legend later
        lines = []
        labels = []
        marker_cycle = cycle(self.markers)
        bin_edges, bin_labels = self._create_bin_edges_and_labels()
        
        # Create x positions that span the full width
        x_positions = np.linspace(0, 1, len(bin_labels))
        
        # Second pass: flatten and clip the values
        for model_name, video_results in pose_estimator_results.items():
            model_values = []
            for metric_result in video_results.values():
                values = metric_result.values
                flattened_valid_clipped_vals = self._flatten_clip_validate(values)
                model_values.extend(np.abs(flattened_valid_clipped_vals.flatten()))
                
            distribution = self._compute_distribution(model_values, bin_edges)
            
            marker = next(marker_cycle)
            plt.plot(x_positions, distribution, 
                    marker=marker,
                    markersize=6,
            )
            
            scatter = plt.scatter([], [], 
                                marker=marker,
                                s=36,
                                label=model_name,
                                color=plt.gca().lines[-1].get_color())
            lines.append(scatter)
            labels.append(model_name)
        
        self._finish_label_grid_axes_styling(x_positions, bin_labels, lines, labels)
        return fig, f"{self.metric_name.lower()}_distribution"

    def _finish_label_grid_axes_styling(self, x_positions: np.ndarray, bin_labels: List[str], lines: List[plt.Line2D], labels: List[str]):
        # Configure x-axis
        plt.xlim(-0.01, 1.01)
        plt.xticks(x_positions, bin_labels, rotation=45)
        
        # Configure y-axis
        y_ticks = np.arange(0, 90, 20)
        plt.yticks(y_ticks, [f'{x:.2f} %' for x in y_ticks])
        plt.ylim(0, 90)
        
        # Configure grid
        plt.grid(True, axis='y', alpha=0.3, linestyle='-', color='gray')
        plt.grid(False, axis='x')
        
        # Ensure all text is black and properly sized
        plt.tick_params(colors='black', which='both')
        for label in plt.gca().get_xticklabels() + plt.gca().get_yticklabels():
            label.set_color('black')
        
        plt.legend(lines, labels)

We provide the following plots and tables:

Kinematic Distribution Plot: Visualizes the distribution of kinematic values for each pose estimator.
Per Keypoint Plot: Displays the median kinematic metric values or Euclidean distance for each COCO keypoint. This plot requires keypoints to be stored in COCO format.
Inference Time Plot: Visualizes the average inference time associated with each pose estimator.
Result Table: Aggregates results per metric and pose estimator across all videos, presenting the data in tabular form.

4.5 Rendering

Video rendering is handled by the Renderer class. For each video in the dataset, the renderer creates a dedicated folder within the renderings directory of the checkpoint folder. Inside each video folder, it generates one video per pose estimator, displaying the rendered keypoints. Special attention was given to maintaining consistent colors for each pose estimator across all videos and plots, using a predefined, color-blind–friendly palette.

5 Datasets

This study uses four video-based datasets, each representing a different level of complexity, from simple, controlled settings to more dynamic and interactive scenarios. To capture this range, we selected or created four distinct datasets: TED Kid Video (Allen 2017), TED Talks (“TED” 2025), Tragic Talkers (Berghi, Volino, and Jackson 2022), and a masked video dataset. Each dataset was chosen based on specific criteria to evaluate pose estimation models under varying degrees of difficulty.

5.1 TED Kid Video

The TED kid video is a short, 10-second clip featuring a child in a well-lit environment from the TEDx talk “Education for all” (Allen 2017). Throughout the sequence, all body parts remain clearly visible, with no occlusion or obstruction of the subject. This video represents a controlled scenario designed to evaluate pose estimation methods under ideal conditions, serving as a baseline for comparison with more complex datasets.

We first tested our models’ performance and evaluation metrics on this video to verify that our metrics function correctly in an ideal setting and that the implementation is accurate. This initial validation ensures that subsequent experiments on more challenging datasets can be interpreted with confidence in the correctness of our evaluation pipeline.

5.2 TED Talks

For the TED talks dataset (“TED” 2025), we selected ten videos featuring diverse speakers to capture a wide range of conditions. The selection criteria included speaker gender, skin tone, clothing types (e.g., long dresses versus short garments), partial occlusion, and videos where only specific body parts—such as hands, upper body, or lower body—are visible. We also considered variations in movement style and speed.

Our focus was on evaluating model performance under more complex conditions, such as scene changes, background noise (e.g., audience sounds), partial visibility of the body, and situations where body parts are difficult to distinguish (e.g., due to long dresses). We also accounted for visual distractions like images or patterns on the speaker’s clothing. In each TED talk video, our analysis concentrates solely on the primary speaker.

5.3 Tragic Talkers

We aimed to evaluate model performance in scenarios involving multiple people interacting. The Tragic Talkers dataset (Berghi, Volino, and Jackson 2022) was chosen because it provides 2D pseudo-ground truth annotations generated by the OpenPose AI model, allowing us to test metrics such as PCK or RMSE.

The dataset features a man in regular clothing and a woman wearing a long dress. It contains four distinct video scenarios, each originally recorded from twenty-two different camera angles. For our analysis, we used only four angles, as many viewpoints were too similar.

Monologue (Male and Female): Individual speakers deliver monologues with relatively simple and slow movements.
Conversation: A male and female speaker engage in dialogue with limited movement.
Interactive 1: A conversation between a male and female speaker that includes physical interaction (e.g., hand contact), with the man sitting close to the woman.
Interactive 4: A more dynamic dialogue featuring faster movements, partial occlusion, and moments of full occlusion.

These scenarios were chosen to reflect a variety of real-world human interactions, allowing us to test how well pose estimation models perform under conditions such as occlusion, multi-person scenes, and varied movement patterns.

However, the ground truth pose estimations were produced by an AI model rather than human annotators, which is why they are imperfect and should not be considered a true ground truth baseline (we refer to it as pseudo-ground truth). Because the original study does not specify which OpenPose variant, parameters, or post-processing steps were used, the PCK and RMSE accuracy values should be interpreted as a measure of how closely pose estimators replicate the OpenPose output rather than as an absolute indicator of pose estimation quality. In this context, a PCK accuracy above 80% is considered a good result, indicating that poses are generally well estimated.

5.4 Masked Video Dataset

The masked video dataset is a collection of three videos. It includes the TED kid video, a segment from the TED talk “Let curiosity lead” (Shahidi 2023), and the video “interactive1_t1-cam06” from the Tragic Talkers dataset. This dataset was created to evaluate the performance of pose estimators on masked videos, addressing the challenge of sharing datasets containing sensitive information among researchers.

For the dataset creation, we used MaskAnyoneUI to manually mask the persons of interest in each video using four different hiding strategies: blurring, pixelation, contours, and solid fill. Including the original unmasked videos, this resulted in a total of 15 videos for evaluation.

5.5 Data Preprocessing

Data preprocessing was carried out on the TED talks to remove unnecessary parts of the videos and to split them into shorter segments compatible with MaskAnyone. Since MaskAnyone cannot process videos longer than 2.5 minutes, and is already resource-intensive even at that limit, we divided the TED Talk videos into chunks of 30 or 50 seconds, depending on the content.

TED talks also showed some inconsistency in structure. Some videos were straightforward, with only the speaker and audience visible, making them easy to segment at any point. However, others included additional visual content such as slides, pictures, or unrelated scenes, which made it more difficult to determine clean chunking points.

For these more complex videos, we carefully selected segment boundaries to ensure that each chunk started with frames where a human was clearly visible. When necessary, we manually trimmed the beginning of chunks to avoid starting with empty or unrelated frames. This step was critical because if a video starts with non-human content, MaskAnyone may incorrectly classify objects in the first frame as humans and then continue misdetecting them in subsequent frames.

No preprocessing was required for the Tragic Talkers dataset, as the videos were already clean and free of noise or unrelated visual content.

6 Evaluation Metrics

In the following sections, we outline the metrics used for evaluating accuracy, smoothness and jitter of different pose estimators.

6.1 Ground-Truth Metrics

The metrics in this section are based on ground truth data provided by the dataset and primarily evaluate the accuracy of the pose estimation compared to the reference ground truth.

6.1.1 Euclidean Distance

The Euclidean distance metric measures the spatial accuracy of pose estimation by calculating the normalized distance between predicted and ground truth keypoint positions. For each keypoint of a person in a frame, it computes the L2 norm (Euclidean distance) between the predicted position $(x_p, y_p)$ and the ground truth position $(x_{gt}, y_{gt})$:

\[ d = \frac{\sqrt{(x_p - x_{gt})^2 + (y_p - y_{gt})^2}}{s} \]

where $s$ is a normalization factor. Normalization is essential to make the metric scale-invariant and comparable across persons of different sizes.

The metric is set up to support three normalization strategies, out of which we only implemented the bounding box normalization. We outline future work for the implementation of head and torso normalization in section Section 9.

Bounding Box Size: The distance is normalized by the maximum of the width and height of the person’s bounding box, computed from the ground truth keypoints. This approach adapts to varying person sizes but may introduce minor pose-dependent scaling variance.
Head Size: Normalization by the head bone link size (not implemented).
Torso Size: Normalization by the torso diameter (not implemented).

Head and torso normalization address the pose-dependent scaling variance of the bounding box normalization. The metric also accounts for several edge cases to ensure robust evaluation:

Different Order of Persons: The metric uses the Hungarian algorithm as described in section Section 4.3.1 to match person indices between ground truth and predictions, ensuring that distances are calculated between corresponding persons even if they appear in different orders.
Keypoint Missing in Ground Truth but not in Prediction: When a keypoint is absent in the ground truth (coordinates (0,0)) but detected in the prediction, the distance is set to NaN and excluded from aggregation, as no valid ground truth reference exists.
Keypoint Missing in Prediction but Present in Ground Truth: When a keypoint exists in the ground truth but is missing in the prediction, the distance is assigned a predetermined large fill value (here, 1). This penalizes missing detections while preventing disproportionate impact on aggregated results.
Undetected Persons: If a person in the ground truth is completely undetected in the prediction, all their keypoint distances are set to the same fill value to penalize the failure.

Euclidean distance forms the basis for computing the Percentage of Correct Keypoints (PCK) and Root Mean Square Error (RMSE) metrics.

6.1.2 Percentage of Keypoints (PCK)

The Percentage of Correct Keypoints (PCK) metric evaluates pose estimation accuracy by calculating the proportion of predicted keypoints whose normalized Euclidean distance to the ground truth falls within a specified threshold. A keypoint is considered “correct” if its distance is below this threshold, allowing PCK to quantify the reliability of pose predictions at the chosen precision level.

For each frame, PCK is calculated as:

\[ PCK = \frac{\text{number of keypoints with distance < threshold}}{\text{total number of valid keypoints}} \]

PCK values range from zero to one, where one indicates perfect predictions (all keypoints are within the threshold) and zero indicates complete failure (no keypoints within the threshold).

6.1.3 Root Mean Square Error (RMSE)

The Root Mean Square Error (RMSE) provides a single aggregated measure of pose estimation accuracy by calculating the root mean square of normalized Euclidean distances across all valid keypoints and persons in a frame. RMSE is defined as:

\[ RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} d_i^2} \]

where $N$ is the total number of valid keypoints in the frame, and $d_i$ is the normalized Euclidean distance of keypoint $i$. By squaring the distances before averaging, RMSE penalizes larger errors more heavily, making it particularly sensitive to outliers.

6.2 Kinematic Metrics

Velocity, acceleration, and jerk are key kinematic metrics that help identify unnatural or erratic movements in pose estimations by highlighting rapid changes in motion.

6.2.1 Velocity

The velocity metric measures the rate of change in keypoint positions between consecutive frames. For each keypoint of a person, it quantifies how quickly the keypoint moves in pixels per frame, providing insight into the smoothness and temporal consistency of the pose estimation.

The velocity calculation proceeds in three steps:

Person indices are matched between consecutive frames (as described in Section 4.3.1) to ensure tracking of the same individual over time.
The velocity is then computed with $v_t = p_{t+1} - p_t$ as the difference between keypoint positions in consecutive frames, where $p_t$ represents the keypoint position at frame $t$, and $v_t$ is the resulting velocity vector.
Finally, the metric can be configured to report velocities in either pixels per frame or pixels per second. In the latter case, the frame-based velocity is divided by the time delta between frames (1/fps).

The metric robustly handles several edge cases:

For videos with fewer than two frames, velocity cannot be computed, and the metric returns NaN values.
If a keypoint is missing in either of two consecutive frames, the corresponding velocity is set to NaN.
Since velocity is derived from frame-to-frame differences, the output contains one fewer frame than the input video.
The output includes a coordinate axis (x and y) representing the velocity vector, which serves as a basis for the computation of the acceleration and jerk metrics. For evaluation and visualization, aggregate along this axis using the vector_magnitude method to obtain scalar velocity values.

6.2.2 Acceleration

The acceleration metric measures the rate of change in velocity over time, representing how quickly the movement speed of keypoints changes. It is computed by $a_t = v_{t+1} - v_t$, where $a_t$ is the acceleration at time $t$, $v_t$ represents the velocity, and $p_t$ the keypoint position. Acceleration values can be reported in either pixels per frame squared or pixels per second squared, with the latter requiring normalization by the squared time delta between frames (1/fps²).

6.2.3 Jerk

Jerk measures the rate of change of acceleration, offering insights into the smoothness and abruptness of motion by quantifying how quickly acceleration varies. It is calculated with $j_t = a_{t+1} - a_t$ as the difference between consecutive acceleration values, where $j_t$ is the jerk at time $t$ and $a_t$ represents the acceleration. The metric supports reporting in pixels per frame cubed or pixels per second cubed, with the latter normalized by the cubed time delta between frames (1/fps³).

7 Experimental Setup

In this section, we describe the experimental setup used to evaluate pose estimators across four datasets.

General Setup

We evaluated seven pose estimators on the four datasets: TED Kid Video, TED Talks, Tragic Talkers, and the Masked Video Dataset. The pose estimators are: YoloPose (v11-l), MediaPipePose (pose_landmarker_heavy), OpenPose (body_25), MaskAnyoneAPI-MediaPipe, MaskAnyoneAPI-OpenPose, MaskAnyoneUI-MediaPipe, and MaskAnyoneUI-OpenPose. Confidence thresholds were visually determined on a subset of videos and set to 0.3 for YoloPose and MediaPipePose, and 0.15 for OpenPose. Since MaskAnyone-based estimators do not output confidence scores, a threshold of zero was used for them. All keypoints were stored in COCO format to enable per-keypoint comparisons across models.

TED Kid Video and TED Talks

For both datasets, we evaluated the kinematic metrics velocity, acceleration, and jerk for each pose estimator. Due to the absence of ground truth annotations for TED Talks, accuracy metrics could not be computed.

Tragic Talkers

For the Tragic Talkers dataset, we evaluated both accuracy metrics (Euclidean distance, PCK, RMSE) and kinematic metrics (velocity, acceleration, jerk) for each pose estimator.

Inference on Raw vs. Masked Videos

For the Masked Video dataset, inference was first performed on the raw videos with all pose estimators to establish a baseline. Subsequently, inference was repeated on videos masked with different hiding strategies. Performance was compared against the baseline to assess the impact of masking, using PCK and RMSE metrics to quantify accuracy relative to raw videos.

Note that this masking evaluation pipeline is a preliminary implementation outside of MaskBench’s native capabilities, serving as a proof of concept. We intend to integrate full support for this workflow in future MaskBench releases (see section Section 9.2.1).

8 Results

We present the experimental results below. To improve readability, kinematic metrics are reported without units in the text: velocity is in pixels/frame, acceleration in pixels/frame², and jerk in pixels/frame³. Our analysis focuses primarily on acceleration and jerk, instead of velocity, as these metrics are more suited to detecting instability and unnatural motion in pose estimation.

8.1 TED Kid Video

Table 1 summarizes the average velocity, acceleration, and jerk for each pose estimator on the TED aid video. Standard pose estimation models like YoloPose, MediaPipePose, and OpenPose exhibit relatively high values across all metrics, indicating more erratic and less stable pose estimations. MediaPipePose has the highest values for velocity (3.36), acceleration (4.10), and jerk (5.56).

In contrast, all evaluated MaskAnyone pose estimators show consistently lower acceleration and jerk, with MaskAnyoneUI-MediaPipe achieving the best results (velocity: 1.97, acceleration: 1.20, jerk: 1.89), representing reductions of 40%, 70%, and 66% respectively, compared to pure MediaPipePose. This indicates substantially smoother and more stable pose tracking over time. The improvements are more pronounced for MediaPipePose than OpenPose: MaskAnyone reduces MediaPipePose’s acceleration and jerk by 2.9 and 5.31, while OpenPose sees smaller decreases of 1.11 and 2.81, demonstrating the greater effectiveness of MaskAnyone with MediaPipePose.

Pose Estimator	Velocity	Acceleration	Jerk
YoloPose	2.81	2.95	5.04
MediaPipePose	3.36	4.10	7.18
OpenPose	2.71	3.20	5.56
MaskAnyoneAPI-MediaPipe	2.00	1.21	1.87
MaskAnyoneAPI-OpenPose	2.73	2.46	3.53
MaskAnyoneUI-MediaPipe	1.97	1.20	1.89
MaskAnyoneUI-OpenPose	2.62	2.09	2.75

Table 1: Average metric results for different pose estimators on the TED kid video.

Figure 3 (a) and Figure 3 (b) present the distribution of the acceleration and jerk metrics for the different pose estimators. These plots show the percentage of keypoints within fixed value ranges for the acceleration and jerk metrics over all frames. The ideal curve for a stable pose estimation follows an exponential decay curve, with most kinematic values near zero and a few large values. Both plots confirm the results from Table 1, showing that the MaskAnyone pose estimators have a very high concentration of low acceleration and jerk values. The UI and API variants of MaskAnyone-MediaPipe most closely resemble the ideal curve, with over 80% of keypoints having acceleration below 1 pixel/frame². MaskAnyone-OpenPose estimators rank third and fourth, with around 57% of keypoints below this threshold. YoloPose ranks fifth with 52%, followed by OpenPose at 40%. MediaPipePose is the most unstable, with only 30% of keypoints below one pixel/frame² and a relatively flat distribution curve. Similar patterns can be observed for the jerk distribution.

Figure 4 (c) shows the median acceleration per keypoint for the different pose estimators. Each keypoint contains a set of seven bars, one for each pose estimator, indicating the median acceleration value for that keypoint and pose estimator. The first notable finding is that keypoints like the wrists, elbows, hips, and ankles exhibit consistently higher median acceleration compared to more stable points like the eyes, ears, and nose. This aligns with expectations since these joints undergo more frequent and pronounced movement. Secondly, MaskAnyoneAPI-MediaPipe and MaskAnyoneUI-MediaPipe consistently achieve the lowest acceleration values across all keypoints. Both MaskAnyoneUI variants improve upon their default counterparts, MediaPipePose and OpenPose, for every keypoint. The most pronounced gains appear at the hips, knees, and ankles, where MaskAnyoneUI-MediaPipe reduces median acceleration from about six pixels/frame² down to less than one.

Last but not least, it is important to not only evaluate pose estimation results analytically but also to visually inspect pose quality. Table 2 shows rendered videos for the seven pose estimators on the TED kid video.

In the MediaPipePose video, the pose estimation appears unstable, showing more jitter and sudden pose changes. At the beginning, the model fails to detect the right elbow joint, which all other estimators detect correctly. Additionally, the hips, ankles, and elbows display rapid, jerky movements throughout the video.

Comparing MaskAnyoneUI-MediaPipe and MaskAnyoneAPI-MediaPipe rendered videos reveals that both are considerably more stable and smoother than pure MediaPipePose. Aside from the person’s natural movement, key points generally remain fixed and steady.

Observing the other pose estimators shows that none are as stable as MaskAnyoneUI-MediaPipe, but most outperform pure MediaPipePose in stability. This visual evidence supports the quantitative results in Table 1 and Figure 3. It also confirms that our kinematic metrics effectively indicate pose estimation stability.

Raw Video	YOLOPose
MediaPipe Pose	OpenPose
MaskAnyoneAPI-MediaPipe	MaskAnyoneAPI-OpenPose
MaskAnyoneUI-MediaPipe	MaskAnyoneUI-OpenPose

Table 2: Rendered result videos of different pose estimators on the TED-kid video.

8.2 TED Talks

The results on ten full TED talks closely mirror those on the single TED kid video, as shown in Table 3. Among the evaluated pose estimators, MaskAnyoneUI-MediaPipe consistently achieved the best stability, with the lowest average velocity, acceleration, and jerk values of 1.25, 1.08, and 1.83, respectively. MaskAnyoneAPI-MediaPipe followed, showing the second-best performance in acceleration and jerk, closely trailed by YoloPose. OpenPose ranked next, while both MaskAnyone-OpenPose variants exhibited greater instability than the pure OpenPose model. Consistent with earlier findings, MediaPipePose was the least stable estimator, with the highest values across all metrics: 3.29 for velocity, 4.52 for acceleration, and 7.94 for jerk. An additional observation is the clear trend that MediaPipe-based MaskAnyone variants generally outperform OpenPose-based ones in stability, as reflected by their consistently lower velocity, acceleration, and jerk values.

Pose Estimator	Velocity	Acceleration	Jerk
YoloPose	1.35	1.46	2.44
MediaPipePose	3.29	4.52	7.94
OpenPose	1.58	2.22	3.44
MaskAnyoneAPI-MediaPipe	1.44	1.25	2.04
MaskAnyoneAPI-OpenPose	2.30	2.42	4.07
MaskAnyoneUI-MediaPipe	1.25	1.08	1.83
MaskAnyoneUI-OpenPose	2.07	2.29	3.72

Table 3: Average metric results for different pose estimators aggregated over all TED talk videos.

Figure 4 (a) and Figure 4 (b) indicate that pose estimators are less stable in TED talks than in the TED kid video, with high acceleration and jerk values occurring more frequently. This is likely because TED talks include camera movements, scene changes, segments without visible people, and audience views, none of which appear in the TED kid video. We included two particularly challenging video chunks in Table 4. The first column shows results for the qualitatively worst-performing pose estimator, MediaPipePose, while the second column presents the best performer, MaskAnyoneUI-MediaPipe.

In the first scene (Burke 2018) from the TED talk “Me Too is a Movement, Not a Moment”, the woman wears a long dress, and the video contains multiple scene changes, audience views with the speaker in the background, and parts where the speaker is not visible. MaskAnyone substantially improves the stability and visual accuracy of pose estimation in all these scenarios.

In the second scene (Buzz 2023) from the TED talk “Universe / Statues / Liberation”, the main challenges are rapid camera view changes and close-up shots of the singing woman. Both MaskAnyoneUI-MediaPipe and raw MediaPipe struggle with close-ups of the hips and arms. The model attempts to fit a full human pose into the small visible area of an arm or hip, leading to incorrect pose estimation and unstable motion. It appears that once the model detects one joint, it tries to estimate the entire pose, which can cause errors in these conditions. This issue was primarily observed with MediaPipe models, including MaskAnyone-MediaPipe variants, and not with other pose estimators. Despite this, MaskAnyoneUI-MediaPipe still provides more stable and accurate pose estimations than pure MediaPipePose for most frames in this video.


MediaPipePose	MaskAnyoneUI-MediaPipe

Table 4: Two TED Talk chunks overlaid with pose estimations from MediaPipePose and MaskAnyoneUI-MediaPipe, featuring challenging segments with scene changes, camera movements, and periods without visible persons. The first row shows a clip from the TED talk “Me Too is a Movement, Not a Moment” (Burke 2018). The second row shows a clip from the TED talk “Universe / Statues / Liberation” (Buzz 2023).

8.3 Tragic Talkers

Table 5 presents the average metric results of various pose estimators on the Tragic Talkers dataset. Regarding accuracy against the pseudo-ground truth, YoloPose achieves the highest PCK at 96%, followed by OpenPose at 87%. All pose estimators except MediaPipePose exceed a PCK of 83%, with MediaPipePose detecting only 69% of keypoints correctly. However, these PCK and RMSE values should be interpreted cautiously, as the pseudo-ground truth was generated by an AI model rather than human annotators, making it inherently imperfect. Thus, the results reflect how closely each pose estimator matches the OpenPose output, rather than absolute pose estimation quality.

The kinematic metrics, especially acceleration and jerk, provide clearer results. MaskAnyoneAPI-MediaPipe performs best, with the lowest acceleration and jerk values of approximately 2.9 and 5.0, respectively. MaskAnyoneUI-MediaPipe follows closely behind MaskAnyoneAPI-MediaPipe with slightly increased acceleration and jerk, while YoloPose shows similar acceleration but a somewhat higher jerk. Although MaskAnyone-OpenPose variants outperform standard OpenPose, they still exhibit noticeably greater acceleration and jerk, reflecting less smooth motion. Pure MediaPipePose remains the least stable estimator, with average acceleration and jerk values of approximately 9.8 and 17.6, respectively.

Pose Estimator	PCK	RMSE	Velocity	Acceleration	Jerk
YoloPose	0.96	0.11	4.36	3.27	5.57
MediaPipePose	0.69	0.47	6.48	9.80	17.58
OpenPose	0.87	0.33	4.63	6.38	10.00
MaskAnyoneAPI-MediaPipe	0.78	0.12	3.46	2.86	5.01
MaskAnyoneAPI-OpenPose	0.85	0.36	5.69	6.08	10.22
MaskAnyoneUI-MediaPipe	0.83	0.07	3.26	2.91	5.10
MaskAnyoneUI-OpenPose	0.85	0.36	5.53	5.12	9.06

Table 5: Average metric results for different pose estimators aggregated over four camera angles of five Tragic Talkers sequences with pseudo-ground truth.

Figure 5 (a) and Figure 5 (b) confirm the results from Table 5. Both plots show that the MaskAnyone-MediaPipe pose estimators achieve the highest proportion of low acceleration and jerk values, followed by YoloPose, the MaskAnyone-OpenPose pose estimators, and OpenPose. MediaPipePose once again has a very flat curve, indicating a lot of large acceleration and jerk values.

Figure 5 (c) shows that MediaPipePose is among the pose estimators with the highest median acceleration values for all keypoints. YoloPose, MaskAnyoneAPI-MediaPipe, and MaskAnyoneUI-MediaPipe achieve consistently low median acceleration values for all keypoints.

Interestingly, although the MaskAnyone-OpenPose pose estimators achieve lower acceleration values for nose, eye, ear, shoulder, and ankle keypoints than pure OpenPose, they perform worse for the elbow, hip, and knee keypoints. A potential reason for this could be that MaskAnyone uses a higher confidence threshold for keypoints than our OpenPose implementation, which leads to the elbow, hip, and knee keypoints not being detected or rendered. As an example, consider Table 6, which shows the first seconds of the rendered video for OpenPose, MaskAnyoneAPI-OpenPose, and MaskAnyoneUI-OpenPose for the “conversation1_t3-cam08” sequence. In this scene, both MaskAnyone-OpenPose pose estimators fail to detect the legs of the woman, while OpenPose correctly detects them.

OpenPose

MaskAnyoneAPI-OpenPose

MaskAnyoneUI-OpenPose

Table 6: First 10 seconds of the rendered Tragic Talkers videos for OpenPose, MaskAnyoneAPI-OpenPose, and MaskAnyoneUI-OpenPose for the “conversation1_t3-cam08” sequence.

Last but not least, we qualitatively compare MaskAnyoneAPI-MediaPipe, MaskAnyoneUI-MediaPipe, and YoloPose on the “interactive4_t3-cam08” sequence (Table 7). These three pose estimators have the lowest overall average acceleration values, as shown in Table 5. Two important observations were made:

YoloPose is the only estimator that correctly identifies the woman when she turns around, facing away from the camera. Both MaskAnyone variants fail in this scenario.
At the start of the sequence, where both actors stand with hands stretched out, only YoloPose correctly captures the lower part of the woman’s body. Both MaskAnyone estimators produce an incorrect upper-body pose initially, which improves as the woman lowers her arms, eventually stabilizing in the correct position.

Qualitatively, YoloPose performs best on this sequence.

YoloPose

MaskAnyoneAPI-MediaPipe

MaskAnyoneUI-MediaPipe

Table 7: The most stable pose estimators on the Tragic Talkers “interactive4_t3-cam08” sequence.

8.4 Inference on Raw vs. Masked Videos

Pose Estimator	Blurring	Pixelation	Contours	Solid Fill	Average
YoloPose	0.95	0.09	0.93	0.32	0.57
MediaPipePose	0.95	0.81	0.56	0.34	0.67
OpenPose	0.88	0.10	0.62	0.01	0.40
MaskAnyoneAPI-MediaPipe	0.85	0.30	0.00	0.00	0.29
MaskAnyoneAPI-OpenPose	0.75	0.00	0.36	0.00	0.28
MaskAnyoneUI-MediaPipe	0.95	0.63	0.00	0.07	0.41
MaskAnyoneUI-OpenPose	0.87	0.03	0.58	0.00	0.37
Average	0.86	0.23	0.44	0.11	/

Table 8: Percentage of correct keypoints (PCK) for different pose estimators on videos masked by different hiding strategies.

Pose Estimator	Blurring	Pixelation	Contours	Solid Fill	Average
YoloPose	0.12	0.92	0.13	0.74	0.48
MediaPipePose	0.12	0.26	0.49	0.64	0.38
OpenPose	0.25	0.94	0.47	1.0	0.67
MaskAnyoneAPI-MediaPipe	0.27	0.74	1.00	0.99	0.75
MaskAnyoneAPI-OpenPose	0.43	0.99	0.75	1.00	0.79
MaskAnyoneUI-MediaPipe	0.07	0.41	1.00	0.94	0.60
MaskAnyoneUI-OpenPose	0.24	0.98	0.52	1.00	0.69
Average	0.21	0.78	0.62	0.90	/

Table 9: Root mean square error (RMSE) for different pose estimators on videos masked by different hiding strategies.

As described in Section 5.4, three videos were masked using four different hiding strategies. Table 8 and Table 9 present the percentage of correct keypoints (PCK) and root mean square error (RMSE) for various pose estimators on the masked videos, compared to the original videos.

Comparison of pose estimators

MediaPipePose achieves the highest average PCK of 67% and the lowest average RMSE of 0.38, indicating robustness across all hiding strategies. YoloPose also performs well with an average PCK of 57%, particularly on the blurring and contours strategies, where it detects 95% and 93% of the original keypoints, respectively. In contrast, OpenPose performs weaker, with an average PCK of only 40% and a high RMSE of 0.67.

Among the MaskAnyone variants, UI-based models generally outperform API-based ones. MaskAnyoneUI-MediaPipe achieves a moderate average PCK of 41% and RMSE of 0.6. The API variants perform poorly, with average PCKs around 28% to 29%, indicating that human input improves performance on masked videos.

However, unlike in other datasets, MaskAnyone UI variants do not improve but rather degrade performance compared to the pure AI models. Masking the videos makes keypoint detection more challenging, often lowering the confidence scores assigned by the models. Because MaskAnyone applies higher confidence thresholds than the base AI models, many keypoints with reduced confidence may be discarded, leading to more undetected keypoints. Additionally, if the first stage of MaskAnyone, where YoloPose detects the person, performs poorly, the second stage, which uses SAM2 (Ravi et al. 2024) to segment and crop the person, also suffers. This cascades to low-quality input for the final pose estimation stage, degrading overall performance.

Comparison of hiding strategies

Last, we compare the hiding strategies in terms of balancing privacy and pose estimation performance on masked videos.

Blurring produced the highest average PCK of 86% across all pose estimators. This shows that models can still recognize and track people accurately, even when the image is partially obscured. The result highlights that pose estimation does not depend solely on the visibility of individual joints. Instead, the models appear to rely on the overall shape and structure of the body, using contextual cues to fill in missing details. They likely infer joint positions by applying spatial relationships and body priors learned during training, such as limb proportions, symmetry, and common human poses. For instance, even when specific features like eyes or hands are hidden, the surrounding geometry, such as head position, shoulder width, or arm direction, provides sufficient context for pose estimation. This suggests that these models depend more on learned pose patterns than on fine-grained pixel information.

The results for other hiding strategies are more mixed. YoloPose achieves an impressive 93% PCK on videos masked with contours, indicating its ability to utilize edge and shape information rather than texture or color. In contrast, other pose estimators perform poorly on this strategy. Both MaskAnyone API variants detect almost no keypoints, and the remaining models achieve between 36% and 62% PCK.

For pixelation, only MediaPipePose detects persons reasonably well, with 81% PCK. YoloPose, OpenPose, and both MaskAnyone-OpenPose variants detect fewer than 10% of keypoints. This suggests that the pixelation level used drastically reduces usable information for pose estimation and that this method is currently unsuitable for such tasks.

The solid fill hiding strategy is the most challenging, removing nearly all information about the person except the outline. As a result, it yields the lowest average PCK of 11%. MediaPipePose performs best here but reaches only 34% PCK.

In conclusion, blurring offers the best trade-off between privacy and pose estimation performance on masked videos. While it may not fully de-identify individuals, it retains sufficient information for accurate pose predictions. The contour hiding strategy can be considered when stronger privacy is required, though it reduces accuracy for all but YoloPose.

Table 10 presents qualitative results of the best-performing pose estimators on masked videos from the TED talk “Let curiosity lead” and the Tragic Talkers sequence “interactive1_t1-cam06.”

Table 10: The best-performing pose estimators on the masked videos are shown for the TED sequence “Let curiosity lead” and the Tragic Talkers sequence “interactive1_t1-cam06”. Each row corresponds to a different hiding strategy, in the order: Blurring, Pixelation, Contours, and Solid Fill. For the TED sequence (first column), the pose estimators are YoloPose, MaskAnyoneUI-MediaPipe, YoloPose, and YoloPose. For the Tragic Talkers sequence (second column), the pose estimators are YoloPose, MediaPipePose, YoloPose, and MaskAnyoneUI-MediaPipe.

9 Future Work & Limitations

In this section, we outline future work and limitations of MaskBench and MaskAnyone.

9.1 MaskAnyone Limitations

Both MaskAnyone-API and MaskAnyone-UI have introduced improvements in detection accuracy and user interaction. However, they still exhibit notable limitations in complex and long video scenarios such as TED talks, which often feature background noise, occlusions, scene transitions, and unrelated visual elements.

The first and most important challenge of MaskAnyone-API and MaskAnyone-UI is that it does not support long videos, such as TED talks. It requires manual chunking of the video; however, chunking introduces another issue. After chunking, some videos start with frames where no human is visible. When a video begins with such frames, MaskAnyone falsely predicts objects in the scene as humans and then continues to the end of the video with that false prediction. For better detection and pose estimation, we would need to remove the start of chunks without visible humans, but this results in a loss of content.

Another major challenge lies in handling abrupt scene changes or shifts in camera perspective. For example, when a video cuts from a close-up to a full-body shot (or vice versa), MaskAnyone fails to maintain consistent detection and tracking, resulting in missed detections or inaccurate pose estimation. MaskAnyone-UI addresses this issue through a human-in-the-loop mechanism that allows users to manually select key frames, ensuring more reliable tracking throughout the video.

Another issue, observed primarily in MaskAnyone-API, is the double overlaying of pose skeletons on the same person. This results in duplicate or misaligned pose renderings. This problem has not been observed in the UI version, as manual frame selection allows users to avoid such misdetections.

Finally, false positive predictions remain a common problem in MaskAnyone-API. Not only in scenes where no human is present, where the system interprets non-human objects such as buildings, cigarettes, or images as people, but it also occurs in scenarios where a human is actually present, yet MaskAnyone-API segments the background instead of the person. False positive predictions occur in MaskAnyone-UI as well, but only on rare occasions.

9.2 MaskBench Outlook

With our promising results, we are laying the groundwork for a more versatile benchmarking framework for pose estimation on masked videos. There are several directions in which we plan to extend our work.

9.2.1 Pipelining

Currently, MaskBench supports a single workflow: running inference on a set of videos, evaluating results using metrics, visualizing them, and rendering the videos with overlaid poses. As demonstrated with the masked video dataset experiment, there are many more potential workflows that could be integrated. We aim to introduce an extensible and customizable pipeline class to MaskBench. Each pipeline would define a specific workflow (i.e., our current workflow, the masked video dataset workflow, or other, yet to be defined workflows) by chaining MaskBench components in a particular order, reusing existing modules and adding new ones where necessary.

For example, the masked video dataset workflow could be structured as follows:

Run inference on the raw videos with all pose estimators.
Evaluate results with metrics.
Visualize pose estimator performance on raw videos using plots or tables.
Render the raw videos with overlaid poses.
Reuse the SAM2 masks from MaskAnyone to apply different hiding strategies to the videos. Masking parameters could be adjusted by the user to explore not only different strategies but also varying degrees of masking, helping determine the optimal balance between privacy and performance.
For each pose estimator, run inference on all videos across all hiding strategies.
Evaluate results with metrics.
Visualize performance on masked videos using plots or tables.
Render the masked videos with overlaid poses.

9.2.2 Evaluation of downstream tasks

Estimating a person’s pose can serve as a preliminary step for many downstream tasks, such as gesture recognition (Köpüklü et al. 2019; Molchanov et al. 2016), 3D human reconstruction (Ho, Song, and Hilliges 2024; Choi, Moon, and Lee 2020), and action classification. MaskBench could be extended to evaluate how different upstream pose estimators affect performance on these downstream tasks for both raw and masked videos. This extension would give researchers practical guidance on which pose estimator to choose for a given downstream application. Furthermore, researchers could use MaskBench to mask sensitive datasets and publish the masked videos together with pose outputs derived from the original raw videos. Other researchers could then use the masked videos plus the provided pose outputs as input for downstream tasks without accessing the original raw data. MaskBench should include an evaluation framework that quantifies the potential performance loss in downstream tasks when using masked videos or alternative upstream estimators.

9.2.3 User interface

Adding a web-based user interface to MaskBench would make the framework significantly more accessible. At present, running MaskBench requires technical expertise, such as working with Docker containers, setting environment variables, and editing configuration files. A dedicated interface could replace these steps with an intuitive, visual workflow for configuring and running pipelines. It could also provide built-in visualization panels for metrics, interactive plots, and side-by-side video comparisons, making it easier to explore results without leaving the application. Ultimately, this would lower the entry barrier for non-technical users while speeding up experimentation for advanced users.

9.2.4 Additional improvements

In addition to the major extensions outlined above, several smaller improvements could further enhance MaskBench in the future:

Expanded normalization options for the Euclidean distance metric. Currently, normalization is only possible using bounding boxes, but can be extended to support head and torso normalization as described in Section 6.1.1. This requires identifying the relevant head and torso keypoints while keeping the system flexible enough to support multiple keypoint formats beyond COCO.
Integrated logging system. A built-in logger could provide cleaner, more structured terminal output. For debugging, an option to display all logs from the underlying Docker containers would make error tracing during development much easier.
Support for face and hand keypoints. This would enable evaluation of a broader set of downstream tasks where fine-grained keypoint data is important.
Additional ground-truth–independent metrics and plots. Beyond velocity, acceleration, and jerk, metrics could assess the physical plausibility of a pose given human body constraints. This would shift part of the evaluation focus from pure numerical quality to biomechanical realism.
3D pose estimation support as a long-term goal. This could include both evaluating 3D models directly and projecting 3D ground truth keypoints onto the 2D image plane using camera calibration data. Leveraging marker-based motion capture datasets—such as BioCV from the University of Bath (Evans et al. 2024)—would allow for more precise real-world benchmarking than 2D pseudo-ground truth data.
Integrate Samurai (Yang et al. 2024) into MaskAnyone. This would allow for more stable tracking of persons over time by using adapted memory modules that more consistently maintain the identity of a person across frames.

10 Conclusion

This work introduced MaskBench, a modular and extensible benchmarking framework for evaluating pose estimation models under diverse conditions, including privacy-preserving masking strategies. We evaluated four datasets of increasing complexity, including real-world TED talk recordings to examine how models perform in unconstrained, natural scenarios rather than under controlled laboratory conditions. The study includes popular pose estimators such as YoloPose (Redmon et al. 2016), MediaPipe (Lugaresi et al. 2019), and OpenPose (Z. Cao et al. 2019), alongside the mixture-of-expert-model pipeline MaskAnyone (Schilling 2024b), to assess their performance across these varied settings.

Our quantitative evaluation, using acceleration and jerk metrics to measure temporal stability, showed that the MaskAnyone pipeline, particularly the human-in-the-loop MediaPipe variant, substantially improves stability by reducing acceleration and jerk compared to standard models. YoloPose was the most robust standalone estimator, while MediaPipePose consistently exhibited the highest instability. Visual inspection of the output poses confirmed these findings, with noticeably smoother and more consistent motion in cases where the metrics indicated high stability.

Our small study on masked videos revealed that blurring offers the best trade-off between privacy and accuracy, maintaining high PCK values across models, whereas pixelation and solid fills significantly degraded performance. Model-specific responses to masking strategies highlighted that pose estimation often relies more on overall body structure than on pixel-level detail.

While results are promising, limitations remain, including reliance on pseudo-ground truth in some datasets and the preliminary implementation of masked-video workflows. Future work will focus on extending MaskBench with flexible pipelining, downstream task evaluation, and user-friendly interfaces, enabling systematic exploration of how privacy-preserving transformations affect pose estimation and subsequent applications.

11 References

Allen, Cameron. 2017. “Education for All.” Youtube.com. TEDxKids@ElCajon. https://www.youtube.com/watch?v=OMbNoo4mCcI.

Berghi, Davide, Marco Volino, and Philip J. B. Jackson. 2022. “Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research.” In Proceedings of the 19th ACM SIGGRAPH European Conference on Visual Media Production. CVMP ’22. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3565516.3565522.

Burke, Tarana. 2018. “Me Too Is a Movement, Not a Moment.” Ted.com. TED Talks. https://www.ted.com/talks/tarana_burke_me_too_is_a_movement_not_a_moment.

Buzz. 2023. “‘Universe’ / ‘Statues’ / ‘Liberation’.” Ted.com. TED Talks. https://www.ted.com/talks/buzz_universe_statues_liberation.

Cao, Zhe, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. “Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” In CVPR.

Cao, Z., G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. “OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Choi, Hongsuk, Gyeongsik Moon, and Kyoung Mu Lee. 2020. “Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose.” In European Conference on Computer Vision (ECCV).

Evans, Murray, Laurie Needham, Logan Wade, Martin Parsons, Steffi Colyer, Polly McGuigan, James Bilzon, and Darren Cosker. 2024. “Synchronised Video, Motion Capture and Force Plate Dataset for Validating Markerless Human Movement Analysis.” Scientific Data 11 (1): 1300. https://doi.org/10.1038/s41597-024-04077-3.

Ho, Hsuan-I, Jie Song, and Otmar Hilliges. 2024. “SiTH: Single-View Textured Human Reconstruction with Image-Conditioned Diffusion.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Jocher, Glenn, and Jing Qiu. 2024. “Ultralytics YOLO11.” https://github.com/ultralytics/ultralytics.

Kaim, Utsav, Aryan Jaiswal, Vyoum Khare, and Manish M Parmar. 2024. “Comparison of ML Models for Posture.” International Journal of Creative Research Thoughts (IJCRT) 12 (8). https://www.ijcrt.org/papers/IJCRT2408135.pdf.

Köpüklü, Okan, Ahmet Gunduz, Neslihan Kose, and Gerhard Rigoll. 2019. “Real-Time Hand Gesture Detection and Classification Using Convolutional Neural Networks.” In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 1–8. https://doi.org/10.1109/FG.2019.8756576.

Lugaresi, Camillo, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, et al. 2019. “MediaPipe: A Framework for Perceiving and Processing Reality.” In Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019. https://mixedreality.cs.cornell.edu/s/NewTitle_May1_MediaPipe_CVPR_CV4ARVR_Workshop_2019.pdf.

Molchanov, Pavlo, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. 2016. “Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4207–15. https://doi.org/10.1109/CVPR.2016.456.

Ravi, Nikhila, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, et al. 2024. “SAM 2: Segment Anything in Images and Videos.” arXiv Preprint arXiv:2408.00714. https://arxiv.org/abs/2408.00714.

Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. “You Only Look Once: Unified, Real-Time Object Detection.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–88. https://doi.org/10.1109/CVPR.2016.91.

Saiva. 2025. “OpenPose Vs MediaPipe: Comprehensive Comparison & Analysis.” https://saiwa.ai/blog/openpose-vs-mediapipe.

Schilling, Martin. 2024a. “MaskAnyone - the de-Identification Toolbox for Video Data.” GitHub Repository. https://github.com/MaskAnyone/MaskAnyone; GitHub.

———. 2024b. “MaskAnyone: A Human Segmentation Pipeline for Pose Estimation and de-Identification in Videos.” Master's Thesis, Potsdam, Germany: Hasso Plattner Institute, University of Potsdam.

Shahidi, Yara. 2023. “Let Curiosity Lead.” TED Talks. https://www.ted.com/talks/yara_shahidi_let_curiosity_lead.

“TED.” 2025. TED Talks. https://www.ted.com/.

Yang, Cheng-Yen, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, and Jenq-Neng Hwang. 2024. “SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory.” https://arxiv.org/abs/2411.11922.