Main Content

Import Camera-Based Datasets in MOT Challenge Format for Object Tracking

This example shows how to read camera image sequences and convert both ground truth and detections to Sensor Fusion and Tracking Toolbox™ formats using a custom dataset that stores ground truth and detetions using the MOT Challenge format [1, 2]. You can modify the example to use any dataset that stores ground truth in the MOT Challenge format. Prior to using another dataset with this example, check the dataset license to ensure you have sufficient rights to use the dataset for your application.

Overview of MOT Challenge Dataset Format

The dataset in this example is based on a camera recording of moving pedestrians. The dataset contains the video images, annotated ground truth, detections, and video metadata. The data is organized following the 2D MOT Challenge format [1, 2]. Download the Pedestrian Tracking dataset as follows. When running a different dataset, modify the datasetName variable accordingly.

datasetName = "PedestrianTracking";
datasetURL = "";
if ~exist(datasetName,"dir")
    disp("Downloading Pedestrian Tracking dataset (350 MB)")
Downloading Pedestrian Tracking dataset (350 MB)

The sequence images saved in PNG format are named sequentially with a 6-digit file name under the img1 folder. The metadata text file, named seqinfo.ini, contains information such as the number of frames, number of frames per second, frame size, and file extension.


The ground truth and detection files contain comma-separated values and each line represents an object instance or an object detection as shown in the table below.

Only ground truth entries contain identifier, valid status, class, and visibility. The identifier is an integer unique to each true object across the full sequence. If a person (or any other class object) disappears for an extended period, the person gets a new unique identifier. The class ID is a number between 1 and 12 with the following definitions:

  1. Pedestrian

  2. Person on vehicle

  3. Car

  4. Bicycle

  5. Motorbike

  6. Non motorized vehicle

  7. Static person

  8. Distractor

  9. Occluder

  10. Occluder on the ground

  11. Occluder full

  12. Reflection

The valid flag is either 0 or 1, and a flag of 1 means that this truth instance is an object of interest for the tracking task and evaluation. The visibility is a percentage value ranging from 0 to 1, representing from completely occluded to fully visible. In this video sequence, visibility percentages are populated manually by visual inspection.

The detection entries report a confidence or score value. The score of each detection can be used as a parameter for the tracking task. Inspect the first ten lines of the ground truth and detection files.

dbtype(datasetName+filesep+"gt"+filesep+"gt.txt", "1:10")
1     1,1,925.61,357.04,61.68,154.38,1,1,1.0
2     2,1,939.98,355.44,61.31,160.83,1,1,1.0
3     3,1,951.04,354.69,57.98,167.60,1,1,1.0
4     4,1,979.43,353.49,69.72,175.43,1,1,1.0
5     5,1,1000.44,351.30,64.75,188.71,1,1,1.0
6     6,1,1011.07,351.82,78.49,197.55,1,1,1.0
7     7,1,1051.12,348.04,77.49,209.13,1,1,1.0
8     8,1,1086.59,351.90,73.40,220.54,1,1,1.0
9     9,1,1099.16,353.61,91.99,237.83,1,1,1.0
10    10,1,1154.00,350.00,85.00,266.00,1,1,1.0
dbtype(datasetName+filesep+"det"+filesep+"det.txt", "1:10")
1     1,-1,922.00,355.00,63.00,155.00,61.06,-1,-1
2     2,-1,8.00,370.00,41.00,100.00,8.44,-1,-1
3     2,-1,934.00,355.00,63.00,155.00,87.51,-1,-1
4     3,-1,953.00,355.00,63.00,155.00,84.78,-1,-1
5     4,-1,460.00,293.00,53.00,129.00,8.52,-1,-1
6     4,-1,984.00,354.00,69.00,169.00,88.41,-1,-1
7     5,-1,1002.00,355.00,63.00,155.00,82.14,-1,-1
8     6,-1,460.00,293.00,53.00,129.00,9.46,-1,-1
9     6,-1,1015.00,357.00,75.00,184.00,79.75,-1,-1
10    7,-1,8.00,370.00,41.00,100.00,7.23,-1,-1

Following the rules described in [1] to determine the size of each bounding box. Each ground truth instance is annotated using the Video Labeler from the Computer Vision Toolbox™ . Note that the cars at the end of the street and some visible shadows and reflections are ignored. The detections are obtained with the Aggregate Channel Feature (ACF) people detector, trained using the INRIA person data set. See the peopleDetectorACF function for more details.

Visualize Video Sequences

First, import the sequence info into a MATLAB structure using the helperReadMOTSequenceInfo function provided with this example.

sequenceInfo = helperReadMOTSequenceInfo(datasetName+filesep+"seqinfo.ini")
sequenceInfo = struct with fields:
         FrameRate: 1
    SequenceLength: 169
        ImageWidth: 1288
       ImageHeight: 964
    ImageExtension: ".png"
         ImagePath: "PedestrianTracking\img1\"

Write the images to a video file using the VideoWriter object. This step helps to visualize and inspect the data and is not required for importing the dataset to perform object tracking.

if ~exist(datasetName+"Video.avi","file")
    v = VideoWriter(datasetName+"Video.avi");
    v.FrameRate = sequenceInfo.FrameRate;
    for i=1:sequenceInfo.SequenceLength
        frameName = sequenceInfo.ImagePath + sprintf("%06d",i) + sequenceInfo.ImageExtension;
        writeVideo(v, imread(frameName));

Import Ground Truth and Detections

Next, import the ground truth data into the trackCLEARMetrics (Sensor Fusion and Tracking Toolbox) format. MOT Challenge datasets provide ground truth for training a tracking algorithm and allow to compute metrics such as the CLEAR metrics. The creation and evaluation of a tracker is shown in the Implement Simple Online and Realtime Tracking (Sensor Fusion and Tracking Toolbox) example.

Use the helperReadMOTGroundTruth function to convert the ground truth dataset.

truths = helperReadMOTGroundTruth(sequenceInfo);
  648×1 struct array with fields:


Next, convert the detections into the objectDetection (Sensor Fusion and Tracking Toolbox) format. You can use this format as inputs to multi-object trackers in the Sensor Fusion and Tracking Toolbox™. The helperReadMOTDetection function copies the bounding box information from each entry into the Measurement field of an objectDetection object. Use the frame number and framerate to fill in the Time property for each detection. The MOT Challenge format reserves class information for ground truth, and each detection keeps the default ObjectClassID value of 0 in this case. The ObjectAttributes field stores the score of each detection as a structure. SensorIndex, ObjectClassParameters, MeasurementParameters, and MeasurementNoise have default values. You may need to specify these three properties when using detections with a tracker.

detections = helperReadMOTDetection(sequenceInfo);
  740×1 objectDetection array with properties:

  objectDetection with properties:

                     Time: 0
              Measurement: [922 355 63 155]
         MeasurementNoise: [4×4 double]
              SensorIndex: 1
            ObjectClassID: 0
    ObjectClassParameters: []
    MeasurementParameters: {}
         ObjectAttributes: [1×1 struct]
    Score: 61.0600

Visualize Ground Truth and Detections

Annotate each frame with the bounding boxes of the ground truth and detection data. Use the helperAnnotateGroundTruth and helperAnnotateDetection functions to extract the frame annotation information.

showDetections = false;
showGroundTruth = true;

reader = VideoReader(datasetName+"Video.avi");
groundTruthHistoryDuration = 3/sequenceInfo.FrameRate; %Time persistence (s) of ground truth trajectories
pastTruths = [];
for i=1:sequenceInfo.SequenceLength
    % Find truths and detection in i-th frame
    time = (i-1)/sequenceInfo.FrameRate;
    curDets = detections(ismembertol([detections.Time],time));
    curTruths = truths(ismembertol([truths.Time],time));

    frame = readFrame(reader);
    if showDetections
        frame = helperAnnotateDetection(frame, curDets);
    if showGroundTruth
        frame = helperAnnotateGroundTruth(frame, curTruths, pastTruths);
    pastTruths = [pastTruths;curTruths]; %#ok<AGROW>


In this example you have learned how to import ground truth and detection data saved in the MOT Challenge format into MATLAB. You also visualized the bounding boxes of truths and detections while writing the images to a video file.


[1] Milan, Anton, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. "MOT16: A benchmark for multi-object tracking." arXiv preprint arXiv:1603.00831 (2016).