Audio-Video Tracking Dataset

Audio-Video Tracking Dataset

We present here a new dataset for object tracking using both sound and video data. The proposed dataset is composed by 3 different sequences of audio-video data, collected with the DualCam device in both indoor and outdoor scenarios:

1. Drone Sequence

2. Voice Sequence

3. Motorbike Sequence

The aim is to show the potentialities of using acoustic images for target tracking in three challenging scenarios. In particular, the audio-based approach, proposed in the paper, is able to overcome, often dramatically, visual tracking with state-of-art algorithms, dealing efficiently with occlusions, abrupt variations in visual appearence and camouflage. These results pave the way to a widespread use of acoustic imaging in application scenarios such as in security and surveillance.


The complete dataset (3 GB) can be downloaded [ Here ]

1. Drone sequence 2. Voice sequence 3. Motorbike sequence
Audio-Video Tracking Dataset Audio-Video Tracking Dataset Audio-Video Tracking Dataset


Audio-Video Tracking Dataset


Audio Data, acquired by the microphones embedded in the DualCam device, are processed through a Filter-and-Sum beamforming algorithm, generating a three dimensional Multispectral Acoustic Image, function of two spatial directions and temporal frequency. Each value in the resulting 3D structure represents the power spectrum at a given frequency bin, related to the audio source coming from a given spatial location. The 2D final Acoustic Map, encoding the sound energy for each spatial location, is computed integrating each power spectrum along the frequency axis.   For each sequence we provide 3 different folders containing the video frames in VGA Resolution, acquired by the embedded video camera, and the audio data, acquired by the microphone array, in 2 different formats, each one corresponding to a different processing stage:

  • Multispectral Acoustic Image
  • Acoustic Map

Into each sequence folder, we also add a Matlab script file which allows the users to load and combine a multispectral acoustic image with its corresponding video frame into a single RGB image.  For a detailed description of the DualCam sensor and the dataset, please refer to the paper [ Download pdf ].    



  • A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. Del Bue, V. Murino
    "Seeing the Sound: a New Multimodal Imaging device for Computer Vision"
    3D Reconstruction and Understanding with Video and Sound, (ICCV Workshop), 2015 [PDF]