Audio-Visually Indicated Actions Dataset

We introduce a new multimodal dataset comprised of visual data as RGB images and acoustic data as raw audio signals acquired from 128 microphones and multispectral acoustic images generated by a filter-and-sum beamforming algorithm. The dataset consists of 378 audio-visual sequences between 30 and 60 seconds depicting different people performing individually a set of actions that produce a characteristic sound. The provided visual and acoustic images are both aligned in space and synchronized in time.


We acquired the dataset using the acoustic-optical camera described in [2]. The sensor captures both audio and video data using a planar array of 128 low-cost digital MEMS microphones and a video camera placed at the device center. The data provided by the sensor consists in RGB video frames of 640x480 pixels, raw audio data from 128 microphones acquired at a frequency of 12 kHz, and 36x48x12 compressed multispectral acoustic images obtained from the raw audio signals of all the microphones using a beamforming algorithm, which summarize the per-direction audio information in the frequency domain.

For the acquisition, we acknowledge the participation of 9 people performing 14 different actions recorded in 3 different scenarios with varying noise conditions. We choose the following set of actions such that their associated sounds were as distinct as possible from each other and indicative of the executed action:

  1. Clapping
  2. Snapping fingers
  3. Speaking
  4. Whistling
  5. Playing kendama
  6. Clicking
  7. Typing
  8. Knocking
  9. Hammering
  10. Peanut breaking
  11. Paper ripping
  12. Plastic crumpling
  13. Paper shaking
  14. Stick dropping

To introduce further intra-class variability we recorded in three different scenarios with increasing noise conditions: 1) an anechoic chamber, 2) an open space area, and 3) a terrace outdoor.

Speaking in anechoic room
Hammering in the open space area
Playing Kendama in the terrace

For a detailed description of the problem and the dataset, please refer to our paper [1].

How to get the dataset

For researchers and educators who wish to use this dataset for non-commercial research and/or educational purposes, we provide access through our site under certain conditions and terms. To obtain this dataset, please follow the steps below:

  1. Download and fill the request form.
  2. Send it to This email address is being protected from spambots. You need JavaScript enabled to view it., indicating as subject [AV-Actions Dataset] (Note: you should send the email from an email address that is linked to your research institution/university).
  3. Wait for the credentials, typically a couple of days.
  4. Download the dataset using this link.
  5. Finally, please remember to cite our paper if you utilize this dataset:
        title={Audio-Visual Model Distillation Using Acoustic Images},
        author={P\'erez, Andr\'es F. and Sanguineti, Valentina and Morerio, Pietro and Murino, Vittorio},


The code is available on GitHub at


[1] A. Perez, V. Sanguineti, P. Morerio, and V. Murino. Audio–Visual Model Distillation Using Acoustic Images. arXiv, 2019.

[2] A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. D. Bue, and V. Murino. Seeing the sound: A new multimodal imaging device for computer vision. ICCVW, 2015.