We introduce a new multimodal dataset comprised of visual data as RGB images and acoustic data as raw audio signals acquired from 128 microphones and multispectral acoustic images generated by a filter-and-sum beamforming algorithm. The dataset consists of 378 audio-visual sequences between 30 and 60 seconds depicting different people performing individually a set of actions that produce a characteristic sound. The provided visual and acoustic images are both aligned in space and synchronized in time.
We acquired the dataset using the acoustic-optical camera described in [2]. The sensor captures both audio and video data using a planar array of 128 low-cost digital MEMS microphones and a video camera placed at the device center. The data provided by the sensor consists in RGB video frames of 640x480 pixels, raw audio data from 128 microphones acquired at a frequency of 12 kHz, and 36x48x12 compressed multispectral acoustic images obtained from the raw audio signals of all the microphones using a beamforming algorithm, which summarize the per-direction audio information in the frequency domain.
For the acquisition, we acknowledge the participation of 9 people performing 14 different actions recorded in 3 different scenarios with varying noise conditions. We choose the following set of actions such that their associated sounds were as distinct as possible from each other and indicative of the executed action: (1) Clapping; (2) Snapping fingers; (3) Speaking; (4) Whistling; (5) Playing kendama; (6) Clicking; (7) Typing; (8) Knocking; (9) Hammering; (10) Peanut breaking; (11) Paper ripping; (12) Plastic crumpling; (13) Paper shaking; and (14) Stick dropping
To introduce further intra-class variability we recorded in three different scenarios with increasing noise conditions: 1) an anechoic chamber, 2) an open space area, and 3) a terrace outdoor.
For a detailed description of the problem and the dataset, please refer to our paper [1].
For researchers and educators who wish to use this dataset for non-commercial research and/or educational purposes, we provide access through our site under certain conditions and terms. To obtain this dataset, please follow the steps below:
Send it to pavistech, indicating as subject [AV-Actions Dataset] (Note: you should send the email from an email address that is linked to your research institution/university).
Wait for the credentials, typically a couple of days.
@article{perez2019,
title={Audio-Visual Model Distillation Using Acoustic Images},
author={P\'erez, Andr\'es F. and Sanguineti, Valentina and Morerio, Pietro and Murino, Vittorio},
year={2019}
}
[1] A. Perez, V. Sanguineti, P. Morerio, and V. Murino, Audio–Visual Model Distillation Using Acoustic Images, arXiv, 2019.
[2] A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. D. Bue, and V. Murino, Seeing the sound: A new multimodal imaging device for computer vision, ICCVW, 2015.