Audio-Visually Indicated Actions Dataset - Pattern Analysis and Computer Vision

[Audio-Visually Indicated Actions Dataset] - Introduction

Audio-Visually Indicated Actions Dataset

INTRODUCTION

We introduce a new multimodal dataset comprised of visual data as RGB images and acoustic data as raw audio signals acquired from 128 microphones and multispectral acoustic images generated by a filter-and-sum beamforming algorithm. The dataset consists of 378 audio-visual sequences between 30 and 60 seconds depicting different people performing individually a set of actions that produce a characteristic sound. The provided visual and acoustic images are both aligned in space and synchronized in time.

[Audio-Visually Indicated Actions Dataset] - Details

DETAILS

We acquired the dataset using the acoustic-optical camera described in [2]. The sensor captures both audio and video data using a planar array of 128 low-cost digital MEMS microphones and a video camera placed at the device center. The data provided by the sensor consists in RGB video frames of 640x480 pixels, raw audio data from 128 microphones acquired at a frequency of 12 kHz, and 36x48x12 compressed multispectral acoustic images obtained from the raw audio signals of all the microphones using a beamforming algorithm, which summarize the per-direction audio information in the frequency domain.

For the acquisition, we acknowledge the participation of 9 people performing 14 different actions recorded in 3 different scenarios with varying noise conditions. We choose the following set of actions such that their associated sounds were as distinct as possible from each other and indicative of the executed action: (1) Clapping; (2) Snapping fingers; (3) Speaking; (4) Whistling; (5) Playing kendama; (6) Clicking; (7) Typing; (8) Knocking; (9) Hammering; (10) Peanut breaking; (11) Paper ripping; (12) Plastic crumpling; (13) Paper shaking; and (14) Stick dropping

To introduce further intra-class variability we recorded in three different scenarios with increasing noise conditions: 1) an anechoic chamber, 2) an open space area, and 3) a terrace outdoor.

For a detailed description of the problem and the dataset, please refer to our paper [1].

[Audio-Visually Indicated Actions Dataset] - Request

How to get the dataset

For researchers and educators who wish to use this dataset for non-commercial research and/or educational purposes, we provide access through our site under certain conditions and terms. To obtain this dataset, please follow the steps below:

Download and fill the request form.
Send it to pavistech, indicating as subject [AV-Actions Dataset] (Note: you should send the email from an email address that is linked to your research institution/university).
Wait for the credentials, typically a couple of days.
Download the dataset using this link.
Finally, please remember to cite our paper if you utilize this dataset

[Audio-Visually Indicated Actions Dataset] - Citation

How to cite our work

@article{perez2019,
    title={Audio-Visual Model Distillation Using Acoustic Images},
    author={P\'erez, Andr\'es F. and Sanguineti, Valentina and Morerio, Pietro and Murino, Vittorio},
    year={2019}
}

[Audio-Visually Indicated Actions Dataset] - References

REFERENCES

[1] A. Perez, V. Sanguineti, P. Morerio, and V. Murino, Audio–Visual Model Distillation Using Acoustic Images, arXiv, 2019.

[2] A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. D. Bue, and V. Murino, Seeing the sound: A new multimodal imaging device for computer vision, ICCVW, 2015.

[Audio-Visually Indicated Actions Dataset] - Links

Download

GitHub link

Navigation Menu