We acquired the dataset using the acoustic-optical camera described in [2]. The sensor captures both audio and video data using a planar array of 128 low-cost digital MEMS microphones and a video camera placed at the device center. The data provided by the sensor consists in RGB video frames of 640x480 pixels, raw audio data from 128 microphones acquired at a frequency of 12 kHz, and 36x48x12 compressed multispectral acoustic images obtained from the raw audio signals of all the microphones using a beamforming algorithm, which summarize the per-direction audio information in the frequency domain.
For the acquisition, we acknowledge the participation of 9 people performing 14 different actions recorded in 3 different scenarios with varying noise conditions. We choose the following set of actions such that their associated sounds were as distinct as possible from each other and indicative of the executed action: (1) Clapping; (2) Snapping fingers; (3) Speaking; (4) Whistling; (5) Playing kendama; (6) Clicking; (7) Typing; (8) Knocking; (9) Hammering; (10) Peanut breaking; (11) Paper ripping; (12) Plastic crumpling; (13) Paper shaking; and (14) Stick dropping
To introduce further intra-class variability we recorded in three different scenarios with increasing noise conditions: 1) an anechoic chamber, 2) an open space area, and 3) a terrace outdoor.
For a detailed description of the problem and the dataset, please refer to our paper [1].