Arun Balajiee

PhD Student, Intelligent Systems Program, University of Pittsburgh

Detection and classification of Acoustic scenes and events

09 Nov 2020 - Arun Balajiee

Speaker: Tuomas Virtanen

Talk Date: 11/09/2020

This talk was the research extensions from solving the DCASE Challenge. Dr. Virtanen talked about his work on the 6 tasks. He started by presenting their work on the classification task of different environment based on the acoustic feedback from the scene – that is predicting the location based on the different sounds that are generated in that scene – as a an airport, shopping mall, metro station, street, public square and so on. The process involved annotating these different sounds in the scenes in the training data before using it to classify on real-world data. The model was also compared with the human for performance evaluation and seemed to perform on par or better than humans. A case of limitation of this was that the human test subjects were not completely sure of some of the sounds because of their personal biases and lack of nuanced differentiation of sounds produced in different scenarios such as train vs trams, forest vs parks. In another application of this, Dr. Virtanen talked about scene annotation by sound event detection such as in a video, a person talking over the phone, opening or closing the car door, a dog barking or a car engine starting. Based on these sequence of actions, Dr. Virtanen plans to extend the work of using RNNs and Seq2Seq models that can predict a sequence of actions such as a phone ringing and a person talking. The third aspect of this that Dr. Virtanen talked about was using sound to localize different objects and their movements in a scene. This was achieved by building a model trained using audio clips from several mics placed at different locations in the scene to offer the stereo effect for training the model. All these models were building a neural based approach specifically the principles of CNN and subsampling.

The final asepct and the most interesting of them all was to caption videos based on the audio. For this they used a large crowd-sourced dataset of audio clips openly available for public. They then annotated these audio clips using three stages of Mturk survey. In the first stage the MTurkers annotated the audio clips with their captions. In the next stage they validated the annotations allowed for corrections to the annotation and in the third stage of annotation, these annotations were ranked with the annotation from best to worst for the audio clips. Using this annotated dataset, they trained the model and used it on video/audio captioning and achieved mixed results with the model underperforming in cases where the audio clip could have had noise or possibilities of the sounds being similar to other sounds.

Overall it was an interesting presentation with a lot of research in the area of audio annotation and captioning presented in a manner that a person without much knowledge in the field could grasp what was going on and appreciate the depth of the research. This work involved the use of a lot of software tools, programming, logistics such as instruments and equipment and the use of machine learning. The target was always to be able to deploy the model to work in a real-world scenario which is the novelty of their work