Thesis details | Informática

Publication date: 26/03/2025

Examining board:

Name	Role
ALBERTO FERREIRA DE SOUZA	Examinador Interno
CLAUDINE SANTOS BADUE	Presidente
FRANCISCO DE ASSIS BOLDT	Coorientador
THIAGO MEIRELES PAIXÃO	Examinador Externo

Summary: This work investigates how two specialized neural networks—a speech transcription model (Whisper) and a general audio captioning model (Prompteus)—can be jointly leveraged to process mixed audio inputs containing both speech and non-speech events. We construct the Clotho Voice dataset by merging speech recordings from the Common Voice 5.1 corpus and general sounds from the Clotho 2.1 dataset. Through a series of controlled experiments, we examine how each model’s performance degrades when presented with overlapping speech and background sounds. Results show that Whisper excels at transcription when speech dominates the input signal, yet its accuracy diminishes in the presence of substantial non-speech noise. Conversely, Prompteus achieves high FENSE scores in purely background oriented settings but exhibits a decline in descriptive capability as speech levels increase. We also highlight how preprocessing steps—such as normalization and resampling—impact borderline cases, revealing that subtle audio features are crucial for robust event detection in challenging acoustic environments. Our findings underscore the importance of tailored training and data augmentation strategies to mitigate performance loss in mixed audio scenarios. By integrating the complementary strengths of speech-focused and background focused models, we offer a pathway toward more comprehensive audio understanding systems suitable for noisy, real-world applications, including industrial automation and assistive technologies. This research paves the way for developing hybrid frameworks that capture both spoken language and context-rich environmental cues in a single, unified approach.

Access to document

Search form

You are here