Name: JOAO VITOR RORIZ DA SILVA
Publication date: 26/03/2025
Examining board:
| Name |
Role |
|---|---|
| ALBERTO FERREIRA DE SOUZA | Examinador Interno |
| CLAUDINE SANTOS BADUE | Presidente |
| FRANCISCO DE ASSIS BOLDT | Coorientador |
| THIAGO MEIRELES PAIXÃO | Examinador Externo |
Summary: This work investigates how two specialized neural networks—a speech transcription model (Whisper) and a general audio captioning model (Prompteus)—can be jointly leveraged to process mixed audio inputs containing both speech and non-speech events. We construct the Clotho Voice dataset by merging speech recordings from the Common Voice 5.1 corpus and general sounds from the Clotho 2.1 dataset. Through a series of controlled experiments, we examine how each model’s performance degrades when presented with overlapping speech and background sounds. Results show that Whisper excels at transcription when speech dominates the input signal, yet its accuracy diminishes in the presence of substantial non-speech noise. Conversely, Prompteus achieves high FENSE scores in purely background oriented settings but exhibits a decline in descriptive capability as speech levels increase. We also highlight how preprocessing steps—such as normalization and resampling—impact borderline cases, revealing that subtle audio features are crucial for robust event detection in challenging acoustic environments. Our findings underscore the importance of tailored training and data augmentation strategies to mitigate performance loss in mixed audio scenarios. By integrating the complementary strengths of speech-focused and background focused models, we offer a pathway toward more comprehensive audio understanding systems suitable for noisy, real-world applications, including industrial automation and assistive technologies. This research paves the way for developing hybrid frameworks that capture both spoken language and context-rich environmental cues in a single, unified approach.
