I’m thrilled to announce that two of my recent submissions have been accepted for presentation at ICSigSys 2025. Both pieces push the envelope in speech processing, blending self-supervision, domain adaptation, and cross-lingual storytelling to tackle real-world challenges. Here’s a closer look at each paper and what makes them special.
Semi-Supervised Acoustic Scene Classification with Label Smoothing and Hard Samples Identification
Acoustic scene classification (ASC) remains a cornerstone task in environmental audio understanding. In this work, we introduce a semi-supervised framework that leverages large amounts of unlabeled audio while focusing the model’s attention on the most informative samples.
- We employ label smoothing to soften the target distribution, reducing overconfidence in noisy or ambiguous audio segments.
- We design a hard sample identification strategy that dynamically selects challenging clips during training, guiding the model to learn discriminative features more robustly.
- Our experiments on standard ASC benchmarks demonstrate a consistent performance boost over fully supervised baselines, especially under limited labeled data regimes.
By integrating these elements, our model adapts more gracefully to new acoustic conditions and requires fewer manual annotations. I look forward to sharing detailed analyses of embedding drift, class confusion matrices, and ablation studies during the conference.
Indonesian Folklore Storytelling in Japanese Language with Text-to-Speech
Cross-lingual storytelling opens up rich cultural exchanges, but high-quality narration across languages is still under-explored. This paper presents a text-to-speech (TTS) system that brings Indonesian folklore into Japanese, preserving narrative style and emotional nuance.
- We start with a Japanese TTS backbone fine-tuned on expressive speech corpora to capture intonation and rhythm.
- We build a lightweight text conversion pipeline that maps Indonesian story scripts to Japanese text, retaining metaphorical and cultural references.
- We evaluate the generated speech with both objective metrics (e.g., Mel-cepstral distortion) and subjective listening tests, showing high naturalness and emotional congruence.
This work bridges two rich oral traditions and showcases how speech technologies can make cultural content accessible across language barriers. I’m excited to demo sample audio clips and discuss potential extensions to other language pairs.
Looking Ahead
- Expand the hard sample identification strategy to multilingual acoustic scenes.
- Incorporate emotion labels into the semi-supervised ASC framework for more nuanced predictions.
- Generalize the folklore TTS pipeline to handle low-resource languages with minimal parallel data.
I’m grateful to my co-authors and colleagues at NAIST for their support and valuable discussions. If you’ll be at ICSigSys 2025, please stop by our sessions—we’d love to hear your feedback and explore collaborations.
Stay tuned for preprint links, code releases, and audio demos. Your insights will help shape the next phase of this research journey!