Furthermore, in the ASR task, MAC beats wav2vec2 (with fine-tuning) on common voice datasets of Cantonese and gets really competitive results on common voice datasets of Taiwanese and Japanese. For CTC greedy search, CTC prefix, attention, and attention rescoring decode mode in Cantonese ASR task, Taiwanese ASR task, and Japanese ASR task the MAC method can reduce the CER by more than 15%. Extensive experiments have demonstrated the great effectiveness of MAC on low resource ASR tasks. Besides, this can also help reduce the difficulty of force alignment, improve the diversity of synthesized audio, and solve the OOV problem in synthesis. By the proper meta audio set, we can integrate language pronunciation rules easily. We propose a broad notion of meta audio set for the concatenative synthesis text-to-speech system to meet the modeling needs of different languages and different scenes when using the system. Mathematically, we give a clear description of MAC framework from the perspective of bayesian sampling. It is easy to implement and can be carried out in extremely low resource environments. Abstract: We propose a unified framework for low resource automatic speech recognition tasks named meta audio concatenation (MAC).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |