Publications

Interspeech 2020, Shanghai

Title : Exploration of End-to-End Synthesisers for Zero Resource Speech Challenge 2020

Authors : Karthik Pandia D.S., Anusha Prakash, Mano Ranjith Kumar M , Hema A. Murthy

Abstract: A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for developing applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units, and a unique set of acoustic units is discovered by iterative training. Using the acoustic unit sequence, TTS models are trained. The main goal of this work is to improve the synthesis quality of zero resource TTS system. Four different systems are proposed. All the systems consist of three stages — unit discovery, followed by unit sequence to spectrogram mapping, and finally spectrogram to speech inversion. Modifications are proposed to the spectrogram mapping stage. These modifications include training the mapping on voice data, using x-vectors to improve the mapping, two-stage learning, and gender-specific modelling. Evaluation of the proposed systems in the Zerospeech 2020 challenge shows that quite good quality synthesis can be achieved.

Interspeech 2020, Shangai

Title : A Hybrid HMM-Waveglow Based Text-to-Speech Synthesizer Using Histogram Equalization for Low Resource Indian Languages

Authors : Mano Ranjith Kumar M , Sudhanshu Srivastava, Anusha Prakash, Hema A. Murthy

Abstract: Conventional text-to-speech (TTS) synthesis requires extensive linguistic processing for producing quality output. The advent of end-to-end (E2E) systems has caused a relocation in the paradigm with better synthesized voices. However, hidden Markov model (HMM) based systems are still popular due to their fast synthesis time, robustness to less training data, and flexible adaptation of voice characteristics, speaking styles, and emotions. This paper proposes a technique that combines the classical parametric HMM-based TTS framework (HTS) with the neural-network-based Waveglow vocoder using histogram equalization (HEQ) in a low resource environment. The two paradigms are combined by performing HEQ across mel-spectrograms extracted from HTS generated audio and source spectra of training data. During testing, the synthesized mel-spectrograms are mapped to the source spectrograms using the learned HEQ. Experiments are carried out on Hindi male and female dataset of the Indic TTS database. Systems are evaluated based on degradation mean opinion scores (DMOS). Results indicate that the synthesis quality of the hybrid system is better than that of the conventional HTS system. These results are quite promising as they pave way to good quality TTS systems with less data compared to E2E systems.

Speech Synthesis Workshop'11 (2021), Hungary

Title : Lipsyncing efforts for transcreating lecture videos in Indian languages

Authors : Mano Ranjith Kumar M , Jom Kuriakose, Hema A. Murthy

Abstract: This paper proposes a novel lip-syncing module for the transcreation of lecture videos from English to Indian languages. The audio from the lecture is transcribed using automatic speech recognition. The text is translated and manually curated before and after translation to avoid mistakes. The curated text is synthesized using the Indian language end-to-endbased text-to-speech synthesis systems. The synthesized audio and video are out-of-sync. This paper attempts to automate this process of producing video lectures lip-synced into Indian languages using different techniques. Lip-syncing an educational video with the Indian language audio is challenging owing to (a) the duration of Indian language audio being considerably longer or shorter than that of the original audio, (b) the extempore speech causes the audio in the source videos to have long silences. Any modification to the speed of audio can be unpleasant to listeners. The proposed system non-uniformly re-samples the video to ensure better lip-syncing. The novelty of this paper is in the use of HMMGMM alignments in tandem with syllable segmentation using group delay, as visemes are closer to syllables. The proposed lip-syncing techniques are evaluated using subjective evaluation methods. Results indicate that accurate alignment at the syllable level is crucial for lip-syncing.

Hi, I'm

Publications

Some of my projects and contributions

Hi, I'm

Publications

I live online here

Some of my projects and contributions