Microsoft has launched Indian language Speech Corpus - IT In Your Hand

All breaking news about Information Technology.

Post Top Ad

Microsoft has launched Indian language Speech Corpus

Share This
Speech has become important to localise experiences in areas such as natural language processing, computer vision, and domain-specific sciences. Also, as Microsoft considers, there is a scarcity of adequate digital data for text, speech, and linguistic resources majorly for languages that are not as dominating as English or Hindi. This brings the need for a speech dataset like the Microsoft Speech Corpus.


Microsoft on Thursday launched the Microsoft Indian Language Speech Corpus package that brings conversational and phrasal speech training and test data for Telugu, Tamil, and Gujarati languages. Claimed to be the largest publicly available Indian language speech dataset, the data package also includes audio and corresponding transcripts. It is essentially aimed at helping researchers and academia build Indian language speech recognition for applications where speech is required. The content of the speech dataset is provided by Microsoft Research Open Data initiative and collection is available for free.

"We believe India's increasing digital literacy needs to be supported by a multi-lingual digital world," said Sundar Srinivasan, General Manager, Artificial Intelligence and Research, Microsoft India, in a press statement. "Microsoft Indian Language Speech Corpus is an extension of our on-going efforts to reduce language barriers and empower Indians to harness the full potential of the Internet. Using our technology expertise, we want to accelerate innovation in voice-based computing for India by supporting researchers and academia."

At Interspeech 2018 in Hyderabad, Microsoft tested its Indian Language Speech Corpus. Participants in a Low Resource Speech Recognition Challenge used data from the package to build their ASR systems and bring new speech recognition models. A baseline system was provided to the participants to let them compare their systems against and use as a starting point.

Microsoft Indian Language Speech Corpus is touted to address differences in enunciation, accent, diction, and slang that are quite common across various regions in India. It also includes audio and corresponding transcripts to help researchers and developers easily build their speech recognition systems - without being the linguistic experts of the vernaculars. The package can be accessed for free directly from the Microsoft Research Open Data site.

No comments:

Post a Comment

Post Bottom Ad

Pages