Keynote Sharon Goldwater

Presentation

Abstract

Speech and language processing have advanced enormously in the last decade, with successful applications like machine translation, voice-activated search, and even language-enabled personal assistants. Yet these systems typically still rely on learning from very large quantities of human-annotated data. These resource-intensive methods mean that effective technology is available for only a tiny fraction of the world’s 7000 or so languages, mainly those spoken in large rich countries.

This talk describes our recent work on developing *unsupervised* speech technology, where transcripts and pronunciation dictionaries are not used. The work is inspired by considering both how young infants may begin to acquire the sounds and words of their language, and how we might develop systems to help linguists analyze and document endangered languages. I will first present work on learning from speech audio alone, where the system must learn to segment the speech stream into word tokens and cluster repeated instances of the same word together to learn a lexicon of vocabulary items. The approach combines Bayesian and neural network methods to address learning at the word and sub-word levels. Time permitting, I will also discuss some preliminary work on learning from speech together with text translations (as in a language documentation scenario).