Speech and language processing for resource-poor languages

Katrin Kirchhoff

Dept. of Electrical Engineering
University of Washington, Seattle

Abstract
Current statistical algorithms for speech and natural language processing can achieve suprisingly good performance but require large amounts of training data. In many cases, large data sets or data annotations for the language of interest are not available, for instance because the language has few speakers or because it is an oral language without a fixed writing standard. Developing robust models for such applications is not only a practical engineering challenge but also an interesting machine learning problem: how can satisfactory classification accuracy for complex, structured prediction problems be achieved in the presence of only small amounts of (labeled) data?

This talk presents novel techniques for two well-known NLP problems, statistical language modeling and part-of-speech tagging. It introduces factored language models, which are based on a a feature-vector style representation for words along with a generalized backoff scheme for probability estimation, which is shown to improve the performance on sparse-data tasks significantly compared to standard language models. The part-of-speech tagging problem is addressed within a transductive learning framework that exploits the combination of labeled and unlabeld data and achieves competitive performance while reducing the need for labeled data considerably.


sfb-logo Zur Startseite Erstellt von: Anke Weinberger (2005-11-08).
Wartung durch: Anke Weinberger (2005-11-08).