Treffer: Representation learning of linguistic structures with neural networks
Weitere Informationen
In the first part of the thesis, I explore the explicit modelling of word creation and reuse in the context of open-vocabulary language modelling. I propose a neural network augmented with a hierarchical structure and a memory component that explicitly models the generation of new words and supports the frequent reuse of new words. The research question is whether the explicit modelling assumption is useful for improving the performance of language modelling compared to the implicit model without using any linguistic structures. The model is evaluated in terms of language modelling performance (i.e. held-out perplexity) on typologically diverse languages and compared with a character-level neural language model which does not explicitly represent any linguistic structure. The results show that the proposed explicit model improve the performance on language modelling in all tested languages and analysis demonstrates that the model is able to use the memory architecture appropriately. In the second part, I extend the open-vocabulary language model to discover word-like structures without any supervision of word boundaries. In contrast to previous work on word segmentation and language modelling that focuses only on either structure discovery or language modelling, the hypothesis is that it is possible to learn good predictive distributions of language at the same time as discovering good explicit structures. Thus, the proposed model combines the benefit of explicit and implicit modelling by parameterizing an explicit probabilistic model using neural networks. The proposal includes a differential learning algorithm that efficiently marginalizes all possible segmentation decisions and a regularization method that is crucial for successful structure induction and language modelling. The model is evaluated in terms of both language modelling performance (i.e. held-out perplexity) and the quality of induced word structures (i.e. precision metrics compared to the human reference). The results show that the proposed model improves language modelling performance over neural language models and discovers word-like units better than Bayesian word segmentation models. Moreover, conditioning on visual context improves performance on both. In the last part, I present a method to discover acoustic structures implicitly from raw audio signals and show that the model can learn useful representations from largescale, real-world data. The aim is to learn representations that are robust to domain shifts (e.g. read English to spoken English) and generalize well to many languages. Since the structures are not induced explicitly, the representations are evaluated based on the impact on downstream speech recognition tasks which predicts phonetic structure in utterances. The results show that the representations learned from diverse and noisy data provide significant improvements on speech recognition tasks in terms of performance, data efficiency and robustness. Moreover, the representations generalize well to many languages including tonal and low-resource languages.