A robust, unsupervised approach is presented for learning word order models from treebanks for languages with variable degrees of word order freedom. The approach is based on an extension of profile Hidden Markov Models (pHMMs; Durbin et al. 1998). A pHMM builds a model that represents similarities across the structures it is trained over; when trained using syntactic trees, we obtain a model of what constituents may appear in different (relative) positions. This represents a generalization of topological field models (Hoehle 1983), one with which we can model restrictions on and possible variations in word order.
This approach is evaluated using treebanks for English (Wall Street Journal), Dutch (Corpus of Spoken Dutch), German (NEGRA), and Czech (Prague Dependency Treebank). We show that it is possible to define typological differences in word order freedom along a continuous scale of entropy distance intervals (EDI); this is in contrast to the binary or ternary categories of traditional typological research (e.g., Steele 1978).
Such a continuous scale for defining degree of word order freedom has interesting consequences for real-time language comprehension. The more rigid the word order of a given language along the EDI scale (e.g., English), the more susceptible it should be to increased processing difficulty when noncanonical orders are encountered. Conversely, the processing of languages located towards the freer end of the EDI continuum (e.g., Czech) should be affected to a lesser extent (or not at all) by noncanonical orders.
In order to investigate this prediction, acceptability-rating and self-paced reading experiments involving German and Czech are currently in progress that investigate the effect of noncanonical word order on processing difficulty. The EDI scale predicts that noncanonical order in German would be less acceptable (even with appropriate discourse context present) than noncanonical order in Czech.
In sum, a metric is proposed for quantifying word order freedom across languages, and it is argued this metric can correctly predict the differentiated degrees of dispreference for noncanonical order in languages conventionally classified as ``free word order'' languages.
Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis. Cambridge University Press.
Tilmann Hoehle. 1983. Topologische Felder. PhD thesis. University of Cologne.
Geert-Jan M. Kruijff. A Categorial-Modal Logical Architecture of Informativity: Dependency Grammar Logic and Information Structure. 2001. PhD thesis, Charles University, Prague, Czech Republic.
Susan Steele. 1978. Word order variation: A typological study. In Joseph H. Greenberg, editor, Universals of Language, Volume 4: Syntax, pages 585-624. Stanford University Press, Stanford, CA.