Learning efficient disambiguation
Khalil Sima'an

Abstract:
%Nr: DS-1999-02
%Author: Khalil Sima'an
%Title: Learning Efficient Disambiguation

It is widely recognized that the resolution of syntactic ambiguity of natural 
language grammars demands resources that are usually considered 
extralinguistic, e.g. discourse, cultural and geographical preferences and 
world-knowledge. Performance models of natural language parsing provide 
mechanisms for modeling the effects of these extra-linguistic factors in 
ambiguity resolution. The majority of the existing performance models 
acquires probabilistic grammars from tree-banks that represent language use 
in limited domains. In the probabilistic approach to ambiguity resolution, 
a language is considered a probability distribution over word-sequences, 
derivations and parse-trees. The probabilities, are acquired from a tree-bank 
(a multi-set of sentence-analysis pairs) and are assumed to represent the 
relative plausibility / grammaticality of parse-trees / utterances given that 
tree-bank. Among the existing performance models, the Data Oriented Parsing 
(DOP) model represents a memory-based approach to natural language parsing. 
The DOP model casts a whole tree-bank, which is assumed to represent the 
language experience of an adult in some domain of language use, into a 
probabilistic grammar called a Stochastic Tree-Substitution Grammar (STSG).

The efficiency of human cognitive capacities is often considered a hallmark 
of human intelligence. A remarkable fact about contemporary performance models 
is their entrenched inefficiency. In principle, existing performance models of 
parsing are based on exhaustive search of the space of possible analyses for 
whatever input utterance they encounter. It is evident that the actual and 
worst-case timeand space-complexities of these models are (almost) independent 
from the frequencies of individual utterances in the tree-bank. For example, 
existing models do not account for two appealing properties of human language 
processing:

1. utterances in specific contexts, typical for limited domains of language 
use, are usually less ambiguous than Broad-Coverage Grammars (BCGs) and 
tree-bank annotations assume,

2. more frequent utterances are usually processed (e.g. understood) more 
efficiently.

We observe that the first property concerns samples of utterances that are 
typically considered limited domains of language. Hence, this property must be 
available in the form of statistical biases in samples of limited domains of 
language use, e.g. tree-banks. A major theoretical hypothesis of the present 
dissertation states that when these statistical biases are exploited within an 
Information-Theoretic framework of learning for acquiring the first property 
above, the second property is acquired as a side-effect.

The present dissertation addresses issues of complexity and efficiency of 
probabilistic disambiguation in general and under the DOP model in particular. 
It starts out by presenting proofs that some problems of probabilistic 
disambiguation are intractable (NP-hard) under existing performance models. 
Some of these problems underly some of the most relevant technological 
applications such as speech-understanding. Clearly, these proofs concern 
worst-case complexities that typically assume independence from properties of 
individual inputs. Therefore, the dissertation provides also empirical 
evidence further supporting the hypothesis that contemporary performance 
models do not account for efficiency properties of human language processing 
in limited domains, specifically the two properties mentioned above. The DOP 
model serves, through the entire dissertation, as a representative of 
contemporary performance models.

The dissertation, then, studies solutions to the inefficiency of performance 
models in general and the DOP model in particular. The principal idea for 
removing these sources of inefficiency is to incorporate the desired 
efficiency properties of human behavior in limited domains of language use, 
such as the properties mentioned above, into existing performance models. 
These properties can be incorporated into a performance model through the 
combination of two methods of learning from a domain-specific tree-bank: 
1) an off-line method that constrains the recognition-power and the ambiguity 
of the linguistic annotation of the treebank in order to specialize it for the 
domain, and 2) an on-line performance model that acquires less ambiguous and 
more efficient probabilistic grammars from that "specialized tree-bank". With 
this sketchy idea as departure point, this dissertation provides a 
theoretical, computational and empirical study of both on-line and off-line 
learning of ambiguity resolution in the context of the DOP model. To this end

 * it presents a framework for specializing performance models, especially
   the DOP model, and Broad-Coverage Grammars to limited domains of 
   language-use. The goal of automatic specialization algorithms is to 
   minimize the ambiguity of the specialized grammar while retaining 
   sufficient coverage of the domain. Ambiguity-Reduction Specialization 
   (ARS) takes place off-line on a tree-bank that is representative of a 
   limited domain of language use resulting in a specialized tree-bank. A 
   Specialized DOP (SDOP) model is then acquired from this specialized 
   tree-bank. The dissertation discusses various algorithms based on 
   Information-Theoretic and other measures of ambiguity and coverage.

 * it presents deterministic polynomial-time algorithms for parsing and 
   disambiguation under the DOP model for various tasks such as sentence 
   disambiguation and word-graph (speech-recognizer's output) disambiguation. 
   Crucially, these algorithms have time complexity linear in STSG size. It 
   is noteworthy that prior to the first publication of these algorithms, 
   parsing and disambiguation under the DOP model took place solely by means 
   of inefficient non-deterministic exponential-time algorithms.

 * it reports on an extensive empirical study of implementations of the DOP 
   model and the specialization algorithms on two independent domains that 
   feature two languages (English and Dutch) and two different tasks. In most 
   experiments the present Specialized DOP model outperforms the DOP model in 
   efficiency while retaining almost the same coverage and accuracy figures. A 
   dedicated experiment exemplifies the fact that the specialized DOP models 
   tend to process more frequent input more efficiently (in contrast to the 
   original DOP model).

The conclusion of this dissertation concerns both subjects that this 
dissertation relates and studies: the computational and algorithmic aspects 
of disambiguation, and the specialization of performance models to limited 
domains of language use. The complexity analysis of disambiguation implies 
that efficient probabilistic disambiguation under the DOP model and various 
similar performance models can not be achieved using conventional 
optimization techniques; this suggests that it is necessary to involve 
non-conventional optimization techniques and extra-linguistic sources of 
knowledge for achieving efficient probabilistic disambiguation. As to the 
disambiguation algorithms, the empirical results clearly show that they 
improve the efficiency of disambiguation under the DOP model substantially. 
Finally, the study of the specialization of performance models to limited 
domains results in new insights concerning the modeling of efficiency-
properties of human language processing. The empirical results support our 
hypothesis concerning the relation between these efficiency-properties and 
the statistical biases in limited domains. Naturally, only further empirical 
exploration will clarify whether the present method can be applied as 
successfully to other languages and other domains of language-use.