Data-Oriented Language Processing: An Overview
Rens Bod, Remko Scha

Abstract:
Data­oriented models of language processing embody the assumption that human 
language perception and production works with representations of concrete past 
language experiences, rather than with abstract grammar rules. Such models 
therefore maintain large corpora of linguistic representations of previously 
occurring utterances. When processing a new input utterance, analyses of this 
utterance are constructed by combining fragments from the corpus; the 
occurrence­frequencies of the fragments are used to estimate which analysis is 
the most probable one. This paper motivates the idea of data­oriented language 
processing by considering the problem of syntactic disambiguation. One 
relatively simple parsing/disambiguation model that implements this idea is 
described in some detail. This model assumes a corpus of utterances annotated 
with labelled phrase­structure trees, and parses new input by combining 
subtrees from the corpus; it selects the most probable parse of an input 
utterance by considering the sum of the probabilities of all its derivations. 
The paper discusses some experiments carried out with this model. Finally, it 
reviews some other models that instantiate the data­oriented processing 
approach. Many of these models also employ labelled phrase­structure trees, 
but use different criteria for extracting subtrees from the corpus or employ 
different disambiguation strategies; other models use richer formalisms for 
their corpus annotations.