Intelligent Information and Communication Systems
English version deutsche Version|

WOCADI Demo - An interactive tour of the WOCADI parser

If you want to test the parser, please send an e-mail to Dr. Sven Hartrumpf ([email protected])

Introduction

The WOCADI parser (formerly: NatLink; WOCADI is an acronym for WOrd ClAss based DIsambiguating) is a computer program written in the Scheme programming language that transforms a German text into a formal semantic representation using the MultiNet (multilayered extended semantic networks) formalism. This representation can be used by a diversity of computer tools and applications like natural language interfaces, knowledge engineering systems, machine translation tools, question answering systems, etc. On the one hand, the WOCADI parser can provide a simple and intuitive front-end for users so that they can communicate with the computer in a natural language (in their mother tongue). No artificial command languages must be learned! On the other hand, the parser can make available the knowledge hidden in large text collections. The level of natural language understanding is much deeper than in search engines like Google or in traditional information retrieval systems.

This tour is a little demo of WOCADI. For demonstration purposes, you can select a German text among several texts with English translations. After your selection, the parser will generate intermediate and final results for the selected text, if your computer has Internet access. The reply can take some seconds since an illustrative image will be produced; the actual parsing typically needs less than 1 second. The parse results are presented in a newly generated web page. Explanations of the results can be found if you follow the links or if you read the rest of this page.

Feel free to experiment and report your findings and comments to Sven Hartrumpf.

Morpho-lexical analysis of the text

The first intermediate result presented originates from two processing modules: a word and sentence tokenizer and a morpho-lexical analyzer which uses two computer lexica (a semantically deep lexicon called HaGenLex (HAgen GErmaN LEXicon) constantly extended by using the lexicon workbench LIA and a semantically flat lexicon). In addition, several dozens of name lexica are consulted.

The tokenizer decides where words and sentences start and end. It segments the user input into words and groups words into sentences. For humans, this is trivial; for computers, this is not always trivial. (Consider for example that a period might end a sentence or not depending on many context factors.)

The morpho-lexical analyzer determines the base form of words and the morphological information that the inflectional suffixes (or prefixes or infixes) add to the information originating from the base form. The analyzer returns for every word a large feature structure (containing around 20 to 80 feature values); for simplicity, only a small part of these feature structures is presented in this demo. A compound analysis module is applied to analyze the structure and semantics of compounds, which are quite popular in many German texts. An example of a nominal compound is Programmiersprache (programming language).

Syntactico-semantic analysis of the text

The meaning of the user input (a German text) is automatically determined by a parser which is based on word class functions (WCFs). The results of the parser are semantic networks from the MultiNet (multilayered extended semantic networks) formalism. This representation is formal so that computers can deal with them directly.

A semantic network contains two basic things: first, there are concepts like computer or peach. Second, there are relations between concepts (shown as directed edges), e.g. that a peach is a fruit or that Armstrong was an actively acting person (or AG(EN)T) when he stepped on the moon in July 1969. For more details on the MultiNet paradigm, please see the MultiNet tour. Semantic networks can be created and maintained using the workbench MWR.

The semantic representation is sent to applications in a textual format. The graphical format is only important if the WOCADI results are to be communicated to humans, like in this demo.

Architecture of the WOCADI parser

The following diagram shows the structure of the WOCADI parser and the main data flows. One possible embedding in an application is indicated.

diagram of the WOCADI architecture

Bibliography

Here are some publications related to WOCADI:

Hartrumpf (2003)
Hartrumpf, Sven (2003). Hybrid Disambiguation in Natural Language Analysis. Osnabrück, Germany: Der Andere Verlag. ISBN 3-89959-080-5
Hartrumpf and Helbig (2002)
Hartrumpf, Sven; Helbig, Hermann (2002). The generation and use of layer information in multilayered extended semantic networks. In: Proceedings of the 5th International Conference on Text, Speech and Dialogue (TSD 2002) (edited by Sojka, Petr; Kopecek, Ivan; Pala, Karel), number 2448 in Lecture Notes in Artificial Intelligence LNCS/LNAI, pp. 89-98. Brno, Czech Republic.
Hartrumpf (2001)
Hartrumpf, Sven (2001). Coreference resolution with syntactico-semantic rules and corpus statistics. In: Proceedings of the Fifth Computational Natural Language Learning Workshop (CoNLL-2001), pp. 137-144, Toulouse, France.
Hartrumpf (1999)
Hartrumpf, Sven (1999). Hybrid disambiguation of prepositional phrase attachment and interpretation. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99), pp. 111-120, College Park, Maryland.
Helbig and Hartrumpf (1997)
Helbig, Hermann; Hartrumpf, Sven (1997). Word class functions for syntactic-semantic analysis. In: Proceedings of the 2nd International Conference on Recent Advances in Natural Language Processing (RANLP-97), pp. 312-317, Tzigov Chark, Bulgaria.

A longer list can be found here.
Publications of IICS members.


IICS (Intelligent Information and Communication Systems), University of Hagen (FernUniversität in Hagen)