Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

Alvis::NLPPlatform::NLPWrapper
Perl extension for the wrappers used for linguistically annotating XML documents in Alvis

Alvis::NLPPlatform::NLPWrapper - Perl extension for the wrappers used for linguistically annotating XML documents in Alvis


NAME

Alvis::NLPPlatform::NLPWrapper - Perl extension for the wrappers used for linguistically annotating XML documents in Alvis


SYNOPSIS

use Alvis::NLPPlatform::NLPWrappers;

Alvis::NLPPlatform::NLPWrappers::tokenize($h_config,$doc_hash);


DESCRIPTION

This module provides defaults wrappers of the Natural Language Processing (NLP) tools. These wrappers are called in the ALVIS NLP Platform (see Alvis::NLPPlatform).

Default wrappers can be overwritten by defining new wrappers in a new and local UserNPWrappers module.


METHODS

tokenize()


    tokenize($h_config, $doc_hash);

This method carries out the tokenisation process on the input document. $doc_hash is the hashtable containing containing all the annotations of the input document.

The tokenization has been written for ALVIS. This is a task that depends largely on the choice made as to what tokens are for our purpose. Hence, this function is not a wrapper but the specific tokenizing tool itself. Its input is the plain text corpus, which is segmented into tokens. Tokens are in fact a group of characters belonging to the same category. Below is a list of the four possible categories:

During the tokenization process, all tokens are stored in memory via a hash table (%hash_tokens).

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

The method returns the number of tokens.

scan_ne()


    scan_ne($h_config, $doc_hash);

This method wraps the default Named entity recognition and tags the input document. $doc_hash is the hashtable containing containing all the annotations of the input document. It aims at annotating semantic units with syntactic and semantic types. Each text sequence corresponding to a named entity will be tagged with a unique tag corresponding to its semantic value (for example a ``gene'' type for gene names, ``species'' type for species names, etc.). All these text sequences are also assumed to be equivalent to nouns: the tagger dynamically produces linguistic units equivalent to words or noun phrases.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

We integrated TagEn (Jean-Francois Berroyer. TagEN, un analyseur d'entites nommees : conception, developpement et evaluation. Universite Paris-Nord, France. 2004. Memoire de D.E.A. d'Intelligence Artificielle), as default named entity tagger, which is based on a set of linguistic resources and grammars. TagEn can be downloaded here: http://www-lipn.univ-paris13.fr/~hamon/ALVIS/Tools/TagEN.tar.gz

word_segmentation()


    word_segmentation($h_config, $doc_hash);

This method wraps the default word segmentation step. $doc_hash is the hashtable containing containing all the annotations of the input document.

We use simple regular expressions, based on the algorithm proposed in G. Grefenstette and P. Tapanainen. What is a word, what is a sentence? problems of tokenization. The 3rd International Conference on Computational Lexicography. pages 79-87. 1994. Budapest. The method is a wrapper for the awk script implementing the approach, has been proposed on the Corpora list (see the achives http://torvald.aksis.uib.no/corpora/ ). The script carries out Word segmentation as week the sentence segmentation. Information related to the sentence segmentation will be used in the default sentence_segmentation method.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

In the default wrapper, segmented words are then aligned with tokens and named entities. For example, let ``Bacillus subtilis'' be a named entity made of three tokens: ``Bacillus'', the space character and ``subtilis''. The word segmenter will find two words: ``Bacillus'' and ``subtilis''. The wrapper however creates a single word, since ``Bacillus subtilis'' was found to be a named entity, and should thus be considered a single word, made of the three same tokens.

sentence_segmentation()


    sentence_segmentation($h_config, $doc_hash);

This method wraps the default sentence segmentation step. $doc_hash is the hashtable containing containing all the annotations of the input document.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

The sentence segmentation function does not invoke any external tool ( See the word_segmentation() method for more explaination.) It scans the token hash table for full stops, i.e. dots that were not considered to be part of words. All of these full stops then mark the end of a sentence. Each sentence is then assigned an identifier, and two offsets: that of the starting token, and that of the ending token.

pos_tag()


    pos_tag($h_config, $doc_hash);

The method wraps the Part-of-Speech (POS) tagging. $doc_hash is the hashtable containing containing all the annotations of the input document. It works as follows: every word is input to the external Part-Of-Speech tagging tool. For every input word, the tagger outputs its tag. Then, the wrapper creates a hash table to associate the tag to the word. It assumes that word and sentence segmentations have been performed.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Be default, we are using the probabilistic Part-Of-Speech tagger TreeTagger (Helmut Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. New Methods in Language Processing Studies in Computational Linguistics. 1997. Daniel Jones and Harold Somers. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ ).

As this POS tagger also carries out the lemmatization, the method also adds annotation at this level step.

The GeniaTagger (Yoshimasa Tsuruoka and Yuka Tateishi and Jin-Dong Kim and Tomoko Ohta and John McNaught and Sophia Ananiadou and Jun'ichi Tsujii. Developing a Robust Part-of-Speech Tagger for Biomedical Text Proceedings of Advances in Informatics - 10th Panhellenic Conference on Informatics. pages 382-392. 2005. LNCS 3746.) can also be used, by modifying column order (see defintion of the command line in client.pl).

lemmatization()


    lemmatisation($h_config, $doc_hash);

This methods wraps the default lemmatizer. $doc_hash is the hashtable containing containing all the annotations of the input document. However, as POS Tagger TreeTagger also gives lemma, this method does ... nothing. It is here just for conformance.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

term_tag()


    term_tag($h_config, $doc_hash);

The method wraps the term tagging step of the ALVIS NLP Platform. $doc_hash is the hashtable containing containing all the annotations of the input document. This step aims at recognizing terms in the documents differing from named entities (see Alvis::TermTagger), like gene expression, spore coat cell. Term lists can be provided as terminological resources such as the Gene Ontology (http://www.geneontology.org/ ), the MeSH (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=mesh ) or more widely UMLS (http://umlsinfo.nlm.nih.gov/ ). They can also be acquired through corpus analysis.

The term matching in the document is carried out according to typographical and inflectional variations. The typographical variation requires a slight preprocessing of the terms.

We first assume a less strict use of the dash character. For instance, the term UDP-glucose can appear in the documents as UDP glucose and vice versa. The inflectional variation requires a lemmatization of the input documents. It makes it possible to identify transcription factors from transcription factor. Both variation types can be taken into account altogether or separately during the term matching. Previous annotation levels, such as lemmatisation and word segmentation but also named entities, are required.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

syntactic_parsing()


    syntactic_parsing($h_config, $doc_hash);

This method wraps the default sentence parsing. It aims at exhibiting the graph of the syntactic dependency relations between the words of the sentence. $doc_hash is the hashtable containing containing all the annotations of the input document.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

The Link Grammar Parser (Daniel D. Sleator and Davy Temperley. Parsing {E}nglish with a link grammar. Third International Workshop on Parsing Technologies. 1993. http://www.link.cs.cmu.edu/link/ ) is actually integrated.

Processing time is a critical point for syntactic parsing, but we expect that a good recognition of the terms can reduce significantly the number of possible parses and consequently the parsing processing time. Term identification is therefore performed prior to parsing. The word level of annotation is required. Depending on the choice of the parser, the morphosyntactic level may be needed.

semantic_feature_tagging()


    semantic_feature_tagging($h_config, $doc_hash)

The semantic typing function attaches a semantic type to the words, terms and named-entities (referred to as lexical items in the following) in documents according to the conceptual hierarchies of the ontology of the domain. $doc_hash is the hashtable containing containing all the annotations of the input document.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Currently, this step is not integrated in the platform.

semantic_relation_tagging()


    semantic_relation_tagging($h_config, $doc_hash)

This method wraps the semantic relation identification step. $doc_hash is the hashtable containing containing all the annotations of the input document. In the Alvis project, the default behaviour is the identification of domain specific semantic relations, i.e. relations occurring between instances of the ontological concepts in the document. These instances are identified and tagged accordingly by the semantic typing. As a result, these semantic relation annotations give another level of semantic representation of the document that makes explicit the role that these semantic units (usually named-entities and/or terms) play with respect to each other, pertaining to the ontology of the domain. However, this annotation depends on previous document annotations and two different tagging strategies, depending on the two different processing lines (annotation of web documents and acquisition of resources used at the web document annotation process) that impact the implementation of the semantic relation tagging:

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Currently, this step is not integrated in the platform.

anaphora_resolution()


    anaphora_resolution($h_config, $doc_hash)

The methods wraps the tool which aims at identifing and solving the anaphora present in a document. $doc_hash is the hashtable containing containing all the annotations of the input document. We restrict the resolution to the anaphoras for the pronoun it. The anaphora resolution takes as input an annotated document coming from the semantic type tagging, in the ALVIS format and produces an augmented text with XML tags corresponding to anaphora relations between antecedents and pronouns, in the ALVIS format.

$hash_config is the reference to the hashtable containing the variables defined in the configuration file.

Currently, this step is not integrated in the platform.

# =head1 ENVIRONMENT


SEE ALSO

Alvis web site: http://www.alvis.info


AUTHORS

Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr> and Julien Deriviere <julien.deriviere@lipn.univ-paris13.fr>


LICENSE

Copyright (C) 2005 by Thierry Hamon and Julien Deriviere

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.

Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy