Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

Plucene::Analysis::CJKTokenizer
Tokenizer for CJK texts

Plucene::Analysis::CJKTokenizer - Tokenizer for CJK texts


NAME

Plucene::Analysis::CJKTokenizer - Tokenizer for CJK texts


SYNOPSIS


        # isa Plucene::Analysis::Tokenizer

        my $next = $chartokenizer->next;

        

=head1 DESCRIPTION

This module tokenizes CJK texts. It creates uni-gram tokens from CJK texts. (See also PROBLEMS) Because I understand not much of Japanese and Korean, I rudely apply this method to them. Patches are always welcome.


METHODS

next


        my $next = $chartokenizer->next;

This will return the next token in the string, or undef at the end of the string.




=cut

use utf8; use Encode; use YAML; use MIME::Base64; use encoding 'utf8';


GLOBAL VARIABLE

Here is one pattern variable that you can modify to customize your tokenizer for a specific collection.

$InCJK

Default pattern for CJK characters.

Default value is

qr( \p{InCJKUnifiedIdeographs} | \p{InCJKUnifiedIdeographsExtensionA} | \p{InCJKUnifiedIdeographsExtensionB} |


    \p{InCJKCompatibilityForms} |

    \p{InCJKCompatibilityIdeographs} |

    \p{InCJKCompatibilityIdeographsSupplement} |

    \p{InCJKRadicalsSupplement} |

    \p{InCJKSymbolsAndPunctuation} |

    

    \p{InHiragana} |

    \p{InKatakana} |

    \p{InKatakanaPhoneticExtensions} |

    

    \p{InHangulCompatibilityJamo} |

    \p{InHangulJamo} |

    \p{InHangulSyllables}

   )x;


PROBLEMS

Currently, I tested bigram tokens, but it keeps failing. Snipped for the current release.

Speed is another issue.


SEE ALSO

Plucene

the Plucene::Analysis::CJKAnalyzer manpage

the MIME::Base64 manpage


COPYRIGHT

Copyright (C) 2006 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>




This library is free software; you can redistribute it and/or modify it

under the same terms as Perl itself



=cut

1;


 
Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy