NAME

Plucene::Analysis::CJKTokenizer - Tokenizer for CJK texts

SYNOPSIS

        # isa Plucene::Analysis::Tokenizer

        my $next = $chartokenizer->next;

DESCRIPTION

This module tokenizes CJK texts. It creates uni-gram tokens from CJK texts. (See also "PROBLEMS") Because I understand not much of Japanese and Korean, I rudely apply this method to them. Patches are always welcome.

METHODS

        my $next = $chartokenizer->next;

This will return the next token in the string, or undef at the end of the string.

GLOBAL VARIABLE

Here is one pattern variable that you can modify to customize your tokenizer for a specific collection.

$InCJK

Default pattern for CJK characters.

Default value is

qr( \p{InCJKUnifiedIdeographs} | \p{InCJKUnifiedIdeographsExtensionA} | \p{InCJKUnifiedIdeographsExtensionB} |

    \p{InCJKCompatibilityForms} |
    \p{InCJKCompatibilityIdeographs} |
    \p{InCJKCompatibilityIdeographsSupplement} |

    \p{InCJKRadicalsSupplement} |
    \p{InCJKSymbolsAndPunctuation} |
    
    \p{InHiragana} |
    \p{InKatakana} |
    \p{InKatakanaPhoneticExtensions} |
    
    \p{InHangulCompatibilityJamo} |
    \p{InHangulJamo} |
    \p{InHangulSyllables}
   )x;

PROBLEMS

Currently, I tested bigram tokens, but it keeps failing. Snipped for the current release.

Speed is another issue.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself