|
NAMEPlucene::Analysis::CJKTokenizer - Tokenizer for CJK textsSYNOPSIS# isa Plucene::Analysis::Tokenizer my $next = $chartokenizer->next; DESCRIPTIONThis module tokenizes CJK texts. It creates uni-gram tokens from CJK texts. (See also "PROBLEMS") Because I understand not much of Japanese and Korean, I rudely apply this method to them. Patches are always welcome.METHODSnextmy $next = $chartokenizer->next; This will return the next token in the string, or undef at the end of the string. GLOBAL VARIABLEHere is one pattern variable that you can modify to customize your tokenizer for a specific collection.$InCJKDefault pattern for CJK characters.Default value is qr( \p{InCJKUnifiedIdeographs} | \p{InCJKUnifiedIdeographsExtensionA} | \p{InCJKUnifiedIdeographsExtensionB} | \p{InCJKCompatibilityForms} | \p{InCJKCompatibilityIdeographs} | \p{InCJKCompatibilityIdeographsSupplement} | \p{InCJKRadicalsSupplement} | \p{InCJKSymbolsAndPunctuation} | \p{InHiragana} | \p{InKatakana} | \p{InKatakanaPhoneticExtensions} | \p{InHangulCompatibilityJamo} | \p{InHangulJamo} | \p{InHangulSyllables} )x; PROBLEMSCurrently, I tested bigram tokens, but it keeps failing. Snipped for the current release.Speed is another issue. SEE ALSOPlucenePlucene::Analysis::CJKAnalyzer MIME::Base64 COPYRIGHTCopyright (C) 2006 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself
Visit the GSP FreeBSD Man Page Interface. |