|
NAMEText::Unaccent::PurePerl - remove accents from charactersSYNOPSISuse Text::Unaccent::PurePerl qw(unac_string); $unaccented = unac_string($string); # For compatibility with Text::Unaccent, and # for dealing with strings of raw octets: $unaccented = unac_string($charset, $octets); $unaccented = unac_string_utf16($octets); # For compatibility with Text::Unaccent, but # have no useful purpose in this module. $version = unac_version(); unac_debug($level); DESCRIPTIONConversionsText::Unaccent::PurePerl is a module for “unaccenting” characters, i.e., removing accents and other diacritic marks from characters. Here, the term unaccenting has a rather loose meaning, since this module does a lot more than just removing accents. Here are some examples:Á → A latin letter Æ → AE single letter split in two ƒ → f simpler variant of same letter IJ → IJ ligature split in two ¹ → 1 superscript ½ → 1⁄2 fraction ώ → ω Greek letter Й → И Cyrillic letter ™ → TM various symbols Comparison to Text::UnaccentText::Unaccent::PurePerl is a pure Perl equivalent to the Text::Unaccent module, but with the additional feature of handling modern Perl character strings. Text::Unaccent only deals with raw octet strings with an associated character coding. In addition, this module, as the name suggests, does not require a C compiler to build. The disadvantage is that this module is slower than Text::Unaccent.Other conversionsThe conversions done by Text::Unaccent seem inconsistent. For instance,Æ → AE Text::Unaccent will convert this ... Œ → OE ... but not this One might expect the following conversions … → ... U+2026 HORIZONTAL ELLIPSIS Œ → OE U+0152 LATIN CAPITAL LIGATURE OE œ → oe U+0153 LATIN SMALL LIGATURE OE ′ → ' U+2032 PRIME ″ → " U+2033 DOUBLE PRIME and more, but these aren't implemented in Text::Unaccent, so they aren't implemented in Text::Unaccent::PurePerl either. This might change in the future. Comparison to Text::UnidecodeIf you want a full transliteration to ASCII, use the Text::Unidecode module."Русский" (input) "Русскии" (output from Text::Unaccent::PurePerl::unac_string) "Russkii" (output from Text::Unidecode::unidecode) "Ελληνικά" (input) "Ελληνικα" (output from Text::Unaccent::PurePerl::unac_string) "Ellinika" (output from Text::Unidecode::unidecode) EXPORTFunctions exported by default: "unac_string", "unac_string_utf16", "unac_version", and "unac_debug".FUNCTIONS
EXAMPLESFrench$str1 = "déjà vu"; $str2 = unac_string($str1); # = "deja vu"; Greek$str1 = "νέα"; = "\x{03BD}\x{03AD}\x{03B1}"; $str2 = unac_string($str1); # = "νεα"; # = "\x{03BD}\x{03B5}\x{03B1}"; The unaccented string $str2 is made up by the three letters epsilon (without the tonos), nu, and alpha. In contrast, the version of unac_string() in the Text::Unaccent module gives $oct2 = unac_string("UTF-8", $str1); # = "\xCE\xBD\xCE\xB5\xCE\xB1"; These octets are the UTF-8 encoded equivalent of "\x{03BD}\x{03B5}\x{03B1}". BUGSThere are currently no known bugs.Please report any bugs or feature requests to "bug-text-unaccent-pureperl at rt.cpan.org", or through the web interface at <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Unaccent-PurePerl>. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes. SUPPORTYou can find documentation for this module with the perldoc command.perldoc Text::Unaccent::PurePerl You can also look for information at:
SEE ALSOText::Unaccent(3).AUTHORPeter John Acklam, <pjacklam@online.no>COPYRIGHT & LICENSECopyright 2008,2013 Peter John Acklam.This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. |