|
NAMEunichars - list characters for one or more propertiesSYNOPSISunichars [options] criterion ...Each criterion is either a square-bracketed character class, a regex starting with a backslash, or an arbitrary Perl expression. See the EXAMPLES section below. OPTIONS: Selection Options: --bmp include the Basic Multilingual Plane (plane 0) [DEFAULT] --smp include the Supplementary Multilingual Plane (plane 1) --astral -a include planes above the BMP (planes 1-15) --unnamed -u include various unnamed characters (see DESCRIPTION) --locale -l specify the locale used for UCA functions Display Options: --category -c include the general category (GC=) --script -s include the script name (SC=) --block -b include the block name (BLK=) --bidi -B include the bidi class (BC=) --combining -C include the canonical combining class (CCC=) --numeric -n include the numeric value (NV=) --casefold -f include the casefold status --decimal -d include the decimal representation of the code point Miscellaneous Options: --version -v print version information and exit --help -h this message --man -m full manpage --debug -d show debugging of criteria and examined code point span Special Functions: $_ is the current code point ord is the current code point's ordinal NAME is charname::viacode(ord) NUM is Unicode::UCD::num(ord), not code point number CF is casefold->{status} NFD, NFC, NFKD, NFKC, FCD, FCC (normalization) UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys) Singleton, Exclusion, NonStDecomp, Comp_Ex checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE DESCRIPTIONThe unichars program reports which characters match all selection criteria anded together.A criterion beginning with a square bracket or a backslash is assumed to be a regular expression. Anything else is a Perl expression such as you might pass to the Perl "grep" function. The $_ variable is set to each successive Unicode character, and if all criteria match, that character is displayed. The numeric code point is therefore accessible as "ord". The special token "NAME" is set to the full name of the current code point. Also, the tokens "NFD", "NFKD", "NFC", and "NFKC" are set to the corresponding normalization form. By default only plane 0, the Basic Multilingual Plane, is examined. For plane 1, the Supplementary Multilingual Plane, use --smp. To examine either, specify both --bmp and --smp options, or -bs. To include all valid code points, use the -a or --astral option. Unless the --unnamed option is given, characters with any of the properties Unassigned, PrivateUse, Han, or InHangulSyllables will be excluded. EXAMPLESCould all non-ASCII digits:$ unichars -a '\d' '\P{ASCII}' | wc -l 401 Find all line terminators: $ unichars '\R' -- 10 0000A LINE FEED (LF) -- 11 0000B LINE TABULATION -- 12 0000C FORM FEED (FF) -- 13 0000D CARRIAGE RETURN (CR) -- 133 00085 NEXT LINE (NEL) -- 8232 02028 LINE SEPARATOR -- 8233 02029 PARAGRAPH SEPARATOR Find what is not "\s" but is "[\h\v]": $ unichars '\S' '[\h\v]' -- 11 0000B LINE TABULATION Count how many code points in the Basic Multilingual Plane are not marks but are diacritics: $ unichars '\PM' '\p{Diacritic}' | wc -l 209 Count how many code points in the Basic Multilingual Plane are marks but are not diacritics: $ unichars '\pM' '\P{Diacritic}' | wc -l 750 Find all code points that are Letters, are in the Greek script, have differing canonical and compatibility decompositions, and whose name contains "SYMBOL": $ unichars -a '\pL' '\p{Greek}' 'NFD ne NFKD' 'NAME =~ /SYMBOL/' ϐ 976 003D0 GREEK BETA SYMBOL ϑ 977 003D1 GREEK THETA SYMBOL ϒ 978 003D2 GREEK UPSILON WITH HOOK SYMBOL ϓ 979 003D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL ϔ 980 003D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL ϕ 981 003D5 GREEK PHI SYMBOL ϖ 982 003D6 GREEK PI SYMBOL ϰ 1008 003F0 GREEK KAPPA SYMBOL ϱ 1009 003F1 GREEK RHO SYMBOL ϲ 1010 003F2 GREEK LUNATE SIGMA SYMBOL ϴ 1012 003F4 GREEK CAPITAL THETA SYMBOL ϵ 1013 003F5 GREEK LUNATE EPSILON SYMBOL Ϲ 1017 003F9 GREEK CAPITAL LUNATE SIGMA SYMBOL Find all numeric nondigits in the Latin script (within the BMP): $ unichars '\pN' '\D' '\p{Latin}' Ⅰ 8544 02160 ROMAN NUMERAL ONE Ⅱ 8545 02161 ROMAN NUMERAL TWO Ⅲ 8546 02162 ROMAN NUMERAL THREE Ⅳ 8547 02163 ROMAN NUMERAL FOUR Ⅴ 8548 02164 ROMAN NUMERAL FIVE Ⅵ 8549 02165 ROMAN NUMERAL SIX Ⅶ 8550 02166 ROMAN NUMERAL SEVEN Ⅷ 8551 02167 ROMAN NUMERAL EIGHT (etc) Find the first three alphanumunderish code points with no assigned name: $ unichars -au '\w' '!length NAME' | head -3 㐀 13312 003400 <unnamed codepoint> 㐁 13313 003401 <unnamed codepoint> 㐂 13314 003402 <unnamed codepoint> Count the combining characters in the Suuplemental Multilingual Plane: $ unichars -s '\pM' | wc -l 61 ENVIRONMENTIf your environment smells like it's in a Unicode encoding, program arguments will be in UTF-8.BUGSThe --man option does not correctly process the page for UTF-8, because it does not pass the necessary --utf8 option to pod2man.SEE ALSOuniprops, uninames, perluniprops, perlunicode, perlrecharclass, perlreAUTHORTom Christiansen <tchrist@perl.com>COPYRIGHT AND LICENCECopyright 2010 Tom Christiansen.This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. |