|
NAMEenca -- detect and convert encoding of text filesSYNOPSISenca [-L LANGUAGE] [OPTION]... [FILE]...enconv [-L LANGUAGE] [OPTION]... [FILE]... INTRODUCTION AND EXAMPLESIf you are lucky enough, the only two things you will ever need to know are: commandenca FILE will tell you which encoding file FILE uses (without changing it), and enconv FILE will convert file FILE to your locale native encoding. To convert the file to some other encoding use the -x option (see -x entry in section OPTIONS and sections CONVERSION and ENCODINGS for details). Both work with multiple files and standard input (output) too. E.g. enca -x latin2 <sometext | lpr assures file `sometext' is in ISO Latin 2 when it's sent to printer. The main reason why these command will fail and turn your files into garbage is that Enca needs to know their language to detect the encoding. It tries to determine your language and preferred charset from locale settings, which might not be what you want. You can (or have to) use -L option to tell it the right language. Suppose, you downloaded some Russian HTML file, `file.htm', it claims it's windows-1251 but it isn't. So you run enca -L ru file.htm and find out it's KOI8-R (for example). Be warned, currently there are not many supported languages (see section LANGUAGES). Another warning concerns the fact several Enca's features, namely its charset conversion capabilities, strongly depend on what other tools are installed on your system (see section CONVERSION)--run enca --version to get list of features (see section FEATURES). Also try enca --help to get description of all other Enca options (and to find the rest of this manual page redundant). DESCRIPTIONEnca reads given text files, or standard input when none are given, and uses knowledge about their language (must be supported by you) and a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings, which it then prints to standard output (or it confesses it doesn't have any idea what the encoding could be). By default, Enca presents results as a multiline human-readable descriptions, several other formats are available--see Output type selectors below.Enca can also convert files to some other encoding ENC when you ask for it--either using a built-in converter, some conversion library, or by calling an external converter. Enca's primary goal is to be usable unattended, as an automatic conversion tool, though it perhaps have not reached this point yet (please see section SECURITY). Please note except rare cases Enca really has to know the language of input files to give you a reliable answer. On the other hand, it can then cope quite well with files that are not purely textual or even detect charset of text strings inside some binary file; of course, it depends on the character of the non-text component. Enca doesn't care about structure of input files, it views them as a uniform piece of text/data. In case of multipart files (e.g. mailboxes), you have to use some tool knowing the structure to extract the individual parts first. It's the cost of ability to detect encodings of any damaged, incomplete or otherwise incorrect files. OPTIONSThere are several categories of options: operation mode options, output type selectors, guessing parameters, conversion parameters, general options and listings.All long options can be abbreviated as long as they are unambiguous, mandatory parameters of long options are mandatory for short options too. Operation modesare following:
Output type selectorsselect what action Enca will take when it determines the encoding; most of them just choose between different names, formats and conventions how encodings can be printed, but one of them (-x) is special: it tells Enca to recode files to some other encoding ENC. These options are mutually exclusive; if you specify more than one output type selector the last one takes precedence.Several output types represent charset name used by some other program, but not all these programs know all the charsets which Enca recognises. Be warned, Enca makes no difference between unrecognised charset and charset having no name in given namespace in such situations.
Guessing parametersThere's only one: -L setting language of input files. This option is mandatory (but see below).
Conversion parametersgive you finer control of how charset conversion will be performed. They don't affect anything when -x is not specified as output type. Please see section CONVERSION for the gory conversion details.
General optionsdon't fit to other option categories...
Listingsare all terminal, i.e. when Enca encounters some of them it prints the required listing and terminates without processing any following options.
CONVERSIONThough Enca has been originally designed as a tool for guessing encoding only, it now features several methods of charset conversion. You can control which of them will be used with -C.Enca sequentially tries converters from the list specified by -C until it finds some that is able to perform required conversion or until it exhausts the list. You should specify preferred converters first, less preferred later. External converter (extern) should be always specified last, only as last resort, since it's usually not possible to recover when it fails. The default list of converters always starts with built-in and then continues with the first one available from: librecode, iconv, nothing. It should be noted when Enca says it is not able to perform the conversion it only means none of the converters is able to perform it. It can be still possible to perform the required conversion in several steps, using several converters, but to figure out how, human intelligence is probably needed. Built-in converteris the simplest and far the fastest of all, can perform only a few byte-to-byte conversions and modifies files directly in place (may be considered dangerous, but is pretty efficient). You can get list of all encodings it can convert withenca --list built-in Beside speed, its main advantage (and also disadvantage) is that it doesn't care: it simply converts characters having a representation in target encoding, doesn't touch anything else and never prints any error message. This converter can be specified as built-in with -C. Librecode converteris an interface to GNU recode library, that does the actual recoding job. It may or may not be compiled in; runenca --version to find out its availability in your enca build (feature +librecode-interface). You should be familiar with recode(1) before using it, since recode is a quite sophisticated and powerful charset conversion tool. You may run into problems using it together with Enca particularly because Enca's support for surfaces not 100% compatible, because recode tries too hard to make the transformation reversible, because it sometimes silently ignores I/O errors, and because it's incredibly buggy. Please see GNU recode info pages for details about recode library. This converter can be specified as librecode with -C. Iconv converteris an interface to the UNIX98 iconv(3) conversion functions, that do the actual recoding job. It may or may not be compiled in; runenca --version to find out its availability in your enca build (feature +iconv-interface). While iconv is present on most today systems it only rarely offer some useful set of available conversions, the only notable exception being iconv from GNU libc. It is usually quite picky about surfaces, too (while, at the same time, not implementing surface conversion). It however probably represents the only standard(ized) tool able to perform conversion from/to Unicode. Please see iconv documentation about for details about its capabilities on your particular system. This converter can be specified as iconv with -C. External converteris an arbitrary external conversion tool that can be specified with -E option (at most one can be defined simultaneously). There are some standard, provided together with enca: cstocs, recode, map, umap, and piconv. All are wrapper scripts: for cstocs(1), recode(1), map(1), umap(1), and piconv(1).Please note enca has little control what the external converter really does. If you set it to /bin/rm you are fully responsible for the consequences. If you want to make your own converter to use with enca, you should know it is always called CONVERTER ENC_CURRENT ENC FILE [-] where CONVERTER is what has been set by -E, ENC_CURRENT is detected encoding, ENC is what has been specified with -x, and FILE is the file to convert, i.e. it is called for each file separately. The optional fourth parameter, -, should cause (when present) sending result of conversion to standard output instead of overwriting the file FILE. The converter should also take care of not changing file permissions, returning error code 1 when it fails and cleaning its temporary files. Please see the standard external converters for examples. This converter can be specified as extern with -C. Default target charsetThe straightforward way of specifying target charset is the -x option, which overrides any defaults. When Enca is called as enconv, default target charset is selected exactly the same way as recode(1) does it.If the DEFAULT_CHARSET environment variable is set, it's used as the target charset. Otherwise, if you system provides the nl_langinfo(3) function, current locale's native charset is used as the target charset. When both methods fail, Enca complains and terminates. Reversibility notesIf reversibility is crucial for you, you shouldn't use enca as converter at all (or maybe you can, with very specifically designed recode(1) wrapper). Otherwise you should at least know that there four basic means of handling inconvertible character entities:fail--this is a possibility, too, and incidentally it's exactly what current GNU libc iconv implementation does (recode can be also told to do it) don't touch them--this is what enca internal converter always does and recode can do; though it is not reversible, a human being is usually able to reconstruct the original (at least in principle) approximate them--this is what cstocs can do, and recode too, though differently; and the best choice if you just want to make the accursed text readable drop them out--this is what both recode and cstocs can do (cstocs can also replace these characters by some fixed character instead of mere ignoring); useful when the to-be-omitted characters contain only noise. Please consult your favourite converter manual for details of this issue. Generally, if you are not lucky enough to have all convertible characters in you file, manual intervention is needed anyway. Performance notesPoor performance of available converters has been one of main reasons for including built-in converter in enca. Try to use it whenever possible, i.e. when files in consideration are charset-clean enough or charset-messy enough so that its zero built-in intelligence doesn't matter. It requires no extra disk space nor extra memory and can outperform recode(1) more than 10 times on large files and Perl version (i.e. the faster one) of cstocs(1) more than 400 times on small files (in fact it's almost as fast as mere cp(1)).Try to avoid external converters when it's not absolutely necessary since all the forking and moving stuff around is incredibly slow. ENCODINGSYou can get list of recognised character sets withenca --list charsets and using --name parameter you can select any name you want to be used in the listing. You can also list all surfaces with enca --list surfaces Encoding and surface names are case insensitive and non-alphanumeric characters are not taken into account. However, non-alphanumeric characters are mostly not allowed at all. The only allowed are: `-', `_', `.', `:', and `/' (as charset/surface separator). So `ibm852' and `IBM-852' are the same, while `IBM 852' is not accepted. CharsetsFollowing list of recognised charsets uses Enca's names (-e) and verbal descriptions as reported by Enca (-f):
where unknown is not any real encoding, it's reported when Enca is not able to give a reliable answer. SurfacesEnca has some experimental support for so-called surfaces (see below). It detects following surfaces (not all can be applied to all charsets):
Note some surfaces have N.A. in place of identifier--they cannot be specified on command line, they can only be reported by Enca. This is intentional because they only inform you why the file cannot be considered surface-consistent instead of representing a real surface. Each charset has its natural surface (called `implied' in recode) which is not reported, e.g., for IBM 852 charset it's `CRLF line terminators'. For UCS encodings, big endian is considered as natural surface; unusual byte orders are constructed from 21 and 4321 permutations: 2143 is reported simply as 21, while 3412 is reported as combination of 4321 and 21. Doubly-encoded UTF-8 is neither charset nor surface, it's just reported. About charsets, encodings and surfacesCharset is a set of character entities while encoding is its representation in the terms of bytes and bits. In Enca, the word encoding means the same as `representation of text', i.e. the relation between sequence of character entities constituting the text and sequence of bytes (bits) constituting the file.So, encoding is both character set and so-called surface (line terminators, byte order, combining, Base64 transformation, etc.). Nevertheless, it proves convenient to work with some {charset,surface} pairs as with genuine charsets. So, as in recode(1), all UCS- and UTF- encodings of Universal character set are called charsets. Please see recode documentation for more details of this issue. The only good thing about surfaces is: when you don't start playing with them, neither Enca won't start and it will try to behave as much as possible as a surface-unaware program, even when talking to recode. LANGUAGESEnca needs to know the language of input files to work reliably, at least in case of regular 8bit encoding. Multibyte encodings should be recognised for any Latin, Cyrillic or Greek language.You can (or have to) use -L option to tell Enca the language. Since people most often work with files in the same language for which they have configured locales, Enca tries tries to guess the language by examining value of LC_CTYPE and other locale categories (please see locale(7)) and using it for the language when you don't specify any. Of course, it may be completely wrong and will give you nonsense answers and damage your files, so please don't forget to use the -L option. You can also use ENCAOPT environment variable to set a default language (see section ENVIRONMENT). Following languages are supported by Enca (each language is listed together with supported 8bit encodings).
The special language none can be shortened to __, it contains no 8bit encodings, so only multibyte encodings are detected. You can also use locale names instead of languages:
FEATURESSeveral Enca's features depend on what is available on your system and how it was compiled. You can get their list withenca --version Plus sign before a feature name means it's available, minus sign means this build lacks the particular feature. librecode-interface. Enca has interface to GNU recode library charset conversion functions. iconv-interface. Enca has interface to UNIX98 iconv charset conversion functions. external-converter. Enca can use external conversion programs (if you have some suitable installed). language-detection. Enca tries to guess language (-L) from locales. You don't need the --language option, at least in principle. locale-alias. Enca is able to decrypt locale aliases used for language names. target-charset-auto. Enca tries to detect your preferred charset from locales. Option --auto-convert and calling Enca as enconv works, at least in principle. ENCAOPT. Enca is able to correctly parse this environment variable before command line parameters. Simple stuff like ENCAOPT="-L uk" will work even without this feature. ENVIRONMENTThe variable ENCAOPT can hold set of default Enca options. Its content is interpreted before command line arguments. Unfortunately, this doesn't work everywhere (must have +ENCAOPT feature).LC_CTYPE, LC_COLLATE, LC_MESSAGES (possibly inherited from LC_ALL or LANG) is used for guessing your language (must have +language-detection feature). The variable DEFAULT_CHARSET can be used by enconv as the default target charset. DIAGNOSTICSEnca returns exit code 0 when all input files were successfully proceeded (i.e. all encodings were detected and all files were converted to required encoding, if conversion was asked for). Exit code 1 is returned when Enca wasn't able to either guess encoding or perform conversion on any input file because it's not clever enough. Exit code 2 is returned in case of serious (e.g. I/O) troubles.SECURITYIt should be possible to let Enca work unattended, it's its goal. However:There's no warranty the detection works 100%. Don't bet on it, you can easily lose valuable data. Don't use enca (the program), link to libenca instead if you want anything resembling security. You have to perform the eventual conversion yourself then. Don't use external converters. Ideally, disable them compile-time. Be aware of ENCAOPT and all the built-in automagic guessing various things from environment, namely locales. SEE ALSOautoconvert(1), cstocs(1), file(1), iconv(1), iconv(3), nl_langinfo(3), map(1), piconv(1), recode(1), locale(5), locale(7), ltt(1), umap(1), unicode(7), utf-8(7), xcode(1)KNOWN BUGSIt has too many unknown bugs.The idea of using LC_* value for language is certainly braindead. However I like it. It can't backup files before mangling them. In certain situations, it may behave incorrectly on >31bit file systems and/or over NFS (both untested but shouldn't cause problems in practice). Built-in converter does not convert character `ch' from KOI8-CS2, and possibly some other characters you've probably never heard about anyway. EOL type recognition works poorly on Quoted-printable encoded files. This should be fixed someday. There are no command line options to tune libenca parameters. This is intentional (Enca should DWIM) but sometimes this is a nuisance. The manual page is too long, especially this section. This doesn't matter since nobody does read it. Send bug reports to <https://github.com/nijel/enca/issues>. TRIVIAEnca is Extremely Naive Charset Analyser. Nevertheless, the `enc' originally comes from `encoding' so the leading `e' should be read as in `encoding' not as in `extreme'.AUTHORSDavid Necas (Yeti) <yeti@physics.muni.cz>Michal Cihar <michal@cihar.com> Unicode data has been generated from various (free) on-line resources or using GNU recode. Statistical data has been generated from various texts on the Net, I hope character counting doesn't break anyone's copyright. ACKNOWLEDGEMENTSPlease see the file THANKS in distribution.COPYRIGHTCopyright (C) 2000-2003 David Necas (Yeti).Copyright (C) 2009 Michal Cihar <michal@cihar.com>. Enca is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation. Enca is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Enca; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
Visit the GSP FreeBSD Man Page Interface. |