|
|
| |
MIME::Charset(3) |
User Contributed Perl Documentation |
MIME::Charset(3) |
MIME::Charset - Charset Information for MIME
use MIME::Charset:
$charset = MIME::Charset->new("euc-jp");
Getting charset information:
$benc = $charset->body_encoding; # e.g. "Q"
$cset = $charset->as_string; # e.g. "US-ASCII"
$henc = $charset->header_encoding; # e.g. "S"
$cset = $charset->output_charset; # e.g. "ISO-2022-JP"
Translating text data:
($text, $charset, $encoding) =
$charset->header_encode(
"\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa".
"\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef",
Charset => 'euc-jp');
# ...returns e.g. (<converted>, "ISO-2022-JP", "B").
($text, $charset, $encoding) =
$charset->body_encode(
"Collectioneur path\xe9tiquement ".
"\xe9clectique de d\xe9chets",
Charset => 'latin1');
# ...returns e.g. (<original>, "ISO-8859-1", "QUOTED-PRINTABLE").
$len = $charset->encoded_header_len(
"Perl\xe8\xa8\x80\xe8\xaa\x9e",
Charset => 'utf-8',
Encoding => "b");
# ...returns e.g. 28.
Manipulating module defaults:
MIME::Charset::alias("csEUCKR", "euc-kr");
MIME::Charset::default("iso-8859-1");
MIME::Charset::fallback("us-ascii");
Non-OO functions (may be deprecated in near future):
use MIME::Charset qw(:info);
$benc = body_encoding("iso-8859-2"); # "Q"
$cset = canonical_charset("ANSI X3.4-1968"); # "US-ASCII"
$henc = header_encoding("utf-8"); # "S"
$cset = output_charset("shift_jis"); # "ISO-2022-JP"
use MIME::Charset qw(:trans);
($text, $charset, $encoding) =
header_encode(
"\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa".
"\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef",
"euc-jp");
# ...returns (<converted>, "ISO-2022-JP", "B");
($text, $charset, $encoding) =
body_encode(
"Collectioneur path\xe9tiquement ".
"\xe9clectique de d\xe9chets",
"latin1");
# ...returns (<original>, "ISO-8859-1", "QUOTED-PRINTABLE");
$len = encoded_header_len(
"Perl\xe8\xa8\x80\xe8\xaa\x9e", "b", "utf-8"); # 28
MIME::Charset provides information about character sets used for MIME messages
on Internet.
The charset is ``character set'' used in MIME to refer to a method of
converting a sequence of octets into a sequence of characters. It includes
both concepts of ``coded character set'' (CCS) and ``character encoding
scheme'' (CES) of ISO/IEC.
The encoding is that used in MIME to refer to a method of
representing a body part or a header body as sequence(s) of printable
US-ASCII characters.
- $charset = MIME::Charset->new([CHARSET [, OPTS]])
- Create charset object.
OPTS may accept following key-value pair. NOTE: When
Unicode/multibyte support is disabled (see "USE_ENCODE"),
conversion will not be performed. So this option do not have any
effects.
- Mapping => MAPTYPE
- Whether to extend mappings actually used for charset names or not.
"EXTENDED" uses extended mappings.
"STANDARD" uses standardized strict
mappings. Default is "EXTENDED".
- $charset->body_encoding
- body_encoding CHARSET
- Get recommended transfer-encoding of CHARSET for message body.
Returned value will be one of
"B" (BASE64),
"Q" (QUOTED-PRINTABLE),
"S" (shorter one of either) or
"undef" (might not be
transfer-encoded; either 7BIT or 8BIT). This may not be same as encoding
for message header.
- $charset->as_string
- canonical_charset CHARSET
- Get canonical name for charset.
- $charset->decoder
- Get "Encode::Encoding" object to decode strings to Unicode by
charset. If charset is not specified or not known by this module, undef
will be returned.
- $charset->dup
- Get a copy of charset object.
- $charset->encoder([CHARSET])
- Get "Encode::Encoding" object to encode Unicode string using
compatible charset recommended to be used for messages on Internet.
If optional CHARSET is specified, replace encoder (and output
charset name) of $charset object with those of
CHARSET, therefore, $charset object will be a
converter between original charset and new CHARSET.
- $charset->header_encoding
- header_encoding CHARSET
- Get recommended encoding scheme of CHARSET for message header.
Returned value will be one of
"B",
"Q",
"S" (shorter one of either) or
"undef" (might not be encoded). This
may not be same as encoding for message body.
- $charset->output_charset
- output_charset CHARSET
- Get a charset which is compatible with given CHARSET and is recommended to
be used for MIME messages on Internet (if it is known by this module).
When Unicode/multibyte support is disabled (see
"USE_ENCODE"), this function will simply return the result of
"canonical_charset".
- $charset->body_encode(STRING [, OPTS])
- body_encode STRING, CHARSET [, OPTS]
- Get converted (if needed) data of STRING and recommended transfer-encoding
of that data for message body. CHARSET is the charset by which STRING is
encoded.
OPTS may accept following key-value pairs. NOTE: When
Unicode/multibyte support is disabled (see "USE_ENCODE"),
conversion will not be performed. So these options do not have any
effects.
- Detect7bit => YESNO
- Try auto-detecting 7-bit charset when CHARSET is not given. Default is
"YES".
- Replacement => REPLACEMENT
- Specifies error handling scheme. See "Error Handling".
3-item list of (converted string, charset for
output, transfer-encoding) will be returned.
Transfer-encoding will be either
"BASE64",
"QUOTED-PRINTABLE",
"7BIT" or
"8BIT". If charset for output could
not be determined and converted string contains non-ASCII byte(s),
charset for output will be "undef"
and transfer-encoding will be
"BASE64". Charset for output will
be "US-ASCII" if and only if string does
not contain any non-ASCII bytes.
- $charset->decode(STRING [,CHECK])
- Decode STRING to Unicode.
Note: When Unicode/multibyte support is disabled (see
"USE_ENCODE"), this function will die.
- detect_7bit_charset STRING
- Guess 7-bit charset that may encode a string STRING. If STRING contains
any 8-bit bytes, "undef" will be
returned. Otherwise, Default Charset will be returned for unknown
charset.
- $charset->encode(STRING [, CHECK])
- Encode STRING (Unicode or non-Unicode) using compatible charset
recommended to be used for messages on Internet (if this module knows it).
Note that string will be decoded to Unicode then encoded even if
compatible charset was equal to original charset.
Note: When Unicode/multibyte support is disabled (see
"USE_ENCODE"), this function will die.
- $charset->encoded_header_len(STRING [, ENCODING])
- encoded_header_len STRING, ENCODING, CHARSET
- Get length of encoded STRING for message header (without folding).
ENCODING may be one of "B",
"Q" or
"S" (shorter one of either
"B" or
"Q").
- $charset->header_encode(STRING [, OPTS])
- header_encode STRING, CHARSET [, OPTS]
- Get converted (if needed) data of STRING and recommended encoding scheme
of that data for message headers. CHARSET is the charset by which STRING
is encoded.
OPTS may accept following key-value pairs. NOTE: When
Unicode/multibyte support is disabled (see "USE_ENCODE"),
conversion will not be performed. So these options do not have any
effects.
- Detect7bit => YESNO
- Try auto-detecting 7-bit charset when CHARSET is not given. Default is
"YES".
- Replacement => REPLACEMENT
- Specifies error handling scheme. See "Error Handling".
3-item list of (converted string, charset for
output, encoding scheme) will be returned. Encoding scheme
will be either "B",
"Q" or
"undef" (might not be encoded). If
charset for output could not be determined and converted
string contains non-ASCII byte(s), charset for output will be
"8BIT" (this is not charset name
but a special value to represent unencodable data) and encoding
scheme will be "undef" (should not be
encoded). Charset for output will be
"US-ASCII" if and only if string does not
contain any non-ASCII bytes.
- $charset->undecode(STRING [,CHECK])
- Encode Unicode string STRING to byte string by input charset of
$charset. This is equivalent to
"$charset->decoder->encode()".
Note: When Unicode/multibyte support is disabled (see
"USE_ENCODE"), this function will die.
- alias ALIAS [, CHARSET]
- Get/set charset alias for canonical names determined by
"canonical_charset".
If CHARSET is given and isn't false, ALIAS will be assigned as
an alias of CHARSET. Otherwise, alias won't be changed. In both cases,
current charset name that ALIAS is assigned will be returned.
- default [CHARSET]
- Get/set default charset.
Default charset is used by this module when charset
context is unknown. Modules using this module are recommended to use
this charset when charset context is unknown or implicit default is
expected. By default, it is
"US-ASCII".
If CHARSET is given and isn't false, it will be set to default
charset. Otherwise, default charset won't be changed. In both cases,
current default charset will be returned.
NOTE: Default charset should not be changed.
- fallback [CHARSET]
- Get/set fallback charset.
Fallback charset is used by this module when conversion
by given charset is failed and
"FALLBACK" error handling scheme is
specified. Modules using this module may use this charset as last resort
of charset for conversion. By default, it is
"UTF-8".
If CHARSET is given and isn't false, it will be set to
fallback charset. If CHARSET is
"NONE", fallback charset will be
undefined. Otherwise, fallback charset won't be changed. In any cases,
current fallback charset will be returned.
NOTE: It is useful that
"US-ASCII" is specified as fallback
charset, since result of conversion will be readable without charset
information.
- recommended CHARSET [, HEADERENC, BODYENC [, ENCCHARSET]]
- Get/set charset profiles.
If optional arguments are given and any of them are not false,
profiles for CHARSET will be set by those arguments. Otherwise, profiles
won't be changed. In both cases, current profiles for CHARSET will be
returned as 3-item list of (HEADERENC, BODYENC, ENCCHARSET).
HEADERENC is recommended encoding scheme for message header.
It may be one of "B",
"Q",
"S" (shorter one of either) or
"undef" (might not be encoded).
BODYENC is recommended transfer-encoding for message body. It
may be one of "B",
"Q",
"S" (shorter one of either) or
"undef" (might not be
transfer-encoded).
ENCCHARSET is a charset which is compatible with given CHARSET
and is recommended to be used for MIME messages on Internet. If
conversion is not needed (or this module doesn't know appropriate
charset), ENCCHARSET is "undef".
NOTE: This function in the future releases can accept
more optional arguments (for example, properties to handle character
widths, line folding behavior, ...). So format of returned value may
probably be changed. Use "header_encoding",
"body_encoding" or "output_charset" to get
particular profile.
- USE_ENCODE
- Unicode/multibyte support flag. Non-empty string will be set when Unicode
and multibyte support is enabled. Currently, this flag will be non-empty
on Perl 5.7.3 or later and empty string on earlier versions of Perl.
"body_encode" and "header_encode" accept following
"Replacement" options:
- "DEFAULT"
- Put a substitution character in place of a malformed character. For
UCM-based encodings, <subchar> will be used.
- "FALLBACK"
- Try "DEFAULT" scheme using fallback
charset (see "fallback"). When fallback charset is undefined
and conversion causes error, code will die on error with an error
message.
- "CROAK"
- Code will die on error immediately with an error message. Therefore, you
should trap the fatal error with eval{} unless you really want to let it
die on error. Synonym is "STRICT".
- "PERLQQ"
- "HTMLCREF"
- "XMLCREF"
- Use "FB_PERLQQ",
"FB_HTMLCREF" or
"FB_XMLCREF" scheme defined by Encode
module.
- numeric values
- Numeric values are also allowed. For more details see "Handling
Malformed Data" in Encode.
If error handling scheme is not specified or unknown scheme is
specified, "DEFAULT" will be assumed.
Built-in defaults for option parameters can be overridden by configuration file:
MIME/Charset/Defaults.pm. For more details read
MIME/Charset/Defaults.pm.sample.
Consult $VERSION variable.
Development versions of this module may be found at
<http://hatuka.nezumi.nu/repos/MIME-Charset/>.
- Release 1.001
- •
- new() method returns an object when CHARSET argument is not
specified.
- Release 1.005
- •
- Restrict characters in encoded-word according to RFC 2047 section 5 (3).
This also affects return value of encoded_header_len() method.
- Release 1.008.2
- body_encoding() method may also returns
"S".
- Return value of body_encode() method for UTF-8 may include
"QUOTED-PRINTABLE" encoding item that in
earlier versions was fixed to
"BASE64".
Multipurpose Internet Mail Extensions (MIME).
Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>
Copyright (C) 2006-2017 Hatuka*nezumi - IKEDA Soji. This program is free
software; you can redistribute it and/or modify it under the same terms as
Perl itself.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |