- encoding_from_content_type($content_type)
- Takes a byte string and uses HTTP::Headers::Util to extract the charset
parameter from the "Content-Type" header
value and returns its value or "undef"
(or an empty list in list context) if there is no such value. Only the
first component will be examined (HTTP/1.1 only allows for one component),
any backslash escapes in strings will be unescaped, all leading and
trailing quote marks and white-space characters will be removed, all
white-space will be collapsed to a single space, empty charset values will
be ignored and no case folding is performed.
Examples:
+-----------------------------------------+-----------+
| encoding_from_content_type(...) | returns |
+-----------------------------------------+-----------+
| "text/html" | undef |
| "text/html,text/plain;charset=utf-8" | undef |
| "text/html;charset=" | undef |
| "text/html;charset=\"\\u\\t\\f\\-\\8\"" | 'utf-8' |
| "text/html;charset=utf\\-8" | 'utf\\-8' |
| "text/html;charset='utf-8'" | 'utf-8' |
| "text/html;charset=\" UTF-8 \"" | 'UTF-8' |
+-----------------------------------------+-----------+
If you pass a string with the UTF-8 flag turned on the string
will be converted to bytes before it is passed to HTTP::Headers::Util.
The return value will thus never have the UTF-8 flag turned on (this
might change in future versions).
- encoding_from_byte_order_mark($octets [, %options])
- Takes a sequence of octets and attempts to read a byte order mark at the
beginning of the octet sequence. It will go through the list of
$options{encodings} or the list of default
encodings if no encodings are specified and match the beginning of the
string against any byte order mark octet sequence found.
The result can be ambiguous, for example qq(\xFF\xFE\x00\x00)
could be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by
a U+0000 character. It is also possible that
$octets starts with something that looks like a
byte order mark but actually is not.
encoding_from_byte_order_mark sorts the list of possible
encodings by the length of their BOM octet sequence and returns in
scalar context only the encoding with the longest match, and all
encodings ordered by length of their BOM octet sequence in list
context.
Examples:
+-------------------------+------------+-----------------------+
| Input | Encodings | Result |
+-------------------------+------------+-----------------------+
| "\xFF\xFE\x00\x00" | default | qw(UTF-32LE) |
| "\xFF\xFE\x00\x00" | default | qw(UTF-32LE UTF-16LE) |
| "\xEF\xBB\xBF" | default | qw(UTF-8) |
| "Hello World!" | default | undef |
| "\xDD\x73\x66\x73" | default | undef |
| "\xDD\x73\x66\x73" | UTF-EBCDIC | qw(UTF-EBCDIC) |
| "\x2B\x2F\x76\x38\x2D" | default | undef |
| "\x2B\x2F\x76\x38\x2D" | UTF-7 | qw(UTF-7) |
+-------------------------+------------+-----------------------+
Note however that for UTF-7 it is in theory possible that the
U+FEFF combines with other characters in which case such detection would
fail, for example consider:
+--------------------------------------+-----------+-----------+
| Input | Encodings | Result |
+--------------------------------------+-----------+-----------+
| "\x2B\x2F\x76\x38\x41\x39\x67\x2D" | default | undef |
| "\x2B\x2F\x76\x38\x41\x39\x67\x2D" | UTF-7 | undef |
+--------------------------------------+-----------+-----------+
This might change in future versions, although this is not
very relevant for most applications as there should never be need to use
UTF-7 in the encoding list for existing documents.
If no BOM can be found it returns
"undef" in scalar context and an empty
list in list context. This routine should not be used with strings with
the UTF-8 flag turned on.
- encoding_from_xml_declaration($declaration)
- Attempts to extract the value of the encoding pseudo-attribute in an XML
declaration or text declaration in the character string
$declaration. If there does not appear to be such
a value it returns nothing. This would typically be used with the return
values of xml_declaration_from_octets. Normalizes whitespaces like
encoding_from_content_type.
Examples:
+-------------------------------------------+---------+
| encoding_from_xml_declaration(...) | Result |
+-------------------------------------------+---------+
| "<?xml version='1.0' encoding='utf-8'?>" | 'utf-8' |
| "<?xml encoding='utf-8'?>" | 'utf-8' |
| "<?xml encoding=\"utf-8\"?>" | 'utf-8' |
| "<?xml foo='bar' encoding='utf-8'?>" | 'utf-8' |
| "<?xml encoding='a' encoding='b'?>" | 'a' |
| "<?xml encoding=' a b '?>" | 'a b' |
| "<?xml-stylesheet encoding='utf-8'?>" | undef |
| " <?xml encoding='utf-8'?>" | undef |
| "<?xml encoding =\x{2028}'utf-8'?>" | 'utf-8' |
| "<?xml version='1.0' encoding=utf-8?>" | undef |
| "<?xml x='encoding=\"a\"' encoding='b'?>" | 'a' |
+-------------------------------------------+---------+
Note that encoding_from_xml_declaration() determines
the encoding even if the XML declaration is not well-formed or violates
other requirements of the relevant XML specification as long as it can
find an encoding pseudo-attribute in the provided string. This means XML
processors must apply further checks to determine whether the entity is
well-formed, etc.
- xml_declaration_from_octets($octets [, %options])
- Attempts to find a ">" character in the byte string
$octets using the encodings in
$encodings and upon success attempts to find a
preceding "<" character. Returns all the strings found this
way in the order of number of successful matches in list context and the
best match in scalar context. Should probably be combined with the only
user of this routine, encoding_from_xml_declaration... You can modify the
list of suspected encodings using
$options{encodings};
- encoding_from_first_chars($octets [, %options])
- Assuming that documents start with "<" optionally preceded by
whitespace characters, encoding_from_first_chars attempts to determine an
encoding by matching $octets against something
like /^[@{$options{whitespace}}]*</ in the various suspected
$options{encodings}.
This is useful to distinguish e.g. UTF-16LE from UTF-8 if the
byte string does not start with a byte order mark nor an XML declaration
(e.g. if the document is a HTML document) to get at least a base
encoding which can be used to decode enough of the document to find
<meta> elements using encoding_from_meta_element.
$options{whitespace} defaults to qw/CR LF SP
TB/. Returns nothing if unsuccessful. Returns the matching encodings in
order of the number of octets matched in list context and the best match
in scalar context.
Examples:
+---------------+----------+---------------------+
| String | Encoding | Result |
+---------------+----------+---------------------+
| '<!DOCTYPE ' | UTF-16LE | UTF-16LE |
| ' <!DOCTYPE ' | UTF-16LE | UTF-16LE |
| '...' | UTF-16LE | undef |
| '...<' | UTF-16LE | undef |
| '<' | UTF-8 | ISO-8859-1 or UTF-8 |
| "<!--\xF6-->" | UTF-8 | ISO-8859-1 or UTF-8 |
+---------------+----------+---------------------+
- encoding_from_meta_element($octets, $encname [, %options])
- Attempts to find <meta> elements in the document using HTML::Parser.
It will attempt to decode chunks of the byte string using
$encname to characters before passing the data to
HTML::Parser. An optional %options hash can be
provided which will be passed to the HTML::Parser constructor. It will
stop processing the document if it encounters
* </head>
* encoding errors
* the end of the input
* ... (see todo)
If relevant <meta> elements, i.e. something like
<meta http-equiv=Content-Type content='...'>
are found, uses encoding_from_content_type to extract the
charset parameter. It returns all such encodings it could find in
document order in list context or the first encoding in scalar context
(it will currently look for others regardless of calling context) or
nothing if that fails for some reason.
Note that there are many edge cases where this does not yield
in "proper" results depending on the capabilities of the
HTML::Parser version and the options you pass for it, for example,
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [
<!ENTITY content_type "text/html;charset=utf-8">
]>
<meta http-equiv="Content-Type" content="&content_type;">
<title></title>
<p>...</p>
This would likely not detect the
"utf-8" value if HTML::Parser does not
resolve the entity. This should however only be a concern for documents
specifically crafted to break the encoding detection.
- encoding_from_xml_document($octets, [, %options])
- Uses encoding_from_byte_order_mark to detect the encoding using a byte
order mark in the byte string and returns the return value of that routine
if it succeeds. Uses xml_declaration_from_octets and
encoding_from_xml_declaration and returns the encoding for which the
latter routine found most matches in scalar context, and all encodings
ordered by number of occurences in list context. It does not return a
value of neither byte order mark not inbound declarations declare a
character encoding.
Examples:
+----------------------------+----------+-----------+----------+
| Input | Encoding | Encodings | Result |
+----------------------------+----------+-----------+----------+
| "<?xml?>" | UTF-16 | default | UTF-16BE |
| "<?xml?>" | UTF-16LE | default | undef |
| "<?xml encoding='utf-8'?>" | UTF-16LE | default | utf-8 |
| "<?xml encoding='utf-8'?>" | UTF-16 | default | UTF-16BE |
| "<?xml encoding='cp37'?>" | CP37 | default | undef |
| "<?xml encoding='cp37'?>" | CP37 | CP37 | cp37 |
+----------------------------+----------+-----------+----------+
Lacking a return value from this routine and higher-level
protocol information (such as protocol encoding defaults) processors
would be required to assume that the document is UTF-8 encoded.
Note however that the return value depends on the set of
suspected encodings you pass to it. For example, by default, EBCDIC
encodings would not be considered and thus for
<?xml version='1.0' encoding='cp37'?>
this routine would return the undefined value. You can modify
the list of suspected encodings using
$options{encodings}.
- encoding_from_html_document($octets, [, %options])
- Uses encoding_from_xml_document and encoding_from_meta_element to
determine the encoding of HTML documents. If
$options{xhtml} is set to a false value uses
encoding_from_byte_order_mark and encoding_from_meta_element to determine
the encoding. The xhtml option is on by default. The
$options{encodings} can be used to modify the
suspected encodings and $options{parser_options}
can be used to modify the HTML::Parser options in
encoding_from_meta_element (see the relevant documentation).
Returns nothing if no declaration could be found, the winning
declaration in scalar context and a list of encoding source and encoding
name in list context, see ENCODING SOURCES.
...
Other problems arise from differences between HTML and XHTML
syntax and encoding detection rules, for example, the input could be
Content-Type: text/html
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv = "Content-Type"
content = "text/html;charset=iso-8859-2">
<title></title>
<p>...</p>
This is a perfectly legal HTML 4.01 document and
implementations might be expected to consider the document ISO-8859-2
encoded as XML rules for encoding detection do not apply to HTML
documents. This module attempts to avoid making decisions which rules
apply for a specific document and would thus by default return 'utf-8'
for this input.
On the other hand, if the input omits the encoding
declaration,
Content-Type: text/html
<?xml version='1.0'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv = "Content-Type"
content = "text/html;charset=iso-8859-2">
<title></title>
<p>...</p>
It would return 'iso-8859-2'. Similar problems would arise
from other differences between HTML and XHTML, for example consider
Content-Type: text/html
<?foo >
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html ...
?>
...
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
...
If this is processed using HTML rules, the first > will end
the processing instruction and the XHTML document type declaration would
be the relevant declaration for the document, if it is processed using
XHTML rules, the ?> will end the processing instruction and the HTML
document type declaration would be the relevant declaration.
IOW, an application would need to assume a certain character
encoding (family) to process enough of the document to determine whether
it is XHTML or HTML and the result of this detection would depend on
which processing rules are assumed in order to process it. It is thus in
essence not possible to write a "perfect" detection algorithm,
which is why this routine attempts to avoid making any decisions on this
matter.
- encoding_from_http_message($message [, %options])
- Determines the encoding of HTML / XML / XHTML documents enclosed in HTTP
message. $message is an object compatible to
HTTP::Message, e.g. a HTTP::Response object.
%options is a hash with the following possible
entries: