$doc = $parser->parse_file( $html_file_name [,\%opts] );
This function parses an HTML document from a file or network;
$html_file_name can be either a filename or an
URL.
Options include 'encoding' to indicate file encoding (e.g.
'utf-8') and 'user_agent' which should be a blessed
"LWP::UserAgent" (or HTTP::Tiny)
object to be used when retrieving URLs.
If requesting a URL and the response Content-Type header
indicates an XML-based media type (such as XHTML), XML::LibXML::Parser
will be used automatically (instead of the tag soup parser). The XML
parser can be told to use a DTD catalogue by setting the option
'xml_catalogue' to the filename of the catalogue.
HTML (tag soup) parsing can be forced using the option
'force_html', even when an XML media type is returned. If an options
hashref was passed, parse_file will set
$options->{'parser_used'} to the name of the
class used to parse the URL, to allow the calling code to double-check
which parser was used afterwards.
If an options hashref was passed, parse_file will set
$options->{'response'} to the HTTP::Response
object obtained by retrieving the URI.
$fragment = $parser->parse_balanced_chunk( $string [,\%opts] );
This method is roughly equivalent to XML::LibXML's method of
the same name, but unlike XML::LibXML, and despite its name it does not
require the chunk to be "balanced". This method is somewhat
black magic, but should work, and do the proper thing in most cases. Of
course, the proper thing might not be what you'd expect! I'll try to
keep this explanation as brief as possible...
Consider the following string:
<b>Hello</b></td></tr> <i>World</i>
What is the proper way to parse that? If it were found in a
document like this:
<html>
<head><title>X</title></head>
<body>
<div>
<b>Hello</b></td></tr> <i>World</i>
</div>
</body>
</html>
Then the document would end up equivalent to the following
XHTML:
<html>
<head><title>X</title></head>
<body>
<div>
<b>Hello</b> <i>World</i>
</div>
</body>
</html>
The superfluous
"</td></tr>" is simply
ignored. However, if it were found in a document like this:
<html>
<head><title>X</title></head>
<body>
<table><tbody><tr><td>
<b>Hello</b></td></tr> <i>World</i>
</td></tr></tbody></table>
</body>
</html>
Then the result would be:
<html>
<head><title>X</title></head>
<body>
<i>World</i>
<table><tbody><tr><td>
<b>Hello</b></td></tr>
</tbody></table>
</body>
</html>
Yes,
"<i>World</i>" gets
hoisted up before the "<table>".
This is weird, I know, but it's how browsers do it in real life.
So what should:
$string = q{<b>Hello</b></td></tr> <i>World</i>};
$fragment = $parser->parse_balanced_chunk($string);
actually return? Well, you can choose...
$string = q{<b>Hello</b></td></tr> <i>World</i>};
$frag1 = $parser->parse_balanced_chunk($string, {within=>'div'});
say $frag1->toString; # <b>Hello</b> <i>World</i>
$frag2 = $parser->parse_balanced_chunk($string, {within=>'td'});
say $frag2->toString; # <i>World</i><b>Hello</b>
If you don't pass a "within" option, then the chunk
is parsed as if it were within a
"<div>" element. This is often
the most sensible option. If you pass something like
"{ within => "foobar" }"
where "foobar" is not a real HTML element name (as found in
the HTML5 spec), then this method will croak; if you pass the name of a
void element (e.g. "br" or
"meta") then this method will croak;
there are a handful of other unsupported elements which will croak
(namely: "noscript",
"noembed",
"noframes").
Note that the second time around, although we parsed the
string "as if it were within a
"<td>" element", the
"<i>Hello</i>" bit did not
strictly end up within the
"<td>" element (not even within
the "<table>" element!) yet it
still gets returned. We'll call things such as this
"outliers". There is a "force_within" option which
tells parse_balanced_chunk to ignore outliers:
$frag3 = $parser->parse_balanced_chunk($string,
{force_within=>'td'});
say $frag3->toString; # <b>Hello</b>
There is a boolean option "mark_outliers" which
marks each outlier with an attribute
("data-perl-html-html5-parser-outlier")
to indicate its outlier status. Clearly, this is ignored when you use
"force_within" because no outliers are returned. Some outliers
may be XML::LibXML::Text elements; text nodes don't have attributes, so
these will not be marked with an attribute.
A last note is to mention what gets returned by this method.
Normally it's an XML::LibXML::DocumentFragment object, but if you call
the method in list context, a list of the individual node elements is
returned. Alternatively you can request the data to be returned as an
XML::LibXML::NodeList object:
# Get an XML::LibXML::NodeList
my $list = $parser->parse_balanced_chunk($str, {as=>'list'});
The exact implementation of this method may change from
version to version, but the long-term goal will be to approach how
common desktop browsers parse HTML fragments when implementing the
setter for DOM's "innerHTML"
attribute.