|
NAMEHTML::Parser::Simple - Parse nice HTML files without needing a compilerSynopsis#!/usr/bin/env perl use strict; use warnings; use HTML::Parser::Simple; # ------------------------- # Method 1: my($p) = HTML::Parser::Simple -> new ( input_file => 'data/s.1.html', output_file => 'data/s.2.html', ); $p -> parse_file; # Method 2: my($p) = HTML::Parser::Simple -> new; $p -> parse_file('data/s.1.html', 'data/s.2.html'); # Method 3: my($p) = HTML::Parser::Simple -> new; print $p -> parse('<html>...</html>') -> traverse($p -> root) -> result; Of course, these can be abbreviated by using method chaining. E.g. Method 2 could be: HTML::Parser::Simple -> new -> parse_file('data/s.1.html', 'data/s.2.html'); See scripts/parse.html.pl and scripts/parse.xhtml.pl. Description"HTML::Parser::Simple" is a pure Perl module.It parses HTML V 4 files, and generates a tree of nodes, with 1 node per HTML tag. The data associated with each node is documented in the "FAQ". See also HTML::Parser::Simple::Attributes and HTML::Parser::Simple::Reporter. DistributionsThis module is available as a Unix-style distro (*.tgz).See <http://savage.net.au/Perl-modules.html> for details. See <http://savage.net.au/Perl-modules/html/installing-a-module.html> for help on unpacking and installing. Constructor and initializationnew(...) returns an object of type "HTML::Parser::Simple".This is the class contructor. Usage: "HTML::Parser::Simple -> new". This method takes a hash of options. Call "new()" as "new(option_1 => value_1, option_2 => value_2, ...)". Available options (each one of which is also a method):
Methodsblock()Returns a hashref where the keys are the names of block-level HTML tags.The corresponding values in the hashref are just 1. Typical keys: address, form, p, table, tr. Note: Some keys, e.g. tr, are also returned by "self_close()". current_node()Returns the Tree::Simple object which the parser calls the current node.depth()Returns the nesting depth of the current tag.The method is just here in case you need it. empty()Returns a hashref where the keys are the names of HTML tags of type empty.The corresponding values in the hashref are just 1. Typical keys: area, base, input, wbr. inline()Returns a hashref where the keys are the names of HTML tags of type inline.The corresponding values in the hashref are just 1. Typical keys: a, em, img, textarea. input_file($in_file_name)Gets or sets the input file name used by "parse($input_file_name, $output_file_name)".Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)". 'input_file' is a parameter to "new()". See "Constructor and Initialization" for details. log($msg)Print $msg to STDERR if "new()" was called as "new(verbose => 1)", or if "$p -> verbose(1)" was called.Otherwise, print nothing. new()This is the constructor. See "Constructor and initialization" for details.node_type()Returns the type of the most recently created node, global, head, or body.See the first question in the "FAQ" for details. output_file($out_file_name)Gets or sets the output file name used by "parse($input_file_name, $output_file_name)".Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)". 'output_file' is a parameter to "new()". See "Constructor and Initialization" for details. parse($html)Returns the invocant. Thus "$p -> parse" returns $p. This allows for method chaining. See the "Synopsis".Parses the string of HTML in $html, and builds a tree of nodes. After calling "$p -> parse($html)", you must call "$p -> traverse($p -> root)" before calling "$p -> result". Alternately, use "$p -> parse_file", which calls all these methods for you. Note: "parse()" may be called directly or via "parse_file()". parse_file($input_file_name, $output_file_name)Returns the invocant. Thus "$p -> parse_file" returns $p. This allows for method chaining. See the "Synopsis".Parses the HTML in the input file, and writes the result to the output file. "parse_file()" calls "parse($html)" and "traverse($node)", using "$p -> root" for $node. Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)". Lastly, the parameters passed in to "parse_file($input_file_name, $output_file_name)" are used to update the internal values set with the input_file and output_file parameters passed in to "new()", or set with calls to "input_file($in_file_name)" and "output_file($out_file_name)". result()Returns the string which is the result of the parse.See scripts/parse.html.pl. root()Returns the Tree::Simple object which the parser calls the root of the tree of nodes.self_close()Returns a hashref where the keys are the names of HTML tags of type self close.The corresponding values in the hashref are just 1. Typical keys: dd, dt, p, tr. Note: Some keys, e.g. tr, are also returned by "block()". tagged_attribute()Returns a string to be used as a regexp, to capture tags and their optional attributes.It does not return qr/$s/; it just returns $s. This regexp takes one of two forms, depending on the state of the xhtml option. See "xhtml($Boolean)". The regexp has four (4) sets of capturing parentheses:
traverse($node)Returns the invocant. Thus "$p -> traverse" returns $p. This allows for method chaining. See the "Synopsis".Traverses the tree of nodes, starting at $node. You normally call this as "$p -> traverse($p -> root)", to ensure all nodes are visited. See the "Synopsis" for sample code. Or, see scripts/traverse.file.pl, which uses HTML::Parser::Simple::Reporter, and calls "traverse($node)" via "traverse_file($input_file_name)" in HTML::Parser::Simple::Reporter. verbose($Boolean)Gets or sets the verbose parameter.'verbose' is a parameter to "new()". See "Constructor and Initialization" for details. xhtml($Boolean)Gets or sets the xhtml parameter.If you call this after object creation, the trigger feature of Moos is used to call "tagged_attribute()" so as to correctly set the regexp which recognises xhtml. 'xhtm'> is a parameter to "new()". See "Constructor and Initialization" for details. FAQWhat is the format of the data stored in each node of the tree?The data of each node is a hash ref. The keys/values of this hash ref are:
How are tags and attributes handled?Tags are stored in lower-case, in a tree managed by Tree::Simple.Attributes are stored in the same case as in the original HTML. The root of the tree is returned be "root()". How are HTML comments handled?They are treated as content. This includes the prefix '<!--' and the suffix '-->'.How is DOCTYPE handled?It is treated as content belonging to the root of the tree.How is the XML declaration handled?It is treated as content belonging to the root of the tree.Does this module handle all HTML pages?No, never.Which versions of HTML does this module handle?Up to V 4.What do I do if this module does not handle my HTML page?Make yourself a nice cup of tea, and then fix your page.Does this validate the HTML input?No.For example, if you feed in a HTML page without the title tag, this module does not care. How do I view the output HTML?There are various ways.
How do I test this module (or my file)?Preferably, see the previous question, or...Suggested steps: Note: There are quite a few files involved. Proceed with caution.
Will you implement a 'quirks' mode to handle my special HTML file?No, never.Help with quirks: <http://www.quirksmode.org/sitemap.html>. Is there anything I should be aware of?Yes. If your HTML file is not nice, the interpretation of tag nesting will not match your preconceptions.In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file. The 'a' tag, for example, is defined to be an inline tag, but the 'div' tag is a block-level tag. I do not define 'a' to be inline, others do, e.g. <http://www.w3.org/TR/html401/> and hence HTML::Tagset. Inline means: <a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a> will not be parsed as an 'a' containing a 'div'. The 'a' tag will be closed before the 'div' is opened. So, the result will look like: <a href = "#NAME"></a><div class = 'global_toc_text'>NAME</div> To achieve what was presumably intended, use 'span': <a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a> Some people (*cough* *cough*) have had to redo their entire websites due to this very problem. Of course, this is just one of a vast set of possible problems. You have been warned. Why did you use Tree::Simple but not Tree or Tree::Fast or Tree::DAG_Node?During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.Late news: Tree does not cope with an arrayref stored in the metadata, so I have switched to Tree::DAG_Node. Stop press: As an experiment I switched to Tree::Simple. Since it also works I will just keep using it. Why is this module not called HTML::Parser::PurePerl?
How do I output my own stuff while traversing the tree?
How is the source formatted?I edit with UltraEdit. That means, in general, leading 4-space tabs.All vertical alignment within lines is done manually with spaces. Perl::Critic is off the agenda. Why did you choose Moos?For the 2012 Google Code-in, I had a quick look at 122 class-building classes, and decided Moos was suitable, given it is pure-Perl and has the trigger feature I needed.See <http://savage.net.au/Module-reviews/html/gci.2012.class.builder.modules.html>. CreditsThis Perl HTML parser has been converted from a JavaScript one written by John Resig.<http://ejohn.org/files/htmlparser.js>. Well done John! Note also the comments published here: <http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58>. Repository<https://github.com/ronsavage/HTML-Parser-Simple>SupportEmail the author, or log a bug on RT:<https://rt.cpan.org/Public/Dist/Display.html?Name=HTML::Parser::Simple>. Author"HTML::Parser::Simple" was written by Ron Savage <ron@savage.net.au> in 2009.Home page: <http://savage.net.au/index.html>. CopyrightAustralian copyright (c) 2009 Ron Savage.All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License, a copy of which is available at: http://www.opensource.org/licenses/index.html
Visit the GSP FreeBSD Man Page Interface. |