|
|
| |
HTML::DOM(3) |
User Contributed Perl Documentation |
HTML::DOM(3) |
HTML::DOM - A Perl implementation of the HTML Document Object Model
Version 0.058 (alpha)
WARNING: This module is still at an experimental stage. The
API is subject to change without notice.
use HTML::DOM;
my $dom_tree = new HTML::DOM; # empty tree
$dom_tree->write($source_code);
$dom_tree->close;
my $other_dom_tree = new HTML::DOM;
$other_dom_tree->parse_file($filename);
$dom_tree->getElementsByTagName('body')->[0]->appendChild(
$dom_tree->createElement('input')
);
print $dom_tree->innerHTML, "\n";
my $text = $dom_tree->createTextNode('text');
$text->data; # get attribute
$text->data('new value'); # set attribute
This module implements the HTML Document Object Model by extending the
HTML::Tree modules. The HTML::DOM class serves both as an HTML parser and as
the document class.
The following DOM modules are currently supported:
Feature Version (aka level)
------- -------------------
HTML 2.0
Core 2.0
Events 2.0
UIEvents 2.0
MouseEvents 2.0
MutationEvents 2.0
HTMLEvents 2.0
StyleSheets 2.0
CSS 2.0 (partially)
CSS2 2.0
Views 2.0
StyleSheets, CSS and CSS2 are actually provided by CSS::DOM. This
list corresponds to CSS::DOM versions 0.02 to 0.14.
- $tree = new HTML::DOM %options;
- This class method constructs and returns a new HTML::DOM object. The
%options, which are all optional, are as
follows:
- url
- The value that the "URL" method will
return. This value is also used by the
"domain" method.
- referrer
- The value that the "referrer" method
will return
- response
- An HTTP::Response object. This will be used for information needed for
writing cookies. It is expected to have a reference to a request object
(accessible via its "request"
method--see HTTP::Response). Passing a parameter to the 'cookie' method
will be a no-op without this.
- weaken_response
- If this is passed a true value, then the HTML::DOM object will hold a weak
reference to the response.
- cookie_jar
- An HTTP::Cookies object. As with
"response", if you omit this, arguments
passed to the "cookie" method will be
ignored.
- charset
- The original character set of the document. This does not affect parsing
via the "write" method (which always
assumes Unicode). "parse_file" will use
this, if specified, or HTML::Encoding otherwise. HTML::DOM::Form's
"make_request" method uses this to
encode form data unless the form has a valid 'accept-charset'
attribute.
If "referrer" and
"url" are omitted, they can be inferred
from "response".
- $tree->elem_handler($elem_name => sub { ... })
- If you call this method first, then, when the DOM tree is in the process
of being built (as a result of a call to
"write" or
"parse_file"), the subroutine will be
called after each $elem_name element is added to
the tree. If you give '*' as the element name, the subroutine will be
called for each element that does not have a handler. The subroutine's two
arguments will be the tree itself and the element in question. The
subroutine can call the DOM object's
"write" method to insert HTML code into
the source after the element.
Here is a lame example (which does not take
Content-Script-Type headers or security into account):
$tree->elem_handler(script => sub {
my($document,$elem) = @_;
return unless $elem->attr('type') eq 'application/x-perl';
eval($elem->firstChild->data);
});
$tree->write(
'<p>The time is
<script type="application/x-perl">
$document->write(scalar localtime)
</script>
precisely.
</p>'
);
$tree->close;
print $tree->documentElement->as_text, "\n";
(Note: HTML::DOM::Element's
"content_offset" method might come in
handy for reporting line numbers for script errors.)
- css_url_fetcher( \&sub )
- With this method you can provide a subroutine that fetches URLs referenced
by 'link' tags. Its sole argument is the URL, which is made absolute based
on the HTML page's own base URL (it is assumed that this is absolute). It
should return "undef" or an empty list
on failure. Upon success, it should return just the CSS code, if it has
been decoded (and is in Unicode), or, if it has not been decoded, the CSS
code followed by "decode => 1". See
"STYLE SHEET ENCODING" in CSS::DOM for details on when you
should or should not decode it. (Note that HTML::DOM automatically
provides an encoding hint based on the HTML document.)
HTML::DOM passes the result of the url fetcher to CSS::DOM and
turns it into a style sheet object accessible via the link element's
"sheet" method.
- $tree->write(...) (DOM method)
- This parses the HTML code passed to it, adding it to the end of the
document. It assumes that its input is a normal Perl Unicode string. Like
HTML::TreeBuilder's "parse" method, it
can take a coderef.
When it is called from an an element handler (see
"elem_handler", above), the value
passed to it will be inserted into the HTML code after the current
element when the element handler returns. (In this case a coderef won't
do--maybe that will be added later.)
If the "close" method has
been called, "write" will call
"open" before parsing the HTML code
passed to it.
- $tree->writeln(...) (DOM method)
- Just like "write" except that it appends
"\n" to its argument and does not work with code refs. (Rather
pointless, if you ask me. :-)
- $tree->close() (DOM method)
- Call this method to signal to the parser that the end of the HTML code has
been reached. It will then parse any residual HTML that happens to be
buffered. It also makes the next "write"
call "open".
- $tree->open (DOM method)
- Deletes the HTML tree, resetting it so that it has just an <html>
element, and a parser hungry for HTML code.
- $tree->parse_file($file)
- This method takes a file name or handle and parses the content,
(effectively) calling "close"
afterwards. In the former case (a file name), HTML::Encoding will be used
to detect the encoding. In the latter (a file handle), you'll have to
"binmode" it yourself. This could be
considered a bug. If you have a solution to this (how to make
HTML::Encoding detect an encoding from a file handle), please let me know.
As of version 0.12, this method returns true upon success, or
undef/empty list on failure.
- $tree->charset
- This method returns the name of the character set that was passed to
"new", or, if that was not given, that
which "parse_file" used.
It returns undef if "new"
was not given a charset and if
"parse_file" was not used or was
passed a file handle.
You can also set the charset by passing an argument, in which
case the old value is returned.
- doctype
- Returns nothing
- implementation
- Returns the HTML::DOM::Implementation object.
- documentElement
- Returns the <html> element.
- createElement ( $tag )
- createDocumentFragment
- createTextNode ( $text )
- createComment ( $text )
- createAttribute ( $name )
- Each of these creates a node of the appropriate type.
- createProcessingInstruction
- createEntityReference
- These two throw an exception.
- getElementsByTagName ( $name )
- $name can be the name of the tag, or '*', to match
all tag names. This returns a node list object in scalar context, or a
list in list context.
- importNode ( $node, $deep )
- Clones the $node, setting its
"ownerDocument" attribute to the
document with which this method is called. If
$deep is true, the $node
will be cloned recursively.
- alinkColor
- background
- bgColor
- fgColor
- linkColor
- vlinkColor
- These six methods return (optionally set) the corresponding attributes of
the body element. Note that most of the names do not map directly to the
names of the attributes. "fgColor"
refers to the "text" attribute. Those
that end with 'linkColor' refer to the attributes of the same name but
without the 'Color' on the end.
- title
- Returns (or optionally sets) the title of the page.
- referrer
- Returns the page's referrer.
- domain
- Returns the domain name portion of the document's URL.
- URL
- Returns the document's URL.
- body
- Returns the body element, or the outermost frame set if the document has
frames. You can set the body by passing an element as an argument, in
which case the old body element is returned.
- images
- applets
- links
- forms
- anchors
- These five methods each return a list of the appropriate elements in list
context, or an HTML::DOM::Collection object in scalar context. In this
latter case, the object will update automatically when the document is
modified.
In the case of "forms" you
can access those by using the HTML::DOM object itself as a hash. I.e.,
you can write "$doc->{f}" instead
of "$doc->forms->{f}".
- cookie
- This returns a string containing the document's cookies (the format may
still change). If you pass an argument, it will set a cookie as well. Both
Netscape-style and RFC2965-style cookie headers are supported.
- getElementById
- getElementsByName
- getElementsByClassName
- These three do what their names imply. The last two will return a list in
list context, or a node list object in scalar context. Calling them in
list context is probably more efficient.
- createEvent ( $category )
- Creates a new event object, believe it or not.
The $category is the DOM event
category, which determines what type of event object will be returned.
The currently supported event categories are MouseEvents, UIEvents,
HTMLEvents and MutationEvents.
You can omit the $category to create
an instance of the event base class (not officially part of the
DOM).
- defaultView
- Returns the HTML::DOM::View object associated with the document.
There is no such object by default; you have to put one there
yourself:
Although it is supposed to be read-only according to the DOM,
you can set this attribute by passing an argument to it. It is
still marked as read-only in
%HTML::DOM::Interface.
If you do set it, it is recommended that the object be a
subclass of HTML::DOM::View.
This attribute holds a weak reference to the object.
- styleSheets
- Returns a CSS::DOM::StyleSheetList of the document's style sheets, or a
simple list in list context.
- innerHTML
- Serialises and returns the HTML document. If you pass an argument, it will
set the contents of the document via
"open",
"write" and
"close", returning a serialisation of
the old contents.
- location
- set_location_object (non-DOM)
- "location" returns the location object,
if you've put one there with
"set_location_object". HTML::DOM doesn't
actually implement such an object itself, but provides the appropriate
magic to make "$doc->location($foo)"
translate into
"$doc->location->href($foo)".
BTW, the location object had better be true when used as a
boolean, or HTML::DOM will think it doesn't exist.
- lastModified
- This method returns the document's modification date as gleaned from the
response object passed to the constructor, in MM/DD/YYYY HH:MM:SS format.
If there is no modification date, an empty string is returned,
but this may change in the future.
(See also "EVENT HANDLING", below.)
- $tree->base
- Returns the base URL of the page; either from a <base href=...> tag,
from the response object passed to
"new", or the URL passed to
"new".
- $tree->magic_forms
- This is mainly for internal use. It returns a boolean indicating whether
the parser needed to associate formies with a form that did not contain
them. This happens when a closing </form> tag is missing and the
form is closed implicitly, but a formie is encountered later.
You can use an HTML::DOM object as a hash ref to access it's form elements by
name. So "$doc->{yayaya}" is short for
"$doc->forms->{yayaya}".
HTML::DOM supports both the DOM Level 2 event model and the HTML 4 event model.
Throughout this documentation, we make use of HTML 5's distinction
between handlers and listeners: An event handler is the result of an HTML
element beginning with 'on', e.g. onsubmit. These are also accessible via
the DOM. (We also use the word 'handler' in other contexts, such as the
'default event handler'.) Event listeners are registered solely with the
"addEventListener" method and can be
removed with "removeEventListener".
HTML::DOM accepts as an event handler a coderef, an object with a
"call_with" method, or an object with
"&{}" overloading. If the
"call_with" method is present, it is
called with the current event target as the first argument and the event
object as the second. This is to allow for objects that wrap JavaScript
functions (which must be called with the event target as the this
value).
An event listener is a coderef, an object with a
"handleEvent" method or an object with
"&{}" overloading. HTML::DOM does not
implement any classes that provide a
"handleEvent" method, but will support any
object that has one.
Listeners and handlers differ in one important aspect. A listener
has to call "preventDefault" on the event
object to cancel the default action. A handler simply returns a defined
false value (except for mouseover events, which must return a true value to
cancel the default).
Default actions that HTML::DOM is capable of handling internally (such as
triggering a DOMActivate event when an element is clicked, and triggering a
form's submit event when the submit button is activated) are dealt with
automatically. You don't have to worry about those. For others, read on....
To specify the default actions associated with an event, provide a
subroutine (in this case, it not being part of the DOM, you can't use an
object with a "handleEvent" method) via
the "default_event_handler_for" and
"default_event_handler" methods.
With the former, you can specify the default action to be taken
when a particular type of event occurs. The currently supported types
are:
submit when a form is submitted
link called when a link is activated (DOMActivate event)
Pass the type of event as the first argument and a code ref as the
second argument. When the code ref is called, its sole argument will be the
event object. For instance:
$dom_tree->default_event_handler_for( link => sub {
my $event = shift;
go_to( $event->target->href );
});
sub go_to { ... }
"default_event_handler_for" with
just one argument returns the currently assigned coderef. With two arguments
it returns the old one after assigning the new one.
Use "default_event_handler"
(without the "_for") to specify a fallback
subroutine that will be used for events not in the list above, and for
events in the list above that do not have subroutines assigned to them.
Without any arguments it will return the currently assigned coderef. With an
argument it will return the old one after assigning the new one.
HTML::DOM::Node's "dispatchEvent" method
triggers the appropriate event listeners, but does not call any default
actions associated with it. The return value is a boolean that indicates
whether the default action should be taken.
H:D:Node's "trigger_event"
method will trigger the event for real. It will call
"dispatchEvent" and, provided it returns
true, will call the default event handler.
The "event_attr_handler" can be used to assign
a coderef that will turn text assigned to an event attribute (e.g.,
"onclick") into an event handler. The
arguments to the routine will be (0) the element, (1) the name (aka type) of
the event (without the initial 'on'), (2) the value of the attribute and (3)
the offset within the source of the attribute's value. (Actually, if the value
is within quotes, it is the offset of the first quotation mark. Also, it will
be "undef" for generated HTML [source code
passed to the "write" method by an element
handler].) As with "default_event_handler",
you can replace an existing handler with a new one, in which case the old
handler is returned. If you call this method without arguments, it returns the
current handler. Here is an example of its use, that assumes that handlers are
Perl code:
$dom_tree->event_attr_handler(sub {
my($elem, $name, $code, $offset) = @_;
my $sub = eval "sub { $code }";
return sub {
local *_ = \$elem;
&$sub;
};
});
The event attribute handler will be called whenever an element
attribute whose name begins with 'on' (case-tolerant) is modified. (For
efficiency's sake, I may change it to call the event attribute handler only
when the event is triggered, so it is not called unnecessarily.)
Use "error_handler" to assign a coderef that
will be called whenever an event listener (or handler) raises an error. The
error will be contained in $@.
- $tree->event_parent
- $tree->event_parent( $new_val )
- This method lets you provide an object that is added to the top of the
event dispatch chain. E.g., if you want the view object (the value of
"defaultView", aka the window) to have
event handlers called before the document in the capture phase, and after
it in the bubbling phase, you can set it like this (see also
"defaultView", above):
$tree->event_parent( $tree->defaultView );
This holds a weak reference.
- $tree->event_listeners_enabled
- $tree->event_listeners_enabled( $new_val )
- This attribute, which is true by default, can be used to disable event
handlers and listeners. (Default event handlers [see above] still run,
though.)
Here are the inheritance hierarchy of HTML::DOM's various classes and the DOM
interfaces those classes implement. The classes in the left column all begin
with 'HTML::DOM::', which is omitted for brevity, except for HTML::DOM itself,
which is listed with its full name. Items in brackets have not yet been
implemented. (See also HTML::DOM::Interface for a machine-readable list of
standard methods.)
Class Inheritance Hierarchy Interfaces
--------------------------- ----------
Exception DOMException, EventException
Implementation DOMImplementation,
[DOMImplementationCSS]
Node Node, EventTarget
DocumentFragment DocumentFragment
HTML::DOM Document, HTMLDocument,
DocumentEvent, DocumentView,
DocumentStyle, [DocumentCSS]
CharacterData CharacterData
Text Text
Comment Comment
Element Element, HTMLElement,
ElementCSSInlineStyle
Element::HTML HTMLHtmlElement
Element::Head HTMLHeadElement
Element::Link HTMLLinkElement, LinkStyle
Element::Title HTMLTitleElement
Element::Meta HTMLMetaElement
Element::Base HTMLBaseElement
Element::IsIndex HTMLIsIndexElement
Element::Style HTMLStyleElement, LinkStyle
Element::Body HTMLBodyElement
Element::Form HTMLFormElement
Element::Select HTMLSelectElement
Element::OptGroup HTMLOptGroupElement
Element::Option HTMLOptionElement
Element::Input HTMLInputElement
Element::TextArea HTMLTextAreaElement
Element::Button HTMLButtonElement
Element::Label HTMLLabelElement
Element::FieldSet HTMLFieldSetElement
Element::Legend HTMLLegendElement
Element::UL HTMLUListElement
Element::OL HTMLOListElement
Element::DL HTMLDListElement
Element::Dir HTMLDirectoryElement
Element::Menu HTMLMenuElement
Element::LI HTMLLIElement
Element::Div HTMLDivElement
Element::P HTMLParagraphElement
Element::Heading HTMLHeadingElement
Element::Quote HTMLQuoteElement
Element::Pre HTMLPreElement
Element::Br HTMLBRElement
Element::BaseFont HTMLBaseFontElement
Element::Font HTMLFontElement
Element::HR HTMLHRElement
Element::Mod HTMLModElement
Element::A HTMLAnchorElement
Element::Img HTMLImageElement
Element::Object HTMLObjectElement
Element::Param HTMLParamElement
Element::Applet HTMLAppletElement
Element::Map HTMLMapElement
Element::Area HTMLAreaElement
Element::Script HTMLScriptElement
Element::Table HTMLTableElement
Element::Caption HTMLTableCaptionElement
Element::TableColumn HTMLTableColElement
Element::TableSection HTMLTableSectionElement
Element::TR HTMLTableRowElement
Element::TableCell HTMLTableCellElement
Element::FrameSet HTMLFrameSetElement
Element::Frame HTMLFrameElement
Element::IFrame HTMLIFrameElement
NodeList NodeList
NodeList::Radio
NodeList::Magic NodeList
NamedNodeMap NamedNodeMap
Attr Node, Attr, EventTarget
Collection HTMLCollection
Collection::Elements
Collection::Options
Event Event
Event::UI UIEvent
Event::Mouse MouseEvent
Event::Mutation MutationEvent
View AbstractView, ViewCSS
The EventListener interface is not implemented by HTML::DOM, but
is supported. See "EVENT HANDLING", above.
Not listed above is HTML::DOM::EventTarget, which is a base class
both for HTML::DOM::Node and HTML::DOM::Attr. The format I'm using above
doesn't allow for multiple inheritance, so I probably need to redo it.
HTML::DOM::Node also implements the HTML::Element interface, but
with a few differences. In particular:
- Any methods that expect text nodes to be just strings are unreliable. See
the note under "objectify_text" in HTML::Element.
- HTML::Element's tree-manipulation methods don't trigger mutation
events.
- HTML::Element's "delete" method is not
necessary, because HTML::DOM uses weak references (for 'upward' references
in the object tree).
- Objects' attributes are accessed via methods of the same name. When the
method is invoked, the current value is returned. If an argument is
supplied, the attribute is set (unless it is read-only) and its old value
returned.
- Where the DOM spec. says to use null, undef or an empty list is used.
- Instead of UTF-16 strings, HTML::DOM uses Perl's Unicode strings (which
happen to be stored as UTF-8 internally). The only significant difference
this makes is to "length",
"substringData" and other methods of
Text and Comment nodes. These methods behave in a Perlish way (i.e., the
offsets and lengths are specified in Unicode characters, not in UTF-16
bytes). The alternate methods
"length16",
"substringData16" et al. use
UTF-16 for offsets and are standards-compliant in that regard (but the
string returned by "substringData16" is
still a regular Perl string).
- Each method that returns a NodeList will return a NodeList object in
scalar context, or a simple list in list context. You can use the object
as an array ref in addition to calling its
"item" and
"length" methods.
- In cases where a method is supposed to return something implementing the
DOMTimeStamp interface, a simple Perl scalar is returned, containing the
time as returned by Perl’s built-in
"time" function.
Much of the code was stolen from HTML::Tree. In fact, HTML::DOM used to extend
HTML::Tree, but the two were merged to allow a whole pile of hacks to be
removed.
perl 5.8.3 or later
Exporter 5.57 or later
URI.pm
LWP 5.13 or later
CSS::DOM 0.06 or later
Scalar::Util 1.14 or later
HTML::Tagset 3.02 or later
HTML::Parser 3.46 or later
HTML::Encoding is required if a file name is passed to
"parse_file".
Tie::RefHash::Weak 0.08 or higher, if you are using perl 5.8.x
- Element handlers are not currently called during assignments to
"innerHTML".
- HTML::DOM::View's "getComputedStyle"
does not currently return a read-only style object; nor are lengths
converted to absolute values. Currently there is no way to specify the
medium. Any style rules that apply to specific media are ignored.
To report bugs, please e-mail the author.
Copyright (C) 2007-16 Father Chrysostomos
$text = new HTML::DOM ->createTextNode('sprout');
$text->appendData('@');
$text->appendData('cpan.org');
print $text->data, "\n";
This program is free software; you may redistribute it and/or
modify it under the same terms as perl.
Each of the classes listed above "CLASSES AND DOM INTERFACES"
HTML::DOM::Exception, HTML::DOM::Node, HTML::DOM::Event,
HTML::DOM::Interface
HTML::Tree, HTML::TreeBuilder, HTML::Element, HTML::Parser, LWP,
WWW::Mechanize, HTTP::Cookies, WWW::Mechanize::Plugin::JavaScript,
HTML::Form, HTML::Encoding
The DOM Level 1 specification at
<http://www.w3.org/TR/REC-DOM-Level-1>
The DOM Level 2 Core specification at
<http://www.w3.org/TR/DOM-Level-2-Core>
The DOM Level 2 Events specification at
<http://www.w3.org/TR/DOM-Level-2-Events>
etc.
Hey! The above document had some coding errors, which are explained
below:
- Around line 1405:
- Non-ASCII character seen before =encoding in 'I’ve'. Assuming
UTF-8
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |