|
NAMEHTML::TableParser - HTML::TableParser - Extract data from an HTML tableVERSIONversion 0.43SYNOPSISuse HTML::TableParser; @reqs = ( { id => 1.1, # id for embedded table hdr => \&header, # function callback row => \&row, # function callback start => \&start, # function callback end => \&end, # function callback udata => { Snack => 'Food' }, # arbitrary user data }, { id => 1, # table id cols => [ 'Object Type', qr/object/ ], # column name matches obj => $obj, # method callbacks }, ); # create parser object $p = HTML::TableParser->new( \@reqs, { Decode => 1, Trim => 1, Chomp => 1 } ); $p->parse_file( 'foo.html' ); # function callbacks sub start { my ( $id, $line, $udata ) = @_; #... } sub end { my ( $id, $line, $udata ) = @_; #... } sub header { my ( $id, $line, $cols, $udata ) = @_; #... } sub row { my ( $id, $line, $cols, $udata ) = @_; #... } DESCRIPTIONHTML::TableParser uses HTML::Parser to extract data from an HTML table. The data is returned via a series of user defined callback functions or methods. Specific tables may be selected either by a matching a unique table id or by matching against the column names. Multiple (even nested) tables may be parsed in a document in one pass.Table IdentificationEach table is given a unique id, relative to its parent, based upon its order and nesting. The first top level table has id 1, the second 2, etc. The first table nested in table 1 has id 1.1, the second 1.2, etc. The first table nested in table 1.1 has id 1.1.1, etc. These, as well as the tables' column names, may be used to identify which tables to parse.Data ExtractionAs the parser traverses a selected table, it will pass data to user provided callback functions or methods after it has digested particular structures in the table. All functions are passed the table id (as described above), the line number in the HTML source where the table was found, and a reference to any table specific user provided data.
Callback APICallbacks may be functions or methods or a mixture of both. In the latter case, an object must be passed to the constructor. (More on that later.)The callbacks are invoked as follows: start( $tbl_id, $line_no, $udata ); end( $tbl_id, $line_no, $udata ); hdr( $tbl_id, $line_no, \@col_names, $udata ); row( $tbl_id, $line_no, \@data, $udata ); warn( $tbl_id, $line_no, $message, $udata ); new( $tbl_id, $udata ); Data CleanupThere are several cleanup operations that may be performed automatically:
Data OrganizationColumn names are derived from cells delimited by the <th> and </th> tags. Some tables have header cells which span one or more columns or rows to make things look nice. HTML::TableParser determines the actual number of columns used and provides column names for each column, repeating names for spanned columns and concatenating spanned rows and columns. For example, if the table header looks like this:+----+--------+----------+-------------+-------------------+ | | | Eq J2000 | | Velocity/Redshift | | No | Object |----------| Object Type |-------------------| | | | RA | Dec | | km/s | z | Qual | +----+--------+----------+-------------+-------------------+ The columns will be: No Object Eq J2000 RA Eq J2000 Dec Object Type Velocity/Redshift km/s Velocity/Redshift z Velocity/Redshift Qual Row data are derived from cells delimited by the <td> and </td> tags. Cells which span more than one column or row are handled correctly, i.e. the values are duplicated in the appropriate places. METHODS
Table RequestsA table request is a hash used by HTML::TableParser to determine which tables are to be parsed, the callbacks to be invoked, and any data cleanup. There may be multiple requests processed by one call to the parser; each table is associated with a single request (even if several requests match the table).A single request may match several tables, however unless the MultiMatch attribute is specified for that request, it will be used for the first matching table only. A table request which matches a table id of "DEFAULT" will be used as a catch-all request, and will match all tables not matched by other requests. Please note that tables are compared to the requests in the order that the latter are passed to the new() method; place the DEFAULT method last for proper behavior. Identifying tables to parseHTML::TableParser needs to be told which tables to parse. This can be done by matching table ids or column names, or a combination of both. The table request hash elements dedicated to this are:
"id" may be passed an array containing any combination of the above: id => [ '1.2', qr/1\.\d+\.2/, sub { ... } ] Elements in the array may be preceded by a modifier indicating the action to be taken if the table matches on that element. The modifiers and their meanings are:
An example: id => [ '-', '1.2', 'DEFAULT' ] indicates that this request should be used for all tables, except for table 1.2. id => [ '--', '1.2' ] Table 2 is just plain skipped altogether.
"cols" may be passed an arrayref containing any combination of the above: cols => [ 'Snacks01', qr/Snacks\d+/, sub { ... } ] Elements in the array may be preceded by a modifier indicating the action to be taken if the table matches on that element. They are the same as the table id modifiers mentioned above.
More than one of these may be used for a single table request. A request may match more than one table. By default a request is used only once (even the "DEFAULT" id match!). Set the "MultiMatch" attribute to enable multiple matches per request. When attempting to match a table, the following steps are taken:
Specifying the data callbacksCallback functions are specified with the callback attributes "start", "end", "hdr", "row", and "warn". They should be set to code references, i.e.%table_req = ( ..., start => \&start_func, end => \&end_func ) To use methods, specify the object with the "obj" key, and the method names via the callback attributes, which should be set to strings. If you don't specify method names they will default to (you guessed it) "start", "end", "hdr", "row", and "warn". $obj = SomeClass->new(); # ... %table_req_1 = ( ..., obj => $obj ); %table_req_2 = ( ..., obj => $obj, start => 'start', end => 'end' ); You can also have HTML::TableParser create a new object for you for each table by specifying the "class" attribute. By default the constructor is assumed to be the class new() method; if not, specify it using the "new" attribute: use MyClass; %table_req = ( ..., class => 'MyClass', new => 'mynew' ); To use a function instead of a method for a particular callback, set the callback attribute to a code reference: %table_req = ( ..., obj => $obj, end => \&end_func ); You don't have to provide all the callbacks. You should not use both "obj" and "class" in the same table request. HTML::TableParser automatically determines if your object or class has one of the required methods. If you wish it not to use a particular method, set it equal to "undef". For example %table_req = ( ..., obj => $obj, end => undef ) indicates the object's end method should not be called, even if it exists. You can specify arbitrary data to be passed to the callback functions via the "udata" attribute: %table_req = ( ..., udata => \%hash_of_my_special_stuff ) Specifying Data cleanup operationsData cleanup operations may be specified uniquely for each table. The available keys are "Chomp", "Decode", "Trim". They should be set to a non-zero value if the operation is to be performed.Other AttributesThe "MultiMatch" key is used when a request is capable of handling multiple tables in the document. Ordinarily, a request will process a single table only (even "DEFAULT" requests). Set it to a non-zero value to allow the request to handle more than one table.BUGSPlease report any bugs or feature requests on the bugtracker website <https://rt.cpan.org/Public/Dist/Display.html?Name=HTML-TableParser> or by email to bug-HTML-TableParser@rt.cpan.org <mailto:bug-HTML-TableParser@rt.cpan.org>.When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature. SOURCEThe development version is on github at <https://github.com/djerius/html-tableparser> and may be cloned from <git://github.com/djerius/html-tableparser.git>AUTHORDiab Jerius <djerius@cpan.org>COPYRIGHT AND LICENSEThis software is Copyright (c) 2018 by Smithsonian Astrophysical Observatory.This is free software, licensed under: The GNU General Public License, Version 3, June 2007
Visit the GSP FreeBSD Man Page Interface. |