|
NAMEHTML::ContentExtractor - extract the main content from a web page by analysising the DOM tree!VERSIONVersion 0.03SYNOPSISuse HTML::ContentExtractor; my $extractor = HTML::ContentExtractor->new(); my $agent=LWP::UserAgent->new; my $url='http://sports.sina.com.cn/g/2007-03-23/16572821174.shtml'; my $res=$agent->get($url); my $HTML = $res->decoded_content(); $extractor->extract($url,$HTML); print $extractor->as_html(); print $extractor->as_text(); DESCRIPTIONWeb pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed. Please notice the input HTML should be encoded in utf-8 format( so do the spam words), thus the module can handle web pages in any language (I've used it to process English, Chinese, and Japanese web pages).
AUTHORZhang Jun, "<jzhang533 at gmail.com>"COPYRIGHT & LICENSECopyright 2007 Zhang Jun, all rights reserved.This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. |