|
NAMEGungho - Yet Another High Performance Web Crawler FrameworkSYNOPSISuse Gungho; Gungho->run($config); DESCRIPTIONGungho provides a complete out-of-the-box web crawler framework with high performance and great felxibility.Please note that Gungho is in beta. It has been stable for some time, but its internals may still change, including the API. Gungho comes with many features that solve recurring problems when building a spider:
HISTORYFirst there were a bunch of scripts that used scrape a bunch of RSS feeds. Then I got tired of writing scripts, so I decided a framework is the way to go, and Xango was born.Xango was my first attempt at trying to harness the full power of event-based framework. It was fast. It wasn't fun to extend. It had a nightmare-ish way to deal with robots.txt. Couple of more attempts later, more inspirations and lessons learned from Catalyst, Plagger, DBIx::Class, Gungho was born. Since its inception, Gungho has been in successfully used as crawlers that fetch hundreds of thousands of urls to a few million urls per day. PLEASE READ BEFORE USEGungho is designed to so that it can handle massive amount of traffic. If you're careful enough with your Provider and Handler implementation, you can in fact hit millions of URL with this crawler.So PLEASE DO NOT LET IT LOOSE. DO NOT OVERLOAD your crawl targets. You are STRONGLY advised to use Gungho::Component::Throttle to throttle your fetches. Also PLEASE CHANGE THE USER AGENT NAME OF YOUR CRAWLER. If you hit your targets hard with the default name (Gungho/VERSION X.XXXX), it will look as though a service called Gungho is hitting their site, which really isn't the case. Whatever it is, please specify at least a simple user agent in your config STRUCTUREGungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook(). All components (engine, provider, handler) are overridable and switcheable. However, do note that if you plan on customizing stuff, you should be aware that Gungho uses Class::C3 extensively, and hence you may see warnings about the code you use. HOW *NOT* TO USE GunghoOne note about Gungho - Don't use it if you are planning on accessing a single url -- It's usually not worth it, so you might as well use LWP::UserAgent or an equivalent module.Gungho's event driven engine works best when you are accessing hundreds, if not thousands of urls. It may in fact be slower than using LWP::UserAgent if you are accessing just a single url. Of course, you may wish to utilize features other than speed that Gungho provides, so at that point, it's simply up to you. RUNNING IN DISTRIBUTED ENVIRONMENTGungho has experimental support for running in distributed environments.Strictly speaking, each crawler needs to have its own strategy to enable itself to to run in a distribued environment. What Gungho offers is a "good enough" solution that may work for your. If what Gungho offers isn't enough, at least what comes with it might help to show you what needs to be tweaked for your particular environment. Roughly speaking, there are three components you need to worry about in order to make a well bahaved and distributed crawler. Check out the below list and documentation for each component.
GLOBAL CONFIGURATION OPTIONS
COMPONENTSComponents add new functionality to Gungho. Components are loaded at startup time from the config file / hash given to Gungho constructor.Gungho->run({ components => [ 'Throttle::Simple' ], throttle => { max_interval => ..., } }); Components modify Gungho's inheritance structure at run time to add extra functionality to Gungho, and therefore should only be loaded before starting the engine. Please refer to each component's document for details
INLINEIf you're looking into simple crawlers, you may want to look at Gungho::Inline,Gungho::Inline->run({ provider => sub { ... }, handler => sub { ... } }); See the manual for Gungho::Inline for details. PLUGINSPlugins are different from components in that, whereas components require the developer to explicitly call the methods, plugins are loaded and are not touched afterwards.Please refer to the documentation of each plugin for details.
HOOKSCurrently available hooks are:engine.send_requestengine.handle_responseMETHODScomponent_base_classUsed for Class::C3::ComponentisedCODEYou can obtain the current code base fromhttp://gungho-crawler.googlecode.com/svn/trunk AUTHORCopyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>CONTRIBUTORS
LICENSEThis program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.See http://www.perl.com/perl/misc/Artistic.html
Visit the GSP FreeBSD Man Page Interface. |