|
NAMENet::OAI::Harvester - A package for harvesting metadata using OAI-PMHSYNOPSIS## create a harvester for the Library of Congress my $harvester = Net::OAI::Harvester->new( 'baseURL' => 'http://memory.loc.gov/cgi-bin/oai2_0' ); ## list all the records in a repository my $records = $harvester->listRecords( 'metadataPrefix' => 'oai_dc' ); while ( my $record = $records->next() ) { my $header = $record->header(); my $metadata = $record->metadata(); print "identifier: ", $header->identifier(), "\n"; print "title: ", $metadata->title(), "\n"; } ## find out the name for a repository my $identity = $harvester->identify(); print "name: ",$identity->repositoryName(),"\n"; ## get a list of identifiers my $identifiers = $harvester->listIdentifiers( 'metadataPrefix' => 'oai_dc' ); while ( my $header = $identifiers->next() ) { print "identifier: ",$header->identifier(), "\n"; } ## list all the records in a repository my $records = $harvester->listRecords( 'metadataPrefix' => 'oai_dc' ); while ( my $record = $records->next() ) { my $header = $record->header(); my $metadata = $record->metadata(); print "identifier: ", $header->identifier(), "\n"; print "title: ", $metadata->title(), "\n"; } ## GetRecord, ListSets, ListMetadataFormats also supported DESCRIPTIONNet::OAI::Harvester is a Perl extension for easily querying OAI-PMH repositories. OAI-PMH is the Open Archives Initiative Protocol for Metadata Harvesting. OAI-PMH allows data repositories to share metadata about their digital assets. Net::OAI::Harvester is a OAI-PMH client, so it does for OAI-PMH what LWP::UserAgent does for HTTP.You create a Net::OAI::Harvester object which you can then use to retrieve metadata from a selected repository. Net::OAI::Harvester tries to keep things simple by providing an API to get at the data you want; but it also has a framework which is easy to extend should you need to get more fancy. The guiding principle behind OAI-PMH is to allow metadata about online resources to be shared by data providers, so that the metadata can be harvested by interested parties. The protocol is essentially XML over HTTP (much like XMLRPC or SOAP). Net::OAI::Harvester does XML parsing for you (using XML::SAX internally), but you can get at the raw XML if you want to do your own XML processing, and you can drop in your own XML::SAX handler if you would like to do your own parsing of metadata elements. A OAI-PMH repository supports 6 verbs: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets. The verbs translate directly into methods that you can call on a Net::OAI::Harvester object. More details about these methods are supplied below, however for the real story please consult the spec at http://www.openarchives.org. Net::OAI::Harvester has a few features that are worth mentioning:
METHODSAll the Net::OAI::Harvester methods return other objects. As you would expect new() returns an Net::OAI::Harvester object; similarly getRecord() returns an Net::OAI::Record object, listIdentifiers() returns a Net::OAI::ListIdentifiers object, identify() returns an Net::OAI::Identify object, and so on. So when you use one of these methods you'll probably want to check out the docs for the object that gets returned so you can see what to do with it. Many of these classes inherit from Net::OAI::Base which provides some base functionality for retrieving errors, getting the raw XML, and the temporary file where the XML is stored (see Net::OAI::Base documentation for more details).new()The constructor which returns an Net::OAI::Harvester object. You must supply the baseURL parameter, to tell Net::OAI::Harvester what data repository you are going to be harvesting. For a list of data providers check out the directory available on the Open Archives Initiative homepage.my $harvester = Net::OAI::Harvester->new( baseURL => 'http://memory.loc.gov/cgi-bin/oai2_0' ); If you want to pull down all the XML files and keep them in a directory, rather than having the stored as transient temp files pass in the dumpDir parameter. my $harvester = Net::OAI::Harvester->new( baseUrl => 'http://memory.loc.gov/cgi-bin/oai2_0', dumpDir => 'american-memory' ); Also if you would like to fine tune the HTTP client used by Net::OAI::Harvester you can pass in a configured object. For example this can be handy if you want to adjust the client timeout: my $ua = LWP::UserAgent->new(); $ua->timeout(20); ## set timeout to 20 seconds my $harvester = Net::OAI::Harvester->new( baseURL => 'http://memory.loc.gov/cgi-bin/oai2_0', userAgent => $ua ); identify()identify() is the OAI verb that tells a metadata repository to provide a description of itself. A call to identify() returns a Net::OAI::Identify object which you can then call methods on to retrieve the information you are intersted in. For example:my $identity = $harvester->identify(); print "repository name: ",$identity->repositoryName(),"\n"; print "protocol version: ",$identity->protocolVersion(),"\n"; print "earliest date stamp: ",$identity->earliestDatestamp(),"\n"; print "admin email(s): ", join( ", ", $identity->adminEmail() ), "\n"; ... For more details see the Net::OAI::Identify documentation. listMetadataFormats()listMetadataFormats() asks the repository to return a list of metadata formats that it supports. A call to listMetadataFormats() returns an Net::OAI::ListMetadataFormats object.my $list = $harvester->listMetadataFormats(); print "archive supports metadata prefixes: ", join( ',', $list->prefixes() ),"\n"; If you are interested in the metadata formats available for a particular resource identifier then you can pass in that identifier. my $list = $harvester->listMetadataFormats( identifier => '1234567' ); print "record identifier 1234567 can be retrieved as ", join( ',', $list->prefixes() ),"\n"; See documentation for Net::OAI::ListMetadataFormats for more details. getRecord()getRecord() is used to retrieve a single record from a repository. You must pass in the "identifier" and an optional "metadataPrefix" parameters to identify the record, and the flavor of metadata you would like. Net::OAI::Harvester includes a parser for OAI DublinCore, so if you do not specifiy a metadataPrefix 'oai_dc' will be assumed. If you would like to drop in your own XML::Handler for another type of metadata use either the "metadataHandler" or the "recordHandler" parameter, either the name of the class as string or an already instantiated object of that class.my $result = $harvester->getRecord( identifier => 'abc123', ); ## did something go wrong? if ( my $oops = $result->errorCode() ) { ... }; ## get the result as Net::OAI::Record object my $record = $result->record(); # undef if error ## directly get the Net::OAI::Record::Header object my $header = $result->header(); # undef if error ## same as my $header = $result->record()->header(); # undef if error ## get the metadata object my $metadata = $result->metadata(); # undef if error or harvested with recordHandler ## or if you would rather use your own XML::Handler ## pass in the package name for the object you would like to create my $result = $harvester->getRecord( identifier => 'abc123', metadataHandler => 'MyHandler' ); my $metadata = $result->metadata(); my $result = $harvester->getRecord( identifier => 'abc123', recordHandler => 'MyCompleteHandler' ); my $complete_record = $result->recorddata(); # undef if error or harvested with metadataHandler listRecords()listRecords() allows you to retrieve all the records in a data repository. You must supply the "metadataPrefix" parameter to tell your Net::OAI::Harvester which type of records you are interested in. listRecords() returns an Net::OAI::ListRecords object. There are four other optional parameters "from", "until", "set", and "resumptionToken" which are better described in the OAI-PMH spec.my $records = $harvester->listRecords( metadataPrefix => 'oai_dc' ); ## iterate through the results with next() while ( my $record = $records->next() ) { my $metadata = $record->metadata(); ... } If you would like to use your own metadata handler then you can specify the package name of the handler as the "metadataHandler" (will be exposed to events below the "metadata" element) or "recordHandler" (will be exposed to the "record" element and all its children) parameter, passing either
my $records = $harvester->listRecords( metadataPrefix => 'mods', metadataHandler => 'MODS::Handler' ); while ( my $record = $records->next() ) { my $metadata = $record->metadata(); # $metadata will be a MODS::Handler object } If you want to automatically handle resumption tokens you can achieve this with the listAllRecords() method. In this case the "next()" method transparently causes the next response to be fetched from the repository if the current response ran out of records and contained a resumptionToken. If you prefer you can handle resumption tokens yourself with a loop, and the resumptionToken() method. You might want to do this if you are working with a repository that wants you to wait between requests or if connectivity problems become an issue during particulary long harvesting runs and you want to implement a retransmission strategy for failing requests. my $records = $harvester->listRecords( metadataPrefix => 'oai_dc' ); my $responseDate = $records->responseDate(); my $finished = 0; while ( ! $finished ) { while ( my $record = $records->next() ) { # a Net::OAI::Record object my $metadata = $record->metadata(); # do interesting stuff here } my $rToken = $records->resumptionToken(); if ( $rToken ) { $records = $harvester->listRecords( resumptionToken => $rToken->token() ); } else { $finished = 1; } } Please note: Since "listRecords()" stashes away the individual records it encounters with "Storable", special care has to be taken if the handlers you provided make use of XS modules since these objects cannot be reliably handled. Therefore you will have to provide the special serializing and deserializing methods "STORABLE_freeze()" and "STORABLE_thaw()" for the objects used by your filter(s). listAllRecords()Does exactly what listRecords() does except the "next()" method will automatically submit resumption tokens as needed.my $records = $harvester->listAllRecords( metadataPrefix => 'oai_dc' ); while ( my $record = $records->next() ) { # a Net::OAI::Record object until undef my $metadata = $record->metadata(); # do interesting stuff here } listIdentifiers()listIdentifiers() takes the same parameters that listRecords() takes, but it returns only the record headers, allowing you to quickly retrieve all the record identifiers for a particular repository. The object returned is a Net::OAI::ListIdentifiers object.my $headers = $harvester->listIdentifiers( metadataPrefix => 'oai_dc' ); ## iterate through the results with next() while ( my $header = $identifiers->next() ) { # a Net::OAI::Record::Header object print "identifier: ", $header->identifier(), "\n"; } If you want to automatically handle resumption tokens use listAllIdentifiers(). If you are working with a repository that encourages pauses between requests you can handle the tokens yourself using the technique described above in listRecords(). listAllIdentifiers()Does exactly what listIdentifiers() does except "next()" will automatically submit resumption tokens as needed.listSets()listSets() takes an optional "resumptionToken" parameter, and returns a Net::OAI::ListSets object. listSets() allows you to harvest a subset of a particular repository with listRecords(). For more information see the OAI-PMH spec and the Net::OAI::ListSets docs.my $sets = $harvester->listSets(); foreach ( $sets->setSpecs() ) { print "set spec: $_ ; set name: ", $sets->setName( $_ ), "\n"; } baseURL()Gets or sets the base URL for the repository being harvested (as "" in URI).$harvester->baseURL( 'http://memory.loc.gov/cgi-bin/oai2_0' ); Or if you want to know what the current baseURL is $baseURL = $harvester->baseURL(); userAgent()Gets or sets the LWP::UserAgent object being used to perform the HTTP transactions. This method could be useful if you wanted to change the agent string, timeout, or some other feature.DIAGNOSTICSIf you would like to see diagnostic information when harvesting is running then set Net::OAI::Harvester::DEBUG to a true value.$Net::OAI::Harvester::DEBUG = 1; PERFORMANCEXML::SAX is used for parsing, but it presents a generalized interface to many parsers. It comes with XML::Parser::PurePerl by default, which is nice since you don't have to worry about getting the right libraries installed. However XML::Parser::PurePerl is rather slow compared to XML::LibXML. If you are a speed freak install XML::LibXML from CPAN today.If you have a particular parser you want to use you can set the $XML::SAX::ParserPackage variable appropriately. See XML::SAX::ParserFactory documentation for details. ENVIRONMENTThe modules use LWP for HTTP operations, thus "PERL_LWP_ENV_PROXY" controls wether the "_proxy" environment settings shall be honored.TODO
SEE ALSO
AUTHORSEd Summers <ehs@pobox.com>Martin Emmerich <Martin.Emmerich@oew.de> Thomas Berger <ThB@gymel.com> LICENSEThis is free software, you may use it and distribute it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. |