|
NAMERDFStore::Parser::SiRPAC - This module implements a streaming RDF Parser as a direct implementation of XML::Parser::Expat(3)SYNOPSISuse RDFStore::Parser::SiRPAC; use RDFStore::NodeFactory; my $p=new RDFStore::Parser::SiRPAC( ErrorContext => 2, Handlers => { Init => sub { print "INIT\n"; }, Final => sub { print "FINAL\n"; }, Assert => sub { print "STATEMENT - @_\n"; } }, NodeFactory => new RDFStore::NodeFactory() ); $p->parsefile('http://www.gils.net/bsr-gils.rdfs'); $p->parsefile('http://www.gils.net/rdf/bsr-gils.rdfs'); $p->parsefile('/some/where/my.rdf'); $p->parsefile('file:/some/where/my.rdf'); $p->parse(*STDIN); #parse stream but with *blocking* Expat (see below example for n-blocking parsing using XML::Parse::ExpatNB) use RDFStore::Parser::SiRPAC; use RDFStore::NodeFactory; my $pstore=new RDFStore::Parser::SiRPAC( ErrorContext => 2, Style => 'RDFStore::Parser::Styles::RDFStore::Model', NodeFactory => new RDFStore::NodeFactory(), style_options => { persistent => 1, seevalues => 1, store_options => { Name => '/tmp/test' } } ); my $rdfstore_model = $pstore->parsefile('http://www.gils.net/bsr-gils.rdfs'); #using the expat no-blocking feature (generally for large XML streams) - see XML::Parse::Expat(3) my $rdfstore_stream_model = $pstore->parsestream(*STDIN); DESCRIPTIONThis module implements a Resource Description Framework (RDF) streaming parser completely in Perl using the XML::Parser::Expat(3) module. The actual RDF parsing happens using an instance of XML::Parser::Expat with Namespaces option enabled and start/stop and char handlers set. The RDF specific code is based on the modified version of SiRPAC of Sergey Melnik in Java; a lot of changes and adaptations have been done to actually run it under Perl. Expat options may be provided when the RDFStore::Parser::SiRPAC object is created. These options are then passed on to the Expat object on each parse call.Exactly like XML::Parser(3) the behavior of the parser is controlled either by the Style entry elsewhere in this document and/or the Handlers entry elsewhere in this document options, or by the setHandlers entry elsewhere in this document method. These all provide mechanisms for RDFStore::Parser::SiRPAC to set the handlers needed by Expat. If neither Style nor Handlers are specified, then parsing just checks the RDF document syntax against the W3C RDF Raccomandation . When underlying handlers get called, they receive as their first parameter the Expat object, not the Parser object. To see some examples about how to use it look at the sections below and in the samples and utils directory coming with this software distribution. E.g. With RDFStore::Parser::SiRPAC you can easily write an rdfingest.pl script to do something like this: fetch -o - -q http://dmoz.org/rdf/content.rdf.u8.gz | \ gunzip - | \ sed -f dmoz.content.sed | rdfingest.pl - METHODS
All the other XML::Parser and XML::Parser::Expat options should work freely with RDFStore::Parser::SiRPAC see XML::Parser(3) and XML::Parser::Expat(3).
HANDLERSAs Expat, SiRPAC is an event based parser. As the parser recognizes parts of the RDF document then any handlers registered for that type of an event are called with suitable parameters. All handlers receive an instance of XML::Parser::Expat as their first argument. See "METHODS" in XML::Parser::Expat for a discussion of the methods that can be called on this object.Init (Expat)This is called just before the parsing of the document starts.Final (Expat)This is called just after parsing has finished, but only if no errors occurred during the parse. Parse returns what this returns.Assert (Expat, Statement)This event is generated when a new RDF statement has been generated by the parseer.start tag is recognized. Statement is of type RDFStore::Statement(3) as generated by the RDFStore::NodeFactory(3) passed as argument to the RDFStore::Parser::SiRPAC constructor.Start_XML_Literal (Expat, Element [, Attr, Val [,...]])This event is generated when an XML start tag is recognized within an RDF property with parseType="Literal". Element is the name of the XML element type that is opened with the start tag. The Attr & Val pairs are generated for each attribute in the start tag.This handler should return a string containing either the original XML chunck or one f its transformations, perhaps using XSLT. Stop_XML_Literal (Expat, Element)This event is generated when an XML end tag is recognized within an RDF property with parseType="Literal". Note that an XML empty tag (<foo/>) generates both a Start_XML_Literal and an Stop_XML_Literal event.Char_XML_Literal (Expat, String)This event is generated when non-markup is recognized within an RDF property with parseType="Literal". The non-markup sequence of characters is in String. A single non-markup sequence of encoding of the string in the original document, this is given to the handler in UTF-8.This handler should return the processed text as a string. manage_bNodes (Expat, Factory, SystembNode)This event is triggered when a new anonymous resource (bNode) needs to be generated by the system e.g. 'genidrdfstoreS302e313439323736363337373935353039P22533T106968250320N21' or a rdf:nodeID attribute is found into the input RDF/XML. When is not rdf:nodeID, by default the system is trying ot use 'GenidNumber' base as passed to the parser constructor to count sequentially the bNode identifiers and internally create an anonymous RDF resource node. Otherwise the counter will start from zero (0). The system will then concatenate such a sequential counter/number to a another unique string built by a hex-encoded random number between 0-1 (i.e. unpack("H*", rand()) ), the current system Process ID (PID) and system timestamp. For example, if the 'GenidNumber' is set to '45', rand()=0.149276637795509 and PID=2233 and system timestamp=1069672550, the parser will generate 'genidrdfstoreS302e313439323736363337373935353039P2233T1069672550N45' as identifier. The user should note that such a unique ID (a part the 'genidrdfstore' prefix) will allow to post-process such an identifier to get out the PID, the timestamp and the counter parts by using the 'S', 'P', 'T' and 'N' chars separators. In addition, the prefix of such an identifier should be unique across different parser runs even on the same file/source i.e. random seed, PID and timestamp uniquely identify the parse run (process). If a rdf:nodeID is encountered the system is simply copying the bNode identifier through e.g. rdf:nodeID="blue" will remain unchanged as 'blue' and so on.NOTE: The 'manage_bNodes' could be called several times for the same 'SystembNode' If this handler is undefined the system behaves by default as outline above - otherwise the end-application can interect with this process. By using this handler the end-application can control how to generate identfiers for anonymous resources OR how to re-write specific bNodes as normal URI qualified resources. The end-application will get triggered 'manage_bNodes' events for each new (system generated) bNode or when a given rdf:nodeID attribute is found into the input RDF/XML source. The system wide generated 'SystembNode' identifier is also passed to the handler code. As already pointed out, multiple events could result for the same 'SystembNode'. The 'SystembNode' parameter will either contain the bare bone system generated identifier like 'genidrdfstoreS302e313439323736363337373935353039P2233T1069672550N45' or the rdf:nodeID like 'blue' - it is recommended to the end-application to keep the 'genidrdfstore' prefix for sequential generated identifiers. This will allow in the future to immediately distinguish bNodes generated by the RDF/XML parser from rdf:nodeID or application specific ones. By using the 'manage_bNodes' event, for example, the application could keep track of system unique (and/or sequential) identifiers for bNodes internally or re-write a given bNode. The end-application must use the 'Factory' parameter (which should correspond to the 'NodeFactory' parameter passed to the parser constructor) to return to the parser (caller) the corresponding RDFStore::RDFNode(3) to use in place of the specific generated event. For example, the handler could keep a kind of look-up table of system generated bNodes or input source rdf:nodeID to application specific URIs. In which case the end-application would rewrite input anonymous resource to valid world-wide unique resources. Here are three examples - the first is simply passing/delegating to the parser the generation of an anonymous resource: sub manage_bNodes { return $_[1]->createAnonymousResource( $_[2] ); #does really nothing...pass through }; The second example rewrite system wide generated bNodes to application specific bNodes: sub manage_bNodes { my ($expatm, $factory, $systemid) = @_; $systemid =~ s/^genidrdfstore/genidmyapplication/; return $factory->createAnonymousResource( $systemid ); }; The last example re-write parser generated and rdf:nodeID bNodes to an application specific URI list: my %app_uri_map = ( 'genidrdfstoreS302e313439323736363337373935353039P2233T1069672550N45' => 'http://www.asemantics.com/index.html', 'alberto' => 'http://foaf.asemantics.com/alberto', 'zac' => 'http://foaf.asemantics.com/zac', 'dirkx' => 'http://foaf.asemantics.com/dirkx' ); sub manage_bNodes { my ($expatm, $factory, $systemid) = @_; return $factory->createResource( $app_uri_map{$systemid} ); }; This handler must return a valid RDFStore::Resource(3) node. WRITE YOUR OWN PARSERWrite an extension module for you needs it is as easy as write one for XML::Parser :) Have a look at http://www.xml.com/xml/pub/98/09/xml-perl.html and http://wwwx.netheaven.com/~coopercc/xmlparser/intro.html.You can either make you Perl script a parser self by embedding the needed function hooks or write a custom Style module for RDFStore::Parser::SiRPAC. *.pl scriptsuse RDFStore::Parser::SiRPAC; use RDFStore::NodeFactory; my $p=new RDFStore::Parser::SiRPAC( Handlers => { Init => sub { print "INIT\n"; }, Final => sub { print "FINAL\n"; }, Assert => sub { print "STATEMENT - @_\n"; } }, NodeFactory => new RDFStore::NodeFactory() ); or something like: use RDFStore::Parser::SiRPAC; use RDFStore::NodeFactory; my $p=new RDFStore::Parser::SiRPAC( NodeFactory => new RDFStore::NodeFactory() ); $p->setHandlers( Init => sub { print "INIT\n"; }, Final => sub { print "FINAL\n"; }, Assert => sub { print join(",",@_),"\n"; } ); Style modulesA more sophisticated solution is to write a complete Perl5 Sytle module for RDFStore::Parser::SiRPAC that can be easily reused in your code. E.g. a perl script could use this piece of code:use RDFStore::Parser::SiRPAC; use RDFStore::Parser::SiRPAC::MyStyle; use RDFStore::NodeFactory; my $p=new RDFStore::Parser::SiRPAC( Style => 'RDFStore::Parser::SiRPAC::MyStyle', NodeFactory => new RDFStore::NodeFactory() ); $p->parsefile('http://www.gils.net/bsr-gils.rdfs'); The Style module self could stored into a file like MyStyle.pm like this: package RDFStore::Parser::SiRPAC::MyStyle; sub Init { print "INIT\n"; }; sub Final { print "FINAL\n"; }; sub Assert { print "ASSERT: ", $_[1]->subject()->toString(), $_[1]->predicate()->toString(), $_[1]->object()->toString(), "\n"; }; sub Start_XML_Literal { print "STARTAG: ",$_[1],"\n"; }; sub Stop_XML_Literal { print "ENDTAG: ",$_[1],"\n"; }; sub Char_XML_Literal { print "UTF8 chrs: ",$_[1],"\n"; }; 1; For a more complete and useful example see RDFStore::Parser::SiRPAC::RDFStore(3). BUGSThis module implements most of the W3C RDF Raccomandation as its Java counterpart SiRPAC from the Stanford University Database Group by Sergey Melnik (see http://www-db.stanford.edu/~melnik/rdf/api.html) This version is conformant to the latest RDF API Draft on 2000-11-13. It does not support yet:* aboutEach SEE ALSORDFStore::Parser::SiRPAC(3), DBMS(3) and XML::Parser(3) XML::Parser::Expat(3) RDFStore::Model(3) RDFStore::NodeFactory(3) RDF Model and Syntax Specification - http://www.w3.org/TR/rdf-syntax-grammar/ RDF Schema Specification 1.0 - http://www.w3.org/TR/rdf-schema/ Benchmarking XML Parsers by Clark Cooper - http://www.xml.com/pub/Benchmark/article.html See also http://www.w3.org/RDF/Implementations/SiRPAC/SiRPAC-defects.html RDF::Parser(3) from http://www.pro-solutions.com AUTHORAlberto Reggiori <areggiori@webweaving.org> Sergey Melnik <melnik@db.stanford.edu> is the original author of the streaming version of SiRPAC in Java Clark Cooper is the author of the XML::Parser(3) module together with Larry Wall
Visit the GSP FreeBSD Man Page Interface. |