![]() |
![]()
| ![]() |
![]()
NAMEPDF::Tiny - Minimal Lightweight PDF LibraryVERSIONVersion 0.09This is an alpha version. The API is still subject to change. SYNOPSISuse PDF::Tiny; # Count pages my $filename = "filename.pdf" my $pdf = new PDF::Tiny $filename; my $page_count = $pdf->page_count; print "$filename has $page_count page", 's'x($page_count!=1), ".\n"; # Extract pages my $source_pdf = new PDF::Tiny "other.pdf"; my $new_pdf = new PDF::Tiny version => $source_pdf->version; # Get just the first three for (0..2) { $new_pdf->import_page($source_pdf, $_); } $new_pdf->print(fh => \*STDOUT); # or filename => "out.pdf" # Change title (low-level stuff) my $pdf = new PDF::Tiny "foo.pdf"; $pdf->trailer->[1]->{Info} ||= PDF::Tiny::make_ref($pdf->add_obj(PDF::Tiny::make_dict({}))); $pdf->vivify_obj('str', '/Info', '/Title')->[1] = "Tie Tool"; $pdf->append; DESCRIPTIONThis is a very lightweight (and limited) PDF parser. If you need to do some simple PDF processing on a web server with limited RAM and CPU, and if slurping the entire file into memory is not an option, this module may well be for you, at the cost of far less functionality than other solutions out there.This module really does assume you know something about the PDF format. This documentation includes a brief section on "THE PDF FILE FORMAT", if you are too lazy to read the specification. LIMITATIONS
INTERFACEThis section uses many terms that are explained in "THE PDF FILE FORMAT" and the "GLOSSARY", below.ConstructorYou can create a PDF from scratch:$pdf = new PDF::Tiny; Or open an existing file: $pdf = new PDF::Tiny $filename; The constructor also accepts hash-style arguments: $pdf = new PDF::Tiny filename => "foo.pdf", empty => 0, version => "1.4", ; If "filename" is absent, it is assumed that a PDF file is being created from scratch. "empty" is a boolean parameter (ignored if a file name is given). When it is false (the default), the new PDF::Tiny object will contain a root and a pages object (the latter containing a Kids array and a Count of 0). If "empty" is true, the root and the pages array will be absent. Only use this if you are going to add the root yourself. (A PDF without a root and a page tree is not well-formed. High-level methods may not work until you have added those.) "version" specifies the version of the PDF format. It is ignored when opening an existing file. It defaults to 1.4 when creating a PDF from scratch. If you are going to make a PDF file that contains pages from another PDF, you should probably set it to the version of the other PDF file. High-Level Methods
Low-Level Methodsmodified
$pdf->modified; # get the hash $pdf->modified("1 0"); # mark 1 0 as modified $pdf->modified("/Info", "/Title"); # mark the indirect object con- # taining the title as modified This method returns a reference to a hash of modified indirect objects. This is used to determine which objects need to be included in an incremental update. If you pass arguments, the object in question will be marked as modified. If you are doing an incremental update and modifying objects by hand, you will need to call this for every object that is modified. The exceptions are as follows:
All of those methods themselves mark the objects as modified. The arguments follow the same format as described below for "get_obj", except that the first argument must be an object id ("1 0"), a trailer entry ("/Info") or the word "trailer". Warning: Getting this right can be tricky. If a modified object is not marked as such, the changes made will not be saved by "append". If the file is small enough, it might be better to use the "print" method and avoid this pitfall. The hash format is as follows: # obj num value should always be true { '1 0' => 1, '2017 0' => 1, ... } (Yes, you can edit the hash manually.) objects Returns a reference to a hash of all parsed indirect objects. See "GUTS". { '1 0' => $obj, '2 0' => $obj2, ... } trailer Returns the PDF trailer dictionary as a parsed object. If the PDF has cross-reference streams, then the trailer is actually the dictionary of the last cross-reference stream. In the latter case, any entries specific to cross-reference streams will be omitted by "print". (The list used is based on the PDF 1.7 reference.) read_obj $pdf->read_obj("4 0") Reads the specified object from the file and stores it as a token sequence or flat object (see "GUTS") in the "objects" hash. The stored object is also returned. If the object already exists in memory, it is simply returned. get_obj $pdf->get_obj("4 0") Reads an indirect object from the file (if necessary), and parses and returns it. If it is already in the "objects" hash, it is upgraded to a first-class object (if it was flat or a token sequence) and then returned. If the object cannot be found, or if it is null, nothing is returned ("undef" or empty list). See also "vivify_obj". $pdf->get_obj("4 0", "/Pages", "/Kids", 3); Dereferences several levels of objects. In this example, "4 0" is probably the root object (a dictionary), with a Pages entry (also a dictionary), with a Kids entry (an array), and it returns element 3 of the Kids array. If element 3 is a reference, the object it points to is returned. The slash is not optional. It is used to distinguish between dictionary and array elements. The characters following the slash must not be escaped (whereas in PDF source they can be escaped). $pdf->get_obj("/Root", "/Pages", "/Kids"); The first argument may be a dictionary element, in which case the lookup begins at the trailer dictionary. $root = $pdf->get_obj("/Root"); $pdf->get_obj($root, "/Pages", "/Kids"); The first argument may also be a parsed object. $pdf->get_obj($root); If you pass just a single parsed object, it will be returned, unless it is actually an indirect reference, in which case the object will be looked up and returned. vivify_obj $pdf->vivify_obj($type, ...) The first argument must be one of the types listed in "GUTS". The remaining arguments are those accepted by "get_obj". An object of the specified type will be vivified, along with all the intervening objects (whose type, array or dictionary, is determined by whether the element begins with a slash). None of the vivified objects will be indirect objects. Any object returned by "vivify_obj" (whether vivified or not) will be marked as modified, under the assumption that you are going to modify it. get_page $pdf->get_page($num) Returns the parsed object associated with the page in question. Pages are numbered from 0. Negative numbers count from the end. -1 means the last page. import_obj $pdf->import_obj($other_pdf, $object) Imports an object, and all other objects it references, from another PDF file. (This entails making sure that each imported object gets renumbered to a number that the destination $pdf is not already using.) This method keeps track of which indirect objects have been imported already, so it can be called multiple times without duplicating the same objects. (It also means that subsequent changes to objects in the source PDF will go unnoticed.) $object may be a string or a parsed object. If $object is a string, it must be an object id ("1 0"). The return value will also be an object id. If it is a parsed object, the object itself will not be added to the $pdf’s list of indirect objects, because it is assumed that it will be inserted directly inside another object. (Or you can pass the return value to "add_obj".) All objects it references though (by numeric id) will be imported. The object itself will be cloned and the new value returned (with all references to other objects updated). Warning: If you try to import pages from another PDF document with this, watch out for the ‘/Parent’ link from each page to its parent page array. You’ll end up pulling in the parent page array, too, bloating your PDF with page data that will not be displayed. This can also be used to renumber all the objects in a PDF (excluding orphans), to avoid bloated cross-reference tables (but this does entail reading the entire file into memory): my $new_pdf = new PDF'Tiny version => $old_pdf->version, empty => 1; my $new_trailer = $new_pdf->trailer->[1]; my $old_trailer = $old_pdf->trailer->[1]; $new_trailer->{Root} = $new_pdf->import_obj($old_pdf, $old_trailer->{Root}); $new_trailer->{Info} = $new_pdf->import_obj($old_pdf, $old_trailer->{Info}) if $old_trailer->{Info}; add_obj $pdf->add_obj($parsed_obj); Adds a new indirect object to the PDF and returns the id that got used. decode_stream $pdf->decode_stream($parsed_obj) $pdf->decode_stream("10 0") $pdf->decode_stream(qw< /Root /Pages /Kids 0 /Content >) Decodes a stream and returns it. Currently the only filter supported is FlateDecode and the only predictor function supported is PNG. This is an lvalue function, so you can call it like this to avoid copying the stream yet again after decoding: $streamref = \$self->decode_stream(...) (This also means you can assign to the "decode_stream" call, which is pointless.) FunctionsNone of these functions is exported. Call each one with a "PDF::Tiny::" prefix.
THE PDF FILE FORMATThe body of a PDF file consists of a collection of what are called objects, in no particular order. Each has a numeric id that consists of two numbers separated by space. Here is an example of what they look like:23 0 obj << /Type /Catalog /Pages 3 0 R >> endobj This is object number 23 0. The "<<" and ">>" indicate a dictionary object (like a hash). The value of the ‘Type’ entry is the name ‘Catalog’ (a slash before it indicates a name, or identifier). The value of the ‘Pages’ entry is a reference to object number 3 0. Everything is an object. In PDF parlance, even something as simple as a number is called an object. An object embedded directly inside another one (such as the number in "<< /Length 1 >>" is called a direct object. An object with a numeric ID (such as 23 0) is called an indirect object. An object ID followed by the keyword "R" is called an indirect reference. Since the indirect objects can be in any order, there follows a cross-reference table giving the exact location in the file of each indirect object. After the cross-reference table is the trailer, an example of which is the following: trailer << /Size 34 /Root 23 0 R /Info 1 0 R /ID [ <e2ca9df8c15ea42d17d5d724f61808b1> <e2ca9df8c15ea42d17d5d724f61808b1> ] >> startxref 7749 %%EOF Try opening an existing PDF file in a text editor (make sure it is PDF 1.4 [Acrobat 5] format or earlier). To see the structure of a PDF file, start with the trailer dictionary. Metadata are stored in the ‘Info’ entry, which here refers to object 1 0. For the actual document structure, you want to look at the ‘Root’ entry, which points to the document root or catalogue. That is object 23 0, shown above. If you find object 3 0 you will see that it is a dictionary containing a ‘Kids’ array of references to page objects, etc. Now, the actual content of pages is in a different language from the PDF structure, which is described here. It goes inside a stream object, referenced by the pages, which is usually compressed with Deflate encoding. (This module does not handle streams per se, but its low-level functions will allow you to get to them.) Even though the language for drawing pages is not the same as that used for PDF structure, it follows the same tokenization rules, so you can use this module’s "tokenize" function if you are writing your own stream-processing code. GUTSGuts of the PDF::Tiny objects.Nothing is an object. By that I mean that PDF::Tiny does not use Perl objects to represent PDF objects. Rather, it uses array refs. These are referred to throughout this documentation as parsed objects. A parsed object looks like this: [ $type, $value ] [ 'num', 3 ] # a number [ 'dict', {...} ] # a PDF dictionary (i.e., hash) [ 'str', 'foo' ] # The string 'foo' [ 'array', [...] ] # An array [ 'bool', 1 ] # A boolean [ 'name', "Root" ] # The name (identifier) /Root [ 'ref', "2 0" ] # A reference to an indirect object [ 'null' ] # This special value has one element [ 'stream', $dict, $content ] # This exception has a parsed dictionary # object for element 1 and the stream # content for element 2 The values of dictionaries and arrays are also parsed objects. The value of an indirect reference (not really an object) consists of two integers without leading zeroes (except 0 itself) separated by a space. Even though "000\n001 R" is a valid reference in PDF syntax, this module always parses it as "0 1", which is important since it is used as a hash key. The various "make_*" functions can be used to create these. There are also two special cases, which are handled transparently like the other objects (and converted into them if necessary by "get_obj"): [ 'flat', '.......' ] # A flattened (serialized) object [ 'tokens', [$token1, $token2, ...]] # A sequence of tokens So the following three are equivalent: [ 'dict', { Type => [ 'name', 'Catalog' ], Pages => ['ref', '2 0'] } ] [ 'flat', '<</Type/Catalog/Pages 2 0 R>>' ] [ 'tokens', [qw '<< /Type /Catalog /Pages 2 0 R >>'] ] (This does not apply to the trailer. Do not flatten the trailer.) Also, streams cannot be stored in flat or tokenized format, but their dictionaries can: [ 'stream', ['dict', { Length => ['num', 9] }], 'scream!!!' ] [ 'stream', ['tokens', ['<<','/Length','9','>>']], 'scream!!!' ] [ 'stream', ['flat', '<</Length 9>>'], 'scream!!!' ] # Invalid: [ 'flat', "<</Length 9>>stream\nscream!!!" ] GLOSSARY
OTHER PDF MODULESWhy yet another PDF module, considering how many there are? Most of the other solutions were insufficiently lightweight for my needs. (In particular, I needed to write a web service that would serve a single page at a time from a collection of PDFs, some of which are 200MB. I needed it to be very responsive.)PDF::API2 (probably the best PDF module), CAM::PDF, and PDF::Extract all read the entire file into memory. Text::PDF is quite fast compared with the others, but I could not figure out how to use it, except to get a page count. PDF::Reuse is fast, but it has trouble with some PDF files, and it provides no page count feature (something I needed). PDF::Xtract I did not find out about until after writing this module. It is a fork of PDF::Extract. That said, there is no reason why you could not use PDF::Tiny in conjunction with other modules. CAM::PDF, for example, can generate PDFs from scratch fairly efficiently, but is slow at extracting pages from large PDFs. You could use it to generate a PDF, and then use PDF::Tiny to add pages afterwards from some other PDF. (Or extract pages with PDF::Tiny to a small PDF, and then import them with CAM::PDF.) CompatibilityUnfortunately, many of the modules mentioned above do not fully understand PDF syntax, or interpret the spec too strictly, such that they are unable to read certain PDFs. I have a large scanned book in the form of a PDF produced by ABBYY FineReader. I tried rewriting it with "PDF::Tiny->new($old)->print(filename => $new)", and then I tested both PDFs with the above modules. The results (with an Acrobat 6 column as a bonus):PDF producer PDF reader | FineReader | PDF::Tiny | Acrobat 6+ ------------------+------------+-----------+----------- CAM::PDF 1.60 | no | yes | yes PDF::API2 2.032 | yes | yes | yes PDF::Tiny 0.05 | yes | yes | yes Text::PDF 0.31 | no | no | no PDF::Reuse 0.39 | yes* | no | yes PDF::Extract 3.04 | yes | no | no PDF::Xtract 0.08 | yes | no | no Adobe Acrobat | yes | yes | yes Apple Preview | yes | yes | yes * It has trouble with the cross-reference table, such that it may or may not be able to extract the information you want. It happened to work for my purposes, but was slow and produced a bloated file. (The bug is fixed in the git repository and may be gone by the time you read this.) Part of the reason for the large number of noes in the middle column is that PDF::Tiny tries to get the files as compact as possible as fast as is possible with a reasonably small amount of code. To avoid reaching the PDF line length limit (which means entering a more complex and slower code path), it emits line breaks between tokens wherever whitespace is mandatory. It is probably the only PDF producer that does that. I have filed bug reports against all the modules that have a no in either column. I hope I do not have to slow down PDF::Tiny to work with these other modules. (If this proves to be a problem for anyone, let me know, and I can change the way it outputs whitespace.) BenchmarksOkay, so I took the PDF mentioned above (169.5 MB in size, containing 253 scanned pages) and benchmarked (1) fetching a page count and (2) extracting a single page, which are the two tasks for which I am using PDF::Tiny. The benchmark code (which uses Dumbbench) can be found in the benchmark file in the distribution. The results (on a 2.8 GHz Intel Core 2 Duo):Page count | Page extraction | Resulting file size ------------------+------------+-----------------+-------------------- PDF::Tiny 0.07 | 0.010197 s | 0.08735 s | 952 KB CAM::PDF 1.60 | 0.6923 s | 1.1849 s | 995 KB PDF::API2 2.033 | 1.7974 s | 2.1135 s | 953 KB PDF::Xtract 0.08 | 3.7902 s | 3.7805 s | 992 KB PDF::Reuse 0.39 | N/A | 15.36 s | 169.5 MB PDF::Extract 3.04 | N/A | 158.54 s | 954 KB Some explanation of the numbers: CAM::PDF and PDF::Xtract do not renumber objects, so they end up with a bloated 40K cross-reference table. PDF::Reuse drags in the entire contents of the source PDF but only includes one of its pages in the page tree. PREREQUISITESperl 5.10 or higher and core modules.This modules tries to be true to its Tininess by not loading any other modules unless absolutely necessary. When loaded it loads warnings. The "import_obj" method loads Hash::Util::FieldHash. And "import_page" calls "import_obj". The "decode_stream" method loads Compress::Zlib if the stream is compressed. The constructor calls "decode_stream" if the PDF has cross-reference streams (PDF 1.5/Acrobat 6). BUGSProbably lots. Most of the limitations could be considered bugs. Most of the limitations could also be considered features, because they make the module Tiny.This module is badly in need of tests. (Or it needs to be tested badly.) No doubt more tests will uncover bugs. Please send bug reports to bug-PDF-Tiny@rt.cpan.org <mailto:bug-PDF-Tiny@rt.cpan.org>. ACKNOWLEDGEMENTSThanks to Pali for his contributions.AUTHOR & COPYRIGHTCopyright (C) 2017 Father Chrysostomos <sprout [at] cpan [dot] org>This program is free software; you may redistribute it, modify it or both under the same terms as perl. The full text of the license can be found in the LICENSE file included with this module. POD ERRORSHey! The above document had some coding errors, which are explained below:
|