|
|
| |
CAM::PDF(3) |
User Contributed Perl Documentation |
CAM::PDF(3) |
CAM::PDF - PDF manipulation library
Copyright 2002-2006 Clotho Advanced Media, Inc., <http://www.clotho.com/>
Copyright 2007-2008 Chris Dolan
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
use CAM::PDF;
my $pdf = CAM::PDF->new('test1.pdf');
my $page1 = $pdf->getPageContent(1);
[ ... mess with page ... ]
$pdf->setPageContent(1, $page1);
[ ... create some new content ... ]
$pdf->appendPageContent(1, $newcontent);
my $anotherpdf = CAM::PDF->new('test2.pdf');
$pdf->appendPDF($anotherpdf);
my @prefs = $pdf->getPrefs();
$prefs[$CAM::PDF::PREF_OPASS] = 'mypassword';
$prefs[$CAM::PDF::PREF_UPASS] = 'mypassword';
$pdf->setPrefs(@prefs);
$pdf->cleanoutput('out1.pdf');
print $pdf->toPDF();
Many example programs are included in this distribution to do
useful tasks. See the "bin"
subdirectory.
This package reads and writes any document that conforms to the PDF
specification generously provided by Adobe at
<http://partners.adobe.com/public/developer/pdf/index_reference.html>
(link last checked Oct 2005).
The file format through PDF 1.5 is well-supported, with the
exception of the "linearized" or "optimized" output
format, which this module can read but not write. Many specific aspects of
the document model are not manipulable with this package (like fonts), but
if the input document is correctly written, then this module will preserve
the model integrity.
The PDF writing feature saves as PDF 1.4-compatible. That means
that we cannot write compressed object streams. The consequence is that
reading and then writing a PDF 1.5+ document may enlarge the resulting file
by a fair margin.
This library grants you some power over the PDF security model.
Note that applications editing PDF documents via this library MUST respect
the security preferences of the document. Any violation of this respect is
contrary to Adobe's intellectual property position, as stated in the
reference manual at the above URL.
Technical detail regarding corrupt PDFs: This library adheres
strictly to the PDF specification. Adobe's Acrobat Reader is more lenient,
allowing some corrupted PDFs to be viewable. Therefore, it is possible that
some PDFs may be readable by Acrobat that are illegible to this library. In
particular, files which have had line endings converted to or from
DOS/Windows style (i.e. CR-NL) may be rendered unusable even though Acrobat
does not complain. Future library versions may relax the parser, but not
yet.
$self = CAM::PDF->new(content | filename | '-')
$self->toPDF()
$self->needsSave()
$self->save()
$self->cleansave()
$self->output(filename | '-')
$self->cleanoutput(filename | '-')
$self->previousRevision()
$self->allRevisions()
$self->preserveOrder()
$self->appendObject(olddoc, oldnum, [follow=(1|0)])
$self->replaceObject(newnum, olddoc, oldnum, [follow=(1|0)])
(olddoc can be undef in the above for adding new objects)
$self->numPages()
$self->getPageText(pagenum)
$self->getPageDimensions(pagenum)
$self->getPageContent(pagenum)
$self->setPageContent(pagenum, content)
$self->appendPageContent(pagenum, content)
$self->deletePage(pagenum)
$self->deletePages(pagenum, pagenum, ...)
$self->extractPages(pagenum, pagenum, ...)
$self->appendPDF(CAM::PDF object)
$self->prependPDF(CAM::PDF object)
$self->wrapString(string, width, fontsize, page, fontlabel)
$self->getFontNames(pagenum)
$self->addFont(page, fontname, fontlabel, [fontmetrics])
$self->deEmbedFont(page, fontname, [newfontname])
$self->deEmbedFontByBaseName(page, basename, [newfont])
$self->getPrefs()
$self->setPrefs()
$self->canPrint()
$self->canModify()
$self->canCopy()
$self->canAdd()
$self->getFormFieldList()
$self->fillFormFields(fieldname, value, [fieldname, value, ...])
or $self->fillFormFields(%values)
$self->clearFormFieldTriggers(fieldname, fieldname, ...)
Note: 'clean' as in cleansave() and cleanobject()
means write a fresh PDF document. The alternative (e.g. save())
reuses the existing doc and just appends to it. Also note that 'clean'
functions sort the objects numerically. If you prefer that the new PDF docs
more closely resemble the old ones, call preserveOrder() before
cleansave() or cleanobject().
$self->toString()
$self->getPage(pagenum)
$self->getFont(pagenum, fontname)
$self->getFonts(pagenum)
$self->getStringWidth(fontdict, string)
$self->getFormField(fieldname)
$self->getFormFieldDict(object)
$self->isLinearized()
$self->decodeObject(objectnum)
$self->decodeAll(any-node)
$self->decodeOne(dict-node)
$self->encodeObject(objectnum, filter)
$self->encodeOne(any-node, filter)
$self->changeString(obj-node, hashref)
$self->pageAddName(pagenum, name, objectnum)
$self->getPageObjnum(pagenum)
$self->getPropertyNames(pagenum)
$self->getProperty(pagenum, propname)
$self->getValue(any-node)
$self->dereference(objectnum) or $self->dereference(name,pagenum)
$self->deleteObject(objectnum)
$self->copyObject(obj-node)
$self->cacheObjects()
$self->setObjNum(obj-node, num)
$self->getRefList(obj-node)
$self->changeRefKeys(obj-node, hashref)
$self->getObjValue(objectnum)
$self->_startdoc()
$self->delinearlize()
$self->build*()
$self->parse*()
$self->write*()
$self->*CB()
$self->traverse()
$self->fixDecode()
$self->abbrevInlineImage()
$self->unabbrevInlineImage()
$self->cleanse()
$self->clean()
$self->createID()
- $doc->new($package, $content)
- $doc->new($package, $content, $ownerpass, $userpass)
- $doc->new($package, $content, $ownerpass, $userpass, $prompt)
- $doc->new($package, $content, $ownerpass, $userpass, $options)
- Instantiate a new CAM::PDF object. $content can be
a document in a string, a filename, or '-'. The latter indicates that the
document should be read from standard input. If the document is password
protected, the passwords should be passed as additional arguments. If they
are not known, a boolean $prompt argument allows
the programmer to suggest that the constructor prompt the user for a
password. This is rudimentary prompting: passwords are in the clear on the
console.
This constructor takes an optional final argument which is a
hash reference. This hash can contain any of the following optional
parameters:
- prompt_for_password => $boolean
- This is the same as the $prompt argument described
above.
- fault_tolerant => $boolean
- This flag causes the instance to be more lenient when reading the input
PDF. Currently, this only affects PDFs which cannot be successfully
decrypted.
- $doc->toPDF()
- Serializes the data structure as a PDF document stream and returns as in a
scalar.
- $doc->toString()
- Returns a serialized representation of the data structure. Implemented via
Data::Dumper.
(all of these functions are intended for internal only)
- $doc->getRootDict()
- Returns the Root dictionary for the PDF.
- $doc->getPagesDict()
- Returns the root Pages dictionary for the PDF.
- $doc->parseObj($string)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return an
object Node. This can be called as a class method in most circumstances,
but is intended as an instance method.
- $doc->parseInlineImage($string)
- $doc->parseInlineImage($string, $objnum)
- $doc->parseInlineImage($string, $objnum, $gennum)
- Given a fragment of PDF page content, parse it and return an object Node.
This can be called as a class method in some cases, but is intended as an
instance method.
- $doc->writeInlineImage($objectnode)
- This is the inverse of parseInlineImage(), intended for use only in
the CAM::PDF::Content class.
- $doc->parseStream($string, $objnum, $gennum, $dictnode)
- This should only be used by parseObj(), or other specialized cases.
Given a fragment of PDF page content, parse it and return a
stream Node. This can be called as a class method in most circumstances,
but is intended as an instance method.
The dictionary Node argument is typically the body of the
object Node that precedes this stream.
- $doc->parseDict($string)
- $doc->parseDict($string, $objnum)
- $doc->parseDict($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return an
dictionary Node. This can be called as a class method in most
circumstances, but is intended as an instance method.
- $doc->parseArray($string)
- $doc->parseArray($string, $objnum)
- $doc->parseArray($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return an
array Node. This can be called as a class or instance method.
- $doc->parseLabel($string)
- $doc->parseLabel($string, $objnum)
- $doc->parseLabel($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a
label Node. This can be called as a class or instance method.
- $doc->parseRef($string)
- $doc->parseRef($string, $objnum)
- $doc->parseRef($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a
reference Node. This can be called as a class or instance method.
- $doc->parseNum($string)
- $doc->parseNum($string, $objnum)
- $doc->parseNum($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a
number Node. This can be called as a class or instance method.
- $doc->parseString($string)
- $doc->parseString($string, $objnum)
- $doc->parseString($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a
string Node. This can be called as a class or instance method.
- $doc->parseHexString($string)
- $doc->parseHexString($string, $objnum)
- $doc->parseHexString($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a
hex string Node. This can be called as a class or instance method.
- $doc->parseBoolean($string)
- $doc->parseBoolean($string, $objnum)
- $doc->parseBoolean($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a
boolean Node. This can be called as a class or instance method.
- $doc->parseNull($string)
- $doc->parseNull($string, $objnum)
- $doc->parseNull($string, $objnum, $gennum)
- Use parseAny() instead of this, if possible.
Given a fragment of PDF page content, parse it and return a
null Node. This can be called as a class or instance method.
- $doc->parseAny($string)
- $doc->parseAny($string, $objnum)
- $doc->parseAny($string, $objnum, $gennum)
- Given a fragment of PDF page content, parse it and return a Node of the
appropriate type. This can be called as a class or instance method.
- $doc->getValue($object)
- For INTERNAL use
Dereference a data object, return a value. Given an node
object of any kind, returns raw scalar object: hashref, arrayref,
string, number. This function follows all references, and descends into
all objects.
- $doc->getObjValue($objectnum)
- For INTERNAL use
Dereference a data object, and return a value. Behaves just
like the getValue() function, but used when all you know is the
object number.
- $doc->dereference($objectnum)
- $doc->dereference($name, $pagenum)
- For INTERNAL use
Dereference a data object, return a PDF object as a node. This
function makes heavy use of the internal object cache. Most (if not all)
object requests should go through this function.
$name should look something like
'/R12'.
- $doc->getPropertyNames($pagenum)
- $doc->getProperty($pagenum, $propertyname)
- Each PDF page contains a list of resources that it uses (images, fonts,
etc). getPropertyNames() returns an array of the names of those
resources. getProperty() returns a node representing a named
property (most likely a reference node).
- $doc->getFont($pagenum, $fontname)
- For INTERNAL use
Returns a dictionary for a given font identified by its label,
referenced by page.
- $doc->getFontNames($pagenum)
- For INTERNAL use
Returns a list of fonts for a given page.
- $doc->getFonts($pagenum)
- For INTERNAL use
Returns an array of font objects for a given page.
- $doc->getFontByBaseName($pagenum, $fontname)
- For INTERNAL use
Returns a dictionary for a given font, referenced by page and
the name of the base font.
- $doc->getFontMetrics($properties $fontname)
- For INTERNAL use
Returns a data structure representing the font metrics for the
named font. The property list is the results of something like the
following:
$self->_buildNameTable($pagenum);
my $properties = $self->{Names}->{$pagenum};
Alternatively, if you know the page number, it might be easier
to do:
my $font = $self->dereference($fontlabel, $pagenum);
my $fontmetrics = $font->{value}->{value};
where the $fontlabel is something like
'/Helv'. The getFontMetrics() method is useful in the cases where
you've forgotten which page number you are working on (e.g. in
CAM::PDF::GS), or if your property list isn't part of any page (e.g.
working with form field annotation objects).
- $doc->addFont($pagenum, $fontname, $fontlabel)
- $doc->addFont($pagenum, $fontname, $fontlabel, $fontmetrics)
- Adds a reference to the specified font to the page.
If a font metrics hash is supplied (it is required for a font
other than the 14 core fonts), then it is cloned and inserted into the
new font structure. Note that if those font metrics contain references
(e.g. to the "FontDescriptor"), the
referred objects are not copied -- you must do that part yourself.
For Type1 fonts, the font metrics must minimally contain the
following fields: "Subtype",
"FirstChar",
"LastChar",
"Widths",
"FontDescriptor".
- $doc->deEmbedFont($pagenum, $fontname)
- $doc->deEmbedFont($pagenum, $fontname, $basefont)
- Removes embedded font data, leaving font reference intact. Returns true if
the font exists and 1) font is not embedded or 2) embedded data was
successfully discarded. Returns false if the font does not exist, or the
embedded data could not be discarded.
The optional $basefont parameter
allows you to change the font. This is useful when some applications
embed a standard font (see below) and give it a funny name, like
"SYLXNP+Helvetica". In this example,
it's important to change the basename back to the standard
"Helvetica" when de-embedding.
De-embedding the font does NOT remove it from the PDF
document, it just removes references to it. To get a size reduction by
throwing away unused font data, you should use the following code
sometime after this method.
$self->cleanse();
For reference, the standard fonts are
"Times-Roman",
"Helvetica", and
"Courier" (and their bold, italic and
bold-italic forms) plus "Symbol" and
"Zapfdingbats". (Adobe PDF Reference
v1.4, p.319)
- $doc->deEmbedFontByBaseName($pagenum, $fontname)
- $doc->deEmbedFontByBaseName($pagenum, $fontname, $basefont)
- Just like deEmbedFont(), except that the font name parameter refers
to the name of the current base font instead of the PDF label for the
font.
- $doc->wrapString($string, $width, $fontsize, $fontmetrics)
- $doc->wrapString($string, $width, $fontsize, $pagenum, $fontlabel)
- Returns an array of strings wrapped to the specified width.
- $doc->getStringWidth($fontmetrics, $string)
- For INTERNAL use
Returns the width of the string, using the font metrics if
possible.
- $doc->numPages()
- Returns the number of pages in the PDF document.
- $doc->getPage($pagenum)
- For INTERNAL use
Returns a dictionary for a given numbered page.
- $doc->getPageObjnum($pagenum)
- For INTERNAL use
Return the number of the PDF object in which the specified
page occurs.
- $doc->getPageText($pagenum)
- Extracts the text from a PDF page as a string.
- $doc->getPageContentTree($pagenum)
- Retrieves a parsed page content data structure, or undef if there is a
syntax error or if the page does not exist.
- $doc->getPageContent($pagenum)
- Return a string with the layout contents of one page.
- $doc->getPageDimensions($pagenum)
- Returns an array of "x",
"y",
"width" and
"height" numbers that define the
dimensions of the specified page in points (1/72 inches). Technically,
this is the "MediaBox" dimensions, which
explains why it's possible for "x" and
"y" to be non-zero, but that's a rare
case.
For example, given a simple 8.5 by 11 inch page, this method
will return "(0,0,612,792)".
This method will die() if the specified page number
does not exist.
- $doc->getName($object)
- For INTERNAL use
Given a PDF object reference, return it's name, if it has one.
This is useful for indirect references to images in particular.
- $doc->getPrefs()
- Return an array of security information for the document:
owner password
user password
print boolean
modify boolean
copy boolean
add boolean
See the PDF reference for the intended use of the latter four
booleans.
This module publishes the array indices of these values for
your convenience:
$CAM::PDF::PREF_OPASS
$CAM::PDF::PREF_UPASS
$CAM::PDF::PREF_PRINT
$CAM::PDF::PREF_MODIFY
$CAM::PDF::PREF_COPY
$CAM::PDF::PREF_ADD
So, you can retrieve the value of the Copy boolean via:
my ($canCopy) = ($self->getPrefs())[$CAM::PDF::PREF_COPY];
- $doc->canPrint()
- Return a boolean indicating whether the Print permission is enabled on the
PDF.
- $doc->canModify()
- Return a boolean indicating whether the Modify permission is enabled on
the PDF.
- $doc->canCopy()
- Return a boolean indicating whether the Copy permission is enabled on the
PDF.
- $doc->canAdd()
- Return a boolean indicating whether the Add permission is enabled on the
PDF.
- $doc->getFormFieldList()
- Return an array of the names of all of the PDF form fields. The names are
the full hierarchical names constructed as explained in the PDF reference
manual. These names are useful for the fillFormFields()
function.
- $doc->getFormField($name)
- For INTERNAL use
Return the object containing the form field definition for the
specified field name. $name can be either the
full name or the "short/alternate" name.
- $doc->getFormFieldDict($formfieldobject)
- For INTERNAL use
Return a hash reference representing the accumulated property
list for a form field, including all of it's inherited properties. This
should be treated as a read-only hash! It ONLY retrieves the properties
it knows about.
- $doc->setPrefs($ownerpass, $userpass, $print?, $modify?, $copy?,
$add?)
- Alter the document's security information. Note that modifying these
parameters must be done respecting the intellectual property of the
original document. See Adobe's statement in the introduction of the
reference manual.
Important Note: Most PDF readers (Acrobat, Preview.app)
only offer one password field for opening documents. So, if the
$ownerpass and $userpass
are different, those applications cannot read the documents. (Perhaps
this is a bug in CAM::PDF?)
Note: any omitted booleans default to false. So, these two are
equivalent:
$doc->setPrefs('password', 'password');
$doc->setPrefs('password', 'password', 0, 0, 0, 0);
- $doc->setName($object, $name)
- For INTERNAL use
Change the name of a PDF object structure.
- $doc->removeName($object)
- For INTERNAL use
Delete the name of a PDF object structure.
- $doc->pageAddName($pagenum, $name, $objectnum)
- For INTERNAL use
Append a named object to the metadata for a given page.
- $doc->setPageContent($pagenum, $content)
- $doc->setPageContent($pagenum, $tree->toString)
- Replace the content of the specified page with a new version. This
function is often used after the getPageContent() function and some
manipulation of the returned string from that function.
If your content is a parsed tree (i.e. the result of
getPageContentTree) then you should serialize it via toString first.
- $doc->appendPageContent($pagenum, $content)
- Add more content to the specified page. Note that this function does NOT
do any page metadata work for you (like creating font objects for any
newly defined fonts).
- $doc->extractPages($pages...)
- Remove all pages from the PDF except the specified ones. Like
deletePages(), the pages can be multiple arguments, comma separated
lists, ranges (open or closed).
- $doc->deletePages($pages...)
- Remove the specified pages from the PDF. The pages can be multiple
arguments, comma separated lists, ranges (open or closed).
- $doc->deletePage($pagenum)
- Remove the specified page from the PDF. If the PDF has only one page, this
method will fail.
- $doc->decachePages($pagenum, $pagenum, ...)
- Clears cached copies of the specified page data structures. This is useful
if an operation has been performed that changes a page.
- $doc->addPageResources($pagenum, $resourcehash)
- Add the resources from the given object to the page resource dictionary.
If the page does not have a resource dictionary, create one. This function
avoids duplicating resources where feasible.
- $doc->appendPDF($pdf)
- Append pages from another PDF document to this one. No optimization is
done -- the pieces are just appended and the internal table of contents is
updated.
Note that this can break documents with annotations. See the
appendpdf.pl script for a workaround.
- $doc->prependPDF($pdf)
- Just like appendPDF() except the new document is inserted on page 1
instead of at the end.
- $doc->duplicatePage($pagenum)
- $doc->duplicatePage($pagenum, $leaveblank)
- Inserts an identical copy of the specified page into the document. The new
page's number will be "$pagenum + 1".
If $leaveblank is true, the new page
does not get any content. Thus, the document is broken until you
subsequently call setPageContent().
- $doc->createStreamObject($content)
- $doc->createStreamObject($content, $filter ...)
- For INTERNAL use
Create a new Stream object. This object is NOT added to the
document. Use the appendObject() function to do that after
calling this function.
- $doc->uninlineImages()
- $doc->uninlineImages($pagenum)
- Search the content of the specified page (or all pages if the page number
is omitted) for embedded images. If there are any, replace them with
indirect objects. This procedure uses heuristics to detect in-line images,
and is subject to confusion in extremely rare cases of text that uses
"BI" and
"ID" a lot.
- $doc->appendObject($doc, $objectnum, $recurse?)
- $doc->appendObject($undef, $object, $recurse?)
- Duplicate an object from another PDF document and add it to this document,
optionally descending into the object and copying any other objects it
references.
Like replaceObject(), the second form allows you to
append a newly-created block to the PDF.
- $doc->replaceObject($objectnum, $doc, $objectnum, $recurse?)
- $doc->replaceObject($objectnum, $undef, $object)
- Duplicate an object from another PDF document and insert it into this
document, replacing an existing object. Optionally descend into the
original object and copy any other objects it references.
If the other document is undefined, then the object to copy is
taken to be an anonymous object that is not part of any other document.
This is useful when you've just created that anonymous object.
- $doc->deleteObject($objectnum)
- Remove an object from the document. This function does NOT take care of
dependencies on this object.
- $doc->cleanse()
- Remove unused objects. WARNING: this function breaks some PDF
documents because it removes objects that are strictly part of the page
model hierarchy, but which are required anyway (like some font definition
objects).
- $doc->createID()
- For INTERNAL use
Generate a new document ID. Contrary the Adobe recommendation,
this is a random number.
- $doc->fillFormFields($name => $value, ...)
- $doc->fillFormFields($opts_hash, $name => $value, ...)
- Set the default values of PDF form fields. The name should be the full
hierarchical name of the field as output by the getFormFieldList()
function. The argument list can be a hash if you like. A simple way to use
this function is something like this:
my %fields = (fname => 'John', lname => 'Smith', state => 'WI');
$field{zip} = 53703;
$self->fillFormFields(%fields);
If the first argument is a hash reference, it is interpreted
as options for how to render the filled data:
- background_color =< 'none' | $gray | [$r, $g, $b]
- Specify the background color for the text field.
- max_autoscale_fontsize =< $size
- min_autoscale_fontsize =< $size
- If the form field is set to auto-size the text to fit, then you may use
these options to constrain the limits of that autoscaling. Otherwise, for
example, a very long string will become arbitrarily small to fit in the
box.
- $doc->clearFormFieldTriggers($name, $name, ...)
- Disable any triggers set on data entry for the specified form field names.
This is useful in the case where, for example, the data entry Javascript
forbids punctuation and you want to prefill with a hyphenated word. If you
don't clear the trigger, the prefill may not happen.
- $doc->clearAnnotations()
- Remove all annotations from the document. If form fields are encountered,
their text is added to the appropriate page.
- $doc->previousRevision()
- If this PDF was previously saved in append mode (that is, if
"clean()" was not invoked on it), return
a new instance representing that previous version. Otherwise return void.
If this is an encrypted PDF, this method assumes that previous revisions
were encrypted with the same password, which may be an incorrect
assumption.
- $doc->allRevisions()
- Accumulate CAM::PDF instances returned by
"previousRevision" until there are no
more previous revisions. Returns a list of instances from newest to oldest
including this instance as the newest.
- $doc->preserveOrder()
- Try to recreate the original document as much as possible. This may help
in recreating documents which use undocumented tricks of saving font
information in adjacent objects.
- $doc->isLinearized()
- Returns a boolean indicating whether this PDF is linearized (aka
"optimized").
- $doc->delinearize()
- For INTERNAL use
Undo the tweaks used to make the document 'optimized'. This
function is automatically called on every save or output since this
library does not yet support linearized documents.
- $doc->clean()
- Cache all parts of the document and throw away it's old structure. This is
useful for writing PDFs anew, instead of simply appending changes to the
existing documents. This is called by cleansave() and
cleanoutput().
- $doc->needsSave()
- Returns a boolean indicating whether the save() method needs to be
called. Like save(), this has nothing to do with whether the
document has been saved to disk, but whether the in-memory representation
of the document has been serialized.
- $doc->save()
- Serialize the document into a single string. All changed document elements
are normalized, and a new index and an updated trailer are created.
This function operates solely in memory. It DOES NOT write the
document to a file. See the output() function for that.
- $doc->cleansave()
- Call the clean() function, then call the save()
function.
- $doc->output($filename)
- $doc->output()
- Save the document to a file. The save() function is called first to
serialize the data structure. If no filename is specified, or if the
filename is '-', the document is written to standard output.
Note: it is the responsibility of the application to ensure
that the PDF document has either the Modify or Add permission. You can
do this like the following:
if ($self->canModify()) {
$self->output($outfile);
} else {
die "The PDF file denies permission to make modifications\n";
}
- $doc->cleanoutput($file)
- $doc->cleanoutput()
- Call the clean() function, then call the output() function
to write a fresh copy of the document to a file.
- $doc->writeObject($objnum)
- Return the serialization of the specified object.
- $doc->writeString($string)
- Return the serialization of the specified string. Works on normal or hex
strings. If encryption is desired, the string should be encrypted before
being passed here.
- $doc->writeAny($node)
- Returns the serialization of the specified node. This handles all Node
types, including object Nodes.
- $doc->traverse($dereference, $node, $callbackfunc, $callbackdata)
- Recursive traversal of a PDF data structure.
In many cases, it's useful to apply one action to every node
in an object tree. The routines below all use this traverse()
function. One of the most important parameters is the first: the
$dereference boolean. If true, the traversal
follows reference Nodes. If false, it does not descend into reference
Nodes.
Optionally, you can pass in a hashref as a final argument to
reduce redundant traversing across multiple calls. Just pass in an empty
hashref the first time and pass in the same hashref each time. See
"changeRefKeys()" for an example.
- $doc->decodeObject($objectnum)
- For INTERNAL use
Remove any filters (like compression, etc) from a data stream
indicated by the object number.
- $doc->decodeAll($object)
- For INTERNAL use
Remove any filters from any data stream in this object or any
object referenced by it.
- $doc->decodeOne($object)
- $doc->decodeOne($object, $save?)
- For INTERNAL use
Remove any filters from an object. The boolean flag
$save (defaults to false) indicates whether this
removal should be permanent or just this once. If true, the function
returns success or failure. If false, the function returns the
defiltered content.
- $doc->fixDecode($streamdata, $filter, $params)
- This is a utility method to do any tweaking after removing the filter from
a data stream.
- $doc->encodeObject($objectnum, $filter)
- Apply the specified filter to the object.
- $doc->encodeOne($object, $filter)
- Apply the specified filter to the object.
- $doc->setObjNum($object, $objectnum, $gennum)
- Descend into an object and change all of the INTERNAL object number flags
to a new number. This is just for consistency of internal accounting.
- $doc->getRefList($object)
- For INTERNAL use
Return an array all of objects referred to in this object.
- $doc->changeRefKeys($object, $hashref)
- For INTERNAL use
Renumber all references in an object.
- $doc->abbrevInlineImage($object)
- Contract all image keywords to inline abbreviations.
- $doc->unabbrevInlineImage($object)
- Expand all inline image abbreviations.
- $doc->changeString($object, $hashref)
- Alter all instances of a given string. The hashref is a dictionary of
from-string and to-string. If the from-string looks like
"regex(...)" then it is interpreted as a
Perl regular expression and is eval'ed. Otherwise the search-and-replace
is literal.
- $doc->rangeToArray($min, $max, $list...)
- Converts string lists of numbers to an array. For example,
CAM::PDF->rangeToArray(1, 15, '1,3-5,12,9', '14-', '8 - 6, -2');
becomes
(1,3,4,5,12,9,14,15,8,7,6,1,2)
- $doc->trimstr($string)
- Used solely for debugging. Trims a string to a max of 40 characters,
handling nulls and non-Unix line endings.
- $doc->copyObject($node)
- Clones a node via Data::Dumper and eval().
- $doc->cacheObjects()
- Parses all object Nodes and stores them in the cache. This is useful for
cases where you intend to do some global manipulation and want all of the
data conveniently in RAM.
- $doc->asciify($string)
- Helper class/instance method to massage a string, cleaning up some
non-ASCII problems. This is a very incomplete list. Specifically:
This library was primarily developed against the 3rd edition of the reference
(PDF v1.4) with several important updates from 4th edition (PDF v1.5). This
library focuses most deeply on PDF v1.2 features. Nonetheless, it should be
forward and backward compatible in the majority of cases.
This module is written with good speed and flexibility in mind, often at the
expense of memory consumption. Entire PDF documents are typically slurped into
RAM. As an example, simply calling
"new('PDFReference15_v15.pdf')" (the 13.5 MB
Adobe PDF Reference V1.5 document) pushes Perl to consume 89 MB of RAM on my
development machine.
There are several other PDF modules on CPAN. Below is a brief description of a
few of them. If these comments are out of date, please inform me.
- PDF::API2
- As of v0.46.003, LGPL license.
This is the leading PDF library, in my opinion.
Excellent text and font support. This is the highest level
library of the bunch, and is the most complete implementation of the
Adobe PDF spec. The author is amazingly responsive and patient.
- Text::PDF
- As of v0.25, Artistic license.
Excellent compression support (CAM::PDF cribs off this
Text::PDF feature). This has not been developed since 2003.
- PDF::Reuse
- As of v0.32, Artistic/GPL license, like Perl itself.
This library is not object oriented, so it can only process
one PDF at a time, while storing all data in global variables. I'm not
fond of it, but it's quite popular, so don't take my word for it!
CAM::PDF is the only one of these that has regression tests.
Currently, CAM::PDF has test coverage of about 50%, as reported by
"Build testcover".
Additionally, PDFLib is a commercial package not on CPAN
(www.pdflib.com). It is a C-based library with a Perl interface. It is
designed for PDF creation, not for reuse.
The data structure used to represent the PDF document is composed primarily of a
hierarchy of Node objects. Every node in the document tree has this structure:
type => <type>
value => <value>
objnum => <object number>
gennum => <generation number>
where the <value> depends on the <type>, and
<type> is one of
Type Value
---- -----
object Node
stream byte string
string byte string
hexstring byte string
number number
reference integer (object number)
boolean "true" | "false"
label string
array arrayref of Nodes
dictionary hashref of (string => Node)
null undef
All of these except "stream" are directly related to the
PDF data types of the same name. Streams are treated as special cases in
this library since the have a non-general syntax and placement in the
document body. Internally, streams are very much like strings, except that
they have filters applied to them.
All objects are referenced indirectly by their numbers, as defined
in the PDF document. In all cases, the dereference() function should
be used to deserialize objects into their internal representation. This
function is also useful for looking up named objects in the page model
metadata. Every node in the hierarchy contains its object and generation
number. You can think of this as a sort of a pointer back to the root of
each node tree. This serves in place of a "parent" link for every
node, which would be harder to maintain.
The PDF document itself is represented internally as a hash
reference with many components, including the document content, the document
metadata (index, trailer and root node), the object cache, and several other
caches, in addition to a few assorted bookkeeping structures.
The core of the document is represented in the object cache, which
is only populated as needed, thus avoiding the overhead of parsing the whole
document at read time.
Chris Dolan
This module was originally developed by me at Clotho Advanced
Media Inc. Now I maintain it in my spare time.
Thanks to all the people who have submitted bug reports over the years! I've
belatedly started crediting people in the CHANGES file. Apologies to
contributors I've overlooked...
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |