CAM::PDF::PageText - Extract text from PDF page tree
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a
robust process, as PDF text is graphically laid out in arbitrary order. This
module uses a few heuristics to try to guess what text goes next to what other
text, but may be fooled easily by, say, subscripts, non-horizontal text,
changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text
from a simple PDF file.
- $pkg->render($pagetree)
- $pkg->render($pagetree, $verbose)
- Turn a page content tree into a string. This is a class method that should
be called like:
CAM::PDF::PageText->render($pagetree);