Getting data from a pdf using Python

Share this post

In the series of managing different types of documents we have seen among others the structured files in YAML format. However, what about one of the most common text file formats, i’m naturally talking about about pdf files. So, how to grab the data of this semi-structured files in a ease of use way? Of course if you type in Google Python and pdf you will find many libraries candidates to handle these files. We will see together and in this short article how to use one of them pyPDF2 .

What is pyPDF2?

PyPDF2 is a very used Python library allowing to read and doing some manipulation of pdf files. This library itself comes from the pyPdf project and is currently maintained by Phaseit, Inc. It allows the extraction of data from PDF files or simply manipulate existing PDFs with the idea of ​​producing a new pdf file (concatenation, page filtering, etc.). PyPDF2 is compatible with Python versions 2.6, 2.7 and 3.2 – 3.5.

NB: Using version 3.7 I have not encountered any particular concerns …

Installation is simple via the pip utility:

pip install pypdf2

Open pdf file

The usage of the library is pretty simple and is based on two main objects: one for reading and the other for writing pdf files.

from PyPDF2 import PdfFileReader
from PyPDF2 import PdfFileWriter

Then to read a pdf file, nothing could be simpliest, just open it like any file in Python and use the PdfFileReader object afterwards like below.

In the context of our example, we will read a very simple file:

PDF content
document = PdfFileReader(open(myFile, 'rb'))

The file is now open, let’s take a look at the metadata:

metadata = document.getDocumentInfo()
print (metadata)
{'/Author': 'Benoit Cayla',
 '/Creator': 'Microsoft® Word pour Office\xa0365',
 '/CreationDate': "D:20201110115707+01'00'",
 '/ModDate': "D:20201110115707+01'00'",
 '/Producer': 'Microsoft® Word pour Office\xa0365'}

We can also get them directly like this:

author = metadata.author if metadata.author else u'Unknown'
title = metadata.title if metadata.title else myFile
subject = metadata.subject if metadata.subject else "No Subject"
print (author + "|" + title + "|" + subject)
Benoit Cayla|test.pdf|No Subject

Other methods allow you to retrieve important information such as the number of pages:

print(document.getNumPages())
1

Retrieving interactive fields:

fields = document.getFields()

But also :

  • The display type with getPageLayout ()
  • The display mode with getPageMode ()

Do not hesitate to consult the documentation here .

Now, read the text content

We read the content of course after opening the pdf file with PdfFileReader. It is necessary pay attention on the pagination. Indeed, the reading being done page by page like this:

pdftext = ""
for page in range(document.numPages):
    pageObj = document.getPage(page)
    pdftext += pageObj.extractText().replace('\n','')
Bonjour ceci est un test !  Benoit Cayla 

You will notice in the last line that I remove the carriage returns (\ n) because they are retrieved as is by the interpreter.

You can also view the data retrieved in detail (hierarchical). This is not of much use as is, but it shows you that all content and meta content has been recovered.

print(pageObj)
{'/Type': '/Page',
 '/Parent': {'/Type': '/Pages',
  '/Count': 3,
  '/Kids': [IndirectObject(3, 0), IndirectObject(4, 0), IndirectObject(5, 0)]},
 '/Resources': {'/Font': {'/F1': {'/Type': '/Font',
    '/Subtype': '/TrueType',
    '/Name': '/F1',
    '/BaseFont': '/BCDEEE+Calibri',
    '/Encoding': '/WinAnsiEncoding',
    '/FontDescriptor': {'/Type': '/FontDescriptor',
     '/FontName': '/BCDEEE+Calibri',
     '/Flags': 32,
     '/ItalicAngle': 0,
     '/Ascent': 750,
     '/Descent': -250,
     '/CapHeight': 750,
     '/AvgWidth': 521,
     '/MaxWidth': 1743,
     '/FontWeight': 400,
     '/XHeight': 250,
     '/StemV': 52,
     '/FontBBox': [-503, -250, 1240, 750],
     '/FontFile2': {'/Filter': '/FlateDecode', '/Length1': 93840}},
    '/FirstChar': 32,
    '/LastChar': 117,
    '/Widths': [226,
     0,
     ...
     0,
     525]}},
  '/ExtGState': {'/GS7': {'/Type': '/ExtGState', '/BM': '/Normal', '/ca': 1},
   '/GS8': {'/Type': '/ExtGState', '/BM': '/Normal', '/CA': 1}},
  '/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI']},
 '/MediaBox': [0, 0, 595.2, 841.92],
 '/Contents': {'/Filter': '/FlateDecode'},
 '/Group': {'/Type': '/Group', '/S': '/Transparency', '/CS': '/DeviceRGB'},
 '/Tabs': '/S',
 '/StructParents': 0}

Now let’s create a new pdf file

In this example we are going to concatenate 3 pdf files into one. Here are the 3 pdf files that we will first open:

pdflist = ["test.pdf" , "test2.pdf", "test3.pdf"]

Once opened, we create a new pdf file with PdfFileWriter:

pdfWriter = PdfFileWriter()
for filename in pdflist:
    pdfFileObj = open(filename,'rb')
    pdfReader = PdfFileReader(pdfFileObj)
    for pageNum in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(pageNum)
        pdfWriter.addPage(pageObj)
 
pdfOutput = open('final.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close() 

We simply add the pages of the existing pdf in the new pdf.

Conclusion

We have seen in this article how easily we can read text data from a pdf file. Just be careful as we assumed that this data was “real” text and not (scanned) images of text placed in a pdf. In this case, it would then be necessary to extract the image from the pdf and then use an OCR such as Tesseract. For that I suggest you read my article here .

As usual you will find the sources for this article on GitHub.

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub