A Beginner Guide to Python Extract Text From PDF Using PyPDF2

PyPDF2 is a python pdf processing library, which can help us to get pdf numbers, title, merge multiple pages. In this tutorial, we will introduce how to extract text from pdf pages. You can do by following our steps.

Install PyPDF2

pip install PyPDF2

Import library

import PyPDF2

Open a pdf file

file =r'F:\google-pdf\1664-Apress.Pro.dotNET.4.Parallel.Programming.in.CSharp.May.2010.pdf'
pdfFileObject = open(file, 'rb')

Get a pdf reader object

pdfReader = PyPDF2.PdfFileReader(pdfFileObject)

Get pdf page object

pageObject = pdfReader.getPage(0)

In this tutorial, we only get the first page object in pdf file.

Extract text from pdf page object

print(pageObject.extractText())

Close pdf object

pdfFileObject.close()

Then you will see the text extraced from the first page.

A Beginner Guide to Python Extract Text From PDF Using PyPDF2 – Python Tutorial