Python Extract PDF Paper Title By Content, not By Metadata: A Step Guide – Python Tutorial

By | June 2, 2020

The title of a pdf file is very important, however, it is not an easy way to get it. In this tutorial, we will introduce a simple way to extract pdf tile from its content.

PDF Metadata

Pdf metadata also contains pdf title, however, some pdf files may do not contain, or the value of title metadata is wrong. It is not a good idea to extract pdf title by pdf metadata.

How to extract pdf title from its content

In this tutorial, we only focus on pdf paper.

Please look at a paper. It contains an important feature.

the feature of pdf tile in pdf paper

We will find the font size of the pdf title is the largest in the whole pdf.

Then we can extract pdf title by following steps.

1.Get text font size

To get the font size of text in a pdf file, we can convert pdf to html text first, which contains font size of each text.

Python HTML Text From PDF with PyMuPDF – Python PDF Operation

2.Extract text by font size

After we have got the font size of text, we can extract text by its font size from large to small in pdf. This step can get some candidate titles. As to candidate titles with the some font size, we should join them or not by their line number.

3.Create a rule to evaluate candidate titles

We can create a rule to evaluate these candidate titles, such as a valid title may do not contain: table of content, <img, contents at a glance et al.

Finally, we will get the pdf title. Here is an example that we have extracted titles form some pdf files.

extract pdf title from pdf content

As to 1114 pdf files, we extracted 1099 pdf titles correctly, the accuracy is 98.7%.

Leave a Reply