The title of a pdf file is very important, however, it is not an easy way to get it. In this tutorial, we will introduce a simple way to extract pdf tile from its content.
PDF Metadata
Pdf metadata also contains pdf title, however, some pdf files may do not contain, or the value of title metadata is wrong. It is not a good idea to extract pdf title by pdf metadata.
How to extract pdf title from its content
In this tutorial, we only focus on pdf paper.
Please look at a paper. It contains an important feature.
We will find the font size of the pdf title is the largest in the whole pdf.
Then we can extract pdf title by following steps.
1.Get text font size
To get the font size of text in a pdf file, we can convert pdf to html text first, which contains font size of each text.
Python HTML Text From PDF with PyMuPDF – Python PDF Operation
2.Extract text by font size
After we have got the font size of text, we can extract text by its font size from large to small in pdf. This step can get some candidate titles. As to candidate titles with the some font size, we should join them or not by their line number.
3.Create a rule to evaluate candidate titles
We can create a rule to evaluate these candidate titles, such as a valid title may do not contain: table of content, <img, contents at a glance et al.
Finally, we will get the pdf title. Here is an example that we have extracted titles form some pdf files.
As to 1114 pdf files, we extracted 1099 pdf titles correctly, the accuracy is 98.7%.