A Simple Guide to Python Detect PDF File is Corrupted or Incompleted – Python Tutorial

By | September 19, 2019

When we are processing pdf files with python, we should check a pdf is completed or corrupted. In this tutorial, we will introduce you a simple way to how to detect. You can use this tutorial example in your application.

Some features of completed pdf files

PPF file 1.

valid pdf feature 6

The pdf file ends with NUL. Meanwhile, there are many NUL in last line.

The last second line contains: %%EOF

valid pdf feature 6

At the middle of this pdf file, there are also a %%EOF.

PDF file 2.

valid pdf feature 5

This pdf file ends with NUL, there are only a NUL in the last line.

The last second line also contains a %%EOF.

PDF file 3.

valid pdf feature 4

The pdf file ends with unknown symbol. However, the last second line contains a %%EOF.

PDF file 4.

valid pdf feature 2

This pdf file ends with %%EOF.

Then check the start of pdf

PDF file 5.

pdf start feature

This pdf start with: %PDF

So as to a completed pdf, the feature of it is:

1.The pdf file ends with %%EOF or NUL.

2.This file contain more than one %%EOF symbol.

3. The content of pdf file contains %PDF.

We can create a python function to detect a pdf file is completed or not.

def isFullPdf(f):
    end_content = ''
    start_content = ''
    size = os.path.getsize(f)
    if size < 1024: return False 
    with open(f, 'rb') as fin: 
        #start content 
        fin.seek(0, 0)
        start_content = fin.read(1024)
        start_content = start_content.decode("ascii", 'ignore' )
        fin.seek(-1024, 2)
        end_content = fin.read()
        end_content = end_content.decode("ascii", 'ignore' )
    start_flag = False
    #%PDF
    if start_content.count('%PDF') > 0:
        start_flag = True
    
        
    if end_content.count('%%EOF') and start_flag > 0:
        return True
    eof = bytes([0])
    eof = eof.decode("ascii")
    if end_content.endswith(eof) and start_flag:
        return True
    return False

I have test this function on more than 1,000 pdf files, it works well.

Leave a Reply