A Simple Guide to Python Detect PDF File is Corrupted or Incompleted

When we are processing pdf files with python, we should check a pdf is completed or corrupted. In this tutorial, we will introduce you a simple way to how to detect. You can use this tutorial example in your application.

Some features of completed pdf files

PPF file 1.

The pdf file ends with NUL. Meanwhile, there are many NUL in last line.

The last second line contains: %%EOF

At the middle of this pdf file, there are also a %%EOF.

PDF file 2.

This pdf file ends with NUL, there are only a NUL in the last line.

The last second line also contains a %%EOF.

PDF file 3.

The pdf file ends with unknown symbol. However, the last second line contains a %%EOF.

PDF file 4.

This pdf file ends with %%EOF.

Then check the start of pdf

PDF file 5.

This pdf start with: %PDF

So as to a completed pdf, the feature of it is:

1.The pdf file ends with %%EOF or NUL.

2.This file contain more than one %%EOF symbol.

3. The content of pdf file contains %PDF.

We can create a python function to detect a pdf file is completed or not.

def isFullPdf(f):
    end_content = ''
    start_content = ''
    size = os.path.getsize(f)
    if size < 1024: return False 
    with open(f, 'rb') as fin: 
        #start content 
        fin.seek(0, 0)
        start_content = fin.read(1024)
        start_content = start_content.decode("ascii", 'ignore' )
        fin.seek(-1024, 2)
        end_content = fin.read()
        end_content = end_content.decode("ascii", 'ignore' )
    start_flag = False
    #%PDF
    if start_content.count('%PDF') > 0:
        start_flag = True
    
        
    if end_content.count('%%EOF') and start_flag > 0:
        return True
    eof = bytes([0])
    eof = eof.decode("ascii")
    if end_content.endswith(eof) and start_flag:
        return True
    return False

I have test this function on more than 1,000 pdf files, it works well.

A Simple Guide to Python Detect PDF File is Corrupted or Incompleted – Python Tutorial

Some features of completed pdf files

Leave a Reply Cancel reply