This tutorial is in: Python PDF Document Processing Notes for Beginners
Python can split a big pdf file to some small ones, meanwhile, we also can merge some small pdf files to a big one. In this tutorial, we will introduce how to split and merge pdf files using python pymupdf library.
Preliminary
You should install python pymupdf library first.
pip install pymupdf
Open a source pdf file
To split or merge a pdf file, you should open a source pdf first. To open a pdf file in python pymupdf, we can do like this:
import sys, fitz file = '231420-digitalimageforensics.pdf' try: doc = fitz.open(file) except Exception as e: print(e) page_count = doc.pageCount print(page_count)
Run this code, you will find the total page of source document (231420-digitalimageforensics.pdf) is: 199.
Then we can split some pages from the source pdf to a new pdf.
To split or merge pdf files in pymupdf, we can use Document.insertPDF() function.
insertPDF(docsrc, from_page=-1, to_page=-1, start_at=-1, rotate=-1, links=True, annots=True)
This function can select some pages from docsrc to insert into a new pdf.
The index of pages in a pdf document
In python pymupdf, the index of page starts with 0, which means the page index is in [0, total_page – 1].
This is very important if you plan to select some pages from a source pdf file.
Important parameters explain
docsrc: a source pdf file, we can select some page [from_page, to_page].
As to [from_page = 3, to_page = 5], which means we will select 3 pages (page 4, page 5, page 6) from a source pdf.
from_page: int, the start index of page in docsrc.
to_page: int, the end index of page in docsrc, you should notice this index page is also selected.
start_at: int, this parameter determines where to insert pages from docsrc.
For exampe: start_at = 1, which means we will insert pages from docsrc in between page index 0 and page index 1 in destination pdf file.
Menwhile, start_at should be smaller than the total page of destination pdf file.
For example:
doc2 = fitz.open("new-doc-1.pdf") doc2.insertPDF(doc, from_page = 3, to_page = 5, start_at = 1) doc2.save("new-doc-4.pdf")
This code will select 3 pages from 231420-digitalimageforensics.pdf. Then, we will insert these pages into the end of first page of new-doc-1.pdf to create a new pdf document new-doc-4.pdf.
This code can split a pdf file and merge two pdf files to a new one.