Add bookmarks to a PDF file in batch
Under Linux, the tool JPdfBookmarks can be use to add a collection of bookmarks to a PDF book in batch by loading information from a text file, which stores the table of contents and page numbers. To prepare this text file manually, I’ve summarized the following steps according to my practice for your reference.
-
Take snapshots of the content pages and save them into images by using
ImageMagick
.convert -density 300 book.pdf[${page_no}] content_page.png
Here
${page_no}
should be replaced with the numeric page number of the content page. -
Extract text from the image of the content page using the OCR tool Tesseract.
tesseract content_page.png stdout -l eng >> bookmark_text.txt
The option
stdout
tellsTesseract
to output the extracted text to standard output and-l eng
specifies the OCR language as English. To check the list of languages supported byTesseract
, executetesseract --list-langs
.$ tesseract --list-langs List of available languages (8): chi_sim chi_sim_vert chi_tra chi_tra_vert deu eng fra osd
- Open the generated text file
bookmark_text.txt
and perform a manual cleaning and reorganization if needed. It is suggested to do this work in Emacs, where the commandregexp-builder
can help us construct and verify regular expressions that are used for efficient text matching and replacement. - Open the target PDF file in
JPdfBookmarks
and load the text filebookmark_text.txt
by clicking the menu itemLoad
in theTools
menu. - Check the validity of the PDF page targets associated with those bookmark items and finally save the PDF file.