Experimental, use with care.
pd3f is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based. It reconstructs the original continuous text with the help of machine learning.
pd3f can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula. It’s built upon the output of Parsr. Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.
Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.
pd3f is especially useful for languages with long words such as German. It was mainly developed to parse German letters and official documents. Besides German
pd3f supports English, Spanish, French and Italian. More languages will be added a later stage.