Member-only story
Converting scanned PDF files to Markdown and EPUB
There are many tools for converting PDFs and documents, many more for getting your documents ready for gen AI. I wrote a story before and introduced Docling which is one of the bests, and recently I find an open source and very fresh project which is mainly focused on converting scanned PDF files to Markdown and EPUB which is suitable for many models, and it called PDF Craft.
PDF Craft, is mainly using DocLayout-YOLO ( a real-time layout detection model ) to extract the text from the book pages and filter out elements such as headers, footers, footnotes, and page numbers. It also use OnnxOCR ( an Optical Character Recognition system ) for text recognitions.
Requirements
You need python 3.10 or above (recommended 3.10.16) and then:
pip install pdf-craft
pip install onnxruntime==1.21.0
// if using CUDA
pip install onnxruntime-gpu==1.21.0
Here is an example for converting PDF to MarkDown:
from pdf_craft import PDFPageExtractor, MarkDownWriter
extractor = PDFPageExtractor(
device="cpu", # If you want to use CUDA, please change to device="cuda" format.
model_dir_path="/path/to/model/dir/path", # The folder address where the AI model is downloaded and installed
)
with MarkDownWriter(markdown_path, "images", "utf-8") as md…