Sitemap

Member-only story

Converting scanned PDF files to Markdown and EPUB

2 min readApr 14, 2025

There are many tools for converting PDFs and documents, many more for getting your documents ready for gen AI. I wrote a story before and introduced Docling which is one of the bests, and recently I find an open source and very fresh project which is mainly focused on converting scanned PDF files to Markdown and EPUB which is suitable for many models, and it called PDF Craft.

PDF Craft, is mainly using DocLayout-YOLO ( a real-time layout detection model ) to extract the text from the book pages and filter out elements such as headers, footers, footnotes, and page numbers. It also use OnnxOCR ( an Optical Character Recognition system ) for text recognitions.

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Requirements

You need python 3.10 or above (recommended 3.10.16) and then:

pip install pdf-craft
pip install onnxruntime==1.21.0

// if using CUDA
pip install onnxruntime-gpu==1.21.0

Here is an example for converting PDF to MarkDown:

from pdf_craft import PDFPageExtractor, MarkDownWriter

extractor = PDFPageExtractor(
device="cpu", # If you want to use CUDA, please change to device="cuda" format.
model_dir_path="/path/to/model/dir/path", # The folder address where the AI ​​model is downloaded and installed
)
with MarkDownWriter(markdown_path, "images", "utf-8") as md…

--

--

Emad Dehnavi
Emad Dehnavi

Written by Emad Dehnavi

With 8 years as a software engineer, I write about AI and technology in a simple way. My goal is to make these topics easy and interesting for everyone.

No responses yet