Skip to content

Marker

Using advanced OCR for layout, text, and math detection, Marker converts PDFs to Markdown while running locally.

Overview

Contributor
Vik Paruchuri
Github
Code

More Information

Marker converts PDF to markdown quickly and accurately.

  • Supports a wide range of documents (optimized for books and scientific papers)
  • Supports all languages
  • Removes headers/footers/other artifacts
  • Formats tables and code blocks
  • Extracts and saves images along with the markdown
  • Converts most equations to latex
  • Works on GPU, CPU, or MPS

Marker is a pipeline of deep learning models:

  • Extract text, OCR if necessary (heuristics, surya, tesseract)
  • Detect page layout and find reading order (surya)
  • Clean and format each block (heuristics, texify
  • Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)

It only uses models where necessary, which improves speed and accuracy.

Contribute

Join the Discord

Contributors

About Vik

Vik is the founder of Marker. Previously, he founded Dataquest, a site that has taught AI and data skills to 1M+ people, won several Kaggle competitions, and was an early team member at edX, a pioneer in online education. His work has been featured on the front page of the New York Times, Wall Street Journal, LA Times, and other newspapers.

Other Projects