Marker
Using advanced OCR for layout, text, and math detection, Marker converts PDFs to Markdown while running locally.
Overview
- Contributor
- Vik Paruchuri
- Github
- Code
More Information
Marker converts PDF to markdown quickly and accurately.
- Supports a wide range of documents (optimized for books and scientific papers)
- Supports all languages
- Removes headers/footers/other artifacts
- Formats tables and code blocks
- Extracts and saves images along with the markdown
- Converts most equations to latex
- Works on GPU, CPU, or MPS
Marker is a pipeline of deep learning models:
- Extract text, OCR if necessary (heuristics, surya, tesseract)
- Detect page layout and find reading order (surya)
- Clean and format each block (heuristics, texify
- Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)
It only uses models where necessary, which improves speed and accuracy.
Contribute
Join the DiscordContributors
About Vik
Vik is the founder of Marker. Previously, he founded Dataquest, a site that has taught AI and data skills to 1M+ people, won several Kaggle competitions, and was an early team member at edX, a pioneer in online education. His work has been featured on the front page of the New York Times, Wall Street Journal, LA Times, and other newspapers.
Other Projects
-
Web Applets is an open specification for building software that both humans and AI can understand and use together.
- Contributor
- Rupert Manfredi
-
Transformer Lab is an open source platform that allows anyone to build, tune, & run Large Language Models locally, without writing code.
- Contributors
- Ali Asaria, Tony Salomone
-
A creative tool for interactive art, Tölvera empowers artists to create and interact with dynamic, self-organizing systems. It is inspired by fields such as artificial life (ALife) and self-organizing systems.
- Contributor
- Jack Armitage