Chunkr

Open source API service to parse complex documents

ActiveY Combinator

Updated: Dec 2, 2025 ·

About

Battle-tested + highly modular vision infrastructure to convert PDFs, PPTs, Word, Excel, PNG, and JPEGs into LLM-ready data.

We started by building lumina.sh - where we needed to parse ~600M pages of scientific literature. The researchers didn't care - but devs wanted our ingestion pipeline. So we built chunkr instead.

We offer high quality layout analysis, OCR, bounding boxes, granular VLM controls, semantic chunking, and all the last mile engineering that goes into building standout AI applications. Common use-cases include RAG, and automating document workflows like invoices/medical reports -> database.

Financial History

Total Raised

N/A

Valuation

N/A

Financial History

Total Raised

N/A

Valuation

N/A

Leadership Team

Key people at Chunkr.

Leadership Team

Key people at Chunkr.

Deep Dive

High-Level Overview

Chunkr is an open-source API service designed to parse complex documents such as PDFs, Word, Excel, PPT, and images into structured, LLM-ready data formats. It provides advanced document layout analysis, OCR, semantic chunking, and schema-driven data extraction with granular citations and confidence scoring. The platform targets AI and developer teams needing precise, customizable document ingestion pipelines for applications like Retrieval-Augmented Generation (RAG), visual search, and AI agents that reason over complex documents[2][4][6].

For an investment firm, Chunkr represents a mission-driven technology company focused on enabling scalable, accurate document intelligence to power AI applications. Its investment philosophy likely centers on supporting open-source infrastructure that accelerates AI adoption across sectors such as legal, finance, scientific research, and enterprise software. By offering a modular, self-hosted or cloud API, Chunkr impacts the startup ecosystem by lowering barriers to building document-centric AI products and fostering innovation in data ingestion and processing.

For a portfolio company, Chunkr builds a production-ready API and tooling that serve developers, AI teams, and enterprises dealing with messy, complex documents. It solves the problem of one-size-fits-all document parsing by providing granular control over speed, quality, and feature trade-offs, enabling tailored pipelines for diverse use cases. The company shows growth momentum through its open-source adoption, cloud API offerings, and enterprise deployments, supported by a robust Rust-based infrastructure and scalable performance (processing over 11 million pages per month on a single GPU)[2][4][6].

Origin Story

Chunkr was co-founded by Mehul, Ishaan, and Akhilesh, who identified the challenge of inflexible document parsing solutions that fail to meet varied AI application needs. The idea emerged from the need for a modular, customizable pipeline that balances speed, accuracy, and features without vendor lock-in. The founders leveraged their expertise to build Chunkr as an open-source project with a production-ready API, emphasizing self-hosting and enterprise readiness. Early traction came from developer adoption and integration into AI workflows requiring precise document chunking and extraction[2].

The project is maintained by Lumina AI Inc., which continues to evolve Chunkr from an open-source tool into a fully managed cloud API and enterprise-grade solution with proprietary models for enhanced accuracy and speed. This evolution reflects a focus on serving both community users and high-security regulated industries[4].

Core Differentiators

Modular Pipeline Control: Users can customize processing at the segment level (titles, tables, formulas, captions) with different OCR and vision-language model (VLM) strategies per segment, balancing speed and quality[2].
Open Source with Enterprise Options: The AGPL open-source version allows transparency and local hosting, while the Cloud API offers proprietary models and enterprise features for higher accuracy and reliability[4].
Production-Ready Infrastructure: Built in Rust for performance and reliability, with features like image conversion, page parallelization, failure handling, and batching handled out-of-the-box[2].
Schema-Driven Data Extraction: Supports JSON Schema-based extraction with granular source citations and confidence scores, enabling precise structured outputs tailored to user needs[1].
Scalability and Cost Efficiency: Processes about 4 pages per second on a single RTX 4090 GPU, enabling cost-effective large-scale document processing (over 11 million pages/month)[2].
Developer Experience: Provides powerful Python and TypeScript SDKs, a web interface for testing, and comprehensive API documentation for easy integration[6].
Community and Support: Active open-source community with Discord support, plus dedicated support and migration assistance for enterprise customers[4].

Role in the Broader Tech Landscape

Chunkr rides the growing trend of AI-driven document intelligence and Retrieval-Augmented Generation (RAG) systems, where structured, high-quality document data is critical for effective language model applications. The timing is favorable due to increasing demand for automating knowledge extraction from diverse document types in industries like legal, finance, healthcare, and scientific research.

Market forces such as the proliferation of large language models (LLMs), the need for explainability via citations, and the push for self-hosted, privacy-compliant AI infrastructure work in Chunkr’s favor. By enabling granular control and modularity, Chunkr influences the ecosystem by empowering developers to build tailored AI pipelines without vendor lock-in, fostering innovation and accelerating AI adoption in document-heavy workflows[2][6].

Quick Take & Future Outlook

Looking ahead, Chunkr is poised to expand its influence by enhancing its proprietary cloud API models, improving enterprise features, and deepening integrations with AI platforms. Trends shaping its journey include the rise of specialized AI agents, increasing regulatory demands for data provenance, and the growing importance of hybrid cloud/self-hosted solutions.

Chunkr’s open-source foundation combined with enterprise-grade offerings positions it uniquely to serve both community developers and large organizations. Its modular, customizable approach to document parsing and extraction will likely become a standard for AI-powered document intelligence, driving further innovation in how machines understand and utilize complex documents.

This trajectory ties back to Chunkr’s core mission: to provide flexible, high-quality document ingestion infrastructure that feels like a custom-built tool but without the associated complexity or lock-in[2][4][6].

Sources

Frequently Asked Questions

Who founded Chunkr?

Chunkr was founded in 2023 by Mehul Chadda (Co-founder & CEO) and Akhilesh Sharma (Founder) and Ishaan Kapoor (Founder).

Frequently Asked Questions

Who founded Chunkr?

Chunkr was founded in 2023 by Mehul Chadda (Co-founder & CEO) and Akhilesh Sharma (Founder) and Ishaan Kapoor (Founder).

High-Level Overview

Origin Story

Core Differentiators

Modular Pipeline Control: Users can customize processing at the segment level (titles, tables, formulas, captions) with different OCR and vision-language model (VLM) strategies per segment, balancing speed and quality[2].
Open Source with Enterprise Options: The AGPL open-source version allows transparency and local hosting, while the Cloud API offers proprietary models and enterprise features for higher accuracy and reliability[4].
Production-Ready Infrastructure: Built in Rust for performance and reliability, with features like image conversion, page parallelization, failure handling, and batching handled out-of-the-box[2].
Schema-Driven Data Extraction: Supports JSON Schema-based extraction with granular source citations and confidence scores, enabling precise structured outputs tailored to user needs[1].
Scalability and Cost Efficiency: Processes about 4 pages per second on a single RTX 4090 GPU, enabling cost-effective large-scale document processing (over 11 million pages/month)[2].
Developer Experience: Provides powerful Python and TypeScript SDKs, a web interface for testing, and comprehensive API documentation for easy integration[6].
Community and Support: Active open-source community with Discord support, plus dedicated support and migration assistance for enterprise customers[4].