Extract JSON from PDF
Tables, structured text
or raw content.
Free PDF to JSON converter. Three modes: detect tables automatically, extract structured text with page layout, or get raw text per page. Powered by PDF.js — runs in your browser, no upload needed.
Three steps — upload your PDF and choose the output that fits your use case.
Drop or select a PDF created by software — exported from Excel, Google Docs, a reporting tool or generated by an application. PDF.js reads the file in your browser and shows the page count and metadata immediately. Scanned PDFs (images of text) are not supported.
Select Table extraction to detect tabular data and output an array of objects with column headers as keys. Choose Structured text to get pages with lines and coordinates — preserving reading order. Use Raw text for the simplest output: plain text per page as a JSON array.
Click Extract and copy to clipboard or download a .json file. The output is valid JSON ready for further processing — load it into a pipeline, flatten it with the JSON to CSV tool, or inspect it in the JSON viewer.
Three ways to extract content from a PDF — choose based on your data and use case.
| Mode | Best for | Output shape |
|---|---|---|
| Table extraction | PDFs from Excel, reporting tools, financial exports | Array of objects where each row becomes {"Column A": "value", "Column B": "value"}. Column headers are taken from the first detected row. Multiple tables on a page produce multiple arrays. |
| Structured text | Documents, reports, articles where layout matters | Array of pages, each with an array of lines. Each line contains the text content and its approximate Y position on the page — useful for preserving reading order and detecting section boundaries. |
| Raw text | Simple text extraction, NLP pipelines, search indexing | Array of page objects with a single text field containing all text from that page concatenated. The simplest output — one object per page, suitable for feeding into text processing tools. |
Common workflows where extracting JSON from a PDF is the first step.
Bank statements, invoices, expense reports and financial summaries often arrive as PDF. Extracting to JSON lets you process the data programmatically — load it into a spreadsheet, import it into an accounting tool, or feed it to a data pipeline without manual re-entry.
Older ERP, CRM and reporting systems often only export to PDF. Extracting the tabular data as JSON is the first step in migrating that data to a modern system — convert to JSON, then use the JSON to CSV tool to get a spreadsheet-ready file.
Research papers, legal documents and technical manuals in PDF format need text extraction before processing with NLP tools. Raw text mode extracts clean text per page — ready for tokenisation, embedding generation or keyword extraction.
Building a search index over a document corpus requires extracting text from PDFs. Raw text mode produces a clean JSON structure with page-level text that can be indexed directly by Elasticsearch, Typesense or any full-text search engine.
Be aware of these constraints before using the tool.
Process the extracted JSON with these tools next.
PDF parsed in
your browser. No upload.
JSONshift uses PDF.js — the open-source PDF engine developed by Mozilla, used in Firefox and Chrome to render PDFs — to extract content directly in your browser. Your PDF file is never transmitted to any server. Close the tab and it's gone.
Table extraction uses coordinate clustering: text items that share the same vertical position (within a 3-point tolerance) are grouped into rows, and significant horizontal gaps between items are used to detect column boundaries. This approach works reliably for PDFs created by software with precise text positioning.
