What output modes are available?

Three modes: Table extraction detects tabular data by analysing text coordinates and outputs an array of objects. Structured text outputs pages with paragraphs and lines preserving the reading order. Raw text outputs the plain text content of each page as a JSON array.

Does the PDF to JSON converter work with scanned PDFs?

No. Scanned PDFs contain images of text, not actual text characters. PDF.js can only extract text from PDFs that were created digitally — exported from Word, Excel, Google Docs, or generated by software. Scanned PDFs require OCR (optical character recognition) which runs server-side and is not available in this tool.

Convert PDF to JSON — Extract Tables & Structured Text

Q: How do I convert a PDF to JSON?

Upload your PDF file, choose an output mode (table extraction, structured text or raw text), and click Convert. The converter uses PDF.js to read the PDF in your browser and outputs the extracted content as JSON. No upload to any server is required.

Q: What types of PDFs work best with table extraction?

Table extraction works best with PDFs created by spreadsheet software (Excel, Google Sheets, LibreOffice Calc) or reporting tools (Crystal Reports, Jasper, SSRS). These PDFs have precise text positioning that makes column detection reliable. PDFs from Word documents or websites may have less consistent column alignment.

PDF to JSON Converter 100% client-side

Uses PDF.js (Mozilla) — the same engine that renders PDFs in Firefox and Chrome. Works with digital PDFs only — not scanned images. No data is uploaded.

Drop your PDF file here

Digital PDFs only — not scanned images

JSON ready

How to extract JSON from a PDF

Three steps — upload your PDF and choose the output that fits your use case.

Upload a digital PDF

Drop or select a PDF created by software — exported from Excel, Google Docs, a reporting tool or generated by an application. PDF.js reads the file in your browser and shows the page count and metadata immediately. Scanned PDFs (images of text) are not supported.

Choose output mode

Select Table extraction to detect tabular data and output an array of objects with column headers as keys. Choose Structured text to get pages with lines and coordinates — preserving reading order. Use Raw text for the simplest output: plain text per page as a JSON array.

Copy or download the JSON

Click Extract and copy to clipboard or download a .json file. The output is valid JSON ready for further processing — load it into a pipeline, flatten it with the JSON to CSV tool, or inspect it in the JSON viewer.

Output modes explained

Three ways to extract content from a PDF — choose based on your data and use case.

Mode	Best for	Output shape
Table extraction	PDFs from Excel, reporting tools, financial exports	Array of objects where each row becomes `{"Column A": "value", "Column B": "value"}`. Column headers are taken from the first detected row. Multiple tables on a page produce multiple arrays.
Structured text	Documents, reports, articles where layout matters	Array of pages, each with an array of lines. Each line contains the text content and its approximate Y position on the page — useful for preserving reading order and detecting section boundaries.
Raw text	Simple text extraction, NLP pipelines, search indexing	Array of page objects with a single `text` field containing all text from that page concatenated. The simplest output — one object per page, suitable for feeding into text processing tools.

When do you need PDF to JSON?

Common workflows where extracting JSON from a PDF is the first step.

Financial reports and statements

Bank statements, invoices, expense reports and financial summaries often arrive as PDF. Extracting to JSON lets you process the data programmatically — load it into a spreadsheet, import it into an accounting tool, or feed it to a data pipeline without manual re-entry.

Data exports from legacy systems

Older ERP, CRM and reporting systems often only export to PDF. Extracting the tabular data as JSON is the first step in migrating that data to a modern system — convert to JSON, then use the JSON to CSV tool to get a spreadsheet-ready file.

NLP and text processing

Research papers, legal documents and technical manuals in PDF format need text extraction before processing with NLP tools. Raw text mode extracts clean text per page — ready for tokenisation, embedding generation or keyword extraction.

Search and indexing

Building a search index over a document corpus requires extracting text from PDFs. Raw text mode produces a clean JSON structure with page-level text that can be indexed directly by Elasticsearch, Typesense or any full-text search engine.

Limitations — what this tool cannot do

Be aware of these constraints before using the tool.

Scanned PDFs: If your PDF was created by scanning a physical document, it contains images — not text characters. PDF.js cannot extract text from images. You will need an OCR tool (like Adobe Acrobat, Tesseract or Google Document AI) to first convert the scanned images to text.

Table detection accuracy: Table extraction works by clustering text items that share the same vertical position. This works reliably for PDFs exported from spreadsheet software. PDFs from Word, web pages or poorly formatted reports may produce misaligned columns or merged cells that the detector cannot handle correctly.

Complex layouts: Multi-column page layouts (newspapers, magazines, academic papers with sidebars) confuse the reading order reconstruction. Use Raw text mode for these — it concatenates all text items in the order PDF.js returns them, which may not match the visual reading order.

Password-protected PDFs: Encrypted PDFs cannot be opened by PDF.js without the password. The converter will report an error — you will need to remove the password protection first using a tool like Adobe Acrobat or PDFtk.

Related JSON tools

Process the extracted JSON with these tools next.

PDF parsed in
your browser. No upload.

JSONshift uses PDF.js — the open-source PDF engine developed by Mozilla, used in Firefox and Chrome to render PDFs — to extract content directly in your browser. Your PDF file is never transmitted to any server. Close the tab and it's gone.

Table extraction uses coordinate clustering: text items that share the same vertical position (within a 3-point tolerance) are grouped into rows, and significant horizontal gaps between items are used to detect column boundaries. This approach works reliably for PDFs created by software with precise text positioning.

Mozilla's open-source PDF engine — the same one in Firefox and Chrome. Handles complex PDF internals including embedded fonts, encoding tables and page transforms.

Coordinate-based table detection

Analyses text item positions (x, y, width) to detect rows and columns — no visual rendering needed. Works with PDFs that have no visible grid lines.

Honest about limitations

Scanned PDFs, complex layouts and encrypted files are clearly flagged — no false promises about extraction quality.

47 tools, always free

No file size limits, no watermarks, no account. Funded by non-intrusive display advertising only.

Frequently asked questions

Common questions about extracting JSON from PDF files.

How do I convert a PDF to JSON?

Upload your PDF, choose an output mode (table extraction, structured text or raw text), and click Extract. PDF.js reads the file in your browser and outputs the extracted content as JSON. No upload to any server is required.

Does the converter work with scanned PDFs?

No. Scanned PDFs contain images of text — PDF.js can only extract text from digital PDFs created by software. If your PDF was created by scanning a physical document, you need an OCR tool first (Adobe Acrobat, Tesseract, or Google Document AI) to extract the text.

What types of PDFs work best with table extraction?

PDFs exported from Excel, Google Sheets, LibreOffice Calc, Crystal Reports or similar reporting tools work best. These have precise text positioning that makes column detection reliable. PDFs from Word documents or web pages may produce less accurate results due to inconsistent text alignment.

What is the difference between the three output modes?

Table extraction detects rows and columns by coordinate analysis and outputs an array of objects — best for spreadsheet-style data. Structured text preserves page layout with lines and positions — best for documents where reading order matters. Raw text outputs plain text per page — simplest output, best for NLP and search indexing.

Is my PDF safe when using this converter?

Yes. PDF.js runs entirely in your browser. Your PDF file is never uploaded to any server. Open the Network inspector during conversion — you will see zero outbound data requests (except the one-time PDF.js CDN load on first use).

Is the PDF to JSON converter free?

Yes, completely free. No file size limits, no account required. JSONshift is funded by non-intrusive display advertising.

Extract JSON from PDFTables, structured textor raw content.

PDF parsed inyour browser. No upload.

Extract JSON from PDF
Tables, structured text
or raw content.

PDF parsed in
your browser. No upload.