A web interface to extract data tables from PDFs
Excalibur is a web interface to extract tabular data from PDFs.
2 years ago
An open-source tool to extract tables from PDFs into Excels - Vinayak Mehta
I have also published this post on Hacker Noon. Borrowing the first three paragraphs from my previous blog post since they perfectly explain why extracting tables from PDFs is hard. The PDF ( Portable Document Format) was born out of The Camelot Project to create "a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks".
Would you recommend this product?
Hello Product Hunt! I'm Vinayak, creator of Excalibur, which is a web interface to extract tabular data from PDFs! There are both open (Tabula, pdf-table-extract) and closed-source (Smallpdf, Docparser) tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table. Excalibur uses Camelot (
) under the hood, which is a Python library I created to offer users complete control over table extraction. If you can’t get the desired output with default settings, you can tweak them and get the job done! You can install Excalibur using "pip install excalibur-py" or just download and run the Windows/Linux executable from the releases page here:
Great documentation is available here:
I would be really grateful for your feedback that can help me improve it! You can follow the development on GitHub here: