Super HN - Super Hacker News

PDF to Text, a Challenging Problem (marginalia.nu) 14 points by ingve 39 minutes ago

Yeah, getting text, even structured text, out of PDFs is no picnic. Scraping a table of HTML is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, is easily discernible as a table because it's rendered as a table.

I've actually had decent luck in converting PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of manky but it seems reliable for what I need. by bartread 1 minute ago

So many of these problems have been solved by mozilla pdf.js together with its viewer implementation: https://mozilla.github.io/pdf.js/. by rad_gruchalski 8 minutes ago