Raleigh Public Record – DocHive
Via Raleigh Public Record, Inc.
To build a tool to extract structured data from image-based PDF files
The Record will release a new open-source program to help journalists turn PDF files into structured data. The new software will enable reporters to take an image containing data — say a scanned campaign finance return — and turn that into a spreadsheet.
This may sound boring, but it’s a problem that we at the Record have been trying to overcome for more than two years. The story started with Wake County campaign finance returns. The returns are filed as paper, and staff at the Wake County Board of Elections scan them in and put the images online. The problem is, the only way to view the data is to look at it page by page, and the only way to analyze it is to go through by hand and enter the data into a spreadsheet one row at a time.
We’re a small news organization; we don’t have the staff to do data entry for hundreds of pages of campaign finance information. We also don’t have the budget to hire some unfortunate college students to do it for us.
Edward Duncan, my brother and a full-time programmer, and I have been thinking about how to tackle this problem since 2010. We had been kicking ideas back and forth until Edward stumbled across this solution last summer.
The new program aims to pull the data from the documents and put it into a spreadsheet.
It’s called DocHive, and here’s how it works: the program uses XML, a computer programming language used mainly for websites, to break a page up into smaller sections.
For example, in the campaign finance documents, it will make separate sections for donor name, occupation, donation amount and all the other fields. Then, it will take each of those sections and turn it into a separate image file. The software takes that small image and uses optical character recognition technology, known by the acronym OCR, to read the couple words or numbers and insert it into a text file.
Raleigh Public Record, Inc.