Page 2 of 2

Re: OCR?

Posted: Mon Mar 18, 2024 1:41 am
by bamakojeff
Don't know if this is helpful or not, but I built an application for a friend of mine in the publishing industry (which runs on Windows) in order to create royalty reports for individual authors from a single giant pdf which contains all the data for thousands of authors. So the Livecode app shells out to pdfinfo.exe, pdftk.exe, and pdftotext.exe to do all the actual pdf processing. It converts each page of the pdf to text, reads the text off each page to determine where one author's report ends and another's begins, and then splits the massive pdf up into separate pages corresponding to separate reports for each author which it saves as separate pdf files which can then be emailed to the appropriate recipient.

All these are standalone executables. I store them a folder "below" the main app. So to get all the text off of page "tPageNum" from the pdf "sFile", I run:

Code: Select all

put quote & specialFolderPath("resources") & slash & "library/pdftotext.exe" & quote into sPDFtotext 
put shell(sPDFtotext && "-f" && tPageNum && "-l" && tPageNum && "-layout" && sFile && "-") into tShellResults
Now I have the text of that page in tShellResults.

I don't know but I expect that all these utilities also run on Mac. (They are all linux utils originally, I believe.)

If that's helpful to you, I'm happy to share more about it. And if not, no worries. :-)

Jeff

Re: OCR?

Posted: Mon Mar 18, 2024 9:55 am
by stam
Hi Jeff, I presume your answer is directed to me?
Thanks if that’s the case, but the PDFs I had to work with were scans rather than documents converted to PDF, which makes PDF utilities useless - there is only picture data.

Having said that situations have changed and this is now indefinitely on hold.

Re: OCR?

Posted: Mon Mar 18, 2024 10:35 am
by richmond62

Re: OCR?

Posted: Wed Mar 20, 2024 5:57 pm
by bamakojeff
Stam, I saw this thread come across the digest and didn't bother to look at the original posting data when I replied. :-)

If the project ever comes around again, I've had good luck using the open source version of Tesseract (https://github.com/tesseract-ocr/tesseract) for OCR on images.

Jeff

Re: OCR?

Posted: Wed Mar 20, 2024 8:16 pm
by stam
Thanks Jeff. But the results I posted at the start of this thread were from using Tesseract. Not great and certainly not usable in an automated process in a medical context sadly…