Is there a way to OCR incoming PDFs that are faxed in order to make them searchable

Question

question

aaron11128 asked Aug 2, '18 John Wang Deactivated commented Aug 3, '18

Is there a way to OCR incoming PDFs that are faxed in order to make them searchable

fax

1 |3000

Attachments: Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Using the RingCentral Phone for Desktop, you can dial or receive test calls, send and receive test SMS or Fax messages in your sandbox environment.

Download RingCentral Phone for Desktop:

Tip: switch to the "sandbox mode" before logging in the app:

On MacOS: press "fn + command + f2" keys
On Windows: press "Ctrl + F2" keys

Answer 1 · 2018-08-02T17:46:34Z

John Wang Deactivated answered Aug 2, '18

You can do this buy retrieving the PDF and using an OCR API or the Tesseract Open Source package.

One API that can be used is the Google Vision API:

https://cloud.google.com/vision/docs/pdf

The Tesseract Open Source OCR engine is generally considered one of, if not, the best open source solutions:

https://github.com/tesseract-ocr/tesseract

1 |3000

Attachments: Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

Answer 2 · 2018-08-03T01:34:52Z

Tyler Liu answered Aug 3, '18 John Wang Deactivated commented Aug 3, '18

Just want to mention that OCR is not the only way to extract text from PDF. If the PDF's content is text instead of image, you can use some library to extract the text. Search GitHub for "pdf to text".

1

1 |3000

Attachments: Up to 8 attachments (including images) can be used with a maximum of 1.0 MiB each and 10.0 MiB total.

John Wang ♦♦ commented · Aug 03, 2018 at 05:14 AM

The PDF content depends on what generates the PDF. If you use a program like MS Word to generate a PDF then the PDF can text content, but a fax transmission will typically result in a PDF that contains an image and requires OCR, due to the fax transmission process.

0 ·

question