OCR and non-English characters.

Anonymous

OCR and non-English characters.

Anonymous
Not applicable

I did a test on the OCR functionality, see picture below. I wanted to document how use of different PDF printers influence how the OCR works. All the uploaded documents were 100% identical, except for one special case where no non-English characters were to be recognized. 

 

Test:

  • Uploading a drawing printed from Revit, using FoxitPDF or Bluebeam was successful, the OCR recognized all the characters.
  • Uploading drawing printed from AutoCAD, using FoxitPDF or Bluebeam was successful as well.
  • Uploading drawing printed from AutoCAD, using the inbuilt DWG to PDF was unsuccessful if the fields that were to be recognized included non-english characters. I also tested an instance where the fields only included English characters - then it worked as expected. 

 

Conclusion:

The OCR does not seem to have a problem with recognizing non-English characters in drawings printed from AutoCAD or Revit, using Foxit or Bluebeam. 

However, using Autodesk's own DWG to PDF writer, fields that include non-English characters become unrecognizable. 

 

 

1.  Are there any plans to fix this issue?

2.  Has this behavior been documented using other PDF writers, such as Acrobat?

 

 

 

 

 

ocr test.jpg

 

Reply
196,116 Views
12 Replies
Replies (12)

ian_turner
Autodesk
Autodesk

Hello

Thank you for your post.

Would it be possible to get your files so we can take a look please?

 

We use 2 ways to get the text.

If the drawing is a vector we don’t use OCR, we just extract the provided text. (As you can mimic yourself on a vector drawing by selecting text then copy paste to another document)

In that case language doesn’t matter in extraction terms because we just take whatever is there.

 

If the drawing is raster (picture) then we need to use OCR to try to recognize it first and extract. Currently the OCR language is defined by the currently set browser language.

So for non English you need to change your browser language to be the same as the text language you want to extract.

(we are looking at improving that experience in the future but date is not set yet)

 

So it could be that your first tests were vector if non English worked ok, but the last test was raster which made it fail.

Or we may have an issue to address...

 

Either way please send us the info and we will investigate.

 

Thanks, Ian Turner

 

0 Likes

Anonymous
Not applicable

Thank you for the reply.

 

My browser language was set to English in all cases.

 

Actually, the browser language didn't seem to matter in my case because the text was recognized from both vector and raster PDFs.

 

It was only in the case of a PDF file prepared with the "DWG to PDF" printer in AutoCAD that the non-English characters were not recognized which resulted in the whole string were they appeared to come out corrupted. Curiously that particular PDF file is vector based. 

 

I'll PM you a zip-file with all the files.

0 Likes

vincent.carignan
Contributor
Contributor

Hi,

 

Thanks eij for the testing.

 

I ran into the same issue on multiple occasions with French characters. Could we get an update on that issue please?

 

I uploaded the same drawing set (all vector based) multiple times on different test projects to see how it varied from one upload to another and the problematic sheets were never the same. I then tried to reprint the said PDF with a PDF printer (CutePDF) and then it looks like it worked, so eij's theory about the PDF exporter might be a good guess. 

 

Thanks.

0 Likes

Anonymous
Not applicable

Thanks for the follow up,

 

I sent the files to Ian Turner in November but I haven't heard back from him. 

 

It would be good to get an update from Autodesk on this matter.

 

Thanks.

 

 

0 Likes

ian_turner
Autodesk
Autodesk

Hello Eij

I apologize if i am wrong but I don't remember receiving any files.

How did you send originally and can you please resend or attach here?

Thanks, Ian

0 Likes

anil_mistry
Autodesk Support
Autodesk Support

Hi @Anonymous,

 

I’m just following up to see if you were able to gather the information @ian_turner had previously requested to assist him in troubleshoot your issue.

 

Thank you and have a great day!



Anil Mistry
Technical Support Specialist
0 Likes

Anonymous
Not applicable

Hi all,

 

I sent all the relevant files to Ian via private message last November, see screenshot below.

 

forum.jpg

 

 

 

Jason_Kai_Jiang
Autodesk
Autodesk

Hi @vincent.carignan,

 

Thank you very much for sharing the PDF files with us. The pdf file that cannot extract text includes the font with the encoding "Encoding: Identity-H".

Encoding.jpg

 

We are investigating how to improve on this scenario. On the other hand, before we find proper solution, please use "Build-in" encoding.

Build-in_Encoding.jpg

 

 

Reference:

https://forums.adobe.com/thread/758316

 

Thanks and best regards,

 

Jason Jiang

Autodesk Document Management team

Jason Jiang
Senior Product Manager


0 Likes

anil_mistry
Autodesk Support
Autodesk Support

Hi @vincent.carignan,

 

I'm just checking in to see if you need more help with this. Did the suggestion that @Jason_Kai_Jiang provided work for you?

If so, please click Accept as Solution on the posts that helped you so others in the community can find them easily.



Anil Mistry
Technical Support Specialist
0 Likes

vincent.carignan
Contributor
Contributor

Hi @anil_mistry,

 

Depends on your definition of solution 😛 To be fair, as your main customers (contractors) are at the receiving end of the design, the proposed workaround is quite cumbersome since they have to ask professionnals to modify the way they publish their drawings -which they might not- or they alternatively have to re-print the drawings on their side to be able to use the platform. I'll let you be the judge.

 

Also, I can't accept the suggestion as a solution as I'm not the one who submitted the initial question.

 

Thanks for the support.

0 Likes

Anonymous
Not applicable

Hello,

I agree with @vincent.carignan, the workaround is not really practical. 

 

Besides, I haven't found a way to change the encoding using the DWGtoPDF printer that ships with AutoCAD.

 

For now I think the only thing we can do is to ask our clients/partners to use one of the PDF printers that we've found out that work. As @vincent.carignan points out, they might not do that anyway, so this is definitely still an issue. 

 

Thanks

 

 

0 Likes

Jason_Kai_Jiang
Autodesk
Autodesk

Hi @vincent.carignan and @Anonymous,

 

Thanks for your comments. We will continue to investigate resolve the encoding issue without involving reprint the PDF files.

 

Thanks and best regards,

 

Jason Jiang

Autodesk BIM 360 Document Management team

Jason Jiang
Senior Product Manager


0 Likes

Type a product name