Clear Filters
Clear Filters

Why does OCR separate Text into Words?

33 views (last 30 days)
Serbring
Serbring ungefär 10 timmar ago
Edited: dpb ungefär en timme ago
Hi all,
I am trying to retrieve specific text from scanned documents reporting tables of numbers. Since the table can change in the amount of column, I use the following approach:
1 - detection of the units of measure through OCR function,
2 - from the units I need (for example, kg/kW.h), calculation of a proper region of interest where OCR function is used to retrieve the needed numbers
This works rather fine but I do not obtain a consistent behaviour of OCR function. In particular, some cases, all the units are well separated into words by OCR function while in others they are grouped together in a single word. In the code below working with the attached data sample, you can see the issue. In particular, the 16th element of txt1.Words reports the units '(kg/kW.h)(kW.h/)' rather than having two Words (one for '(kg/kW.h)' and the other for '(kW.h/)') with their own WordBoundingBoxes. I do not understand why in some case, the units are in the same Word and in other they are bounded together in a single Word. Is it possible to control the generation process of Words in OCR function?
clear all
load('test.mat')
figure
imshow(I)
roi=[250.5 526 1300 142];
Iocr=insertShape(I,'rectangle',roi,'ShapeColor','blue');
hold on
imshow(Iocr)
txt1=ocr(I,roi,CharacterSet=".()kWrpmlhgh/");%,LayoutAnalysis='word');
UnitString=regexp(txt1.Words,'(?<=\()[\w\.\/]*(?=\))','match');
UnitString(cellfun(@isempty,UnitString))=[];
UnitBox=txt1.WordBoundingBoxes(not(cellfun(@isempty,UnitString)),:);
  3 Comments
Serbring
Serbring ungefär 3 timmar ago
Hi,
thanks for the message. From my computer, I can see the image with a good quality. Indeed, I am attaching a screenshot of the figure so that you can see. The text is also well recognized but it is not well split and I have not understood which is the splitting criteria for Matlab's OCR function.
dpb
dpb ungefär 3 timmar ago
Edited: dpb ungefär en timme ago
I think again you would have to provide both a "good" and a "bad" image for folks to have any chance whatever, as it's going to be related to whatever is different between the two in the particular region of interest.
See the Tips section at ocr for some hints about changing unexpected/unwanted behavior; probably the only way you'll be able to learn much more about the internals will be through the references; it looks as though Mathworks is using the open source implementation as their engine. I don't have the TB, so can't do anything here locally, but one last time; without the two images that behave differently, nothing anybody that comes by will be able to do.
BTW, it's been a long time since I've looked at the Nebraska tests, but they're a lot of fun to look at and very informative in a decision-making process if looking at new (to owner, not just brand new) equipment purchase. We're in the SW corner of KS and such is very big data here although a final purchase decision may end up relying more upon the quality and who are the nearby dealerships and less on the test data temselves.

Sign in to comment.

Answers (0)

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!