Why does OCR separate Text into Words?
30 views (last 30 days)
Show older comments
Hi all,
I am trying to retrieve specific text from scanned documents reporting tables of numbers. Since the table can change in the amount of column, I use the following approach:
1 - detection of the units of measure through OCR function,
2 - from the units I need (for example, kg/kW.h), calculation of a proper region of interest where OCR function is used to retrieve the needed numbers
This works rather fine but I do not obtain a consistent behaviour of OCR function. In particular, some cases, all the units are well separated into words by OCR function while in others they are grouped together in a single word. In the code below working with the attached data sample, you can see the issue. In particular, the 16th element of txt1.Words reports the units '(kg/kW.h)(kW.h/)' rather than having two Words (one for '(kg/kW.h)' and the other for '(kW.h/)') with their own WordBoundingBoxes. I do not understand why in some case, the units are in the same Word and in other they are bounded together in a single Word. Is it possible to control the generation process of Words in OCR function?
clear all
load('test.mat')
figure
imshow(I)
roi=[250.5 526 1300 142];
Iocr=insertShape(I,'rectangle',roi,'ShapeColor','blue');
hold on
imshow(Iocr)
txt1=ocr(I,roi,CharacterSet=".()kWrpmlhgh/");%,LayoutAnalysis='word');
UnitString=regexp(txt1.Words,'(?<=\()[\w\.\/]*(?=\))','match');
UnitString(cellfun(@isempty,UnitString))=[];
UnitBox=txt1.WordBoundingBoxes(not(cellfun(@isempty,UnitString)),:);
3 Comments
dpb
ungefär en timme ago
I think again you would have to provide both a "good" and a "bad" image for folks to have any chance whatever, as it's going to be related to whatever is different between the two in the particular region of interest.
BTW, it's been a long time since I've looked at the Nebraska tests, but they're a lot of fun to look at and very informative in a decision-making process if looking at new (to owner, not just brand new) equipment purchase. We're in the SW corner of KS and such is very big data here although a final purchase decision may end up relying more upon the quality and who are the nearby dealerships and less on the test data temselves.
Answers (0)
See Also
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!