Regexp to extract standalone numbers from string

9 views (last 30 days)
Hello,
I'm trying to extract numbers from a txt file which contains tables where the elements are separated by different amount of white space.
The content might look like the example below and variable rows and columns. However the amount of "free" numbers is always the same
To get the file in matlab i read it line by line using fgetl
str{1,1} = 'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'
str = 1×1 cell array
{'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'}
My goal is to extract only the numbers that are not part of text string. So that would be 21, 202, 203.02, -204.001, 1, 01. So that would be both decimal separated and non-decimal separated numbers.
I've played a bit with the regexp patterns and the closest i get is to use;
rxpPat = '\d+\.?\d*';
regexp(str{1,1},rxpPat,'match')
ans = 1×14 cell array
{'0123'} {'21'} {'20'} {'00'} {'200.1'} {'21'} {'2222'} {'202'} {'203.02'} {'204.001'} {'2'} {'31'} {'1'} {'01'}
The problem with that is that it will also catch the numbers from X?YYx0123 and that way distorts my result.
Do you have an idea how i can approach the problem?

Accepted Answer

Cris LaPierre
Cris LaPierre on 11 Dec 2022
Borrowing heavily from this answer and this doc page.
str{1,1} = 'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -';
regexp(str{1,1},'(?<=\s)[+-]?\d+\.?\d*(?=\s)', 'match')
ans = 1×6 cell array
{'21'} {'202'} {'203.02'} {'-204.001'} {'1'} {'01'}
  2 Comments
Dan
Dan on 12 Dec 2022
That seems to be working fine for what i need. Thanks!
Walter Roberson
Walter Roberson on 12 Dec 2022
str{1,1} = '404 X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - - 92';
christ = regexp(str{1,1},'(?<=\s)[+-]?\d+\.?\d*(?=\s)', 'match')
christ = 1×6 cell array
{'21'} {'202'} {'203.02'} {'-204.001'} {'1'} {'01'}
wdr = str2double(regexp(str{1,1}, '(?<=^|\s)[+-]?\d+(\.\d*)?(?=\s|$)', 'match'))
wdr = 1×8
404.0000 21.0000 202.0000 203.0200 -204.0010 1.0000 1.0000 92.0000
That is, the version Cris posted does not find the numbers if they are first or last in the string, but the version I posted in my Answer does.

Sign in to comment.

More Answers (4)

Steven Lord
Steven Lord on 11 Dec 2022
I wouldn't use regexp here. I'd use string, strsplit, and double.
S = 'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'
S = 'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'
S = string(S);
parts = strsplit(S, ' ')
parts = 1×15 string array
"X?YYx0123" "[un]" "21ZZz20AaaB00" "A200.1" "21" "Xx2222" "202" "203.02" "-204.001" "A(2)" "B(V31)" "1" "01" "-" "-"
Because we converted S from a char vector into a string array above, we can use double to turn those elements of parts that are the text representation of valid numbers into those numbers while turning the other strings into NaN. If we'd left them as a char array we'd get the values of the characters that make up the text representations of those numbers, not the numbers themselves.
notWhatWeWant = double(char(parts(5))) % double('21') is not 21
notWhatWeWant = 1×2
50 49
D = double(parts) % double("21") is 21
D = 1×15
NaN NaN NaN NaN 21.0000 NaN 202.0000 203.0200 -204.0010 NaN NaN 1.0000 1.0000 NaN NaN
Now just remove the NaN values. This does assume that NaN is not a valid numeric value in your string that you want to extract.
validparts = D(~isnan(D))
validparts = 1×6
21.0000 202.0000 203.0200 -204.0010 1.0000 1.0000

Voss
Voss on 11 Dec 2022
Edited: Voss on 11 Dec 2022
Very similar to Steven Lord's answer, but using str2double() instead of converting to string and using double():
str{1,1} = 'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'
str = 1×1 cell array
{'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'}
D = str2double(strsplit(str{1,1}));
D = D(~isnan(D))
D = 1×6
21.0000 202.0000 203.0200 -204.0010 1.0000 1.0000

Image Analyst
Image Analyst on 11 Dec 2022
I don't understand what the problem is. What's wrong with getting the numbers from X?YYx0123?
By the way, here is the new way to get numbers:
str{1,1} = 'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'
str = 1×1 cell array
{'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'}
pat = digitsPattern
pat = pattern
Matching: digitsPattern
numbers = extract(str{1,1}, pat)
numbers = 17×1 cell array
{'0123'} {'21' } {'20' } {'00' } {'200' } {'1' } {'21' } {'2222'} {'202' } {'203' } {'02' } {'204' } {'001' } {'2' } {'31' } {'1' } {'01' }

Walter Roberson
Walter Roberson on 11 Dec 2022
Edited: Walter Roberson on 11 Dec 2022
format short
S = 'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'
S = 'X?YYx0123 [un] 21ZZz20AaaB00 A200.1 21 Xx2222 202 203.02 -204.001 A(2) B(V31) 1 01 - -'
D = str2double(regexp(S, '(?<=^|\s)[+-]?\d+(\.\d*)?(?=\s|$)', 'match'))
D = 1×6
21.0000 202.0000 203.0200 -204.0010 1.0000 1.0000
  • This supports optional positive or negatives sign
  • This supports the possibility that the value is an integer with no decimal point
  • This supports the possibility that the value has a decimal point but there are no digits after the decimal point
  • This specifically checks for whitespace before and after the number, so the A200.1 would not be matched. But that also means that comma directly after a number is not supported.
  • This does not support exponent notation with d or D or e or E, and with optional + or - before the exponent values
  • This does not support number starting directly with the decimal point without a 0 before the decimal point, such as .2

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!