You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
How do I find the original indices of a text array after adding new elements?
106 views (last 30 days)
Show older comments
I'm working a project which reads a text file and searches for user-defined phrases. At first glance this sounds rather easy, but due to complexities such as having phrases wrap around lines (with newlines between search words) or having multiple spaces between words, I found I have to search in several different ways.
I do combinations of things like adding spaces and removing newlines/carriage returns from the original text to create an updated text array to search. I keep a record of the indices of added spaces and removed newlines, all relative to the updated text array. I then map each index of the updated array to the index of the original text array using two loops (one for the added spaces, one for the removed newlines). Note that I first remove any newlines, then add spaces to create the updated search text. I believe the order is important.
I use regexp to search the updated text and return findings and starting indices. These starting indices of findings can then be easily mapped to the original text location, which is what I need to output.
My code works, but because of the loops, doing the initial index mapping takes a very long time for large text files (>20min for a 1 Mb file).
I'm hoping someone can help me figure out how to do the array mapping without the loops, maybe with arrayfun or something else.
Here's the relevant mapping loops. Note that SearchText, OriginalText, AddedSpaces and DeletedNewLines are inputs from the calling function.
SearchTextMap = 1:length(SearchText);
for spaceInd = 1:length(AddedSpaces)
AddedSpaceInd = AddedSpaces(spaceInd);
SearchTextMap(AddedSpaceInd:end) = SearchTextMap(AddedSpaceInd:end) - 1;
end
for newlineInd = 1:length(DeletedNewLines)
DeletedNewLineInd = DeletedNewLines(newlineInd);
SearchTextIndex = find(SearchTextMap == DeletedNewLineInd);
SearchTextMap(SearchTextIndex:end) = SearchTextMap(SearchTextIndex:end)+1;
end
Any help would be greatly appreciated.
20 Comments
Paul
on 30 Sep 2025 at 0:33
Can you provide a small sample of text and a few phrases to look for? Best would be to save those in a .mat file and attach that file to your question, or a new comment, using the Paperclip icon in the Insert portion of the tool strip.
Stephen23
on 30 Sep 2025 at 1:02
"I'm hoping someone can help me figure out how to do the array mapping without the loops, maybe with arrayfun or something else."
ARRAYFUN would likely be slower than a well-written FOR-loop.
"Any help would be greatly appreciated."
As Paul stated, it would help us if you provided some examples of the text and the phrases. Provide these in their native format and do not modify them to make them "simpler" for us. Upload them as the original text file or in a MAT file.
dpb
on 30 Sep 2025 at 15:28
It would seem more fruitful approach would be to refine the pattern searching algorithms to handle the spaces and newlines rather than try to modify the original text to match some pattern.
A real regexp guru could undoubtedly design something that would fit your needs; that would not be me, however.
If you aren't a whizard yourself, I would point you to the relatively recent pattern function that lets you build various search patterns.
Without a little more background on just what input and matching universes might consist of, my initial thought might be to do a recursive search that could begin by finding partial matches and then refining those to discard incomplete total matches.
Stephen23
on 30 Sep 2025 at 17:37
Edited: Stephen23
on 30 Sep 2025 at 17:42
"It would seem more fruitful approach would be to refine the pattern searching algorithms to handle the spaces and newlines rather than try to modify the original text to match some pattern."
Without knowing the details of the OP's data, one simple approach would be to replace spaces with \s+:
txt = sprintf('%s\n','xxx yyy hello','world zzz') % hello world separated by newline
txt =
'xxx yyy hello
world zzz
'
pat = 'hello world';
rgx = regexprep(pat,'\s+','\\s+'); % assumes no special characters in PAT.
[idx,idy] = regexp(txt,rgx) % index where "hello world" starts and ends
idx = 9
idy = 19
Art
on 30 Sep 2025 at 18:27
My apologies, I'm developing this on a standalone computer so it's time consuming to copy code/examples here. But I think my question boils down to one specific item:
If I have an array: StartArray = [1 2 3 4 5 6 7 8 9];
and I have a set of indices that need to be added to the array: AddedSpaceInds = [1 6 12 13];
how do I insert the indices specified by AddedSpaceInds while simultaneously shifting the existing data by one index, resulting in a new array: [0 1 2 3 4 4 5 6 7 8 9 9 9].
Note that the value of the DATA at each added index is the (original DATA value of StartArray at that index) - 1, or if the added index value is > the length of StartArray, it's the value of StartArray(end).
I can accomplish this with the following loop, which builds a new array TextMapTemp from scratch, but it's time consuming.
Any ideas to make this loop more efficient or do this without a loop at all?
StartArray = [1 2 3 4 5 6 7 8 9];
AddedSpaceInds = [1 6 12 13];
% preallocate a temp array the full array size:
TextMapTemp = zeros(1,length(StartArray)+length(AddedSpaceInds));
% loop through each temp index, determine its value based on the
% value of StartArray and whether a space has been added at this index:
StartArrayInd = 1;
for ind = 1:length(TextMapTemp)
if ismember(ind, AddedSpaceInds)
if StartArrayInd > length(StartArray)
TextMapTemp(ind) = StartArray(end);
else
TextMapTemp(ind) = StartArray(StartArrayInd) - 1;
end
else
TextMapTemp(ind) = StartArray(StartArrayInd);
StartArrayInd = StartArrayInd + 1;
end
end
Art
on 30 Sep 2025 at 19:44
Edited: Art
on 30 Sep 2025 at 19:48
And Stephen23, 'pat' is something the user inputs, it can be any word/phrase which may or may not include standard regexp wildcards like "\s+".
I take the user input search term, read the user supplied input file, then search the file text as-is, and with various combinations of adding spaces/removing newlines from the file text. Using regexp on just the original text can miss many instances of a search phrase, e.g. regexp only finds one instance of the word "test" in this text (due to the wildcards):
SearchText = 'xyz xyz test test xyz'
pat = '[^a-zA-Z]test[^a-zA-Z]'
[idx, idy] = regexp(SearchText,pat);
idx = 8
idy = 13
Stephen23
on 30 Sep 2025 at 19:45
"Any ideas to make this loop more efficient or do this without a loop at all?"
Convert the pattern into a regular expression.
Stephen23
on 30 Sep 2025 at 20:25
Edited: Stephen23
on 30 Sep 2025 at 20:32
"e.g. regexp only finds one instance of the word "test" in this text (due to the wildcards)"
I guess you mean the metacharacter square brackets.
In any case, if the user is capable of defining a regular expression like that then they must be aware that their pattern matches not just the literal "test" but also two non-letter characters on either side of that literal text. So the match does not miss anything: once "_text_" is matched (where the underscore stands in for any non-letter) then the following "test" in your example does not have any leading non-letter character to match, it has already been consumed by the first match. In short, it does not match because that is exactly what that user requested.
So it appears that you want to override what the user specifies: what are your specific requirements (given that they are not those of the user nor of regular expressions): how do you decide which parts of the user's regular expression to ignore? Which characters or types of characters? In what combinations?
Do the users really know how regular expressions work? Are they aware that their pattern will miss matching the first word? How badly-formed do you want to allow the patterns to be?
Why can the user not just use \< and \> ? Or lookarounds ? Then you could avoid all of this bother.
Would some user pattern hints and input restrictions be a viable route ? The task still seems rather ill-defined.
"Any ideas to make this loop more efficient or do this without a loop at all?"
I don't see any obvious changes that would significantly improve the efficiency.
Art
on 30 Sep 2025 at 20:33
ok, "no obvious changes that would significantly improve the efficiency" is probably the answer to the question.
Thanks all for taking the time to help. I'll review the pattern function and dive into regexp to see if I can come up with anything else. I'll update here if I find anything noteworthy!
Stephen23
on 30 Sep 2025 at 21:02
There must be some limits to what we are prepared to accept from the user. For example, if we assume that they only specify up to one non-letter at both ends of their badly-formed regular expression:
pat = '[^a-zA-Z]hello world[^a-zA-Z]'; % badly-formed user pattern
txt = sprintf('%s\n','aaa bbb ccc','xxx yyy hello','world hello world','www hello world.')
txt =
'aaa bbb ccc
xxx yyy hello
world hello world
www hello world.
'
numel(txt)
ans = 61
rgx = regexprep(pat,'\s+','\\v+');
tmp = regexprep(txt,'\s','\v\v\v');
[idx,idy] = regexp(tmp,rgx);
ixv = regexp(tmp,'\v');
[~,idb] = ismember(idx,ixv);
idx = idx - 2*idb/3
idx = 1×3
20 32 48
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[~,ide] = ismember(idy,ixv);
idy = idy - 2*idb/3 - 2
idy = 1×3
32 44 60
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Avoiding this situation would be better: e.g. restrict what the user can provide for the pattern.
Art
on 30 Sep 2025 at 21:57
Restricting inputs could be the best solution. However, if the user wants to find every instance of "Hello World" in your example, but NOT find any instance of "Hello Worlds", then they would likely include the metacharacter brackets -and at this point there's nothing I could do about it! :(
Thanks again
dpb
on 30 Sep 2025 at 22:26
Perhaps you make a logical tree for user input that defines what are looking for and then create the regexp expression instead of letting them try to write regexp expressions themselves?
Stephen23
on 1 Oct 2025 at 4:37
Edited: Stephen23
on 1 Oct 2025 at 7:43
What you attempting to write here is a Do-What-I-Want-And-Not-What-I-Tell-You engine: to ignore what the user explicitly tells REGEXP to do and override it with something else. But this gives you two essentially unsolvable problems to solve: 1) unravel what the user explicitly requested, categorising parts of it as wanted and other parts as unwanted, followed by 2) perfom the text matching based on what you believe they wanted...
In your example you essentially want to ignore the explicit request by the user to match and consume one non-letter at both ends of some literal text. But what happens if the user provide this pattern, where they explicitly match two non-letters at both ends:
pat = '[^a-zA-Z]{2}test[^a-zA-Z]{2}'
Would those get ignored too? What about three non-letters? Or ten? It seems to me that this task is ill-defined.
I suspect that restricting what patterns the user can input (either input checking or via some regex-creating tool as dpb suggested) might be a more robust approach.
dpb
on 1 Oct 2025 at 13:18
Edited: dpb
on 1 Oct 2025 at 17:23
"...doing the initial index mapping takes a very long time for large text files"
As Knuth pointed out during a colloquium at ORNL many years ago when asked a similar question about a particular problem a lab employee had, the most time saving code construct is to not do what isn't needed.
@Stephen23 has a very good characteerization of the problem, in concert with that, if one is adamant about doing a more general search trying to account for possible irregularities in the text being searched, it would be far more efficient to make changes in the search pattern rather than modifiy the large text.
dpb
on 2 Oct 2025 at 16:24
Edited: dpb
on 2 Oct 2025 at 18:14
"I do combinations of things like adding spaces and removing newlines/carriage returns from the original text to create an updated text array to search...."
It should be possible to directly calcuate the change in position from the original text location without any looping, or at least without physically rearranging the text in the loop.
For the first substitution/replacement of the newline with a blank, the position is the same if the initial text uses only the newline character so that is known.
If you remove redundant blanks, the length of the text is shortened by the number removed so the location of the newline within the original text is that number past the present.
The location of each substring past each removal is just one more than the prior up to the number in the original line; so the offset to the word after the removal location is the cumsum up to that point of the number removed.
While I'm sure the above would work, the better way would still be to make the search pattern ignore the amount of white space between words in the phrase unless you build in a way for the user to explicitly say to only match identical character strings. But if the idea is context despite formatting, the above should work reasonably simply.
BTW, you may find <Run_Length> by @Jan from FEX useful in the rearrangement parsing. It would start with the input text as a char() vector.
Art
on 2 Oct 2025 at 18:32
Lots of good feedback for this question, thanks.
I took @Stephen23's suggestion and limited the user metacharacter inputs to whitespace, digits, not a digit, not a word char: \s+, \d+, \D, [^a-zA-Z], [^a-zA-Z]+
@dpb your suggestion to remove redundant blanks gave me the answer that replaces the loop that handles deleted newlines indices, which basically halves the time. Now I'm down to one loop that updates TextMapTemp based on the number of added space indices. I guess this is about the best I can do timewise.
Art
on 3 Oct 2025 at 16:36
Ok, good points above. I looked into \< \> but I'm not sure they'd work for all user input cases, I'll check them out further.
After trying to understand how to compute the location offsets as stated above, I came up with this (after plumbing through some additional variables like OriginalText):
% Define an initial array the size of the original text array with each original index listed in order:
OrigTextMap = 1:length(OriginalText);
% Remove any deleted newline indices:
OrigTextMap(DeletedNewLines) = [];
% Initialize the map array the length of the updated text array (after any newlines are removed and spaces added):
TextMap = zeros(size(SearchText));
% Set all TextMap indices that = added space indices to NaNs:
TextMap(AddedSpaces) = NaN;
% Set remaining indices in TextMap to the indices in OrigTextMap.
NotNamInds = ~isnan(TextMap);
TextMap(NotNanInds) = TextMap;
I believe this does what I need without the loops, I just have to account for the NaNs. Thanks for helping me walk through the logic!
dpb
ungefär 24 timmar ago
So how much of a time reduction have you accomplished so far...inquiring minds, and all that! <g>
Answers (0)
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Tags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)