How do I find the original indices of a text array after adding new elements?
Show older comments
I'm working a project which reads a text file and searches for user-defined phrases. At first glance this sounds rather easy, but due to complexities such as having phrases wrap around lines (with newlines between search words) or having multiple spaces between words, I found I have to search in several different ways.
I do combinations of things like adding spaces and removing newlines/carriage returns from the original text to create an updated text array to search. I keep a record of the indices of added spaces and removed newlines, all relative to the updated text array. I then map each index of the updated array to the index of the original text array using two loops (one for the added spaces, one for the removed newlines). Note that I first remove any newlines, then add spaces to create the updated search text. I believe the order is important.
I use regexp to search the updated text and return findings and starting indices. These starting indices of findings can then be easily mapped to the original text location, which is what I need to output.
My code works, but because of the loops, doing the initial index mapping takes a very long time for large text files (>20min for a 1 Mb file).
I'm hoping someone can help me figure out how to do the array mapping without the loops, maybe with arrayfun or something else.
Here's the relevant mapping loops. Note that SearchText, OriginalText, AddedSpaces and DeletedNewLines are inputs from the calling function.
SearchTextMap = 1:length(SearchText);
for spaceInd = 1:length(AddedSpaces)
AddedSpaceInd = AddedSpaces(spaceInd);
SearchTextMap(AddedSpaceInd:end) = SearchTextMap(AddedSpaceInd:end) - 1;
end
for newlineInd = 1:length(DeletedNewLines)
DeletedNewLineInd = DeletedNewLines(newlineInd);
SearchTextIndex = find(SearchTextMap == DeletedNewLineInd);
SearchTextMap(SearchTextIndex:end) = SearchTextMap(SearchTextIndex:end)+1;
end
Any help would be greatly appreciated.
20 Comments
Paul
on 30 Sep 2025
Can you provide a small sample of text and a few phrases to look for? Best would be to save those in a .mat file and attach that file to your question, or a new comment, using the Paperclip icon in the Insert portion of the tool strip.
Stephen23
on 30 Sep 2025
"I'm hoping someone can help me figure out how to do the array mapping without the loops, maybe with arrayfun or something else."
ARRAYFUN would likely be slower than a well-written FOR-loop.
"Any help would be greatly appreciated."
As Paul stated, it would help us if you provided some examples of the text and the phrases. Provide these in their native format and do not modify them to make them "simpler" for us. Upload them as the original text file or in a MAT file.
dpb
on 30 Sep 2025
It would seem more fruitful approach would be to refine the pattern searching algorithms to handle the spaces and newlines rather than try to modify the original text to match some pattern.
A real regexp guru could undoubtedly design something that would fit your needs; that would not be me, however.
If you aren't a whizard yourself, I would point you to the relatively recent pattern function that lets you build various search patterns.
Without a little more background on just what input and matching universes might consist of, my initial thought might be to do a recursive search that could begin by finding partial matches and then refining those to discard incomplete total matches.
"It would seem more fruitful approach would be to refine the pattern searching algorithms to handle the spaces and newlines rather than try to modify the original text to match some pattern."
Without knowing the details of the OP's data, one simple approach would be to replace spaces with \s+:
txt = sprintf('%s\n','xxx yyy hello','world zzz') % hello world separated by newline
pat = 'hello world';
rgx = regexprep(pat,'\s+','\\s+'); % assumes no special characters in PAT.
[idx,idy] = regexp(txt,rgx) % index where "hello world" starts and ends
Art
on 30 Sep 2025
Stephen23
on 30 Sep 2025
"Any ideas to make this loop more efficient or do this without a loop at all?"
Convert the pattern into a regular expression.
"e.g. regexp only finds one instance of the word "test" in this text (due to the wildcards)"
I guess you mean the metacharacter square brackets.
In any case, if the user is capable of defining a regular expression like that then they must be aware that their pattern matches not just the literal "test" but also two non-letter characters on either side of that literal text. So the match does not miss anything: once "_text_" is matched (where the underscore stands in for any non-letter) then the following "test" in your example does not have any leading non-letter character to match, it has already been consumed by the first match. In short, it does not match because that is exactly what that user requested.
So it appears that you want to override what the user specifies: what are your specific requirements (given that they are not those of the user nor of regular expressions): how do you decide which parts of the user's regular expression to ignore? Which characters or types of characters? In what combinations?
Do the users really know how regular expressions work? Are they aware that their pattern will miss matching the first word? How badly-formed do you want to allow the patterns to be?
Why can the user not just use \< and \> ? Or lookarounds ? Then you could avoid all of this bother.
Would some user pattern hints and input restrictions be a viable route ? The task still seems rather ill-defined.
"Any ideas to make this loop more efficient or do this without a loop at all?"
I don't see any obvious changes that would significantly improve the efficiency.
Art
on 30 Sep 2025
There must be some limits to what we are prepared to accept from the user. For example, if we assume that they only specify up to one non-letter at both ends of their badly-formed regular expression:
pat = '[^a-zA-Z]hello world[^a-zA-Z]'; % badly-formed user pattern
txt = sprintf('%s\n','aaa bbb ccc','xxx yyy hello','world hello world','www hello world.')
numel(txt)
rgx = regexprep(pat,'\s+','\\v+');
tmp = regexprep(txt,'\s','\v\v\v');
[idx,idy] = regexp(tmp,rgx);
ixv = regexp(tmp,'\v');
[~,idb] = ismember(idx,ixv);
idx = idx - 2*idb/3
[~,ide] = ismember(idy,ixv);
idy = idy - 2*idb/3 - 2
Avoiding this situation would be better: e.g. restrict what the user can provide for the pattern.
Art
on 30 Sep 2025
dpb
on 30 Sep 2025
Perhaps you make a logical tree for user input that defines what are looking for and then create the regexp expression instead of letting them try to write regexp expressions themselves?
What you attempting to write here is a Do-What-I-Want-And-Not-What-I-Tell-You engine: to ignore what the user explicitly tells REGEXP to do and override it with something else. But this gives you two essentially unsolvable problems to solve: 1) unravel what the user explicitly requested, categorising parts of it as wanted and other parts as unwanted, followed by 2) perfom the text matching based on what you believe they wanted...
In your example you essentially want to ignore the explicit request by the user to match and consume one non-letter at both ends of some literal text. But what happens if the user provide this pattern, where they explicitly match two non-letters at both ends:
pat = '[^a-zA-Z]{2}test[^a-zA-Z]{2}'
Would those get ignored too? What about three non-letters? Or ten? It seems to me that this task is ill-defined.
I suspect that restricting what patterns the user can input (either input checking or via some regex-creating tool as dpb suggested) might be a more robust approach.
"...doing the initial index mapping takes a very long time for large text files"
As Knuth pointed out during a colloquium at ORNL many years ago when asked a similar question about a particular problem a lab employee had, the most time saving code construct is to not do what isn't needed.
@Stephen23 has a very good characteerization of the problem, in concert with that, if one is adamant about doing a more general search trying to account for possible irregularities in the text being searched, it would be far more efficient to make changes in the search pattern rather than modifiy the large text.
"I do combinations of things like adding spaces and removing newlines/carriage returns from the original text to create an updated text array to search...."
It should be possible to directly calcuate the change in position from the original text location without any looping, or at least without physically rearranging the text in the loop.
For the first substitution/replacement of the newline with a blank, the position is the same if the initial text uses only the newline character so that is known.
If you remove redundant blanks, the length of the text is shortened by the number removed so the location of the newline within the original text is that number past the present.
The location of each substring past each removal is just one more than the prior up to the number in the original line; so the offset to the word after the removal location is the cumsum up to that point of the number removed.
While I'm sure the above would work, the better way would still be to make the search pattern ignore the amount of white space between words in the phrase unless you build in a way for the user to explicitly say to only match identical character strings. But if the idea is context despite formatting, the above should work reasonably simply.
BTW, you may find <Run_Length> by @Jan from FEX useful in the rearrangement parsing. It would start with the input text as a char() vector.
Art
on 2 Oct 2025
Art
on 3 Oct 2025
dpb
on 3 Oct 2025
So how much of a time reduction have you accomplished so far...inquiring minds, and all that! <g>
Answers (0)
Categories
Find more on Dates and Time in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!