How to extract part of a text file in MATLAB?

7 views (last 30 days)
Okay so I have opened an xml file and want to get the relevant text stored in those files. I tried the following code (noting that the relevant text started after a certain string of characters in the xml file, I tried to use an if statement to extract the text from that point till they reached another point. This would give me less meaningless text so that I could get the text that I want.)
if true
File1 = fopen('Factual1.xml','r');
File2 = fopen('Factual2.xml','r');
File3 = fopen('Colloquial1.xml','r');
File4 = fopen('Colloquial2.xml','r');
File5 = fopen('Hello.xml','r');
File6 = fopen('Hello2.xml','r');
Filenames = {'File1';'File2';'File3';'File4';'File5';'File6'};
B = {0};
for i=File1:File6
A = fscanf(i,'%s');
if ~(strcmp(A,'<w:pw:rsidR="00E3286E"w:rsidRDefault="'))
while((B = fscanf(i,'%c')) ~='\')
B
end
end
end
end
but I keep getting an error, saying that the statement B = fscanf(I,'%c') is not valid. Is there any other way that I can scan the contents of each file, character by character, so that I can extract the amount of text that I want?

Answers (2)

Ken Atwell
Ken Atwell on 3 Jun 2013
I'm guessing you're a C programmer. You can't assign B in the while loop's conditional like you are attempting to do. Use two lines:
B = fscanf(i, '%c');
while B ~= '\'
...
B = fscanf(i, '%c');
end
BTW, I believe your for loop is working "accidentally" because MATLAB tends to assign file handles in numeric order -- but is perhaps not guaranteed.
  4 Comments
Samyukta Ramnath
Samyukta Ramnath on 4 Jun 2013
But I checked, and they always were consecutive integers! They just didn't always start from one. But will do this anyway, to be sure.
Walter Roberson
Walter Roberson on 4 Jun 2013
MATLAB appears to follow what POSIX does, which is to allocate the first available (lowest numbered) file descriptor. But that does not mean that the results will always be consecutive.
fid1 = fopen('file1');
fid2 = fopen('file2');
fid3 = fopen('file3');
fclose(fid1);
fclose(fid2);
nfid1 = fopen('nfile1');
nfid2 = fopen('nfile2');
nfid3 = fopen('nfile3');
If we assume nothing had been opened before, fid1 will be 3, fid2 will be 4, fid3 will be 5, then 3 and 4 are released, so nfid1 will be 3, nfid2 will be 4, but nfid3 would be the next available, 6, rather than the consecutive 5.

Sign in to comment.


Paul Metcalf
Paul Metcalf on 4 Jun 2013
You are defining B as a cell matrix, then trying to replace B with a different data type which is invalid. Try first initializing B properly. E.g. B = cell(m,n); Then to assign data into each cell in the array use B{1,1} = 'first line of data'; etc... Your code is really poorly constructed in general. If I have time tonight I'll look at sending you some more tips.
  1 Comment
Samyukta Ramnath
Samyukta Ramnath on 4 Jun 2013
I think I get your point. You mean that I should first initialize B as a two dimensional matrix, then I can print the text character by character, after applying a checking condition (i.e. keep printing the characters if they aren't equal to some specific character?)

Sign in to comment.

Categories

Find more on Text Data Preparation in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!