How to deal with files in which rows have different numbers of columns

Question

Richard Wood on 8 Dec 2023

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/2058209-how-to-deal-with-files-in-which-rows-have-different-numbers-of-columns

Edited: Voss on 9 Dec 2023

Hello everyone

I'm trying to deal with the loading and manipulation of a file (with a different extension than .txt and .data), which has the following type of structure:

#unit cell size 
500	500	26.408
#unit cell vectors
1     0	   0
0     1	   0
0     0	   1
#Atoms
46212	2
0	0	0	0	0	0	0
...
46211	0.996549	0.997788	0.75	1	0	0
808700	tensorial
0	0	1 0 0 0 1.13755e-22	0	0	0	1.13755e-22	0	0	0	1.13755e-22
...
808699	46138	46141 0 0 0 -3.83401e-23	0	1.17279e-23	-0	-3.83401e-23	1.1713e-30	-1.17279e-23	-1.1713e-30	-3.83401e-23

where the ellipsis indicates that there are as many lines as integers exist between the number in the first column preceding the ellipsis and the number in the first column just after the ellipsis. As it can be seen, between the rows "46212 2" and "808700 tensorial" all rows have seven columns, while from row "808700 tensorial" to the end of the file all rows have fifteen columns. As a note of possible interest, the first column of the line "46212 2" indicates that there will be 46212 rows with seven columns, while the first column of the line "808700 tensorial" indicates that there will be 808700 rows with fifteen columns.

My goal would be to be able to create two different variables in my MatLab-based working environment, the first containing the 46212 rows of seven columns that exist between the rows "46212 2" and "808700 tensor", and the second containing the 808700 rows of fifteen columnas that exist between the row "808700 tensor" and the end of the file. How could this be done?

I would also be very interested in knowing if it is possible for Matlab itself to detect the number of rows with seven and fifteen columns, respectively, in such a way that I can use this method for files of the same type that do not necessarily have 46212 rows of seven columns and 808700 rows of fifteen columns, as in this case.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Voss on 9 Dec 2023

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2058209-how-to-deal-with-files-in-which-rows-have-different-numbers-of-columns#answer_1368379

Edited: Voss on 9 Dec 2023

Open in MATLAB Online

test_file.txt

% a concrete example file, with smaller matrices, is attached:
filename = 'test_file.txt';
% show the file's contents, for reference:
type(filename)
#unit cell size 
500	500	26.408
#unit cell vectors
1     0	   0
0     1	   0
0     0	   1
#Atoms
3	2
0	0	0	0	0	0	0
1	0	0	0	0	0	0
2	0.996549	0.997788	0.75	1	0	0
4	tensorial
0	0	1 0 0 0 1.13755e-22	0	0	0	1.13755e-22	0	0	0	1.13755e-22
1	0	1 0 0 0 1.13755e-22	0	0	0	1.13755e-22	0	0	0	1.13755e-22
2	46138	46141 0 0 0 -3.83401e-23	0	1.17279e-23	-0	-3.83401e-23	1.1713e-30	-1.17279e-23	-1.1713e-30	-3.83401e-23
3	46138	46141 0 0 0 -3.83401e-23	0	1.17279e-23	-0	-3.83401e-23	1.1713e-30	-1.17279e-23	-1.1713e-30	-3.83401e-23

Here is one method that reads the entire file at once and interprets the matrices contained within.

It is assumed that the line immediately following the last line that starts with '#Atoms' starts with a positive integer defining how many immediately subsequent lines contain the first matrix (as you describe).
After the first matrix is read, the code continues to read matrices the same way (one line of text defining the number of rows in the matrix followed by that many lines of text containing the matrix).
Matrices are stored in the cell array "result".

% read the file's contents into a character vector:
fid = fopen(filename,'r');
ch = fread(fid,[1 Inf],'*char');
fclose(fid);
% split on line endings (\n or \r\n):
str = regexp(ch,'\r?\n','split');
% number of lines of text in the file:
n_lines = numel(str);
% index of the line that immediately follows the last 
% line in the file that starts with '#Atoms':
idx = find(startsWith(str,'#Atoms'),1,'last')+1;
% initialize the matrix counter:
ii = 1;
% go as long as idx does not exceed the number of lines in the file
% (idx indicates the line immediately before the next matrix to read (and 
% immediately after the previously-read matrix, if any)):
while idx <= n_lines
    
    % read line #idx to get the number of rows/lines in the next matrix:
    num = sscanf(str{idx},'%d');
    
    % matrix start and end line indices:
    start_idx = idx+1;
    end_idx = idx+num(1);
    
    % section (i.e., consecutive lines of text) defining the matrix:
    section = str(start_idx:end_idx);
    
    % split those lines on whitespace to get the numbers
    % (still in char form):
    C = regexp(section,'\s','split');
    C = vertcat(C{:});
    
    % convert each char to double (numeric) and store the resulting 
    % matrix in the "result" cell array:
    result{ii} = str2double(C);
    
    % increment the matrix counter:
    ii = ii+1;
    
    % move to the next line after the matrix ends, and continue the loop
    % (if more lines are left):
    idx = end_idx+1;
    
end
% see the matrices:
disp(result)
    {3×7 double}    {4×15 double}
result{:}
ans = 3×7
         0         0         0         0         0         0         0
    1.0000         0         0         0         0         0         0
    2.0000    0.9965    0.9978    0.7500    1.0000         0         0
ans = 4×15
1.0e+04 *

         0         0    0.0001         0         0         0    0.0000         0         0         0    0.0000         0         0         0    0.0000
    0.0001         0    0.0001         0         0         0    0.0000         0         0         0    0.0000         0         0         0    0.0000
    0.0002    4.6138    4.6141         0         0         0   -0.0000         0    0.0000         0   -0.0000    0.0000   -0.0000   -0.0000   -0.0000
    0.0003    4.6138    4.6141         0         0         0   -0.0000         0    0.0000         0   -0.0000    0.0000   -0.0000   -0.0000   -0.0000
% if you really want to make two variables, one for each of the first two
% matrices in result:
M1 = result{1};
M2 = result{2};

Here is another method that interprets the matrix sections the same way as in the previous method, but it reads the file line-by-line (so it will work for larger files than the first method will), and it expects the matrix definitions to start immediately following the first '#Atoms' line (not the last as in the previous method).

% open the file:
fid = fopen(filename,'r');
% read lines one at a time until you get to the 
% first line that starts with '#Atoms':
while ~feof(fid)
    ch = fgetl(fid);
    if startsWith(ch,'#Atoms')
        break
    end
end
% initialize the matrix counter:
ii = 1;
% continue until the end of the file:
while ~feof(fid)
    
    % get the next line:
    ch = fgetl(fid);
    
    % read off the number of matrix rows expected to follow:
    num = sscanf(ch,'%d');
    
    % read that many lines into the cell array "section":
    section = cell(num(1),1);
    for jj = 1:num(1)
        section{jj} = fgetl(fid);
    end
    
    % split those lines on whitespace to get the numbers
    % (still in char form):
    C = regexp(section,'\s','split');
    C = vertcat(C{:});
    
    % convert each char to double (numeric) and store the 
    % resulting matrix in the "result" cell array:
    result{ii} = str2double(C);
    
    % increment the matrix counter:
    ii = ii+1;
    
end
% close the file:
fclose(fid);
% see the matrices:
disp(result)
    {3×7 double}    {4×15 double}
result{:}
ans = 3×7
         0         0         0         0         0         0         0
    1.0000         0         0         0         0         0         0
    2.0000    0.9965    0.9978    0.7500    1.0000         0         0
ans = 4×15
1.0e+04 *

         0         0    0.0001         0         0         0    0.0000         0         0         0    0.0000         0         0         0    0.0000
    0.0001         0    0.0001         0         0         0    0.0000         0         0         0    0.0000         0         0         0    0.0000
    0.0002    4.6138    4.6141         0         0         0   -0.0000         0    0.0000         0   -0.0000    0.0000   -0.0000   -0.0000   -0.0000
    0.0003    4.6138    4.6141         0         0         0   -0.0000         0    0.0000         0   -0.0000    0.0000   -0.0000   -0.0000   -0.0000

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

Matt J on 8 Dec 2023

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2058209-how-to-deal-with-files-in-which-rows-have-different-numbers-of-columns#answer_1368334

I think you would probably have to read the file in as a string array using readlines, and then do your own text parsing.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 3

Walter Roberson on 8 Dec 2023

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2058209-how-to-deal-with-files-in-which-rows-have-different-numbers-of-columns#answer_1368339

Open in MATLAB Online

You could readcell but I wouldn't recommend it.

I would suggest using textscan with appropriate format for what you are expecting to read, and using explicit repeat counts. Something along the lines of

atomsinfocell = textscan(fid, AtomsFmt, Num_Atoms);

possibly with the CollectOutput option turned on.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

How to deal with files in which rows have different numbers of columns

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (2)

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

How to deal with files in which rows have different numbers of columns

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (2)

0 Comments Show -2 older commentsHide -2 older comments

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments