How to deal with files in which rows have different numbers of columns

5 views (last 30 days)
Hello everyone
I'm trying to deal with the loading and manipulation of a file (with a different extension than .txt and .data), which has the following type of structure:
#unit cell size
500 500 26.408
#unit cell vectors
1 0 0
0 1 0
0 0 1
#Atoms
46212 2
0 0 0 0 0 0 0
...
46211 0.996549 0.997788 0.75 1 0 0
808700 tensorial
0 0 1 0 0 0 1.13755e-22 0 0 0 1.13755e-22 0 0 0 1.13755e-22
...
808699 46138 46141 0 0 0 -3.83401e-23 0 1.17279e-23 -0 -3.83401e-23 1.1713e-30 -1.17279e-23 -1.1713e-30 -3.83401e-23
where the ellipsis indicates that there are as many lines as integers exist between the number in the first column preceding the ellipsis and the number in the first column just after the ellipsis. As it can be seen, between the rows "46212 2" and "808700 tensorial" all rows have seven columns, while from row "808700 tensorial" to the end of the file all rows have fifteen columns. As a note of possible interest, the first column of the line "46212 2" indicates that there will be 46212 rows with seven columns, while the first column of the line "808700 tensorial" indicates that there will be 808700 rows with fifteen columns.
My goal would be to be able to create two different variables in my MatLab-based working environment, the first containing the 46212 rows of seven columns that exist between the rows "46212 2" and "808700 tensor", and the second containing the 808700 rows of fifteen columnas that exist between the row "808700 tensor" and the end of the file. How could this be done?
I would also be very interested in knowing if it is possible for Matlab itself to detect the number of rows with seven and fifteen columns, respectively, in such a way that I can use this method for files of the same type that do not necessarily have 46212 rows of seven columns and 808700 rows of fifteen columns, as in this case.

Accepted Answer

Voss
Voss on 9 Dec 2023
Edited: Voss on 9 Dec 2023
% a concrete example file, with smaller matrices, is attached:
filename = 'test_file.txt';
% show the file's contents, for reference:
type(filename)
#unit cell size 500 500 26.408 #unit cell vectors 1 0 0 0 1 0 0 0 1 #Atoms 3 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0.996549 0.997788 0.75 1 0 0 4 tensorial 0 0 1 0 0 0 1.13755e-22 0 0 0 1.13755e-22 0 0 0 1.13755e-22 1 0 1 0 0 0 1.13755e-22 0 0 0 1.13755e-22 0 0 0 1.13755e-22 2 46138 46141 0 0 0 -3.83401e-23 0 1.17279e-23 -0 -3.83401e-23 1.1713e-30 -1.17279e-23 -1.1713e-30 -3.83401e-23 3 46138 46141 0 0 0 -3.83401e-23 0 1.17279e-23 -0 -3.83401e-23 1.1713e-30 -1.17279e-23 -1.1713e-30 -3.83401e-23
Here is one method that reads the entire file at once and interprets the matrices contained within.
  • It is assumed that the line immediately following the last line that starts with '#Atoms' starts with a positive integer defining how many immediately subsequent lines contain the first matrix (as you describe).
  • After the first matrix is read, the code continues to read matrices the same way (one line of text defining the number of rows in the matrix followed by that many lines of text containing the matrix).
  • Matrices are stored in the cell array "result".
% read the file's contents into a character vector:
fid = fopen(filename,'r');
ch = fread(fid,[1 Inf],'*char');
fclose(fid);
% split on line endings (\n or \r\n):
str = regexp(ch,'\r?\n','split');
% number of lines of text in the file:
n_lines = numel(str);
% index of the line that immediately follows the last
% line in the file that starts with '#Atoms':
idx = find(startsWith(str,'#Atoms'),1,'last')+1;
% initialize the matrix counter:
ii = 1;
% go as long as idx does not exceed the number of lines in the file
% (idx indicates the line immediately before the next matrix to read (and
% immediately after the previously-read matrix, if any)):
while idx <= n_lines
% read line #idx to get the number of rows/lines in the next matrix:
num = sscanf(str{idx},'%d');
% matrix start and end line indices:
start_idx = idx+1;
end_idx = idx+num(1);
% section (i.e., consecutive lines of text) defining the matrix:
section = str(start_idx:end_idx);
% split those lines on whitespace to get the numbers
% (still in char form):
C = regexp(section,'\s','split');
C = vertcat(C{:});
% convert each char to double (numeric) and store the resulting
% matrix in the "result" cell array:
result{ii} = str2double(C);
% increment the matrix counter:
ii = ii+1;
% move to the next line after the matrix ends, and continue the loop
% (if more lines are left):
idx = end_idx+1;
end
% see the matrices:
disp(result)
{3×7 double} {4×15 double}
result{:}
ans = 3×7
0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 2.0000 0.9965 0.9978 0.7500 1.0000 0 0
ans = 4×15
1.0e+04 * 0 0 0.0001 0 0 0 0.0000 0 0 0 0.0000 0 0 0 0.0000 0.0001 0 0.0001 0 0 0 0.0000 0 0 0 0.0000 0 0 0 0.0000 0.0002 4.6138 4.6141 0 0 0 -0.0000 0 0.0000 0 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 0.0003 4.6138 4.6141 0 0 0 -0.0000 0 0.0000 0 -0.0000 0.0000 -0.0000 -0.0000 -0.0000
% if you really want to make two variables, one for each of the first two
% matrices in result:
M1 = result{1};
M2 = result{2};
Here is another method that interprets the matrix sections the same way as in the previous method, but it reads the file line-by-line (so it will work for larger files than the first method will), and it expects the matrix definitions to start immediately following the first '#Atoms' line (not the last as in the previous method).
% open the file:
fid = fopen(filename,'r');
% read lines one at a time until you get to the
% first line that starts with '#Atoms':
while ~feof(fid)
ch = fgetl(fid);
if startsWith(ch,'#Atoms')
break
end
end
% initialize the matrix counter:
ii = 1;
% continue until the end of the file:
while ~feof(fid)
% get the next line:
ch = fgetl(fid);
% read off the number of matrix rows expected to follow:
num = sscanf(ch,'%d');
% read that many lines into the cell array "section":
section = cell(num(1),1);
for jj = 1:num(1)
section{jj} = fgetl(fid);
end
% split those lines on whitespace to get the numbers
% (still in char form):
C = regexp(section,'\s','split');
C = vertcat(C{:});
% convert each char to double (numeric) and store the
% resulting matrix in the "result" cell array:
result{ii} = str2double(C);
% increment the matrix counter:
ii = ii+1;
end
% close the file:
fclose(fid);
% see the matrices:
disp(result)
{3×7 double} {4×15 double}
result{:}
ans = 3×7
0 0 0 0 0 0 0 1.0000 0 0 0 0 0 0 2.0000 0.9965 0.9978 0.7500 1.0000 0 0
ans = 4×15
1.0e+04 * 0 0 0.0001 0 0 0 0.0000 0 0 0 0.0000 0 0 0 0.0000 0.0001 0 0.0001 0 0 0 0.0000 0 0 0 0.0000 0 0 0 0.0000 0.0002 4.6138 4.6141 0 0 0 -0.0000 0 0.0000 0 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 0.0003 4.6138 4.6141 0 0 0 -0.0000 0 0.0000 0 -0.0000 0.0000 -0.0000 -0.0000 -0.0000

More Answers (2)

Matt J
Matt J on 8 Dec 2023
I think you would probably have to read the file in as a string array using readlines, and then do your own text parsing.

Walter Roberson
Walter Roberson on 8 Dec 2023
You could readcell but I wouldn't recommend it.
I would suggest using textscan with appropriate format for what you are expecting to read, and using explicit repeat counts. Something along the lines of
atomsinfocell = textscan(fid, AtomsFmt, Num_Atoms);
possibly with the CollectOutput option turned on.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!