Writing a table too large for MATLAB

21 views (last 30 days)
Brandon
Brandon on 27 Jul 2023
Edited: dpb on 27 Jul 2023
Currently I'm working on filtering out points in multiple files, then combining them into one large table, with the following loop. It works fine for the first 13 files, but then matlab runs out of memory and I'm unable to continue. Is this an issue with matlab? Or just the way I went about this?
arr = [];
for i = 1:9;
A = readtable("LRO_ES_0" + i + ".csv");
idx = strcmp(A{:,5},"FIRING");
A = A(idx,:);
arr = [arr;A];
end
  6 Comments
the cyclist
the cyclist on 27 Jul 2023
Edited: the cyclist on 27 Jul 2023
It would be helpful if you could upload the data files (which I expect you cannot do, or are not allowed to do), or at least tell us
  • the number of files
  • the numbers of rows and columns in those files
  • the typical fraction of the time that idx will be true (i.e. how much filtering you expect to occur)
That would help gauge the magnitude of the problem.
dpb
dpb on 27 Jul 2023
"...but my internship mentor wants a full gathered table for future applications."
Well, "if frogs had wings...". We can't always have what we want.
arr = [];
for i = 1:9;
A = readtable("LRO_ES_0" + i + ".csv");
idx = strcmp(A{:,5},"FIRING");
A = A(idx,:);
arr = [arr;A];
end
In the above, you read a table but then retain a part of it and append to existing via dynamic reallocation (using the [arr,A] syntax). I don't know if that requires a temporary copy or not; you could try preallocating a large empty table and then address into it by the size accumulated.
It's possible a little storage compression might occur if changed variables like the colum 5 above to categorical() instead of string; another storage "trick" might be to convert the default double to single if the loss of precision could be tolerated.
But, if the files are, indeed, that large, then even if you get it to read the 10th or 11th, it'll undoubtedly still run out of memory before get to the end (unless there are only 12 altogether, you might get lucky! :) )
It may just not be feasible to put all together in memory at once...

Sign in to comment.

Answers (3)

Walter Roberson
Walter Roberson on 27 Jul 2023
You can do slightly better with
numfiles = 9;
arr = cell(numfiles,1);
for i = 1:numfiles
A = readtable("LRO_ES_0" + i + ".csv");
idx = strcmp(A{:,5},"FIRING");
A = A(idx,:);
arr{i} = A;
end
arr = vertcat(arr{:});
But this really only postpones the problem.
However, I can pretty much guarantee that no matter how much RAM your system has, that you could find 9 data files whose total size exceeded the amount of RAM that you have.
If you go to the command window, upper right size, Preferences -> MATLAB -> Workspace -> MATLAB array size limit, and turn off Limit the maximum array size to a percentage of RAM, then if you add virtual memory to your system, MATLAB will be able to handle arrays that are up to 2^48-1 bytes in total size. However, using virtual memory is typically quite slow.
You should look into tall arrays... but I suspect it isn't going to work for your purposes.
Your mentor is probably going to have to give up on the idea of gathering everything into a single table. (Gather into a database might work.)
Unless, that is, you happen to be using a computer with a fairly small amount of RAM by today's standards. For example if your system only has 4 gigabytes of RAM you should probably get a RAM upgrade.
Side note: Microsoft Excel has a limit of 2^20 = 1048576 rows for an .xlsx file.

Voss
Voss on 27 Jul 2023
If you don't mind the full table being in an xlsx file insted of a csv, you can use the 'Range' input in writetable to write each table to the output file one at a time without having to store the full table in memory all at once.
num_files = 13;
output_file = "LRO_ES_all.xlsx";
start_row = 1;
for i = 1:num_files
A = readtable(sprintf("LRO_ES_%02d.csv",i));
idx = strcmp(A{:,5},"FIRING");
A = A(idx,:);
writetable(A,output_file,'Range',sprintf('A%d',start_row),'WriteVariableNames',i==1);
start_row = start_row + size(A,1) + (i==1);
end
  1 Comment
Walter Roberson
Walter Roberson on 27 Jul 2023
Note: using 'Range' does not give any way around the limit of 1048576 rows.
Also, .xlsx files are zipped directories of text XML files -- so they need a fair bit of temporary RAM. More than a .mat file would need for example.

Sign in to comment.


dpb
dpb on 27 Jul 2023
Edited: dpb on 27 Jul 2023
It depends upon what you mean by the request "...but my internship mentor wants a full gathered table for future applications."
If it is interpreted/intended as a single MATLAB table, then it's probably been conclusively demonstrated that's not going to happen. But, if the point is to have the filtered data available, then there are alternative routes; besides the Excel file,
FILTER_STR="FIRING";
fido=fileopen('PickGoodNameForFilteredFile.csv','r'); % open an output file
d=dir("LRO_ES_0*.csv"); % get the list of the matching files
for i=1:numel(d)
fidi=fopen(fullfile(d(i).folder,d(i).name),'r'); % open each in turn
while ~feof(fidi)
l=fgets(fidi); % read line including \n terminator
if contains(l,FILTER_STRING) % find the wanted string (*)
fwrite(fido,l);
end
end
fidi=fclose(fidi);
end
fido=fclose(fido);
This will provide a single file containing all the filtered data that can then be processed -- the tools for large datasets then may come into use/help later on.
(*) This will find the string anywhere within the record; it may need a little more sophistication in the matching if it is important to only match the 5th (or other) specific column, but that's just a little extra logic.
This will run very quickly as it doesn't go through the i/o conversion routines but just moves/compares bytes.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!