I am working with a very large dataset (total 500GB) that is split up into more than a thousand individual .txt files (160 columns/characteristics per file, more than a million rows possible, contains a mixture of string and numeric variables), each covering a specific geo area. For files covering large areas, a single .txt file can be as large as 16GB. To cope with the large amount of data, I proceed as follows for each of the files:
- access the respective .txt file (with "fopen")
- within a while-loop import 250,000 rows using "textscan"
- process data and export smaller dataset (append if not first loop iteration)
- repeat steps above until end of the .txt file is reached (while ~feof)
The code that imports the data looks like this:
fileID = fopen(filename) ;
'HeaderLines', double(first_iteration==1),'EndOfLine','\r\n','EmptyValue',-1) ;
Doing so allows me to effectively reduce the size of my dataset such that I can conveniently work with the full dataset later.
My problem is the following: The code works perfectly well for ALL files (including the very large .txt ones) with version 2019a. With version 2021a (if I recall correctly, it did not work with 2020a either), the code works perfectly well UNTIL the code reaches a file that is too large. At this point, the code (instantaneously!) stops with an "out of memory" error:
Out of memory.
I suspect that the newer "textscan" function recognizes the filesize that is to be accessed would be too large to load in fully (which it is), but does not recognize that I only want 250k lines at a time.
I looked at the "readtable" command, but as far as I know, this command does not allow to import smaller stacks of data once at a time (only for spreadsheets).
Is there a workaround/fix for my issue? As I work (and worked) frequenty with these types of codes, I would be eternally stuck with the 2019 version. Thank you very much in advance for your help.