How to save a tall array / table to a text or csv file?

44 views (last 30 days)
I'm trying to save the contents of a tall table that doesn't fit into memory to a .txt file.
MATLAB provides the function write. However, it can only write the table contents to .mat files. So far I haven't found an option or another function that could write the data to a text file.
A workaround I'm trying to do is to continously gather a part of the tall table, save it to a text file, gather the next part etc. until the table end is reached. For that to work in tall syntax I suppose I need a vector containing the row numbers of the tall table. That way I could find the index of the rows I want to gather and write with:
idx = (RowNumbers >= lowerLimit) & (RowNumbers <= upperLimit);
With the index vector it is then possible to gather the rows I want to save to the text file:
TableToSave = gather(Table(idx,:));
once the data is gathered the table could be saved with the writetable function. After that, the lowerLimit and upperLimit could be adjusted and a new chunk of the table could be saved.
The point where I'm failing is the construction of this vector containing the row numbers. In theory it's simply
RowNumbers = 1:1:size(Table,1); // Or: RowNumbers = 1:1:gather(size(Table,1));
The first one doesn't work because the 1:X syntax doesn't support 'tall doubles' and the second approach doesn't work I suppose because the resulting RowNumbers and index vector are completely in-memory while the table is not.
So if I try to
idx = tall((RowNumbers >= lowerLimit) & (RowNumbers <= upperLimit));
and use
gather(head(DB(idx,:)));
The following error appears:
Incompatible tall array arguments. The first dimension in each tall array must have the same size, and each tall array must be based on the same datastore.
To sum up:
1. Is there another way to save tall arrays / tables to text files?
2. How to create an "unevaluated" row number array that then could be used for the described workaround?
Thanks a lot!

Accepted Answer

Edric Ellis
Edric Ellis on 7 Sep 2017
I think the least inefficient method is probably to combine use of tall/write with writetable. Calling gather repeatedly is going to be inefficient - the approach below takes 2 passes over the data (one of which is in the optimised .mat form, so should be quicker). Here's the sort of thing I mean.
%%Step 1 - create a tall table
varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'};
ds1 = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ...
'SelectedVariableNames', varnames);
tt = tall(ds1);
%%Step 2 - operate on tall table
tt.TotalDelay = tt.ArrDelay + tt.DepDelay;
%%Step 3 - use tall/write to emit .mat files
writeDir = tempname
mkdir(writeDir);
write(writeDir, tt);
%%Step 4 - iteratively convert the tall/write output to CSV
ds = datastore(writeDir);
csvDir = tempname
mkdir(csvDir);
idx = 0;
while hasdata(ds)
idx = 1 + idx;
fname = fullfile(csvDir, sprintf('out_%06d.csv', idx));
writetable(read(ds), fname);
end
A refinement of this would be to partition the datastore and operate on it in parallel, using the techniques described here in the documentation.
  2 Comments
Benjamin Imbach
Benjamin Imbach on 8 Sep 2017
Edited: Benjamin Imbach on 8 Sep 2017
Thanks! This is what I was looking for. I still wish there was a direct way to write the tall table to a csv but until then this will do!
A small correction:
subds = partition(ds, idx1, N);
should be
subds = partition(ds, N, idx1);
In your case it worked since the datastore is probably so small that N = 1.
Thanks again!

Sign in to comment.

More Answers (1)

Adam Filion
Adam Filion on 1 Oct 2018
Edited: Adam Filion on 1 Oct 2018
As of R2018b the tall write command now directly supports writing tall arrays to .txt files (also .csv, .xls* and custom formats) in addition to .mat:

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!