# How to save a tall array / table to a text or csv file?

I'm trying to save the contents of a tall table that doesn't fit into memory to a .txt file.

MATLAB provides the function write. However, it can only write the table contents to .mat files. So far I haven't found an option or another function that could write the data to a text file.

A workaround I'm trying to do is to continously gather a part of the tall table, save it to a text file, gather the next part etc. until the table end is reached. For that to work in tall syntax I suppose I need a vector containing the row numbers of the tall table. That way I could find the index of the rows I want to gather and write with:

idx = (RowNumbers >= lowerLimit) & (RowNumbers <= upperLimit);

With the index vector it is then possible to gather the rows I want to save to the text file:

TableToSave = gather(Table(idx,:));

once the data is gathered the table could be saved with the writetable function. After that, the lowerLimit and upperLimit could be adjusted and a new chunk of the table could be saved.

The point where I'm failing is the construction of this vector containing the row numbers. In theory it's simply

RowNumbers = 1:1:size(Table,1); // Or: RowNumbers = 1:1:gather(size(Table,1));

The first one doesn't work because the 1:X syntax doesn't support 'tall doubles' and the second approach doesn't work I suppose because the resulting RowNumbers and index vector are completely in-memory while the table is not.

So if I try to

idx = tall((RowNumbers >= lowerLimit) & (RowNumbers <= upperLimit));

and use

gather(head(DB(idx,:)));

The following error appears:

Incompatible tall array arguments. The first dimension in each tall array must have the same size, and each tall array must be based on the same datastore.

To sum up:

1. Is there another way to save tall arrays / tables to text files?

2. How to create an "unevaluated" row number array that then could be used for the described workaround?

Thanks a lot!

### Accepted Answer

Edric Ellis
on 7 Sep 2017

I think the least inefficient method is probably to combine use of tall/write with writetable. Calling gather repeatedly is going to be inefficient - the approach below takes 2 passes over the data (one of which is in the optimised .mat form, so should be quicker). Here's the sort of thing I mean.

%%Step 1 - create a tall table

varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'};

ds1 = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ...

'SelectedVariableNames', varnames);

tt = tall(ds1);

%%Step 2 - operate on tall table

tt.TotalDelay = tt.ArrDelay + tt.DepDelay;

%%Step 3 - use tall/write to emit .mat files

writeDir = tempname

mkdir(writeDir);

write(writeDir, tt);

%%Step 4 - iteratively convert the tall/write output to CSV

ds = datastore(writeDir);

csvDir = tempname

mkdir(csvDir);

idx = 0;

while hasdata(ds)

idx = 1 + idx;

fname = fullfile(csvDir, sprintf('out_%06d.csv', idx));

writetable(read(ds), fname);

end

A refinement of this would be to partition the datastore and operate on it in parallel, using the techniques described here in the documentation.

Edric Ellis
on 7 Sep 2017

Here's the parfor version of Step 4.

%%Step 5 - use parfor to parallelise the writetable loop

ds = datastore(writeDir);

N = numpartitions(ds, gcp);

csvDir2 = tempname

mkdir(csvDir2);

parfor idx1 = 1 : N

idx2 = 0;

subds = partition(ds, idx1, N);

while hasdata(subds)

idx2 = 1 + idx2;

fname = fullfile(csvDir2, sprintf('out_%06d_%06d.csv', idx1, idx2));

writetable(read(subds), fname);

end

end

### More Answers (1)

Adam Filion
on 1 Oct 2018

Edited: Adam Filion
on 1 Oct 2018

