Writing to matfile in a loop

4 views (last 30 days)
Andrew
Andrew on 28 Jun 2012
Hello,
I am working on a project involving processing an hour of sampled data, where the data is sampled at several megasamples per second. A single data set can come in at 100 GB+.
In order to determine if my data processing is producing high quality measurements, I want to simulate similar data, and process it. My preference is to generate the data all at once, then process it all at once. Thus, I have a need to generate vectors larger than what my PC can hold in RAM.
I am attempting to generate my large data set one second at a time, then store each second of data to a file. I am doing this in a for loop using the matfile class that is included in R7.14 (2012a), and perhaps earlier.
However, the files I am generating are larger, indeed, much larger, than they should be. My issue can be demonstrated with the following snippet. At the end, there are two files on disk, the first 34208152 bytes, the second 776065 bytes. However, there are two vectors in memory, each 100,000x1.
Why is the file written in pieces 44x larger than the file written in one step?
If there is not a fix for this, do you have any suggestions on how I can generate and store a variable much larger than the amount of RAM available on my PC?
Thank you,
Andrew
% Generate Data that is cause for question
testFile = matfile('simData1','Writable',true);
testFile.y1 = [];
filePtr = 1;
for itr = 1:100
data = randn(1000,1);
testFile.y1(filePtr:filePtr+1000-1,1) = data;
filePtr = filePtr + 1000;
end%for
data = randn(1e5,1);
testFile2 = matfile('simData2','Writable',true);
testFile2.y2 = data;
% Show that files are different size on disk
clear
details = dir;
for itr = 1:length(details)
if strcmp(details(itr).name,'simData1.mat');
fileIdx1 = itr;
elseif strcmp(details(itr).name,'simData2.mat');
fileIdx2 = itr;
end%if
end%for
details(fileIdx1).bytes
details(fileIdx2).bytes
% show that variables are same size in memory
load('simData1')
load('simData2')
Edit: Note that preallocation of testFile.y1, (e.g. testFile.y1 = zeros(1e5,1);) does not resolve the issue.

Accepted Answer

per isakson
per isakson on 28 Jun 2012
Edited: per isakson on 29 Jun 2012
.
Strange! I think this is an issue for the tech support and the developers at The Mathworks.
I've reproduced your results with the function below (on R2012a 64bit + Windows7) and made a couple of observations:
  1. each chunk, .<1000x1 double>, of y1 has a larger "footprint" in the mat-file than the previous chunk; plot(sz1) displays a quadratic function.
  2. a rerun of the function, cssm, without deleting the mat-file beforehand adds another 32MB to the mat-file, simData1.mat. The file, simData1, ends up with a lot of "dead data".
function [ sz1, sz_mat1, sz_mat2, sz_y ] = cssm
% Generate Data that is cause for question
testFile = matfile('simData1','Writable',true);
testFile.y1 = [];
filePtr = 1;
sz1 = nan( 100, 1 );
for itr = 1:100
data = randn(1000,1);
testFile.y1(filePtr:filePtr+1000-1,1) = data;
filePtr = filePtr + 1000;
sad = dir('simData1.mat');
sz1(itr) = sad.bytes;
end%for
data = randn(1e5,1);
testFile2 = matfile('simData2','Writable',true);
testFile2.y2 = data;
[ sz_mat1, sz_mat2, sz_y ] = cssm_();
end
function [ sz_mat1, sz_mat2, sz_y ] = cssm_
% Show that files are different size on disk
sad = dir('simData1.mat');
sz_mat1 = sad.bytes;
sad = dir('simData2.mat');
sz_mat2 = sad.bytes;
% show that variables are same size in memory
load('simData1')
load('simData2')
sz_y = whos('y*');
end
.
--- Suggestion ---
100GB+ is a huge vector with a PC. Hopefully you use 64bit.
I've experimented a bit with HDF5 and files of a couple of GB. Both high and low level Matlab hcf5 functions. My suggestions are:
  1. Try to make something simple with fwrite, fread, and fseek. Will be fast.
  2. Use high level HDF5. Choose a fixed sub-vector-size that fits easily in the Matlab workspace and Windows file cache. Use a separate dataset in the hdf-file for each sub-vector. Do not use a "chunked" hdf-file.
  3. The Waterloo File and Matrix Utilities, Malcolm Lidierth, http://www.mathworks.se/matlabcentral/fileexchange/12250-project-waterloo-file-and-matrix-utilities
.
--- Experiment with fwrite, fseek and fread ---
I've experimented a bit with fwrite, fseek and fread. With n1 = 1e3; n2 = 1e6; in the function below a 8GB file is produced. The write and the two reads each takes approximately 100 seconds. Most of the time the Free memory is (close to) zero. With n1 = 1e3; n2 = 1e6;
With n1 = 1e3; n2 = 3e6; a 24GB file is produced and the times to write and read are three times as long; it seems to be linear with size.
function [write_tt,read_tt,readr_tt] = write_read_huge_vector( n1, n2 )
% write_read_huge_vector Write and read 10GB+ double vector with fwrite, fsek amd fread
%{
n1 = 1e3; n2 = 1e6;
[ write_tt, read_tt, readr_tt ] = write_read_huge_vector( n1, n2 );
figure, plot( write_tt )
figure, plot( read_tt )
figure, plot( readr_tt )
%}
file_spec = 'c:\temp\write_read_huge_vector.dat';
write_tt = write( n1, n2, file_spec );
read_tt = read( n1, n2, file_spec );
readr_tt = readr( n1, n2, file_spec );
end
function elapse = write( n1, n2, file_spec )
if exist( file_spec, 'file' )
delete( file_spec )
end
vec = rand( n2, 1 );
fid = fopen( file_spec, 'a' );
tic_id = tic;
elapse = nan( n1, 1 );
for ii = 1 : n1
fwrite( fid, vec + 10*ii, 'double' );
elapse(ii) = toc( tic_id );
end
fclose( fid );
end
function elapse = read( n1, n2, file_spec )
fid = fopen( file_spec, 'r' );
tic_id = tic;
elapse = nan( n1, 1 );
for ii = 1 : n1
vec = fread( fid, n2, '*double' ); %#ok<NASGU>
elapse(ii) = toc( tic_id );
end
fclose( fid );
end
function elapse = readr( n1, n2, file_spec )
fid = fopen( file_spec, 'r' );
tic_id = tic;
elapse = nan( n1, 1 );
jj = 0;
chunk_list = randperm( n1, n1 );
for ii = chunk_list
fseek( fid, (ii-1)*n2*8, 'bof' );
vec = fread( fid, n2, '*double' );
if abs( vec([1,2,3,numel(vec)-1,numel(vec)] )-10*ii ) > 1
warning( 'write_read_huge_vector:readr:Mismatch' ...
, 'reading mismatch for ii=%u', ii )
end
jj = jj + 1;
elapse(jj) = toc( tic_id );
end
fclose( fid );
end
  4 Comments
per isakson
per isakson on 28 Jun 2012
Edited: per isakson on 30 Jun 2012
A search for "matfile" doesn't return anything in The Mathworks' Bug Report. Thus, we need to make a bug/enhancement report.
I've reported: 1-IO95SG

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!