- Don't use a text file, go binary
- Split your text file in manageable chunks beforehand.
- Use a database instead
You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
Data is not saving to the workspace
13 views (last 30 days)
Show older comments
I have a large text file composed of a single row of 52480000 numbers separated by semicolons. I'm attempting to organize the data into 51250 rows of 1024 numbers and then separate this into distinct blocks of 1025 x 1024. The numbers need to stay in the same order they were in in the original file (with every 1025th number being the start of a new row) I have tried using a while and if loop.
R = 51250;
C = 1024;
fid = fopen( 'TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, 'EndOfLine', ';');
if ~isempty(z{1})
k = k + 1;
s = fprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, 1025, []), ';')
end
end
fclose(fid);
This code does not create an initial cell of 52480000 numbers, which means that none of the subsequent data sets (s & z) are created in the workspace. The problem is that if I textscan the data into Matlab before formatting it, the file creates a memory error. Does anyone notice anything that I don't about this code or have any pointers?
26 Comments
José-Luis
on 10 Feb 2017
Edited: José-Luis
on 10 Feb 2017
What is the size of that file? If the numbers had been stored in a binary file in double precision, that would still be more than 400MB. A text file is bound to be much larger and despite impressive progress GB files are a pain to process.
There are several ways of tackling this. Off the top of my head:
There are other ways but I can't be more specific without knowing what you are trying to achieve.
Stephen23
on 10 Feb 2017
Edited: Stephen23
on 10 Feb 2017
See earlier question:
"I'm attempting to organize the data into 51250 rows of 1024 numbers and then separate this into distinct blocks of 1025 x 1024"
Why do you need this intermediate step?
My answer showed you how to to simply process exactly those blocks of 1025*1024, avoiding that intermediate matrix entirely. What do gain by creating that huge matrix that you don't even want? My code shows how you can go directly to the smaller matrices (which seems to be your aim) without having to read the whole file data into MATLAB and without needing to use the intermediate step of rearranging all of the data into one pointlessly huge matrix.
Why not just read the blocks you need (1025*1024) instead of wasting time and memory with that huge matrix?
"The numbers need to stay in the same order they were in in the original file (with every 1025th number being the start of a new row) "
Yes, and that is what my answer does. Change R = 51250; back to R = 1025; and this code will work too.
Aaron Smith
on 10 Feb 2017
Like I said, using your code, There is no output data. z and s do not appear in the workspace, and when I made alterations that did give s and z in the workspace, they were empty cells
Aaron Smith
on 10 Feb 2017
The same problem occurs with that R value. I changed the values, hedging my bets but it didn't make a difference to the result
Stephen23
on 10 Feb 2017
Edited: Stephen23
on 10 Feb 2017
@Aaron Smith: when my code works properly then the contents of Z will be empty at the end of all iterations. What were you expecting?
>> size(Z)
ans =
1 1
>> size(Z{1})
ans =
0 1
Much more interesting would be the value of k: please tell me what value k has.
Aaron Smith
on 10 Feb 2017
z is 1 in my workspace and k is also 1. There is now an error occurring with reshape: Error using reshape Product of known dimensions, 1025, not divisible into total number of elements, 1.
Stephen23
on 10 Feb 2017
Edited: Stephen23
on 10 Feb 2017
"z is 1" z is actually a cell array, so it cannot be equal to one. What do you really mean?
textscan is not reading the data file. Possibly the format is not as expected. Do the numbers have decimal digits, or exponent notation? Please run this and tell me exactly what values out has (it will be slow):
fid = fopen('file.txt','rt');
out = [];
while ~feof(fid)
tmp = unique(fgets(fid,1e5));
out = union(out,double(tmp));
end
fclose(fid);
disp(out)
And also show exactly what this displays:
fid = fopen('file.txt','rt');
str = fgets(fid,60)
fclose(fid);
Aaron Smith
on 10 Feb 2017
>> fid = fopen( 'TEST_A.asc', 'rt');
>> out = [];
>> while ~feof(fid)
tmp = unique(fgets(fid, 1e5));
out = union(out, double(tmp));
end
>> fclose(fid);
>> tmp
tmp =
067
>> out
out =
10 48 49 50 51 52 53 54 55 56 57
The data are all integers between 0 and 1000, though some may be over 1000. I just haven't been able to spot any numbers over 800. The file does have over 50 million numbers though.
fid = fopen( 'TEST_A.asc', 'rt' );
>> str = fgets(fid, 60)
str =
1
>> fclose(fid);
Stephen23
on 10 Feb 2017
Edited: Stephen23
on 10 Feb 2017
@Aaron Smith: the file contains newline characters (char 10), which means your original description of the file format "I have a very large text file composed of, in essence one row of numbers." is incorrect. Also your original question had code where you used textscan with semicolon delimiter. But there is not one single semicolon in the whole file.
As a result that code tells textscan to read a file with a particular format, but it is not the format that that file has. Because I wrote that code based on what you told me.
You can either experiment with textscan's options (e.g. EndOfLine, Delimiter, etc) yourself, or you can tell us exactly what format the file really has. If you want help then please upload a sample text file (the first two thousand numbers or so) in a new comment.
Image Analyst
on 10 Feb 2017
Aaron: Please attach 'TEST_A.asc' for further help.
Aaron Smith
on 10 Feb 2017
The file did not unzip correctly today so the file was not correct. I downloaded it again and unzipped it again
>> fid = fopen( 'TEST_A.asc', 'rt' );
str = fgets(fid, 60)
fclose(fid);
str =
1;658;671;661;686;672;662;645;654;669;675;650;688;666;664;66
This is the other test code you wrote
>> fid = fopen( 'TEST_A.asc', 'rt');
out = [];
while ~feof(fid)
tmp = unique(fgets(fid, 1e5));
out = union(out, double(tmp));
end
fclose(fid);
tmp
out
When I tried the original code with the newly properly unzipped file
R = 1025;
C = 1024;
fid = fopen('TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, 'EndOfLine', ';');
if ~isempty(z{1})
k = k + 1;
s = sprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, R, []), ';')
end
end
fclose(fid);
This gave an output for z which was a 1025 x 1 cell. This cell is the first row
Stephen23
on 10 Feb 2017
Edited: Stephen23
on 10 Feb 2017
@Aaron Smith: in your last comment you forgot to show us what values tmp and out have.
Try something like this:
opt = {'EndOfLine',';', 'CollectOutput',true);
...
z = textscan( fid, '%d', R*C, opt{:});
Aaron Smith
on 10 Feb 2017
R = 1025;
C = 1024;
opt = { 'EndofLine', ';', 'CollectOutput', true};
fid = fopen('TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, opt{:});
if ~isempty(z{1})
k = k + 1;
s = sprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, R, []), ';')
end
end
Error using reshape
Product of known dimensions, 1025, not divisible into total number of elements, 1.
I tried it again and got a different error
Stephen23
on 10 Feb 2017
Edited: Stephen23
on 10 Feb 2017
@Aaron Smith: What is k's value when you get that error?
You have been asked twice to upload a sample file. It will be difficult to help your further without it.
I know my code works: I tested it. I even gave you the code that I used to generate the fake data file. If there is any problem then it is because your data file does not match the expected format somehow. So we need to see it.
Could it be that the number of values in the file is not divisible by 50*1025 ? If so then you might need a special case to handle the last matrix. Again, knowing the value of k and a sample file would be helpful.
Stephen23
on 10 Feb 2017
Edited: Stephen23
on 10 Feb 2017
@Aaron Smith: Try this, it saves all blocks of 1025x1024 values in their own files, and if there are any values left over at the end it saves them in one row in new file:
sbd = 'tempDir';
R = 1025;
C = 1024;
opt = {'EndOfLine',';', 'CollectOutput',true};
fid = fopen(fullfile(sbd,'temp0.txt'),'rt');
k = 0;
while ~feof(fid)
k = k+1;
Z = textscan(fid,'%d', R*C, opt{:});
S = fullfile(sbd,sprintf('temp0_%02d.txt',k));
if rem(numel(Z{1}),R)==0
dlmwrite(S,reshape(Z{1},[],R).',';')
else
dlmwrite(S,Z{1},';')
end
end
fclose(fid);
Note that I also added a transpose to get the data in the correct order.
Aaron Smith
on 13 Feb 2017
Edited: Aaron Smith
on 13 Feb 2017
Thanks so much Stephen. I got an error on the code but i think it might be a problem with the file itself or save path
sbd = 'tempDir';
R = 1025;
C = 1024;
opt = {'EndOfLine', ';', 'CollectOutput', true};
fid = fopen(fullfile( sbd, 'TEST_A.asc' ), 'rt');
k = 0;
while ~feof(fid)
k = k+1;
Z = textscan(fid, '%d', R*C, opt{:});
S = fullfile( sbd, sprintf( 'TEST_A_A.asc', k ));
if rem(numel( z{1}), R)==0
dlmwrite(S, reshape( z{1}, [], R).', ';')
else
dlmwrite( S, z{1}, ';')
end
end
fclose(fid);
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier
I'm sure I'll be able to fix that. What does sbd do? Is it system build which builds the blocks or does it make fullfile create separate files for the blocks rather than build a full file from parts the way fullfile usually does or is it just the temporary name of the files?
Walter Roberson
on 13 Feb 2017
sbd is the name of the subdirectory to save the individual files into. You can set it to '' if you do not want to use a subdirectory to store them
Stephen23
on 13 Feb 2017
Edited: Stephen23
on 13 Feb 2017
sbd = 'tempDir';
is a subdirectory of the current directory. I put all of the files into this subdirectory because I did not want them cluttering up my current directory. You can make the subdir '' if you want to use the current directory, or (even better) learn to use directory paths and put your data in its own subdirectory.
Aaron Smith
on 13 Feb 2017
Yeah, I worked that out from reading pages on Matlab and by writing a description of the code. Thanks guys. Any idea what the problem with the file identifier might be? It came up before and seemed to just go away after a few times typing it out. That hasn't worked this time. It isn't the save path or the file name that is causing the problem as far as i know
Stephen23
on 13 Feb 2017
@Aaron Smith: get the second output from fopen:
[fileID,errmsg] = fopen(...)
and read the error message. It always turns out to be a spelling mistake, folder permissions, or the file not being in the location that they are looking in.
Aaron Smith
on 14 Feb 2017
Edited: Aaron Smith
on 14 Feb 2017
When using fopen outside of the code itself, it works fine and doesn't create an error. The only thing I can think it could be is the fullfile and sbd in the fopen command. I tried taking it out, moving it but that creates errors with the code. Is there a way to put the fullfile(sbd, ...) part in a separate line?
sbd = 'tempdir';
R = 1025;
C = 1024;
opt = { 'EndOfLine', ';', 'CollectOutput', true };
>> fid = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
>> k = 0;
while ~feof(fid)
k = k + 1;
Z = textscan( fid, '%d', R*C, opt{:});
S = fullfile( sbd, sprintf( 'TEST_ASA.asc', k ));
if rem( numel( Z{1}), R)==0
dlmwrite( S, reshape( Z{1}, [], R).', ';')
else
dlmwrite( S, Z{1}, ';')
end
end
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier.
>> [fid, errmsg] = fopen( 'TEST_A.asc' )
fid =
9
errmsg =
''
I was thinking, looking at the fullfile page on mathworks, Should i set up a folder to be a destination for the file?
f = fullfile('myfolder','mysubfolder','myfile.m')
I'm thinking it may be the subdirectory (sbd) that is causing the error
Stephen23
on 14 Feb 2017
@Aaron Smith: just get rid of the fullfile if you don't want it.
However I would recommend learning to use filepaths to access data files, as it makes your code faster and more reliable (e.g. compared to cd or other buggy ideas). Note that the file path I used is relative to the current directory, and that this may be different for the command window and the code that is being called: that path needs to exist relative to where the code runs from. One simple resolution is to always specify the an absolute path. The internet is full of help on understanding relative/absolute paths, but you might as well start here:
"Is there a way to put the fullfile(sbd, ...) part in a separate line" Sure, it is just a function, you can put it wherever you want to.
Aaron Smith
on 15 Feb 2017
Is there a way for me to share my data file with you so that you can try your code with the actual data? The file is approximately 200mb
Stephen23
on 15 Feb 2017
Edited: Stephen23
on 15 Feb 2017
You could register with dropbox, mediafire, google drive, or one of the many other file sharing websites, and send me the link of the file (via my profile page: please also include a link to this thread otherwise the email will get deleted automatically).
Accepted Answer
Stephen23
on 15 Feb 2017
Edited: Stephen23
on 15 Feb 2017
Thank you for the file. What did I learn from the actual data file: that it is not "composed of a single row", but in fact there are 51200 rows in the file that I received.
Why is this important? Because computers are stupid, and they do exactly what they are told to do. Knowing how to read a file correctly requires knowing what format the file has. In this case it is also quite handy for us, because it is trivial to read and write lines without much processing.
The code below worked correctly for me, reading the 200 MB file, and creating 50 smaller files with the rows following the same order as the original file.
sbd = 'temp';
f2d = fopen(fullfile(sbd,'temp_01.asc'),'wt');
f1d = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
k = 0;
while ~feof(f1d)
str = fgetl(f1d);
if sscanf(str,'%d')==1
k = k+1;
fclose(f2d);
fnm = fullfile(sbd,sprintf('temp_%02d.asc',k));
f2d = fopen(fnm,'wt');
end
fprintf(f2d,'%s\n',str);
end
fclose(f1d);
fclose(f2d);
Note that:
- the size of the output matrices is 1024x1025 (because there are 1025 numbers per line). This is correct because the first number of each line is simply a line count (check the files and you will see).
- the lines are exactly the same as the original file.
- MATLAB hold one line at a time: the lines are simply read from the large file and written directly to a new file.
- as a result: no matrix, no converting from string to numeric and back to string.
- it is slow because the file is large... reading and writing 51200 lines of 1025 numbers each will take some time.
7 Comments
Aaron Smith
on 16 Feb 2017
Thanks Stephen. I knew about the line count number, i was just attributing it to columns rather than rows. I did think it it was all one single row of data. Anyway, thanks so much for your continued help. There is an error message showing up but i'm not sure if there is a fix for it.
>> sbd = 'temp';
>> fid2 = fopen(fullfile( sbd, 'temp_01.asc'), 'w');
>> fid1 = fopen(fullfile( sbd, 'TEST_A.asc' ), 'r');
>> k = 0;
>> while ~feof(fid1)
str = fgetl(fid1);
if sscanf( str, '%d' )==1
k = k + 1;
fclose(fid2);
fnm = fullfile( sbd, sprintf( 'temp_%02d.asc', k));
fid2 = fopen( fnm, 'w');
end
fprintf(fid2, '%s\n', str);
end
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier.
[fid1, errmsg] = fopen( 'TEST_A.asc' )
fid1 =
6
errmsg =
''
>> [fid2, errmsg] = fopen( 'test_01.asc', 'w')
fid2 =
7
errmsg =
''
Stephen23
on 16 Feb 2017
Edited: Stephen23
on 17 Feb 2017
"i'm not sure if there is a fix for it."
You need to provide the correct filepath for your files. I put all of my files into one sub-directory of the current path named "temp". That worked for me. Do you see "temp" at the start of my code?
Imagine that you tell MATLAB (or any other programming language that has ever existed) to open this file 'C:\Temp\myfile.txt' But what should happen if there is no such file in that location? Then the programming language cannot read your mind: it cannot guess that you actually meant another location, e.g. 'C:\Temp\testfiles\myfile.txt', or that the file is actually called 'my_mistake.csv'. YOU are the one who has to know where you files are, and YOU have to provide the correct path to fopen (via fullfile if used).
So look at my code: I used a sub-directory named "temp". My files were all in that sub-directory. So I told MATLAB to look in that sub-directory. But when you test for those files like this:
[fid1, errmsg] = fopen( 'TEST_A.asc' )
Where is it looking?: ONLY IN THE CURRENT DIRECTORY. You did not tell fopen to look in any sub-directory, or in any other directories anywhere in your computer, or even anywhere else in the known universe. Just the current directory. Let me ask a question: is the file 'TEST_A.asc' in the current directory? If the answer is no, then why are you telling MATLAB to look for it in the current directory?
fopen failures are most commonly caused by one thing: users not giving the correct path (which includes spelling mistakes of the name).
"i'm not sure if there is a fix for it."
The fix is that you provide fopen with the correct path.
PS: [fid2, errmsg] = fopen( 'test_01.asc', 'w') is a pointless test because it just creates that file wherever you tell it too: see the "w" option? That creates a file. It does not care where.
PPS: Why did you get rid of the t option? You should keep it (unless you plan on doing strange things with EOL characters). Removing random things is not a good way of making code work.
Walter Roberson
on 16 Feb 2017
"fopen failures are caused by one thing: users not giving the correct path"
Well, that and permission errors. And networked file access to a server that is not accessible. And bugs in file sharing applications like DropBox. And bugs in using UNC paths. And VPN setup. And encryption certificate problems. And full disks. ...
Aaron Smith
on 20 Feb 2017
Thanks Stephen. I did eventually get the code working. The problem was with the save path. I had to specify destinations with the entire path (C\ files\ folder\ folder). You mentioned the first number on each line, the line number (1, 2, 3, 4 etc). Is there a way to remove or ignore this the way the headerlines command in the textscan function does?
Stephen23
on 20 Feb 2017
Edited: Stephen23
on 21 Feb 2017
You could use the sscanf call to get an index, e.g.:
>> str = '10;123;456;789;0;123;';
>> [row,~,~,idx] = sscanf(str,'%d')
row =
10
idx =
3
Or in my answer (untested):
sbd = 'temp';
f2d = fopen(fullfile(sbd,'temp_01.asc'),'wt');
f1d = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
k = 0;
while ~feof(f1d)
str = fgetl(f1d);
[row,~,~,idx] = sscanf(str,'%d');
if row==1
k = k+1;
fclose(f2d);
fnm = fullfile(sbd,sprintf('temp_%02d.asc',k));
f2d = fopen(fnm,'wt');
end
fprintf(f2d,'%s\n',str(idx+1:end));
end
fclose(f1d);
fclose(f2d);
Aaron Smith
on 21 Feb 2017
Thanks Stephen, that code works as far as I can see. What may I ask are the two ~ in the code doing?
More Answers (0)
See Also
Categories
Find more on Low-Level File I/O in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)