File is all numeric, but csv read does not work fully

Question

Jenna P on 6 Apr 2016

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/277582-file-is-all-numeric-but-csv-read-does-not-work-fully

Edited: dpb on 20 Apr 2016

Hi, I'm trying to read in a bunch of data files one at time as a matrix, use the find function to find a certain z location and when I do, I store that row of data in a new matrix. My problem is that no matter what I do, I get this error:

??? Error using ==> dlmread at 145
Mismatch between file and format string.
Trouble reading number from file (row 781666, field 4) ==>
Error in ==> csvread at 52
    m=dlmread(filename, ',', r, c);
Error in ==> Velocity_AtPt_vs_Time at 67
    datafile=csvread(fullname,1);

The data files are all identical, 1 row of column headers and 14 columns of all numeric data, I made it so that the csvread skips the first row and reads all else. My files are approximately 1 million rows x 14 columns.

What's happening is the code executes for 69 data files doing exactly the steps I wish it to/filling the new matrix properly and then stops and gives me this error after the 69th. I have tried taking away the 70th and 71st files to see what happens, it now stops at 67 files. Very odd. If anyone has suggestions, please let me know! Thanks for reading

This is my loop that receives the error message:

for k = 1:numel(filenames) 
  % Create full file name and partial filename
  fullname = [currentfolder filesep NEWFileNames(k).name];
  % Read in data
  datafile=csvread(fullname,1);
  [rr1,cc1] = find(datafile(:,z)==0.0075000000008515);
  firstrow1 = rr1(1,1);
  firscol1 = cc1(1,1);
  dataset1(k,:)=datafile(firstrow1,:);
end

Note: The rr1 and cc1 are just so that I may take the first instance this value shows up, but is not the error with this code

15 Comments
Show 13 older commentsHide 13 older comments

Jenna P on 7 Apr 2016

Open in MATLAB Online

Maximum possible array:              16920 MB (1.774e+010 bytes) *
Memory available for all arrays:     16920 MB (1.774e+010 bytes) *
Memory used by MATLAB:                 780 MB (8.178e+008 bytes)
Physical Memory (RAM):               12222 MB (1.282e+010 bytes)
*  Limited by System Memory (physical + swap file) available.
Maximum possible array:              17601 MB (1.846e+010 bytes) *
Memory available for all arrays:     17601 MB (1.846e+010 bytes) *
Memory used by MATLAB:                 780 MB (8.178e+008 bytes)
Physical Memory (RAM):               12222 MB (1.282e+010 bytes)
*  Limited by System Memory (physical + swap file) available.
Maximum possible array:              17325 MB (1.817e+010 bytes) *
Memory available for all arrays:     17325 MB (1.817e+010 bytes) *
Memory used by MATLAB:                 780 MB (8.178e+008 bytes)
Physical Memory (RAM):               12222 MB (1.282e+010 bytes)
*  Limited by System Memory (physical + swap file) available.
Maximum possible array:              17211 MB (1.805e+010 bytes) *
Memory available for all arrays:     17211 MB (1.805e+010 bytes) *
Memory used by MATLAB:                 778 MB (8.162e+008 bytes)
Physical Memory (RAM):               12222 MB (1.282e+010 bytes)
*  Limited by System Memory (physical + swap file) available.
Maximum possible array:              17149 MB (1.798e+010 bytes) *
Memory available for all arrays:     17149 MB (1.798e+010 bytes) *
Memory used by MATLAB:                 778 MB (8.162e+008 bytes)
Physical Memory (RAM):               12222 MB (1.282e+010 bytes)
*  Limited by System Memory (physical + swap file) available.
Maximum possible array:              17068 MB (1.790e+010 bytes) *
Memory available for all arrays:     17068 MB (1.790e+010 bytes) *
Memory used by MATLAB:                 778 MB (8.162e+008 bytes)
Physical Memory (RAM):               12222 MB (1.282e+010 bytes)
*  Limited by System Memory (physical + swap file) available.
Maximum possible array:              16993 MB (1.782e+010 bytes) *
Memory available for all arrays:     16993 MB (1.782e+010 bytes) *
Memory used by MATLAB:                 776 MB (8.135e+008 bytes)
Physical Memory (RAM):               12222 MB (1.282e+010 bytes)
*  Limited by System Memory (physical + swap file) available.

This is the result of putting memory into the loop, although I'm not sure how to deal with this output

dpb on 7 Apr 2016

Edited: dpb on 7 Apr 2016

Open in MATLAB Online

I was wondering if you ran the script multiple times without changing anything does it fail in the same place every time?

Also, one thing you could try that might save just a tiny fraction of memory would be to rewrite the find operation a little...

[rr1,cc1] = find(datafile(:,z)==0.0075000000008515,1,'first');
dataset1(k,:)=datafile(rr1,:);

This will stop the lookup after the one case is found. Also I note that z isn't defined in the code snippet, is it simply a constant I presume?

IA has mentioned the issue of floating point comparison but that would result in the result being empty which would be a different error...

Also, since you're looking for a particular value within the file, and the file is quite large, it might be advantageous to use a little more sophistication in the process rather than simply trying to read the whole thing in one swell foop--instead use textscan to read sections and if the point were to happen to be fairly early in the file, you can truncate the read before having to process the whole thing.

Is there any way to guestimate how far into the file the location of interest is?

dpb on 7 Apr 2016

Edited: dpb on 7 Apr 2016

textscan is quite different in one significant way--it operates on an open file handle and can be called repetitively to read the file sequentially whereas csvread and the base routine it calls which does the actual work dlmread only operate by reading the entire file into memory at once.

It seems as noted before if the file in question will, in fact, be read correctly on its own that there is not actually an error in the file itself but some memory interaction; that it seems to be exactly reproducible surprises me somewhat, though, I was expecting there to be a somewhat random component involved.

But, as far as that goes, I'd suggest submitting the symptoms as a bug report to TMW; perhaps they would be willing to try to reproduce the issue. It's too much data to be able to provide via the Answers forum for any of us to attempt directly.

BUT, back to getting to a workaround to get your project underway, look at the following doc page and consider how you might follow their guidelines there. If you could provide any algorithm by which you could estimate the location of the value in question by knowing something of the way in which the coordinates are laid out, that would help. If you need some help implementing this, attach a smaller subsection of a file and would be glad to help (as am sure many others will be as well).

The newer releases than I have also have a memory-mapping facility for text files similar to memmapfile for stream files that might be of help if you have recent-enough release. I don't so really have no experience with it or how well it might handle your particular problem.

Any way, there's more than one way to skin the cat here... :)

ADDENDUM

If you know the location you're after is always well down in the file, (you aren't so lucky as that it is the same identical location in every file I suppose which would make things much simpler) try setting the initial row number to a large integer but less than the line being searched for. Then, while I'd expect speed to go way down, maybe you'll resolve the memory-related issue.

I'm wondering, however, it the problem isn't actually memory per se, but OS file handles or the like that aren't being destroyed and it's system resources that are the ultimate cause of the failure and only the symptom error message is a red herring as to what is really the root cause. If that were to be the case, it would be likely the above also may fail but I think it's worth a try as it is simple to simply change the one constant in the existing script/function.

Jenna P on 7 Apr 2016

Edited: Jenna P on 7 Apr 2016

Hi dpb, thanks for your thoughtful response. I'll give textscan a shot and see how it goes. In the meantime, I was processing some other data which required a similar approach (ironically I used tolerances even before we discussed it), but it works for a small number of files and then with many more it stops. It seems to be a memory issue at this point since this is an entirely separate and different code, but with many files also. This time the files are about 150,000 rows by 14 col. The error message is very much identical, but it complains about a different row and field.

Unfortunately, the coordinates seem to be mapped randomly to the file. I had also entertained the idea of using the same exact row from each file instead of the find function as a test, but the error still persists. I'll give using a lower row number a chance, although that's not the goal of course. Perhaps textscan will be the fix since csvread is the common factor in both files.

Edit: Taking files away from the 150,000 by 14 set allowed it to run through, no problem. Clearly a memory issue. Since I'm trying to average results, I may have to settle on how many I average

Jenna P on 19 Apr 2016

Edited: Jenna P on 19 Apr 2016

I still have not found a solution to this problem. It does not make sense to me

edit: Actually... using xlsread instead of csvread may have worked..but painfully slow

dpb on 19 Apr 2016

Did you try the textscan solution? The first response in my earlier Answer should work simply substituting it (and the appropriate fopen|fclose pair of course) for csvread.

I'd surely suggest making that attempt before going to xlsread. If it's something peculiar about [csv|dlm]read causing the (what I think is a resource issue) error, textscan is standalone and if it also aborts that's pretty indicative it's more fundamental.

Also, did you file a Service Request with TMW Support on the issue?

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 7 Apr 2016

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/277582-file-is-all-numeric-but-csv-read-does-not-work-fully#answer_216937

Edited: dpb on 20 Apr 2016

Open in MATLAB Online

OK, to separate from the long-winded chain of comments...this isn't the full answer yet, but a "getting-started" for textscan solution.

>> fid=fopen('file70_part.csv');  % open the file
>> d=cell2mat(textscan(fid,'','headerlines',1,'delimiter',',','collectoutput',1));
>> whos d
  Name       Size            Bytes  Class     Attributes
    d         26x14             2912  double              
>> fid=fclose(fid);
>>

The above is all needed to read the full file; I've done a couple of things to make note of--

Used empty string '' for the format string. This has the effect that Matlab will determine the fields per record automagically and return the proper shape; otherwise you have to know the number per record and write a specific format string to match, and;
Used cell2mat around the textscan call to return the data as double array rather than the cell array otherwise returned. 'collectoutput' serves to make a single array, not 14.

What's not shown here is a counted number of records to read... That can be as simple as--after the fopen, of course:

>> fgetl(fid);       % get, throwaway the header row
>> while ~feof(fid)  % until run out of data 
    d=cell2mat(textscan(fid,'',5,'delimiter',',','collectoutput',1));
d(:,14).',end
ans =
  0.0079    0.0078    0.0076    0.0070    0.0071
ans =
  0.0072    0.0074    0.0075    0.0088    0.0087
ans =
  0.0085    0.0084    0.0083    0.0081    0.0080
ans =
  0.0079    0.0078    0.0076    0.0075    0.0074
ans =
  0.0071    0.0072    0.0087    0.0085    0.0084
ans =
  0.0083
>>

This aborts with a short group since the size isn't evenly divisible; there were 26 lines of data. You'll end up aborting the loop because (hopefully) your search for the particular value will have succeeded and then you do a break, fclose and do whatever with the data you found and go to the next file.

NB Previously had forgotten to remove the 'headerlines',1 parameter so was skipping a record each loop through. Had accounted for the single header record at the beginning of the file with the fgetl call before beginning the loop.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

File is all numeric, but csv read does not work fully

15 Comments
Show 13 older commentsHide 13 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

File is all numeric, but csv read does not work fully

15 Comments Show 13 older commentsHide 13 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

15 Comments
Show 13 older commentsHide 13 older comments

0 Comments
Show -2 older commentsHide -2 older comments