How can I display the progress of textscan

I am using textscan to read and parse data files that vary widely in size. For small files it is very fast, but it is not uncommon to read files in excess of 700 megabytes.
Since these files can take over a minute to load, it would be helpful to show the user that progress is being made.
Is there a way to get the progress of textscan - ideally I'd send it to waitbar() or something similar, but any indication I can give my user will be a major improvement.
I have zero ideas at this point without implementing my own textscan, and that's not something I'm excited to undertake.

8 Comments

dpb
dpb on 23 Oct 2014
Edited: dpb on 24 Oct 2014
No can do...when you pass control to textscan it's got it until it returns.
You could write your code that if you were to look at the file size first, you could then partition the textscan call into blocks and after each is finished update an indicator then repeat until done.
textscan is a builtin function so you can't use the expedient of adding anything to the m-file so think the above is about it as it has no callback functionality.
ADDENDUM
Do you need the full flexibility of textscan to parse the files? If the structure is regular perhaps fscanf or one of the other "less developed" input routines would suffice and speed up the process significantly owing to less overhead of the complexity within textscan.
I was unaware that I could specify what portion of a file textscan processes. I just reviewed the documentation a bit and it looks like I can specify a block size, but it's not clear to me that I can easily do that without inadvertently splitting a line in the file (it's essentially a csv, with newline at the end of each entry).
I'll look at fscanf also - I've never really used it, so that may be a good approach.
Thanks for the ideas
help textscan
...
C = textscan(FID,'FORMAT',N) reads data from the file, using the FORMAT
N times, where N is a positive integer. To read additional data from
the file after N cycles, call textscan again using the original FID.
...
What's the format string you're using?
I have 2 functions that I've been using that are a little different. One pre-processes my large files to look for unique data streams, so it only reads one column, to which I apply unique()
% Read all data stream ID strings
% -------------------------------------------------------------------------
allData = textscan(fid,'%*s %*s %*s %s %*[^\n]','Delimiter',',');
Then, after I've grepped those lines to individual files (it is not uncommon for them to be in excess of 800 megs) I do a full text scan and parse the data out.
Q = textscan(fid, '%s %*s %*s %s %s %s %*s %s %s', 'Delimiter', ',');
I know that's many disk accesses, but when I'm reading gigs of data, my computer has a tendency to choke on the giant cell arrays, so this seemed the most expedient approach. Also, if the routine dies halfway through, I have at least done some usable work and can resume parsing where I left off.
I can probably add %*[^\n] to the second one to ensure a line by line read through… Think that's necessary?
"I can probably add %*[^\n] to the second one to ensure a line by line read through… Think that's necessary?"
Will be if you want to read subsections of the file in order to come back to the calling routine intermittently and update a progress indicator.
Are the data you're after actually strings or are you then also doing str2double or the like?
Are the files fixed-width columns or just comma-delimited variable length fields?
In the first case, they are actually strings (unique ID strings).
In the second case, the data type varies, so I have been importing them as strings and then converting them in a switch block based on the value in one of the columns (we have mixed data types - bool, floats, ints… tons of fun).
Also, in some cases I use str2xxx, but sometimes that appeared to be a bottleneck, so I also use
sscanf(sprintf('%s ', cellArrayofStrings{:,1}),'%f')
which yielded significant speed improvements.
It would be more complicated to change the initial parsing string, as some of the streams I'm parsing have multiple data types associated… I could conceivably run many more parsing steps and them bundle the results afterward.
Fortunately, for the mixed data cases, they are typically much less data than the other streams; by orders of magnitude.
And finally, they are comma-delimited with variable length. I actually get a second text file from our hardware that is fixed width with no delimiter, but it's so much larger I don't use it for my parser.
Possibly move the sscanf step to scanf on the file would be one suggestion.
I'm thinking probably to get any further real input would need to see a section of the file under discussion and what it is you want from it.
Doesn't have to be large, just enough to illustrate the format and whatever "issues" it has altho sounds like it is at least regular.
Thanks for all your time on this - here is a snippet of data that is representative:
2014/194/15:57:17.272106, , ,RP1 PCVNC-1014 Globe Valve Cmd,SC,FGSE M-P0A-RP1-PCV-1014 RP1 Tanking Low Flow Control PCVNC, , ,
2014/194/15:57:17.272405, , ,RP1 PCVNC-1014 Globe Valve Cmd Def Response,CR,, ,ACK,
2014/194/15:57:17.273198, , ,RP1 PCVNC-1015 Globe Valve Cmd,SC,FGSE M-P0A-RP1-PCV-1015 RP1 Tanking High Flow Control PCVNC, , ,
2014/194/15:57:17.273759, , ,RP1 PCVNC-1015 Globe Valve Cmd Def Response,CR,, ,ACK,
2014/194/15:57:17.283244, , ,AIR RV-0003 Position Mon, A,ECS E-P0A-AIR-RV-0003 AIR Position Sensor,----------------,1.000000000000000000E+02,%open,
2014/194/15:57:17.443121, , ,LO2 PCVNO-2014 Globe Valve Mon, A,FGSE M-P0A-LO2-PCV-2014 LO2 Tanking Low Flow Control PCVNO,----------------,2.785333333333330685E+01,%open,
2014/194/15:57:17.683124, , ,AIR RV-0004 Position Mon, A,ECS E-P0A-AIR-RV-0004 AIR Position Sensor,----------------,1.000000000000000000E+02,%open,
2014/194/15:57:17.783104, , ,AIR RV-0004 Position Mon, A,ECS E-P0A-AIR-RV-0004 AIR Position Sensor,----------------,1.000000000000000000E+02,%open,
2014/194/15:57:17.921458, , ,LO2 PCVNO-2013 Globe Valve Cmd,SC,FGSE M-P0A-LO2-PCV-2013 LO2 Tanking Low Flow Control PCVNO, , ,
2014/194/15:57:17.921643, , ,LO2 PCVNO-2013 Globe Valve Cmd Def Response,CR,, ,ACK,
2014/194/15:57:35.789884, , ,LO2 PCVNC-2059 State,UI,PCVNC-2059 Tank Vapor Sply Press Ctl Valve,----------------,1,
Fortunately, it is regularly structured. But there is a mixture of data types and not every line contains a numerical value to extract.

Sign in to comment.

Answers (0)

Categories

Products

Asked:

on 22 Oct 2014

Commented:

on 25 Oct 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!