How do I determine the number of data in a 'messy' CSV and then import it?

1 view (last 30 days)
I have a CSV file that looks like the following:
"","JOHN DOE","p0123TP01, Approved"
"","Rectum (4)(Volume: 57.77)","CTV (21)(Volume: 32.39)"
"0.0","100.0","100.0"
"0.1","78.01999","100.0"
"0.2","76.2245","100.0"
"0.3","75.21035","100.0"
[skipping to the end of the file]
"58.1","","1.88277"
"58.2","","0.21338"
"58.3","","0.0"
but the number of data sets can vary, so I must write a flexible script to accomodate however many data rows and columns exist:
"","DOE, JOHN","revised blurring for Case 16, ANON, original; calculating correlation between blurred dose and toxicity","","","","","","","","","","","","","",""
"","patient (1)(Volume: 18578.06)","Prostate (2)(Volume: 19.80)","Proximal-SV (3)(Volume: 6.40)","Rectum (4)(Volume: 70.44)","Bladder (5)(Volume: 143.70)","Intestine (6)(Volume: 8.40)","CutLine (7)(Volume: 415.77)","Air (8)(Volume: 5.06)","InsideRectum (9)(Volume: 36.17)","PTV1 (10)(Volume: 76.77)","PTVoriginal (11)(Volume: 80.89)","PTV2 (12)(Volume: 56.52)","Guide1 (13)(Volume: 81.79)","Guide2 (14)(Volume: 68.50)","guide (15)(Volume: 66.91)","CTV-operator (16)(Volume: 26.15)"
"0.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.1","12.88921","100.0","100.0","69.70608","73.36377","2.54878","82.87909","16.40558","73.91824","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.2","12.06455","100.0","100.0","67.50839","69.03589","0.88936","79.92729","16.40558","71.92567","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.3","11.56745","100.0","100.0","65.82671","67.72518","0.32793","77.7594","16.40558","70.44129","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.4","11.09599","100.0","100.0","64.40629","66.56731","0.15624","75.78776","16.40558","69.10236","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.5","10.1524","100.0","100.0","63.25727","65.87648","0.07211","73.65644","16.40558","68.12462","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.6","9.48733","100.0","100.0","61.64318","64.92351","0.0103","67.57963","16.40558","65.96868","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.7","9.1257","100.0","100.0","60.35885","62.43443","0.0","59.25954","16.40558","65.12147","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
[... skipping to end of file ...]
"58.1","0.00498","4.57181","0.23601","","","","","","","1.20431","1.143","1.63577","1.13037","1.34968","1.38174","3.51704"
"58.2","4.9E-4","0.46111","0.0","","","","","","","0.1189","0.11285","0.1615","0.1116","0.13326","0.13642","0.34903"
"58.3","0.0","0.0","","","","","","","","0.0","0.0","0.0","0.0","0.0","0.0","0.0"
What textscan formatSpec do I need to store this data? The following code results in a 1x1 cell array with element 0x2 empty double:
function [StudyIDcol,DistributionType,OrganSpecification,Doses,Volumes,DoseUnit,VolumeUnit] = CreateTableVariables(folderpath,type,StudyID)
% First we read the file. Code courtesy of Walter Roberson of the MathWorks community
fid = fopen([folderpath,type,'/',StudyID,'.csv'],'rt'); % rt = 'read in text mode'
if fid == -1 % added for debugging at the suggestion of Jan Hansen of the MathWorks community
error('Author:Function:OpenFile', 'Cannot open file');
end
num = 0;
% headers = cell % unfinished
while true %endless loop requiring 'break' command to exit!
H1 = fgetl(fid) ;
if feof(fid); break; end
H2 = fgetl(fid) ;
if feof(fid); break; end
datacell = textscan(fid, '%f%f', 'Delimiter', ',', 'CollectOutput', true) ; % originally '%f%f'
if isempty(datacell) || isempty(datacell{1}); break; end
% if any(isnan(datacell{1}(end,:))); datacell{1}(end,:) = []; end % commented out to verify the above command quits the loop
num = num + 1;
headers(num,:) = {H1, H2} ;
data(num) = datacell; % We store it to the data cell array only after checking that it contains data.
%fgetl(fid); %the empty line between organs
end
fclose(fid);
I am reading the textscan formatSpec documentation and more trying to solve this problem. Please help me understand how to use textscan.
I am now trying to write a script to flexibly import such files; I must learn how to scan it to count the number of rows and columns for the data beginning at "0.0","100.0","100.0" (not necessarily this number each time) ...
%function data = ReadMIMDVH(filepath)
filepath = '/home/sony/Documents/research/data/MIM Cumulative DVH/planned/1.csv';
% We write a script specifically to import MIM data.
% First we scan the file to see how many rows of data it has.
% We construct the format spec based on this number of rows and columns.
formatSpec = ['%*s %*s %*s %*s %q %q',repmat('%f',[rows,columns])];
% We read the file using textscan.
fileID = fopen(filepath);
DVH = textscan(fileID,formatSpec,'Delimiter',{',' '/n'});
%end

Accepted Answer

Guillaume
Guillaume on 22 Jan 2018
I'd use readtable which should be able to work out the formatspec on its own. So just:
t = readtable(filepath); %all done file is read and parsed.
  5 Comments
Guillaume
Guillaume on 23 Jan 2018
My next task is to learn how to parse the Volume variable names to strip the underscore and all characters after it.
A regexprep should do that very easily. Explain exactly what you want.
Daniel Bridges
Daniel Bridges on 23 Jan 2018
Edited: Daniel Bridges on 23 Jan 2018
Please see the question, " How do I parse and erase from a string while importing CSV with tableread?" The goal is to most efficiently import the volume structure names, e.g.
"Rectum (4)(Volume: 70.44)","Bladder (5)(Volume: 143.70)"
keeping only the names 'Rectum', 'Bladder', etc. I was able to find one method, but it took four lines of code, and I suspect there is a way to do it with one. If you would post a better method, I would unaccept my answer and accept yours.

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 22 Jan 2018
opts = detectImportOptions(filepath);
nvar = length(opts.VariableNames);
opts.VariableNames = sprintfc('Var%d', 1:nvar);
test = readtable(filepath, opts);

Categories

Find more on Data Import and Export in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!