Time and Space Efficiency Problem in Multi-Format .csv file Splicing and Saving

2 views (last 30 days)
Hi,
I'm recently dealing with a bunch of csv files which are stored in different folders named by date sequence. The aim is to splice the data with the same csv filename (among all folders) together and save for further use.
First, I go over all folders and use unique function to derive a list of .csv file names. Next, I splice the data with identical filename according to date sequence.
data=[];
for i = 1:num_folder
data = [data; readtable(tmp_file)]; % Splicing
end
save(strcat(pwd,'\',filename,'.mat'),'data') % Saving
The saving is -v7.3 by default setting. (the file size range from hundreds of Mbs to several Gbs)
The whole process is running quite slow. I checked the profile viewer and it turns out save function consumed 70% of time and 'splicing' the other 30%. I would love to know how to elevate the time&space efficiency of my code.
[P.S.: The original .csv file have 60 columns. The Majority of them are numbers (double). One is filled with 'true' or 'false'; one filled with 'HH:MM:SS.TTT' and the other filled with 'HH:MM:SS.TTTTTT'. (Since I read them with readtable function, they are save in the table format in a cell manner.) I tried to 'simplify' the table by converting all elemets into double (I thought it could save some space?) but failed in correctly attaining all microseconds (TTTTTT). The textscan can only gives me HH, MM and SS.TTTT.]
  4 Comments
Frank
Frank on 1 Feb 2023
Thanks! I tried to saved the spliced data into .csv format instead of .mat and saved some ~85% of space!
I am now doing the whole data-processing in a two-step procedure. First is what I'm doing now, read the original data and splice them by their specific filename (as 'Original Database re-Storage'? my aim was to minimize multiple times of loading when conducting the following data analysis). After 'Archiving' the Original Data, the next step is to standardlize/customize the needed columns of data in preparation for data analysis. It troubles me of how to smoothly convert 'cell' type 'HH:MM:SS.TTT' table elements into datetime/duration format (I am aware of the pure time is comes with 'today' when performing datestr. Is there any other way to derive and keep the simple decimal parts of the pure time alone?)
The time structure example is 'HH:MM:SS.TTT' (e.g. '19:14:23.000') and 'HH:MM:SS.TTTTTT' (e.g. '08:46:51.429869').
My solution now is using:
sscanf(data.Time,'%{HH:mm.ss.SSS}D')
but have no idea how to deal with 'HH:MM:SS.TTTTTT' format without losing precision. (pls show me an example if possible)
Frank
Frank on 1 Feb 2023
One example of my spliced data look like this (attachment). I'm considering turning the whole table into double matrix (true/false -> 1/0, time -> decimals) since it seems easier to use than table format (cell style). Any suggestions?

Sign in to comment.

Accepted Answer

Steven Lord
Steven Lord on 1 Feb 2023
Assuming your data files don't have any header rows, I'd consider avoiding the read/save process and just use operating system commands to staple together the files. If you then want to process the data in MATLAB you'd only need to read it once (or you could assemble a datastore and create a tall array to work with the data.)
For Windows take a look at the last couple examples on the Microsoft documentation page for the copy command. You could assemble the command in MATLAB from the file names then call system to execute it.
  2 Comments
Frank
Frank on 2 Feb 2023
Great! I'll try compare matrix splicing, fwrite function and OS command on the same set of files and see who's the fastest one.
Frank
Frank on 3 Feb 2023
Thanks! I used the copy command via MATLAB system. It is really way faster! It is the perfect way for the stapling.
Extra thanks for the datestore and tall array format. It provide me solution and ideas to some other problem I've encountered before on extra-large data analysis.

Sign in to comment.

More Answers (1)

dpb
dpb on 1 Feb 2023
Edited: dpb on 1 Feb 2023
Per usual when things are not totally plain vanilla in some fashion, you need to make use of the extended features supplied in MATLAB. We didn't need but a half-dozen lines to see the file format; I made a shorter version of the original by the following code (which post as may give you some ideas regarding the "splicing" operations you spoke of)
fid=fopen('OI401.csv','r');
fod=fopen('OI401_1.csv','w');
for i=1:100, fwrite(fod,fgets(fid)); end
fclose('all')
to leave me with a big enough file to do something with but still not be huge...
Anyways, that done,
opt=detectImportOptions('OI401.csv'); % create basic import object
opt=setvartype(opt,{'tradingday','logDay'},"datetime"); % set datetime data type
opt=setvaropts(opt,'tradingday','logDay'},"InputFormat",'yyyyDDDD'); % and the input format
opt=setvartype(opt,'isTrading','logical');
opt.SelectedVariableNames=opt.VariableNames([1:6 end-4:end]); % just pick a small subset for demo
tOI=readtable('OI401_1.csv',opt); % and read in with this help
The above returned
>> head(tOI)
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ __________ _________ ____ ____ ____ _______________ _______________ ________ ______ _________
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:21.430146 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:31.430749 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:41.433568 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:51.431701 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:01.431491 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:11.436309 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:21.435230 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:31.438434 20230117 0 true
>>
And, I didn't think to do it on the import option object, so to prove what we got,
>> ans.tradingday.Format='default'
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ ___________ _________ ____ ____ ____ _______________ _______________ ________ ______ _________
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:21.430146 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:31.430749 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:41.433568 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:51.431701 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:01.431491 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:11.436309 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:21.435230 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:31.438434 20230117 0 true
>>
shows the dates were imported as datetime.
Now, as far as what to use for analyses, that'll be your call and will depend heavily upon what you intend to do with the data once you have it. There are all kinds of builtin analysis tools for tables that let you select variables by variable name that are quite convenient; whether they'll be of any direct use for your purposes will all depend upon what those purposes are...
  14 Comments
dpb
dpb on 6 Feb 2023
I think you can accept more than one Answer if more than one were useful??? Not positive it will keep both; haven't explored that much; I do know that reputation points aren't taken away from an earlier respondent if a second is selected...
It's not like we get anything other than personal satisfaction out of it, anyways... :) My penchant is to try to teach as well as "just answer"; hence the examples of some less-oftened facilities I notice newcomers tend to overlook/not use...
Frank
Frank on 9 Feb 2023
There's no 'Accept' button anymore after accepting one. Only a 'vote' button. I'd be happy to accept both of the answers as they're all very helpful to me.
And yes exploring useful but less-oftened facilities are fun. Thanks a lot again.

Sign in to comment.

Categories

Find more on Axes Appearance in Help Center and File Exchange

Products


Release

R2017a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!