Filtering multiple .csv files based on data from a table
5 views (last 30 days)
Show older comments
Hello all,
I am trying to filter multiple .csv files (based on data in a separate table. To be more specific I am trying to filter the data from dataTables by d_Data.Actual.Calculated_Flow variable, so that it filters the data that are below minFlow or above maxFlow, which are listed in dataList. dataList has several columns the first one lists Tablename that corresponds with names of the tables in dataTables, the 4th & 5th lists corresponding maxFlow & minFlow that I would like to apply for filtering. How can I achieve this?
here is the code:
dataListDir="zoznam_merani.xlsx";
dataDir='C:\Users\U430746\OneDrive - Danfoss\Desktop\EHA\EHA 1\Matlab\Matlab\Simulation EHA V1\Test Data\OneDrive_2024-08-16\Flow control\EHA_Flow_Control_standard mode\EHA_Flow_Control';
% Loads dataList
dataList=readtable(dataListDir);
% Loads direction to .csv files
dataFiles = dir(fullfile(dataDir, '*.csv'));
% Inicialize a cell array for input of tables
dataTables = cell(1, length(dataFiles));
% Loads tables to the dataTables cell array
for i = 1:length(dataFiles)
fileName = fullfile(dataDir, dataFiles(i).name);
dataTables{i} = readtable(fileName,"VariableNamingRule","preserve");
end
% Inicialize a cell array for filtered data
filteredDataTables = cell(1, length(dataFiles));
% Runs through all the tables in dataTables
for i = 1:length(dataTables)
% Finds a row that corresponds with the TableName
rowIndex = strcmp(dataList.TableName, dataFiles(i).name);
if any(rowIndex)
% Gets minFlow & maxFlow for the actual table
minFlow = dataList.minFlow(rowIndex);
maxFlow = dataList.maxFlow(rowIndex);
% Filters the tables based on minFlow & maxFlow
filteredDataTables{i} = dataTables{i}(dataTables{i}.("d_Data.Actual.Calculated_Flow") >= minFlow & dataTables{i}.("d_Data.Actual.Calculated_Flow") <= maxFlow, :);
else
% If the tableName is not available executes no changes
filteredDataTables{i} = dataTables{i};
end
end
0 Comments
Accepted Answer
Jeremy Hughes
on 20 Aug 2024
It looks like you're loading all the data into memory, then looping to do the filtering. And to the best of my ability to tell, it looks like you're doing it right. (Without files, it's hard to say)
Are you having any specific problems?
You might have better luck doing the filtering on each table as you bring it in, so I reorganized this code a bit:
% Loads direction to .csv files
dataListDir="zoznam_merani.xlsx";
dataList=readtable(dataListDir,"TextType","string");
% Truncated this for readability, also One Drive can be problematic when
% accessed in a loop, but I see that more when writing. This is probably
% fine.
dataDir='C:\Users\...whereever...\';
dataFiles = dir(fullfile(dataDir, '*.csv'));
dataFiles = string({datafiles.name})';
% Loads tables to the dataTables cell array
dataTables = cell(1, height(dataFiles));
filteredDataTables = cell(1, height(dataFiles));
for i = 1:height(dataFiles) % Changed to length => height since length can give strange results
T = readtable(fullfile(dataDir, dataFiles(i)),"VariableNamingRule","preserve");
% I assume this condition is right assuming TableName lists the files
% as they are named in dataDir.
rowIndex = (dataList.TableName == dataFiles(i));
% **** Potentially, this is a problem if there are multiple rows with
% the same file name, but so check here for that case. ****
if any(rowIndex)
% Gets minFlow & maxFlow for the actual table
minFlow = dataList.minFlow(rowIndex);
maxFlow = dataList.maxFlow(rowIndex);
% Filters the tables based on minFlow & maxFlow
aboveMin = T.("d_Data.Actual.Calculated_Flow") >= minFlow;
belowMax = T.("d_Data.Actual.Calculated_Flow") <= maxFlow
filteredDataTables{i} = T(aboveMin & belowMax, :);
else
% If the tableName is not available executes no changes
filteredDataTables{i} = T;
end
dataTables{i} = T;
end
2 Comments
Jeremy Hughes
on 21 Aug 2024
If you accept an answer, it will highlight that answer. Others can still see it and answer, but they will see that you have accepted it.
More Answers (1)
Saurabh
on 20 Aug 2024
Hi Simon,
What I understood is that you want to filter multiple csv files based on the "d_Data.Actual.Calculated_Flow" column, by applying `minFlow` and `maxFlow` values from a separate `dataList` table that matches table names with CSV filename. I looked over your code and didn't see any issues.
Here are some things to check and consider:
- Variable Naming Rule: You are using the "VariableNamingRule","preserve" option in “readtable”. Ensure that the variable name "d_Data.Actual.Calculated_Flow" is exactly as it appears in your CSV files. MATLAB is case-sensitive, and any mismatch will cause an error.
- Table Name Matching: When matching “dataFiles (i)”.name with “dataList.TableName,” ensure that the names are consistent. If “dataList.TableName” does not include the file extension (e.g., .csv), you might need to adjust the comparison to remove the extension from “dataFiles(i).name”.
- Row Index Check: The condition if any(rowIndex) checks if there is any match. Ensure that “dataList.TableName” contains the expected names and that there are no leading/trailing spaces or mismatches.
- Data Types: Verify that “minFlow” and “maxFlow” are numeric values and that “dataTables{i}.("d_Data.Actual.Calculated_Flow")” is a numeric column. Any type mismatch can cause errors during the filtering operation.
Here's a refined version of script with some adjustments:
dataListDir = "zoznam_merani.xlsx";
dataDir = 'C:\Users\U430746\OneDrive - Danfoss\Desktop\EHA\EHA 1\Matlab\Matlab\Simulation EHA V1\Test Data\OneDrive_2024-08-16\Flow control\EHA_Flow_Control_standard mode\EHA_Flow_Control';
% Load dataList
dataList = readtable(dataListDir);
% Load .csv files
dataFiles = dir(fullfile(dataDir, '*.csv'));
% Initialize cell arrays for tables
dataTables = cell(1, length(dataFiles));
filteredDataTables = cell(1, length(dataFiles));
% Load tables into dataTables
for i = 1:length(dataFiles)
fileName = fullfile(dataDir, dataFiles(i).name);
dataTables{i} = readtable(fileName, "VariableNamingRule", "preserve");
end
% Filter dataTables
for i = 1:length(dataTables)
% Strip the .csv extension for comparison
tableName = erase(dataFiles(i).name, '.csv');
rowIndex = strcmp(dataList.TableName, tableName);
if any(rowIndex)
% Get minFlow and maxFlow for the current table
minFlow = dataList.minFlow(rowIndex);
maxFlow = dataList.maxFlow(rowIndex);
% Filter the table
flowData = dataTables{i}.("d_Data.Actual.Calculated_Flow");
filteredDataTables{i} = dataTables{i}(flowData >= minFlow & flowData <= maxFlow, :);
else
% No matching table name
filteredDataTables{i} = dataTables{i};
end
end
Some Debugging tips:
- Use “disp” or “fprintf” to print out variables at different stages.
- Use “class” and “whos” to verify that variables are of the expected types.
I hope this was helpful
See Also
Categories
Find more on Matrices and Arrays in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!