Treat and handle missing hourly data (with daily profile), that might have large gaps

2 views (last 30 days)
I want to treat huge missy temperature data with many missing values (presented as 999.9).
If there is few missing data within the day, I would take average from data before and after. But if I have large missing clusters (almost full-day missing, or up to 100 values in a row), I would take average of 1PM temperature from yesterday and 1PM temperature from tomorrow to get 1PM value for today, and same goes for all hours.
Note: I don't wish to change valid assigned tempratures linked to hours (like what interp1 would do with values order).
What can I use to handle these data?
08/09/2016 4:00:00 26
08/09/2016 5:00:00 26
08/09/2016 6:00:00 25
08/09/2016 6:00:00 999.9
08/09/2016 7:00:00 24
08/09/2016 8:00:00 25
08/09/2016 9:00:00 24
08/09/2016 9:00:00 999.9
08/09/2016 10:00:00 23
  5 Comments
Anwaar Alghamdi
Anwaar Alghamdi on 24 Nov 2022
Also, if I do linear interpolation, the non-999 values will be missed up (at least their order). I don't want to touch the temperatures assigned for each hour. Only estimate the 999 values.
Jiri Hajek
Jiri Hajek on 24 Nov 2022
As for the cluster identification, I can give you some hints - will put them below into an answer. As for the handling of large missing clusters, I would leave themo out, i.e. constrain the scope.

Sign in to comment.

Answers (1)

Jiri Hajek
Jiri Hajek on 24 Nov 2022
To identify the clusters of outliers, one may use logical indexing and the time vector. This is just a skeletal draft of the algorithm, but you can get the idea.
timeColumn % your datatime values
temperatureColumnRaw % your original temperatures
outlierPoints = temperatureColumnRaw > 900;
outlierTimes = timeColumn(outlierPoints);
timeDifsOfOutliers = diff(outlierTimes);
clusterStartsLogical = [1; timeDifsOfOutliers > mode(diff(timeColumn))];
clusterStartTimes = outlierTimes(clusterStartsLogical);
nClusters = length(clusterStart);
if nClusters > 1
clusterStartIndices = find(clusterStartsLogical);
clusterEndPoints = [clusterStartIndices(2:end)-1;length(outlierTimes)];
clusterEndTimes = outlierTimes(clusterEndPoints);
end
clusterDurations = clusterEndTimes-clusterStartTimes;
shortClusterIndices = clusterDurations > hours(3); % you define, what is a short cluster

Categories

Find more on Dates and Time in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!