How I can eliminate days with incomplete hours?

1 view (last 30 days)
Hi everyone. I can't find a solution to my issue in the web. I have hourly averages data from a meteorological station. However there are some missing hours due to the lack of measurements. In this cases, I would like to remove the complete day, in order to keep only the days in which there are measurements in a whole day (24 data per day). Look at this example; The meteorological station is giving the next output, in which columns 1,2,3 and 4 corresponds to year, month, day and hour, respectively, and column 5 correspond to the measurement:
2010 1 1 0 12.3
2010 1 1 1 12.4
2010 1 1 2 3
2010 1 1 3 34.7
2010 1 1 4 54
2010 1 1 5 565
2010 1 1 6 2
2010 1 1 7 334
2010 1 1 8 23
2010 1 1 9 223
2010 1 1 10 32
2010 1 1 11 233
2010 1 1 12 544
2010 1 1 13 23
2010 1 1 16 12
2010 1 1 17 12
2010 1 1 18 3
2010 1 1 19 34
2010 1 1 20 54
2010 1 1 21 565
2010 1 1 22 2
2010 1 1 23 334
2010 1 2 0 23
2010 1 2 1 223
2010 1 2 2 32
2010 1 2 3 233
2010 1 2 4 544
2010 1 2 5 23
2010 1 2 6 12
2010 1 2 7 12
2010 1 2 8 3
2010 1 2 9 34
2010 1 2 10 54
2010 1 2 11 565
2010 1 2 12 2
2010 1 2 13 334
2010 1 2 14 23
2010 1 2 15 223
2010 1 2 16 32
2010 1 2 17 233
2010 1 2 18 544
2010 1 2 19 12
2010 1 2 20 12
2010 1 2 21 3
2010 1 2 22 34
2010 1 2 23 54
2010 1 3 0 565
2010 1 3 1 2
2010 1 3 2 334
2010 1 3 3 12
2010 1 3 5 12
2010 1 3 6 3
2010 1 3 7 34
2010 1 3 8 54
2010 1 3 9 565
2010 1 3 10 2
2010 1 3 11 334
2010 1 3 12 23
2010 1 3 13 223
2010 1 3 14 32
2010 1 3 15 233
2010 1 3 16 544
2010 1 3 17 23
2010 1 3 18 12
2010 1 3 19 12
2010 1 3 23 3
2010 1 4 0 34
2010 1 4 1 54
2010 1 4 2 565
2010 1 4 3 2.2
2010 1 4 4 334
2010 1 4 5 23
2010 1 4 6 223
2010 1 4 7 32
2010 1 4 8 233
2010 1 4 9 544
2010 1 4 10 12
2010 1 4 11 12
2010 1 4 12 3
2010 1 4 13 34
2010 1 4 14 54.7
2010 1 4 15 565
2010 1 4 16 2
2010 1 4 17 334
2010 1 4 18 23
2010 1 4 19 223
2010 1 4 20 32
2010 1 4 21 233
2010 1 4 22 544.6
2010 1 4 23 344.4
In the table of above, there were some missing hours; in day 1: hour 14 and hour 15; in day 3: hour 4, hour 20, hour 21 and hour 22. For that reason, is necessary don't take into account day 1 and day 3, and generate the next output:
2010 1 2 0 23
2010 1 2 1 223
2010 1 2 2 32
2010 1 2 3 233
2010 1 2 4 544
2010 1 2 5 23
2010 1 2 6 12
2010 1 2 7 12
2010 1 2 8 3
2010 1 2 9 34
2010 1 2 10 54
2010 1 2 11 565
2010 1 2 12 2
2010 1 2 13 334
2010 1 2 14 23
2010 1 2 15 223
2010 1 2 16 32
2010 1 2 17 233
2010 1 2 18 544
2010 1 2 19 12
2010 1 2 20 12
2010 1 2 21 3
2010 1 2 22 34
2010 1 2 23 54
2010 1 4 0 34
2010 1 4 1 54
2010 1 4 2 565
2010 1 4 3 2
2010 1 4 4 334
2010 1 4 5 23
2010 1 4 6 223
2010 1 4 7 32
2010 1 4 8 233
2010 1 4 9 544
2010 1 4 10 12
2010 1 4 11 12
2010 1 4 12 3
2010 1 4 13 34
2010 1 4 14 54
2010 1 4 15 565
2010 1 4 16 2
2010 1 4 17 334
2010 1 4 18 23
2010 1 4 19 223
2010 1 4 20 32
2010 1 4 21 233
2010 1 4 22 544
2010 1 4 23 344
Thanks in advance for your help!

Answers (4)

Marieke Klijn
Marieke Klijn on 21 Sep 2016
Hi Miguel,
For your question I loaded your copied data and named it 'data'. I assume that every day contains 24 hours (0-23) and each day has a number which is unique per dataset. So I loop over each day and check if there are 24 hour entries. If this is not the case then the rows are removed from the dataset.
days = unique(data(:,3));
for ii = 1:length(days)
hours_in_day = find(data(:,3) == days(ii));
if length(hours_in_day) < 24
data(hours_in_day,:) = [];
end
clear hours_in_day
end
Hope this works for you!
Cheers, Marieke
  3 Comments
Miguel L
Miguel L on 22 Sep 2016
I found one solution: first of all I filtered the year and secondly filtered each month; after this, I applied your scipt to each month.

Sign in to comment.


Geoff Hayes
Geoff Hayes on 21 Sep 2016
Miguel - depending upon your version of MATLAB, you could use hist or http://www.mathworks.com/help/matlab/ref/histogram.html to determine the number of recorded hours in each day. For example, using your above data and hist we would see that
[hourCount,days] = hist(data(:,3),1:max(data(:,3)))
returns
hourCount =
22 24 20 24
days =
1 2 3 4
So we know that the first and third days are incomplete. We could then remove these days from the data by checking to see which element of column three is a member of those days that are incomplete and set it to an empty element (which removes it from the array)
data(ismember(data(:,3),days(hourCount ~= 24)),:) = [];
In the above, we rely on the fact that
ismember(data(:,3),days(hourCount ~= 24))
returns a logical array of ones and zeros for each row of data where a one indicate that the row is missing an hour and a zero indicates otherwise. There may be a more efficient way of doing the above..I seem to recall that ismember can be expensive.
  1 Comment
Miguel L
Miguel L on 21 Sep 2016
Hi Geoff thanks for your response!
But in my case my data is covering one year. For that reason when I execute the code, it doesn't work properly due to it counts all days 1, all days 2... and so on until day 31... I mean it counts every day 1 from every month, every day 2 from every month and so on.
I attach to this message the file in which I am trying to ignore the incomplete days (<24 hours). Could you help me one more time?
I have 56 files like this! For that reason it is convenient to count with a script which could make the process of information easier!
Thanks in advance!

Sign in to comment.


George
George on 21 Sep 2016
Edited: George on 21 Sep 2016
You can do this with findgroups() and splitapply(). Conceptually you want to do the following:
  1. Find groups on year, month, day.
  2. Validate that your hours meets what you need (e.g., numel == 24 or somesuch)
  3. Extract valid dates
This should mostly do it.
weather = readtable('weather.xlsx');
G = findgroups(weather.Year, weather.Month, weather.Day);
validDays = splitapply(@(x) (numel(x) == 24), weather.Hour, G);
weather(validDays(G),:)
Where this may fall down would be if you had days with a dupe reading, but still 24 readings. You can do different anonymous functino if that concerns you.

Andrei Bobrov
Andrei Bobrov on 22 Sep 2016
[~,~,c] = unique(EXAMPLE(:,1:3),'rows');
t = nonzeros((1:max(c))'.*(accumarray(c,1) == 24));
out = EXAMPLE(ismember(c,t),:);

Categories

Find more on Dates and Time in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!