Join tables based on times that are within 3 hours of each other

I have two tables that should be joined based on matching id values, but I would only like to join those values which have data collected within three hours of each other.
The joined table AB does not ned to be in that specific order or format, it is only an example. It just needs to contain the correct dataA and dataB which match both the id and the condition that the time is within 3 hours of each other.
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB' 'dataB'});
AB = table(["AAA"; "BBB"; "CCC"], ......
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 14:02:11'); datetime('2023-02-12 03:10:11')], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17')], ...
["Red"; "White"; "Green"], ...
[9999; 34 ; 3333],...
'VariableNames', { 'idAB', 'dateA', 'dateB','dataB','dataA' });
A
A = 6×3 table
idA dateA dataA _____ ____________________ _____ "AAA" 07-Jul-2023 10:55:19 9999 "BBB" 28-Jul-2023 09:31:00 12 "BBB" 28-Jul-2023 14:02:11 34 "CCC" 23-Jan-2023 17:35:24 1111 "CCC" 23-Jan-2023 17:31:48 2222 "CCC" 12-Feb-2023 03:10:11 3333
B
B = 4×3 table
idB dateB dataB _____ ____________________ ________ "AAA" 07-Jul-2023 09:30:05 "Red" "BBB" 28-Jul-2023 12:43:36 "White" "CCC" 12-Feb-2023 02:34:17 "Green" "DDD" 29-Sep-2023 05:00:00 "Orange"
AB
AB = 3×5 table
idAB dateA dateB dataB dataA _____ ____________________ ____________________ _______ _____ "AAA" 07-Jul-2023 10:55:19 07-Jul-2023 09:30:05 "Red" 9999 "BBB" 28-Jul-2023 14:02:11 28-Jul-2023 12:43:36 "White" 34 "CCC" 12-Feb-2023 03:10:11 12-Feb-2023 02:34:17 "Green" 3333

4 Comments

are the corresponding rows guaranteed to be 1:1, i.e., there will only ever be 0 or 1 rows in B that will be within 3 hours of a row in A?
NOTA BENE the variable names in B are swapped; should be
B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
B
B = 4×3 table
idB dateB dataB _____ ____________________ ________ "AAA" 07-Jul-2023 09:30:05 "Red" "BBB" 28-Jul-2023 12:43:36 "White" "CCC" 12-Feb-2023 02:34:17 "Green" "DDD" 29-Sep-2023 05:00:00 "Orange"
There will only be one row in B with that unique timestamp and ID combination.
Thanks dpb, fixed the variable names.

Sign in to comment.

 Accepted Answer

This may be one such "optimization", though i have no idea if it is faster, but it maybe is less "brute force"?
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
% B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
% [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
% datetime('2023-09-29 05:00:00')], ...
% ["Red"; "White"; "Green"; "Orange"], ...
% 'VariableNames', {'idB', 'dateB' 'dataB'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Black"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
AB = table(["AAA"; "BBB"; "CCC"], ......
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 14:02:11'); datetime('2023-02-12 03:10:11')], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17')], ...
["Red"; "White"; "Green"], ...
[9999; 34 ; 3333],...
'VariableNames', { 'idAB', 'dateA', 'dateB','dataB','dataA' });
%% work with timetables so you can use things like retime
TA = sortrows(table2timetable(A,"RowTimes",A.dateA));
TB = sortrows(table2timetable(B,"RowTimes",B.dateB));
% break up into id's to reduce logic needed for meeting both conditions
idList = intersect(TA.idA,TB.idB);
% pre-allocate for joined tables by ID
TAB = cell(numel(idList),1);
% iterate through id's
for k = 1:numel(idList)
subA = TA(TA.idA==idList(k),:);
subB = TB(TB.idB==idList(k),:);
% retime to the table with fewer entries
% otherwise I think you duplicate entries and have to trim later
if height(subA) < height(subB)
src = subA;
tgt = subB;
else
src = subB;
tgt = subA;
end
% retime the target table to the timestamps of the source table
tgtR = retime(tgt,src.Time,"nearest");
% join the tables using built-in
tmp = innerjoin(src,tgtR);
% discriminate based on difference between times
% this is already the closest match there is
% if there are multiple matches within 3 hrs they will be already discounted
isFar = abs(tmp.dateB-tmp.dateA) > duration(3,0,0);
% remove entries without matches within 3 hours
tmp(isFar,:) = [];
% save the merged table (for a single id)
TAB{k} = tmp;
end
AB_2 = timetable2table(vertcat(TAB{:}))
AB_2 = 3×7 table
Time idB dateB dataB idA dateA dataA ____________________ _____ ____________________ _______ _____ ____________________ _____ 07-Jul-2023 09:30:05 "AAA" 07-Jul-2023 09:30:05 "Red" "AAA" 07-Jul-2023 10:55:19 9999 28-Jul-2023 12:43:36 "BBB" 28-Jul-2023 12:43:36 "White" "BBB" 28-Jul-2023 14:02:11 34 12-Feb-2023 02:34:17 "CCC" 12-Feb-2023 02:34:17 "Green" "CCC" 12-Feb-2023 03:10:11 3333

3 Comments

i'm not so happy about the condition to switch which table to retime based off of...there must be a better way?
Probably "not so much" if either A or B isn't known a priori to be the dominant one to match against, but if it doesn't matter, then picking either A or B for the list of IDs is ok; since there has to be one in each to match, if you get the odd one(s) in the list you parse or not doesn't matter; the list of which are in both has to be complete regardless of which list you use.
I didn't read his solution closely enough to follow the need for retime, anyway; since the time difference between the two is the selection criterion, it doesn't matter which is the reference, it would simply change the sign of the difference depending on which is compared to which and it was not specified the time difference was signed.
One (I assumed probably minor) optimization in mine would be to get the unique list that isn't in both and iterate over it instead of just picking the one list and working down its members one-by-one as I did. I started on that path and then went the straight list route as I read the initial follow on comment of there being only the one unique time as meaning there was only one--hence the issue with duplicates had to fix later...
Retime is my way of finding nearest timestamp, mistakenly thought it would be faster than manually searching for each row...oops ;)

Sign in to comment.

More Answers (1)

A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Black"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
A,B
A = 6×3 table
idA dateA dataA _____ ____________________ _____ "AAA" 07-Jul-2023 10:55:19 9999 "BBB" 28-Jul-2023 09:31:00 12 "BBB" 28-Jul-2023 14:02:11 34 "CCC" 23-Jan-2023 17:35:24 1111 "CCC" 23-Jan-2023 17:31:48 2222 "CCC" 12-Feb-2023 03:10:11 3333
B = 5×3 table
idB dateB dataB _____ ____________________ ________ "AAA" 07-Jul-2023 09:30:05 "Red" "BBB" 28-Jul-2023 12:43:36 "White" "BBB" 28-Jul-2023 20:34:17 "Black" "CCC" 12-Feb-2023 02:34:17 "Green" "DDD" 29-Sep-2023 05:00:00 "Orange"
AB=[];
for i=1:height(A)
ixU=(B.idB==A.idA(i));
if any(ixU)
ixT=false(size(ixU));
for j=1:numel(ixU)
if ~ixU(j), continue, end
%[i j], [A.dateA(i),B.dateB(j)]
ixT(j)=(abs(A.dateA(i)-B.dateB(j))<=hours(3));
ixB=ixU&ixT;
if any(ixB)
AB=[AB;table(A.idA(i),A.dateA(i),B.dateB(ixB),A.dateA(i)-B.dateB(ixB),A.dataA(i),B.dataB(ixB))];
end
end
end
end
AB.Properties.VariableNames={'idAB','DateA','DateB','DateDiff','DataA','DataB'};
AB=unique(AB,'rows');
AB
AB = 3×6 table
idAB DateA DateB DateDiff DataA DataB _____ ____________________ ____________________ ________ _____ _______ "AAA" 07-Jul-2023 10:55:19 07-Jul-2023 09:30:05 01:25:14 9999 "Red" "BBB" 28-Jul-2023 14:02:11 28-Jul-2023 12:43:36 01:18:35 34 "White" "CCC" 12-Feb-2023 03:10:11 12-Feb-2023 02:34:17 00:35:54 3333 "Green"
Above is brute force; depends upon how large the real tables are as to how much optimization might be called for. If numel(ixB) can be > 1 in the wild, then the subsequent assignment needs must be in another loop, of course.

17 Comments

Thanks.
I do have cases where there are multiple idB values with different dateB timestamps and am struggling to rectify that. It fails at the ixB=ixU&ixT; line because ixT is a 2x1 logical. Hopefully I will sort it out.
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Black"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
B
See updated Answer -- then one has to iterate over each of the family in B, too...
I've not tested extensively, but I think it does what want...
Thanks so much! I wish I could accept more than one answer, this one works also.
Well, partial credit would be a Vote... :)
Done! Wish I could do more but know that you have accumulated major karma points :)
Out of curiosity, if you have a sizable dataset, let us know the difference in performance between the two...
I have ~10k rows in A and ~6k in B. I got 77s for your method and 199s for the timetable method.
However, the loop brute force method will fail when multiple Bs timestamps with the same id occur within 3 hours of each other with an A in the middle. I did not think that could happen but there it was. So J. Alex's timetabe method of finding the nearest timestamp was better for my poorly thought out question :)
And of course, in the meantime someone found me an even better time field that will be more like 15 minutes maximum apart from the other, and interestingly the time in B will (allegedly) always be before A.
Thanks again.
"... 77s for your method and 199s for the timetable method."
Wow! I thought it might be somewhat better, but not anything close to a factor of 2-3X. It would then be interesting to see what gains could be made by preallocating a table of N rows and storing into it instead of the dynamic reallocation....that might make a sizable difference with 10K rows.
"...the loop brute force method will fail when multiple Bs timestamps with the same id occur within 3 hours of each other with an A in the middle"
That's easy to work around -- just sort on ID first. If it's important to retain original order when done, then save the sort index vector and rearrange back.
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Black"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
VariableNames={'idAB','DateA','DateB','DateDiff','DataA','DataB'};
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
'VariableTypes',{'string','datetime','datetime','duration','double','string'}, ...
'VariableNames',VariableNames);
k=0;
[A,ixa]=sortrows(A,1); [B,ixb]=sortrows(B,1);
for i=1:height(A)
ixU=(B.idB==A.idA(i));
if any(ixU)
ixT=false(size(ixU));
for j=1:numel(ixU)
if ~ixU(j), continue, end
%[i j], [A.dateA(i),B.dateB(j)]
ixT(j)=(abs(A.dateA(i)-B.dateB(j))<=hours(3));
ixB=ixU&ixT;
if any(ixB)
k=k+1;
AB.idAB(k)=A.idA(i);
AB.DateA(k)=A.dateA(i); AB.DateB(k)=B.dateB(ixB);
AB.DataA(k)=A.dataA(i); AB.DataB(k)=B.dataB(ixB);
end
end
end
end
AB=AB(~ismissing(AB.idAB),:);
AB=unique(AB,'rows');
AB.DateDiff=AB.DateA-AB.DateB;
AB
AB = 3×6 table
idAB DateA DateB DateDiff DataA DataB _____ ____________________ ____________________ ________ _____ _______ "AAA" 07-Jul-2023 10:55:19 07-Jul-2023 09:30:05 01:25:14 9999 "Red" "BBB" 28-Jul-2023 14:02:11 28-Jul-2023 12:43:36 01:18:35 34 "White" "CCC" 12-Feb-2023 03:10:11 12-Feb-2023 02:34:17 00:35:54 3333 "Green"
Try the above for speed comparison; will lose some owing to sorting both first (if one or both are already sorted can skip that, of course) but should gain quite a bit by the use of indexing into a preallocated table.
This one allocates a table at the beginning by using the size of the larger of A, B. If there is a way to make an a priori reasonable estimate of the fraction that are retained within the time window desired, then that could save some memory by applying that factor to the initial allocation.
Another refinement that might save a little would be to convert the ID and B data to categorical variables from strings.
A bit faster overall- 69s.
I already do convert the real data to categorical by habit.
Thanks again- as usual I am being asked to change the scope and direction so I'm sure I'll have more questions soon!
That's disappointing; I thought/hoped might do somewhat better than that. Just for curiosity, how much was taken up by the sort as opposed to the meat of the process?
According to profiler, the cell.ismember function was by far the longerst at 35s. Next was @categorical\private\convertCodes at 17. Everything else was 3s or less.
does this work, and is it slower or faster?
idList = unique([A.idA;B.idB]);
% pre-allocate for joined tables by ID
TAB = cell(max(height(A),height(B)),1);
% iterate through id's
cntr = 0;
for k = 1:numel(idList)
subA = A(A.idA==idList(k),:);
subB = B(B.idB==idList(k),:);
for i = 1:height(subA)
[t,idx] = min(abs(subA.dateA(i)-subB.dateB));
if t < hours(3)
cntr = cntr + 1;
TAB{cntr} = [subA(i,:),subB(idx,:)];
end
end
end
AB_3 = vertcat(TAB{:})
Slower- 139s. Worked, but found 2 more than the last one. (i did not look too closely, just noted the number of rows was greater)
" the cell.ismember function was by far the longerst at 35s"
But there's not an ismember call anywhere??? I know it has always been a performance dog; surely an "==" wasn't converted to ismember by the interpreter...
I was wondering about that too, but I'm no expert at the profiler. My guess was that it was referringto the ixU=(B.idB==A.idA(i)) line also. You weren't explicitly converting categoricals either.
I guess it is, it is a matching of element within an array which is what ismember does. Have to think about any alternative syntax that could possibly be more efficient.
The conversion to categorical ought to be done initially to the overall tables; that shouldn't take long
>> B=repmat(B,round(10000/height(B)),1);
>> whos B
Name Size Bytes Class Attributes
B 10000x3 1161609 table
>> tic;B=convertvars(B,{'idB','dataB'},'categorical');toc
Elapsed time is 0.015839 seconds.
>>
Remember then to change the variable type in the empty table to match so there's not a conversion to/from during the working loop. From
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
'VariableTypes',{'string','datetime','datetime','duration','double','string'}, ...
'VariableNames',VariableNames);
to
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
'VariableTypes',{'categorical','datetime','datetime','duration','double','categorical'}, ...
'VariableNames',VariableNames);

Sign in to comment.

Categories

Asked:

on 16 Oct 2023

Commented:

dpb
on 18 Oct 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!