Join tables based on times that are within 3 hours of each other
Show older comments
I have two tables that should be joined based on matching id values, but I would only like to join those values which have data collected within three hours of each other.
The joined table AB does not ned to be in that specific order or format, it is only an example. It just needs to contain the correct dataA and dataB which match both the id and the condition that the time is within 3 hours of each other.
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB' 'dataB'});
AB = table(["AAA"; "BBB"; "CCC"], ......
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 14:02:11'); datetime('2023-02-12 03:10:11')], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17')], ...
["Red"; "White"; "Green"], ...
[9999; 34 ; 3333],...
'VariableNames', { 'idAB', 'dateA', 'dateB','dataB','dataA' });
A
B
AB
4 Comments
J. Alex Lee
on 16 Oct 2023
are the corresponding rows guaranteed to be 1:1, i.e., there will only ever be 0 or 1 rows in B that will be within 3 hours of a row in A?
NOTA BENE the variable names in B are swapped; should be
B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
B
Marcus Glover
on 16 Oct 2023
Marcus Glover
on 16 Oct 2023
Accepted Answer
More Answers (1)
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Black"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
A,B
AB=[];
for i=1:height(A)
ixU=(B.idB==A.idA(i));
if any(ixU)
ixT=false(size(ixU));
for j=1:numel(ixU)
if ~ixU(j), continue, end
%[i j], [A.dateA(i),B.dateB(j)]
ixT(j)=(abs(A.dateA(i)-B.dateB(j))<=hours(3));
ixB=ixU&ixT;
if any(ixB)
AB=[AB;table(A.idA(i),A.dateA(i),B.dateB(ixB),A.dateA(i)-B.dateB(ixB),A.dataA(i),B.dataB(ixB))];
end
end
end
end
AB.Properties.VariableNames={'idAB','DateA','DateB','DateDiff','DataA','DataB'};
AB=unique(AB,'rows');
AB
Above is brute force; depends upon how large the real tables are as to how much optimization might be called for. If numel(ixB) can be > 1 in the wild, then the subsequent assignment needs must be in another loop, of course.
17 Comments
Marcus Glover
on 16 Oct 2023
Edited: Marcus Glover
on 16 Oct 2023
dpb
on 16 Oct 2023
See updated Answer -- then one has to iterate over each of the family in B, too...
I've not tested extensively, but I think it does what want...
Marcus Glover
on 16 Oct 2023
dpb
on 16 Oct 2023
Well, partial credit would be a Vote... :)
Marcus Glover
on 16 Oct 2023
dpb
on 16 Oct 2023
Out of curiosity, if you have a sizable dataset, let us know the difference in performance between the two...
Marcus Glover
on 16 Oct 2023
dpb
on 16 Oct 2023
"... 77s for your method and 199s for the timetable method."
Wow! I thought it might be somewhat better, but not anything close to a factor of 2-3X. It would then be interesting to see what gains could be made by preallocating a table of N rows and storing into it instead of the dynamic reallocation....that might make a sizable difference with 10K rows.
"...the loop brute force method will fail when multiple Bs timestamps with the same id occur within 3 hours of each other with an A in the middle"
That's easy to work around -- just sort on ID first. If it's important to retain original order when done, then save the sort index vector and rearrange back.
A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
[datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
[9999; 12; 34; 1111; 2222; 3333], ...
'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
[datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
["Red"; "White"; "Black"; "Green"; "Orange"], ...
'VariableNames', {'idB', 'dateB', 'dataB'});
VariableNames={'idAB','DateA','DateB','DateDiff','DataA','DataB'};
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
'VariableTypes',{'string','datetime','datetime','duration','double','string'}, ...
'VariableNames',VariableNames);
k=0;
[A,ixa]=sortrows(A,1); [B,ixb]=sortrows(B,1);
for i=1:height(A)
ixU=(B.idB==A.idA(i));
if any(ixU)
ixT=false(size(ixU));
for j=1:numel(ixU)
if ~ixU(j), continue, end
%[i j], [A.dateA(i),B.dateB(j)]
ixT(j)=(abs(A.dateA(i)-B.dateB(j))<=hours(3));
ixB=ixU&ixT;
if any(ixB)
k=k+1;
AB.idAB(k)=A.idA(i);
AB.DateA(k)=A.dateA(i); AB.DateB(k)=B.dateB(ixB);
AB.DataA(k)=A.dataA(i); AB.DataB(k)=B.dataB(ixB);
end
end
end
end
AB=AB(~ismissing(AB.idAB),:);
AB=unique(AB,'rows');
AB.DateDiff=AB.DateA-AB.DateB;
AB
Try the above for speed comparison; will lose some owing to sorting both first (if one or both are already sorted can skip that, of course) but should gain quite a bit by the use of indexing into a preallocated table.
This one allocates a table at the beginning by using the size of the larger of A, B. If there is a way to make an a priori reasonable estimate of the fraction that are retained within the time window desired, then that could save some memory by applying that factor to the initial allocation.
Another refinement that might save a little would be to convert the ID and B data to categorical variables from strings.
Marcus Glover
on 17 Oct 2023
dpb
on 17 Oct 2023
That's disappointing; I thought/hoped might do somewhat better than that. Just for curiosity, how much was taken up by the sort as opposed to the meat of the process?
Marcus Glover
on 18 Oct 2023
J. Alex Lee
on 18 Oct 2023
Edited: J. Alex Lee
on 18 Oct 2023
does this work, and is it slower or faster?
idList = unique([A.idA;B.idB]);
% pre-allocate for joined tables by ID
TAB = cell(max(height(A),height(B)),1);
% iterate through id's
cntr = 0;
for k = 1:numel(idList)
subA = A(A.idA==idList(k),:);
subB = B(B.idB==idList(k),:);
for i = 1:height(subA)
[t,idx] = min(abs(subA.dateA(i)-subB.dateB));
if t < hours(3)
cntr = cntr + 1;
TAB{cntr} = [subA(i,:),subB(idx,:)];
end
end
end
AB_3 = vertcat(TAB{:})
Marcus Glover
on 18 Oct 2023
Edited: Marcus Glover
on 18 Oct 2023
dpb
on 18 Oct 2023
" the cell.ismember function was by far the longerst at 35s"
But there's not an ismember call anywhere??? I know it has always been a performance dog; surely an "==" wasn't converted to ismember by the interpreter...
Marcus Glover
on 18 Oct 2023
dpb
on 18 Oct 2023
I guess it is, it is a matching of element within an array which is what ismember does. Have to think about any alternative syntax that could possibly be more efficient.
The conversion to categorical ought to be done initially to the overall tables; that shouldn't take long
>> B=repmat(B,round(10000/height(B)),1);
>> whos B
Name Size Bytes Class Attributes
B 10000x3 1161609 table
>> tic;B=convertvars(B,{'idB','dataB'},'categorical');toc
Elapsed time is 0.015839 seconds.
>>
Remember then to change the variable type in the empty table to match so there's not a conversion to/from during the working loop. From
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
'VariableTypes',{'string','datetime','datetime','duration','double','string'}, ...
'VariableNames',VariableNames);
to
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
'VariableTypes',{'categorical','datetime','datetime','duration','double','categorical'}, ...
'VariableNames',VariableNames);
Categories
Find more on Tables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!