Join tables based on times that are within 3 hours of each other

Question

Marcus Glover on 16 Oct 2023

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/2034254-join-tables-based-on-times-that-are-within-3-hours-of-each-other

Commented: dpb on 18 Oct 2023

I have two tables that should be joined based on matching id values, but I would only like to join those values which have data collected within three hours of each other.

The joined table AB does not ned to be in that specific order or format, it is only an example. It just needs to contain the correct dataA and dataB which match both the id and the condition that the time is within 3 hours of each other.

A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
    [datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
    datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
    [9999; 12; 34; 1111; 2222; 3333], ...
    'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
    [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
    datetime('2023-09-29 05:00:00')], ...
    ["Red"; "White"; "Green"; "Orange"], ...
    'VariableNames', {'idB', 'dateB' 'dataB'});
AB = table(["AAA"; "BBB"; "CCC"], ......
    [datetime('2023-07-07 10:55:19'); datetime('2023-07-28 14:02:11'); datetime('2023-02-12 03:10:11')], ...
    [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17')], ...
    ["Red"; "White"; "Green"], ...
    [9999; 34 ; 3333],...
    'VariableNames', { 'idAB', 'dateA', 'dateB','dataB','dataA' });
A
A = 6×3 table
     idA            dateA            dataA
    _____    ____________________    _____

    "AAA"    07-Jul-2023 10:55:19    9999 
    "BBB"    28-Jul-2023 09:31:00      12 
    "BBB"    28-Jul-2023 14:02:11      34 
    "CCC"    23-Jan-2023 17:35:24    1111 
    "CCC"    23-Jan-2023 17:31:48    2222 
    "CCC"    12-Feb-2023 03:10:11    3333 
B
B = 4×3 table
     idB            dateB             dataB  
    _____    ____________________    ________

    "AAA"    07-Jul-2023 09:30:05    "Red"   
    "BBB"    28-Jul-2023 12:43:36    "White" 
    "CCC"    12-Feb-2023 02:34:17    "Green" 
    "DDD"    29-Sep-2023 05:00:00    "Orange"
AB
AB = 3×5 table
    idAB            dateA                   dateB             dataB     dataA
    _____    ____________________    ____________________    _______    _____

    "AAA"    07-Jul-2023 10:55:19    07-Jul-2023 09:30:05    "Red"      9999 
    "BBB"    28-Jul-2023 14:02:11    28-Jul-2023 12:43:36    "White"      34 
    "CCC"    12-Feb-2023 03:10:11    12-Feb-2023 02:34:17    "Green"    3333 

4 Comments
Show 2 older commentsHide 2 older comments

Marcus Glover on 16 Oct 2023

There will only be one row in B with that unique timestamp and ID combination.

Marcus Glover on 16 Oct 2023

Thanks dpb, fixed the variable names.

Sign in to comment.

Sign in to answer this question.

Answer 1

J. Alex Lee on 16 Oct 2023

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2034254-join-tables-based-on-times-that-are-within-3-hours-of-each-other#answer_1334354

Edited: J. Alex Lee on 16 Oct 2023

Open in MATLAB Online

This may be one such "optimization", though i have no idea if it is faster, but it maybe is less "brute force"?

A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
    [datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
    datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
    [9999; 12; 34; 1111; 2222; 3333], ...
    'VariableNames', {'idA', 'dateA', 'dataA'});
% B = table(["AAA"; "BBB"; "CCC"; "DDD"], ...
%     [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17');...
%     datetime('2023-09-29 05:00:00')], ...
%     ["Red"; "White"; "Green"; "Orange"], ...
%     'VariableNames', {'idB', 'dateB' 'dataB'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
    [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
    datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
    ["Red"; "White"; "Black"; "Green"; "Orange"], ...
    'VariableNames', {'idB', 'dateB', 'dataB'});
AB = table(["AAA"; "BBB"; "CCC"], ......
    [datetime('2023-07-07 10:55:19'); datetime('2023-07-28 14:02:11'); datetime('2023-02-12 03:10:11')], ...
    [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-02-12 02:34:17')], ...
    ["Red"; "White"; "Green"], ...
    [9999; 34 ; 3333],...
    'VariableNames', { 'idAB', 'dateA', 'dateB','dataB','dataA' });
%% work with timetables so you can use things like retime
TA = sortrows(table2timetable(A,"RowTimes",A.dateA));
TB = sortrows(table2timetable(B,"RowTimes",B.dateB));
% break up into id's to reduce logic needed for meeting both conditions
idList = intersect(TA.idA,TB.idB);
% pre-allocate for joined tables by ID
TAB = cell(numel(idList),1);
% iterate through id's
for k = 1:numel(idList)
    
    subA = TA(TA.idA==idList(k),:);
    subB = TB(TB.idB==idList(k),:);
    
    % retime to the table with fewer entries
    % otherwise I think you duplicate entries and have to trim later
    if height(subA) < height(subB)
        src = subA;
        tgt = subB;
    else
        src = subB;
        tgt = subA;
    end
    
    % retime the target table to the timestamps of the source table
    tgtR = retime(tgt,src.Time,"nearest");
    
    % join the tables using built-in
    tmp = innerjoin(src,tgtR);
    
    % discriminate based on difference between times
    % this is already the closest match there is
    % if there are multiple matches within 3 hrs they will be already discounted
    isFar = abs(tmp.dateB-tmp.dateA) > duration(3,0,0);
    
    % remove entries without matches within 3 hours
    tmp(isFar,:) = [];
    
    % save the merged table (for a single id)
    TAB{k} = tmp;
end
AB_2 = timetable2table(vertcat(TAB{:}))
AB_2 = 3×7 table
            Time             idB            dateB             dataB      idA            dateA            dataA
    ____________________    _____    ____________________    _______    _____    ____________________    _____

    07-Jul-2023 09:30:05    "AAA"    07-Jul-2023 09:30:05    "Red"      "AAA"    07-Jul-2023 10:55:19    9999 
    28-Jul-2023 12:43:36    "BBB"    28-Jul-2023 12:43:36    "White"    "BBB"    28-Jul-2023 14:02:11      34 
    12-Feb-2023 02:34:17    "CCC"    12-Feb-2023 02:34:17    "Green"    "CCC"    12-Feb-2023 03:10:11    3333 

3 Comments
Show 1 older commentHide 1 older comment

dpb on 16 Oct 2023

Edited: dpb on 17 Oct 2023

Probably "not so much" if either A or B isn't known a priori to be the dominant one to match against, but if it doesn't matter, then picking either A or B for the list of IDs is ok; since there has to be one in each to match, if you get the odd one(s) in the list you parse or not doesn't matter; the list of which are in both has to be complete regardless of which list you use.

I didn't read his solution closely enough to follow the need for retime, anyway; since the time difference between the two is the selection criterion, it doesn't matter which is the reference, it would simply change the sign of the difference depending on which is compared to which and it was not specified the time difference was signed.

One (I assumed probably minor) optimization in mine would be to get the unique list that isn't in both and iterate over it instead of just picking the one list and working down its members one-by-one as I did. I started on that path and then went the straight list route as I read the initial follow on comment of there being only the one unique time as meaning there was only one--hence the issue with duplicates had to fix later...

J. Alex Lee on 16 Oct 2023

Retime is my way of finding nearest timestamp, mistakenly thought it would be faster than manually searching for each row...oops ;)

Sign in to comment.

Answer 2

dpb on 16 Oct 2023

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2034254-join-tables-based-on-times-that-are-within-3-hours-of-each-other#answer_1334314

Edited: dpb on 16 Oct 2023

Open in MATLAB Online

A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
    [datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
    datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
    [9999; 12; 34; 1111; 2222; 3333], ...
    'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
    [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
    datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
    ["Red"; "White"; "Black"; "Green"; "Orange"], ...
    'VariableNames', {'idB', 'dateB', 'dataB'});
A,B
A = 6×3 table
     idA            dateA            dataA
    _____    ____________________    _____

    "AAA"    07-Jul-2023 10:55:19    9999 
    "BBB"    28-Jul-2023 09:31:00      12 
    "BBB"    28-Jul-2023 14:02:11      34 
    "CCC"    23-Jan-2023 17:35:24    1111 
    "CCC"    23-Jan-2023 17:31:48    2222 
    "CCC"    12-Feb-2023 03:10:11    3333 
B = 5×3 table
     idB            dateB             dataB  
    _____    ____________________    ________

    "AAA"    07-Jul-2023 09:30:05    "Red"   
    "BBB"    28-Jul-2023 12:43:36    "White" 
    "BBB"    28-Jul-2023 20:34:17    "Black" 
    "CCC"    12-Feb-2023 02:34:17    "Green" 
    "DDD"    29-Sep-2023 05:00:00    "Orange"
AB=[];
for i=1:height(A)
  ixU=(B.idB==A.idA(i));
  if any(ixU)
    ixT=false(size(ixU));
    for j=1:numel(ixU)
      if ~ixU(j), continue, end
      %[i j], [A.dateA(i),B.dateB(j)]
      ixT(j)=(abs(A.dateA(i)-B.dateB(j))<=hours(3));
      ixB=ixU&ixT;
      if any(ixB)
        AB=[AB;table(A.idA(i),A.dateA(i),B.dateB(ixB),A.dateA(i)-B.dateB(ixB),A.dataA(i),B.dataB(ixB))];
      end
    end
  end
end
AB.Properties.VariableNames={'idAB','DateA','DateB','DateDiff','DataA','DataB'};
AB=unique(AB,'rows');
AB
AB = 3×6 table
    idAB            DateA                   DateB            DateDiff    DataA     DataB 
    _____    ____________________    ____________________    ________    _____    _______

    "AAA"    07-Jul-2023 10:55:19    07-Jul-2023 09:30:05    01:25:14    9999     "Red"  
    "BBB"    28-Jul-2023 14:02:11    28-Jul-2023 12:43:36    01:18:35      34     "White"
    "CCC"    12-Feb-2023 03:10:11    12-Feb-2023 02:34:17    00:35:54    3333     "Green"

Above is brute force; depends upon how large the real tables are as to how much optimization might be called for. If numel(ixB) can be > 1 in the wild, then the subsequent assignment needs must be in another loop, of course.

17 Comments
Show 15 older commentsHide 15 older comments

dpb on 17 Oct 2023

Edited: dpb on 17 Oct 2023

Open in MATLAB Online

A = table(["AAA"; "BBB"; "BBB"; "CCC"; "CCC"; "CCC"], ...
    [datetime('2023-07-07 10:55:19'); datetime('2023-07-28 09:31:00'); datetime('2023-07-28 14:02:11');
    datetime('2023-01-23 17:35:24'); datetime('2023-01-23 17:31:48'); datetime('2023-02-12 03:10:11')], ...
    [9999; 12; 34; 1111; 2222; 3333], ...
    'VariableNames', {'idA', 'dateA', 'dataA'});
B = table(["AAA"; "BBB"; "BBB";"CCC"; "DDD"], ...
    [datetime('2023-07-07 09:30:05'); datetime('2023-07-28 12:43:36'); datetime('2023-07-28 20:34:17');...
    datetime('2023-02-12 02:34:17'); datetime('2023-09-29 05:00:00')], ...
    ["Red"; "White"; "Black"; "Green"; "Orange"], ...
    'VariableNames', {'idB', 'dateB', 'dataB'});
VariableNames={'idAB','DateA','DateB','DateDiff','DataA','DataB'};
AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
         'VariableTypes',{'string','datetime','datetime','duration','double','string'}, ...
         'VariableNames',VariableNames);
k=0;
[A,ixa]=sortrows(A,1); [B,ixb]=sortrows(B,1);
for i=1:height(A)
  ixU=(B.idB==A.idA(i));
  if any(ixU)
    ixT=false(size(ixU));
    for j=1:numel(ixU)
      if ~ixU(j), continue, end
      %[i j], [A.dateA(i),B.dateB(j)]
      ixT(j)=(abs(A.dateA(i)-B.dateB(j))<=hours(3));
      ixB=ixU&ixT;
      if any(ixB)
        k=k+1;
        AB.idAB(k)=A.idA(i);
        AB.DateA(k)=A.dateA(i); AB.DateB(k)=B.dateB(ixB);
        AB.DataA(k)=A.dataA(i); AB.DataB(k)=B.dataB(ixB);
      end
    end
  end
end
AB=AB(~ismissing(AB.idAB),:);
AB=unique(AB,'rows');
AB.DateDiff=AB.DateA-AB.DateB;
AB
AB = 3×6 table
    idAB            DateA                   DateB            DateDiff    DataA     DataB 
    _____    ____________________    ____________________    ________    _____    _______

    "AAA"    07-Jul-2023 10:55:19    07-Jul-2023 09:30:05    01:25:14    9999     "Red"  
    "BBB"    28-Jul-2023 14:02:11    28-Jul-2023 12:43:36    01:18:35      34     "White"
    "CCC"    12-Feb-2023 03:10:11    12-Feb-2023 02:34:17    00:35:54    3333     "Green"

Try the above for speed comparison; will lose some owing to sorting both first (if one or both are already sorted can skip that, of course) but should gain quite a bit by the use of indexing into a preallocated table.

This one allocates a table at the beginning by using the size of the larger of A, B. If there is a way to make an a priori reasonable estimate of the fraction that are retained within the time window desired, then that could save some memory by applying that factor to the initial allocation.

Another refinement that might save a little would be to convert the ID and B data to categorical variables from strings.

Marcus Glover on 18 Oct 2023

I was wondering about that too, but I'm no expert at the profiler. My guess was that it was referringto the ixU=(B.idB==A.idA(i)) line also. You weren't explicitly converting categoricals either.

dpb on 18 Oct 2023

Open in MATLAB Online

I guess it is, it is a matching of element within an array which is what ismember does. Have to think about any alternative syntax that could possibly be more efficient.

The conversion to categorical ought to be done initially to the overall tables; that shouldn't take long

>> B=repmat(B,round(10000/height(B)),1);
>> whos B
  Name          Size              Bytes  Class    Attributes
  B         10000x3             1161609  table              
>> tic;B=convertvars(B,{'idB','dataB'},'categorical');toc
Elapsed time is 0.015839 seconds.
>> 

Remember then to change the variable type in the empty table to match so there's not a conversion to/from during the working loop. From

AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
         'VariableTypes',{'string','datetime','datetime','duration','double','string'}, ...
         'VariableNames',VariableNames);

to

AB=table('Size',[max(height(A),height(B)),numel(VariableNames)], ...
         'VariableTypes',{'categorical','datetime','datetime','duration','double','categorical'}, ...
         'VariableNames',VariableNames);

Sign in to comment.

Join tables based on times that are within 3 hours of each other

4 Comments
Show 2 older commentsHide 2 older comments

Accepted Answer

3 Comments
Show 1 older commentHide 1 older comment

More Answers (1)

17 Comments
Show 15 older commentsHide 15 older comments

See Also

Categories

Tags

Community Treasure Hunt

Join tables based on times that are within 3 hours of each other

4 Comments Show 2 older commentsHide 2 older comments

Accepted Answer

3 Comments Show 1 older commentHide 1 older comment

More Answers (1)

17 Comments Show 15 older commentsHide 15 older comments

See Also

Categories

Tags

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

3 Comments
Show 1 older commentHide 1 older comment

17 Comments
Show 15 older commentsHide 15 older comments