excel table data analysis

4 views (last 30 days)
ALDO
ALDO on 9 May 2019
Commented: dpb on 14 May 2019
Hi guys
My data is (are) in an excel table 20000 by 3. I have a list of 358 patients. There are 48 different bacteria that I am looking for. each patent is tested to see if a bacteria is found in their blood. The day each test is conducted is recorded.
1) when was the first day the patient tested positive (regradless of bacteria type) (ex:first time patient A was positive is day 2)
2) List of all the bacterias that have been found for each individual patient.( ex: Patient A shows bactesias:BAC1 BAC2 BAC3)
3) When was the first time each patient tested possitive for each type of bacteria.(ex: First time BAC1 was found is day 4)
4) In what days was each patient exsibiting each bacteria (ex: BAC 1 was forund for patient A in days 300, 4)
The table is in the following format:
patient day they were tested bacteria found
A 300 BAC1
A 2 BAC2
A 4 BAC1
A 8 BAC3
B 66 BAC5
B 55 BAC1
C 208 BAC2
C 77 BAC2
C 51 BAC9
D 90 NAN
I have been struggling with this for a while. I would appriciate any input.
Please let me know if clarifications are needed. Thanks so much for the help in advance!
  2 Comments
dpb
dpb on 10 May 2019
  1. Is the first time really 300, then 2, or is that a typo or are there multiple years in the dataset and the times are day of year?
  2. How do you get 2 for 1.? Looks to me like was 300--there's something reported for first time in the dataset (BAC1).
With a little clarifcation on what the dataset parameters are, doesn't look terribly complex. Convert the Patient and Bacteria to categorical variables, put in a table and use findgroups to identify patients...the when was the first? questions are simple lookups but again what the time range and definition is may muddle the logic if there's wraparound as looks like might be will need to fix that issue first.
The "how many days?" questions are also simple lookup; with categorical data matches are exact so ismember or similar will answer those trivially.
ALDO
ALDO on 10 May 2019
Thanks for your fast responce!
1)'day they were tested' is the day they took the test so, day 0 is the day the patient was admited to the hospital and day 300 is the day they took the test. the tabel order doesnt represent which test was taken first.
2)the answer is 2 because they first time the patient tested positive was on day 2. so out of day-300 day-2 day-4 day-8. the smallest would be 2. the smaller the number the sooner the test was taken.
I did the categorial variables and find groups as you said. This is what i have so far as my code i know i have overcomplicated things but This is my first time working with tables so feel free to point out my mistakes since i am not familiar with what to do:
Table=table(Pateint, Day_of_Culture, Organism,'VariableNames',{'patient' 'day' 'Organism'});
Table.Day_of_Culture=categorical(Table.Day_of_Culture);
Table.Organism=categorical(Table.Organism)
patients=unique(patient);
orgnz=unique(Organism);
G = findgroups(Table.patient);
how_many_pateints=max(G);
func=@(varargin) {table(varargin{:}, 'VariableNames', Table.Properties.VariableNames)};
output = splitapply(func, Table,G);
result= zeros(357,48);
for i=1:how_many_patients
[C,ia,ic] =unique(output{i,1}.Organism); %Unique Organisms found for each %patient
Organism_for_each_patient{i}= C;
for j=1:length(C)
u=output{i,1}.Organism((strcmpi(output{i,1}.Organism,C(j) )));%strcmpi will find the indexes of the variable
v=output{i,1}.Patient((strcmpi(output{i,1}.Organism,C(j) )));
w=output{i,1}.Day_of_Culture((strcmpi(output{i,1}.Organism,C(j) )));
table2=table(v,w,u,'VariableName',{'Patient' 'day' 'Organism'});
result(i,j)=min(w);
end
%first time the first variable name i am looking for ex:{'BAC1'} occured , %so the smallest day of cuture
end

Sign in to comment.

Accepted Answer

dpb
dpb on 10 May 2019
Edited: dpb on 10 May 2019
It's not trivial to learn the tricks, granted...but here's a start for the first...with it and some study of the examples, should get an idea -- it's late; I've got to turn in at this point, though..sorry :)
tBAC=readtable('bacteria.dat','ReadVariableNames',1); % read the data in
tBAC.patient=categorical(tBAC.patient); % fix data types
tBAC.found=categorical(tBAC.found);
[ig,Patient]=findgroups(tBAC.patient); % group index, group names
FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig); % find first infection any kind
FirstInfection(cellfun(@(x) length(x)==0,FirstInfection))={inf}; % clean up missing (no infection)
[table(Patient) cell2table(FirstInfection)] % display result in a table form
results in for the first problem statement...I chose "inf' as the indicator of no infection instead of NaN -- the decimals show up because I'd been doing financial work and had format bank in effect.
ans =
4×2 table
Patient FirstInfection
_______ ______________
A 2.00
B 55.00
C 51.00
D Inf
>>
I slightly modified your data to a file with the header line "patient, testday, found" and made a csv-file of your plain text...I presume you have a file format of your own.
Modifications to the functional for the remainder should suffice I think...altho 3 and 4 need grouping by both patient and bacteria ID.
ADDENDUM:
Added the identifier for which bacteria ID was the first. Having the standalone function means can clean up the return data there instead of afterwards--so there is some payback for the extra code. :)
NB: I got the return arguments from min in wrong order last night; the index is the optional second, not the first. This produces the amplified table:
tBAC=readtable('bacteria.dat','ReadVariableNames',1); % read the data in
tBAC.patient=categorical(tBAC.patient); % fix data types
tBAC.found=categorical(tBAC.found);
% 1. First occurrence of any in each patient
[ig,Patient]=findgroups(tBAC.patient); % group index, group names
%FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig); % find first infection any kind
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig);
table(Patient,FirstInfection,Infection)
% 2. All occurrences in each patient
[AllInfections]=splitapply(@allinfections,tBAC.found,ig);
[table(Patient) cell2table(AllInfections)]
% 3. First occurrence of each bacterium in each patient
[ig,Patient,Bacterium]=findgroups(tBAC.patient,tBAC.found);
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig2);
table(Patient,Bacterium,FirstInfection,Infection)
% 4. Ooccurrences of each bacterium in each patient
% EXERCISE FOR STUDENT :)
function [tFirst,bFirst]=firstinfected(t,r)
% return first time, infection
[tFirst,iFirst]=min(t(r~='NAN'));
if isempty(tFirst)
tFirst=nan;
bFirst='NAN';
else
bFirst=r(iFirst);
end
end
function [b]=allinfections(r)
% return all infections for each
b={unique(r(r~='NAN')).'};
if isempty(b)
b='NAN';
end
end
>> table(Patient,FirstInfection,Infection)
ans =
4×3 table
Patient FirstInfection Infection
_______ ______________ _________
A 2.00 BAC2
B 55.00 BAC1
C 51.00 BAC9
D NaN NAN
ans =
4×2 table
Patient AllInfections
_______ _________________
A [1×3 categorical]
B [1×2 categorical]
C [1×2 categorical]
D [1×0 categorical]
ans =
8×4 table
Patient Bacterium FirstInfection Infection
_______ _________ ______________ _________
A BAC1 4.00 BAC1
A BAC2 2.00 BAC2
A BAC3 8.00 BAC3
B BAC1 55.00 BAC1
B BAC5 66.00 BAC5
C BAC2 77.00 BAC2
C BAC9 51.00 BAC9
D NAN NaN NAN
>>
Unfortunately, the builtin table display function won't show the actual categorical variable values for each patient since they're not the same length of each array--and a table has to be regular in number of variable sfor each row/observation so can't create multiple variables without a lot of ugly NAN values scattered around.
The really cute part is the firstinfection function works for any chosen grouping so that don't have to do anything except use the other grouping variables. You could choose to not populate the tble with the second return or not use the second ID variable since they are the same...
Now, your mission, should you choose to accept it, is last item, #4... :)
  11 Comments
ALDO
ALDO on 14 May 2019
I want to see if the result of the test was positive or negative. So I have to look at each patient and see in what day ranges the bacteria was found. And depending on wether it was within the range it gives a logic 0 1 response. For example for patientA and look at each individual bacterium that was found. Was BAC1 was found on day 1, If yes gives logic answers 1. Also assess if it was present in specific day ranges of days, day 1-28 if yes gives logic answer 1. The specific days i want to look st are: day 1 2 3 The ranges i wasnt to look at are: 1-2 1-14 1-7 1-28 29-56 57-84.
I hope I explained the question correctly please let me know if you need any clarifications.
You mentioned that I could use summary Satistics thats why i was trying to use grpstats.
dpb
dpb on 14 May 2019
OK. that makes sense -- since you are actually counting days independent of the calendar, simply arranging by those numeric days is all that is needed.
I've got to go do some other stuff at the moment, but I'll mull over the idea...the general idea I'm thinking of is that one of the statistics grpstats returns in number elements--if there is a grouping variable for each category, then the answer of logical True is equivalent to number of elements in the group.
However, it seems to me you need a grouping variable for each of the combinations; at least otomh I don't see an easy way to combine results other than just the one case of 1-2 is 1 + 2 but none of the rest are composited uniquely from others so have to group each separately.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!