Find locations of repeated values?

18 views (last 30 days)
Jacqueline
Jacqueline on 15 Jul 2013
So, I have this function that takes a set of data and finds if there are values that repeat for more than 300 seconds in that data set...\
function FindRepetition(TruckVariableName)
setpref('Internet','SMTP_Server','lamb.corning.com');
data1 = (TruckVariableName);
x = length(TruckVariableName);
data = reshape(data1, 1, x);
datarep = ~diff(data) & data(2:x) ~= 0; %binary data -- 1 means repeats, 0 means different, excludes repetitive zeros
%if the difference in the data at each point is zero, and if the data at
%that point isn't itself zero, return true. 2:x means difference array is equal to the length of the data array, matrix dimensions must be the same or &
%cannot be used
datarepstr = num2str(datarep); %convert to string
s = regexprep(datarepstr,' ',''); %remove spaces
[startindex,runs] = regexp(s,'1+','start','match'); %find all runs and the point where they start
l = cellfun('length',runs); %find the length of each run
y = l > 300;
if any(y) %if any run is longer than 5 minutes, display message
%sendmail('johnsonlj2@corning.com', '2011 KENWORTH ISX15','A data fault has been detected - Prolonged data repetition');
disp('--An error has occurred - Prolonged data repetition.');
disp('Errors occurred at');
end
end
I want to find WHERE those repeated values start in that set of data. I tried disp(find(y));, but that finds the locations of the data set y, which is not the original data set. Anyone know how I can find the locations of data1 where the data repeats for more than 300 seconds?
  2 Comments
Cedric
Cedric on 15 Jul 2013
Edited: Cedric on 15 Jul 2013
Could you provide a sample dataset or the content of this TruckVariableName that you pass to your function?
Jacqueline
Jacqueline on 15 Jul 2013
One of my variables is engine speed, and the data is collected for over 95,000 seconds. A chunk of the data may look like this...
1055.25000000000 777.250000000000 771.750000000000 1112.37500000000 1151.37500000000 1447 1447 1447 1447 1447 1447 1447 1447 668.625000000000 803.750000000000 850.250000000000 693.625000000000 1069.37500000000 868.500000000000 985.875000000000 1085.87500000000 1148 1065.62500000000 978.250000000000 885.750000000000 723.125000000000 638.125000000000 678.500000000000 807.500000000000 692.750000000000 814.875000000000
See how 1447 is repeated? Say that was repeating for more than 300 seconds. My script would use the ~diff function and replace the non-repeating numbers with 0s and the repeating numbers with 1s. Then it finds were the ones repeat for more than 300 seconds. When I use find(y) though, it finds locations but they don't correspond to the original data set

Sign in to comment.

Accepted Answer

Cedric
Cedric on 15 Jul 2013
Edited: Cedric on 15 Jul 2013
I think that you can use two approaches. I'll illustrate with a simple example: say we have the following data
>> data = [7 8 8 8 8 6 6 7 8 7 7 7] ;
and we want to get blocks of repeating values with at least 3 elements.
1. Based on your REGEXP method, you would indeed look for the position of streams of 1's larger than a given value.
>> rep = ~diff(data) % Add other components if needed.
rep =
0 1 1 1 0 1 0 0 0 1 1
>> repStr = sprintf('%d', rep)
repStr =
01110100011
>> start = regexp(repStr, '1{2,}', 'start') % 3 similar values -> 2
start = % repetitions.
2 10
2. Without conversion to string and REGEXP:
>> buffer = [true, diff(data)~=0]
buffer =
1 1 0 0 0 1 0 1 1 1 0 0
>> groupStart = find(buffer)
groupStart =
1 2 6 8 9 10
>> groupId = cumsum(buffer)
groupId =
1 2 2 2 2 3 3 4 5 6 6 6
>> groupSize = accumarray(groupId.', ones(size(groupId))).'
groupSize =
1 4 2 1 1 3
>> start = groupStart(groupSize > 2)
start =
2 10
EDIT: note that the 2nd method is more than 5 times faster than the 1st on large datasets.
  3 Comments
Cedric
Cedric on 15 Jul 2013
Edited: Cedric on 15 Jul 2013
In your command window, type
doc sprintf
then, in the SPRINTF documentation, look up formatSpec, which describes all the format conversion specifiers. %d is for integer, which means that elements of rep are interpreted as integers and converted to string as such.

Sign in to comment.

More Answers (1)

Muthu Annamalai
Muthu Annamalai on 15 Jul 2013
Guessing from reading the code, and the comments in the code itself, you are looking for the variable, startindex
[startindex,runs] = regexp(s,'1+','start','match'); %find all runs and the point where they start
So just add this to your return value from the function, and you should be all set.
  1 Comment
Jacqueline
Jacqueline on 15 Jul 2013
That finds the starting point of where there are more than one 1s in a data set of 1s and zeros. The length of that string is different than my original string, which is where I need to find the locations of the repeating values

Sign in to comment.

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!