Random Number Generation with Parameters

Dear All,
Kindly ask for your assistant, I am trying to generate a time series with approx 500 data with the following parameters
1)values between min=0 max=1500
2) 0-100 range 50% of total data, 101-200 25% of total data etc
3) x/x+1 data should be +/-10% at 20% of data, x/x+1 should be between +/-10% - +/-20 at 15% of total data etc
Any ideas will be highly appreciated

7 Comments

What does this mean:
3) x/x+1 data should be +/-10% at 20% of data, x/x+1 should be between +/-10% - +/-20 at 15% of total data etc
does it say something about the ratio of consecutive terms? And what exactly does it say? I'm sorry, but that line is highly confusing.
Apologies for the poor description...
Concerning the third parameter let me provide an example.
100,120,135..... the percentage difference if 100 and 120 is 20% (100x 1,2) while the percentage difference of 120 and 135 is 12,5%. As such i would like to determine the number of % differences within the generated data series as let's say 0%-10% differences at 20% of the entire data (is 500*20% = 100 data), 10%-20% difference at 15% of the entire data (500*15% = 75 data) etc. Hope it is more clear now. As i do want to generate consecutive data with 50% and 60% differences, it will be usefull if the generated data of ranges (0-100, 101 -200, 201-300 etc) are not continuously ie 100,275,510 etc
Many thanks!!!
This script esitmates the parameters for a Markov model of the system. The script converts the engne power sequence to an equal length sequence of 14 levels, from 1 to 14. Each level is considered a state. It converts the 14x14 matrix of probabilities for each state going to each other state (including the same state) at each time point. Some of the probabilities are zero, bececause there are no transitions from certain levels to certain other levels in the original sequence. The script also computes the mean lifetime (in samples) of each state, although this is not needed for the next stage of simulation, and the lifetimes can be computed directly from the probability matrix. The transition probability matrix and the mean lifetimes are saved in a .mat file.
You can use the transition probability matrix to contruct a 500 point long sequence.
The script and the mat fle created by the script are attached.
[Edit: Add comments and code at the end to convert the "levels" signal (V2) to a signal (U2) with engine power in the original signal units.]
Load the original sequence, convert it to levels, and display the level signal and its histogram.
load('Data_time_power_values'); % load original sequence
U=Data(:,2); % column 2 = engine power
U=U(U~=0); % remove zeros from U
N1=length(U); % sequence length, after removing zeros
% Define sequence V1 as sequence U, quantized into integer-valued levels
V1=ceil(U/100); % V1=level at every time
Ns=length(unique(V1)); % number of states
figure;
subplot(211), plot(1:N1,V1); title('V1'); ylabel('Level')
subplot(212), histogram(V1);
title('V1 Histogram'); ylabel('Counts'); xlabel('Level')
Now simulate the system. Start by loading the Markov model transition probability matrix, computed earlier.
load('Data_time_power_values_MM.mat'); % load transProb
Each row of transProb is like a probability density: the values on each row sum to 1. It will be useful for simulation to have a matrix in which each row is like a cumulative distribution, i.e. the last value on each row is 1.
B=zeros(Ns); % allocate matrix
for i=1:Ns
B(i,:)=cumsum(transProb(i,:)); % cum.dist.function
end
Use B to simulate. Start at level 9, since that is most probable state, as shown by histogram above.
N2=500; % desired sequence length
V2=zeros(N2,1); % allocate vector for sequence of levels
V2(1)=9; % initial value: level 9
for i=2:N2
x=rand; % uniform random in (0,1)
j=V2(i-1); % level of previous state is the row of B to use
V2(i)=find(x<B(j,:),1);
end
Let's plot the sequence and its histogram
figure
subplot(211); plot(1:N2,V2,'-r'); title('V2'); ylabel('Level')
subplot(212); histogram(V2);
title('V2 Histogram'); ylabel('Counts'); xlabel('Level')
The histogram of V2 is similar to the hstogram of the original signal. It is not identical, due to randomness and due to the fact that V2 has only 500 elements. For example, since the original sequence had only 7 samples at level 14, out of 30000, it is not surprising that a sequence with length 500 has no samples with level 14. The probabilities of transitions between each level and each other level are also similar for the original sequence (V1) and the new sequence (V2), but not identical, for the same reasons.
Finally, convert V2 (units of levels) to a signal with engine power in the same units as the original signal. SInce level 1=0-99, and level2=100-199, etc., we convert level 1 to 50, level 2 to 150, etc.
U2=100*V2-50; % convert levels to engine power
figure; plot(1:N2,U2,'-b');
xlabel('Samples'); ylabel('Engine Power'); title('U2')
OK
@Demosthenis Kasastogiannis, you're welcome. You can accept the answer if it suits you. Good luck with your work.

Sign in to comment.

 Accepted Answer

I share @John D'Errico's question. I am not sure I understand your reply to his question. My understanding of your wishes is below. If my interpretation is wrong, please explain it again, and explain why you want what you want, because maybe that will help me understand.
My interpretation of your reply to @John D'Errico: You would like consecutive numbers in the "random" data set to differ by 0 to 10%, in 20% of cases. You would like consecutive numbers in the data set to differ by 10-20% in 15% of cases. I think you want consecutive numbers to differ by 50% to 60% in some cases, but I am not sure how often.
In your original post, you said you want
A. Approximately 500 random numbers in the range [0,1500].
B. "50% of numbers between 0 and 100, 25% of numbers between 100 and 200, etc."
C. The requirements about differences between consecutive numbers, described above.
Requirement B is met by an exponential distribution with .
Random numers from this distribution may, rarely, exceed 1500. We can chek for any such values and delete them if found, to comply with requirement A.
N=500; % number of initial points
mu=100/log(2); % distribution parameter
x=exprnd(mu,[1,N]);
[~,ind]=find(x<=1500); % indices of elements of x<=1500
y=x(ind); % y=elements of x that are <=1500
M=length(y);
fprintf('length(y)=%d; min=%.1f, max=%.1f.\n',M,min(y),max(y));
length(y)=500; min=0.4, max=988.9.
fprintf('0<=y<100 in %.1f %% of cases.\n',sum(y<100)*100/M)
0<=y<100 in 47.4 % of cases.
fprintf('100<=y<200 in %.1f %% of cases.\n',sum(y>=100 & y<200)*100/M)
100<=y<200 in 23.8 % of cases.
The results above indicate that vector y satisfies requirements A and B above.
For requirement C, let us first see what the distribution of percent differences between successive values looks like, if use vector y.
percentDiff=100*diff(y)./y(1:end-1);
histogram(percentDiff, 'Normalization','probability')
grid on; ylabel('Probability')
xlabel('Percent Difference'); title('Difference Between Consecutive Values')
The plot shows that the differece btween consecutive elements is between -1000% and 0%, in approximately 50% of cases, and the difference is between 0 and +1000%, in approximately 45% of cases, and the difference exceeds +1000% in the remaining cases. It makes sense that the difference is negative half the time, for this random sequence. In fact, we know from the definition of percent change, and the fact that the distribution is non-negative, that the successive difference can never be smaller than -100%. Let us make a histogram plot with an expanded horizontal axis to learn more.
figure
histogram(percentDiff,'BinWidth',10,'Normalization','probability')
xlim([-120,100]); grid on; ylabel('Probability')
xlabel('Percent Difference'); title('Difference Between Consecutive Values')
The histogram above shows that the difference between successive elements is between -10% and +10% in about 6.0% of cases (the sum of the heights of the two central bars). (Exact values will differ when you run the code, due to randomness.) The difference is in the range -20% to -10%, or +10% to +20%, in 5.4% of cases. I think you want the difference to be between -10% and +10% in 20% of cases, and you want the difference to be in the range -20% to -10%, or +10% to +20%, in 15% of cases.
I have demonstrated how to make a vector of values that satisfies conditions A and B, and I have demonstrated how to evaluate whether condition C is satisfied.
To make a sequence w() that satisfies condition C (but maybe not conditons A and B), you could trythe equation
w(j)=a(j)*w(j-1);
where a(j) is a random number in the range 0.9 to 1.1 in 20% of cases, and a(j) is random in the range (0.8 to 0.9 or 1.1 to 1.2) in 15% of cases, etc. I tried the sequence above. It is tricky to work with. The results are sensitive to the details of the probability distribution of the coefficients a(j). Sequences w(j) that go to 0 (when long) are observed for some a(j) distributions. Some a(j) distributions are likely to produce w(j) sequences that are very large at times.
I would not be surprised if it is impossible to satisfy conditions A, B, and C simultaneously.

17 Comments

Hi William, many thanks for your detailed answer and time spent.
My effort is to generate, from a large data series (40.000 data) a smaller data series in terms of data (500 data), engine's power profile having the same characteristics as the initial profile. My description in terms of a) ranges ie 0-100, 101-200... are the energy intervals on which the eninge works and b) the % changes, represent the increases - decreases on energy demand from the engine on particular time intervals compared to time t(x)/ t(x+1). My difficulty is also in satisfying the C condition as the first one can be also described as a probability of occurance, for each power level, which seems to work
pmf = [All the probabilities of the power levels occuring on the initial data series (40.000) for each value];
population = 1:1500;
sample_size = 500;
random_number = randsample(population,sample_size,true,pmf
I am not sure of whether my approach for solving this problem is the correct one.
[Edit: Add ")" in my comments, where I forgot to include it.]
Can you upload a mat file with the 40000 power levels?
You want a vector y, of length 500, whose elements are engine power levels at specific times. The power levels are integers in the range [1:1500].* The values of y are to be selected randomly but should follow certain probabilities. The probabilities of selecting each integer in [1:1500] are obtained from the probability of that integer occurring in the sequence x, which has length 40000. (I assume the sequence x has only integers, and min(x)=1 and max(x)=1500.)
You can generate a sequence y that meets the requirements above, as follows:
% x1=40000 random integers with approximately exponential distribution.
% If I had your file of 40000 integers, I would read it in to get x.
x1=round(0.5+exprnd(100/log(2),[1,40000])); % engine power (integers)
ind=find(x1<=1500);
x=x1(ind); % x=x1, without any values >1500 (engine power)
N=length(x); fprintf('Length(x)=%d, min(x)=%d, max(x)=%d.\n',...
N,min(x),max(x))
Length(x)=39998, min(x)=1, max(x)=1472.
pop=1:1500; % population of random integers
% Next: Compute the weighting vector w, whose elements are
% the number of occurrences in x of each value in pop
w=zeros(1,length(pop)); % allocate w
for i=1:length(pop)
w(i)=sum(x==pop(i));
end
fprintf('w includes %d values=0.\n',sum(w==0))
w includes 603 values=0.
% Next: y=500 samples from pop using the weighting factors w
y=randsample(pop,500,true,w); % engine power
% Plot w=weighting factors and the sequence y.
figure
subplot(211), plot(1:1500,w,'b.')
xlabel('Population value'); ylabel('Weight'); title('w'); grid on
subplot(212), plot(1:500,y,'-r.')
xlabel('Time'); ylabel('Power'); title('y=Engine Power'); grid on
A key feature of y constructed above is that its autocorrelation is basically 0, for lags greater than 0, i.e y(n) and y(n-1) have no correlation. You want the random variable z(n)=y(n)/y(n-1) to have a certain probability distribution. This is not possible if you generate y() with randsample(). See my previous comment for a suggestion for how to generate a sequence y such that z has a desired proability distribution.
*Is the lower limit for power 0 or 1? You say 1:1500 in one place, but you also say "ranges ie 0-100,101-200...".
Hi William,
Please find attached the actual time series (Data_time_power_values) plus an excel file with the same data (data.xls) and the distrinutions of the ranges and % differences. You can ignore all the zero values as they don't affect the final result.
thanks again for all the effort!!
Demosthenis
[Edit: Fix spelling errors. Update x-axis labels and titles on last 2 plots.]
You want to generate a data set with ~500 points from the 43K-point data set, in a way that preserves some of the statistical properties of the large data set - including the probability distribution and the distribution of the percent differences of consecutive values.
The histogram for the energy values, (sequence v below, after removing zeros), is very different from the distribution of values you requested above. You requested twice as many values in the range 0-100 as in the range 100-200, etc. I thought you wanted the distribution of the 500 samples to be like the large data set. Was I wrong?
The histogram for the percent differences (sequence z, below), is very different from the distribution you requested. You requested 20% of values to be 0 to 10% (which I assume means -10% to +10%), and you requested 15% of values to be in the range 10 to 20% (which I assume means -20 to -10% and +10 to +20%). About 80% of the z values in the large data set are in the range -10 to +10%. Again, I assumed you wanted the distribution of percent differences in the 500-long-data to be similar to the large data set. Was I wrong?
You say "you can ignore the zero values as they don't affect the final result". (My italics.) Can you explain a) what you mean by the "final result", and b) why are you comfortable ignoring 30% of the data? I'm sure you have a good reason for ignoring the zeros, and I would like to understand it.
There is more than one way to "ignore" the data, and it makes a difference to how we analyze the data and generate a 500 point sequence. Two different ways are below, and I'm sure you can think of other ways. Which way do you want to ignore the zeros?
A. Let u(i) be the time series which has ~43K points, including ~30% zeros. Make a new time series v(i) which is u(i) but with all the zeros removed. Analyze v(i), and the successive differences, as if it has no gaps.
B. Make v(i) from u(i) by replacing all zeros in u(i) with NaNs. This means that successive differences will only be computed from values that were consecutive in the original sequence.
Let's examine your data.
load('Data_time_power_values.mat');
t=Data(:,1);
u=Data(:,2); % u=power values
Nu=length(u);
% Display info about u(i)
fprintf('Len(u)=%d, min(u)=%.3f, max(u)=%.3f, numZeros=%d, numNaNs=%d.\n',...
Nu,min(u),max(u),sum(u==0),sum(isnan(u)))
Len(u)=42745, min(u)=0.000, max(u)=1335.600, numZeros=12672, numNaNs=0.
% v(i)=u(i) minus the zeros
v=u(u>0);
Nv=length(v);
% Display info about v(i)
fprintf('Len(v)=%d, min(v)=%.3f, max(v)=%.3f, numZeros=%d, numNaNs=%d.\n',...
Nv,min(v),max(v),sum(v==0),sum(isnan(v)))
Len(v)=30073, min(v)=1.000, max(v)=1335.600, numZeros=0, numNaNs=0.
% Compute z=% difference of consecutive values
z=100*diff(v)./v(1:end-1);
Nz=length(z);
% Display info about ratio z(i)
fprintf('Len(z)=%d, min,max(z)=%.1f,%.1f, numZero=%d, numNaN=%d, num<0=%d, num>0=%d.\n',...
Nz,min(z),max(z),sum(z==0),sum(isnan(z)),sum(z<0),sum(z>0))
Len(z)=30072, min,max(z)=-99.0,7400.0, numZero=11, numNaN=0, num<0=18458, num>0=11603.
% Plot u(i), v(i)
figure
subplot(211), plot(1:Nu,u,'-r',1:Nv,v,'-b')
grid on; xlabel('Index'); ylabel('Power'); title('u,v'); legend('u','v')
% Plot histogram of v
subplot(212), histogram(v,'Normalization','probability');
grid on; xlabel('Power'); ylabel('Probability'); title('Histogram of v')
% Plot histogram of v, twice, with different horizontal scales
subplot(211), histogram(z,'Normalization','probability')
grid on; xlabel('Percent Change'); ylabel('Probability')
title('Histogram of Percent Change Values')
subplot(212), histogram(z,'Normalization','probability','BinWidth',10)
grid on; xlabel('Percent Change'); ylabel('Probability')
title('Histogram of Percent Change Values'); xlim([-100,100])
Apologies for the confusion , initially i was trying to describe my problem in a more simplistic way in order not to take too much of your time.
The appropriate approach is a) Make a new time series v(i) which is u(i) but with all the zeros removed.
The new 500 data time series should have, to the extend possible the same a) range and b) % change distribution as the initial one.
Apologies for any inconvinience caused, that was not my intention.
@Demosthenis Kasastogiannis, no problem. Busy tonight. More tomorrrow.
Dear William, Any update on the above? Thank you
Call the long sequence, which you posted, U. U has ~3x10^4 points, after the zeros are removed. I found Umax and Umin. I found the distribution of the percent changes, Zu: Zu(i)=100*(U(i+1)-U(i))/U(i). Then I made a function that returns a random value with the same distribution as Zu. I used the function to make a sequence Y, by applying the randomly chosen percent change values: Y(i+1)=Y(i)*(1+Zy(i)/100). If the value of Y obtained at any step was outside the range [Umin,Umax], then I picked another random value of Zy, until I got a Y value that was within the allowed range.
Sequences Y and U are below.
Y will be somewhat different each time you run the script, due to randomness. The general appearance of Y , as seen above, is consistent on multiple trials. You can see from the plot above that sequence Y does not look like sequence U. This suggests that there are dynamics governing Y that are not captured by the sequential percent changes. You can see an example of the non-random dynamics of U by plotting the last 500 samples of U. (Try it.) This segment shows that U(i) is not governed by applying a sequence of random percent changes.
The plot below shows the distribution of percent changes, Zu and Zy, for U and for Y. Each histogram is shown twice, with different scales, to show different details.
The distributions of Zu and Zy, above, are quite similar, but not identical, especially -20% to +20% along the x-axis. The difference is due to the application of the max,min constraint for Y. (If you remove the constraint Umin<Y<Umax, then the distributions of Zu and Zy look identical, up to random variation. But if you do this, the sequence Y will look nothing like sequence U.)
Hi William,
Your point is completely understood as it is the same issue I am facing with all efforts up to now.
Matlab is trying to satisfy each and every constraint individually, not to the entire data sample (first, 0-100 range for x number of data, then 101-200, y number of data etc) and then the percentage changes in which case quite often the generated numbers reach the upper or down limit of the constraint, in which case, the program produces continiously the min or max value or a random number if instructed, violating the constraint itself (the program is instructed to satisfy each objective at a time ie 40 consecutive data with 0-10% change versus previous data - not in the entire data sample but let's say for the first 40 numbers, i believe this is the main issue ). So randomness, if any, is something very weak in my approah. Not sure whether Matlab can "process" this kind of arguments - constraints in a different way.
Many thanks again for your valuable time, i do have a couple of ideas, if they work i will post the results.
Here is the script I used to make the figure above.
I have also attached functon myRand(), which the script calls. I have also posted myRand on the Matlab File Exchange.
Why do you want to make a signal similar to the signal in Data_time_power_values.mat? Your answer may help you find the right course to take.
What process produced the numbers in file Data_time_power_values.mat? What was the nature of the process? Why does it show periods of stability, interrputed by abrupt changes? Consider the behavior shown below:
% load data
Data=load('Data_time_power_values');
t=Data.Data(:,1);
U=Data.Data(:,2);
t=t(U~=0);
U=U(U~=0);
Nu=length(U);
% plot data
subplot(311), plot(t,U,'-b'); grid on; xlabel('Time')
subplot(312), plot(t(28900:end),U(28900:end),'-b'); grid on; xlabel('Time')
subplot(313), plot(t(29510:29590),U(29510:29590),'-b.'); grid on; xlabel('Time')
The top plot above shows the full length sequence. It shows repeated periods of relative stability, with a mean value in the range 600 tp 900,during each stable segment. The periods of stability are interruted by rapid changes.
The middle plot above shows the last ~1200 points. It shows rapid changes, then a period of relative stabilty lasting for about 1000 samples, followed by another rapid decrease.
The bottom plot shows a zoomed in view during a stable period. The signal oscillates with a cycle that alternates between 9 points per cycle and 7 points per cycle. What is going on here? There must be an explanation. It is not random. If you want to simulate this, use what you know, or use what you suspect, about the experiment that produced this data.
Hi William,
Just for your info, the data are real - actual data taken from an operating engine, the fluctuations you have mentioned are the power levels this engine works on a particular trip.
Many thanks!!
Demosthenis
@Demosthenis Kasastogiannis, Yes, I have no doubt the data are real. So it is engine power during a two hour drive? That makes sense. The oscuillations during the periods of stability have a duration of 7 and 9 samples, i.e. 35 to 45 seconds. The regularity of the limit cycle makes me think it is not due to human intervention. I wonder if the oscillations are due to cruise control being active during thiese periods.
Do you want to make a signal with 500 samples (2500 seconds) to serve as a test signal? I would try to mimic periods of stability interrupted by perods of deceleration and acceleration, as seen in your data. More later.
exactly , these are power numbers per 5 sec. Many thanks again and waiting for your feedback!!!
What i am trying is to simplify (reduce) the number of observations for computational reasons without losing ,to the extend possible, any info.
You say you wish "to simplify (reduce) the number of observations for computational reasons without losing ,to the extend possible, any info".
I don't know what computations you plan to do with the sequence of engine power values. Therefore I do not know what the computational reasons are for preferring a sequence of length 500 to a sequence of length 30,000. But the high speed of modern computers makes me think that a slightly longer computation time may be preferable to the potential loss of informaton that will occur of you construct a much shorter sequence. I believe we do not have a good understanding of the process that generates the changes in engine power levels. Therefore any model we make is liekly to lose information, and we won;t even know what possibly important informaton we are losing.
The model you proposed, and which I implemented, was that the sequential percent changes are a sequence of independent random numbers with a distribution obtained from the full length sequence, and with a certain max and min. When we implemented that model, we saw that the power sequience it produced did not look like the actual power sequence.
You could spend a lot of time and effort trying to make a better model. I think that you would be better off using the full length sequence for your subsequent computations.
By teh way, I added one more plot to the script. The new plot shows the full length sequence as a set of eoffset traces, in order to expand the horizontal axis scale to allow you to see the abrupt changes in power more clearly. The first part of the full length sequence is at the top, the last part is at the bottom. I use different colors to help distinguish traces that overlap. See below.
The latest scipt is also attached.
Dear William,
It is essentlial, for the final models, to use a number of data of no more of 500. Any effort so far, took more than 10 days to produce results, with all data used (alot of interation in the final model), with doubtfull validity.
I believe we do not have a good understanding of the process that generates the changes in engine power levels. Answer: For each time, it is the necessary power the engine should produce in order to support, speed, auxiliary power needs etc, it is just a power profile. there is nothing more than that
The most important info is the A) power category the engine operates at each time ie (0-100, 101 -200 ....) and the B) change versus the previous level of operation. These are the only two elements - "patterns" that need to be inhereted by the new 500 data time series.
William Rose
William Rose on 29 Aug 2024
Moved: Voss on 29 Aug 2024
Ok, very good. Does each power level span 100 points, or do the levels get wider at higher power? We will make a new time series with only 15 levels. We will make preliminary estimates of the state transition matrix and lifetimes of each state. More later.

Sign in to comment.

More Answers (1)

Hi Demosthenis,
I understand that you want to generate random numbers based on a few rules/parameters.
The first and second conditions can be met using the `randi` function. For instance, you can generate 50% of the total data size (i.e., 500 numbers) within the range of 0-100. The next 20% of the data can be in the range of 101-200, while the remaining numbers can fall within the range of 201-1500. For example, to generate the first 100 numbers in the range of 0-100, you can use:
r = randi([0 100],1,100)
However, the third condition is unclear and somewhat confusing. It would be helpful if you could provide more details or elaborate further on it.
Hope this helps!

1 Comment

Thank you very much Amith!! It is Highly appreciated
Concerning the third parameter let me provide an example.
100,120,135..... the percentage difference if 100 and 120 is 20% (100x 1,2) while the percentage difference of 120 and 135 is 12,5%. As such i would like to determine the number of % differences within the generated data series as let's say 0%-10% differences at 20% of the entire data (is 500*20% = 100 data), 10%-20% difference at 15% of the entire data (500*15% = 75 data) etc. Hope it is more clear now. As i do want to generate consecutive data with 50% and 60% differences, it will be usefull if the generated data of ranges (0-100, 101 -200, 201-300 etc) are not continuously ie 100,275,510 etc
Many thanks!!!

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!