You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
Lognormal Fitting with the Pore Size and its Probabilities
7 views (last 30 days)
Show older comments
Dear all,
I have some pore sizes and its probabilites. Visually the pore size distribution follows the lognormal distribution, and the particle size distribution usually follows lognormal distribution, as the black line in the figure shows. So I try to fit it to a lognormal distribution with the following code:
Counts = [71 73 53 77 79 141 158 237 360 594 958 1659 3094 7183 19034 40051 125001 143248 164224 168209 172406 167600 160578 146530 124505 90064 67769 42179 21645 1070];
PoreSize = [89716.4 71264.3 56607.2 44964.7 35716.7 28370.8 22535.7 17900.8 14219.1 11294.6 8971.6 7126.4 5660.7 2837.1 2253.6 1790.1 712.6 566.1 449.6 357.2 283.7 225.4 179 142.2 112.9 89.7 71.3 56.6 45 35.7];
probs = Counts / sum(Counts);
ExpOfY = sum(PoreSize.*probs);
Ysqr = PoreSize.^2;
ExpOfYsqr = sum(Ysqr.*probs);
VarOfY = ExpOfYsqr - ExpOfY1^; % variance of the scores tabulated in vectors 1 & 2
Invalid expression. Check for missing or extra characters.
normu = log(ExpOfY/sqrt(1+VarOfY/ExpOfY^2));
norvar = log(1+VarOfY/ExpOfY^2);
norsigma = sqrt(norvar);
dist = makedist('Lognormal','mu',normu1,'sigma',norsigma1);
x = linspace(0.001,200000,2000000); % Use whatever range is appropriate for data
pdfOfx1 = pdf(dist,x);
figure; hold on; plot(PoreSize,Counts);
plot(x,pdfOfx * 6200); % Multiply 6200 just to compare with the original data.
After fitting, the results look wired, comparing with the raw data point distribution. The agreement between the fitting and the original looks quite bed.I have also tried to use the makedist, the results look similar as well. Does anyone know the reason?
Best
18 Comments
Paul
on 8 Jan 2024
Is anyone else seeing that the graphics figures in the answers by @William Rose and @Torsten are small and kind of blurry (at least compared to what I'm used to seeing on the forum)? The plot in the answer of @Muhammad Hassaan Shah is better, but still not as good as usual. I'm using Win10/Edge.
Torsten
on 8 Jan 2024
I don't know whether there is a coincidence, but it happened after I installed Windows 11.
Paul
on 9 Jan 2024
Are there any known compatibility issues between Answers and Windows 11 with respect to display of graphics?
Rena Berman
on 9 Jan 2024
(Answers Dev) @Paul, I will need to investigate. Can you post a screenshot that shows what you are seeing?
Torsten
on 9 Jan 2024
Edited: Torsten
on 9 Jan 2024
Try running one of the codes under "Answers" and submit the result. I guess the plot will be shown larger (as in Muhammad Hassaan Shah's answer).
William Rose
on 9 Jan 2024
Here is a screenshot of what I am seeing (OS=Windows 11, browser=Firefox).
Adam Danz
on 10 Jan 2024
When I put William's comment into edit mode and re-run the code, I see the figure in its expected size.
OS Win 11; Browser: chrome & Edge
I don't have firefox installed so I didn't test that browswer.
Torsten
on 10 Jan 2024
Edited: Torsten
on 10 Jan 2024
It appears in expected size when generating the plot, but after submission, it shows up in the reduzed size. And it remains in reduced size when the answer is opened again. When re-running the code, the plot is again in its expected size. But after submission ...
At least in my case.
William Rose
on 10 Jan 2024
I have the same experience as @Torsten with plots generated in Matlab Answers: the plot gets small when I submit my answer or comment.
the cyclist
on 15 Feb 2024
@Rena Berman, I am seeing the same behavior. (I learned about this thread after posting my own question.) I see the same behavior that @Torsten describes, where the figure briefly appears as if it is going to be the correct size, but then it gets small.
I am running Mac OSX Sonoma.
William Rose
on 15 Feb 2024
@Rena Berman, the shrinking-graph thing still happens to me. If I insert an image, such as a jpeg, it does not get shrunken. If I make a plot using embedded code, it looks reasonable until I press Submit, and then it gets shrunken.
plot(1:10,1:10)
I run Windows 11 and use up-to-date Firefox.
Rena Berman
on 21 Feb 2024
Edited: Rena Berman
on 21 Feb 2024
(Answers Dev) @William Rose, @the cyclist, I have forwarded this on to the appropriate MW developers.
Accepted Answer
William Rose
on 8 Jan 2024
Take the logarithm of the pore sizes, then compute mu and sigma using the log(pore size) vector. The mu, sigma computed that way are what you should pass to makedist('Lognormal',...). Perhaps the rest of your code will work correctly after that.
27 Comments
William Rose
on 8 Jan 2024
% load data
Counts = [71, 73, 53, 77, 79, 141, 158, 237, 360, 594, 958, 1659, 3094, 7183,19034,40051,125001,143248,164224,168209,172406,167600,160578,146530,124505,90064,67769,42179,21645,1070];
PoreSize=[89716,71264,56607,44965,35717,28371,22536,17901,14219,11295, 8972, 7126, 5661, 2837, 2254, 1790, 712.6, 566.1, 449.6, 357.2,283.7, 225.4, 179.0, 142.2, 112.9, 89.7, 71.3, 56.6, 45.0, 35.7];
% plot data
figure; semilogx(PoreSize,Counts,'-k.');
grid on; xlabel('Pore Size (nm)'); ylabel('Count'); hold on
%estimate distribution
N=sum(Counts); % total number of observations
probs = Counts / N;
Y=log(PoreSize);
EY = sum(Y.*probs);
EY2 = sum(Y.*Y.*probs);
VarY = EY2 - EY^2; % variance log(PoreSize)
sigmaY = sqrt(VarY);
dist = makedist('Lognormal','mu',EY,'sigma',sigmaY);
X=sort(PoreSize); % sort PoreSize into ascending order
Xpdf = pdf(dist,X);
CountPred=Xpdf.*X*N/sum(Xpdf.*X);
% add to plot
semilogx(X,CountPred,'-r.');
legend('Measured','Pred')
When computing the predicted counts, I scale the PDF by an "effective bin width" which is equal to X, the PoreSize:
CountPred=Xpdf.*X*N/sum(Xpdf.*X);
This is necessary because this is a lognormal distribution. The scalar factor N/sum(Xpdf.*X) is needed to make the total number of predicted counts equal the total number of measured counts.
Paul
on 9 Jan 2024
Hi William,
I'm quite curious about this problem in how the parameters of the lognormal distribution are estimated. My concern is that the parameters aren't being estimated from the actual data, of which there are N points. Rather, the parameters are being estimated from discretized data, and the data is discretized into bins that, as far as I can tell, aren't even equally spaced in any sense. Is there any theoretical underpinning for estimating distribution parameters from discretized data? It's not clear to me that disrcetizing and then computing the sample mean and variance as below is the same as what we'd get from computing the sample mean and variance from the un-discretized data.
Here is your code for estimating the pdf:
Counts = [71, 73, 53, 77, 79, 141, 158, 237, 360, 594, 958, 1659, 3094, 7183,19034,40051,125001,143248,164224,168209,172406,167600,160578,146530,124505,90064,67769,42179,21645,1070];
PoreSize = [89716,71264,56607,44965,35717,28371,22536,17901,14219,11295, 8972, 7126, 5661, 2837, 2254, 1790, 712.6, 566.1, 449.6, 357.2,283.7, 225.4, 179.0, 142.2, 112.9, 89.7, 71.3, 56.6, 45.0, 35.7];
Counts = fliplr(Counts);
PoreSize = fliplr(PoreSize);
%estimate distribution
N = sum(Counts); % total number of observations
probs = Counts / N;
Y = log(PoreSize);
EY = sum(Y.*probs);
EY2 = sum(Y.*Y.*probs);
VarY = EY2 - EY^2; % variance log(PoreSize)
sigmaY = sqrt(VarY);
dist1 = makedist('Lognormal','mu',EY,'sigma',sigmaY);
Here is my code for estimating the pdf. The difference is that I'm estimating the sample mean and variance from the discretized data and then finding the parameters of the pdf in the lognormal way, while your code is estimating the sample mean and variance of the log of the discretized data directly.
m = sum(PoreSize.*Counts)/N;
v = sum((PoreSize-m).^2.*Counts)/N;
mu = log(m^2/sqrt(v+m^2));
sigma = sqrt(log(v/m^2+1));
dist2 = makedist('LogNormal','mu',mu,'sigma',sigma);
Comparing the pdfs we see two very different solutions
dist1,dist2
dist1 =
LognormalDistribution
Lognormal distribution
mu = 5.55692
sigma = 0.84966
dist2 =
LognormalDistribution
Lognormal distribution
mu = 4.92834
sigma = 1.47884
figure
x = logspace(-2,5,5000);
semilogx(x,pdf(dist1,x),x,pdf(dist2,x)),grid
Despite appearances, both pdfs integrate to unity.
[trapz(x,pdf(dist1,x)) trapz(x,pdf(dist2,x))]
ans = 1×2
1.0000 1.0000
Comparing dist1 and dist2 to the histogram of the data clearly shows that your solution is superior.
edges = mean([PoreSize(2:end);PoreSize(1:end-1)]);
edges = [2*PoreSize(1)-edges(1), edges, 2*PoreSize(end)-edges(end)];
figure
histogram('BinEdges',edges,'BinCounts',Counts,'Normalization','pdf')
hold on
plot(x,pdf(dist1,x),x,pdf(dist2,x)),grid
legend('data','dist1','dist2')
xlim([0 1000])
Here, I'll generate a full set of PoreSize data in accordance with your pdf.
PoreSize = random(dist2,1,N);
Now use the same code as above to estimate the pdf parameters with this data
Counts = ones(1,N);
probs = Counts / N;
Y = log(PoreSize);
EY = sum(Y.*probs);
EY2 = sum(Y.*Y.*probs);
VarY = EY2 - EY^2; % variance log(PoreSize)
sigmaY = sqrt(VarY);
dist1 = makedist('Lognormal','mu',EY,'sigma',sigmaY);
m = sum(PoreSize.*Counts)/N;
v = sum((PoreSize-m).^2.*Counts)/N;
mu = log(m^2/sqrt(v+m^2));
sigma = sqrt(log(v/m^2+1));
dist2 = makedist('Lognormal','mu',mu,'sigma',sigma);
figure
semilogx(x,pdf(dist1,x),x,pdf(dist2,x)),grid
We essentially get the same results.
Any idea why your method works so much better with the discretized data?
William Rose
on 9 Jan 2024
@Paul, the uneven spa ing of the data did defijnitewly throwe me off. The uneven spacing means it is quite an odd data set. It is not linearly or logarithmically spaced. The plot of number of observations vesus pore size looks kind of like a normal distribution when plotted on a log scale, which is why it seems readonable to use a lognormal distribution. HOwever, this plot should not be thought of as representing a histogram, because the uneven bin width has not been taken into account. I realized this when I was trying to use the fitted distribution to predict the observed numbers. #
Please explain the experimental procedure used to obtain the data. The observed pore numbers at sizes 712 and 1740 nm are particularly interesting and somewhat unexpected. There is a large gap (a factor of 2.5) between these two sizes. If each pore size is considered to be the center or the edge of a bin, then the bins at these two sizes (or at least one of them) would have a considerably higher expected count than what was observed. I have been trying to think of an epxerimental procedure that could generate data like this from a lognormal distribution of pore sizes. So far, I am failing. Thanks.
Paul
on 9 Jan 2024
What really threw me off what that the data had to be fliplr'd, at least for trapz to work as it should.
William Rose
on 9 Jan 2024
The parameter estimation formulas which I used give the maximum likelihood estimator (MLE) for the lognormal distribution. The MLE is (see here)
The estimates I computed are equivalent to formulas above*, if you assume that there were n1 samples with value x1, and n2 samples with value x2, and so on.
You have used the method of moments(MoM): choose mu and sigma so that the moments of the fitted distribution match the experimentally measured moments. For a normal distribution, the MLE and the MoM give the same formulas for the estimates of mu and sigma. For a lognormal distribution, the MLE and MoM estimates are not the same.
@Paul, you also asked "Is there any theoretical underpinning for estimating distribution parameters from discretized data?". Yes there is. For example, the Kolmogorov-Smirnov test, and the Anderson-Darling test, for deciding if a sample came from a specified distribution, are well suited to working with binned data. When doing the K-S test, it is standard to use the MLE parameters, if the parameters must be estimated. I believe the A-D test for the lognormal distributiuon also uses the MLE parameters (see here, p.3, lower left column).
I am certainly not a statistician, so everything I say is based on my student-level understanding of the topics.
Paul
on 10 Jan 2024
Edited: Paul
on 10 Jan 2024
Yes, the MLE and MOM estimates won't be the same, but I think they'll be pretty close when applied to non-discretized data that actually comes from the assumed, continuous distribution, which is what I showed above.
To be clear, my concern is that by discretizing the data there must be some loss of information (musn't there?) that would compromise the parameter estimation for the continuous distribution. It turns out that for one example my concern is not well founded when the discretized data is the result of discretizing the samples from the actual distribution to be estimated.
Here's the example.
Start with the same data in this question
Counts = [71, 73, 53, 77, 79, 141, 158, 237, 360, 594, 958, 1659, 3094, 7183,19034,40051,125001,143248,164224,168209,172406,167600,160578,146530,124505,90064,67769,42179,21645,1070];
PoreSize = [89716,71264,56607,44965,35717,28371,22536,17901,14219,11295, 8972, 7126, 5661, 2837, 2254, 1790, 712.6, 566.1, 449.6, 357.2,283.7, 225.4, 179.0, 142.2, 112.9, 89.7, 71.3, 56.6, 45.0, 35.7];
Counts = fliplr(Counts);
PoreSize = fliplr(PoreSize);
Make the distribution from your procedure as implemented above.
dist1 = makedist('LogNormal','mu',5.55692,'sigma',0.84966);
Generate random data from this actual distribution.
N = sum(Counts);
rng(100);
X = random(dist1,1,N);
Discretize the data assuming that the PoreSise elements define the bin centers and the edges are linearly halfway between the centers (maybe it would be better to assume halfway in a log sense?)
edges = mean([PoreSize(2:end);PoreSize(1:end-1)]);
edges = [2*PoreSize(1)-edges(1), edges, 2*PoreSize(end)-edges(end)];
bincounts = histcounts(X,edges);
Compare the discretized counts from the actual distribution to the Counts data from which dist1 as derived
figure
semilogx(PoreSize,Counts,'-o',PoreSize,bincounts,'-x'),grid
xline(edges)
The discretized counts from sampling dist1 is actually a pretty good match to Counts, execpt for PoreSize = 712, which seems to be the center of the widest bin, probably not a coincidence.
MLE parameter estimates based on the dist1, discretized samples (red points)
probs = bincounts / N;
Y = log(PoreSize);
EY = sum(Y.*probs);
EY2 = sum(Y.*Y.*probs);
VarY = EY2 - EY^2; % variance log(PoreSize)
sigmaY = sqrt(VarY);
dist2 = makedist('LogNormal','mu',EY,'sigma',sigmaY);
MOM parameter estimates based on dist1, discretized samples (red points)
m = sum(PoreSize.*bincounts)/N;
v = sum((PoreSize-m).^2.*bincounts)/N;
mu = log(m^2/sqrt(v+m^2));
sigma = sqrt(log(v/m^2+1));
dist3 = makedist('LogNormal','mu',mu,'sigma',sigma);
Compare the true and estimated pdfs.
x = logspace(-2,5,5000);
figure
semilogx(x,pdf(dist1,x),x,pdf(dist2,x),x,pdf(dist3,x)),grid
For this case, the MLE and MOM both work pretty well when the discretized data is derived from truly log-normal samples.
But for the original Counts data, the MLE and MOM approaches yield very different results, and I'm not sure why that should be the case. I'm sure there's a reason ....
William Rose
on 10 Jan 2024
@Paul, thanks for demonstrating that the MLE and MOM approaches to lognormal parameter estimation give similar results when applied to data that is actually from a lognormal distribnution, whether it is binned or not, and thank you for for showing that the MLE and MOM parameter estimates are quite different for the observed data in this example.
As you note, the predicted number of counts at pore size=712 um is much larger than the observed number. The large predicted number is due to the large binwidth at 712. The observed counts at 712 um and at 1790 um (which is also a wide bin, if you assume the bounds are at the midpoints) are lower than you'd expect, if the distribution is actually lognormal. Thus the counts at 712 um and maybe at 1790 um are outliers, in the sense of being highly improbable , if the data is a random sample from a lognormal distribution. The plot of the probability density shows this, sort of:
Counts = [71, 73, 53, 77, 79, 141, 158, 237, 360, 594, 958, 1659, 3094, 7183,19034,40051,125001,143248,164224,168209,172406,167600,160578,146530,124505,90064,67769,42179,21645,1070];
PoreSize = [89716,71264,56607,44965,35717,28371,22536,17901,14219,11295, 8972, 7126, 5661, 2837, 2254, 1790, 712.6, 566.1, 449.6, 357.2,283.7, 225.4, 179.0, 142.2, 112.9, 89.7, 71.3, 56.6, 45.0, 35.7];
Counts = fliplr(Counts); PoreSize = fliplr(PoreSize);
Y=log(PoreSize);
binEdge=(Y(1:end-1)+Y(2:end))/2; % interior bin edges, at the midpoints
binWidth=diff(binEdge); % interior bin widths
% add bin widths at ends, same as nearest neighbors
binWidth=[binWidth(1),binWidth,binWidth(end)];
% Estimate density. Will be normal, if PoreSize is lognormal.
pdf=(Counts/sum(Counts))./binWidth;
plot(Y,pdf,'-rx'); grid on; xlabel('Y=log(PoreSize)'); ylabel('PDF')
text(Y(14),pdf(14),'\leftarrow 712 \mum');
text(Y(15),pdf(15),'\leftarrow 1790 \mum');
I understand that different methods of parameter estimation have different outlier sensitivity. It seems that the MOM is more sensitive to outliers than the MLE, for this distribution.
Emily
on 18 Jan 2024
I am reading this post after @Torsten linked it on my question here. I have a question about how you calculate mu and sigma from the data. I was getting values that did not make sense when simply taking mean(log(data)) and std(log(data)), and I am wondering if you can explain the logic behind the following lines:
%estimate distribution
N=sum(Counts); % total number of observations
probs = Counts / N;
Y=log(PoreSize);
EY = sum(Y.*probs);
EY2 = sum(Y.*Y.*probs);
VarY = EY2 - EY^2; % variance log(PoreSize)
sigmaY = sqrt(VarY);
dist = makedist('Lognormal','mu',EY,'sigma',sigmaY);
X=sort(PoreSize); % sort PoreSize into ascending order
Xpdf = pdf(dist,X);
CountPred=Xpdf.*X*N/sum(Xpdf.*X);
If there is any source I could read more from that explains these steps, that would be helpful. I am just not understanding the calculation of mu (EY), and though I did read the explanation of the step of scaling the PDF to get correct predicted counts, I am still a bit confused by the logic behind that step. Thank you in advance for any help! I have tried to find sources online for calculating these parameters but haven't had much luck until this thread.
William Rose
on 18 Jan 2024
The code above is for a person who does not have the raw data - they only have a histogram of the data. Does that apply to you? I looked at your other post. It is not obvious to me if you were fitting raw data, or a histogram of the data.
To understand the code above, remember that the raw data (pore sizes) are denoted by X, and Y=log(X). I converted the histogram of X to a histogram of Y. The code above, which you asked about, estimates and . This approach is the maximum likelihood estimator (MLE) for mu and sigma.
@Paul uses a different approach. He computes and var(X). Then he computes the mu and sigma that are consistent with and . This approach is the method of moments estimator for mu and sigma.
If the data are really from a lognormal distribution, the MOM and MLE estimates of mu and sigma will match, or be very close. But if the data contain outliers, or are not from a lognormal distribution, then the MOM estimates and MLE estimates can be quite different, and that is what we observed for the data set in this question.
William Rose
on 18 Jan 2024
@Emily, you say "I was getting values that did not make sense when simply taking mean(log(data)) and std(log(data))". Thus it appears that you are estimating mu and sigma by the MLE method.
My code above, which you asked abut, implements the MLE method, for data in histogram form. The MLE estimates of mu and sigma produced a PDF that matched the data set above better than the PDF based on the method of moments (MOM) estimates of mu and sigma.
Therefore I am somewhat surprised that you obtained non-sensical values "when simply taking mean(log(data)) and std(log(data))". If you wish to share your data and code, feel free. Or send a secure email, by clicking on the envelope icon that appears when you click the WR icon next to my user name.
Torsten
on 18 Jan 2024
Edited: Torsten
on 18 Jan 2024
If you have count measurements of different pore sizes or probabilities for different pore sizes, you can use MATLAB's "mle" function to compute mu and sigma as maximum likelihood estimates of the underlying lognormal distribution.
Note that these mu and sigma are not mean and standard deviation of the distribution, but the parameters mu and sigma in the representation of the probability density function of the lognormal distribution:
and
William Rose
on 18 Jan 2024
Edited: William Rose
on 18 Jan 2024
@Emily, another potential source of error (ask me how I know) when comparing a fitted lognormal distribution to the data involves inconsistent use of X and Y (where Y=log(X)) and their respective PDFs. The pdf of Y is a normal curve. If you plot pdf(Y) versus X, or pdf(X) versus Y, you'll get the wrong results. Plotting X on a log axis (with a command like semilogx(X,pdfY)) does not fix the problem.
Emily
on 18 Jan 2024
Edited: Emily
on 18 Jan 2024
@William Rose I am fitting raw data - an observed aerosol size distribution. One variable is the count of number of aerosols within each "bin" (osmoke_num_dist_norm), and the other variable designates the central diameter of each bin (odiameters2). The only changes I have made from the raw data is to create equal bin spacing across the diameters variable that is then applied to the bin concentrations, and then I normalized the bin concentrations (each individual bin concentration divided by the sum of all bin concentrations). When I make a semilogx plot with odiameters2 as the x-variable and osmoke_num_dist_norm as the y-variable, it resembles a lognormal PDF. I am trying to calculate the mu and sigma from these observations to create a new lognormal PDF that extends past the current min/max of the observations.
In the code below I use two methods to try to calculate the mu and sigma from the raw data. Both methods give a mu = -10.3954 and sigma = 1.8516. I know from guessing/checking that to plot a fitted distribution that matches the observations the values should be mu = -1.8 and sigma = 0.39. I am including the plot that comes out as a result ("calculated" and "lognfit" come up with the same solutions, so they overlap in the plot).
osmoke_num_dist_norm = load('osmoke_num_dist_norm.mat'); osmoke_num_dist_norm = osmoke_num_dist_norm.osmoke_num_dist_norm;
odiameters2 = load('odiameters2.mat'); odiameters2 = odiameters2.odiameters2;
% calculate mu and sigma from the data
ln_x = log(osmoke_num_dist_norm);
mu1 = mean(ln_x);
sig1 = std(ln_x);
% use lognfit to get mu and sigma
pHat = lognfit(osmoke_num_dist_norm);
mu2 = pHat(1,1);
sig2 = pHat(1,2);
% create distributions from both mu/sigma versions
xvalues = 0:0.02:1E02;
pd1 = makedist('Lognormal','mu',mu1,'sigma',sig1);
y1 = pdf(pd1,xvalues);
y1 = y1./sum(y1); % normalize the fitted distribution to match observed
pd2 = makedist('Lognormal','mu',mu2,'sigma',sig2);
y2 = pdf(pd2,xvalues);
y2 = y2./sum(y2); % normalize the fitted distribution to match observed
semilogx(xvalues,y1,'LineWidth',2)
hold on
semilogx(xvalues,y2,'LineWidth',2)
hold on
semilogx(odiameters2,osmoke_num_dist_norm,'LineWidth',2)
legend('calculated','lognfit','data')
I feel like I must be putting the data into the equations incorrectly somehow? Or maybe I'm just confused about something really simple? Sorry in advance if this turns out to be a really stupid error on my end, I just cannot seem to figure it out to save my life.
William Rose
on 18 Jan 2024
Edited: William Rose
on 18 Jan 2024
I will look at this tomorrow or Sunday.
ln_x = log(osmoke_num_dist_norm);
That is an error. osmoke_num_dist_norm is the observed density or probability, corresponding to pdf(X), if the bin sizes have equal width. You should not take the log of the pdf(X). You should take the log of the corresponding diameters, which correspond to X: Y=log(X)=log(diameters). Fix that, adjust the rest of the code as needed, and see what happens.
Torsten
on 18 Jan 2024
Edited: Torsten
on 18 Jan 2024
osmoke_num_dist_norm = load('osmoke_num_dist_norm.mat'); osmoke_num_dist_norm = osmoke_num_dist_norm.osmoke_num_dist_norm;
osmoke_num_dist_norm = osmoke_num_dist_norm*1e7;
odiameters2 = load('odiameters2.mat'); odiameters2 = odiameters2.odiameters2;
diams = [];
for i = 1:numel(osmoke_num_dist_norm)
diams = [diams,odiameters2(i)*ones(1,round(osmoke_num_dist_norm(i)))];
end
phat = mle(diams,'Distribution','LogNormal')
phat = 1×2
-1.5349 0.6892
William Rose
on 19 Jan 2024
Load data:
load osmoke_num_dist_norm
prob=osmoke_num_dist_norm;
load odiameters2
X=odiameters2;
Y=log(X);
Estimate mu and sigma. Call the estimates = muHat, sigmaHat, as a reminder that these are only estimates of true, but unknown, parameters.
muHat=sum(prob.*Y);
sigmaHat=sqrt(sum(prob.*Y.^2)-muHat^2);
Display results.
fprintf('mu=%.3f, sigma=%.3f.\n',muHat,sigmaHat)
mu=-1.535, sigma=0.689.
Make probability distribution object
pd1 = makedist('Lognormal','mu',muHat,'sigma',sigmaHat);
pdfEst = pdf(pd1,X);
Plot results. The values in vector prob are not a density. To convert the values to a density, for plotting versus the fitted denisty, we must divide each value in prob by the corresponding bin width. The bin widths in your data are all the same. (The bin widths were irregular in @Alex Lee's data, which made things more complicated.)
binwid=mean(diff(X)); % mean bin width; widths are all the same
pdfX=prob/binwid; % probability density for observed data
semilogx(X,pdfX,'r.',X,pdfEst,'-b');
xlabel('X=particle size'); ylabel('Probability denisity');
grid on; legend('Data','Fitted Lognormal Dist.')
It's not the greatest fit ever, but it's not totally terrible. The data show some significant deviation from log-normality. For example, note the data (red dots) above the blue line from X=0.8 to X=1.0. Furthermore, we cannot see, with the plot above, what's going on in the tail region. A log-log plot will help us see how the fitted distribution compares to the experimental data in the tail.
figure; loglog(X,pdfX,'r.',X,pdfEst,'-b');
xlabel('X=particle size'); ylabel('Probability denisity');
grid on; legend('Data','Fitted Lognormal Dist.')
The log-log plot helps me understand the fit. The first plot (semilogx plot) made me wonder if you could get a better fit by making sigma smaller. This would have made the blue line come closer to the red dots, at least up to about X=10^0. But the second plot (loglog plot) shows that if sigma were smaller, the mismatch between data and fit would be even worse than it is, for particle sizes > 10^0.
William Rose
on 19 Jan 2024
@Torsten, good point. There's something to be said for the optical (chi-by-eye) fit. Although, if you plot the PDF for mu,sigma=-1.8,0.39 on a log-log plot, you might think again.
The mu,sigma I got are the MLE, if the data are from a lognormal distibution. If this data is not from a log-normal distribution, then maybe we don't want the MLE. (For @Alex Lee's data, the MLE gave a better optical result than the MOM estimate.)
We could ask "what mu and sigma give a lognormal density that is the best approximation to the observed data?" How should we define "best approximation"? A. The chi-squared goodness of fit statistic. B. The squared differences between observed and measured counts in each cell. C. Something else. Minimizing A or B may give different estimates for mu, sigma than the MLE estimates. I suspect that B will give a mu, sigma whose PDF is more optically pleasing (on a plot with a linear y-axis for the PDF) than the MLE, because B will not be influenced by the data above X=10^0 the way the MLE is. I think statisticians would frown on B.
William Rose
on 19 Jan 2024
Edited: William Rose
on 19 Jan 2024
Someone (@Emily) should compute mu,sigma using the method of moments (MOM). Maybe that will yield a more optically pleasing PDF. The MOM is widely used in the stats world, unlike method B above. And it's easy to add to the code above.
I would add lines such as the following. I have not checked the code below by running it, so you must check my formulas and code carefully, and verify.
EX = sum(prob.*X); % E(X)
VX = sum(prob.*X.^2)-EX^2; % var(X)
muhat2 = log(EX^2/sqrt(VX+EX^2)); % MOM estimate for mu
sigmahat2 = sqrt(log(VX/EX^2+1)); % MOM estimate for sigma
fprintf('Method of moments: mu=%.3f, sigma=%.3f.\n',muhat2,sigmahat2)
pd2 = makedist('Lognormal','mu',muhat2,'sigma',sigmahat2);
pdfMOM = pdf(pd2,X);
Then you add the plot of pdfMOM versus X to the existing plot commands.
Torsten
on 19 Jan 2024
Edited: Torsten
on 19 Jan 2024
load osmoke_num_dist_norm
prob=osmoke_num_dist_norm;
load odiameters2
X=odiameters2;
%binwid=mean(diff(X)); % mean bin width; widths are all the same
%pdfX=prob/binwid; % probability density for observed data
pdfX = zeros(1,numel(X));
pdfX(1) = prob(1)/(0.5*(X(2)-X(1)));
pdfX(2:end-1) = prob(2:end-1)./(0.5*(X(2:end-1)-X(1:end-2))+0.5*(X(3:end)-X(2:end-1)));
pdfX(end) = prob(end)/(0.5*(X(end)-X(end-1)));
trapz(X,pdfX)
ans = 1.0000
%MLE
Y=log(X);
muHat=sum(prob.*Y);
sigmaHat=sqrt(sum(prob.*Y.^2)-muHat^2);
fprintf('Maximum Likelihood: mu=%.3f, sigma=%.3f.\n',muHat,sigmaHat)
Maximum Likelihood: mu=-1.535, sigma=0.689.
pd1 = makedist('Lognormal','mu',muHat,'sigma',sigmaHat);
pdfEst = pdf(pd1,X);
%MOM
EX = sum(prob.*X); % E(X)
VX = sum(prob.*X.^2)-EX^2; % var(X)
muhat2 = log(EX^2/sqrt(VX+EX^2)); % MOM estimate for mu
sigmahat2 = sqrt(log(VX/EX^2+1)); % MOM estimate for sigma
fprintf('Method of moments: mu=%.3f, sigma=%.3f.\n',muhat2,sigmahat2)
Method of moments: mu=-2.015, sigma=1.334.
pd2 = makedist('Lognormal','mu',muhat2,'sigma',sigmahat2);
pdfMOM = pdf(pd2,X);
%Function fitting
fun = @lognpdf;
F = @(x,xdata)fun(xdata,x(1),x(2));
ms = lsqcurvefit(F,[-1.8,0.39],X,pdfX);
Local minimum possible.
lsqcurvefit stopped because the final change in the sum of squares relative to
its initial value is less than the value of the function tolerance.
muhat3 = ms(1);
sigmahat3 = ms(2);
fprintf('Function fitting: mu=%.3f, sigma=%.3f.\n',muhat3,sigmahat3)
Function fitting: mu=-1.706, sigma=0.426.
pd3 = makedist('Lognormal','mu',muhat3,'sigma',sigmahat3);
pdfFF = pdf(pd3,X);
%Plot results
semilogx(X,pdfX,'r.',X,pdfEst,'-b',X,pdfMOM,'-g',X,pdfFF,'-c');
xlabel('X=particle size'); ylabel('Probability density');
grid on; legend('Data','Maximum Likelihood','Method of Moments','Function Fitting')
figure; loglog(X,pdfX,'r.',X,pdfEst,'-b',X,pdfMOM,'-g',X,pdfFF,'-c');
xlabel('X=particle size'); ylabel('Probability density');
grid on; legend('Data','Maximum Likelihood','Method of Moments','Function Fitting')
Emily
on 19 Jan 2024
@Paul @William Rose @Torsten Thank you all for your help! I understand where I was going wrong in terms of taking the log of the wrong variable, and I appreciate the insights into the different methods and considerations I can use here. Thank you for your thorough answers/help, I will be going through all of the above to better understand what I'm doing!
William Rose
on 19 Jan 2024
@Torsten has done something very clever, again. Look at how good the cyan line fit looks in his plot above. He used Matlab's lsqcurvefit() to find the mu and sigma that gives the n=best fit, in a least-squares sense, to the data. This is rather similar to my proposal earlier, called alternative B, to minimize the sum squared error between observed and predicted counts in each bin. And it looks really good, in the semilog plot, as I predicted.
Torsten
on 19 Jan 2024
He used Matlab's lsqcurvefit() to find the mu and sigma that gives the n=best fit, in a least-squares sense, to the data.
Although it will be difficult to find a statistical justification to use function fitting methods in distribution fitting ...
William Rose
on 19 Jan 2024
More Answers (2)
Hassaan
on 8 Jan 2024
Edited: Hassaan
on 8 Jan 2024
% Given Data
Counts = [71 73 53 77 79 141 158 237 360 594 958 1659 3094 7183 19034 40051 125001 143248 164224 168209 172406 167600 160578 146530 124505 90064 67769 42179 21645 1070];
PoreSize = [89716.4 71264.3 56607.2 44964.7 35716.7 28370.8 22535.7 17900.8 14219.1 11294.6 8971.6 7126.4 5660.7 2837.1 2253.6 1790.1 712.6 566.1 449.6 357.2 283.7 225.4 179 142.2 112.9 89.7 71.3 56.6 45 35.7];
probs = Counts / sum(Counts);
% Calculating Expected Values and Variance
ExpOfY = sum(PoreSize.*probs);
Ysqr = PoreSize.^2;
ExpOfYsqr = sum(Ysqr.*probs);
VarOfY = ExpOfYsqr - ExpOfY^2; % Corrected the typo here
% Calculating the parameters for the lognormal distribution
normu = log(ExpOfY/sqrt(1+VarOfY/ExpOfY^2));
norvar = log(1+VarOfY/ExpOfY^2);
norsigma = sqrt(norvar);
% Create a lognormal distribution object
dist = makedist('Lognormal','mu',normu,'sigma',norsigma);
% Define the range for your data (adjust as needed)
x = linspace(min(PoreSize), max(PoreSize), 1000);
% Calculate the PDF
pdfOfx = pdf(dist,x);
% Plotting the data
figure; hold on;
scatter(PoreSize,Counts, 'b'); % Scatter plot of the original data
plot(x,pdfOfx * max(Counts), 'r'); % Plot of the fitted distribution
xlabel('Pore Size');
ylabel('Counts');
title('Pore Size Distribution with Lognormal Fit');
legend('Data','Lognormal Fit');
hold off;
- Syntax and Typographical Errors: Ensure that all variable names and operations are correct. The expression VarOfY = ExpOfYsqr - ExpOfY1^; in your original code seems to contain errors. It's corrected here.
- Visualizing Data: I've used a scatter plot for the original data and a line plot for the fitted distribution. Adjust the scaling factor (max(Counts)) as needed to better compare the two.
- Data Range: The range of x in the line x = linspace(min(PoreSize), max(PoreSize), 1000); is set from the minimum to the maximum of your pore sizes. Adjust this as necessary.
- Distribution Parameters: The 'mu' and 'sigma' for a lognormal distribution are derived from the logarithm of your data. Ensure these are calculated correctly.
- Fitting and Model Choice: If the fit is still poor, consider alternative fitting methods or distributions. The fit quality can vary significantly based on the characteristics of your data.
---------------------------------------------------------------------------------------------------------------------------------------------------
If you find the solution helpful and it resolves your issue, it would be greatly appreciated if you could accept the answer. Also, leaving an upvote and a comment are also wonderful ways to provide feedback.
Professional Interests
- Technical Services and Consulting
- Embedded Systems | Firmware Developement | Simulations
- Electrical and Electronics Engineering
1 Comment
Torsten
on 8 Jan 2024
Edited: Torsten
on 8 Jan 2024
Counts = [71 73 53 77 79 141 158 237 360 594 958 1659 3094 7183 19034 40051 125001 143248 164224 168209 172406 167600 160578 146530 124505 90064 67769 42179 21645 1070];
PoreSize = [89716.4 71264.3 56607.2 44964.7 35716.7 28370.8 22535.7 17900.8 14219.1 11294.6 8971.6 7126.4 5660.7 2837.1 2253.6 1790.1 712.6 566.1 449.6 357.2 283.7 225.4 179 142.2 112.9 89.7 71.3 56.6 45 35.7];
data = [];
for i = 1:numel(Counts)
data = [data,PoreSize(i)*ones(1,Counts(i))];
end
phat = mle(data,'Distribution','LogNormal')
phat = 1×2
5.5569 0.8497
x = linspace(0,2000,1000);
hold on
plot(x,lognpdf(x,phat(1),phat(2)),'g')
histogram(data,'Normalization','pdf')
xlim([0 2000])
probs = Counts / sum(Counts);
ExpOfY = sum(PoreSize.*probs);
Ysqr = PoreSize.^2;
ExpOfYsqr = sum(Ysqr.*probs);
VarOfY = ExpOfYsqr - ExpOfY^2; % variance of the scores tabulated in vectors 1 & 2
normu = log(ExpOfY^2/sqrt(VarOfY+ExpOfY^2))
normu = 4.9283
norvar = log(1+VarOfY/ExpOfY^2);
norsigma = sqrt(norvar)
norsigma = 1.4788
plot(x,lognpdf(x,normu,norsigma),'b')
hold off
1 Comment
See Also
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)