PolarSPARC |
Introduction to Statistics - Part 3
Bhaskar S | 07/18/2021 |
In Part 2 of the series, we introduced the concepts around the various probability distributions, namely, the Binomial, the Poisson, and the Normal probability distributions.
In this part of the series, we will delve into the world of Inferential Statistics that focuses on analyzing the data about the sample to arrive at a conclusion about the population. In particular, we will look at the Central Limit Theorem, the Point Estimation, and the Confidence Interval.
Basic Definitions - I
A Sampling Distribution is the probability distribution of a sample statistic (such as the mean, variance, etc) that is formed when samples of size n are repeatedly taken from a population.
Each of the measurements, such as the mean, the standard deviation, etc., from the population is referred to as a Parameter.
Each of the measurements, such as the mean, the standard deviation, etc., from the sample is referred to as a Statistic.
Central Limit Theorem
The Central Limit Theorem is one of the core foundations for inferential statistics as it
provides information about the sample mean
For any population that is NOT normally distributed, the sampling distribution of the
sample means
For a normally distributed population, the sampling distribution of sample means is normally distributed for any sample size n
The sample mean
The sample standard deviation
The sample standard deviation
To get a better intuition and understanding of the Central Limit Theorem, we will perform some experiments using one years' worth of market close stock data for the symbol UNP.
The following illustration shows the first five entries of market close data for the symbol UNP:
The following illustration shows the last five entries of market close data for the symbol UNP:
The following illustration shows the summary statistic on the Close price for the symbol UNP:
The following illustration shows the distribution of the Close price for the symbol UNP, which is not normally distributed:
The following illustration shows the sampling distribution for 10 sets of samples of size 5 on the Close price for the symbol UNP:
The following illustration shows the sampling distribution for 10 sets of samples of size 15 on the Close price for the symbol UNP:
The following illustration shows the sampling distribution for 100 sets of samples of size 50 on the Close price for the symbol UNP:
The following illustration shows the sampling distribution for 100 sets of samples of size 100 on the Close price for the symbol UNP:
As is evident from the illustration above, as the sample size increases, the sample distribution becomes normally distributed,
the mean of the sample distribution approaches the population distribution
In Part 2, we learnt how one
could find the probability of a normally distributed random variable x will lie in an interval by calculating the area under
the normal curve using the Z-score. Similarly, we can find the probability that a sample mean
Example-1 | A study of a fishing lake where the length of a trout taken at random from the pond has a normal distribution with mean
|
---|---|
From the Central Limit Theorem, we know Also, from the Central Limit Theorem, To find the probability the sample mean (A) The first interval (B) The second interval To find the area between z = -3.49 and z = 2.86, we need to subtract (A) from (B). That is, 0.9979 - 0.0002 = 0.9977. Therefore, the probability of the sample mean length (using a sample size of 5) to be between 8 and 12 inches is 0.9977. |
Basic Definitions - II
A Point Estimate is a single value estimate for a population parameter. The sample mean
Using the sample mean
An Interval Estimate is an interval range that is used to estimate a population parameter. To create an interval estimate, use the point estimate as the center of the interval, then add and subtract a margin of error from the point estimate.
A Confidence Level denoted by c, with any value between 0 and 1 (typically 0.90, 0.95, or 0.99) is the probability that the interval estimate contains the population parameter assuming that the estimation process is repeated a large number of times. A confidence level is sometimes referred to as the Degree of Confidence OR the Confidence Coefficient.
For a confidence level of c, the Critical Value
The following illustration shows the critical values (
We know the margin of error (also called the Error Tolerance) E =
We also know the Z-score z =
For a given confidence level denoted by c and given the population standard deviation
Confidence Interval
Now that we know about the point estimate and the margin of error, one can construct an interval estimate for the population
parameter
The main requirements for the sample distribution are similar to the central limit theorem, that is:
For any population that is not normally distributed, the sample size n must be large
For a normally distributed population, any sample size n is sufficient
The sample mean
Example-2 | John jogs 2 miles per day. The standard deviation of his times is 1.80 minutes. During the past year, John has recorded his times to run 2 miles. He has a random sample of 90 of these times. For these 90 times, the mean was 15.60 minutes. Find a 0.95 confidence interval for the mean jogging time for the entire distribution. |
---|---|
Given the sample size n = 90, the mean sample distribution will be normal. Also, given the population standard deviation For c = 0.95, the area under the normal curve for the critical value We know the margin of error E = We know the confidence interval for the population mean That is, Therefore, we can conclude with 95% confidence that the interval is from 15.23 mins to 15.97 mins |
For a given sample statistics, as the confidence level c increases, the confidence interval widens. As the confidence
interval widens, the precision of the estimate decreases. One way to improve the precision of an estimate without decreasing
the confidence level c is to increase the sample size. Given the margin of error E =
Example-3 | A research study is designed to find the mean weight of salmon caught by a fishing company. A preliminary study of a random sample of 50 salmon showed a standard deviation of about 2.15 pounds. How large a sample should be taken to be 99% confident that the sample mean is within 0.20 pound of the true mean weight. |
---|---|
For c = 0.99, the area under the normal curve for the critical value Given the the population standard deviation The minimum sample size can be computed using n = |
References