FIND InnoNuggets

 

My Book on Strategic Decision Making

My Book on Strategic Decision Making
Applying the Analytic Hierarchy Process

Monday, November 18, 2013

On Stochastic Variables, Probability Distributions and Normalcy Test ~ What every business should know?


In the life of a typical design engineer the normal distribution sneaks in when his/her designs are produced or manufactured though a system of repeated runs of a process. Suddenly the talks of process capabilities, normality passing, sample sizes, etc, bring issues to the design engineers that were not imagined by the designer while designing the product. Unless, of course, the designer designed the products using what is called the DFM (Design for Manufacturability). This short note tries to give the perspective and small doses of theory behind the process analysis and acceptance activity analysis, and it is hoped will be of use for design engineers.



Fundamentals – Stochastic Variables and Probability Distributions
The natural and artificial phenomena that our lives are governed by, needs systematic frameworks to study and explain these phenomena. Mathematics including its branches such as Statistics, is one such framework.

Fundamental to mathematics is a notion of a Variable. Wikipedia describes a variable [1] as, “a measurable factor, characteristic, or attribute of an individual or a system—in other words, something that might be expected to vary over time or between individuals”. A variable typically has a range (R) of values associated with it. Further variables can be discrete or continuous.



Stochastic Variables
Let there be a variable with range R = {x1, x2, …,xn}. If it is a stochastic variable (say X) it will define an associated probability with each value from its Range such that it is specified as a set of ordered pairs X= {(x1, p1), (x2, p2) … (xn, pn)} defined over the range R = {x1, x2,…,xn} with a probability set P = {p1,p2, …, pn).

X defines a function from set R to set P.  A stochastic variable will always have a probability density function (pdf) associated with it. This function may be known or unknown. Typically the probability distribution function is represented graphically where the x-axis will have the range of the variable and y-axis will have the probability values. There are some standard probability distributions such as binomial, gamma, exponential, Poisson, etc.   

An example

Let X is defined as in table below. Let the Range of values be R = {25, 26, 27, 28, 29, 30, 31}. The corresponding pdf is plotted in the figure.



S. No.
X values from Range (R)
Probability P
1
25
0.15
2
26
0.15
3
27
0.10
4
28
0.05
5
29
0.20
6
30
0.15
7
31
0.05
SUM
1.0




The pdf shown in the figure above doesn’t have a clear functional form such as P(x). However, there exist in this world phenomenon which can be approximated to easy and clear probability distributions such that they can be expressed as a mathematical relation. One of the standard and most prevalent forms of probability distributions is the Normal or Gaussian probability distribution. 




This is generally expressed as N (μ, σ), where parameter μ represents the mean of the distribution and σ represents the standard deviation. N(0, 1) is called the standard normal variable (shown in green line in the figure). Normal distribution has very interesting properties – such as symmetry around the mean that makes it easy to analyze for various parameters of interest regarding the system [2]. The cumulative distribution function of a normal distribution typically follows an S-curve (shown in the adjacent figure), which readers involved in technology forecasting can easily relate to. The problem is for many of physical and artificial phenomenon, e.g., manufacturing a complex product through repeated applications of a process, it is extremely difficult to find out the probability density of random variables that impact the (production) process. Luckily there is a remarkable result known as Central Limit Theorem (CLT) that comes into picture in such scenarios.


Central Limit Theorem

Once again, we look in wikipedia [3] which states the CLT as, “if the sum of the variables has a finite variance, then it will be approximately normally distributed (i.e. following a normal or Gaussian distribution)”. However, there is a pre-condition which states that these random variables should be iid’s (independent and identically distributed). Many of the courses taught in commercial world related to statistical process control (SPC) either miss out on this important condition or it somehow is not given its due importance.


    How does it help in Statistical Process Control or Component Qualification?

We will try to explain the application through an example. Let us assume there is a design specification stating that length of the tube should be 10±1 mm. Further we are told that supplier produces the same through the same line, but there are three shifts, many different specifications for other customers, and different operators. Further, there are two different machines of the same type used in different shifts. Let us assume we get 5 samples each from three lots, i.e., there are total of 15 components. 




This is a small sample (15 observations) from the universe of all parts produced by the process. Typically, number of parts produced by the process will be large. For the purpose of analysis and extracting any meaningful information for the population one can not rely on such a small sample. Can one tell what the underlying probability density function is just by observing this data? Can one say, what is the probability that the next tube produced by the process will be 10.5 mm in length?  Can one say with confidence that the process will produce tubes between 9mm and 11mm only? If not, can one say y% of time the process will produce tubes between 9 mm and 11mm? Since we do not know the underlying probability distribution we can not make any statement with confidence.

Each of these observations however is a specific value from a stochastic variable. If we know that underlying probability density function of the family of stochastic variables is same and these variables are independent of each other, then Central Limit Theorem holds. Thereby the sum of these variables and in turn mean (which is nothing but normalized sum) of the variables is drawn from a stochastic variable whose probability density function is closer to Normal distribution. Once it is proven than one can carry out the process capability analysis and make qualified statements about the process.

However, it is difficult to check each of the observation from the process to see whether there is any variation in the process from one part to next part. Hence iid conditions can not be evaluated. Another way to check is as follows. If the data set conforms to a normal distribution than we can safely assume that variables are iids and hence one can use the process capability analysis for sampling and testing to make confident statements about the process.

There are some caveats in this line of thought. It is possible that the data may pass normality tests; however, by observation of plots one may be able to see multiple modes or clusters in the data. In such scenarios, there is a need to investigate the reasons further.



  Conclusions

The objective of this short note is to explain the fundamentals and reasoning behind the Normality tests and process capability computations. Starting from basic definitions of stochastic variables, we have tried to cover Normal Distribution, Central Limit Theorem and the rationale underlying the normality tests. The reader can refer to the links given below to get more information on these topics. It was felt that although the procedures to pass and execute normality and other statistical methods have been understood, however, engineers are generally not clear about the reasoning or theory behind the procedures. It is author’s hope that this short write-up can help bridge the gap to some extent.


References
 





























No comments:

My Book @Goodread