In the life of a typical design engineer the normal distribution sneaks in when
his/her designs are produced or manufactured though a system of repeated runs
of a process. Suddenly the talks of process capabilities, normality passing,
sample sizes, etc, bring issues to the design engineers that were not imagined
by the designer while designing the product. Unless, of course, the designer
designed the products using what is called the DFM (Design for Manufacturability). This short note tries to give
the perspective and small doses of theory behind the process analysis and
acceptance activity analysis, and it is hoped will be of use for design
engineers.
Fundamentals – Stochastic Variables and
Probability Distributions
The natural and artificial phenomena that our
lives are governed by, needs systematic frameworks to study and explain these
phenomena. Mathematics including its branches such as Statistics, is one such
framework.
Fundamental to mathematics is a
notion of a Variable. Wikipedia describes a variable [1] as, “a measurable factor, characteristic, or
attribute of an individual or a system—in other words, something that might be expected
to vary over time or between individuals”. A variable typically has a range
(R) of values associated with it. Further variables can be discrete or
continuous.
Stochastic
Variables
Let there
be a variable with range R = {x1, x2, …,xn}.
If it is a stochastic variable (say
X) it will define an associated probability with each value from its Range such
that it is specified as a set of ordered pairs X= {(x1, p1),
(x2, p2) … (xn, pn)} defined over
the range R = {x1, x2,…,xn} with a probability
set P = {p1,p2, …, pn).
X
defines a function from set R to set
P. A stochastic variable will always
have a probability density function (pdf)
associated with it. This function may be known or unknown. Typically the
probability distribution function is represented graphically where the x-axis
will have the range of the variable and y-axis will have the probability
values. There are some standard probability distributions such as binomial,
gamma, exponential, Poisson, etc.
An example
Let X is defined as in table
below. Let the Range of values be R = {25, 26, 27, 28, 29, 30, 31}. The
corresponding pdf is plotted in the figure.
S. No.
|
X values from Range
(R)
|
Probability P
|
1
|
25
|
0.15
|
2
|
26
|
0.15
|
3
|
27
|
0.10
|
4
|
28
|
0.05
|
5
|
29
|
0.20
|
6
|
30
|
0.15
|
7
|
31
|
0.05
|
SUM
|
1.0
|
The pdf shown in the figure above doesn’t have a
clear functional form such as P(x). However, there exist in this world
phenomenon which can be approximated to easy and clear probability
distributions such that they can be expressed as a mathematical relation. One
of the standard and most prevalent forms of probability distributions is the Normal or Gaussian
probability distribution.
This is generally expressed as N (μ, σ), where
parameter μ represents the mean of the distribution and σ represents the
standard deviation. N(0, 1) is called the standard normal variable (shown in
green line in the figure). Normal distribution has very interesting properties
– such as symmetry around the mean that makes it easy to analyze for various
parameters of interest regarding the system [2]. The cumulative distribution function
of a normal distribution typically follows an S-curve (shown in the adjacent
figure), which readers involved in technology forecasting can easily relate to.
The problem is for many of physical and artificial phenomenon, e.g.,
manufacturing a complex product through repeated applications of a process, it
is extremely difficult to find out the probability density of random variables
that impact the (production) process. Luckily there is a remarkable result
known as Central Limit Theorem (CLT)
that comes into picture in such scenarios.
Central Limit Theorem
Once again, we look in wikipedia
[3] which states the CLT as, “if the sum
of the variables has a finite variance, then it will be approximately normally distributed (i.e. following a normal
or Gaussian distribution)”. However, there is a pre-condition which states
that these random variables should be iid’s
(independent and identically distributed). Many of the courses taught in
commercial world related to statistical process control (SPC) either miss out
on this important condition or it somehow is not given its due importance.
We will try to explain the
application through an example. Let us assume there is a design specification
stating that length of the tube should be 10±1
mm. Further we are told that supplier produces the same through the same
line, but there are three shifts, many different specifications for other
customers, and different operators. Further, there are two different machines
of the same type used in different shifts. Let us assume we get 5 samples each
from three lots, i.e., there are total of 15 components.
This is a small sample
(15 observations) from the universe of all parts produced by the process. Typically,
number of parts produced by the process will be large. For the purpose of
analysis and extracting any meaningful information for the population one can
not rely on such a small sample. Can one tell what the underlying probability
density function is just by observing this data? Can one say, what is the
probability that the next tube produced by the process will be 10.5 mm in
length? Can one say with confidence that
the process will produce tubes between 9mm and 11mm only? If not, can one say
y% of time the process will produce tubes between 9 mm and 11mm? Since we do
not know the underlying probability distribution we can not make any statement
with confidence.
Each of these observations however is a specific value from a
stochastic variable. If we know that underlying probability density
function of the family of stochastic variables is same and these variables are
independent of each other, then Central Limit Theorem holds. Thereby the sum of
these variables and in turn mean (which is nothing but normalized sum) of the
variables is drawn from a stochastic variable whose probability density
function is closer to Normal distribution. Once it is proven than one can carry
out the process capability analysis and make qualified statements about the
process.
However, it is difficult to check
each of the observation from the process to see whether there is any variation
in the process from one part to next part. Hence iid conditions can not be evaluated. Another way to check is as
follows. If the data set conforms to a
normal distribution than we can safely assume
that variables are iids and hence one can use the process capability
analysis for sampling and testing to make confident statements about the
process.
There are some caveats in this
line of thought. It is possible that the data may pass normality tests;
however, by observation of plots one may be able to see multiple modes or
clusters in the data. In such scenarios, there is a need to investigate the
reasons further.
Conclusions
The objective of this short note
is to explain the fundamentals and reasoning behind the Normality tests and
process capability computations. Starting from basic definitions of stochastic
variables, we have tried to cover Normal Distribution, Central Limit Theorem
and the rationale underlying the normality tests. The reader can refer to the
links given below to get more information on these topics. It was felt that although
the procedures to pass and execute normality and other statistical methods have
been understood, however, engineers are generally not clear about the reasoning
or theory behind the procedures. It is author’s hope that this short write-up
can help bridge the gap to some extent.
References
No comments:
Post a Comment