Chapter 1

Statistics: a science that attempts to make sense of data and to provide information so you can make informed decisions

Two types – descriptive and inferential

Descriptive statistical methods are used to describe your data through graphical methods and numerical summaries.

Inferential statistical methods use your data to make generalizations about a large group based on a subset of that group and to assess the reliability of your inferences.

Data: the pieces of information that you gather; usually organized into rows and columns. Rows represent the individuals or experimental units being examined. Columns represent the variables or characteristics that are recorded about each unit.

Data can be split into two major types: quantitative and qualitative (categorical).

Quantitative data are numeric and can be:

• Nominal – numbers that represent codes
• Ordinal – numbers that convey ranking
• Interval – numbers for which it makes sense to do mathematical calculations.
Qualitative data are non-numeric and specify which of a finite number of discrete categories a unit belongs to.

When you receive a data set, you should ask yourself the following questions:

• Where did the data come from? For what purpose was it collected?
• What experimental units do the data describe? How many are there?
• What variables are in the data set? What are the exact definitions of those variables? What are the units of measure for each variable?
Population: the larger universe in which you’re interested; all possible individuals to which you wish your conclusions to apply. You may or may not be able to enumerate the entire population.

It’s often impossible or at least impractical to collect data from the entire population, so it’s common practice to select a sample.

Sample: a representative subset selected from the population.

In order to insure that the sample is representative, probability theory is used to help select a random sample that will be representative of the entire population. By taking a random sample, good results can be obtained from a small subset of the population and the amount of error in the results can be quantified.

Reasons for Sampling

• More cost effective than taking a census
• Faster than taking a census
• Population may simply be too large
• Nature of the observations may be destructive
• Population may not be easily accessible
Engineering Applications
• Statistical process control
• Quality assessment
• Model building and prediction
• Statistical inference based on experimental results
• Reliability
• Experimental design
Chapter 2

Goal: to understand the data – describe it, summarize it, answer questions about it.

Begin by examining each variable by itself
Examine the data graphically
Produce numerical summaries
Look at relationships between the variables

Frequency Distribution: lists the values that occur in the data set and tells you how often the value occurs in the data set

Qualitative Variables

Use bar charts or pie charts to illustrate the frequency distribution graphically
The heights of the bars in the bar chart represent the frequency of occurrence of the category.
The widths of the pie slices in the pie chart represent the frequency of occurrence of the category.

Quantitative Data

Use histograms to illustrate the frequency distribution graphically

Goals:

• Look for overall patterns in the data
• Determine the center or location of the data
• Determine the general spread of the data
• Look at the shape of the distribution
Shapes:
• Symmetric – mirror image
• Skewed to the right (positively skewed) – long tail pointing to the right
• Skewed to the left (negatively skewed) – long tail pointing to the left
• Normal – bell-shaped and symmetric
• Exponential – exponential curve
For smaller data sets, stem and leaf plots are useful for describing distributions and give more information than a histogram. To construct a stem and leaf plot:

Sort the data from smallest to largest
Select the leading digits (stems) from the data
List those digits in numeric order in a column, listing each unique stem only once
Draw a vertical line
Write each of the final digits (leaves) for each number to the right of the appropriate stem.

Example:

Drying times for different formulations of paint

2.5  3.0  3.3  4.0  6.0  2.8  4.2  4.4  5.0  5.0  3.6  5.6  4.8  4.9  6.1  3.5  4.5  5.2  4.5  6.5

Stem and Leaf Plot for drying time data

2 | 5 8
3 | 0 3 5 6
4 | 0 2 4 5 5 8 9
5 | 0 0 2 6
6 | 0 1 5

Once you have examined the data graphically, you may want to calculate some numerical summary measures to describe the center and spread of the data.

Sample mean: simple average of all of the observations

Example:

Mark McGwire’s home run record

1987 49     1993 9
1988 32     1994 9
1989 33     1995 39
1990 39     1996 52
1991 22     1997 58
1992 42     1998 70

The mean is 37.8.

** The mean is very sensitive to outliers **

So, we need another measure of center that is not so sensitive to extreme values.

Median: the midpoint of the distribution. 50% of the values are less than the median, and 50% of the values are greater than the median.

To calculate the median:

Sort the observations from smallest to largest
If you have an odd number of observations, then the median is the middle observation
If you have an even number of observations, the median is the average of the 2 middle observations.

Example using the McGwire homerun data:

9  9  22  32  33  39 39  42  49  52  58  70

The median is the average of the 2 center numbers: 39

Rules:

• The median is the preferred measure of center when there are outliers present in the data.
• The median is the preferred measure of center when the distribution is skewed.
• The mean is the preferred measure of center when the distribution is fairly symmetric and there are no outliers in the data.