Statistics: a science that attempts to make sense of data and to provide information so you can make informed decisions
Two types – descriptive and inferential
Descriptive statistical methods are used to describe your data through graphical methods and numerical summaries.
Inferential statistical methods use your data to make generalizations about a large group based on a subset of that group and to assess the reliability of your inferences.
Data: the pieces of information that you gather; usually organized into rows and columns. Rows represent the individuals or experimental units being examined. Columns represent the variables or characteristics that are recorded about each unit.
Data can be split into two major types: quantitative and qualitative (categorical).
Quantitative data are numeric and can be:
When you receive a data set, you should ask yourself the following questions:
It’s often impossible or at least impractical to collect data from the entire population, so it’s common practice to select a sample.
Sample: a representative subset selected from the population.
In order to insure that the sample is representative, probability theory is used to help select a random sample that will be representative of the entire population. By taking a random sample, good results can be obtained from a small subset of the population and the amount of error in the results can be quantified.
Reasons for Sampling
Goal: to understand the data – describe it, summarize it, answer questions about it.
Begin by examining each variable by itself
Examine the data graphically
Produce numerical summaries
Look at relationships between the variables
Frequency Distribution: lists the values that occur in the data set and tells you how often the value occurs in the data set
Use bar charts or pie charts to illustrate the frequency distribution
The heights of the bars in the bar chart represent the frequency of occurrence of the category.
The widths of the pie slices in the pie chart represent the frequency of occurrence of the category.
Use histograms to illustrate the frequency distribution graphically
Sort the data from smallest to largest
Select the leading digits (stems) from the data
List those digits in numeric order in a column, listing each unique stem only once
Draw a vertical line
Write each of the final digits (leaves) for each number to the right of the appropriate stem.
Drying times for different formulations of paint
2.5 3.0 3.3 4.0 6.0 2.8 4.2 4.4 5.0 5.0 3.6 5.6 4.8 4.9 6.1 3.5 4.5 5.2 4.5 6.5
Stem and Leaf Plot for drying time data
2 | 5 8
3 | 0 3 5 6
4 | 0 2 4 5 5 8 9
5 | 0 0 2 6
6 | 0 1 5
Once you have examined the data graphically, you may want to calculate some numerical summary measures to describe the center and spread of the data.
Sample mean: simple average of all of the observations
Mark McGwire’s home run record
1987 49 1993 9
1988 32 1994 9
1989 33 1995 39
1990 39 1996 52
1991 22 1997 58
1992 42 1998 70
The mean is 37.8.
** The mean is very sensitive to outliers **
So, we need another measure of center that is not so sensitive to extreme values.
Median: the midpoint of the distribution. 50% of the values are less than the median, and 50% of the values are greater than the median.
To calculate the median:
Sort the observations from smallest to largest
If you have an odd number of observations, then the median is the middle observation
If you have an even number of observations, the median is the average of the 2 middle observations.
Example using the McGwire homerun data:
9 9 22 32 33 39 39 42 49 52 58 70
The median is the average of the 2 center numbers: 39