Chapters 1 & 2:
The following data relating to education were collected for each state.
A subset of the data is listed below. The REGION is the US Census
region, POP is the population of the state in thousands of people, SATV
is the average verbal SAT score for the state, and TPAY is the average
teacher pay for the state in thousands of dollars.
OBS STATE REGION POP SATV TPAY
2 AL ESC 4273 565 31.3
3 AR WSC 2510 566 29.3
4 AZ MTN 4428 525 32.5
5 CA PAC 31878 495 43.1
6 CO MTN 3823 536 35.4
7 CT NE 3274 507 50.3
8 DC SA 543 489 43.7
9 DE SA 725 508 40.5
10 FL SA 14400 498 33.3
. . . . . .
. . . . . .
. . . . . .
Is REGION a quantitative or qualitative variable?
What type of variable is TPAY?
What is the shape of this distribution?
Skewed to the right
Do there appear to be any outliers in the distribution?
Yes – the point around 30,000
What measure of center should be used for this distribution?
What measure of spread would be appropriate for this distribution?
In the following side-by-side box plots illustrating the distribution of verbal SAT scores, the census regions have further been collapsed into North, South, Midwest, and West.
Based on the box plot, which region (N, S, MW, or W) tends to have the
highest SAT scores?
Which region has the most consistent verbal scores on the SAT?
For the southern region, we can see that the middle 50% of SAT verbal
scores is between _495__ and __565___.
Continuing with the education data presented above, we want to examine
the relationship between verbal scores on the SAT and teacher pay in each
state. The following scatterplot illustrates the relationship between
the two variables:
The correlation between teacher pay and verbal SAT score is –0.47. Based on this information and the scatterplot, what can you say about the relationship between teacher pay and verbal SAT score (shape, strength, direction)?
Shape: slightly linear
Strength: fairly weak
Direction: decreasing trend
From the scatterplot, does it appear that there are any outliers or
Yes – there are some outliers between TPAY= 40 –50 with SATV around 560 and also some outliers between TPAY=30 – 35 and SATV between 480 and 500
What does the intercept tell us for this regression model?
It tells us that even if teachers made no money, students would still score around 624 on the SAT.
How would we interpret the slope for this model?
If teacher pay increases by 1 unit, then SAT scores will decrease by 2.56 units.
What percentage of the variability in verbal SAT scores can be explained
by differences in teacher pay?
Can we say that increasing teacher pay causes verbal SAT scores to decrease?
No – association does not imply causation.
1. Choose an employed person at random. Government data tell us that the probability that the worker is a woman is 0.46. The probability that any given woman will hold a managerial or professional job is 0.32. What is the probability that a randomly selected employed person will be a woman who holds a managerial or professional job?
General Multiplication Law:
Pr(A and B) = Pr(A)*Pr(B|A) = Pr(B)*Pr(A|B)
So, Pr(woman and professional) = Pr(woman)*Pr(professional|woman)
= (0.46)(0.32) = 0.1472
3. Consolidated Builders has bid on two large construction projects. The company president believes that the probability of winning the first contract (A) is 0.6, that the probability of winning the second contract (B) is 0.4, and that the probability of winning both jobs (A and B) is 0.2.
a. Are the events A and B independent?
No, because Pr(A and B) is not equal to Pr(A)*Pr(B).
b. What is the probability that Consolidated Builders will win either
contract A OR contract B?
Pr(A or B) = Pr(A) + Pr(B) – Pr(A and B)
= 0.6 + 0.4 – 0.2 = 0.8
1. Why is sampling often preferable to conducting a census for the purpose
of obtaining information about a population?
It is often not practical or possible to conduct a census. Taking a sample is quicker and more cost efficient than conducting a census. And, good information can be obtained about the population from a properly selected random sample.
2. Why do we generally expect some error when estimating a parameter
(such as a population mean) by a statistic (such as a sample mean)?
Sampling variability – the value of the statistic will vary from sample to sample.
3. Explain why increasing the sample size results in a tendency for
smaller sampling error when using a sample mean to estimate a population
The variance of the sampling distribution of the mean is inversely proportional to the sample size.
4. Define sampling error (also called sampling variability).
Sampling error refers to the fact that the value of a statistic will vary from sample to sample.
1. What exactly is a confidence interval and why is it better to report
a confidence interval instead of a single number for estimating the population
A confidence interval is an estimate of the population parameter plus a margin of error. It is better to report a confidence interval rather than a point estimate because the confidence interval allows you to account for sampling variability.
2. Suppose that we have obtained data by taking a simple random sample
from a population. What should we do (e.g., what assumptions should
we verify) before we construct a confidence interval from the sample?
Verify that the sample mean has a normal distribution.
3. Suppose that we have obtained data by taking a simple random sample
from a population and we intend to find a confidence interval for the population
mean. We will either use a 95% confidence interval or a 99% confidence
interval. Which confidence level will give us a narrower interval
for estimating the mean
The 95% CI will give a narrower interval for estimating the mean.
4. The Gallup Organization conducts annual national surveys on home
gardening. Results are published by the national Association for
Gardening. A random sample is taken of 25 households with vegetable
gardens. The size of vegetable gardens is normally distributed.
The mean size of the vegetable gardens from the sample was 643 sq ft.
a. Find a 90% confidence interval for the mean size of all household vegetable gardens in the United States. Assume that sigma=247 sq ft.
xbar +/- z * sigma/sqrt(n)
643 +/- (1.645)(247)/sqrt(25)
b. Explain in words what the confidence interval from part (a) means.
We are 90% confident that the true mean size of all household vegetable gardens in the US is contained in this interval.
5. A quality-control engineer in a bakery goods plant needs to estimate
the mean weight of bags of potato chips that are packed by a machine.
He knows from experience that sigma=0.1 oz for this machine. Weights
of bags are normally distributed. A random sample of 12 bags has
a mean weight of 16.01 oz.
a. Find a 99% confidence interval for the mean weight bags of potato chips.
xbar +/- z * sigma/sqrt(n)
16.01 +/- (2.576)(0.1)/sqrt(12)
2. Radio Advertising Bureau of New York reports in Radio Facts that in 1994 the mean number of radios per U.S. household was 5.6. A random sample of 45 U.S. households taken this year showed that the average number of radios owned is xbar = 5.9. Do the data provide sufficient evidence to conclude that this year’s mean number of radios per U.S. household has changed from the 1994 mean of 5.6? Assume that the standard deviation of this year’s number of radios per U.S. household is 1.9. Use the following steps to answer the question.
a. State the null and alternative hypotheses.
Ho: mu = 5.6
Ha : mu not equal to 5.6
b. Discuss the logic of conducting the hypothesis test (e.g., how will
you determine whether you have enough evidence to reject the null hypothesis).
Calculate the test statistic and the p-value. Compare the p-value to the significance level. If the p-value is less than the significance level, then reject the null hypothesis. Otherwise, fail to reject the null hypothesis.
c. Identify the distribution of the variable ; that is, the sampling
distribution of the mean for samples of size 45.
The sampling distribution of the mean will be normally distributed by the Central Limit Theorem.
d. Obtain a precise criterion for deciding whether to reject the null
hypothesis in favor of the alternative hypothesis (e.g., pick out a significance
level, ?, that you will use for conducting the test. This can be
any value that you would like to use, but most of the time, a 5% significance
level is used).
We’ll use 5% for this example.
e. Calculate your p-value.
Test Statistic: z = (5.9 - 5.6) / (1.9/sqrt(45)) = 0.3/0.28 = 1.07
p-value = 2*Pr(Z>=|z|) = 2 * Pr(Z >= 1.07) = 2 * (1 – Pr(Z<=1.07))
= 2 * (1 – 0.8577) = 0.2846
f. Apply the criterion in part (d) to the problem and state your conclusion.
Since 0.2846 > 0.05, fail to reject the null hypothesis and conclude that the mean number of radios has not changed from the 1994 value.
xbar +/- z * sigma/sqrt(n)
10.1 +/- (1.96)(6.0)/sqrt(500)
10.1 +/- (1.96)(0.27)
10.1 +/- 0.5292
b. Does the value of 10.3 thousand miles fall within your confidence interval from (b)? yes
c. Use the information from (a) and (b) to determine whether the average
distance driven last year is different from the average distance driven
Since 10.3 is contained in the confidence interval, we conclude that the average driven last year is not significantly different from the average driven in 1990.
1. Each year, manufacturers perform mileage tests on new car models
and submit the results to the EPA. The EPA then tests the vehicles
to determine whether the manufacturers’ claims are correct. In 1998,
one company reported that a particular model equipped with a four-speed
manual transmission averaged 29mpg on the highway. Gas mileage is
normally distributed. Suppose the EPA tested 15 of the cards.
What decision would you make regarding the gas mileage of the car?
Perform the required hypothesis test at the 5% significance level.
(NOTE: For this sample, xbar = 28.753 and s = 1.595).
a. State the hypotheses for the test.
Ho: mu = 29
Ha : mu not equal to 29
b. Calculate the test statistic. What is the distribution of the test statistic?
t = (28.753 – 29) / (1.595 / sqrt(15) ) = -0.5995 (It
has a t-distribution)
c. If the p-value for this hypothesis test is 0.5400, what conclusion would you draw?
Since the p-value is greater than 0.05, you would fail to reject the null hypothesis and conclude that the mileage for the car is acceptable.
State the hypotheses:
Ho: p = 0.3
Ha : p > 0.3
Calculate the sample proportion:
phat = 32/80 = 0.4
Calculate the test statistic:
z = (0.4 – 0.3) / sqrt((0.3*0.7)/80) = 2
Calculate the p-value:
p-value = Pr(Z>=2) = 1 – Pr(Z<=2) = 1 – 0.9772 = 0.0228
Since 0.0228 > 0.01, fail to reject the null hypothesis and conclude
that the new
process does not have a higher success rate than the current process.