Intro
to Probability and Statistics
Sample
Final #1 Questions Only
Professor Brian Shydlo
Instructions:
1) Please
write your name: _____________________________________
2) There
are 10 questions totaling 100 points. Please be careful to answer all
questions. Partial credit will be given.
Also, there are two questions for extra credit (one point each). There is no partial credit on the Extra
Credit questions.
Question 1)
8 Points (Correlation and Covariance)
Question 2)
15 Points (Expected Value and Standard
Deviation of a Portfolio of Two Assets)
Question 3)
6 Points (Basic Probability and Binomial Distribution)
Question 4)
10 Points (Linear Regression, theoretical concepts)
Question 5)
23 Points (Linear Regression)
Question 6)
9 Points (Multiple Linear Regression)
Question 7)
8 Points (Testing Two Sample Means)
Question 8)
6 Points (Hypothesis Testing)
Question 9)
5 Points (ANOVA Comparison Of Means)
Question
10) 10 Points (Sample Means and Confidence Intervals)
Total
100 Points
Question
1) (8 points in Total)
You have
the following table of X and Y values.
(For example, there is a 10% chance that X will be 4 and Y will be 2,
and so on
)
X |
Y |
Probability(X,Y) |
4 |
2 |
10% |
6 |
4 |
20% |
7 |
7 |
20% |
10 |
14 |
20% |
12 |
15 |
30% |
To help you
out I have calculated the Standard Deviation and Mean (or Expected Value) of
each.
μx = 8.6
μy = 9.7
sx
= 2.8
sy
= 5.1
Question
1a) (5 Points)
What is
Covariance(X,Y)?
Answer: _______ญญญญญญญญญญญญญ____________________
Question
1b) (3 Points)
What is the
Correlation Coefficient of X,Y?
Answer: _______ญญญญญญญญญญญญญ____________________
Question
2) (15 points in total)
A certain
stock, X, has an expected return of 15% per year and a standard deviation of
25%. The Stock is Normally Distributed.
A certain bond,
Y, has an expected return of 5% per year and a standard deviation of 9%. The Bond is Normally Distributed.
They have a
correlation of -0.2.
You could
write this as:
mx
= 15%, my
= 5%, sx
= 25%, sy
= 9%, and rxy = -0.2.
Question
2a) (3 Points)
You decide
to invest $100 dollars in either X or Y or some combination of both. How do you allocate your $100 to maximize
your expected return?
Answer: _______ญญญญญญญญญญญญญ____________________
Question
2b) (3 Points)
You decide
to split your money and invest $50 in X and $50 in Y. How much money do you expect to have after
one year (your initial investment of $100 + the expected return of your
portfolio of X and Y). (The correct
answer is some number over $100
I am not asking how much more money you would
have.)
Answer: _______ญญญญญญญญญญญญญ____________________
Question
2c) (5 Points)
What is the
standard deviation and variance of the portfolio from part b?
Answer: _______ญญญญญญญญญญญญญ____________________
Question
2d) (4 Points)
What is a
95% (1.96 standard deviation) confidence interval for
your return? That is, give me a
confidence interval for 50% in X and 50% in Y.
Answer: _______ญญญญญญญญญญญญญ____________________
Question
3 (6 points in Total)
Question
3a) (3 Points)
You have a
deck of card with 52 cards. It is a
standard bridge deck, which means it has 4 suits (hearts, clubs, spades and diamonds). For each suit you have 13 cards: Numbered
cards from 2 to 10, a Jack, Queen, King and an Ace.
Question
(3a): You pick 20 out of the 52 cards at random. What are the odds you'll see the Ace of
Spades?
Answer:
_____________________________________________
Question
3b) (3 Points)
With the
deck above you number each card from 1 to 52 so that
the Ace of Spades gets a Number of 1 and so on down to the lowest card, which
gets a number of 52. Furthermore, you
designate 26 (50%) of the cards to be in the top half and 26 to be in the
bottom half.
You observe
that for the 20 cards you picked out (one or a few at a time from the desk), 5
of them are in the top half and 15 of them are from the bottom half (in terms
of the ranking). Assuming that everything
is totally random, you decide to calculate the odds of getting 5 OR FEWER cards
in the top half when you draw 20 cards randomly from the deck as described
above.
You decide
to model this as a Binomial Distribution with a p, probability of success at
50% (since half of the cards are in the top half and half in the bottom
half). The formula you come up with is:
Sum from x
= 0 to 5 this:
Where p =
.5 and n = 20
Using this
formula (six times and taking the sum), you get a probability of 2.07% or about
1 in 50.
Question
(3b): Was your analysis/modeling of the problem above correct? If not, what is the problem with it?
Answer:
_____________________________________________
Question
4) (10 points in Total)
Question
4a) (5 Points)
You have
this data (exactly two datapoints)
X |
Y |
1 |
2 |
4 |
6 |
You decide
to run a regression and you get an R-squared of 100%. Here is a chart:
Question
(4a): Why is it a bad idea to do a regression and especially make predictions
and especially create confidence intervals around those predictions when you
only have two datapoints?
Answer:
_____________________________________________
Question
4b) (5 Points)
The Random
Walk theory of the Stock Market says that you can't predict what will happen
with the stock market. You tried to disprove
this, but so far no success.
One of the
models you used was to take the log of monthly returns for the S&P 500 from
Jan 1990 to April 2003 and then shift it by one month. You ran a regression of the shifted returns
(as the X) versus the unshifted returns (as the Y)
and found there was no relation. The
R-squared was zero.
You decide
it is pointless to try to predict future market direction based on historic
data, so you decide to try to predict market volatility based on historic
volatility. You set up a model where
you calculate volatility using the previous 12 months. So you calculate the February 1991 volatility
of the stock market as the standard deviation of the log of the stock market
returns from February 1990 to January 1991 (12 Months). You calculate the March 1991 volatility of
the stock market as the standard deviation of the log of the stock market
returns from March 1990 to February 1991 (12 Months). And so on.
As before
you decide to shift the volatility by one month and try to predict the next
month's volatility based on the previous months.
This is an
excerpt of the Raw Data:
Month |
Montly_Volatility (Y) |
Montly_Volatility_Previous_Month's (X) |
Feb-91 |
4.986% |
5.294% |
Mar-91 |
5.294% |
5.289% |
Apr-91 |
5.289% |
5.179% |
May-91 |
5.179% |
4.675% |
Jun-91 |
4.675% |
4.931% |
Jul-91 |
4.931% |
5.059% |
ect
|
ect
|
ect
|
Here is
graph of the Data:
Here is the
Minitab output of the Data:
The regression equation is
Montly_Volatility = 0.00183 + 0.952
Montly_Volatility_t-1
Predictor Coef StDev T P
Constant 0.001825
0.001016 1.80
0.074
Montly_V 0.95201 0.02410 39.50
0.000
S = 0.004233 R-Sq = 91.6%
Analysis of Variance
Source DF SS MS F P
Regression 1
0.027958 0.027958 1560.14
0.000
Residual Error 143
0.002563 0.000018
Total 144 0.030520
Question
(4b): At first you are thrilled with the 91.6% R-squared. Then you realize that a
R-squared that high is to be expected
meaning that the result is trivial.
In other words,
everything about your regression was totally accurate from a technical point of
view
low p-score, linear relationship, ect
, but the
high R-squared is a trivial result.
Why?
Answer:
_____________________________________________
Question
5 (23 points in Total)
You do a
regression of X against Y. You get these
results:
Regression Analysis
The regression equation is
Y = 12.2 + 0.533 X
Predictor Coef StDev T
P
Constant 12.176 4.002 3.04
0.005
X 0.5334 0.1372 AAAA
0.001
S = 10.65 R-Sq =
BBBB
Analysis of Variance
Source DF SS MS F P
Regression 1 DDDD CCCC
15.11 0.001
Residual Error EEEE
3288.7 113.4
Total 30 5002.2
Question
5a) (3 Points)
Predict Y
when X = 20
Answer:
_____________________________________________
Question
5b) (3 Points)
Create a
95% Confidence Interval for your prediction.
Assume that 2.04 is the correct number for the t-distribution such that
95% of the data is between -2.04 +2.04 for the appropriate number of degrees of
freedom (in other words, use 2.04 when building your confidence interval).
Answer:
_____________________________________________
Question
5c) (5 Points)
List 3
possible reasons why your confidence interval may be off or inappropriate.
Answer:
_____________________________________________
Question
5d) (5 Points)
Fill in the
missing values from the Regression Output Table (1 point each).
AAAA:
_____________________________________________
BBBB:
_____________________________________________
CCCC:
_____________________________________________
DDDD:
_____________________________________________
EEEE:
_____________________________________________
Question
5e) (4 Points)
You do
another (unrelated) regression of a new X against a new Y. You have 10 datapoints. You get an R-squared of 80%, an F-score of 32
and a Standard Error of the Regression of 4.68188.
What is the
Standard Deviation of the variable Y?
Hint:
Make an ANOVA table based on the above information.
Answer:
_____________________________________________
Question
5f) (3 Points)
Your friend
does another (unrelated) regression of a new X against a new Y. Your friend get a
low p-score of .00001 and an R-squared of 4.6%.
Everything else with the regression checks out as being valid.
Your friend
decides to not use the Linear Regression model since it has such a low
R-squared. What do you tell your
friend about your friend's decision to reject the Linear Regression model due
to the low R-squared?
Answer:
_____________________________________________
Question
6 (9 points in Total)
You get
some data regarding a batter. This is
it:
Hits |
Misses |
Total at Bats |
6 |
14 |
20 |
7 |
14 |
21 |
8 |
15 |
23 |
7 |
17 |
24 |
8 |
13 |
21 |
7 |
16 |
23 |
6 |
18 |
24 |
7 |
15 |
22 |
6 |
17 |
23 |
7 |
17 |
24 |
You decide
to run a Multiple Linear Regression to use Hits and Misses to predict Total at
Bats.
Question
6a) (4 Points)
What is the
Regression Equation (2 points of credit) and R-squared (2 points of credit)?
Answer:
_____________________________________________
Question
6b) (5 Points)
You do
another (totally unrelated) Multiple Linear Regression. You have two variables, X1 and X2, which you
use to predict a third variable called Y.
This is the
output from Minitab:
Regression Analysis
The regression equation is
Y = 10.2 + 0.0605 X1 + 0.757
X2
Predictor Coef StDev T P
Constant 10.166 2.147
4.74 0.002
X1 0.06050 0.05639 1.07
0.319
X2 0.7569 0.1380 5.48
0.001
S = 0.6742 R-Sq =
82.8% R-Sq(adj) = 77.9%
Analysis of Variance
Source DF SS MS
F P
Regression 2
15.3183 7.6591 16.85
0.002
Residual Error 7
3.1817 0.4545
Total 9 18.5000
Question (6b): Comment on the table
above, specifically commenting on the appropriateness of the Multiple Linear
Regression Model based solely on the above table of data. If the model is not appropriate in this
case, what might you do to make it better?
Answer:
_____________________________________________
Question
7 (8 points in Total)
Question
7a) (5 Points)
An
interesting article was published recently that talked about why many of us are
feeling that things are getting more expensive even though the Consumer Price
Index is only 3%.
The article
describes the CPI (Consumer Price Index) as a basket of commonly purchased
items (e.g., Computer, TV, Mortgage, Clothing, furniture, cars, ect...). It goes on
to say how items relating to the maintenance of purchases (and one's self) are
not included in the CPI (e.g., Health Care, Cable TV, Gasoline, Tuition, Car
Insurance, Home Heating Oil, Train Fare, ect
) are
going up at a higher rate.
Below is
data meant to illustrate the kind of data from the article:
|
Consumer Price
Index |
Cost of Maintaining
Items |
Item 1 |
1.60% |
4.30% |
Item 2 |
2.40% |
4.60% |
Item 3 |
4% |
5.20% |
Item 4 |
2.80% |
6.90% |
Item 5 |
3.00% |
7.30% |
Item 6 |
1.00% |
8.90% |
Item 7 |
-4% |
|
Item 8 |
10% |
|
Item 9 |
6% |
|
You realize
that the average of the Non-CPI items is higher, yet you also realize that
there is a chance that all of these items may come from the same
population. You decide to run a T-test
for comparing two sample means. Here is
what you get:
t-Test:
Two-Sample Assuming Unequal Variances |
|
|
|
Consumer Price
Index |
Cost of Maintaining
Items |
Mean |
2.98% |
6.2% |
Variance |
0.001429444 |
0.0003232 |
Observations |
9 |
6 |
Hypothesized
Mean Difference |
0 |
|
df |
12 |
|
t
Stat |
-2.209418802 |
|
P(T<=t)
one-tail |
0.023664899 |
|
It is true
that the means are not equal? Please
formally state your conclusion and give reasons why.
Answer:
_____________________________________________
Question
7b) (3 Points)
Assuming
Unequal Variances is the more conservative approach: True or False?
Answer:
_____________________________________________
Question
8 (6 points in Total)
Write the
null and alternate hypothesis for the following items. This can be an English description or using
symbols.
Question
8a) (3 Points)
The ANOVA
test for comparing means:
Answer:
_____________________________________________
Question
8b) (3 Points)
Multiple
Linear Regression:
Answer:
_____________________________________________
Question
9 (5 points in Total)
Imagine you
run a customer service desk for a product with 5 customers. You record the number of Service Requests
each month and have these averages for year-to-date 2003:
Customer A |
Customer B |
Customer C |
Customer D |
Customer E |
14 |
9.4 |
8.8 |
16.2 |
10.6 |
The raw
data is:
|
Customer A |
Customer B |
Customer C |
Customer D |
Customer E |
Jan |
13 |
16 |
7 |
27 |
13 |
Feb |
10 |
11 |
5 |
21 |
6 |
Mar |
20 |
9 |
7 |
23 |
16 |
April |
18 |
5 |
18 |
7 |
14 |
May |
9 |
6 |
7 |
3 |
4 |
You run an
ANOVA test to try to determine if the means could really be the same. You get these results:
Analysis of Variance
Source DF
SS MS F
P
Factor 4
202.0 50.5 1.21
0.338
Error 20
836.0 41.8
Total 24
1038.0
Individual
95% CIs For Mean
Based on
Pooled StDev
Level N
Mean StDev ------+---------+---------+---------+
Customer 5
14.000 4.848 (---------*---------)
Customer 5
9.400 4.393 (---------*---------)
Customer 5
8.800 5.215 (---------*---------)
Customer 5
16.200 10.545 (---------*---------)
Customer 5
10.600 5.273 (---------*---------)
------+---------+---------+---------+
Pooled StDev
= 6.465 6.0 12.0
18.0 24.0
What does
the test above say about the possibility that all the means are equal?
Answer:
_____________________________________________
Question
10 (10 points in Total)
You conduct
a sample of 64 items. Your sample mean is
20 and you get a sample standard deviation of 10.
Question
10a) (3 Points)
Write out
the formula you would have used to calculate the Sample Standard Deviation?
Answer:
_____________________________________________
Question
10b) (2 Points)
What is
your point estimate for the true population mean?
Answer:
_____________________________________________
Question
10c) (5 Points)
Write out a
95% Confidence Interval for the True Population Mean. Assume that +-1.96 standard deviations hold 95%
of the data.
Answer:
_____________________________________________
Extra
Credit 1 (1 Point)
The
Standard Normal Distribution has a mean of 0, a Variance and Std Dev of 1
The
T-Distribution has a mean of zero. Could it have a Standard Deviation of Sqrt(df / df+2)?
Sqrt = Square Root
df = Degrees of Freedom
Don't just
give a True/False, explain why.
Answer:
_____________________________________________
Extra
Credit 2 (1 Point)
Half of the
numbers (meaning area under the curve) in a Chi-Squared Distribution are above
the Expected Value of the distribution.
Don't just
give a True/False, give the reasons.
Answer:
_____________________________________________