Statistics Final, Sample #1, Questions Only

Intro to Probability and Statistics

Sample Final #1 – Questions Only

Professor Brian Shydlo

Instructions:

1) Please write your name: _____________________________________

2) There are 10 questions totaling 100 points. Please be careful to answer all questions. Partial credit will be given. Also, there are two questions for extra credit (one point each). There is no partial credit on the Extra Credit questions.

Question 1) 8 Points (Correlation and Covariance)

Question 2) 15 Points (Expected Value and Standard Deviation of a Portfolio of Two Assets)

Question 3) 6 Points (Basic Probability and Binomial Distribution)

Question 4) 10 Points (Linear Regression, theoretical concepts)

Question 5) 23 Points (Linear Regression)

Question 6) 9 Points (Multiple Linear Regression)

Question 7) 8 Points (Testing Two Sample Means)

Question 8) 6 Points (Hypothesis Testing)

Question 9) 5 Points (ANOVA Comparison Of Means)

Question 10) 10 Points (Sample Means and Confidence Intervals)

Total 100 Points

Question 1) (8 points in Total)

You have the following table of X and Y values. (For example, there is a 10% chance that X will be 4 and Y will be 2, and so on…)

X	Y	Probability(X,Y)
4	2	10%
6	4	20%
7	7	20%
10	14	20%
12	15	30%

To help you out I have calculated the Standard Deviation and Mean (or Expected Value) of each.

μ_x = 8.6

μ_y = 9.7

s_x = 2.8

s_y = 5.1

Question 1a) (5 Points)

What is Covariance(X,Y)?

Answer: ___________________________

Question 1b) (3 Points)

What is the Correlation Coefficient of X,Y?

Answer: ___________________________

Question 2) (15 points in total)

A certain stock, X, has an expected return of 15% per year and a standard deviation of 25%. The Stock is Normally Distributed.

A certain bond, Y, has an expected return of 5% per year and a standard deviation of 9%. The Bond is Normally Distributed.

They have a correlation of -0.2.

You could write this as:

m_x = 15%, m_y = 5%, s_x = 25%, s_y = 9%, and r_xy = -0.2.

Question 2a) (3 Points)

You decide to invest $100 dollars in either X or Y or some combination of both. How do you allocate your $100 to maximize your expected return?

Answer: ___________________________

Question 2b) (3 Points)

You decide to split your money and invest $50 in X and $50 in Y. How much money do you expect to have after one year (your initial investment of $100 + the expected return of your portfolio of X and Y). (The correct answer is some number over $100… I am not asking how much more money you would have.)

Answer: ___________________________

Question 2c) (5 Points)

What is the standard deviation and variance of the portfolio from part b?

Answer: ___________________________

Question 2d) (4 Points)

What is a 95% (1.96 standard deviation) confidence interval for your return? That is, give me a confidence interval for 50% in X and 50% in Y.

Answer: ___________________________

Question 3 (6 points in Total)

Question 3a) (3 Points)

You have a deck of card with 52 cards. It is a standard bridge deck, which means it has 4 suits (hearts, clubs, spades and diamonds). For each suit you have 13 cards: Numbered cards from 2 to 10, a Jack, Queen, King and an Ace.

Question (3a): You pick 20 out of the 52 cards at random. What are the odds you'll see the Ace of Spades?

Answer: _____________________________________________

Question 3b) (3 Points)

With the deck above you number each card from 1 to 52 so that the Ace of Spades gets a Number of 1 and so on down to the lowest card, which gets a number of 52. Furthermore, you designate 26 (50%) of the cards to be in the top half and 26 to be in the bottom half.

You observe that for the 20 cards you picked out (one or a few at a time from the desk), 5 of them are in the top half and 15 of them are from the bottom half (in terms of the ranking). Assuming that everything is totally random, you decide to calculate the odds of getting 5 OR FEWER cards in the top half when you draw 20 cards randomly from the deck as described above.

You decide to model this as a Binomial Distribution with a p, probability of success at 50% (since half of the cards are in the top half and half in the bottom half). The formula you come up with is:

Sum from x = 0 to 5 this:

Where p = .5 and n = 20

Using this formula (six times and taking the sum), you get a probability of 2.07% or about 1 in 50.

Question (3b): Was your analysis/modeling of the problem above correct? If not, what is the problem with it?

Answer: _____________________________________________

Question 4) (10 points in Total)

Question 4a) (5 Points)

You have this data (exactly two datapoints)

X	Y
1	2
4	6

You decide to run a regression and you get an R-squared of 100%. Here is a chart:

Question (4a): Why is it a bad idea to do a regression and especially make predictions and especially create confidence intervals around those predictions when you only have two datapoints?

Answer: _____________________________________________

Question 4b) (5 Points)

The Random Walk theory of the Stock Market says that you can't predict what will happen with the stock market. You tried to disprove this, but so far no success.

One of the models you used was to take the log of monthly returns for the S&P 500 from Jan 1990 to April 2003 and then shift it by one month. You ran a regression of the shifted returns (as the X) versus the unshifted returns (as the Y) and found there was no relation. The R-squared was zero.

You decide it is pointless to try to predict future market direction based on historic data, so you decide to try to predict market volatility based on historic volatility. You set up a model where you calculate volatility using the previous 12 months. So you calculate the February 1991 volatility of the stock market as the standard deviation of the log of the stock market returns from February 1990 to January 1991 (12 Months). You calculate the March 1991 volatility of the stock market as the standard deviation of the log of the stock market returns from March 1990 to February 1991 (12 Months). And so on.

As before you decide to shift the volatility by one month and try to predict the next month's volatility based on the previous months.

This is an excerpt of the Raw Data:

Month	Montly_Volatility (Y)	Montly_Volatility_Previous_Month's (X)
Feb-91	4.986%	5.294%
Mar-91	5.294%	5.289%
Apr-91	5.289%	5.179%
May-91	5.179%	4.675%
Jun-91	4.675%	4.931%
Jul-91	4.931%	5.059%
ect…	ect…	ect…

Here is graph of the Data:

Here is the Minitab output of the Data:

The regression equation is

Montly_Volatility = 0.00183 + 0.952 Montly_Volatility_t-1

Predictor Coef StDev T P

Constant 0.001825 0.001016 1.80 0.074

Montly_V 0.95201 0.02410 39.50 0.000

S = 0.004233 R-Sq = 91.6%

Analysis of Variance

Source DF SS MS F P

Regression 1 0.027958 0.027958 1560.14 0.000

Residual Error 143 0.002563 0.000018

Total 144 0.030520

Question (4b): At first you are thrilled with the 91.6% R-squared. Then you realize that a R-squared that high is to be expected… meaning that the result is trivial.

In other words, everything about your regression was totally accurate from a technical point of view… low p-score, linear relationship, ect…, but the high R-squared is a trivial result. Why?

Answer: _____________________________________________

Question 5 (23 points in Total)

You do a regression of X against Y. You get these results:

Regression Analysis

The regression equation is

Y = 12.2 + 0.533 X

Predictor Coef StDev T P

Constant 12.176 4.002 3.04 0.005

X 0.5334 0.1372 AAAA 0.001

S = 10.65 R-Sq = BBBB

Analysis of Variance

Source DF SS MS F P

Regression 1 DDDD CCCC 15.11 0.001

Residual Error EEEE 3288.7 113.4

Total 30 5002.2

Question 5a) (3 Points)

Predict Y when X = 20

Answer: _____________________________________________

Question 5b) (3 Points)

Create a 95% Confidence Interval for your prediction. Assume that 2.04 is the correct number for the t-distribution such that 95% of the data is between -2.04 +2.04 for the appropriate number of degrees of freedom (in other words, use 2.04 when building your confidence interval).

Answer: _____________________________________________

Question 5c) (5 Points)

List 3 possible reasons why your confidence interval may be off or inappropriate.

Answer: _____________________________________________

Question 5d) (5 Points)

Fill in the missing values from the Regression Output Table (1 point each).

AAAA: _____________________________________________

BBBB: _____________________________________________

CCCC: _____________________________________________

DDDD: _____________________________________________

EEEE: _____________________________________________

Question 5e) (4 Points)

You do another (unrelated) regression of a new X against a new Y. You have 10 datapoints. You get an R-squared of 80%, an F-score of 32 and a Standard Error of the Regression of 4.68188.

What is the Standard Deviation of the variable Y?

Hint: Make an ANOVA table based on the above information.

Answer: _____________________________________________

Question 5f) (3 Points)

Your friend does another (unrelated) regression of a new X against a new Y. Your friend get a low p-score of .00001 and an R-squared of 4.6%. Everything else with the regression checks out as being valid.

Your friend decides to not use the Linear Regression model since it has such a low R-squared. What do you tell your friend about your friend's decision to reject the Linear Regression model due to the low R-squared?

Answer: _____________________________________________

Question 6 (9 points in Total)

You get some data regarding a batter. This is it:

Hits	Misses	Total at Bats
6	14	20
7	14	21
8	15	23
7	17	24
8	13	21
7	16	23
6	18	24
7	15	22
6	17	23
7	17	24

You decide to run a Multiple Linear Regression to use Hits and Misses to predict Total at Bats.

Question 6a) (4 Points)

What is the Regression Equation (2 points of credit) and R-squared (2 points of credit)?

Answer: _____________________________________________

Question 6b) (5 Points)

You do another (totally unrelated) Multiple Linear Regression. You have two variables, X1 and X2, which you use to predict a third variable called Y.

This is the output from Minitab:

Regression Analysis

The regression equation is

Y = 10.2 + 0.0605 X1 + 0.757 X2

Predictor Coef StDev T P

Constant 10.166 2.147 4.74 0.002

X1 0.06050 0.05639 1.07 0.319

X2 0.7569 0.1380 5.48 0.001

S = 0.6742 R-Sq = 82.8% R-Sq(adj) = 77.9%

Analysis of Variance

Source DF SS MS F P

Regression 2 15.3183 7.6591 16.85 0.002

Residual Error 7 3.1817 0.4545

Total 9 18.5000

Question (6b): Comment on the table above, specifically commenting on the appropriateness of the Multiple Linear Regression Model based solely on the above table of data. If the model is not appropriate in this case, what might you do to make it better?

Answer: _____________________________________________

Question 7 (8 points in Total)

Question 7a) (5 Points)

An interesting article was published recently that talked about why many of us are feeling that things are getting more expensive even though the Consumer Price Index is only 3%.

The article describes the CPI (Consumer Price Index) as a basket of commonly purchased items (e.g., Computer, TV, Mortgage, Clothing, furniture, cars, ect...). It goes on to say how items relating to the maintenance of purchases (and one's self) are not included in the CPI (e.g., Health Care, Cable TV, Gasoline, Tuition, Car Insurance, Home Heating Oil, Train Fare, ect…) are going up at a higher rate.

Below is data meant to illustrate the kind of data from the article:

	Consumer Price Index	Cost of Maintaining Items
Item 1	1.60%	4.30%
Item 2	2.40%	4.60%
Item 3	4%	5.20%
Item 4	2.80%	6.90%
Item 5	3.00%	7.30%
Item 6	1.00%	8.90%
Item 7	-4%
Item 8	10%
Item 9	6%

You realize that the average of the Non-CPI items is higher, yet you also realize that there is a chance that all of these items may come from the same population. You decide to run a T-test for comparing two sample means. Here is what you get:

t-Test: Two-Sample Assuming Unequal Variances
	Consumer Price Index	Cost of Maintaining Items
Mean	2.98%	6.2%
Variance	0.001429444	0.0003232
Observations	9	6
Hypothesized Mean Difference	0
df	12
t Stat	-2.209418802
P(T<=t) one-tail	0.023664899

It is true that the means are not equal? Please formally state your conclusion and give reasons why.

Answer: _____________________________________________

Question 7b) (3 Points)

Assuming Unequal Variances is the more conservative approach: True or False?

Answer: _____________________________________________

Question 8 (6 points in Total)

Write the null and alternate hypothesis for the following items. This can be an English description or using symbols.

Question 8a) (3 Points)

The ANOVA test for comparing means:

Answer: _____________________________________________

Question 8b) (3 Points)

Multiple Linear Regression:

Answer: _____________________________________________

Question 9 (5 points in Total)

Imagine you run a customer service desk for a product with 5 customers. You record the number of Service Requests each month and have these averages for year-to-date 2003:

Customer A	Customer B	Customer C	Customer D	Customer E
14	9.4	8.8	16.2	10.6

The raw data is:

	Customer A	Customer B	Customer C	Customer D	Customer E
Jan	13	16	7	27	13
Feb	10	11	5	21	6
Mar	20	9	7	23	16
April	18	5	18	7	14
May	9	6	7	3	4

You run an ANOVA test to try to determine if the means could really be the same. You get these results:

Analysis of Variance

Source DF SS MS F P

Factor 4 202.0 50.5 1.21 0.338

Error 20 836.0 41.8

Total 24 1038.0

Individual 95% CIs For Mean

Based on Pooled StDev

Level N Mean StDev ------+---------+---------+---------+

Customer 5 14.000 4.848 (---------*---------)

Customer 5 9.400 4.393 (---------*---------)

Customer 5 8.800 5.215 (---------*---------)

Customer 5 16.200 10.545 (---------*---------)

Customer 5 10.600 5.273 (---------*---------)

------+---------+---------+---------+

Pooled StDev = 6.465 6.0 12.0 18.0 24.0

What does the test above say about the possibility that all the means are equal?

Answer: _____________________________________________

Question 10 (10 points in Total)

You conduct a sample of 64 items. Your sample mean is 20 and you get a sample standard deviation of 10.

Question 10a) (3 Points)

Write out the formula you would have used to calculate the Sample Standard Deviation?

Answer: _____________________________________________

Question 10b) (2 Points)

What is your point estimate for the true population mean?

Answer: _____________________________________________

Question 10c) (5 Points)

Write out a 95% Confidence Interval for the True Population Mean. Assume that +-1.96 standard deviations hold 95% of the data.

Answer: _____________________________________________

Extra Credit 1 (1 Point)

The Standard Normal Distribution has a mean of 0, a Variance and Std Dev of 1

The T-Distribution has a mean of zero. Could it have a Standard Deviation of Sqrt(df / df+2)?

Sqrt = Square Root

df = Degrees of Freedom

Don't just give a True/False, explain why.

Answer: _____________________________________________

Extra Credit 2 (1 Point)

Half of the numbers (meaning area under the curve) in a Chi-Squared Distribution are above the Expected Value of the distribution.

Don't just give a True/False, give the reasons.

Answer: _____________________________________________