Intro to Probability and Statistics

 

Sample Final #1 – Questions Only

Professor Brian Shydlo

brian@shydlo.com

 

 

 

 

 

 

 

Instructions:

1) Please write your name: _____________________________________

 

2) There are 10 questions totaling 100 points. Please be careful to answer all questions. Partial credit will be given.  Also, there are two questions for extra credit (one point each).   There is no partial credit on the Extra Credit questions.

 

Question 1) 8 Points  (Correlation and Covariance)

Question 2) 15 Points  (Expected Value and Standard Deviation of a Portfolio of Two Assets)

Question 3) 6 Points (Basic Probability and Binomial Distribution)

Question 4) 10 Points (Linear Regression, theoretical concepts)

Question 5) 23 Points (Linear Regression)

Question 6) 9 Points (Multiple Linear Regression)

Question 7) 8 Points (Testing Two Sample Means)

Question 8) 6 Points (Hypothesis Testing)

Question 9) 5 Points (ANOVA Comparison Of Means)

Question 10) 10 Points (Sample Means and Confidence Intervals)

Total         100 Points

 


Question 1) (8 points in Total)

 

You have the following table of X and Y values.  (For example, there is a 10% chance that X will be 4 and Y will be 2, and so on…)

 

X

Y

Probability(X,Y)

4

2

10%

6

4

20%

7

7

20%

10

14

20%

12

15

30%

 

To help you out I have calculated the Standard Deviation and Mean (or Expected Value) of each.

μx = 8.6

μy = 9.7

sx  =  2.8

sy  =  5.1

 

 

Question 1a)  (5 Points)

 

What is Covariance(X,Y)?

 

 

 

 

 

 

 

 

 

                            Answer: _______ญญญญญญญญญญญญญ____________________

 

Question 1b)  (3 Points)

What is the Correlation Coefficient of X,Y?

 

 

 

 

 

 

 

 

                            Answer: _______ญญญญญญญญญญญญญ____________________

Question 2) (15 points in total)

 

A certain stock, X, has an expected return of 15% per year and a standard deviation of 25%.  The Stock is Normally Distributed.

 

A certain bond, Y, has an expected return of 5% per year and a standard deviation of 9%.  The Bond is Normally Distributed.

 

They have a correlation of -0.2.

 

You could write this as:

mx = 15%, my = 5%, sx = 25%, sy = 9%, and rxy = -0.2.

 

 

Question 2a)  (3 Points)

You decide to invest $100 dollars in either X or Y or some combination of both.  How do you allocate your $100 to maximize your expected return?

 

 

 

 

 

 

 

                            Answer: _______ญญญญญญญญญญญญญ____________________

 

Question 2b)  (3 Points)

You decide to split your money and invest $50 in X and $50 in Y.  How much money do you expect to have after one year (your initial investment of $100 + the expected return of your portfolio of X and Y).  (The correct answer is some number over $100… I am not asking how much more money you would have.)

 

 

 

 

 

 

 

 

                            Answer: _______ญญญญญญญญญญญญญ____________________

 


Question 2c)  (5 Points)

What is the standard deviation and variance of the portfolio from part b?

 

 

 

 

 

 

 

 

 

 

 

                            Answer: _______ญญญญญญญญญญญญญ____________________

 

 

Question 2d)  (4 Points)

What is a 95% (1.96 standard deviation) confidence interval for your return?   That is, give me a confidence interval for 50% in X and 50% in Y.

 

 

 

 

 

 

 

 

 

 

                            Answer: _______ญญญญญญญญญญญญญ____________________

 


Question 3 (6 points in Total)

Question 3a) (3 Points)

You have a deck of card with 52 cards.  It is a standard bridge deck, which means it has 4 suits (hearts, clubs, spades and diamonds).  For each suit you have 13 cards: Numbered cards from 2 to 10, a Jack, Queen, King and an Ace.

 

Question (3a): You pick 20 out of the 52 cards at random.  What are the odds you'll see the Ace of Spades?

 

 

 

Answer: _____________________________________________

 

Question 3b) (3 Points)

With the deck above you number each card from 1 to 52 so that the Ace of Spades gets a Number of 1 and so on down to the lowest card, which gets a number of 52.  Furthermore, you designate 26 (50%) of the cards to be in the top half and 26 to be in the bottom half.

You observe that for the 20 cards you picked out (one or a few at a time from the desk), 5 of them are in the top half and 15 of them are from the bottom half (in terms of the ranking).   Assuming that everything is totally random, you decide to calculate the odds of getting 5 OR FEWER cards in the top half when you draw 20 cards randomly from the deck as described above.  

You decide to model this as a Binomial Distribution with a p, probability of success at 50% (since half of the cards are in the top half and half in the bottom half).   The formula you come up with is:

Sum from x = 0 to 5 this:

Where p = .5 and n = 20

Using this formula (six times and taking the sum), you get a probability of 2.07% or about 1 in 50.

 

 

Question (3b): Was your analysis/modeling of the problem above correct?   If not, what is the problem with it?

 

 

 

 

 

 

 

 

Answer: _____________________________________________


Question 4) (10 points in Total)

 

Question 4a) (5 Points)

You have this data (exactly two datapoints)

X

Y

1

2

4

6

 

 

You decide to run a regression and you get an R-squared of 100%.  Here is a chart:

 

 

Question (4a): Why is it a bad idea to do a regression and especially make predictions and especially create confidence intervals around those predictions when you only have two datapoints?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________


Question 4b) (5 Points)

The Random Walk theory of the Stock Market says that you can't predict what will happen with the stock market.  You tried to disprove this, but so far no success. 

 

One of the models you used was to take the log of monthly returns for the S&P 500 from Jan 1990 to April 2003 and then shift it by one month.  You ran a regression of the shifted returns (as the X) versus the unshifted returns (as the Y) and found there was no relation.  The R-squared was zero. 

 

You decide it is pointless to try to predict future market direction based on historic data, so you decide to try to predict market volatility based on historic volatility.   You set up a model where you calculate volatility using the previous 12 months.  So you calculate the February 1991 volatility of the stock market as the standard deviation of the log of the stock market returns from February 1990 to January 1991 (12 Months).   You calculate the March 1991 volatility of the stock market as the standard deviation of the log of the stock market returns from March 1990 to February 1991 (12 Months).  And so on.

 

As before you decide to shift the volatility by one month and try to predict the next month's volatility based on the previous months. 

This is an excerpt of the Raw Data:

Month

Montly_Volatility (Y)

Montly_Volatility_Previous_Month's (X)

Feb-91

4.986%

5.294%

Mar-91

5.294%

5.289%

Apr-91

5.289%

5.179%

May-91

5.179%

4.675%

Jun-91

4.675%

4.931%

Jul-91

4.931%

5.059%

ect…

ect…

ect…

 

Here is graph of the Data:


Here is the Minitab output of the Data:

 

The regression equation is

Montly_Volatility = 0.00183 + 0.952 Montly_Volatility_t-1

 

Predictor        Coef       StDev          T        P

Constant     0.001825    0.001016       1.80    0.074

Montly_V      0.95201     0.02410      39.50    0.000

 

S = 0.004233    R-Sq = 91.6%    

 

Analysis of Variance

Source            DF          SS          MS         F        P

Regression         1    0.027958    0.027958   1560.14    0.000

Residual Error   143    0.002563    0.000018

Total            144    0.030520

 

Question (4b): At first you are thrilled with the 91.6% R-squared.  Then you realize that a R-squared that high is to be expected… meaning that the result is trivial. 

In other words, everything about your regression was totally accurate from a technical point of view… low p-score, linear relationship, ect…, but the high R-squared is a trivial result.   Why?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________


Question 5 (23 points in Total)

 

You do a regression of X against Y.  You get these results:

 

Regression Analysis

 

The regression equation is

Y = 12.2 + 0.533 X

 

Predictor        Coef       StDev          T        P

Constant       12.176       4.002       3.04    0.005

X              0.5334      0.1372       AAAA    0.001

 

S = 10.65       R-Sq = BBBB    

 

Analysis of Variance

 

Source            DF          SS          MS         F        P

Regression         1        DDDD        CCCC     15.11    0.001

Residual Error   EEEE     3288.7       113.4

Total             30      5002.2

 

 

Question 5a) (3 Points)

Predict Y when X = 20

 

 

 

 

 

 

Answer: _____________________________________________

 

 

Question 5b) (3 Points)

Create a 95% Confidence Interval for your prediction.  Assume that 2.04 is the correct number for the t-distribution such that 95% of the data is between -2.04 +2.04 for the appropriate number of degrees of freedom (in other words, use 2.04 when building your confidence interval).

 

 

 

 

 

 

 

Answer: _____________________________________________


Question 5c) (5 Points)

List 3 possible reasons why your confidence interval may be off or inappropriate.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________

 

Question 5d) (5 Points)

Fill in the missing values from the Regression Output Table (1 point each).

 

 

 

AAAA: _____________________________________________

 

BBBB: _____________________________________________

 

CCCC: _____________________________________________

 

DDDD: _____________________________________________

 

EEEE: _____________________________________________

 

 


Question 5e) (4 Points)

You do another (unrelated) regression of a new X against a new Y.  You have 10 datapoints.  You get an R-squared of 80%, an F-score of 32 and a Standard Error of the Regression of 4.68188.

 

What is the Standard Deviation of the variable Y?  

Hint: Make an ANOVA table based on the above information.

 

 

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________

 

Question 5f) (3 Points)

Your friend does another (unrelated) regression of a new X against a new Y.  Your friend get a low p-score of .00001 and an R-squared of 4.6%.  Everything else with the regression checks out as being valid.  

Your friend decides to not use the Linear Regression model since it has such a low R-squared.    What do you tell your friend about your friend's decision to reject the Linear Regression model due to the low R-squared?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________

Question 6 (9 points in Total)

 

You get some data regarding a batter.  This is it: 

 

Hits

Misses

Total at Bats

6

14

20

7

14

21

8

15

23

7

17

24

8

13

21

7

16

23

6

18

24

7

15

22

6

17

23

7

17

24

 

You decide to run a Multiple Linear Regression to use Hits and Misses to predict Total at Bats.

 

Question 6a) (4 Points)

What is the Regression Equation (2 points of credit) and R-squared (2 points of credit)?

 

 

 

 

 

 

 

Answer: _____________________________________________

 

 


Question 6b) (5 Points)

You do another (totally unrelated) Multiple Linear Regression.  You have two variables, X1 and X2, which you use to predict a third variable called Y.

This is the output from Minitab:

 

Regression Analysis

 

The regression equation is

Y = 10.2 + 0.0605 X1 + 0.757 X2

 

Predictor        Coef       StDev          T        P

Constant       10.166       2.147       4.74    0.002

X1            0.06050     0.05639       1.07    0.319

X2             0.7569      0.1380       5.48    0.001

 

S = 0.6742      R-Sq = 82.8%     R-Sq(adj) = 77.9%

 

Analysis of Variance

 

Source            DF          SS          MS         F        P

Regression         2     15.3183      7.6591     16.85    0.002

Residual Error     7      3.1817      0.4545

Total              9     18.5000

 

Question (6b): Comment on the table above, specifically commenting on the appropriateness of the Multiple Linear Regression Model based solely on the above table of data.   If the model is not appropriate in this case, what might you do to make it better?

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________

 

 

 

 

 

 


Question 7 (8 points in Total)

Question 7a) (5 Points)

An interesting article was published recently that talked about why many of us are feeling that things are getting more expensive even though the Consumer Price Index is only 3%.  

 

The article describes the CPI (Consumer Price Index) as a basket of commonly purchased items (e.g., Computer, TV, Mortgage, Clothing, furniture, cars, ect...).   It goes on to say how items relating to the maintenance of purchases (and one's self) are not included in the CPI (e.g., Health Care, Cable TV, Gasoline, Tuition, Car Insurance, Home Heating Oil, Train Fare, ect…) are going up at a higher rate.

 

Below is data meant to illustrate the kind of data from the article:

 

Consumer Price Index

Cost of Maintaining Items

Item 1

1.60%

4.30%

Item 2

2.40%

4.60%

Item 3

4%

5.20%

Item 4

2.80%

6.90%

Item 5

3.00%

7.30%

Item 6

1.00%

8.90%

Item 7

-4%

 

Item 8

10%

 

Item 9

6%

 

 

You realize that the average of the Non-CPI items is higher, yet you also realize that there is a chance that all of these items may come from the same population.  You decide to run a T-test for comparing two sample means.  Here is what you get:

t-Test: Two-Sample Assuming Unequal Variances

 

 

Consumer Price Index

Cost of Maintaining Items

Mean

2.98%

6.2%

Variance

0.001429444

0.0003232

Observations

9

6

Hypothesized Mean Difference

0

 

df

12

 

t Stat

-2.209418802

 

P(T<=t) one-tail

0.023664899

 

 


It is true that the means are not equal?  Please formally state your conclusion and give reasons why.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________

 

 

Question 7b) (3 Points)

 

Assuming Unequal Variances is the more conservative approach: True or False?

 

 

 

 

 

Answer: _____________________________________________

 

 

 

 

 

 


Question 8 (6 points in Total)

 

Write the null and alternate hypothesis for the following items.  This can be an English description or using symbols.

 

Question 8a) (3 Points)

 

 

The ANOVA test for comparing means:

 

 

 

 

 

 

Answer: _____________________________________________

 

Question 8b) (3 Points)

 

Multiple Linear Regression:

 

 

 

 

 

 

Answer: _____________________________________________

 

 


Question 9 (5 points in Total)

 

Imagine you run a customer service desk for a product with 5 customers.  You record the number of Service Requests each month and have these averages for year-to-date 2003:

 

Customer A

Customer B

Customer C

Customer D

Customer E

14

9.4

8.8

16.2

10.6

 

The raw data is:

 

Customer A

Customer B

Customer C

Customer D

Customer E

Jan

13

16

7

27

13

Feb

10

11

5

21

6

Mar

20

9

7

23

16

April

18

5

18

7

14

May

9

6

7

3

4

 

You run an ANOVA test to try to determine if the means could really be the same.  You get these results:

 

Analysis of Variance

Source     DF        SS        MS        F        P

Factor      4     202.0      50.5     1.21    0.338

Error      20     836.0      41.8

Total      24    1038.0

                                   Individual 95% CIs For Mean

                                   Based on Pooled StDev

Level       N      Mean     StDev  ------+---------+---------+---------+

Customer    5    14.000     4.848           (---------*---------)

Customer    5     9.400     4.393    (---------*---------)

Customer    5     8.800     5.215   (---------*---------)

Customer    5    16.200    10.545               (---------*---------)

Customer    5    10.600     5.273      (---------*---------)

                                   ------+---------+---------+---------+

Pooled StDev =    6.465                6.0      12.0      18.0      24.0

 

 

What does the test above say about the possibility that all the means are equal?

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________


Question 10 (10 points in Total)

 

You conduct a sample of 64 items.  Your sample mean is 20 and you get a sample standard deviation of 10.

 

Question 10a) (3 Points)

 

Write out the formula you would have used to calculate the Sample Standard Deviation?

 

 

 

 

 

 

 

Answer: _____________________________________________

 

Question 10b) (2 Points)

 

What is your point estimate for the true population mean?

 

 

 

 

Answer: _____________________________________________

 

 

Question 10c) (5 Points)

 

Write out a 95% Confidence Interval for the True Population Mean.  Assume that +-1.96 standard deviations hold 95% of the data.

 

 

 

 

 

Answer: _____________________________________________

 

 

 


Extra Credit 1  (1 Point)

 

The Standard Normal Distribution has a mean of 0, a Variance and Std Dev of 1

The T-Distribution has a mean of zero.  Could it have a Standard Deviation of Sqrt(df / df+2)? 

Sqrt = Square Root

df = Degrees of Freedom

 

Don't just give a True/False, explain why.

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________

 

 

 

 

 

Extra Credit 2  (1 Point)

 

Half of the numbers (meaning area under the curve) in a Chi-Squared Distribution are above the Expected Value of the distribution.

Don't just give a True/False, give the reasons.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Answer: _____________________________________________