MicrosoftВ® Office ExcelВ® 2007: Data Analysis and Business Modeling (Bpg -- Other)
-
Our factory manufactures three products. How can I predict the cost of running the factory based on the number of units produced?
-
How accurate are my forecasts for predicting monthly cost based on units produced?
-
I know how to use the Data Analysis command to run a multiple regression. Is there a way to run the regression without using this command and place the regression’s results in the same worksheet as the data?
-
Our factory manufactures three products. How can I predict the cost of running the factory based on the number of units produced?
-
In Chapters 43 through 45, I described how to use the trend curve in Microsoft Office Excel 2007 to predict one variable (called y, or the dependent variable) from another variable (called x, or the independent variable). However, we often want to use more than one independent variable (called the independent variables x1, x2, … xn) to predict the value of a dependent variable. In these cases, we can use either the multiple regression option in the Excel Data Analysis feature or the LINEST function to estimate the relationship we want.
-
Multiple regression assumes that the relationship between y and x1, x2, … xn has the form
Y=Constant+B1X1+B2X2+...BnXn
-
Excel calculates the values of Constant, B1, B2, … Bn to make the predictions from this equation as accurate (in the sense of minimizing the sum of squared errors) as possible. The following example illustrates how multiple regression works.
-
The Data worksheet in in the file Mrcostest.xlsx (see Figure 47-1 on the next page) contains the cost of running a plant over 19 months as well as the number of units of Product A, Product B, and Product C produced during each month.
Figure 47-1: Data for predicting monthly operating costs -
We would like to find the best forecast for monthly operating cost that has the form (which I’ll refer to as Form 1)
Monthly operating cost=Constant+B1*(Units A produced)+B2*(Units B produced)+B3*(Units C produced)
-
The Excel Data Analysis feature can find the equation for this form that best fits our data. Click Data Analysis in the Analysis group on the Data tab, and then select Regression. Fill in the Regression dialog box as shown in Figure 47-2.
Figure 47-2: Regression dialog box Note If you haven’t previously installed the Analysis Toolpak, click the Microsoft Office Button, click the Excel Options button, and select Add-Ins. With Excel Add-ins in the Manage box, click Go, check the Analysis ToolPak box, and click OK.
-
The Input Y range, B3:B22, contains the dependent variable or data (including the Cost label) that we want to predict.
-
The Input X range, C3:E22, contains the data or independent variables (including the labels A Made, B Made, and C Made) that we want to use in the prediction. Excel has a limit of 15 independent variables, which must be in adjacent columns.
-
Because both the Input X and Input Y ranges include labels, I’ve checked the Labels box.
-
I chose to place the output in a new worksheet titled Regression.
-
Checking the Residuals box causes Excel to list, for each observation, the prediction from Form 1 and the residual, which equals the observed cost minus the predicted cost.
-
-
After clicking OK in the Regression dialog box, we obtain the output shown in Figures 47-3 and 47-4.
Figure 47-3: Original multiple regression output Figure 47-4: Original multiple regression residual output
What is the best prediction equation? We find in the Coefficients column (column B of the summary output) that the best equation of Form 1 that can be used to predict monthly cost is
Predicted monthly cost=35,102.90+2.07(AMade)+4.18(BMade)+4.79(CMade)
A natural question is, which of our independent variables are useful for predicting monthly cost? After all, if we had chosen the number of games won by the Seattle Mariners during a one month period as an independent variable, we would expect that this variable would have little effect on predicting monthly operating cost. When you run a regression, each independent variable has a p-value between 0 and 1. Any independent variable with a p-value (see column E) of less than or equal to 0.15 is considered to be useful for predicting the dependent variable. Thus, the smaller the p-value, the higher the predictive power of the independent variable. Our three independent variables have p-values of 0.23 (for A Made), 0.025 (for B Made), and 0.017 (for C Made). These p-values may be interpreted as follows:
-
When we use B Made and C Made to predict monthly operating cost, we have a 77 percent chance (1–0.23) that A Made adds predictive power.
-
When we use A Made and C Made to predict monthly operating cost, there is a 97.5 percent chance (1–0.025) that B Made adds predictive power.
-
When we use A Made and B Made to predict monthly operating cost, there is a 98.3 percent chance (1–0.017) that C Made adds predictive power.
Our p-values indicate that A Made does not add much predictive power to B Made and C Made, which means that if we know B Made and C Made, we can predict monthly operating cost about as well as we can if we include A Made as an independent variable. Therefore, we can opt to delete A Made as an independent variable and use just B Made and C Made for our prediction. I copied our data to the worksheet titled A Removed and deleted the A Made column (column C). I then adjusted the Input X range to be C3:D22. In the worksheet titled NoA, you can see the regression output shown in Figure 47-5 and Figure 47-6.
We see that A Made and B Made each have very low p-values (0.002 and 0.007, respectively). These values indicate that both these independent variables have useful predictive power. Using the new coefficients in column B, we can now predict monthly operating cost using the equation
Predicted monthly operating cost=35,475.3+5.32(BMade)+5.42(CMade)
-
How accurate are my forecasts for predicting monthly cost based on units produced?
-
In the regression output in cell B5 of the NoA worksheet (see Figure 47-5), we find that R2 equals 0.61. An R2 value such as this one means that together, B Made and C Made explain 61 percent of the variation in monthly operating costs. Notice that in our original regression, which included A Made as an independent variable, R2 equals 0.65. This indicates that the addition of A Made as an independent variable explains only 4 percent more variation in monthly operating costs. Having such a minor difference is consistent with the decision to delete A Made as an independent variable.
-
In the regression output in cell B7 in the NoA worksheet, we find that the standard error for the regression with B Made and C Made as independent variables is 1274. We expect about 68 percent of our multiple regression forecasts to be accurate within one standard error, and 95 percent of our multiple regression forecasts to be accurate within two standard errors. Any forecast that differs from the actual value by more than two standard errors is considered an outlier. Thus, if our forecasted operating cost is in error by more than $2,548 (2*1274), we consider that observation to be an outlier.
-
In the Residual portion of our output, shown earlier in Figure 47-6, we are given for each observation the predicted cost and the residual, which equals the actual cost less the predicted cost. For the first observation, for example, we predict a cost of $43,381.10. Our residual of $1,057.95 indicates that our prediction of actual cost was too low by $1,057.95.
-
I know how to use the Data Analysis command to run a multiple regression. Is there a way to run the regression without using this command and place the regression’s results in the same worksheet as the data?
-
The Excel LINEST function can be used to insert the results of a regression analysis directly into a workbook. To use the LINEST function when there are m independent variables, begin by selecting a blank cell range consisting of five rows and m+1 columns, where you want LINEST to deposit the results. In the A Removed worksheet, I used the range F5:H9. The syntax of the LINEST function is
LINEST(KnownYs,KnownXs,True,True)
-
If the third argument is changed to False, Excel will estimate the equation without a constant term. Changing the fourth argument to False causes the LINEST function to omit many regression computations and return only the multiple regression equation.
-
With the upper-left cell of the target range selected (F5 in this example), select the range of desired size (in our case, the cell range F5:H9), and then enter the formula =LINEST(B4:B22,C4:D22,True,True). At this point, do not press Enter! LINEST is an array function (see Chapter 74, “Array Functions and Formulas,” for further discussion of array functions), so you must hold down Ctrl+Shift and then press Enter for the function to work correctly. After using this key combination, we obtain the results shown in Figure 47-7.
Figure 47-7: Using the LINEST function to calculate a multiple regression -
In row 5, we find our prediction equation (coefficients read right to left, starting with the intercept) of Predicted monthly cost=35,475.3+5.32(B Made)+5.43(C Made). Row 6 contains standard errors for each coefficient estimate, but these are not too relevant. Cell F7 contains our R2 value of 0.61, and cell G7 contains the regression standard error of 1274. Rows 8 and 9 contain information (F statistic, degrees of freedom, sum of squares regression, and sum of squares residual) that is also not very relevant.
Note Problems that you can work with to learn more about multiple regression are available at the end of Chapter 49.
Категории