What values to look at in cross validated linear regression in DAAG package
I performed the following on a data set that contains 151 variables with 161 observations:-
> library(DAAG) > fit <- lm(RT..seconds.~., data=cadets) > cv.lm(df = cadets, fit, m = 10)
And got the following results:-
fold 1 Observations in test set: 16 7 11 12 24 33 38 52 67 72 Predicted 49.6 44.1 26.4 39.8 53.3 40.33 47.8 56.7 58.5 cvpred 575.0 -113.2 640.7 -1045.8 876.7 -5.93 2183.0 -129.7 212.6 RT..seconds. 42.0 44.0 44.0 45.0 45.0 46.00 49.0 56.0 58.0 CV residual -533.0 157.2 -596.7 1090.8 -831.7 51.93 -2134.0 185.7 -154.6
What I want to do is compare the predicted results to the actual experimental results, so I can plot a graph of the two against each other to show how similar they are. I'm I right in assuming I would do this by using the values in the Predicted row as my predicted results and not the cvpred?
I only ask this as when I performed the very same thing in the caret package, the predicted and the observed values came out to be far more different from one another:-
library(caret) ctrl <- trainControl(method = "cv", savePred=T, classProb=T) mod <- train(RT..seconds.~., data=cadets, method = "lm", trControl = ctrl) mod$pred
pred obs rowIndex .parameter Resample 1 141.2 42 6 none Fold01 2 -504.0 42 7 none Fold01 3 1196.1 44 16 none Fold01 4 45.0 45 27 none Fold01 5 262.2 45 35 none Fold01 6 570.9 52 58 none Fold01 7 -166.3 53 61 none Fold01 8 -1579.1 59 77 none Fold01 9 2699.0 60 79 none Fold01
The model shouldn't be this inaccurate as I originally started from 1664 variables, reduced it through the use of random forest so only variables that had a variable importance of greater than 1 was used, which massively reduced my dataset from 162 * 1664 to 162 * 151.
If someone could explain this to me I'd be grateful, thanks
I think there are few areas of confusion here, let me try to clear the up for you.
The "Predicted" section from cv.lm does not correspond to results from crossvalidaiton. If you're interested with crossvalidaiton then you need to look at your "cvpred" results -- "Predicted" corresponds to predictions from the model fit using all of your data.
The reason that there is a such a large difference between your predictions and your cvpredictions is likely because your final model is overfitting which should illustrate why crossvalidation is so important.
I believe that you are fitting your cv.lm model incorrectly. I've never used the package but I think you want to pass in something like cv.lm(df = cadets, RT..seconds.~., m = 10) rather than your fit object. I'm not sure why you see such a large difference between your cvpred and Predicted options in the example above, but these results tell me that passing in a model will lead to using a model that was fit on all of the data for each CV fold:
library(DAAG) fit <- lm(Sepal.Length ~ ., data=iris) mod1 <- cv.lm(df=iris,fit,m=10) mod2 <- cv.lm(df=iris,Sepal.Length ~ .,m=10) > sqrt(mean((mod1$cvpred - mod1$Sepal.Length)^2))  0.318 > sqrt(mean((mod2$cvpred - mod2$Sepal.Length)^2))  5.94 > sqrt(mean((mod1$cvpred - mod1$Predicted)^2))  0.0311 > sqrt(mean((mod2$cvpred - mod2$Predicted)^2))  5.94
The reason that there is such a difference between your caret results is because you were looking at the "Predicted" section. "cvpred" should line up closely with caret (although make sure to make indices on your cv results) and if you want to line up the "Predicted" results with caret you will need to get your predictions using something like predict(mod,cadets).