Read Table and Random Forest in R
I'm trying to use the Random Forest method in R. I need to read a txt file (training set).
dataset<- read.table(path1,header=TRUE,sep=",")
The column names are numeric (i.e 1005_at) so they are automatically converted, adding X, by R (i.e X1005_at). In order to resolve this problem I did:
colnames(dataset)<-gsub("^[X](.*)","\\1",colnames(dataset))
Now the names are ok, but when I run the Random Forest:
model.rf <- randomForest(class ~ ., data=dataset, importance=TRUE,keep.forest=T, ntree=5, do.trace=T)
I have this error:
Error in eval(expr, envir, enclos) : object '1005_at' not found
While if I run the Random Forest on the original dataset (without modify the names, so using X1005_at) this error doesn't occur. Why? How can I fix it?
Answers
Use read.csv as it already has the appropriate defaults for header and sep and use the check.names=FALSE argument to avoid mangling the names.
The formula method of randomForest will not accept non-syntactic names in the input data frame. Use the default method instead.
Thus we have:
> # dataset <- read.csv(path1, check.names = FALSE) > > # next few lines are to make example similar to the one in the question > dataset <- CO2 > names(dataset) <- c(paste(1:4, names(dataset[1:4]), sep = "_"), "class") > names(dataset) [1] "1_Plant" "2_Type" "3_Treatment" "4_conc" "class" > > i <- match("class", names(dataset)) # i is index of class column > fm <- randomForest(dataset[-i], dataset[[i]] + # other arguments - in this example none + ) > fm Call: randomForest(x = dataset[-i], y = dataset[[i]]) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 1 Mean of squared residuals: 26.43385 % Var explained: 77.13 > fm$importance IncNodePurity 1_Plant 2105.779 2_Type 1529.527 3_Treatment 557.300 4_conc 2265.724