r: for loop operation with nested indices runs super slow

I have an operation I'd like to run for each row of a data frame, changing one column. I'm an apply/ddply/sqldf man, but I'll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I've stripped my problem down to the very simple test case below.

n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time(
 for (i in 1:nrow(t.df)) {
 t.df[i,(t.df[i,1]%%10 + 1)] <- 99
 }
)

This takes 70 seconds with ten columns, and 360 when ncol=50. That's crazy. Are loops the wrong approach? Is there a better, more efficient way to do this?

I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution.

My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is "for loop with nested indices" even the best way to think about my problem?

Answers


Using row and col seems less complicated to me:

t.df[col(t.df) == (row(t.df) %% 10) + 1]  <- 99

I think Tommy's is still faster, but using row and col might be easier to understand.


It seems the real bottleneck is having the data in the form of a data.frame. I assume that in your real problem you have a compelling reason to use a data.frame. Any way to convert your data in such a way that it can remain in a matrix?

By the way, great question and a very good example.

Here's an illustration of how much faster loops are on matrices than on data.frames:

> n <- 20000
> t.df <- (matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+   for (i in 1:nrow(t.df)) {
+     t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+   }
+ )
   user  system elapsed 
  0.084   0.001   0.084 
> 
> n <- 20000
> t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+   for (i in 1:nrow(t.df)) {
+     t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+   }
+   )
   user  system elapsed 
 31.543  57.664  89.224 

@JD Long is right that if t.df can be represented as a matrix, things will be much faster.

...And then you can actually vectorize the whole thing so that it is lightning fast:

n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
  m <- as.matrix(t.df)
  m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
  t2.df <- as.data.frame(m)
}) # 0.00 secs

Unfortunately, the matrix indexing I use here does not seem to work on a data.frame.

EDIT A variant where I create a logical matrix to index works on data.frame, and is almost as fast:

n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
  t2.df <- t.df

  # Create a logical matrix with TRUE wherever the replacement should happen
  m <- array(FALSE, dim=dim(t2.df))
  m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE

  t2.df[m] <- 99
}) # 0.01 secs

UPDATE: Added the matrix version of Tommy's solution to the benchmarking exercise.

You can vectorize it. Here is my solution and a comparison with the loop

n <- 20000
t.df <- (matrix(1:5000, ncol=10, nrow=n))

f_ramnath <- function(x){
  idx <- x[,1] %% 10 + 1
  x[cbind(1:NROW(x), idx)] <- 99  
  return(x)
}

f_long <- function(t.df){
  for (i in 1:nrow(t.df)) {
    t.df[i,(t.df[i,1]%%10 + 1)] <- 99
  }
  return(t.df)
}

f_joran <- function(t.df){
  t.df[col(t.df) == (row(t.df) %% 10) + 1]  <- 99
  return(t.df)
}

f_tommy <- function(t.df){
  t2.df <- t.df
  # Create a logical matrix with TRUE wherever the replacement should happen
  m <- array(FALSE, dim=dim(t2.df))
  m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
  t2.df[m] <- 99
  return(t2.df)
}

f_tommy_mat <- function(m){
  m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
}

To compare the performance of the different approaches, we can use rbenchmark.

library(rbenchmark)
benchmark(f_long(t.df), f_ramnath(t.df), f_joran(t.df), f_tommy(t.df), 
  f_tommy_mat(t.df), replications = 20,  order = 'relative',
  columns = c('test', 'elapsed', 'relative')

               test elapsed  relative
5 f_tommy_mat(t.df)   0.135  1.000000
2   f_ramnath(t.df)   0.172  1.274074
4     f_tommy(t.df)   0.311  2.303704
3     f_joran(t.df)   0.705  5.222222
1      f_long(t.df)   2.411 17.859259

Need Your Help

What is the purpose of pinax's groups?

python django pinax

I viewed the DjangoCon 2009 talks about pinax by James Tauber and pydanny and heared about pinax's groups. But I don't get the actual usecases they describe, even after reading the documentation.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.