Is there a superfast way to transform data frame rows to list elements?

Suppose a data frame like this:

> n <- 3
> a <- data.frame(x=1:n,y=sample(letters,n,replace = T),stringsAsFactors = F)
> rownames(a) <- paste0("p",1:n)
> a
   x y
p1 1 a
p2 2 e
p3 3 b

I want to transform the data frame to a list like this:

$p1
$p1$x
[1] 1

$p1$y
[1] "a"


$p2
$p2$x
[1] 2

$p2$y
[1] "e"


$p3
$p3$x
[1] 3

$p3$y
[1] "b"

One intuitive ways to perform such transformation is to use lapply to iterate over all rows, but it is really slow. If it were a matrix, another way is apply(a,1,as.list). I do some benchmark tests and they show that apply approach is 5 times fasters than lapply approach. Further more, I also tested apply(a,1,as.vector,mode="list") approach and it's 4 times faster than as.list approach. Unfortunately, it is a data frame with heterogeneous types of columns.

When the number of rows of the data frame is larger, all methods seem to work slowly. Is there a way to do this even faster? (Use Rcpp? and how?)

Answers


For the record (and since you've mentioned "Rcpp"), I'm adding an approach at the C level. The speedup is about 7x; there could be better / faster solutions, but -agreeing with the comments- it may be more suitable to plan a different approach than trying to make a specific part as fast as it gets especially if it's hard to get significant speedups.

library(inline)

ff <- cfunction(sig = c(R_df = "data.frame"), body = '
    R_len_t nr = LENGTH(VECTOR_ELT(R_df, 0)), nc = LENGTH(R_df);

    SEXP ans;
    PROTECT(ans = allocVector(VECSXP, nr));
    for(int i = 0; i < nr; i++) {
        SET_VECTOR_ELT(ans, i, allocVector(VECSXP, nc));
        setAttrib(VECTOR_ELT(ans, i), R_NamesSymbol, 
                  getAttrib(R_df, R_NamesSymbol));
    }
    setAttrib(ans, R_NamesSymbol, getAttrib(R_df, R_RowNamesSymbol)); 

    for(int i = 0; i < nc; i++) {
        SEXP tmp;
        PROTECT(tmp = coerceVector(VECTOR_ELT(R_df, i), 
                                   TYPEOF(VECTOR_ELT(R_df, i))));
        switch(TYPEOF(tmp)) {
            case LGLSXP:
            case INTSXP: {
                R_len_t *ptmp = INTEGER(tmp);
                for(int j = 0; j < nr; j++) 
                    SET_VECTOR_ELT(VECTOR_ELT(ans, j), i, 
                                   ScalarInteger(ptmp[j]));
                break;              
            }
            case REALSXP: {
                double *ptmp = REAL(tmp);
                for(int j = 0; j < nr; j++) 
                    SET_VECTOR_ELT(VECTOR_ELT(ans, j), i, 
                                   ScalarReal(ptmp[j]));
                break;              
            }
            case STRSXP: {
                for(int j = 0; j < nr; j++) 
                    SET_VECTOR_ELT(VECTOR_ELT(ans, j), i, 
                                   ScalarString(STRING_ELT(tmp, j)));
                break;              
            }
        }
        UNPROTECT(1);
    }

    UNPROTECT(1);
    return(ans);
')

ff(a) 
#$p1
#$p1$x
#[1] 1
#
#$p1$y
#[1] "k"
#
#
#$p2
#$p2$x
#[1] 2
#
#$p2$y
#[1] "o"
#
#
#$p3
#$p3$x
#[1] 3
#
#$p3$y
#[1] "l"

And comparing with the approach of yours (mentioned in the comments) that proved to be fast:

identical(setNames(do.call(Map, 
                           c(function(...) 
                                "names<-"(list(...), colnames(a)), a)), 
                   row.names(a)), 
           ff(a))
#[1] TRUE 

And on a larger "data.frame":

set.seed(101)
DF = do.call(cbind.data.frame, 
             replicate(4, cbind.data.frame(x = I(sample(letters, 1e5, T)), 
                                           y = runif(1e5), 
                                           z = sample(1e5)), simplify = F))
names(DF) = make.unique(names(DF), "")


identical(setNames(do.call(Map, 
                           c(function(...) 
                               "names<-"(list(...), colnames(DF)), DF)), 
                   row.names(DF)), 
          ff(DF))   
#[1] TRUE
library(microbenchmark)
microbenchmark(ans1 = setNames(do.call(Map, 
                                       c(function(...) 
                                           "names<-"(list(...), colnames(DF)), 
                                         DF)), 
                               row.names(DF)), 
               ff(DF), 
               times = 10)
#Unit: milliseconds
#   expr       min        lq    median       uq       max neval
#   ans1 3504.1825 3862.4333 3931.0853 4063.691 4162.9370    10
# ff(DF)  143.0398  340.6897  365.5144  404.475  498.3854    10

It looks like you want the rows split into a list and then within each of these split the rows into a list with all the elements. Here's an approach that matches the OP's output but I think @Roland's is more useful. The use of sprintf is to address reordering done by split. This has the advantage over the apply(a, 1, as.list) solution in that the individual elements of the nested lists are numeric and character whereas apply coerces everything to character (it forms a matrix).

rows <- 1:nrow(a)
breaks <- paste0("p", sprintf(paste0("%0", nchar(max(rows)), "d"), rows))
lapply(split(a, breaks), as.list)

## $p1
## $p1$x
## [1] 1
## 
## $p1$y
## [1] "g"
## 
## 
## $p2
## $p2$x
## [1] 2
## 
## $p2$y
## [1] "c"
## 
## 
## $p3
## $p3$x
## [1] 3
## 
## $p3$y
## [1] "t"

From your comments I'd suggest to either use a real database or to use package data.table:

DT <- data.table(name=c("Ken","Ashley"),type=c("A","B"),score=c(9,8)) 
setkey(DT, name)
interests <- data.table(name=c("Ken", "Ashley"), 
               interests=list(c("reading","music"), c("dancing","swimming")))

DT[interests]
#     name type score        interests
#1:    Ken    A     9    reading,music
#2: Ashley    B     8 dancing,swimming

Note that at its core this is a list:

unclass(DT[interests])
$name
[1] "Ken"    "Ashley"

$type
[1] "A" "B"

$score
[1] 9 8

$interests
$interests[[1]]
[1] "reading" "music"  

$interests[[2]]
[1] "dancing"  "swimming"


attr(,"row.names")
[1] 1 2
attr(,".internal.selfref")
<pointer: 0x7fc7c4007978>

Need Your Help

rspec variables in different contexts/scopes

ruby rspec

I'm trying to do some things with rspec and I'm not able to access the variables I need in the places I need them. For example:

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.