# Column of lists inside a dataframe in R

Lets have the following dataframe inside R:

df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal")))) df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma"))))) df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial"))))) df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal"))))) df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))

The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name. The dataframe df looks like:

sample params 1 0.85102972 0, 1, Normal 2 0.67313218 5, 5, Gamma 3 3.00000000 7, 0.7, .... 4 0.08488487 2, 3, Normal 5 0.95025523 3, Student-T

Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist

Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.

## Answers

It's probably more natural to store information like this in a pure list structure, than in a data frame:

distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")), gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")), binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")), normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")), tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))

And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:

sapply(distList,function(x) x[[2]]$dist) normal gamma binom normal2 tdist "Normal" "Gamma" "Binomial" "Normal" "Student-T"

If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the *maximum* number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.

For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:

dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal', param1=0, param2=1) dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5), distribution='Gamma', param1=5, param2=5)) dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student', param1=3, param2=NA))

It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().