R - Compare all values in a vector/dataframe against values in another dataframe for filtering
I'm new to R so I apologize for my novice question.
I have a dataframe of two variables that I have sorted to give me all my top performers in a short list. I now want to take a much larger dataframe of 4 variables, and remove all rows that do not have the performer string in the smaller list.
I have tried the following:
clean_df <- df[match(best$retailer, all$retailer), ]
But this give me a df with all my column names and NAs only.
I have also tried a few logical comparisons based on string values, but nothing has worked so far. Any help is greatly appreciated.
Suppose if we have two dataframes ('all', 'best') and wants to keep the rows in 'all' that is not in 'best' with respect to the 'retailer' column, we can use anti_join. From the output, we can check with the 'df' dataset (it is not clear though).
library(dplyr) anti_join(all, best, by='retailer')
Or may be we use %in% for finding the elements in 'all' that are also in 'best' for the 'retailer' column, to get the logical index and subset the 'df'.
df[all$retailer %in% best$retailer,]
Or using match, we can set nomatch=0 so that the NA values that we get previously will be converted to '0'. As indexing in R starts from 1, the 0 values have no effect in filtering.
df[match(best$retailer, all$retailer, nomatch=0),]
You just need to set the default with match to be FALSE or 0, and also check to see that the matched index is greater than 0 so you get a logical vector that indexes correctly.
set.seed(0) best <- letters[1:4] all <- data.frame(retailer=sample(letters, 30, rep=T), x=runif(30)) all[match(all$retailer, best, 0L)>0L, ] # retailer x # 11 b 0.4112744 # 25 d 0.2447973 # 28 a 0.3162717