R as a general purpose programming language

I liked Python before because Python has rich built-in types like sets, dicts, lists, tuples. These structures help write short scripts to process data.

On the other side, R is like Matlab, and has scalar, vector, data frame, array and list as its data types. But it lacks sets, dicts, tuples, etc. I know that list type is powerful, a lot of operations could be thought as list processing. But the idea of using R as a general purpose language is still vague.

(The following is just an example. Not mean that I focus on text processing/mining.)

For example, I need to do TF-IDF counting for a set of news articles (say 200,000 articles in a folder and its sub folders).

After I read the files, I need to do word-to-ID mapping and other counting tasks. These tasks involve string manipulation and need containers like set or map.

I know I can use another language to do these processing and load the data into R. But maybe (for small things) putting all preprocessing into a single R script is better.

So my question is does R have enough capability in this kind of rich data structures in the language level? Or If not, any packages provide good extension for R language?

Answers


I think that R's data pre-processing capability--i.e., everything from extracting data from its source and just before the analytics steps--has improved substantially in the past three years (the length of time i have been using R). I use python daily and have for the past seven years or so--its text-processing capabilities are superb--and still i wouldn't hesitate for a moment to use R for the type of task you mention.

A couple of provisos though. First, i would suggest looking very closely at a couple of the external packages for the set of tasks in your Q--in particular, hash (python-like key-value data structure), and stringr (consists mostly of wrappers over the less user-friendly string manipulation functions in the the base library)

Both stringr and hash are available on CRAN.

> library(hash)
> dx = hash(k1=453, k2=67, k3=913)
> dx$k1
  [1] 453
> dx = hash(keys=letters[1:5], values=1:5)
> dx
  <hash> containing 5 key-value pair(s).
   a : 1
   b : 2
   c : 3
   d : 4
   e : 5

> dx[a]
  <hash> containing 1 key-value pair(s).
  a : 1

> library(stringr)
> astring = 'onetwothree456seveneight'
> ptn = '[0-9]{3,}'
> a = str_extract_all(astring, ptn)
> a
  [[1]]
  [2] "456"

It seems also that there is a large subset of R users for whom text processing and text analytics comprise a significant portion of their day-to-day work--as evidenced by CRAN's Natural Language Processing Task View (one of about 20 such informal domain-oriented Package collections). Within that Task View is the package tm, a package dedicated to functions for text mining. Included in tm are optimized functions for processing tasks such as the one mentioned in your Q.

In addition, R has an excellent selection of packages for working interactively on reasonably large datasets (e.g., > 1 GB) often without the need to set up a parallel processing infrastructure (but which can certainly exploit a cluster if it's available). The most impressive of these in my opinion are the set of packages under the rubric "The Bigmemory Project" (CRAN) by Michael Kane and John Emerson at Yale; this Project subsumes bigmemory, biganalytics, synchronicity, bigtabulate, and bigalgebra. In sum, the techniques behind these Packages include: (i) allocating the data to shared memory, which enables coordination of shared access by separate concurrent processes to a single copy of the data; (ii) file-backed data structures (which i believe, but i am not certain, is synonymous with a memory-mapped file structure, and which works enabling very fast access from disk using pointers thus avoiding the RAM limit on available file size).

Still, quite a few functions and data structures in R's standard library make it easier to work interactively with data approaching ordinary RAM limits. For instance, .RData, a native binary format, is about as simple as possible to use (the commands are save and load) and it has excellent compression:

> library(ElemStatLearn)
> data(spam)
> format(object.size(spam), big.mark=',')
  [1] "2,344,384" # a 2.34 MB data file
> save(spam, file='test.RData')

This file, 'test.RData' is only 176 KB, greater than 10-fold compression.


Need Your Help

How could I get my SVN-only host to pull from a git repository?

svn git build-automation capistrano

I'd really like to get our host to pull from our Git repository instead of uploading files manually, but it doesn't have Git installed.

Parsing a string to determine it's a website URL? (asp.net)

asp.net string parsing

In ASP.net what's the best way I can parse a string to determine if it's a valid URL?

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.