Big Data Process and Analysis in R

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.

Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:

  1. Read and process the data into a data frame
  2. Basic descriptive analysis, including text mining (frequent terms, etc.)
  3. Plotting

Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.

Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.

Thanks in advance.

Answers


If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.

(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.

Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

A couple of pointers:

  • an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.

  • you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.


Need Your Help

how to get UserID from Google plus Oauth2.0 for IOS

ios objective-c oauth-2.0 google-plus

I used Google Developers Documentation but I can't take UserID

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.