You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

R, parallelization and large datasets

When we have a task that would take a long time, we can usually think about parallelization. In this post I will show how to deal with an issue when you have large shared data set (but not that big so you would need MapReduce).

Let’s first start with how to set up cluster in R:

Cluster set-up using doSNOW
Revolution Analytics pulled out doMC; therefore, I am using doSNOW.


library(foreach)
library(doSNOW)


numberofcores <- 4

makeCluster(numberofcores)
registerDoSNOW(cl)

foreach (ind=1:1000) %dopar% foo_with(bigdata)

stopCluster(cl)

There are two issues here. This code gives us an error message that the function foo_with and you are transporting a lot of data what causes slow down.

Solution for both problems
Push data into your cluster by:

clusterExport(cl, bigdata)

Function can be either pushed by clusterExport or we can use clusterApply or clusterApplyLB

clusterApplyLB(cl, array, foo_with_rewritten,...)

This blog post shows the solution in between simple SNOW (or different) cluster computing just MC or similar and cluster that needs MapReduce.

Leave a Comment

Log in