I've been programming in R for a while now. Whenever I've had performance problems, it almost always is due to
I wrote this code to be particularly slow due to its
Rewriting the silly example above,
This time, it runs too quickly for any events to be recorded with the profiling. The execution time is 0.192 seconds, whereas the first version is 3.48 seconds. A pretty good speed-up.
data.frame
usage. To check what's slowing down your R code, just use the Rprof
command like so:I wrote this code to be particularly slow due to its
data.frame
usage. You may view the results of the profiling by running R CMD Rprof summ.prof
: As you can see, it's very slow. After profiling your own code, if you find that the top calls are
Each sample represents 0.02 seconds.
Total run time: 2.98 seconds.
Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.
% total % self
total seconds self seconds name
81.2 2.42 1.3 0.04 "[<-"
79.9 2.38 65.8 1.96 "[<-.data.frame"
18.1 0.54 10.1 0.30 "[.data.frame"
18.1 0.54 0.0 0.00 "["
12.1 0.36 1.3 0.04 "%in%"
11.4 0.34 9.4 0.28 "match"
4.7 0.14 4.0 0.12 "anyDuplicated"
2.0 0.06 2.0 0.06 "names"
2.0 0.06 2.0 0.06 "sys.call"
1.3 0.04 1.3 0.04 "=="
0.7 0.02 0.7 0.02 ".row_names_info"
0.7 0.02 0.7 0.02 "NROW"
0.7 0.02 0.7 0.02 "anyDuplicated.default"
0.7 0.02 0.7 0.02 "cos"
% self % total
self seconds total seconds name
65.8 1.96 79.9 2.38 "[<-.data.frame"
10.1 0.30 18.1 0.54 "[.data.frame"
9.4 0.28 11.4 0.34 "match"
4.0 0.12 4.7 0.14 "anyDuplicated"
2.0 0.06 2.0 0.06 "names"
2.0 0.06 2.0 0.06 "sys.call"
1.3 0.04 81.2 2.42 "[<-"
1.3 0.04 12.1 0.36 "%in%"
1.3 0.04 1.3 0.04 "=="
0.7 0.02 0.7 0.02 ".row_names_info"
0.7 0.02 0.7 0.02 "NROW"
0.7 0.02 0.7 0.02 "anyDuplicated.default"
0.7 0.02 0.7 0.02 "cos"
[.data.frame
or [<-.data.frame
, then you have a data.frame
problem. Here's how I solve this, in order of things I try:- avoid loops, use vectorized code (no
for
loops, noapply
, nosapply
). In the example, used[,1]<-
, for assigning an entire column - Use numeric indices when possible. In our example, that means using
d[i, 1]
instead ofd[i, "x"]
- Get rid of the
data.frame
for heavy calculations by using thedata.matrix
command. In the example above, just used.matrix <- data.matrix(d)
Rewriting the silly example above,
This time, it runs too quickly for any events to be recorded with the profiling. The execution time is 0.192 seconds, whereas the first version is 3.48 seconds. A pretty good speed-up.