Download - R meetup talk
![Page 1: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/1.jpg)
Fast lookups in R
Joseph Adler
April 13 2010
![Page 2: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/2.jpg)
About me
Relevant work
• Tasks– Computer security research
– Credit risk modeling
– Pricing strategy
– Direct marketing
• Places– American Express
– Johnson and Johnson
– DoubleClick
– VeriSign
– LinkedIn (now)
![Page 3: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/3.jpg)
About me
Books
![Page 4: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/4.jpg)
Today’s talk
What I wrote
If you need to store a big lookup table, consider implementing the table using an environment. Environment objects are implemented using hash tables. Vectors and lists are not. This means that looking up a value with n element in a list can take O(n) time. Looking up the value in an environment object takes O(1) time on average
![Page 5: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/5.jpg)
Today’s talk
What I read after the book was printed
Re: [R] beginner Q: hashtable or dictionary?
From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> Date: Mon 30
Jan 2006 - 18:37:00 EST
On Sun, 29 Jan 2006, hadleywickham wrote:
>> use a 'list': > > Is a list O(1) for setting and getting?
Can you elaborate? R is a vector language, and normally you create
a list in one pass, and you can retrieve multiple elements at once.
Retrieving elements by name from a long vector (including a
list) is very fast, as an internal hash table is used.Does the
following item from ONEWS answer your question?
Indexing a vector by a character vector was slow if both
the vector and index were long (say 10,000). Now
hashing is used and the time should be linear in the
longer of the lengths (but more memory is used).
Indexing by number is O(1) except where replacement causes the
list vector to be copied. There is always the option to use match() to
convert to numeric indexing.
-- Brian D. Ripley,
Professor of Applied Statistics,
University of Oxford
Retrieving elements by name from a
long vector (including a list) is very
fast, as an internal hash table is used.
Professor Brian D. Ripley
![Page 6: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/6.jpg)
Today’s talk
• A short introduction to objects in R
• Looking up values in R
– How lookup tables are implemented in R
– Measuring lookup speed
– Optimizing lookup speed
![Page 7: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/7.jpg)
Objects in R
Everything in R is an object. Here are some
examples of objects.
Numeric Vector:
>onehalf<- 1/2
>class(onehalf)
[1] "numeric”
![Page 8: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/8.jpg)
Objects in R
Integer Vector:
> four <- as.integer(4)
> four
[1] 4
>class(four)
[1] "integer”
![Page 9: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/9.jpg)
Objects in R
Character vector:
> zero <- "zero"
>class(zero)
[1] "character”
![Page 10: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/10.jpg)
Objects in R
Logical vector:
>this.is.interesting<- FALSE
>class(this.is.interesting)
[1] "logical"
![Page 11: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/11.jpg)
Objects in R
Vectors can have multiple elements
>one.to.five<- 1:5
>class(one.to.five)
[1] "integer"
>six.to.ten<- c(6, 7, 8, 9, 10)
>class(six.to.ten)
[1] "numeric"
![Page 12: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/12.jpg)
Objects in R
Lists contain heterogeneous collections of objects> stuff <- list(3.14, "hat", FALSE)
>class(stuff)
[1] "list"
![Page 13: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/13.jpg)
Objects in R
Functions are also objects in R:
>f<- function(x, y) {
+ x + y
+ }
>f
function(x, y) {
x + y
}
>class(f)
[1] "function"
![Page 14: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/14.jpg)
Objects in R
Environments map names to objects. They are
used within R itself to map variable names to
objects. You can access these environment
objects, or create your own.> one <- 1
> two <- 2
> three <- 3
> objects()
[1] "one" "three" "two"
>e<- .GlobalEnv
>class(e)
[1] "environment"
>objects(e)
[1] "e" "one" "three" "two"
![Page 15: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/15.jpg)
Lookups
You can look up an item in a vector, list, or array
within R
– Let’s define a vector:
>a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> a
[1] 1 2 3 4 5 6 7 8 9 10
– You can refer to elements by index:
>a[3]
[1] 3
![Page 16: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/16.jpg)
Lookups
It's also possible to name elements in a vector, then refer to
them by name:
>b<- c(Joe=1, Bob=2, Jim=3)
>b["Bob"]
Bob
This can be very convenient: you can use every vector in R
as a table. You can access the name vector through the
names function:
>names(b)
[1] "Joe" "Bob" "Jim"
![Page 17: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/17.jpg)
Lookups
Named vectors in R are implemented using two
different arrays:
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
![Page 18: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/18.jpg)
Lookups
The name lookup algorithm works roughly like this:
function(vector, name) {
for (i in 1:length(vector)) {
if (names(vector)[i] == name)
return vector[i]
}
return NA
![Page 19: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/19.jpg)
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
![Page 20: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/20.jpg)
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[1]
![Page 21: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/21.jpg)
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[2]
![Page 22: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/22.jpg)
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[4]
![Page 23: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/23.jpg)
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[4]
![Page 24: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/24.jpg)
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[5]
![Page 25: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/25.jpg)
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[5]
![Page 26: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/26.jpg)
Lookups
In vectors,
– Looking up a value by index takes a constant amount
of time.
– Looking up a value by name (potentially) requires
looking at every name in the names array. (This
means that lookup times scale linearly with the
number of items in the table.)
![Page 27: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/27.jpg)
Lookups
Environments store (and fetch) data using a
different structure. They use hash tables.
Hash tables rely on a hash function to map labels
to indices.
![Page 28: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/28.jpg)
Lookups
Simple hash table implementation
Example: store 15 ¾ for “Joe”
1. Calculate h(“Joe”)
2. Store 15 ¾ in the
table in slot h(“Joe”)
1
2
3
4 15 ¾
5
6
h(“Joe”) = 4
![Page 29: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/29.jpg)
Lookups
If you carefully choose the size of the hash table
and the hash function, you can store and lookup
values in constant time (on average) in hash
tables.
![Page 30: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/30.jpg)
Measuring Lookup Speed
In theory, looking up values in environments
should be faster than looking up values in vectors.
In practice, how much difference does this make?
Let’s measure how much time it takes to look up
values in vectors and environments, using different
lookup methods
![Page 31: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/31.jpg)
Measuring Lookup Speed
Let's build a large, labeled vector for testing:labeled.array<- function(n) {
a <- 1:n
from <- “1234567890"
to <- "ABCDEFGHIJ"
for (i in 1:n) {
names(a)[i] <- chartr(from, to, i)
}
a
}
Here's an example of the output of this function:
>a.20 <- labeled.array(20)
>a.20
A B C D E F G H I AJ AA AB AC AD AE AF AG AH AI BJ
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
![Page 32: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/32.jpg)
Measuring Lookup Speed
Let's also create environment objects for testing:
labeled.environment<- function(n) {e<- new.env(hash=TRUE, size=n) from <- "1234567890”to <- "ABCDEFGHIJ”for (i in 1:n) {
assign(x=chartr(from, to, i),value=i, envir=e)
}e}
Here’s an example of the output of this function:
> e.20 <- labeled.environment(20)
> e.20
<environment: 0x143756c>
![Page 33: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/33.jpg)
Measuring Lookup Speed
You can fetch values from an environment object
with the get function
>get("A",envir=e.20)
[1] 1
>get("BA",envir=e.20)
[1] 20
You can also fetch values from an environment
with the double bracket operator
> e.20[["A"]]
[1] 1
>e.20[["BA"]]
[1] 20
![Page 34: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/34.jpg)
Measuring Lookup Speed
• Creating examples for testing
arrays <- list()
for (i in 10:15) {
arrays[[as.character(2 ** i)]] <-
labeled.array(2 ** i)
}
environments <- list()
for (i in 10:15) {
environments[[as.character(2 ** i)]] <-
labeled.environment(2 ** i)
}
![Page 35: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/35.jpg)
Measuring Lookup Speed
• Using the test function:
test_expressions("first element, by index:",function(d,l,r) {s<- 0 for (v in 1:r) {s<- s + d[1]
}},arrays, 1024)
• Output:
first element, by index:1024 2048 4096 8192 16384 327680.010 0.003 0.004 0.003 0.005 0.004
![Page 36: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/36.jpg)
Measuring Lookup Speed
• Results for 1024 lookups:
1024 2048 4096 8192 16384 32768
Array index First 0.01 0.003 0.004 0.003 0.005 0.004
Array index Last 0.01 0.004 0.004 0.004 0.003 0.004
Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397
Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266
Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002
Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107
Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003
Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112
Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005
Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005
![Page 37: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/37.jpg)
Measuring Lookup Speed
• Results for 1024 lookups:
1024 2048 4096 8192 16384 32768
Array index First 0.01 0.003 0.004 0.003 0.005 0.004
Array index Last 0.01 0.004 0.004 0.004 0.003 0.004
Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397
Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266
Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002
Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107
Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003
Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112
Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005
Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005
Notice that these values increase linearly with the number of
elements in the array
![Page 38: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/38.jpg)
Measuring Lookup Speed
• Results for 1024 lookups:
1024 2048 4096 8192 16384 32768
Array index First 0.01 0.003 0.004 0.003 0.005 0.004
Array index Last 0.01 0.004 0.004 0.004 0.003 0.004
Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397
Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266
Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002
Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107
Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003
Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112
Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005
Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005
Let’s focus on the results for the largest arrays (which are the
most precise)
![Page 39: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/39.jpg)
Measuring Lookup Speed
• Results for 1024 lookups, 32768 elements:
Array index First 0.004
Array index Last 0.004
Array Label Single Bracket First 5.397
Array Label Single Bracket Last 5.266
Array Label Double Bracket Exact First 0.002
Array Label Double Bracket Exact Last 1.107
Array Label Double Bracket Not exact First 0.003
Array Label Double Bracket Not exact Last 1.112
Environment Label First 0.005
Environment Label Last 0.005
![Page 40: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/40.jpg)
Optimizing Lookup Speed
How to write efficient code:
1. Write code for clarity, not speed
2. Check to see if the code is fast enough. If it is
fast enough, stop.
3. Test your code to find where time is being spent
4. Fix the parts of your code that are taking
enough time.
5. Go to step 2
![Page 41: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/41.jpg)
Optimizing Lookup Speed
• How do you make lookups fast?
– Lookups by position are fastest
– If you have to lookup up single values by name, write
your code with double-brackets
• Double-bracket lookups are a little faster than single bracket
lookups
• If you discover that your code is too slow, you can easily
change from vectors to environments
![Page 42: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/42.jpg)
Optimizing Lookup Speed
• What if
– Your code is too slow
– You need to look up values by name
– It would be hard to change your code to use double-
bracket notation
• Define a bracket operator for environments!
![Page 43: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/43.jpg)
Optimizing Lookup Speed
Remember that everything in R is a function, even
lookup operators.
Example code:
>b<- c(Joe=1, Bob=2, Jim=3)
>b["Bob"]
Bob
2
![Page 44: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/44.jpg)
Optimizing Lookup Speed
Translation of the example code:
>b["Bob"]
Bob
2
>as.list(quote(b["Bob"]))
[[1]]
`[`
[[2]]
b
[[3]]
[1] "Bob"
![Page 45: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/45.jpg)
Optimizing Lookup Speed
R translates
b["B"]
to
`[`(b, "B")
![Page 46: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/46.jpg)
Optimizing Lookup Speed
Here is the code for our new subset function
`[` <- function(x, i, j, ..., drop=TRUE) {
if (class(x) == "environment”) {
get(x=i, envir=x)
} else {
.Primitive("[")(x, i, j, ..., drop=TRUE)
}
}
![Page 47: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/47.jpg)
Optimizing Lookup Speed
Assignments through bracket notation are a little
funny. For example, R evaluates
x[3:5] <- 13:15
as if this code had been executed:
`*tmp*` <- x
x<- "[<-"(`*tmp*`, 3:5, value=13:15)
rm(`*tmp*`)
![Page 48: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/48.jpg)
Optimizing Lookup Speed
Here is the code for our new subset assignment
function
`[<-` <- function(x, i, j, ..., value) {
if (class(x) == "environment”) {
assign(x=i, value=value, envir=x)
# the assign statement returns value,
# but we want to return the environment:
x
} else {
.Primitive("[<-")(x, i, j, ..., value)
}
}
![Page 50: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/50.jpg)
Backup Slides
![Page 51: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/51.jpg)
• A function to test the performance of a lookup
function on an object:
test_expressions<-
function(description, fun, data, reps) {
cat(paste(description,"\n"))
results <- vector()
for (n in names(data)) {
results[[n]] <- system.time(
fun(data[[n]], as.integer(n), reps)
)[["user.self"]]
}
print(results)
}
![Page 52: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/52.jpg)
To figure out the full argument list for the bracket
operator, use the getGeneric function:
>getGeneric("[")
standardGeneric for "[" defined from package "base"
function (x, i, j, ..., drop = TRUE)
standardGeneric("[", .Primitive("["))
<environment: 0x11a6828>
Methods may be defined for arguments: x, i, j, drop
Use showMethods("[") for currently available ones.
![Page 53: R meetup talk](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a9339a1a28ab30368b4822/html5/thumbnails/53.jpg)
In general, you should set new methods with the setMethod function. Example:
setClass("myenv", representation(e="environment"))setMethod("[",signature(x="myenv", i="character”, j="missing"),function(x,i,j,...,drop=TRUE) {
get(x=i,envir=x@e)}
)
Unfortunately, R doesn’t let you redefine these operators for environments, so we have to do something trickier.