Dataset: IMDB
Predict the rating of the movie according to the text in the review.
Training Dataset: 25000 reviews
Binary response: positive or negative.
## [1] 1
## [1] "kurosawa is a proved humanitarian this movie is totally about people living in"
## [2] "poverty you will see nothing but angry in this movie it makes you feel bad but"
## [3] "still worth all those who s too comfortable with materialization should spend 2"
## [4] "5 hours with this movie"
Word Segmentation
## [1] "" "kurosawa" "is"
## [4] "a" "proved" "humanitarian"
## [7] "" "this" "movie"
## [10] "is" "totally" "about"
## [13] "people" "living" "in"
## [16] "poverty" "" "you"
## [19] "will" "see" "nothing"
## [22] "but" "angry" "in"
## [25] "this" "movie" ""
## [28] "it" "makes" "you"
## [31] "feel" "bad" "but"
## [34] "still" "worth" ""
## [37] "all" "those" "who"
## [40] "s" "too" "comfortable"
## [43] "with" "materialization" "should"
## [46] "spend" "2" "5"
## [49] "hours" "with" "this"
## [52] "movie" ""
FeatureHashing
and it's Formula Interface
FeatureHashing
provides split
in the formula.
hash_size <- 2^16
m.train <- hashed.model.matrix(~ split(review, delim = " ", type = "existence"),
data = imdb.train,
hash.size = hash_size,
signed.hash = FALSE)
m.valid <- hashed.model.matrix(~ split(review, delim = " ", type = "existence"),
data = imdb.valid,
hash.size = hash_size,
signed.hash = FALSE)
Document-term Matrix and tm
suppressPackageStartupMessages(library(tm))
corpus.train <- VCorpus(VectorSource(imdb.train$review))
dtm.train <- DocumentTermMatrix(corpus.train)
dim(dtm.train)
## [1] 20000 67678
corpus.valid <- VCorpus(VectorSource(imdb.valid$review))
dtm.valid <- DocumentTermMatrix(corpus.valid)
dim(dtm.valid)
## [1] 5000 38155
Efficiency
test replications elapsed relative user.self sys.self user.child
2 fh 10 73.316 1.000 70.083 1.898 0.000
1 tm 10 203.663 2.778 66.950 4.440 239.018
sys.child
2 0.000
1 12.413
Document-Term Matrix and xgboost
dtrain <- xgb.DMatrix(m.train, label = imdb.train$sentiment)
dvalid <- xgb.DMatrix(m.valid, label = imdb.valid$sentiment)
watch <- list(train = dtrain, valid = dvalid)
g <- xgb.train(booster = "gblinear", nrounds = 100, eta = 0.0001, max.depth = 2,
data = dtrain, objective = "binary:logistic",
watchlist = watch, eval_metric = "auc")
[0] train-auc:0.969895 valid-auc:0.914488
[1] train-auc:0.969982 valid-auc:0.914621
[2] train-auc:0.970069 valid-auc:0.914766
...
[97] train-auc:0.975616 valid-auc:0.922895
[98] train-auc:0.975658 valid-auc:0.922952
[99] train-auc:0.975700 valid-auc:0.923014
Algorithm in Text Book
Regression
\[y = X \beta + \varepsilon\]
Real Data
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Species |
1 |
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
2 |
4.9 |
3.0 |
1.4 |
0.2 |
setosa |
51 |
7.0 |
3.2 |
4.7 |
1.4 |
versicolor |
52 |
6.4 |
3.2 |
4.5 |
1.5 |
versicolor |
101 |
6.3 |
3.3 |
6.0 |
2.5 |
virginica |
102 |
5.8 |
2.7 |
5.1 |
1.9 |
virginica |
- How to convert the real data to \(X\)?
Feature Vectorization and Formula Interface
- \(X\) is usually constructed via
model.matrix
in R
model.matrix(~ ., iris.demo)
|
(Intercept) |
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Speciesversicolor |
Speciesvirginica |
1 |
1 |
5.1 |
3.5 |
1.4 |
0.2 |
0 |
0 |
2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
0 |
0 |
51 |
1 |
7.0 |
3.2 |
4.7 |
1.4 |
1 |
0 |
52 |
1 |
6.4 |
3.2 |
4.5 |
1.5 |
1 |
0 |
101 |
1 |
6.3 |
3.3 |
6.0 |
2.5 |
0 |
1 |
102 |
1 |
5.8 |
2.7 |
5.1 |
1.9 |
0 |
1 |
Categorical Feature in R
- A categorical variables of \(K\) categories are transformed to a \(K-1\)-dimentional vector.
- There are many coding systems and the most commonly used is
Dummy Coding
.
- The first category are transformed to \(\vec{0}\).
Dummy Coding
## versicolor virginica
## setosa 0 0
## versicolor 1 0
## virginica 0 1
Categorical Feature in Machine Learning
- Predictive analysis
- Regularization
- The categorical variables of \(K\) categories are transformed to \(K\)-dimentional vector.
- The missing data are transformed to \(\vec{0}\).
contr.treatment(levels(iris.demo$Species), contrasts = FALSE)
## setosa versicolor virginica
## setosa 1 0 0
## versicolor 0 1 0
## virginica 0 0 1
Limitation of model.matrix
- The vectorization requires all categories.
contr.treatment(levels(iris$Species))
## versicolor virginica
## setosa 0 0
## versicolor 1 0
## virginica 0 1
Observations
- Mapping features to \(\{0, 1, 2, ..., K\}\) is one of method to vectorize feature.
setosa
=> \(\vec{e_1}\)
versicolor
=> \(\vec{e_2}\)
virginica
=> \(\vec{e_3}\)
## setosa versicolor virginica
## setosa 1 0 0
## versicolor 0 1 0
## virginica 0 0 1
Pros of FeatureHashing
A Good Companion of Online Algorithm
library(FeatureHashing)
hash_size <- 2^16
w <- numeric(hash_size)
for(i in 1:1000) {
data <- fread(paste0("criteo", i))
X <- hashed.model.matrix(V1 ~ ., data, hash.size = hash_size)
y <- data$V1
update_w(w, X, y)
}
Pros of FeatureHashing
A Good Companion of Distributed Algorithm
library(pbdMPI)
library(FeatureHashing)
hash_size <- 2^16
w <- numeric(hash_size)
i <- comm.rank()
data <- fread(paste0("criteo", i))
X <- hashed.model.matrix(V1 ~ ., data, hash.size = hash_size)
y <- data$V1
# ...
Pros of FeatureHashing
Simple Training and Testing
library(FeatureHashing)
model <- is_click ~ ad * (url + ip)
m_train <- hashed.model.matrix(model, data_train, hash_size)
m_test <- hashed.model.matrix(model, data_test, hash_size)
Cons of FeatureHashing
Lose Interpretation
- Collision makes the interpretation harder.
- It is inconvenient to reverse the indices to feature.
m <- hashed.model.matrix(~ Species, iris, hash.size = 2^4, create.mapping = TRUE)
hash.mapping(m) %% 2^4
## Speciessetosa Speciesvirginica Speciesversicolor
## 7 13 8
Formula Interface: +
+
is the operator of combining linear predictors
model.matrix(~ a + b, data.demo)
|
(Intercept) |
a |
b |
1 |
1 |
5.1 |
3.5 |
2 |
1 |
4.9 |
3.0 |
51 |
1 |
7.0 |
3.2 |
52 |
1 |
6.4 |
3.2 |
101 |
1 |
6.3 |
3.3 |
102 |
1 |
5.8 |
2.7 |
Formula Interface: :
:
is the interaction operator
model.matrix(~ a + b + a:b, data.demo)
|
(Intercept) |
a |
b |
a:b |
1 |
1 |
5.1 |
3.5 |
17.85 |
2 |
1 |
4.9 |
3.0 |
14.70 |
51 |
1 |
7.0 |
3.2 |
22.40 |
52 |
1 |
6.4 |
3.2 |
20.48 |
101 |
1 |
6.3 |
3.3 |
20.79 |
102 |
1 |
5.8 |
2.7 |
15.66 |
Formula Interface: *
*
is the operator of cross product
# a + b + a:b
model.matrix(~ a * b, data.demo)
|
(Intercept) |
a |
b |
a:b |
1 |
1 |
5.1 |
3.5 |
17.85 |
2 |
1 |
4.9 |
3.0 |
14.70 |
51 |
1 |
7.0 |
3.2 |
22.40 |
52 |
1 |
6.4 |
3.2 |
20.48 |
101 |
1 |
6.3 |
3.3 |
20.79 |
102 |
1 |
5.8 |
2.7 |
15.66 |
Formula Interface: (
:
and *
are distributive over +
# a:c + b:c
model.matrix(~ (a + b):c, data.demo)
|
(Intercept) |
a:c |
b:c |
1 |
1 |
7.14 |
4.90 |
2 |
1 |
6.86 |
4.20 |
51 |
1 |
32.90 |
15.04 |
52 |
1 |
28.80 |
14.40 |
101 |
1 |
37.80 |
19.80 |
102 |
1 |
29.58 |
13.77 |
Formula Interface: .
.
means all columns of the data
.
# ~ Sepal.Length + Sepal.Width + Petal.Length +
# Petal.Width + Species
model.matrix(~ ., iris.demo)
|
(Intercept) |
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
Speciesversicolor |
Speciesvirginica |
1 |
1 |
5.1 |
3.5 |
1.4 |
0.2 |
0 |
0 |
2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
0 |
0 |
51 |
1 |
7.0 |
3.2 |
4.7 |
1.4 |
1 |
0 |
52 |
1 |
6.4 |
3.2 |
4.5 |
1.5 |
1 |
0 |
101 |
1 |
6.3 |
3.3 |
6.0 |
2.5 |
0 |
1 |
102 |
1 |
5.8 |
2.7 |
5.1 |
1.9 |
0 |
1 |
Tips
terms.formula
and its argument specials
tf <- terms.formula(~ Plant * Type + conc * split(Treatment), specials = "split",
data = CO2)
attr(tf, "factors")
## Plant Type conc split(Treatment) Plant:Type
## Plant 1 0 0 0 1
## Type 0 1 0 0 1
## conc 0 0 1 0 0
## split(Treatment) 0 0 0 1 0
## conc:split(Treatment)
## Plant 0
## Type 0
## conc 1
## split(Treatment) 1
Specials
attr(tf, "specials")
tells which rows of attr(tf, "factors")
need to be parsed further
rownames(attr(tf, "factors"))
## [1] "Plant" "Type" "conc"
## [4] "split(Treatment)"
attr(tf, "specials")
## $split
## [1] 4
Parse
parse
extracts the information from the specials
options(keep.source = TRUE)
p <- parse(text = rownames(attr(tf, "factors"))[4])
getParseData(p)
## line1 col1 line2 col2 id parent token terminal text
## 9 1 1 1 16 9 0 expr FALSE
## 1 1 1 1 5 1 3 SYMBOL_FUNCTION_CALL TRUE split
## 3 1 1 1 5 3 9 expr FALSE
## 2 1 6 1 6 2 9 '(' TRUE (
## 4 1 7 1 15 4 6 SYMBOL TRUE Treatment
## 6 1 7 1 15 6 9 expr FALSE
## 5 1 16 1 16 5 9 ')' TRUE )
Summary
- Pros of
FeatureHashing
- Make it easier to do feature vectorization for predictive analysis.
- Make it easier to tokenize the text data.
- Cons of
FeatureHashing
- Decrease the prediction accuracy if the
hash size
is too small
- Interpretation becomes harder.
When should I use FeatureHashing
?
Short Answer: Predictive Anlysis with a large amount of categories.