Wush筆記

紀錄學習過程的點點滴滴

Chinese Font on EC2 Instance

| Comments

Resolve Chinese font issue on AWS EC2

Record the step I used to resolve the font issue on AWS EC2

Download chinese ttf font

  • Download .ttf chinese font. For example, DFLiYuanXBold1B. Remember to rename the file extension from TTF to ttf

Install R package extrafont

1
install.packages('extrafont')

Import ttf font

1
2
library(extrafont)
font_import("<path to DFLIYX1B.ttf>")

Summary

I didn’t try it second times. Please let me know if it works or not.

R 錯誤處理

| Comments

R 的官方文件在Exception handling有介紹R的例外處理機制。

這裡我簡單介紹如何在R寫出類似java、c++或python等主流語言所使用的try-catch機制。

另外這裡講的都是以R2.15為主。

錯誤相關的函數

  • warning(...): 拋出一個警告
  • stop(...): 拋出一個例外
  • surpressWarnings(expr): 忽略expr中發生的警告
  • try(expr): 嘗試執行
  • tryCatch: 最主流語言例外處理的方法
  • conditionMessage : 顯示錯誤訊息

R 和其他主流語言的不同

R 語言處理例外的方式,是透過函數,而非像其他主流語言使用try … catch … 等語法。這是因為R 語言幾乎所有功能都是用函數來實作的。請參考Every operation is a function call

一個try的範例

我自己最早是先發現try函數。try的用法近似於回傳expr的結果執行時發生的錯誤。

1
2
3
4
result <- try(..., silent=TRUE)
if (class(result) == "try-error") {
  ... # 錯誤處理
}

由於R是我第一個語言,所以我也就接受他了。直到我後來發現主流語言的try – catch機制後,才覺得奇怪。

一個tryCatch的範例

後來我發現tryCatch函式提供了比較類似try – catch機制的錯誤處理方法。

1
2
3
4
5
6
7
8
9
tryCatch({
  result <- expr
}, warning = function(w) {
  ... # 警告處理
}, error = function(e) {
  ... # 錯誤處理
}, finally {
  ... # 清理
}

這種語法和其他主流語言的機制比起來接近多了。

conditionMessage

有時候當錯誤發生時,我無法處理,需要直接回傳錯誤訊息給使用者時,或是log起來時,我們可以在tryCatch中使用conditionMessage來擷取錯誤訊息。

1
2
3
4
5
tryCatch({
  stop("demo error")
}, error = function(e) {
  conditionMessage(e) # 這就會是"demo error"
})

錯誤處理的相關issue

就我的經驗來說,寫出一個穩健的程式碼是非常不容易的。在軟體工程中有許多文章介紹如何寫出這類程式碼。

大部份R 寫出來的script都是只用一次的,所以程式穩定不穩定就不是重點,也因此大家都很少去使用R 的例外處理機制。

某些R 使用者,會需要寫出自動化的script。而這時候為了要讓迴圈不中斷,使用者才開始使用例外處理。

但是當寫到套件時,例外處理就很重要了。這時候,函數的使用者將不再是開發者自己,而還包括其他的使用者,甚至是其他的開發者。此時例外處理就變成一門哲學了。這部份我也只略懂皮毛,下面只列出少許我知道的issue:

  • 釋放資源: 由於錯誤發生時,函數會在不正常的地方退出,所以此時需要釋放一些函數中獲得的資源(如資料庫連線需要關閉)。這部份在C++可以用RAII等技術來保證資源不會被忘記釋放。然而在R中,我還不知道有什麼類似的安全機制。
  • exception safety guarantees: 當使用者要基於某些函數建立複雜的程式時,通常希望這些函式是不會出錯的。Exception safety就是在探討相關的issue。畢竟使用的函數有例外狀況時,原本的函數也跟著會有例外狀況。就像是蓋在危樓上的樓層,一定也很危險一樣。目前我也尚未看過R在這部份的功能。
  • 錯誤訊息: 當錯誤發生時,提供的錯誤訊息是否能幫助使用者找到發生錯誤的理由。R在這部份也很不足,這造成要除錯R的程式時,沒有相當的經驗,是無法理解錯誤訊息的。

參考資料

Slidy and Scianimator

| Comments

In knitr, there is a hook for creating animation with javascript:

hook_scianimator

However, if you directly use it with pandoc and slidy, the animation will not be correctly rendered. The reason is that the .html created by pandoc will not include the source scianimator required.

Yesterday, I successfully intergrate scianimator into slidy.

Environment

  • Ubuntu 12.04 and ubuntu 12.10
  • pandoc 1.10.0.4
  • R 2.15.2
  • knitr 1.0.5

Hacks

  • Download the zip file from Scianimator
  • Copy the subdirectory assets under your project.
  • Copy the original template used by pandoc, /<path to pandoc>/data/templates/default.slidy, to slidy/slidy.scianimator
  • add the following line:

from:

origin
1
2
3
4
  ...
  <script src="$slidy-url$/scripts/slidy.js.gz"
    charset="utf-8" type="text/javascript"></script>
  ...

to:

origin
1
2
3
4
5
6
7
8
  ...
  <script src="$slidy-url$/scripts/slidy.js.gz"
    charset="utf-8" type="text/javascript"></script>
  <script src="assets/js/jquery-1.4.4.min.js"></script>
  <script src="assets/js/jquery.scianimator.pack.js"></script>
  <script src="assets/js/jquery.scianimator.js"></script>
  <script src="assets/js/index.js"></script>
  ...
  • Use the following pandoc arguments:
1
pandoc -s -S -i -t slidy --template=slidy/slidy.scianimator --mathjax src.md -o target.html

That’s it!

Using Eclipse CDT to Develop Rcpp Package

| Comments

Rstudio is great, but it lacks some useful features for C/C++ provided by modern IDE such as tracing. Eclipse CDT is a good choice, but it is complicated to setup the project correctly.

I just wrote a cmake script to generate Eclipse CDT project for developing Rcpp package.

Environment

  • CMake >= 2.8.7
  • Eclipse >= 3.7
  • Eclipse CDT >= 1.4.2
  • R >= 2.15
  • Rcpp >= 0.10

Configuration

  • Download FindLibR.cmake from github provided by Rstudio

  • Generate Rcpp package, for example

1
2
library(Rcpp)
Rcpp.package.skeleton("RcppPackage")
  • Put the following file, named CMakeLists.txt in the generated folder such as RcppPackage in the previous example
CMakeLists.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
cmake_minimum_required(VERSION 2.8)
project(RcppPackage)
find_package(LibR)
if(${LIBR_FOUND})
else()
  message(FATAL_ERROR "No R...")
endif()
message(STATUS ${CMAKE_SOURCE_DIR})
execute_process(
    COMMAND ${LIBR_EXECUTABLE} "--slave" "-e" "stopifnot(require('Rcpp'));cat(Rcpp:::Rcpp.system.file('include'))"
    OUTPUT_VARIABLE LIBRCPP_INCLUDE_DIRS
    )
include_directories(BEFORE ${LIBR_INCLUDE_DIRS})
message(STATUS ${LIBR_INCLUDE_DIRS})
include_directories(BEFORE ${LIBRCPP_INCLUDE_DIRS})
message(STATUS ${LIBRCPP_INCLUDE_DIRS})
add_custom_target(RcppPackage ALL
  COMMAND find ${CMAKE_SOURCE_DIR} -name "*.o" -exec rm "{}" "\;"
  COMMAND find ${CMAKE_SOURCE_DIR} -name "*.so" -exec rm "{}" "\;"
  COMMAND ${LIBR_EXECUTABLE} "--slave" "-e" "\"stopifnot(require(roxygen2));roxygenize('${CMAKE_SOURCE_DIR}',roclets=c('rd','collate','namespace'))\""
  COMMAND ${LIBR_EXECUTABLE} CMD INSTALL "${CMAKE_SOURCE_DIR}")
  • Customize CMakeLists.txt such roxygenize and R CMD INSTALL

  • Generate project with cmake

1
2
3
mkdir build # don't create subdirectory of RcppPackage
cd build
cmake -G "Eclipse CDT4 - Unix Makefiles" <path to RcppPackage> -DCMAKE_ECLIPSE_GENERATE_SOURCE_PROJECT=TRUE
  • Open eclipse and import project from build(See cmake-eclipse-cdt for example). After indexing, enjoy several convenient features provided by Eclipse CDT such as tracing and autocomplete.

  • You can build the project which will be converted to R CMD INSTALL or anything in the CMakeLists.txt.

Xts and Rcpp

| Comments

Here is my guideline to integrate xts with Rcpp in a R package.

Because the xts_API is written for c language, so we need to hack somethings to make it work with c++.

Modify DESCRIPTION

1
2
Depends: xts, Rcpp
linkingTo: xts, Rcpp

Create files in src directory

xts_api.c
1
2
#include <xts.h>
#include <xts_stubs.c>
xts_api.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
extern "C" {
#define class xts_class
#include <xts.h>
#undef class
}


inline SEXP install(const char* x) {
  return Rf_install(x);
}

inline SEXP getAttrib(SEXP a, SEXP b) {
  return Rf_getAttrib(a, b);
}


inline SEXP setAttrib(SEXP a, SEXP b, SEXP c) {
  return Rf_setAttrib(a, b, c);
}

Without the macro, there will be compile time error:

xts_api.h
1
error: expected identifier before ) token

because xts.h use the keyword class.

Without the inline functions, there will be some compile time errors:

xts_api.h
1
2
error: install was not declared in this scope
error: getAttrib was not declared in this scope

Now, almost all API could be invoked in c++:

rcpp_test.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
#include <Rcpp.h>

#include "xts_api.h"

using namespace Rcpp;

RcppExport SEXP get_xts_index(SEXP x) {
  BEGIN_RCPP

  return GET_xtsIndex(x);

  END_RCPP
}

except SET_xtsIndexClass(x, value):

compile time error
1
error: ‘xts_IndexvalueSymbol’ was not declared in this scope

I guess that we should replace xts_IndexvalueSymbol with xts_IndexClassSymbol

Reference

  • file.show(system.file('api_example/README', package="xts"))

Rtwmap

| Comments

source: https://github.com/wush978/Rtwmap

Plot Data

1
library(Rtwmap)
1
## Loading required package: sp
1
2
data(village2010)
plot(village2010)

plot of chunk village2010

1
2
data(county1984)
plot(county1984)

plot of chunk county1984

1
2
data(county2010)
plot(county2010)

plot of chunk county2010

1
2
data(town1984)
plot(town1984)

plot of chunk town1984

1
2
data(town2010)
plot(town2010)

plot of chunk town2010

Coloring

隨機顏色

1
2
3
4
5
data(county1984)
random.color <- as.factor(sample(1:3, length(county1984), TRUE))
color <- rainbow(3)
county1984$random.color <- random.color
spplot(county1984, "random.color", col.regions = color, main = "Taiwan Random Color")

plot of chunk county1984-color

人口

1
2
3
4
5
6
7
population <- read.csv("population.csv", sep = "\t", header = FALSE)
data(county2010)
rownames(population) <- as.character(population$V2)
population <- population[as.character(county2010$county), "V4"]
col <- heat.colors(max(population))[max(population):1]
county2010$population <- population
spplot(county2010, "population", col.regions = col, main = "Population of Taiwan")

plot of chunk population

Unicode Escape in R

| Comments

簡介

最近需要分析中文資料,就遇到了unicode escape的問題。

除了抓下來的資料問題外,就是轉JSON的時候也會跑出來

1
2
3
library(rjson)
toJSON("測試")
toJSON("測試", "R")
1
2
3
4
5
6
> library(rjson)
> toJSON("測試")
[1] "\"\\u6e2c\\u8a66\""
> toJSON("測試", "R")
[1] "\"測試\""
>

中間的\u6e2c\u8a66就是unicode escape

解法原理

上面的\u6e2c中,\u是header,6e2cUTF16BE編碼的hex code。

了解這點之後,就很容易自己做出解決方法:

  • 利用regular expression(如gregexpr)定位\\u[0-9a-f]{4,4}
  • 利用iconv把後面的兩個byte從UTF16BE轉換回UTF8

很弱的實作

但是我在R裏面沒有找到原生的hex轉string的函數,最後就自己刻了兩個函數,效能很差。

但是原理知道了,所以之後我有空可能刻個C++的解決方案。

R Package Installation Tips on Ubuntu

| Comments

rgl

1
sudo apt-get install r-cran-rgl

RBGL, R interface to the Boost Graph Library

1
2
3
#! /usr/bin/R
source("http://bioconductor.org/biocLite.R")
biocLite("RBGL")

Benchmark of Saving and Loading R Objects

| Comments

Introduction

To compare the speed of saving and loading R objects to and from MongoDB with or without serialization.

Environment

  • OpenVZ with Ubuntu 12.04, i7-2600 CPU @ 3.4GHz, 2 processors, 4G RAM
  • Local MongoDB
  • Local PostgreSQL
  • R 1.14.1
  • rmongodb 1.0.3
  • RPostgreSQL 0.3-2

Initialize

1
sudo apt-get install mongodb

R

install libpq-dev
1
sudo apt-get install libpq-dev
install R packages
1
2
install.packages("rmongodb")
install.packages("RPostgreSQL")

Benchmark

Test saving object serialized or not
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
{ # loading package
  library(rmongodb)
  mongo <- mongo.create()
  if (!mongo.is.connected(mongo)) {
    stop("disconnected")
  }
}

save1 <- function(a) {
  for(i in 1:repeat.time) {
    b <- mongo.bson.from.list(list(Rdata = a))
    mongo.insert(mongo, "test.save1", b)
  }
}

load1 <- function() {
  result <- list()
  length(result) <- repeat.time
  cursor <- mongo.find(mongo, "test.save1")
  index <- 1
  while(mongo.cursor.next(cursor)) {
    result[[index]] <- mongo.bson.to.list(mongo.cursor.value(cursor))
    index <- index + 1
  }
  result
}

save2 <- function(a) {
  for(i in 1:repeat.time) {
    buf <- mongo.bson.buffer.create()
    mongo.bson.buffer.append(buf, "Rdata", serialize(a, NULL, FALSE))
    mongo.insert(mongo, "test.save2", mongo.bson.from.buffer(buf))
  }
}

load2 <- function() {
  result <- list()
  length(result) <- repeat.time
  cursor <- mongo.find(mongo, "test.save2")
  index <- 1
  while(mongo.cursor.next(cursor)) {
    result[[index]] <- unserialize(mongo.bson.value(mongo.cursor.value(cursor), "Rdata"))
    index <- index + 1
  }
  result
}

repeat.time <- 1000
mongo.drop.database(mongo, "test")
a <- matrix(rnorm(100^2), 100, 100)
system.time({ #direct way
  print("directly save and load")
  save1(a)
  a.result <- load1()
})
system.time({ #serialized way
  print("serialized before save and load")
  save2(a)
  a.result2 <- load2()
})

I tested many times and notice that the results are very unstable, and I guess that the serialized way is faster a little bit.

I paste some results here:

Test saving object serialized or not
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.226   0.083   4.221
[1] "serialized before save and load"
   user  system elapsed
  0.746   0.095   3.578
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.227   0.090   3.981
[1] "serialized before save and load"
   user  system elapsed
  0.771   0.106   3.327
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.232   0.104   3.808
[1] "serialized before save and load"
   user  system elapsed
  0.760   0.110   3.289
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.303   0.078   3.827
[1] "serialized before save and load"
   user  system elapsed
  0.763   0.109   3.413
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.237   0.089   3.834
[1] "serialized before save and load"
   user  system elapsed
  0.773   0.091   3.458
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.247   0.114   3.970
[1] "serialized before save and load"
   user  system elapsed
  0.781   0.110   3.738
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.331   0.142   4.329
[1] "serialized before save and load"
   user  system elapsed
  0.753   0.098   3.202
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.217   0.090   3.766
[1] "serialized before save and load"
   user  system elapsed
  0.737   0.097   5.339
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.231   0.103   3.875
[1] "serialized before save and load"
   user  system elapsed
  0.751   0.105   3.377
rmongodb package (mongo-r-driver) loaded
Use 'help("mongo")' to get started.

[1] TRUE
[1] "directly save and load"
   user  system elapsed
  1.202   0.085   6.935
[1] "serialized before save and load"
   user  system elapsed
  0.752   0.082   3.996