BUSINESS DATA SCIENCEの続き。

データなどは作者のgitにある。

Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions

作者:Taddy, Matt
発売日: 2019/08/21
メディア: ハードカバー

最近「ビジネスデータサイエンスの教科書」という邦題で日本語版も出たみたいです

ビジネスデータサイエンスの教科書

作者:マット・タディ
発売日: 2020/07/22
メディア: 単行本

この章はなにか

分類問題について。

この節はなにか

k-NNとlassoロジスティクス回帰での分類を通して分類の基本的な部分を学ぶ。

Nearest Neighbors

1,0の2分類問題について考える。

k-NNという手法は、多次元空間を用いて最も近いk個の正解データをもとに各分類の確率を求める。例えば、赤白黒が分類としてあり、近い5つの正解データが白白赤黒赤の場合、白である確率は2/5、赤は2/5、黒は1/5となる。なお、この際に各次元は等価だしk個内での近さも等価となる。

実データ

ガラスについてのデータ。

github.com

#### ******* Forensic Glass ****** ####

library(MASS) ## a library of example datasets
data(fgl) ## loads the data into R; see help(fgl)
head(fgl)
# RI    Na   Mg   Al    Si    K   Ca Ba   Fe type
# 1  3.01 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00 WinF
# 2 -0.39 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00 WinF
# 3 -1.82 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00 WinF
# 4 -0.34 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00 WinF
# 5 -0.58 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00 WinF
# 6 -2.04 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26 WinF

タイプ毎の各変数の分布は以下のようになっている(図は4つのみ表示)。

par(mfrow=c(2,3))
plot(RI ~ type, data=fgl, col=c(grey(.2),2:6))
plot(Al ~ type, data=fgl, col=c(grey(.2),2:6))
plot(Na ~ type, data=fgl, col=c(grey(.2),2:6))
plot(Mg ~ type, data=fgl, col=c(grey(.2),2:6))
plot(Ba ~ type, data=fgl, col=c(grey(.2),2:6))
plot(Si ~ type, data=fgl, col=c(grey(.2),2:6))

f:id:chito_ng:20200324102425p:plain

Baを見るとタイプHead以外はほぼ0であることや、Siをみるとどのタイプでもほぼ同じ値であることがわかる。

テキトーにデータを10個選び、それぞれでの正解データ・k=1、k=5での予測結果は以下のようになる。

## make numerica and scale
## convert columns to mean-zero sd-one

x <- scale(fgl[,1:9]) # column 10 is the class label
apply(x,2,sd) # see ?apply


## use 200 training points to find nearest neighbors for 14

library(class)
test <- sample(1:214,10)
nearest1 <- knn(train=x[-test,], test=x[test,], cl=fgl$type[-test], k=1)
nearest5 <- knn(train=x[-test,], test=x[test,], cl=fgl$type[-test], k=5)
data.frame(fgl$type[test],nearest1,nearest5)
# fgl.type.test. nearest1 nearest5
# 1             Con      Con      Con
# 2            WinF      Veh     WinF
# 3            WinF     WinF     WinF
# 4            Head     Head     Head
# 5           WinNF    WinNF    WinNF
# 6           WinNF     WinF    WinNF
# 7            WinF      Veh     WinF
# 8            WinF    WinNF    WinNF
# 9             Veh      Veh     WinF
# 10          WinNF    WinNF     WinF

このとき、Conのk=1での正解率は5/10、k=5の正解率は7/10となる。

k-nnの問題点として、kを増やすほど良い、といったような一般性がなく、kの数のうちどれが適切かが定まっていないしデータによっても変わる。そのため、CVはk-NNで使用することはできない。

Probability, Cost, and Classification

分類問題でのコストは以下で表される。

$E(loss(a)) = \sum_{k} p_k c(a,k)$

aはそのactionで、kは分類で、pはkとなる確率。例えば、0/1分類で0:60%、1:40%となり、0のとき100円がかかり1のとき-100円がかかる(100円もらえる)actionの場合、0.6100 + 0.4(-100) = 20となる。以降ではこのloss(a)を最小化するようにモデルを調整する。

github.com

実データ

諸々処理をした以下のcreditヒストリーデータを使う

head(credit)
# Default duration amount installment age  history      purpose foreign  rent
# 1       0        6   1169           4  67 terrible goods/repair foreign FALSE
# 2       1       48   5951           2  22     poor goods/repair foreign FALSE
# 3       0       12   2096           2  49 terrible          edu foreign FALSE
# 4       0       42   7882           2  45     poor goods/repair foreign FALSE
# 5       1       24   4870           3  53     poor       newcar foreign FALSE
# 6       0       36   9055           2  35     poor          edu foreign FALSE
dim(credit)
#  1000    9

以下のnaref関数を用いて、sparseに扱えるようにfactorを交互作用付きでダミー化してからCV lassoを行う。

github.com

## build a design matrix 
library(gamlr)
source("naref.R")
credx <- sparse.model.matrix( Default ~ .^2, data=naref(credit))[,-1]
default <- credit$Default
credscore <- cv.gamlr(credx, default, family="binomial", verb=TRUE)

par(mfrow=c(1,2))
plot(credscore$gamlr)
plot(credscore)

f:id:chito_ng:20200324191603p:plain

このモデルを用いてdefaultしたかの0/1となる確率を予測した結果は以下となる。

## What are the underlying default probabilities
## In sample probability estimates
pred <- predict(credscore$gamlr, credx, type="response")
pred <- drop(pred) # remove the sparse Matrix formatting
boxplot(pred ~ default, xlab="default", ylab="prob of default", col=c("pink","dodgerblue"))

f:id:chito_ng:20200324191842p:plain

このとき、データがpによって完全に分割できずに被っていることから、完全に分類することはできない。そのときに、いい感じのpを考えるために、以下のような0/1と予測したうち真に0/1を予測できた率となる指標を考える。

$False Positive Rate = \frac{False Positive}{True Positive + False Positive}$
$False Negative Rate = \frac{False Negative}{True Negative + False Negative}$

また、0/1と予測したうち実際0/1だった率を用いる場合もある。

$Sensitivity = \frac{True Positive}{True Positive + False Negative}$
$Specificity = \frac{True Negative}{True Negative + False Positive}$

以下では閾値としてpが0.2以上を1とする。

## what are our misclassification rates?
rule <- 1/5 # move this around to see how these change

sum( (pred>rule)[default==0] )/sum(pred>rule) ## false positive rate
#  0.6059744
sum( (pred<rule)[default==1] )/sum(pred<rule) ## false negative rate
# 0.07744108

sum( (pred>rule)[default==1] )/sum(default==1) ## sensitivity
# 0.9233333
sum( (pred<rule)[default==0] )/sum(default==0) ## specificity
# 0.3914286

sensitivityは式からわかるように、全てを1とする閾値を置いた場合では100%になるし、逆にspecificityは全てを0となる閾値を置いた場合では100%になる。そのため、これらのバランスが良い閾値を探索する必要がある。
そのためには、それぞれの閾値を動かしたときの挙動を確認するROC曲線を用いる。

以下では、ISとOOSそれぞれに対してROC曲線を見ている。

# OOS ROC curve
# refit the model using only 1/2 of data
test <- sample.int(1000,500)
credhalf <- gamlr(credx[-test,], default[-test], family="binomial")
predoos <- predict(credhalf, credx[test,], type="response")
defaultoos <- default[test]

## roc curve and fitted distributions
source("roc.R")

par(mai=c(.9,.9,.2,.1), mfrow=c(1,2))
roc(p=pred, y=default, bty="n", main="in-sample")
## our 1/5 rule cutoff
points(x= 1-mean((pred<.2)[default==0]), 
       y=mean((pred>.2)[default==1]), 
       cex=1.5, pch=20, col='red') 
## a standard `max prob' (p=.5) rule
points(x= 1-mean((pred<.5)[default==0]), 
       y=mean((pred>.5)[default==1]), 
       cex=1.5, pch=20, col='blue') 
legend("bottomright",fill=c("red","blue"),
       legend=c("p=1/5","p=1/2"),bty="n",title="cutoff")

roc(p=predoos, y=defaultoos, bty="n", main="out-of-sample")
## our 1/5 rule cutoff
points(x= 1-mean((predoos<.2)[defaultoos==0]), 
       y=mean((predoos>.2)[defaultoos==1]), 
       cex=1.5, pch=20, col='red') 
## a standard `max prob' (p=.5) rule
points(x= 1-mean((predoos<.5)[defaultoos==0]), 
       y=mean((predoos>.5)[defaultoos==1]), 
       cex=1.5, pch=20, col='blue')