全然obviouslyじゃない……

追記:
勘違いしていた.タイトルは間違っているし,
本文も参照先の論旨とは異なった事をしている.
(意味が無い訳では無いので,別物として置いておく)
((ただ……))
((考え方が違う=数理的に誤り,という意味では異なっているけど))
((まあ同様の結果を得ているという意味ではやっぱりk-meansは手軽で手頃だと思う))

if self.affinity == "precomputed":
            self.affinity_matrix_ = X
        elif self.affinity == "euclidean":
            self.affinity_matrix_ = -euclidean_distances(X, squared=True)

なので,”precomputed”は文字通り,既にaffinity_matrixが計算されている場合で,
1つ目の話は類似性行列から,1,2,3と4,5に分かれるのは明らかだった.
(ただ2つ目のは非対称だし類似性行列じゃない様な)
(ユークリッドじゃなければあり得るか)

一つ目のやつは,単にダンピングの値が高すぎるだけだろう.
(デフォルトは0.5)

import sklearn.cluster
clusterer = sklearn.cluster.AffinityPropagation(affinity='precomputed')
result = clusterer.fit(affinities)

from pprint import pprint
pprint(vars(result))

for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], '.', color=col)
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

{‘affinity’: ‘precomputed’,
‘affinity_matrix_’: array([[ 0., -1., -2., -6., -7.],
[-1., 0., -1., -7., -8.],
[-2., -1., 0., -8., -9.],
[-6., -7., -8., 0., -1.],
[-7., -8., -9., -1., 0.]]),
‘cluster_centers_indices_’: array([1, 3], dtype=int64),
‘convergence_iter’: 15,
‘copy’: True,
‘damping’: 0.5,
‘labels_’: array([0, 0, 0, 1, 1], dtype=int64),
‘max_iter’: 200,
‘n_iter_’: 60,
‘preference’: None,
‘verbose’: False}

FireShot Capture 313 - JupyterLab Alpha Preview - http___localhost_8888_lab

ビジュアライゼーションしたければ,一番簡単なのはヒートマップ.

#X = np.array(affinities)
sns.heatmap(X, annot=True)

FireShot Capture 316 - JupyterLab Alpha Preview - http___localhost_8888_lab

確かにobviously

後,分かり易いのがnetworkxを使ってネットワーク構造でみせる方法.
非常に直感的に分かり易い.
こういう関係構造になるだろうっていう全体像と,それに向けた探索的な部分があるけど.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx

G = nx.complete_graph(5, nx.DiGraph())

X = np.array(affinities)
for i, j in G.edges():
    G.edge[i][j]['weight'] = X[i, j]
weights = [d[2]*0.2 for d in G.edges_iter(data='weight', default=0.2)]

pos=nx.spring_layout(G)
nx.draw_networkx(G, pos=pos, width=weights)
plt.axis('off')

FireShot Capture 318 - JupyterLab Alpha Preview - http___localhost_8888_lab

FireShot Capture 319 - JupyterLab Alpha Preview - http___localhost_8888_lab

FireShot Capture 320 - JupyterLab Alpha Preview - http___localhost_8888_lab

対象な類似性行列であれば,有向グラフより無向グラフでも良い.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx

G = nx.complete_graph(5)

X = np.array(affinities)
for i, j in G.edges():
    G.edge[i][j]['weight'] = X[i, j]
weights = [d[2]*0.5 for d in G.edges_iter(data='weight', default=0.5)]

pos=nx.spring_layout(G)
nx.draw_networkx(G, pos=pos, width=weights)
plt.axis('off')

FireShot Capture 321 - JupyterLab Alpha Preview - http___localhost_8888_lab

sns.heatmap(sims, annot=True)

FireShot Capture 317 - JupyterLab Alpha Preview - http___localhost_8888_lab

1,2,3と4,5とも,1と2,3と4と5とも.

X =  np.array(
    [
        [0, 17, 10, 32, 32], 
        [18, 0, 6, 20, 15], 
        [10, 8, 0, 20, 21], 
        [30, 16, 20, 0, 17], 
        [30, 15, 21, 17, 0]
    ]
)

G = nx.complete_graph(5, nx.DiGraph())
for i, j in G.edges():
    G.edge[i][j]['weight'] = X[i, j]
weights = [d[2]*0.1 for d in G.edges_iter(data='weight', default=0.1)]

pos=nx.spring_layout(G, k=1)
nx.draw_networkx(G, pos=pos, width=weights)
plt.axis('off')

FireShot Capture 322 - JupyterLab Alpha Preview - http___localhost_8888_lab

FireShot Capture 323 - JupyterLab Alpha Preview - http___localhost_8888_lab

1, 2, 3と4,5とも,1,2,3の内1と2には距離がありそうとも,4,5は一緒か別か.
結構,微妙な感がよく分かる.

分かり易くweight>20なエッジをリムーブすると,

FireShot Capture 324 - JupyterLab Alpha Preview - http___localhost_8888_lab

weight > 15なエッジをリムーブすると,

FireShot Capture 325 - JupyterLab Alpha Preview - http___localhost_8888_lab

1と2,3と4と5みたいな.
 
 
 


 
 
 

 
 
 

Clustering from an Affinity matrix with Affinity propagation in scikit-learn – StackOverflow

まず,データ構造がよく分からないけど.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_style('darkgrid')

affinities = [
        [ 0., -1., -2., -6., -7.],
        [-1.,  0., -1., -7., -8.],
        [-2., -1.,  0., -8., -9.],
        [-6., -7., -8.,  0., -1.],
        [-7., -8., -9., -1.,  0.]
]

for subli in affinities:
    plt.plot(subli, 'o')

FireShot Capture 298 - JupyterLab Alpha Preview - http___localhost_8888_lab

明らかに平面では分離できない.
(もう1次元持ってくれば線形分離可能だけど)

各行の平均をみれば,

plt.plot(2, X.mean(), 'o', markersize=15)
plt.plot(X.mean(axis=1), 'o')

FireShot Capture 302 - JupyterLab Alpha Preview - http___localhost_8888_lab

まあ,2つにクラスタリングできそうだろうか.

from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin

X = np.array(affinities)
Y = np.ones((5, 5))*np.arange(5)
n_clusters = 2
k_means = KMeans(n_clusters=n_clusters)
k_means.fit(affinities)
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)

colors = [sns.xkcd_rgb['pale red'], sns.xkcd_rgb['denim blue']]

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.scatter(X[my_members, 0], X[my_members, 1], c=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', label=k,
            markerfacecolor=col, markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.legend()

FireShot Capture 312 - JupyterLab Alpha Preview - http___localhost_8888_lab

from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin

X = np.array(affinities)
Y = np.ones((5, 5))*np.arange(5)
n_clusters = 2
k_means = KMeans(n_clusters=n_clusters)
k_means.fit(affinities)
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)

colors = [sns.xkcd_rgb['pale red'], sns.xkcd_rgb['denim blue']]

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.scatter(Y[my_members], X[my_members], c=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', label=k,
            markerfacecolor=col, markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.legend()

FireShot Capture 299 - JupyterLab Alpha Preview - http___localhost_8888_lab

 
 
 

import sklearn.cluster
clusterer = sklearn.cluster.AffinityPropagation(affinity='precomputed', damping=0.9, verbose=True)
result = clusterer.fit(affinities)

from pprint import pprint
pprint(vars(result))

Converged after 23 iterations.
{‘affinity’: ‘precomputed’,
‘affinity_matrix_’: array([[ 0., -1., -2., -6., -7.],
[-1., 0., -1., -7., -8.],
[-2., -1., 0., -8., -9.],
[-6., -7., -8., 0., -1.],
[-7., -8., -9., -1., 0.]]),
‘cluster_centers_indices_’: array([0], dtype=int64),
‘convergence_iter’: 15,
‘copy’: True,
‘damping’: 0.9,
‘labels_’: array([0, 0, 0, 0, 0], dtype=int64),
‘max_iter’: 200,
‘n_iter_’: 24,
‘preference’: None,
‘verbose’: True}

いちおう,「affinity=’precomputed’」を変えれば,クラスタ2つに分かれる.

for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], '.', color=col)
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

FireShot Capture 304 - JupyterLab Alpha Preview - http___localhost_8888_lab

from sklearn.cluster import AffinityPropagation
from itertools import cycle

af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)
plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(Y[class_members], X[class_members], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)

plt.title(f'Estimated number of clusters: {n_clusters_}')

print(f'Estimated number of clusters: {n_clusters_}')
print(f'Silhouette Coefficient: {metrics.silhouette_score(X, labels, metric="sqeuclidean")}')

Estimated number of clusters: 2
Silhouette Coefficient: 0.9689156980050294

FireShot Capture 300 - JupyterLab Alpha Preview - http___localhost_8888_lab
 
 
 

ただ,そもそも教師なし学習は探索的手法だし,
まずプロットをみて,適切に仮説設定,モデル立てをすれば,
k-meansの方がリーズナブルだと思うけど.
 
 
 

似た様なのがもう一つ.

Plot the sklearn clusters in python – StackOverflow

import sklearn.cluster

import numpy as np

sims =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
print(cluster_centers_indices)
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = sims[cluster_centers_indices[k]]
    plt.plot(sims[class_members, 0], sims[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in sims[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)

FireShot Capture 305 - JupyterLab Alpha Preview - http___localhost_8888_lab

plt.plot(sims, 'o')

FireShot Capture 306 - JupyterLab Alpha Preview - http___localhost_8888_lab

plt.plot(2, sims.mean(), 'o', markersize=15)
plt.plot(sims.mean(axis=1), 'o')

FireShot Capture 307 - JupyterLab Alpha Preview - http___localhost_8888_lab

まあ,各ベクトルが何かしらのデータを表していると考えれば,
(そのコアをみて)1,4,5と2,3の2つに分かれたら良さそうか.
(或いは3つのクラスタに分かれる様にもみえる)

affprop = sklearn.cluster.AffinityPropagation(preference=-50)
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
print(cluster_centers_indices)
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = sims[cluster_centers_indices[k]]
    plt.plot(sims[class_members, 0], sims[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in sims[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)

FireShot Capture 308 - JupyterLab Alpha Preview - http___localhost_8888_lab

affprop = sklearn.cluster.AffinityPropagation()
affprop.fit(sims)

cluster_centers_indices = affprop.cluster_centers_indices_
print(cluster_centers_indices)
labels = affprop.labels_
n_clusters_ = len(cluster_centers_indices)
print(n_clusters_)

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = sims[cluster_centers_indices[k]]
    plt.plot(sims[class_members, 0], sims[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in sims[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)

FireShot Capture 309 - JupyterLab Alpha Preview - http___localhost_8888_lab

うーん……

affprop.affinity_matrix_

array([[ -0., -1062., -546., -2250., -2274.],
[-1062., -0., -200., -1000., -828.],
[ -546., -200., -0., -1280., -1340.],
[-2250., -1000., -1280., -0., -580.],
[-2274., -828., -1340., -580., -0.]])

結局,affinity propagationはデータの対角からその類似性行列を出しているので,
データ構造が明らかでない場合,何がしたいか分からないって事になってしまう.
(これでいいのか,データのコアをみて全体の関係構造を明らかにしたいのか)

結局,パラメータのチューニングで手探りに弄る必要性とコストを考えると,
k-meansの方がやっぱりリーズナブルチョイスだと思う.

from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin

X =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

n_clusters = 2
k_means = KMeans(n_clusters=n_clusters)
k_means.fit(X)
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)

colors = [sns.xkcd_rgb['pale red'], sns.xkcd_rgb['denim blue']]

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.scatter(X[my_members, 0], X[my_members, 1], c=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', label=k,
            markerfacecolor=col, markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.legend()

FireShot Capture 310 - JupyterLab Alpha Preview - http___localhost_8888_lab

from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin

X =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])
Y = np.ones((5, 5))*np.arange(5).reshape(-1, 1)

n_clusters = 2
k_means = KMeans(n_clusters=n_clusters)
k_means.fit(X)
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)

colors = [sns.xkcd_rgb['pale red'], sns.xkcd_rgb['denim blue']]

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.scatter(Y[my_members], X[my_members], c=col, marker='.')
ax.set_title('KMeans')
ax.legend()

FireShot Capture 311 - JupyterLab Alpha Preview - http___localhost_8888_lab

綺麗に2つに分かれているし,想像通りの幾何構造なので分かり易い.

 
 
 

関連:
散布図のラベル

無題のドキュメント
 
 
 
・AffinityPropagation

まず,similarity matrix(ユークリッド距離等)を作って,そこから
“responsibility”(セントロイドとしての尤もらしさ)と
“availability”(任意のセントロイドに対してどれだけ属しているか)の
2つのマトリックスを交互に更新して最適化する感じ

計算コストがO(N^2T),サンプル数の二乗x反復回数(k-meansならO(NT))

時間的,空間的コストが大きいので中小規模のデータ向け

PCA + k-means,t-SNE + k-means,DBSCAN,余り出番が無さそう.

結局,パラメータのTuningが難しい
k-meansの方が好きかな
 
 
 
 
 
 

 
・3つのクラスタとして考えた場合

from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin

X =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

n_clusters = 3
k_means = KMeans(n_clusters=n_clusters)
k_means.fit(X)

colors = [sns.xkcd_rgb['pale red'], sns.xkcd_rgb['denim blue'], sns.xkcd_rgb['medium green']]

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means.labels_ == k
    cluster_center = k_means_cluster_centers[k]
    ax.scatter(X[my_members, 0], X[my_members, 1], c=col, marker='.')
ax.set_title('KMeans')
ax.legend()

FireShot Capture 315 - JupyterLab Alpha Preview - http___localhost_8888_lab

from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin

X =  np.array([[0, 17, 10, 32, 32], [18, 0, 6, 20, 15], [10, 8, 0, 20, 21], [30, 16, 20, 0, 17], [30, 15, 21, 17, 0]])

n_clusters = 3
k_means = KMeans(n_clusters=n_clusters)
k_means.fit(X)
k_means_cluster_centers = np.sort(k_means.cluster_centers_, axis=0)
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)

colors = [sns.xkcd_rgb['pale red'], sns.xkcd_rgb['denim blue'], sns.xkcd_rgb['medium green']]

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111)
for k, col in zip(range(n_clusters), colors):
    my_members = k_means_labels == k
    cluster_center = k_means_cluster_centers[k]
    ax.scatter(X[my_members, 0], X[my_members, 1], c=col, marker='.')
    ax.plot(cluster_center[0], cluster_center[1], 'o', label=k,
            markerfacecolor=col, markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.legend()

FireShot Capture 314 - JupyterLab Alpha Preview - http___localhost_8888_lab

pairwise_distances_argminの結果が訳分からない事になるのは,
flat geometryじゃないデータに対して,セントロイドからの最短距離でクラスタをみているからで,
でもk-meansの結果は,クラスタ数さえ尤もらしい数値を決めてあげれば,
予想通り(仮定通り,想像通り,想定した通り)な良い感じの結果になっている.

広告
カテゴリー: 未分類 パーマリンク

全然obviouslyじゃない…… への3件のフィードバック

  1. ピンバック: Spectral Clustering/Affinity Propagation/KMeans | 粉末@それは風のように (日記)

  2. ピンバック: 階層的クラスタリング | 粉末@それは風のように (日記)

  3. ピンバック: 距離行列(類似性行列) | 粉末@それは風のように (日記)

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中