11. 聚类

中国谚语

磨刀不误砍柴工——中国古老的谚语。

images/clustering_logo.png

上述数字由以下代码生成: Python Data Science Handbook .

11.1. k-均值模型

11.1.1. 介绍

k均值聚类是一种矢量量化的方法,最初是从信号处理开始的,在数据挖掘中广泛应用于聚类分析。方法k-均值如下,以解决这个问题被称为 Expectation-Maximization . 具体描述如下:

  1. 分配一些杂波中心

  2. 重复直到收敛

    • E-STEP:将点指定给最近的中心

    • M-STEP:将群集中心设置为平均值

给出一组观察结果

System Message: WARNING/2 ((x_1, x_2, \cdots, x_m))

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

. 目标功能是

System Message: WARNING/2 (j=sum_i=1 m sum k=1 k w ik x i-c u k ^2)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

在哪里?

System Message: WARNING/2 (w_{{ik}}=1)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

如果

System Message: WARNING/2 (x_i)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

在群集中

System Message: WARNING/2 (k)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

否则

System Message: WARNING/2 (w_{{ik}}=0)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

System Message: WARNING/2 (c_k)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

是的质心

System Message: WARNING/2 (x_i)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

的群集。

在数学上,k-均值是一个有两部分的最小化问题:第一,我们最小化

System Message: WARNING/2 (J)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

w.r.t

System Message: WARNING/2 (w_{{ik}})

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

具有

System Message: WARNING/2 (c_k)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

固定;然后最小化

System Message: WARNING/2 (J)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

w.r.t

System Message: WARNING/2 (c_k)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

具有

System Message: WARNING/2 (w_{{ik}})

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

固定的。即

E-step

System Message: WARNING/2 (\frac{\partial J}{\partial w_{ik}} = \sum_{i=1}^{m}\sum_{k=1}^{K} ||x_i-c_k||^2\\ \Rightarrow w_{ik} =\left\{ \begin{array}{ll} 1, & \text{ if }{ k = argmin_{j} ||x_i-c_j||^2} \\ 0, & \text{ otherwise } \end{array} \right.)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

M-step

System Message: WARNING/2 (\frac{\partial J}{\partial c_k} = 2\sum_{i=1}{m} w_{ik}(x_i-c_k) =0 \Rightarrow c_k = \frac{\sum_{i=1}^{m}w_{ik}x_i}{\sum_{i=1}^{m}w_{ik}})

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

11.1.2. 演示

  1. 设置Spark上下文和SparkSession

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark K-means example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
  1. 负载数据集

df = spark.read.format('com.databricks.spark.csv').\
                       options(header='true', \
                       inferschema='true').\
            load("../data/iris.csv",header=True);

检查数据集

df.show(5,True)
df.printSchema()

然后你会得到

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows

root
 |-- sepal_length: double (nullable = true)
 |-- sepal_width: double (nullable = true)
 |-- petal_length: double (nullable = true)
 |-- petal_width: double (nullable = true)
 |-- species: string (nullable = true)

您还可以从数据帧中获得统计结果(不幸的是,它只适用于数字)。

df.describe().show()

然后你会得到

+-------+------------------+-------------------+------------------+------------------+---------+
|summary|      sepal_length|        sepal_width|      petal_length|       petal_width|  species|
+-------+------------------+-------------------+------------------+------------------+---------+
|  count|               150|                150|               150|               150|      150|
|   mean| 5.843333333333335| 3.0540000000000007|3.7586666666666693|1.1986666666666672|     null|
| stddev|0.8280661279778637|0.43359431136217375| 1.764420419952262|0.7631607417008414|     null|
|    min|               4.3|                2.0|               1.0|               0.1|   setosa|
|    max|               7.9|                4.4|               6.9|               2.5|virginica|
+-------+------------------+-------------------+------------------+------------------+---------+
  1. 将数据转换为密集向量( 特征

# convert the data to dense vector
def transData(data):
    return data.rdd.map(lambda r: [Vectors.dense(r[:-1])]).toDF(['features'])

注解

强烈建议您尝试我的 get_dummy 用于处理复杂数据集中的分类数据的函数。

监督学习版本:

def get_dummy(df,indexCol,categoricalCols,continuousCols,labelCol):

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
    from pyspark.sql.functions import col

    indexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
                 for c in categoricalCols ]

    # default setting: dropLast=True
    encoders = [ OneHotEncoder(inputCol=indexer.getOutputCol(),
                 outputCol="{0}_encoded".format(indexer.getOutputCol()))
                 for indexer in indexers ]

    assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders]
                                + continuousCols, outputCol="features")

    pipeline = Pipeline(stages=indexers + encoders + [assembler])

    model=pipeline.fit(df)
    data = model.transform(df)

    data = data.withColumn('label',col(labelCol))

    return data.select(indexCol,'features','label')

无监督学习版本:

def get_dummy(df,indexCol,categoricalCols,continuousCols):
    '''
    Get dummy variables and concat with continuous variables for unsupervised learning.
    :param df: the dataframe
    :param categoricalCols: the name list of the categorical data
    :param continuousCols:  the name list of the numerical data
    :return k: feature matrix

    :author: Wenqiang Feng
    :email:  von198@gmail.com
    '''

    indexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
                 for c in categoricalCols ]

    # default setting: dropLast=True
    encoders = [ OneHotEncoder(inputCol=indexer.getOutputCol(),
                 outputCol="{0}_encoded".format(indexer.getOutputCol()))
                 for indexer in indexers ]

    assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders]
                                + continuousCols, outputCol="features")

    pipeline = Pipeline(stages=indexers + encoders + [assembler])

    model=pipeline.fit(df)
    data = model.transform(df)

    return data.select(indexCol,'features')
  1. 将数据集转换为数据帧

transformed= transData(df)
transformed.show(5, False)
+-----------------+
|features         |
+-----------------+
|[5.1,3.5,1.4,0.2]|
|[4.9,3.0,1.4,0.2]|
|[4.7,3.2,1.3,0.2]|
|[4.6,3.1,1.5,0.2]|
|[5.0,3.6,1.4,0.2]|
+-----------------+
only showing top 5 rows
  1. 处理分类变量

from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.

featureIndexer = VectorIndexer(inputCol="features", \
                               outputCol="indexedFeatures",\
                               maxCategories=4).fit(transformed)

data = featureIndexer.transform(transformed)

现在,您可以使用

data.show(5,True)

你会得到

+-----------------+-----------------+
|         features|  indexedFeatures|
+-----------------+-----------------+
|[5.1,3.5,1.4,0.2]|[5.1,3.5,1.4,0.2]|
|[4.9,3.0,1.4,0.2]|[4.9,3.0,1.4,0.2]|
|[4.7,3.2,1.3,0.2]|[4.7,3.2,1.3,0.2]|
|[4.6,3.1,1.5,0.2]|[4.6,3.1,1.5,0.2]|
|[5.0,3.6,1.4,0.2]|[5.0,3.6,1.4,0.2]|
+-----------------+-----------------+
only showing top 5 rows

注解

由于包括k均值在内的聚类算法使用基于距离的测量来确定数据点之间的相似性,因此强烈建议将数据标准化,使平均值为零,标准偏差为1。

  1. 确定k-均值聚类最佳聚类数的肘形法

import numpy as np
cost = np.zeros(20)
for k in range(2,20):
    kmeans = KMeans()\
            .setK(k)\
            .setSeed(1) \
            .setFeaturesCol("indexedFeatures")\
            .setPredictionCol("cluster")

    model = kmeans.fit(data)
    cost[k] = model.computeCost(data) # requires Spark 2.0 or later
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sbs
from matplotlib.ticker import MaxNLocator

fig, ax = plt.subplots(1,1, figsize =(8,6))
ax.plot(range(2,20),cost[2:20])
ax.set_xlabel('k')
ax.set_ylabel('cost')
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
plt.show()
images/elbow.png

在我看来,有时很难通过使用 elbow method . 如下图所示,您可以选择3、5甚至8。我会选择 3 在这个演示中。

images/elbow_rfm.png
  • 轮廓分析

#PySpark libraries
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import col, percent_rank, lit
from pyspark.sql.window import Window
from pyspark.sql import DataFrame, Row
from pyspark.sql.types import StructType
from functools import reduce  # For Python 3.x

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

def optimal_k(df_in,index_col,k_min, k_max,num_runs):
    '''
    Determine optimal number of clusters by using Silhoutte Score Analysis.
    :param df_in: the input dataframe
    :param index_col: the name of the index column
    :param k_min: the train dataset
    :param k_min: the minmum number of the clusters
    :param k_max: the maxmum number of the clusters
    :param num_runs: the number of runs for each fixed clusters

    :return k: optimal number of the clusters
    :return silh_lst: Silhouette score
    :return r_table: the running results table

    :author: Wenqiang Feng
    :email:  von198@gmail.com
    '''

    start = time.time()
    silh_lst = []
    k_lst = np.arange(k_min, k_max+1)

    r_table = df_in.select(index_col).toPandas()
    r_table = r_table.set_index(index_col)
    centers = pd.DataFrame()

    for k in k_lst:
        silh_val = []
        for run in np.arange(1, num_runs+1):

            # Trains a k-means model.
            kmeans = KMeans()\
                    .setK(k)\
                    .setSeed(int(np.random.randint(100, size=1)))
            model = kmeans.fit(df_in)

            # Make predictions
            predictions = model.transform(df_in)
            r_table['cluster_{k}_{run}'.format(k=k, run=run)]= predictions.select('prediction').toPandas()

            # Evaluate clustering by computing Silhouette score
            evaluator = ClusteringEvaluator()
            silhouette = evaluator.evaluate(predictions)
            silh_val.append(silhouette)

        silh_array=np.asanyarray(silh_val)
        silh_lst.append(silh_array.mean())

    elapsed =  time.time() - start

    silhouette = pd.DataFrame(list(zip(k_lst,silh_lst)),columns = ['k', 'silhouette'])

    print('+------------------------------------------------------------+')
    print("|         The finding optimal k phase took %8.0f s.       |" %(elapsed))
    print('+------------------------------------------------------------+')


    return k_lst[np.argmax(silh_lst, axis=0)], silhouette , r_table
k, silh_lst, r_table = optimal_k(scaledData,index_col,k_min, k_max,num_runs)

+------------------------------------------------------------+
|         The finding optimal k phase took     1783 s.       |
+------------------------------------------------------------+
spark.createDataFrame(silh_lst).show()

+---+------------------+
|  k|        silhouette|
+---+------------------+
|  3|0.8045154385557953|
|  4|0.6993528775512052|
|  5|0.6689286654221447|
|  6|0.6356184024841809|
|  7|0.7174102265711756|
|  8|0.6720861758298997|
|  9| 0.601771359881241|
| 10|0.6292447334578428|
+---+------------------+

从剪影列表中,我们可以选择 3 作为最佳簇数。

警告

ClusteringEvaluator 在里面 pyspark.ml.evaluation 需要Spark 2.4或更高版本!!

  1. 管道体系结构

from pyspark.ml.clustering import KMeans, KMeansModel

kmeans = KMeans() \
          .setK(3) \
          .setFeaturesCol("indexedFeatures")\
          .setPredictionCol("cluster")

# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, kmeans])

model = pipeline.fit(transformed)

cluster = model.transform(transformed)
  1. k-均值聚类

cluster = model.transform(transformed)
+-----------------+-----------------+-------+
|         features|  indexedFeatures|cluster|
+-----------------+-----------------+-------+
|[5.1,3.5,1.4,0.2]|[5.1,3.5,1.4,0.2]|      1|
|[4.9,3.0,1.4,0.2]|[4.9,3.0,1.4,0.2]|      1|
|[4.7,3.2,1.3,0.2]|[4.7,3.2,1.3,0.2]|      1|
|[4.6,3.1,1.5,0.2]|[4.6,3.1,1.5,0.2]|      1|
|[5.0,3.6,1.4,0.2]|[5.0,3.6,1.4,0.2]|      1|
|[5.4,3.9,1.7,0.4]|[5.4,3.9,1.7,0.4]|      1|
|[4.6,3.4,1.4,0.3]|[4.6,3.4,1.4,0.3]|      1|
|[5.0,3.4,1.5,0.2]|[5.0,3.4,1.5,0.2]|      1|
|[4.4,2.9,1.4,0.2]|[4.4,2.9,1.4,0.2]|      1|
|[4.9,3.1,1.5,0.1]|[4.9,3.1,1.5,0.1]|      1|
|[5.4,3.7,1.5,0.2]|[5.4,3.7,1.5,0.2]|      1|
|[4.8,3.4,1.6,0.2]|[4.8,3.4,1.6,0.2]|      1|
|[4.8,3.0,1.4,0.1]|[4.8,3.0,1.4,0.1]|      1|
|[4.3,3.0,1.1,0.1]|[4.3,3.0,1.1,0.1]|      1|
|[5.8,4.0,1.2,0.2]|[5.8,4.0,1.2,0.2]|      1|
|[5.7,4.4,1.5,0.4]|[5.7,4.4,1.5,0.4]|      1|
|[5.4,3.9,1.3,0.4]|[5.4,3.9,1.3,0.4]|      1|
|[5.1,3.5,1.4,0.3]|[5.1,3.5,1.4,0.3]|      1|
|[5.7,3.8,1.7,0.3]|[5.7,3.8,1.7,0.3]|      1|
|[5.1,3.8,1.5,0.3]|[5.1,3.8,1.5,0.3]|      1|
+-----------------+-----------------+-------+
only showing top 20 rows