14. 社会网络分析¶

中国谚语

一块布，以无数种方式相连。 --中国古语

14.1. 介绍¶

14.2. Co-occurrence网络¶

Co-occurrence networks 通常用于提供在书面材料中表示的人员、组织、概念或其他实体之间潜在关系的图形化可视化。随着适于文本挖掘的电子存储文本的出现，Co-occurrence网络的生成和可视化已经成为现实。

14.2.1. 方法论¶

创建Corpus C
基于Corpus C构建文档术语矩阵D
计算术语文档矩阵

System Message: WARNING/2 (D^T)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.
邻接矩阵

System Message: WARNING/2 (A =D^T\cdot D)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

该算法主要由四部分组成：Corpus C、文档项矩阵D、文档项矩阵

System Message: WARNING/2 (D^T)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

以及邻接矩阵A。在这个演示部分，我将展示如何构建这四个主要组件。

考虑到我们有三组朋友，他们是

+-------------------------------------+
|words                                |
+-------------------------------------+
|[[george] [jimmy] [john] [peter]]    |
|[[vincent] [george] [stefan] [james]]|
|[[emma] [james] [olivia] [george]]   |
+-------------------------------------+

Corpus C

然后，我们可以基于给定组数据中的唯一元素构建以下语料库：

[u'george', u'james', u'jimmy', u'peter', u'stefan', u'vincent', u'olivia', u'john', u'emma']

相应的元素频率：

基于Corpus C（countvectorizer）的文献术语矩阵D

from pyspark.ml.feature import CountVectorizer
count_vectorizer_wo = CountVectorizer(inputCol='term', outputCol='features')
# with total unique vocabulary
countVectorizer_mod_wo = count_vectorizer_wo.fit(df)
countVectorizer_twitter_wo = countVectorizer_mod_wo.transform(df)
# with truncated unique vocabulary (99%)
count_vectorizer = CountVectorizer(vocabSize=48,inputCol='term',outputCol='features')
countVectorizer_mod = count_vectorizer.fit(df)
countVectorizer_twitter = countVectorizer_mod.transform(df)

+-------------------------------+
|features                       |
+-------------------------------+
|(9,[0,2,3,7],[1.0,1.0,1.0,1.0])|
|(9,[0,1,4,5],[1.0,1.0,1.0,1.0])|
|(9,[0,1,6,8],[1.0,1.0,1.0,1.0])|
+-------------------------------+

术语文档矩阵

System Message: WARNING/2 (D^T)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

RDD：

[array([ 1.,  1.,  1.]), array([ 0.,  1.,  1.]), array([ 1.,  0.,  0.]),
 array([ 1.,  0.,  0.]), array([ 0.,  1.,  0.]), array([ 0.,  1.,  0.]),
 array([ 0.,  0.,  1.]), array([ 1.,  0.,  0.]), array([ 0.,  0.,  1.])]

矩阵：

array([[ 1.,  1.,  1.],
       [ 0.,  1.,  1.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.]])

邻接矩阵

System Message: WARNING/2 (A =D^T\cdot D)

latex exited with error [stdout] This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016/Debian) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2017/01/01> patch level 3 Babel <3.9r> and hyphenation patterns for 12 language(s) loaded. (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2014/09/29 v1.4h Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/utf8x.def)) (/usr/share/texlive/texmf-dist/tex/latex/ucs/ucs.sty (/usr/share/texlive/texmf-dist/tex/latex/ucs/data/uni-global.def)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/anyfontsize/anyfontsize.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/bm.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mathtools.sty (/usr/share/texlive/texmf-dist/tex/latex/graphics/keyval.sty) (/usr/share/texlive/texmf-dist/tex/latex/tools/calc.sty) (/usr/share/texlive/texmf-dist/tex/latex/mathtools/mhsetup.sty)) ! LaTeX Error: File `dsfont.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.16 \def \Z{\mathbb{Z}}^^M No pages of output. Transcript written on math.log.

RDD：

[array([ 1.,  1.,  1.]), array([ 0.,  1.,  1.]), array([ 1.,  0.,  0.]),
 array([ 1.,  0.,  0.]), array([ 0.,  1.,  0.]), array([ 0.,  1.,  0.]),
 array([ 0.,  0.,  1.]), array([ 1.,  0.,  0.]), array([ 0.,  0.,  1.])]

矩阵：

array([[ 3.,  2.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 2.,  2.,  0.,  0.,  1.,  1.,  1.,  0.,  1.],
       [ 1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.],
       [ 1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.]])

14.2.2. 从我的采访中编码拼图¶

问题

随附的UTF-8编码文本文件包含与在线生物医学科学文章相关的标签，格式如下（大小：100000）。每一篇科学文章都由文件中以回车分隔的一行来表示。

+--------------------+
|               words|
+--------------------+
|[ACTH Syndrome, E...|
|[Antibody Formati...|
|[Adaptation, Phys...|
|[Aerosol Propella...|
+--------------------+
only showing top 4 rows

编写一个程序，使用这个文件作为输入，生成一组成对的标签，这些标签以任意顺序和位置出现在至少50篇不同的科学文章中。例如，在上面的示例中， [女性] 和 [人类] 一起出现两次，但每隔一对只出现一次。您的程序应该以与输入相同的格式（例如标记1、标记2）将对列表输出到stdout。

我的解决方案

对应词频率：

词频¶

输出：

+----------+------+-------+
|    term.x|term.y|   freq|
+----------+------+-------+
|    Female|Humans|16741.0|
|      Male|Humans|13883.0|
|     Adult|Humans|10391.0|
|      Male|Female| 9806.0|
|MiddleAged|Humans| 8181.0|
|     Adult|Female| 7411.0|
|     Adult|  Male| 7240.0|
|MiddleAged|  Male| 6328.0|
|MiddleAged|Female| 6002.0|
|MiddleAged| Adult| 5944.0|
+----------+------+-------+
only showing top 10 rows

对应的Co-occurrence网络：

Co-occurrence网络¶

然后你会得到数字 Co-occurrence网络

14.3. 附录：PySpark中的矩阵乘法¶

负载测试矩阵

df = spark.read.csv("matrix1.txt",sep=",",inferSchema=True)
df.show()

+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
|1.2|3.4|2.3|1.1|
|2.3|1.1|1.5|2.2|
|3.3|1.8|4.5|3.3|
|5.3|2.2|4.5|4.4|
|9.3|8.1|0.3|5.5|
|4.5|4.3|2.1|6.6|
+---+---+---+---+

PySpark中矩阵乘法的主函数

from pyspark.sql import functions as F
from functools import reduce
# reference: https://stackoverflow.com/questions/44348527/matrix-multiplication-at-a-in-pyspark
# do the sum of the multiplication that we want, and get
# one data frame for each column
colDFs = []
for c2 in df.columns:
    colDFs.append( df.select( [ F.sum(df[c1]*df[c2]).alias("op_{0}".format(i)) for i,c1 in enumerate(df.columns) ] ) )
# now union those separate data frames to build the "matrix"
mtxDF = reduce(lambda a,b: a.select(a.columns).union(b.select(a.columns)), colDFs )
mtxDF.show()

+------------------+------------------+------------------+------------------+
|              op_0|              op_1|              op_2|              op_3|
+------------------+------------------+------------------+------------------+
|            152.45|118.88999999999999|             57.15|121.44000000000001|
|118.88999999999999|104.94999999999999|             38.93|             94.71|
|             57.15|             38.93|52.540000000000006|             55.99|
|121.44000000000001|             94.71|             55.99|110.10999999999999|
+------------------+------------------+------------------+------------------+

使用python版本验证

import numpy as np
a = np.genfromtxt("matrix1.txt",delimiter=",")
np.dot(a.T, a)

array([[152.45, 118.89,  57.15, 121.44],
       [118.89, 104.95,  38.93,  94.71],
       [ 57.15,  38.93,  52.54,  55.99],
       [121.44,  94.71,  55.99, 110.11]])

14.4. 相关网络¶

待办事项