7. 数据探索¶

中国谚语

千里之行始于足下。 ——老子成语。

我不会说理解您的数据集是数据科学中最困难的事情，但它确实很重要而且非常耗时。数据探索是指利用统计和可视化技术来描述数据。我们研究数据是为了了解这些特性，并为我们的模型带来重要的特性。

7.1. 单变量分析¶

在数学中，单变量是指只有一个变量的表达式、方程、函数或多项式。“uni”的意思是“one”，所以换句话说，您的数据只有一个变量。因此，您不需要在这一步中处理原因或关系。单变量分析获取数据，逐个总结变量（属性），并在数据中找到模式。

单变量数据中发现的模式有多种描述方法，包括集中趋势（平均值、模式和中位数）和离散：范围、方差、最大值、最小值、四分位数（包括四分位数间距）、变异系数和标准差。您还可以使用多个选项来可视化和描述具有单变量数据的数据。如 frequency Distribution Tables ， bar Charts ， histograms ， frequency Polygons ， pie Charts .

变量可以是分类的，也可以是数值的，我将演示不同的统计和可视化技术来研究每种类型的变量。

Jupyter notebook可从 Data Exploration .
数据可以从下载 German Credit .

7.1.1. 数值变量¶

描述

这个 desctibe 中的函数 pandas 和 spark 会给我们大部分的统计结果，比如 min ， median ， max ， quartiles 和 standard deviation . 借助用户定义的函数，您可以得到更多的统计结果。

# selected varables for the demonstration
num_cols = ['Account Balance','No of dependents']
df.select(num_cols).describe().show()

+-------+------------------+-------------------+
|summary|   Account Balance|   No of dependents|
+-------+------------------+-------------------+
|  count|              1000|               1000|
|   mean|             2.577|              1.155|
| stddev|1.2576377271108936|0.36208577175319395|
|    min|                 1|                  1|
|    max|                 4|                  2|
+-------+------------------+-------------------+

您可能会发现pyspark中的默认函数不包括四分位数。以下功能将帮助您在熊猫中获得相同的结果

def describe_pd(df_in, columns, deciles=False):
    '''
    Function to union the basic stats results and deciles
    :param df_in: the input dataframe
    :param columns: the cloumn name list of the numerical variable
    :param deciles: the deciles output

    :return : the numerical describe info. of the input dataframe

    :author: Ming Chen and Wenqiang Feng
    :email:  von198@gmail.com
    '''

    if deciles:
        percentiles = np.array(range(0, 110, 10))
    else:
        percentiles = [25, 50, 75]

    percs = np.transpose([np.percentile(df_in.select(x).collect(), percentiles) for x in columns])
    percs = pd.DataFrame(percs, columns=columns)
    percs['summary'] = [str(p) + '%' for p in percentiles]

    spark_describe = df_in.describe().toPandas()
    new_df = pd.concat([spark_describe, percs],ignore_index=True)
    new_df = new_df.round(2)
    return new_df[['summary'] + columns]

describe_pd(df,num_cols)

+-------+------------------+-----------------+
|summary|   Account Balance| No of dependents|
+-------+------------------+-----------------+
|  count|            1000.0|           1000.0|
|   mean|             2.577|            1.155|
| stddev|1.2576377271108936|0.362085771753194|
|    min|               1.0|              1.0|
|    max|               4.0|              2.0|
|    25%|               1.0|              1.0|
|    50%|               2.0|              1.0|
|    75%|               4.0|              1.0|
+-------+------------------+-----------------+

有时，由于机密数据问题，您无法提供真实数据，您的客户可能会要求更多的统计结果，例如 deciles . 您可以应用以下功能来实现它。

describe_pd(df,num_cols,deciles=True)

+-------+------------------+-----------------+
|summary|   Account Balance| No of dependents|
+-------+------------------+-----------------+
|  count|            1000.0|           1000.0|
|   mean|             2.577|            1.155|
| stddev|1.2576377271108936|0.362085771753194|
|    min|               1.0|              1.0|
|    max|               4.0|              2.0|
|     0%|               1.0|              1.0|
|    10%|               1.0|              1.0|
|    20%|               1.0|              1.0|
|    30%|               2.0|              1.0|
|    40%|               2.0|              1.0|
|    50%|               2.0|              1.0|
|    60%|               3.0|              1.0|
|    70%|               4.0|              1.0|
|    80%|               4.0|              1.0|
|    90%|               4.0|              2.0|
|   100%|               4.0|              2.0|
+-------+------------------+-----------------+

偏度和峰度

这个小节来自维基百科 Skewness .

在概率论和统计学中，偏态是对实值随机变量平均值概率分布不对称性的度量。偏度值可以是正的、负的或未定义的。对于单峰分布，负偏度通常表示尾部在分布的左侧，正偏度表示尾部在右侧。

考虑下图中的两个分布。在每个图中，分布锥形右侧的值与左侧的值不同。这些逐渐变细的边称为尾巴，它们提供了一种视觉手段来确定一个分布有两种偏斜：
1. 负偏斜：左尾较长，分布质量集中在图的右侧。分布被称为左偏斜、左尾或左偏斜，尽管曲线本身看起来是偏斜或向右倾斜；相反，左指的是正在绘制的左尾，并且通常是指平均值偏斜到典型数据中心的左侧。左偏分布通常表现为右偏曲线。
2. 正偏斜：右尾较长，分布质量集中在图的左侧。分布被称为右偏斜、右尾或向右偏斜，尽管曲线本身似乎向左偏斜或向左倾斜；相反，右指的是被拉出的右尾，并且通常是指平均值偏斜到典型数据中心的右侧。右偏分布通常表现为左偏曲线。
这个小节来自维基百科 Kurtosis .

在概率论和统计学中，峰度（Kyrtos或Kurtos，意思是“曲线，拱”）是对实值随机变量概率分布“尾性”的度量。与偏态的概念类似，峰度是概率分布形状的描述符，正如偏态一样，对于理论分布，有不同的方法对其进行量化，并相应地从群体的样本中对其进行估计。

from pyspark.sql.functions import col, skewness, kurtosis
df.select(skewness(var),kurtosis(var)).show()

+---------------------+---------------------+
|skewness(Age (years))|kurtosis(Age (years))|
+---------------------+---------------------+
|   1.0231743160548064|   0.6114371688367672|
+---------------------+---------------------+

警告

有时统计数据可能会产生误导！

F.J.Anscombe曾经说过，计算和图表都可以。应研究这两种输出；每种输出都有助于理解。图中这13个数据集相同的数据，不同的图表（datasaurus，加上其他12个）每个都有相同的汇总统计数据（x/y平均值、x/y标准差和皮尔逊相关性），精确到两个小数位，但外观却大不相同。这项工作描述了我们为创建这个数据集而开发的技术，以及其他类似的技术。更多的细节和有趣的结果可以在 Same Stats Different Graphs .

相同的数据，不同的图表¶

直方图

警告

柱状图经常与条形图混淆！

柱状图和柱状图的根本区别在于柱状图中的条形之间存在间隙，但柱状图中的条形彼此相邻。感兴趣的读者被介绍给 Difference Between Histogram and Bar Graph .

var = 'Age (years)'
x = data1[var]
bins = np.arange(0, 100, 5.0)

plt.figure(figsize=(10,8))
# the histogram of the data
plt.hist(x, bins, alpha=0.8, histtype='bar', color='gold',
         ec='black',weights=np.zeros_like(x) + 100. / x.size)

plt.xlabel(var)
plt.ylabel('percentage')
plt.xticks(bins)
plt.show()

fig.savefig(var+".pdf", bbox_inches='tight')

var = 'Age (years)'
x = data1[var]
bins = np.arange(0, 100, 5.0)


########################################################################
hist, bin_edges = np.histogram(x,bins,
                               weights=np.zeros_like(x) + 100. / x.size)
# make the histogram

fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 2, 1)

# Plot the histogram heights against integers on the x axis
ax.bar(range(len(hist)),hist,width=1,alpha=0.8,ec ='black', color='gold')
# # Set the ticks to the middle of the bars
ax.set_xticks([0.5+i for i,j in enumerate(hist)])
# Set the xticklabels to a string that tells us what the bin edges were
labels =['{}'.format(int(bins[i+1])) for i,j in enumerate(hist)]
labels.insert(0,'0')
ax.set_xticklabels(labels)
plt.xlabel(var)
plt.ylabel('percentage')

########################################################################

hist, bin_edges = np.histogram(x,bins) # make the histogram

ax = fig.add_subplot(1, 2, 2)
# Plot the histogram heights against integers on the x axis
ax.bar(range(len(hist)),hist,width=1,alpha=0.8,ec ='black', color='gold')

# # Set the ticks to the middle of the bars
ax.set_xticks([0.5+i for i,j in enumerate(hist)])

# Set the xticklabels to a string that tells us what the bin edges were
labels =['{}'.format(int(bins[i+1])) for i,j in enumerate(hist)]
labels.insert(0,'0')
ax.set_xticklabels(labels)
plt.xlabel(var)
plt.ylabel('count')
plt.suptitle('Histogram of {}: Left with percentage output;Right with count output'
             .format(var), size=16)
plt.show()

fig.savefig(var+".pdf", bbox_inches='tight')

有时，有些人会要求您绘制条形图的不等宽度（柱状图的无效参数）。你仍然可以通过以下的技巧来实现它。

var = 'Credit Amount'
plot_data = df.select(var).toPandas()
x= plot_data[var]

bins =[0,200,400,600,700,800,900,1000,2000,3000,4000,5000,6000,10000,25000]

hist, bin_edges = np.histogram(x,bins,weights=np.zeros_like(x) + 100. / x.size) # make the histogram

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(1, 1, 1)
# Plot the histogram heights against integers on the x axis
ax.bar(range(len(hist)),hist,width=1,alpha=0.8,ec ='black',color = 'gold')

# # Set the ticks to the middle of the bars
ax.set_xticks([0.5+i for i,j in enumerate(hist)])

# Set the xticklabels to a string that tells us what the bin edges were
#labels =['{}k'.format(int(bins[i+1]/1000)) for i,j in enumerate(hist)]
labels =['{}'.format(bins[i+1]) for i,j in enumerate(hist)]
labels.insert(0,'0')
ax.set_xticklabels(labels)
#plt.text(-0.6, -1.4,'0')
plt.xlabel(var)
plt.ylabel('percentage')
plt.show()

盒积与小提琴积

请注意，尽管小提琴情节与图基（1977）的盒情节密切相关，但小提琴情节可以显示比盒情节更多的信息。当我们进行探索性分析时，对样本一无所知。因此，样本的分布不能假定为正态分布，通常当你得到一个大数据时，正态分布会在箱线图中显示出一些骗子。

然而，对于较小的样本尺寸，小提琴图有潜在的误导性，其中密度图可能显示出有趣的特征（以及其中的群体差异），即使是为标准正常数据制作的。一些海报建议样品尺寸应大于250。样本大小（例如n>250或更大，理想情况下甚至更大），其中，核密度图提供了分布的合理准确表示，可能会显示一些细微差别，例如双峰或其他形式的非正态性，在盒形图中不可见或不明显。有关更多详细信息，请参阅 A simple comparison of box plots and violin plots .

x = df.select(var).toPandas()

fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(1, 2, 1)
ax = sns.boxplot(data=x)

ax = fig.add_subplot(1, 2, 2)
ax = sns.violinplot(data=x)

7.1.2. 分类变量¶

与数值变量相比，分类变量更容易进行探索。

频率表

from pyspark.sql import functions as F
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window

window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
# withColumn('Percent %',F.format_string("%5.0f%%\n",col('Credit_num')*100/col('total'))).\
tab = df.select(['age_class','Credit Amount']).\
   groupBy('age_class').\
   agg(F.count('Credit Amount').alias('Credit_num'),
       F.mean('Credit Amount').alias('Credit_avg'),
       F.min('Credit Amount').alias('Credit_min'),
       F.max('Credit Amount').alias('Credit_max')).\
   withColumn('total',sum(col('Credit_num')).over(window)).\
   withColumn('Percent',col('Credit_num')*100/col('total')).\
   drop(col('total'))

+---------+----------+------------------+----------+----------+-------+
|age_class|Credit_num|        Credit_avg|Credit_min|Credit_max|Percent|
+---------+----------+------------------+----------+----------+-------+
|    45-54|       120|3183.0666666666666|       338|     12612|   12.0|
|      <25|       150| 2970.733333333333|       276|     15672|   15.0|
|    55-64|        56| 3493.660714285714|       385|     15945|    5.6|
|    35-44|       254| 3403.771653543307|       250|     15857|   25.4|
|    25-34|       397| 3298.823677581864|       343|     18424|   39.7|
|      65+|        23|3210.1739130434785|       571|     14896|    2.3|
+---------+----------+------------------+----------+----------+-------+

馅饼图

# Data to plot
labels = plot_data.age_class
sizes =  plot_data.Percent
colors = ['gold', 'yellowgreen', 'lightcoral','blue', 'lightskyblue','green','red']
explode = (0, 0.1, 0, 0,0,0)  # explode 1st slice

# Plot
plt.figure(figsize=(10,8))
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.show()

条形图

labels = plot_data.age_class
missing = plot_data.Percent
ind = [x for x, _ in enumerate(labels)]

plt.figure(figsize=(10,8))
plt.bar(ind, missing, width=0.8, label='missing', color='gold')

plt.xticks(ind, labels)
plt.ylabel("percentage")

plt.show()

labels = ['missing', '<25', '25-34', '35-44', '45-54','55-64','65+']
missing = np.array([0.000095, 0.024830, 0.028665, 0.029477, 0.031918,0.037073,0.026699])
man = np.array([0.000147, 0.036311, 0.038684, 0.044761, 0.051269, 0.059542, 0.054259])
women = np.array([0.004035, 0.032935, 0.035351, 0.041778, 0.048437, 0.056236,0.048091])
ind = [x for x, _ in enumerate(labels)]

plt.figure(figsize=(10,8))
plt.bar(ind, women, width=0.8, label='women', color='gold', bottom=man+missing)
plt.bar(ind, man, width=0.8, label='man', color='silver', bottom=missing)
plt.bar(ind, missing, width=0.8, label='missing', color='#CD853F')

plt.xticks(ind, labels)
plt.ylabel("percentage")
plt.legend(loc="upper left")
plt.title("demo")

plt.show()

7.2. 多元分析¶

在本节中，我将只演示双变量分析。因为多元分析是二元的产生。

7.2.1. 数值V.S.数值¶

相关矩阵

from pyspark.mllib.stat import Statistics
import pandas as pd

corr_data = df.select(num_cols)

col_names = corr_data.columns
features = corr_data.rdd.map(lambda row: row[0:])
corr_mat=Statistics.corr(features, method="pearson")
corr_df = pd.DataFrame(corr_mat)
corr_df.index, corr_df.columns = col_names, col_names

print(corr_df.to_string())

+--------------------+--------------------+
|     Account Balance|    No of dependents|
+--------------------+--------------------+
|                 1.0|-0.01414542650320914|
|-0.01414542650320914|                 1.0|
+--------------------+--------------------+

散点图

import seaborn as sns
sns.set(style="ticks")

df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")
plt.show()

7.2.2. 分类V.S.分类¶

皮尔逊卡方检验

警告

pyspark.ml.stat 仅在Spark 2.4.0中可用。

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import ChiSquareTest

data = [(0.0, Vectors.dense(0.5, 10.0)),
        (0.0, Vectors.dense(1.5, 20.0)),
        (1.0, Vectors.dense(1.5, 30.0)),
        (0.0, Vectors.dense(3.5, 30.0)),
        (0.0, Vectors.dense(3.5, 40.0)),
        (1.0, Vectors.dense(3.5, 40.0))]
df = spark.createDataFrame(data, ["label", "features"])

r = ChiSquareTest.test(df, "features", "label").head()
print("pValues: " + str(r.pValues))
print("degreesOfFreedom: " + str(r.degreesOfFreedom))
print("statistics: " + str(r.statistics))

pValues: [0.687289278791,0.682270330336]
degreesOfFreedom: [2, 3]
statistics: [0.75,1.5]

十字桌

df.stat.crosstab("age_class", "Occupation").show()

+--------------------+---+---+---+---+
|age_class_Occupation|  1|  2|  3|  4|
+--------------------+---+---+---+---+
|                 <25|  4| 34|108|  4|
|               55-64|  1| 15| 31|  9|
|               25-34|  7| 61|269| 60|
|               35-44|  4| 58|143| 49|
|                 65+|  5|  3|  6|  9|
|               45-54|  1| 29| 73| 17|
+--------------------+---+---+---+---+

叠加图

labels = ['missing', '<25', '25-34', '35-44', '45-54','55-64','65+']
missing = np.array([0.000095, 0.024830, 0.028665, 0.029477, 0.031918,0.037073,0.026699])
man = np.array([0.000147, 0.036311, 0.038684, 0.044761, 0.051269, 0.059542, 0.054259])
women = np.array([0.004035, 0.032935, 0.035351, 0.041778, 0.048437, 0.056236,0.048091])
ind = [x for x, _ in enumerate(labels)]

plt.figure(figsize=(10,8))
plt.bar(ind, women, width=0.8, label='women', color='gold', bottom=man+missing)
plt.bar(ind, man, width=0.8, label='man', color='silver', bottom=missing)
plt.bar(ind, missing, width=0.8, label='missing', color='#CD853F')

plt.xticks(ind, labels)
plt.ylabel("percentage")
plt.legend(loc="upper left")
plt.title("demo")

plt.show()