>>> from helper import info; info()
页面更新时间： 2020-03-18 16:07:38
操作系统/OS: Linux-4.19.0-8-amd64-x86_64-with-debian-10.3 ；Python: 3.7.3

项目：从电子表格中读取数据¶

假定你有一张电子表格的数据，来自于2010年美国人口普查。你有一个无聊的任务，要遍历表中的几千行，计算总的人口，以及每个县的普查区的数目（普查区就是一个地理区域，是为人口普查而定义的)。每行表示一个人口普查区。我们将这个电子表格文件命名为 censuspopdata.xlsx，可以从 http://nostarch.com/automatestuff/ 下载它。它的内容如图12-2所示。

尽管Excel是要能够计算多个选中单元格的和，你仍然需要选中3000个以上县的单元格。即使手工计算一个县的人口只需要几秒钟，整张电子表格也需要几个小时时间。

图 12-2 censuspopdata.xlsx 电子表格

在这个项目中，你要编写一个脚本，从人口普查电子表格文件中读取数据，并在几秒钟内计算出每个县的统计值。

下面是程序要做的事：

从Excel电子表格中读取数据。
计算每个县中普查区的数目。
计算每个县的总人口。
打印结果。

这意味着代码需要完成下列任务：

用 openpyxl 模块打开Excel文档并读取单元格。
计算所有普查区和人口数据，将它保存到一个数据结构中。
利用 pprint 模块，将该数据结构写入一个扩展名为 .py 的文本文件。

第1步：读取电子表格数据¶

censuspopdata.xlsx 电子表格中只有一张表，名为 'Population by Census Tract'。每一行都保存了一个普查区的数据。列分别是普查区的编号（A），州的简称（B），县的名称（C），普查区的人口（D）。

打开一个新的文件编辑器窗口，输入以下代码。将文件保存为 readCensusExcel.py 。

首先是打开电子表格：

>>> import openpyxl, pprint
>>> wb = openpyxl.load_workbook('censuspopdata.xlsx')

>>> sheet = wb.get_sheet_by_name('Population by Census Tract')
>>> countyData = {}
>>>
>>> # TODO: Fill in countyData with each county's population and tracts,

开始读取各行：

>>> for row in range(2, sheet.max_row + 1):
>>>     # Each row in the spreadsheet has data for one census tract.
>>>     state  = sheet['B' + str(row)].value
>>>     county = sheet['C' + str(row)].value
>>>     pop    = sheet['D' + str(row)].value
>>>
>>> # TODO: Open a new text file and write the contents of countyData to it.

这段代码导入了 openpyxl 模块，也导入了 pprint 模块，你用后者来打印最终的县的数据。然后代码打开了 censuspopdata.xlsx 文件，取得了包含人口普查数据的工作表，开始迭代它的行。

请注意，你也创建了一个countyData变量，它将包含你计算的每个县的人口和普查区数目。但在它里面存储任何东西之前，你应该确定它内部的数据结构。

第2步：填充数据结构¶

保存在 countyData 中的数据结构将是一个字典，以州的简称作为键。每个州的简称将映射到另一个字典，其中的键是该州的县的名称。每个县的名称又映射到一个字典，该字典只有两个键，'tracts' 和 'pop' 。这些键映射到普查区数目和该县的人口。例如，该字典可能类似于：

>>> countyData = {'AK': {'Aleutians East': {'pop': 3141, 'tracts': 1},
>>>         'Aleutians West': {'pop': 5561, 'tracts': 2},
>>>         'Anchorage': {'pop': 291826, 'tracts': 55},
>>>         'Bethel': {'pop': 17013, 'tracts': 3},
>>>         'Bristol Bay': {'pop': 997, 'tracts': 1} } }

如果前面的字典保存在 countyData 中，下面的表达式求值结果如下：

>>> countyData['AK']['Anchorage']['pop']
291826

>>> countyData['AK' ] [ 'Anchorage' ] [ 'tracts']
55

一般来说， countyData 字典中的键看起来像这样：

countyData[state abbrev][county]['tracts']
countyData[state abbrev][county]['pop']

既然知道了 countyData 的结构，就可以编写代码，用县的数据填充它。将下面的代码添加到程序的末尾：

#! python3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.

--snip--

>>> for row in range(2, sheet.max_row + 1 ):
>>>
>>>     # Each row in the spreadsheet has data for one census tract.
>>>
>>>     State  = sheet['B' + str(row)].value
>>>     county = sheet['C' + str(row)].value
>>>     pop    = sheet['D'+ str(row)].value
>>>
>>>     # Make sure the key for this state exists.
>>>
>>>     countyData.setdefault(state,{})
>>>
>>>     # Make sure the key for this county in this state exists.
>>>
>>>     countyData[state].setdefault(county, {'tracts': 0, 'pop': 0})
>>>
>>>     # Each row represents one census tract, so increment by one.
>>>     countyData[state][county]['tracts'] += 1
>>>
>>>     # Increase the county pop by the pop in this census tract.
>>>
>>>     countyData[state][county]['pop'] += int(pop)
# TODO: Open a new text file and write the contents of countyData to it.

最后的两行代码执行实际的计算工作，在 for 循环的每次迭代中，针对当前的县，增加 tracts 的值，并增加 pop 的值。

其他代码存在是因为，只有 countyData 中存在的键，你才能引用它的值。（也就是说，如果 'AK' 键不存在， countyData['AK']['Anchorage']['tracts']+= 1 将导致一个错误）。为了确保州简称的键存在，你需要调用 setdefault() 方法，在 state 还不存在时设置一个默认值。

正如 countyData 字典需要一个字典作为每个州缩写的值，这样的字典又需要一个字典，作为每个县的键的值。这样的每个字典又需要键 'tracts' 和 'pop' ，它们的初始值为整数0 (如果这个字典的结构令你混淆，回去看看本节开始处字典的例子)。

如果键已经存在， setdefault() 不会做任何事情，因此在 for 循环的每次迭代中调用它不会有问题。

第3步：将结果写入文件¶

for 循环结束后， countyData 字典将包含所有的人口和普查区信息，以县和州为键。这时，你可以编写更多代码，将数据写入文本文件或另一个Excel电子表格。目前，我们只是使用 pprint.pformat() 函数，将变量字典的值作为一个巨大的字符串，写入文件 census2010.py 。在程序的末尾加上以下代码（确保它没有缩进，这样它就在 for 循环之外)：

>>> #! python3
>>> # readCensusExcel.py - Tabulates population and number of census tracts for
>>> # each county.
>>>
>>> import openpyxl, pprint
>>> print('Opening workbook...')
>>> wb = openpyxl.load_workbook('censuspopdata.xlsx')
>>> sheet = wb.get_sheet_by_name('Population by Census Tract')
>>> countyData = {}
>>> # Fill in countyData with each county's population and tracts.
>>> print('Reading rows...')
>>> for row in range(2, sheet.max_row + 1):
>>>     # Each row in the spreadsheet has data for one census tract.
>>>     state  = sheet['B' + str(row)].value
>>>     county = sheet['C' + str(row)].value
>>>     pop    = sheet['D' + str(row)].value
>>>
>>>     # Make sure the key for this state exists.
>>>     countyData.setdefault(state, {})
>>>     # Make sure the key for this county in this state exists.
>>>     countyData[state].setdefault(county, {'tracts': 0, 'pop': 0})
>>>
>>>     # Each row represents one census tract, so increment by one.
>>>     countyData[state][county]['tracts'] += 1
>>>     # Increase the county pop by the pop in this census tract.
>>>     countyData[state][county]['pop'] += int(pop)
>>>
>>> # Open a new text file and write the contents of countyData to it.
>>> print('Writing results...')
>>> resultFile = open('census2010.py', 'w')
>>> resultFile.write('allData = ' + pprint.pformat(countyData))
>>> resultFile.close()
>>> print('Done.')
Opening workbook...
Reading rows...
Writing results...
Done.

pprint.pformat() 函数产生一个字符串，它本身就是格式化好的、有效的Python代码。将它输出到文本文件 census2010.py ，你就通过Python程序生成了一个 Python程序！这可能看起来有点复杂，但好处是你现在可以导入 census2010.py ，就像任何其他 Python 模块一样。在交互式环境中，将当前工作目录变更到新创建的文件所在的文件夹（根据不同的环境，路径可能不一样)，然后导入它：

>>> import os
>>> # os.chdir('Python34')
>>> import census2010
>>> census2010.allData['AK']['Anchorage']
{'pop': 291826, 'tracts': 55}

>>> anchoragePop = census2010.allData['AK']['Anchorage']['pop']
>>> print('The 2010 population of Anchorage was ' + str(anchoragePop))
The 2010 population of Anchorage was 291826

readCensusExcel.py 程序是可以扔掉的代码：当你把它的结果保存为 census2010.py之后，就不需要再次运行该程序了。任何时候，只要需要县的数据，就可以执行 import census2010 。

手工计算这些数据可能需要数小时，这个程序只要几秒钟。利用 OpenPyXL ，可以毫无困难地提取保存在Excel电子表格中的信息，并对它进行计算。从 http://nostarch.com/automatestuff/ 可以下载这个完整的程序。

第4步：类似程序的思想¶

许多公司和组织机构使用Excel来保存各种类型的数据，电子表格会变得庞大，这并不少见。解析Excel电子表格的程序都有类似的结构：它加载电子表格文件，准备一些变量或数据结构，然后循环遍历电子表格中的每一行。这样的程序可以做下列事情：

比较一个电子表格中多行的数据。
打开多个Excel文件，跨电子表格比较数据。
检查电子表格是否有空行或无效的数据，如果有就警告。
从电子表格中读取数据，将它作为Python程序的输入。