In [1]: import pandas as pd

Data used for this tutorial:

Titanic data

本教程使用存储为CSV的泰坦尼克号数据集。数据由以下数据列组成：

PassengerID：每个乘客的ID。
生还：表示乘客是否生还。 0 是的而且 1 不是的。
Pclass：3个票类中的一个：类 1 ，班级 2 和班级 3 。
姓名：乘客姓名。
性别：乘客性别。
年龄：乘客的年龄，以年为单位。
SibSp：船上兄弟姐妹或配偶的数量。
Parch：船上父母或孩子的人数。
车票：旅客车票号码。
票价：标明票价。
客舱：客舱人数。
已装船：装船港。

To raw data

In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]: 
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

如何处理文本数据？#

将所有名称字符设置为小写。

In [4]: titanic["Name"].str.lower()
Out[4]: 
                              braund, mr. owen harris
    cumings, mrs. john bradley (florence briggs th...
                               heikkinen, miss. laina
         futrelle, mrs. jacques heath (lily may peel)
                             allen, mr. william henry
                             ...                        
                              montvila, rev. juozas
                       graham, miss. margaret edith
           johnston, miss. catherine helen "carrie"
                              behr, mr. karl howell
                                dooley, mr. patrick
Name: Name, Length: 891, dtype: object

若要使 Name 列小写，请选择 Name 专栏(请参阅 tutorial on selection of data )，添加 str 访问器，并将 lower 方法。因此，每个字符串都是按元素进行转换的。

中的DateTime对象类似 time series tutorial 有一个 dt 方法时，可以使用许多专用的字符串方法 str 访问者。这些方法通常具有与单个元素的等效内置字符串方法匹配的名称，但它们是按元素应用的(请记住 element-wise calculations ？)在列的每个值上。

创建新列 Surname 它通过提取逗号前的部分来包含乘客的姓氏。

In [5]: titanic["Name"].str.split(",")
Out[5]: 
                           [Braund,  Mr. Owen Harris]
    [Cumings,  Mrs. John Bradley (Florence Briggs ...
                            [Heikkinen,  Miss. Laina]
      [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
                          [Allen,  Mr. William Henry]
                             ...                        
                           [Montvila,  Rev. Juozas]
                    [Graham,  Miss. Margaret Edith]
        [Johnston,  Miss. Catherine Helen "Carrie"]
                           [Behr,  Mr. Karl Howell]
                             [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

使用 Series.str.split() 方法时，每个值都作为两个元素的列表返回。第一个元素是逗号之前的部分，第二个元素是逗号之后的部分。

In [6]: titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)

In [7]: titanic["Surname"]
Out[7]: 
0         Braund
1        Cumings
2      Heikkinen
3       Futrelle
4          Allen
         ...    
886     Montvila
887       Graham
888     Johnston
889         Behr
890       Dooley
Name: Surname, Length: 891, dtype: object

由于我们只对表示姓氏的第一部分(元素0)感兴趣，因此我们可以再次使用 str 访问者和应用 Series.str.get() 以提取相关部分。事实上，这些字符串函数可以串联在一起，一次组合多个函数！

To user guide

有关提取字符串部分的更多信息，请参阅 splitting and replacing strings 。

提取泰坦尼克号上伯爵夫人的乘客数据。

In [8]: titanic["Name"].str.contains("Countess")
Out[8]: 
    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: Name, Length: 891, dtype: bool

In [9]: titanic[titanic["Name"].str.contains("Countess")]
Out[9]: 
     PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch  Ticket  Fare Cabin Embarked Surname
759          760         1       1  Rothes, the Countess. of (Lucy Noel Martha Dye...  female  33.0      0      0  110152  86.5   B77        S  Rothes

( 对她的故事感兴趣吗？看见 Wikipedia ! )

The string method Series.str.contains() checks for each of the values in the column Name if the string contains the word Countess and returns for each of the values True (Countess is part of the name) or False (Countess is not part of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the subsetting of data tutorial. As there was only one countess on the Titanic, we get one row as a result.

备注

支持对字符串进行更强大的提取，因为 Series.str.contains() 和 Series.str.extract() 方法接受 regular expressions ，但超出了本教程的范围。

To user guide

有关提取字符串部分的更多信息，请参阅 string matching and extracting 。

泰坦尼克号上哪位乘客的名字最长？
```
In [10]: titanic["Name"].str.len()
Out[10]: 
0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64
```
要获得最长的名称，我们首先必须获取 Name 纵队。通过使用PANAS字符串方法， Series.str.len() 函数分别应用于每个名称(逐个元素)。
```
In [11]: titanic["Name"].str.len().idxmax()
Out[11]: 307
```
接下来，我们需要获取表中名称长度最大的相应位置，最好是索引标签。这个 idxmax() 方法就是这样做的。它不是字符串方法，并且应用于整数，因此没有 str 是使用的。
```
In [12]: titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]
Out[12]: 'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'
```
基于行的索引名称 (307 )和栏目 (Name )，我们可以使用 loc operator, introduced in the tutorial on subsetting 。

在“性别”栏中，将“男性”的值改为“M”，将“女性”的值改为“F”。

In [13]: titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})

In [14]: titanic["Sex_short"]
Out[14]: 
0      M
1      F
2      F
3      F
4      M
      ..
886    M
887    F
888    F
889    M
890    M
Name: Sex_short, Length: 891, dtype: object

鉴于 replace() 不是字符串方法，它提供了一种使用映射或词汇表转换特定值的便捷方法。它需要一个 dictionary 定义映射的步骤 {{from : to}} 。

警告

也有一个 replace() 方法可用于替换特定的一组字符。但是，当有多个值的映射时，这将变为：

titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")

这会变得很麻烦，很容易导致错误。只要想一想(或者自己试一试)，如果这两个语句以相反的顺序应用…会发生什么

REMEMBER

字符串方法可以使用 str 访问者。
字符串方法以元素为单位工作，并可用于条件索引。
这个 replace 方法是根据给定词典转换值的便捷方法。

To user guide

上的用户指南页面中提供了完整的概述 working with text data 。

如何轻松处理时间序列数据？

与其他工具的比较