Pandas数据处理

本文主要写Pandas在数据处理时需要掌握的方法。

运算方法

我们已经提过如何用Pandas有效地筛选数据，也知道一些基本的统计学运算方法，而在这一节中，我们想要关注的是在Pandas中如何运算。

筛选赋值运算

在之前筛选数据的教学中，我们能成功找出数据中的某个部分，那么针对这个找出的部分，我们对它进行操作也是没问题的。比如下面我们先生成一组数据，然后再对这组数据进行筛选运算。

import pandas as pd
import numpy as np
data = np.arange(-12, 12).reshape((6,4))
df = pd.DataFrame(
  data,
  index=list("abcdef"),
  columns=list("ABCD"))
df

运行结果：

    A   B   C   D
a -12 -11 -10  -9
b  -8  -7  -6  -5
c  -4  -3  -2  -1
d   0   1   2   3
e   4   5   6   7
f   8   9  10  11

筛选出A的column出来，对A的内容进行乘0的运算。

df["A"] *= 0
df

运行结果：

   A   B   C   D
a  0 -11 -10  -9
b  0  -7  -6  -5
c  0  -3  -2  -1
d  0   1   2   3
e  0   5   6   7
f  0   9  10  11

同样，在筛选数据教学中我们提到的iloc,loc功能也是可以用来对某数据进行运算的。iloc找的是index，loc找的是标签。

df.loc["a", "A"] = 100
df.iloc[1, 0] = 200
df

运行结果：

     A   B   C   D
a  100 -11 -10  -9
b  200  -7  -6  -5
c    0  -3  -2  -1
d    0   1   2   3
e    0   5   6   7
f    0   9  10  11

这只是赋值，现在你拿这些赋值的方法进行运算试试：

df.loc["a", :] = df.loc["a",:] * 2
df

运行结果：

     A   B   C   D
a  200 -22 -20 -18
b  200  -7  -6  -5
c    0  -3  -2  -1
d    0   1   2   3
e    0   5   6   7
f    0   9  10  11

试一试条件运算，下面做的就是对于df["A"]，我要找出df["A"]中等于0的数，把这些数赋值成-1.

df["A"][df["A"] == 0] = -1
df

运行结果：

     A   B   C   D
a  200 -22 -20 -18
b  200  -7  -6  -5
c   -1  -3  -2  -1
d   -1   1   2   3
e   -1   5   6   7
f   -1   9  10  11

基本上，pandas 中可以用于筛选数据的方法都可以用来进一步把筛选出来的数据赋予新的值。

Apply方法

另一种比较方便的批处理数据的方法，我比较喜欢用的是 apply。这是一种可以针对数据做自定义功能的运算。意味着可以简化数据做复杂的功能运算。上面我们提到的筛选运算，其实是一种简单的运算方式，如果当运算变得复杂，甚至还需要很多局部变量来缓存运算结果，我们就可以尝试把运算过程放置在一个 func 中，模块化。

比如我定义下面这批数据：

df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df

运行结果：

如果对df做全量的平方根计算，一般的方法是这样：

np.sqrt(df)

但是如果用apply，就会变成

df.apply(np.sqrt)

运行结果：

     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

我们把np.sqrt这个函数当成一个参数传入了apply，看起来好像没什么用，还不如直接使用np.sqrt(df)来的方便。的确这个case写成np.sqrt(df)是要简单点。但是下面这种case呢？

def func(x):
    return x[0] * 2, x[1] * -1

df.apply(func, axis=1, result_type='expand')

运行结果：

在这个自定义的函数中，对 df 中的每一行，每行第 0 位乘以 2，第 1 位乘以 -1，我们原本的 col0，就都乘了 2，而 col1 就都乘了-1。提示一下，apply 里面还有不同的参数项可以选，我使用了一个 result_type="expand" 的配置，让输出的结果可以生成多 column，要不然，会只生成一个 column，所有的结果都写在这一个 column 里。要不你试试删除刚才写的 result_type，观察一下生成结果的变化。

#df.apply(func, axis=1)
0    (8, -9)
1    (8, -9)
2    (8, -9)
dtype: object

顺带提一下，如果 reult_type="broadcast"，那么原 column 和 index 名会继承到新生成的数据中。仔细对比上下两次的运行，你就能发现不同的表现了。

def func(x):
    return x[0] * 2, x[1] * -1

df.apply(func, axis=1, result_type='broadcast')

运行结果：

如果只想改一个column：

def func(x):
    return x["A"] * 4
  
df.apply(func, axis=1)

运行结果：

0    16
1    16
2    16

想要返回原df，但只修改一个column：

def func(x):
    return x["A"] * 4

df["A"] = df.apply(func, axis=1)
df

运行结果：

想对row进行操作时，修改axis的值为0，并且修改func中对应的运算规则:

def func(r):
    return r[2] * 4

last_row = df.apply(func, axis=0)
print("last_row:\n", last_row)

df.iloc[2, :] = last_row
print("\ndf:\n", df)

运行结果：

last_row:
 A    64
B    36
dtype: int64

df:
     A   B
0  16   9
1  16   9
2  64  36

总结

想对数据做特殊的运算，甚至想自定义功能，对数据做批量处理，我们今天就介绍了两大类方法，一种是直接索引-运算，一种是利用 pandas 的 apply 来做更为丰富的运算模式。

文字处理

相比 Python 的科学运算神器 Numpy，Pandas 还有一个特别优势的地方，那就是处理数据库当中的文字信息。对比 Numpy，Numpy 是一个纯数据处理的库，在数据处理的速度上，是要优于 Pandas 的。但是在处理数据的丰富度上，比如要处理文字，日期型数据的时候，Pandas 还是有很大优势的。今天我们就来看看处理文本数据时，Pandas 可以怎么用。

格式化字符

str.upper(); str.lower(); str.len()

需要对标一下Python中自带的文字处理功能：Python本身就有很多自带的文字函数，如strip()，upper()等：

import pandas as pd
py_s = "A,B,C,Aaba,Baca,CABA,dog,cat"
pd_s = pd.Series(
  ["A","B","C","Aaba","Baca","CABA","dog","cat"],
  dtype="string")
print("python:\n", py_s.upper())
print("\npandas:\n", pd_s.str.upper())

运行结果：

python:
 A,B,C,AABA,BACA,CABA,DOG,CAT

pandas:
 0       A
1       B
2       C
3    AABA
4    BACA
5    CABA
6     DOG
7     CAT
dtype: string

**注意如果要用到 Pandas 丰富的文字处理功能，你要确保 Series 或者 DataFrame 的 dtype="string"**，如果不是 string，比如我们刚从一个 excel 中读取出来一个数据，自动读的，没有解析到 string 格式，我们怎么调整呢？其实也简单。

pd_not_s = pd.Series(
  ["A", "B", "C", "Aaba", "Baca", "CABA", "dog", "cat"],
)
print("pd_not_s type:", pd_not_s.dtype)
#pd_not_s type: object
pd_s = pd_not_s.astype("string")
print("pd_s type:", pd_s.dtype)
#pd_s type: string

好，牢记这点，我们接着来对比原生Python的功能。

print("python lower:\n", py_s.lower())
print("\npandas lower:\n", pd_s.str.lower())
print("python len:\n", [len(s) for s in py_s.split(",")])
print("\npandas len:\n", pd_s.str.len())

运行结果：

python lower:
 a,b,c,aaba,baca,caba,dog,cat

pandas lower:
 0       a
1       b
2       c
3    aaba
4    baca
5    caba
6     dog
7     cat
dtype: string
python len:
 [1, 1, 1, 4, 4, 4, 3, 3]

pandas len:
 0    1
1    1
2    1
3    4
4    4
5    4
6    3
7    3
dtype: Int64

str.strip(); str.lstrip(); str.rstrip()

再来对比一下对文字的裁剪：

py_s = ["   jack", "jill ", "    jesse    ", "frank"]
pd_s = pd.Series(py_s, dtype="string")
print("python strip:\n", [s.strip() for s in py_s])
print("\npandas strip:\n", pd_s.str.strip())

print("\n\npython lstrip:\n", [s.lstrip() for s in py_s])
print("\npandas lstrip:\n", pd_s.str.lstrip())

print("\n\npython rstrip:\n", [s.rstrip() for s in py_s])
print("\npandas rstrip:\n", pd_s.str.rstrip())

运行结果：

python strip:
 ['jack', 'jill', 'jesse', 'frank']

pandas strip:
 0     jack
1     jill
2    jesse
3    frank
dtype: string


python lstrip:
 ['jack', 'jill ', 'jesse    ', 'frank']

pandas lstrip:
 0         jack
1        jill 
2    jesse    
3        frank
dtype: string


python rstrip:
 ['   jack', 'jill', '    jesse', 'frank']

pandas rstrip:
 0         jack
1         jill
2        jesse
3        frank
dtype: string

str.split()

从结果可能看不清空白符有多少，但是实际上是把空白符都移除掉了。下面再对比一下split拆分方法。

pt_s = ["a_b_c", "jill_jesse", "frank"]
pd_s = pd.Series(py_s, dtype="string")
print("python split:\n", [s.split("_") for s in py_s])
print("\npandas split:\n", pd_s.str.split("_"))

运行结果：

python split:
 [['a', 'b', 'c'], ['jill', 'jesse'], ['frank']]

pandas split:
 0        [a, b, c]
1    [jill, jesse]
2          [frank]
dtype: object

咦，pandas 这样拆分起来怪怪的，把结果都放到了一个 column 里面，我还记得上一节用 apply() 的时候，我可以加一个 result_type="expand"，同样，在 split 中也有类似的功能，可以将拆分出来的结果放到不同的 column 中去。

pd_s.str.split("_", expand=True)

运行结果：

       0      1     2
0      a      b     c
1   jill  jesse  <NA>
2  frank   <NA>  <NA>

你看，一共拆出了三个 column，但是有些 column 因为没有 split 出那么多值，所以显示的也是 pd.nan

这里还有一点我想说，我们上面都是在 Series 里面做实验，其实 DataFrame 也是一样的。 你要做的，只是先选一个 column 或者 row，拿到一个 Series 再开始做 str 的处理

pd_df = pd.DataFrame([["a", "b"], ["C", "D"]])
pd_df.iloc[0, :].str.upper()

运行结果：

0    A
1    B
Name: 0, dtype: object

正则方案

str.contains(); str.match();

正则是一个很有用的东西，我们在Python 基础中也花了大功夫来学习正则表达式，用特殊规则获取到特殊的文本。在演示的第一件事情就是它是否真的可以找到一些东西。我们用 str.contains() 或 str.match() 来确认它真的找到了匹配文字。

注意，如果你还不了解正则表达式，我强烈建议你先看一下我的正则教学。要不然你也看不懂我写的匹配规则，比如这里 [0-9][a-z] 表示要匹配 0~~9 的任何数字，之后再接着匹配 a~~z 的任何字母。

pattern = r"[0-9][a-z]"
s = pd.Series(["1", "1a", "11c", "abc"], dtype="string")
s.str.contains(pattern)

运行结果：

0    False
1     True
2     True
3    False
dtype: boolean

现在请你把 str.contains() 换成 str.match() 看看结果有无变化。仔细的你肯定发现了，11c 这个字符，用 contains() 可以匹配，但是 match() 却不能。那是因为 只要包含正则规则，contains 就为 True，但是 match() 的意思是你的正则规则要完全匹配才会返回 True。

那么为了要让 match 匹配 11c 我们就需要把规则改成 r"[0-9]+?[a-z]。至于为什么，那请看到我的正则教学。

pattern = r"[0-9]+?[a-z]"
s.str.match(pattern)

运行结果：

0    False
1     True
2     True
3    False
dtype: boolean

str.startswith(); str.endswith()

下面我们在对比下原生 Python 中我比较常用的 startswith, endswith 这两个前后匹配。

py_s = ["1", "1a", "21c", "abc"]
pd_s = pd.Series(py_s, dtype="string")
print("py_s startswith '1':\n", [s.startswith("1") for s in py_s])
print("\npy_s endswith 'c':\n", [s.endswith("c") for s in py_s])

print("\n\npd_s startswith '1':\n", pd_s.str.startswith("1"))
print("\npd_s endswith 'c':\n", pd_s.str.endswith("c"))

运行结果：

py_s startswith '1':
 [True, True, False, False]

py_s endswith 'c':
 [False, False, True, True]


pd_s startswith '1':
 0     True
1     True
2    False
3    False
dtype: boolean

pd_s endswith 'c':
 0    False
1    False
2     True
3     True
dtype: boolean

当然，pandas 的 str.startswith() 和 str.endswith() 都是可以支持正则的。使用方式和上面的 str.match() 等一样。

str.replace()

还有一个十分有用，而且我觉得是最重要的，就是 replace 了，因为这真的减轻了我们很多复制粘贴的工作，比如 Excel 中人工按照一个规则修改老板给的新任务。下面同样，我们对比 Python 原生的 replace，来验证一下。

py_s = ["1", "1a", "21c", "abc"]
pd_s = pd.Series(py_s, dtype="string")
print("py_s replace '1' -> '9':\n", [s.replace("1", "9") for s in py_s])

print("\n\npd_s replace '1' -> '9':\n", pd_s.str.replace("1", "9"))

运行结果：

py_s replace '1' -> '9':
 ['9', '9a', '29c', 'abc']


pd_s replace '1' -> '9':
 0      9
1     9a
2    29c
3    abc
dtype: string

但是比原生 Python 强大的是，这个 replace 是支持正则的。我们把所有数字都替换成这个 NUM 吧。

print("pd_s replace -> 'NUM':")
pd_s.str.replace(r"[0-9]", "NUM", regex=True)

运行结果：

pd_s replace -> 'NUM':
0        NUM
1       NUMa
2    NUMNUMc
3        abc
dtype: string

str.extract(); str.extractall()

除了替换原本文字里的东西，我们还可以去从原本文字里找到特定的文字。有点像正则中的 findall 函数。

s = pd.Series(['a1', 'b2', 'c3'])
s.str.extract(r"([ab])(\d)")

r"([ab])(\d)" 这一个正则匹配我简单介绍一下，其中有两个括号，第一个括号是想提取的第一种规则，第二个是第二种想提取的规则。那么运行出来，你会看到有两个 column，分别对应着这两个提取规则出来的值。最后一行出来的结果是两个 NaN，也就意味着第三个数据没有提取出来任何东西。

运行结果：

     0    1
0    a    1
1    b    2
2  NaN  NaN

对应 str.extract() 还有一个 str.extractall() 函数，用来返回所有匹配，而不是第一次发现的匹配。

拼接

str.cat()

将两个文本 Series 拼接到一起的方法多种多样。大多情况我们是想结合两个 Series 而形成一个新的 Series。比如下面这样。

s1 = pd.Series(["A", "B", "C", "D"], dtype="string")
s2 = pd.Series(["1", "2", "3", "4"], dtype="string")
s1.str.cat(s2)

上面这是将两个文字拼接成新的文字，如果你想了解如何在 pandas 中做 df 的数据上的拼接，比如 2 columns 和 3 columns 的 df 做横向拼接等，我们会在这节 Pandas 的拼接专门讲到，因为里面涉及的拼接方法实在是太多了，在这里讲不完。

总结

可以看到，文字处理包罗万象，有很多方法。我们挑重点的，调有用的。如果觉得这些对于你还不够，你可以参考到官方文档，获取到更多信息。

异常数据处理

异常数据，我常代指的是机器学习或者是统计分析中的脏数据。为什么他们异常或者脏呢？是因为这些数据不符合你期望当中的规律，给你或你的模型带来困扰。而且很可能是收集数据时，

因为人工差错、机器传感器差错而导致的数据异常。再或者某一个 sample 的数据没有被采集，这也会引发数据批量处理中的异常。

既然数据异常经常发生，又无可避免，我们就来看看如何能找到合适的解决方案。