2、pandas-处理丢失数据

Pandas处理丢失数据

  • 对于数值数据,pandas中的缺失数据使用浮点值Nan表示
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [1]: import pandas as pd
In [2]: import numpy as np

In [3]: string_data = pd.Series(['python','pandas',np.nan,'numpy'])
In [4]: string_data
Out[4]:
0 python
1 pandas
2 NaN
3 numpy
dtype: object
#使用isnull()方法可以检测缺失值
In [5]: string_data.isnull()
Out[5]:
0 False
1 False
2 True
3 False
dtype: bool
  • Python自身内置的None值在对象数组中也可以作为NA:
1
2
3
4
5
6
7
8
9
In [6]: string_data[0] = None

In [7]: string_data.isnull()
Out[7]:
0 True
1 False
2 True
3 False
dtype: bool
  • 下面是关于缺失数据处理的函数
方法 说明
dropna 根据各标签的值中是否存在缺失数据对轴标签进行过滤,可通过阈值调节缺失值的容忍度
fillna 用指定值或插值方法填充缺失值
isnull 返回一个含有布尔值的对象,有缺失值的地方返回True,否则返回False
notnull isnull的否定

过滤缺失数据

  • 通过上面表格中的函数过滤缺失数据
  • Series:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [8]: string_data
Out[8]:
0 None
1 pandas
2 NaN
3 numpy
dtype: object
#使用dropna()函数进行过滤
In [9]: string_data.dropna()
Out[9]:
1 pandas
3 numpy
dtype: object
#等价于
In [10]: string_data[string_data.notnull()]
Out[10]:
1 pandas
3 numpy
dtype: object
  • 对于Dataframe:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
In [15]: from numpy import nan as NA
In [21]: data = pd.DataFrame([[1.,5.,3.],[2.,NA,NA],
...: [NA,NA,NA],[NA,6.,3.]])
...:

In [22]: data
Out[22]:
0 1 2
0 1.0 5.0 3.0
1 2.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.0 3.0

In [23]: cleand = data.dropna()

In [24]: data
Out[24]:
0 1 2
0 1.0 5.0 3.0
1 2.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.0 3.0
#对于DataFrame,dropna()函数默认丢弃任何含缺失值的行
In [25]: cleand
Out[25]:
0 1 2
0 1.0 5.0 3.0
#传入how='all'将值丢弃全为NA的行
In [26]: cleand = data.dropna(how='all')

In [27]: cleand
Out[27]:
0 1 2
0 1.0 5.0 3.0
1 2.0 NaN NaN
3 NaN 6.0 3.0
#若想丢弃列,则可传入axis=1
In [28]: data[4] = NA

In [29]: data
Out[29]:
0 1 2 4
0 1.0 5.0 3.0 NaN
1 2.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.0 3.0 NaN

In [30]: data.dropna(axis=1,how='all')
Out[30]:
0 1 2
0 1.0 5.0 3.0
1 2.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.0 3.0

填充缺失数据

  • 针对不想过滤缺失数据的情况,可以对数据进行填充,大多数情况下使用fillna函数。通过一个常数调用fillna就能将缺失值替换为常数值。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
In [31]: df = pd.DataFrame(np.random.randn(7,3))      

In [32]: df.iloc[:4,1] = NA

In [33]: df.iloc[:2,2] = NA

In [34]: df
Out[34]:
0 1 2
0 0.865682 NaN NaN
1 -0.204699 NaN NaN
2 0.130211 NaN -0.848007
3 -0.426136 NaN 1.704451
4 0.702581 -1.478580 0.159704
5 -0.693145 -0.334965 -0.187303
6 1.181460 1.289157 0.128563

In [35]: df.fillna(0)
Out[35]:
0 1 2
0 0.865682 0.000000 0.000000
1 -0.204699 0.000000 0.000000
2 0.130211 0.000000 -0.848007
3 -0.426136 0.000000 1.704451
4 0.702581 -1.478580 0.159704
5 -0.693145 -0.334965 -0.187303
6 1.181460 1.289157 0.128563
# 通过字典调用fillna,实现对不同的列填充不同的值:
In [37]: df.fillna({1:0.5,2:0})
Out[37]:
0 1 2
0 0.865682 0.500000 0.000000
1 -0.204699 0.500000 0.000000
2 0.130211 0.500000 -0.848007
3 -0.426136 0.500000 1.704451
4 0.702581 -1.478580 0.159704
5 -0.693145 -0.334965 -0.187303
6 1.181460 1.289157 0.128563