4、pandas-数据合并

0. 回顾numpy的级联

  • 1.生成2各3*3的矩阵,对其分别进行两个维度上的级联
1
2
3
4
5
6
7
In [3]: nd = np.random.randint(0,10,size=(3,3))

In [4]: nd
Out[4]:
array([[0, 9, 0],
[7, 5, 8],
[7, 7, 1]])
1
2
3
4
5
6
7
In [5]: nd1 = np.random.randint(0,10,size=(3,3))

In [6]: nd1
Out[6]:
array([[1, 3, 0],
[7, 0, 2],
[0, 7, 1]])
1
2
3
4
5
6
7
8
9
#0是第一维的方向:行
In [7]: np.concatenate((nd,nd1),axis=0)
Out[7]:
array([[0, 9, 0],
[7, 5, 8],
[7, 7, 1],
[1, 3, 0],
[7, 0, 2],
[0, 7, 1]])
1
2
3
4
5
In [8]: np.concatenate((nd,nd1),axis=1)
Out[8]:
array([[0, 9, 0, 1, 3, 0],
[7, 5, 8, 7, 0, 2],
[7, 7, 1, 0, 7, 1]])
  • 定义生成一个DataFrame的函数:
1
2
3
4
In [9]: def make_df(cols,inds):
...: data = {c:[c+str(i) for i in inds] for c in cols}
...: print(data)
...: return pd.DataFrame(data,index = inds,columns = cols)
1
2
3
4
5
6
In [10]: make_df(['A','B'],[1,2])
{'A': ['A1', 'A2'], 'B': ['B1', 'B2']}
Out[10]:
A B
1 A1 B1
2 A2 B2

1. 使用pd.concat()级联

  • Pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:
1
2
3
Signature: pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
Docstring:
Concatenate pandas objects along a particular axis with optional set logic

1.1 简单级联

  • 和np.concatenate一样,优先增加行数(默认axis=0)
1
2
3
4
5
6
7
8
9
10
11
12
13
In [11]: df1 = make_df(list('AB'),[0,1])
{'A': ['A0', 'A1'], 'B': ['B0', 'B1']}

In [12]: df2 = make_df(list('AB'),[2,3])
{'A': ['A2', 'A3'], 'B': ['B2', 'B3']}

In [13]: pd.concat([df1,df2])
Out[13]:
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
1
2
3
4
5
6
7
In [14]: pd.concat((df1,df2))
Out[14]:
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
1
2
3
4
5
6
7
8
#可以通过设置axis来改变级联方向
In [15]: pd.concat((df1,df2),axis=1)
Out[15]:
A B A B
0 A0 B0 NaN NaN
1 A1 B1 NaN NaN
2 NaN NaN A2 B2
3 NaN NaN A3 B3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 注意,index在级联时是可以重复的
In [16]: df3 = make_df(list('AB'),[0,1,2])
{'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}

In [17]: df4 = make_df(list('AB'),[1,2,3])
{'A': ['A1', 'A2', 'A3'], 'B': ['B1', 'B2', 'B3']}

In [18]: df5 = pd.concat((df3,df4))

In [19]: df5
Out[19]:
A B
0 A0 B0
1 A1 B1
2 A2 B2
1 A1 B1
2 A2 B2
3 A3 B3
1
2
3
4
5
6
7
In [20]: df5.loc[[1,2]]
Out[20]:
A B
1 A1 B1
1 A1 B1
2 A2 B2
2 A2 B2
1
2
3
4
5
6
7
8
9
10
#可以选择忽略ignore_index,重新索引
In [21]: pd.concat((df3,df4),ignore_index=True)
Out[21]:
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A1 B1
4 A2 B2
5 A3 B3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#或者使用多层索引keys
In [22]: x = make_df(list('XY'),['a','b'])
{'X': ['Xa', 'Xb'], 'Y': ['Ya', 'Yb']}

In [23]: y = make_df(list('XY'),['A','B'])
{'X': ['XA', 'XB'], 'Y': ['YA', 'YB']}

In [24]: pd.concat((x,y))
Out[24]:
X Y
a Xa Ya
b Xb Yb
A XA YA
B XB YB
1
2
3
4
5
6
7
8
#keys可以增加索引,使得合并后的数据更清晰
In [25]: pd.concat((x,y),keys=['x','y'])
Out[25]:
X Y
x a Xa Ya
b Xb Yb
y A XA YA
B XB YB

1.2 不匹配级联

  • 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时航索引不一致
  • 有3种连接方式:
  • 外连接:补NaN(默认模式)
1
2
3
4
5
In [26]: df1 = make_df(['A','B'],[1,3])
{'A': ['A1', 'A3'], 'B': ['B1', 'B3']}

In [27]: df2 = make_df(['B','C'],[2,4])
{'B': ['B2', 'B4'], 'C': ['C2', 'C4']}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [28]: pd.concat((df1,df2),join='outer')
F:\soft\Anaconda3\Scripts\ipython:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

Out[28]:
A B C
1 A1 B1 NaN
3 A3 B3 NaN
2 NaN B2 C2
4 NaN B4 C4
  • 内连接:只连接匹配的项
1
2
3
4
5
6
7
In [29]: pd.concat((df1,df2),join='inner')
Out[29]:
B
1 B1
3 B3
2 B2
4 B4
  • 连接指定轴join_axes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [30]: df3 = make_df(list('ACD'),[0,1,2])
{'A': ['A0', 'A1', 'A2'], 'C': ['C0', 'C1', 'C2'], 'D': ['D0', 'D1', 'D2']}

In [31]: df4 = make_df(list('CDF'),[3,4,5])
{'C': ['C3', 'C4', 'C5'], 'D': ['D3', 'D4', 'D5'], 'F': ['F3', 'F4', 'F5']}

In [32]: pd.concat((df3,df4))
F:\soft\Anaconda3\Scripts\ipython:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

Out[32]:
A C D F
0 A0 C0 D0 NaN
1 A1 C1 D1 NaN
2 A2 C2 D2 NaN
3 NaN C3 D3 F3
4 NaN C4 D4 F4
5 NaN C5 D5 F5
1
2
3
4
5
6
7
8
9
10
#join_axes 以某一个DataFrame列索引为新的索引值
In [33]: pd.concat((df3,df4),join_axes=[df4.columns])
Out[33]:
C D F
0 C0 D0 NaN
1 C1 D1 NaN
2 C2 D2 NaN
3 C3 D3 F3
4 C4 D4 F4
5 C5 D5 F5

1.3 使用append()函数级联

  • 由于级联的使用非常普遍,因此有一个函数append专门用于在后面添加
1
2
3
4
5
In [34]: df1 = make_df(['大众','雷克萨斯'],[0,1,2,3,4])
{'大众': ['大众0', '大众1', '大众2', '大众3', '大众4'], '雷克萨斯': ['雷克萨斯0', '雷克萨斯1', '雷克萨斯2', '雷克萨斯3', '雷克萨斯4']}

In [35]: df2 = make_df(['大众','雷克萨斯'],[5,6,7,8,9])
{'大众': ['大众5', '大众6', '大众7', '大众8', '大众9'], '雷克萨斯': ['雷克萨斯5', '雷克萨斯6', '雷克萨斯7', '雷克萨斯8', '雷克萨斯9']}
1
2
3
4
5
6
7
8
9
10
11
12
13
In [36]: df1.append(df2)
Out[36]:
大众 雷克萨斯
0 大众0 雷克萨斯0
1 大众1 雷克萨斯1
2 大众2 雷克萨斯2
3 大众3 雷克萨斯3
4 大众4 雷克萨斯4
5 大众5 雷克萨斯5
6 大众6 雷克萨斯6
7 大众7 雷克萨斯7
8 大众8 雷克萨斯8
9 大众9 雷克萨斯9

2. 使用pd.merge()合并

  • merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并
  • 使用pd.merge()合并时,除自动根除两者相同columns名称的那一列,作为key来进行合并
  • 注意每一列元素的吮吸不要求一致

2.1 一对一合并

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [37]: df1 = pd.DataFrame({'employee':['Po','Sara','Danis'],
...: 'group':['sail','counting','marcketing']})

In [38]: df2 = pd.DataFrame({'employee':['Po','Sara','Danis'],
...: 'work_time':[2,3,1]})

In [39]: display(df1,df2)
employee group
0 Po sail
1 Sara counting
2 Danis marcketing

employee work_time
0 Po 2
1 Sara 3
2 Danis 1
1
2
3
4
5
6
In [40]: pd.merge(df1,df2)
Out[40]:
employee group work_time
0 Po sail 2
1 Sara counting 3
2 Danis marcketing 1
1
2
3
4
5
6
7
In [41]: df3 = pd.DataFrame({'employee':['Po','Sara','Bush'],'work_time':[2,3,1]})

In [42]: pd.merge(df1,df3)
Out[42]:
employee group work_time
0 Po sail 2
1 Sara counting 3
1
2
3
4
5
6
In [43]: pd.concat((df1,df2),axis=1)
Out[43]:
employee group employee work_time
0 Po sail Po 2
1 Sara counting Sara 3
2 Danis marcketing Danis 1

2.2 多对一合并

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [44]: df1 = pd.DataFrame({'employee':['Po','Sara','Danis'],
...: 'group':['sail','counting','marcketing']})

In [45]: df2 = pd.DataFrame({'employee':['Po','Po','Danis'],
...: 'work_time':[2,3,1]})

In [46]: display(df1,df2)
employee group
0 Po sail
1 Sara counting
2 Danis marcketing

employee work_time
0 Po 2
1 Po 3
2 Danis 1
1
2
3
4
5
6
In [47]: pd.merge(df1,df2)
Out[47]:
employee group work_time
0 Po sail 2
1 Po sail 3
2 Danis marcketing 1

2.3 多对多合并

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [48]: df1 = pd.DataFrame({'employee':['Po','Po','Danis'],
...: 'group':['sail','counting','marcketing']})

In [49]: df2 = pd.DataFrame({'employee':['Po','Po','Danis'],
...: 'work_time':[2,3,1]})

In [50]: display(df1,df2)
employee group
0 Po sail
1 Po counting
2 Danis marcketing

employee work_time
0 Po 2
1 Po 3
2 Danis 1
1
2
3
4
5
6
7
8
9
#在进行多对多合并时,每一个数据都没有放过
In [51]: pd.merge(df1,df2)
Out[51]:
employee group work_time
0 Po sail 2
1 Po sail 3
2 Po counting 2
3 Po counting 3
4 Danis marcketing 1

2.4 key的规范化

  • 使用on=显示指定哪一列位key,当有多个key相同时使用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [52]: df3 = pd.DataFrame({'employee':['Po','Summer','Flower'],
...: 'group':['sail','marketing','serch'],
...: 'salary':[12000,10000,8000]})

In [53]: df4 = pd.DataFrame({'employee':['Po','Winter','Flower'],
...: 'group':['marketing','marketing','serch'],
...: 'work_time':[2,1,5]})

In [54]: display(df3,df4)
employee group salary
0 Po sail 12000
1 Summer marketing 10000
2 Flower serch 8000

employee group work_time
0 Po marketing 2
1 Winter marketing 1
2 Flower serch 5
1
2
3
4
In [55]: pd.merge(df3,df4)
Out[55]:
employee group salary work_time
0 Flower serch 8000 5
1
2
3
4
5
In [56]: pd.merge(df3,df4,on='employee')
Out[56]:
employee group_x salary group_y work_time
0 Po sail 12000 marketing 2
1 Flower serch 8000 serch 5
1
2
3
4
5
6
In [57]: pd.merge(df3,df4,on='group',suffixes=['_A','_B'])
Out[57]:
employee_A group salary employee_B work_time
0 Summer marketing 10000 Po 2
1 Summer marketing 10000 Winter 1
2 Flower serch 8000 Flower 5
  • 使用left_on和right_on指定左右两边的列作为key,当左右两个的key都不相等时使用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
In [58]: df5 = pd.DataFrame({'employer':['Po','Summer','Flower'],
...: 'Team':['sail','marketing','serch'],
...: 'salary':[12000,10000,8000]})

In [59]: df6 = pd.DataFrame({'employee':['Po','Winter','Flower'],
...: 'group':['marketing','marketing','serch'],
...: 'work_time':[2,1,5]})

In [60]: display(df5,df6)
employer Team salary
0 Po sail 12000
1 Summer marketing 10000
2 Flower serch 8000

employee group work_time
0 Po marketing 2
1 Winter marketing 1
2 Flower serch 5
1
2
3
4
5
In [61]: pd.merge(df5,df6,left_on='employer',right_on='employee')
Out[61]:
employer Team salary employee group work_time
0 Po sail 12000 Po marketing 2
1 Flower serch 8000 Flower serch 5
1
2
3
4
5
6
In [62]: pd.merge(df5,df6,left_on='Team',right_on='group')
Out[62]:
employer Team salary employee group work_time
0 Summer marketing 10000 Po marketing 2
1 Summer marketing 10000 Winter marketing 1
2 Flower serch 8000 Flower serch 5

2.5 内合并与外合并

  • 内合并:只保留两者都有的key(默认模式)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In [63]: df1 = pd.DataFrame({'age':[18,19,22,25],'height':[170,165,180,170]})

In [64]: df2 = pd.DataFrame({'age':[18,20,22,30],'weight':[55,50,60,65]})

In [65]: display(df1,df2)
age height
0 18 170
1 19 165
2 22 180
3 25 170

age weight
0 18 55
1 20 50
2 22 60
3 30 65
1
2
3
4
5
In [66]: pd.merge(df1,df2)
Out[66]:
age height weight
0 18 170 55
1 22 180 60
  • 外合并how=’outer’:补NaN
1
2
3
4
5
6
7
8
9
In [67]: pd.merge(df1,df2,how='outer')
Out[67]:
age height weight
0 18 170.0 55.0
1 19 165.0 NaN
2 22 180.0 60.0
3 25 170.0 NaN
4 20 NaN 50.0
5 30 NaN 65.0
  • 左合并、右合并:how=’left’,how=’right’
1
2
3
4
5
6
7
In [68]: pd.merge(df1,df2,how='left')
Out[68]:
age height weight
0 18 170 55.0
1 19 165 NaN
2 22 180 60.0
3 25 170 NaN
1
2
3
4
5
6
7
In [69]: pd.merge(df1,df2,how='right')
Out[69]:
age height weight
0 18 170.0 55
1 22 180.0 60
2 20 NaN 50
3 30 NaN 65