Pandas数据练习题

[TOC]

了解数据

探索Chipotle快餐数据

  • 将数据集存入一个名为chipo的数据框中
1
2
3
4
import pandas as pd
import numpy as np

chipo = pd.read_csv('../pydata/chipotle.tsv','\t')
  • 查看前面10行内容
1
chipo.head(10)
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans… $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou… $10.98
6 3 1 Side of Chips NaN $1.69
7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables… $11.75
8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch… $9.25
9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto… $9.25
  • 数据集中有多少列(columns)
1
2
chipo.sheap[1]
5
  • 打印出全部的列名称
1
2
3
4
5
chipo.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
  • 被下单数最多商品(item)是什么?
1
2
c = chipot[['item_name','quantity']].groupby(by='item_name',as_index=False).agg({'quantity':'sum'})
c.sort_values(by='quantity',ascending=False).head()
item_name quantity
17 Chicken Bowl 761
18 Chicken Burrito 591
25 Chips and Guacamole 506
39 Steak Burrito 386
10 Canned Soft Drink 351
  • 在item_name这一列中,一共有多少种商品被下单
1
2
3
4
5
# chipot['item_name'].drop_duplicates().count()
# nunique() 返回对象中唯一元素的数据
chipot['item_name'].nunique()

50
  • 在choice_description中,下单次数最多的商品是什么
1
2
c = chipot[['choice_description','quantity']].groupby(by='choice_description',as_index=False).agg({'quantity':'sum'})
c.sort_values(by='quantity',ascending=False).head()
choice_description quantity
15 [Diet Coke] 159
14 [Coke] 143
583 [Sprite] 89
256 [Fresh Tomato Salsa, [Rice, Black Beans, Chees… 49
257 [Fresh Tomato Salsa, [Rice, Black Beans, Chees… 42
  • 一共有多少商品被下单
1
2
3
chipot['quantity'].sum()

4972
  • 将item_price转换为浮点数
1
2
3
trans = lambda x: float(x[1:-1])
chipot['item_price'] = chipot['item_price'].apply(trans)
chipot.head()
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN 2.39
1 1 1 Izze [Clementine] 3.39
2 1 1 Nantucket Nectar [Apple] 3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN 2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans… 16.98
  • 在该数据集对应的时期内,收入(revenue)是多少
1
2
3
4
chipot['sum_total'] = chipot['item_price'] * chipot['quantity']
chipot['sum_total'].sum()

39237.02
  • 在该数据集对应的时期内,一共有多少订单
1
2
3
chipot['order_id'].nunique()

1834
  • 每一单(order)对应的平均总价是多少
1
chipot[['order_id','sum_total']].groupby(by='order_id',as_index=False).agg({'sum_total':'sum'})['sum_total'].mean()
  • 一共有多少种不同的商品被出售
1
2
3
chipot['item_name'].nunique()

50

数据过滤与排序

探索2012欧洲杯数据

  • 将数据集命名为euro12
1
2
3
4
5
import pandas as pd
import numpy as np

euro12 = pd.read_csv('../pydata/Euro2012.csv')
euro12.head()
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 13 81.3% 41 62 2 9 0 9 9 16
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 9 60.1% 53 73 8 7 0 11 11 19
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 10 66.7% 25 38 8 4 0 7 7 15
3 England 5 11 18 50.0% 17.2% 40 0 0 0 22 88.1% 43 45 6 5 0 11 11 16
4 France 3 22 24 37.9% 6.5% 65 1 0 0 6 54.6% 36 51 5 6 0 11 11 19
  • 只选取 Goals 这一列
1
2
3
4
5
6
7
8
euro12['Goals'].head()

0 4
1 4
2 4
3 5
4 3
Name: Goals, dtype: int64
  • 有多少球队参与了2012欧洲杯

    1
    2
    3
    euro12.shape[0]

    16
  • 该数据集中一共有多少列(columns)

    1
    2
    3
    euro12.shape[1]

    35
  • 将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框

    1
    2
    discipline = euro12[['Team','Yellow Cards','Red Cards']]
    discipline
Team Yellow Cards Red Cards
0 Croatia 9 0
1 Czech Republic 7 0
2 Denmark 4 0
3 England 5 0
4 France 6 0
5 Germany 4 0
6 Greece 9 1
7 Italy 16 0
8 Netherlands 5 0
9 Poland 7 1
10 Portugal 12 0
11 Republic of Ireland 6 1
12 Russia 6 0
13 Spain 11 0
14 Sweden 7 0
15 Ukraine 5 0
  • 对数据框discipline按照先Red Cards再Yellow Cards进行排序
    1
    discipline.sort_values(['Red Cards','Yellow Cards'],ascending=False)
Team Yellow Cards Red Cards
6 Greece 9 1
9 Poland 7 1
11 Republic of Ireland 6 1
7 Italy 16 0
10 Portugal 12 0
13 Spain 11 0
0 Croatia 9 0
1 Czech Republic 7 0
14 Sweden 7 0
4 France 6 0
12 Russia 6 0
3 England 5 0
8 Netherlands 5 0
15 Ukraine 5 0
2 Denmark 4 0
5 Germany 4 0
  • 计算每个球队拿到的黄牌数的平均值

    1
    2
    3
    discipline['Yellow Cards'].mean()

    7.4375
  • 找到进球数Goals超过6的球队数据

    1
    euro12[euro12['Goals'] > 6]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.80% 15.60% 80 2 1 0 10 62.60% 63 49 12 4 0 15 15 17
13 Spain 12 42 33 55.90% 16.00% 100 0 1 0 15 93.80% 102 83 19 11 0 17 17 18
  • 选取以字母G开头的球队数据
    1
    euro12[euro12.Team.str.startswith('G')]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.80% 15.60% 80 2 1 0 10 62.60% 63 49 12 4 0 15 15 17
6 Greece 5 8 18 30.70% 19.20% 32 1 1 1 13 65.10% 67 48 12 9 1 12 12 20
  • 选取前7列
    1
    2
    # euro12.iloc[:,:7]
    euro12.T[:7].T
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked)
0 Croatia 4 13 12 51.90% 16.00% 32
1 Czech Republic 4 13 18 41.90% 12.90% 39
2 Denmark 4 10 10 50.00% 20.00% 27
3 England 5 11 18 50.00% 17.20% 40
4 France 3 22 24 37.90% 6.50% 65
5 Germany 10 32 32 47.80% 15.60% 80
6 Greece 5 8 18 30.70% 19.20% 32
7 Italy 6 34 45 43.00% 7.50% 110
8 Netherlands 2 12 36 25.00% 4.10% 60
9 Poland 2 15 23 39.40% 5.20% 48
10 Portugal 6 22 42 34.30% 9.30% 82
11 Republic of Ireland 1 7 12 36.80% 5.20% 28
12 Russia 5 9 31 22.50% 12.50% 59
13 Spain 12 42 33 55.90% 16.00% 100
14 Sweden 5 17 19 47.20% 13.80% 39
15 Ukraine 2 7 26 21.20% 6.00% 38
  • 选取除了最后3列之外的全部列
    1
    euro12.iloc[:,:-3]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards
0 Croatia 4 13 12 51.90% 16.00% 32 0 0 0 0 10 3 13 81.30% 41 62 2 9 0
1 Czech Republic 4 13 18 41.90% 12.90% 39 0 0 0 1 10 6 9 60.10% 53 73 8 7 0
2 Denmark 4 10 10 50.00% 20.00% 27 1 0 0 1 10 5 10 66.70% 25 38 8 4 0
3 England 5 11 18 50.00% 17.20% 40 0 0 0 2 29 3 22 88.10% 43 45 6 5 0
4 France 3 22 24 37.90% 6.50% 65 1 0 0 1 7 5 6 54.60% 36 51 5 6 0
5 Germany 10 32 32 47.80% 15.60% 80 2 1 0 1 11 6 10 62.60% 63 49 12 4 0
6 Greece 5 8 18 30.70% 19.20% 32 1 1 1 1 23 7 13 65.10% 67 48 12 9 1
7 Italy 6 34 45 43.00% 7.50% 110 2 0 0 2 18 7 20 74.10% 101 89 16 16 0
8 Netherlands 2 12 36 25.00% 4.10% 60 2 0 0 0 9 5 12 70.60% 35 30 3 5 0
9 Poland 2 15 23 39.40% 5.20% 48 0 0 0 0 8 3 6 66.70% 48 56 3 7 1
10 Portugal 6 22 42 34.30% 9.30% 82 6 0 0 2 11 4 10 71.50% 73 90 10 12 0
11 Republic of Ireland 1 7 12 36.80% 5.20% 28 0 0 0 0 23 9 17 65.40% 43 51 11 6 1
12 Russia 5 9 31 22.50% 12.50% 59 2 0 0 0 8 3 10 77.00% 34 43 4 6 0
13 Spain 12 42 33 55.90% 16.00% 100 0 1 0 5 8 1 15 93.80% 102 83 19 11 0
14 Sweden 5 17 19 47.20% 13.80% 39 3 0 0 1 12 5 8 61.60% 35 51 7 7 0
15 Ukraine 2 7 26 21.20% 6.00% 38 0 0 0 0 4 4 13 76.50% 48
  • 找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)
    1
    euro12.loc[euro12.Team.isin(['England','Italy','Russia']),['Team','Shooting Accuracy']]
Team Shooting Accuracy
3 England 50.00%
7 Italy 43.00%
12 Russia 22.50%

数据分组

探索酒类消费数据

  • 将数据框命名为drinks
1
2
3
4
import pandas as pd

drinks = pd.read_csv('../pydata/drinks.csv')
drinks.head()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0 AS
1 Albania 89 132 54 4.9 EU
2 Algeria 25 0 14 0.7 AF
3 Andorra 245 138 312 12.4 EU
4 Angola 217 57 45 5.9 AF
  • 哪个大陆(continent)平均消耗的啤酒(beer)更多

    1
    2
    3
    4
    5
    6
    7
    8
    9
    drinks.groupby('continent').beer_servings.mean().sort_values(ascending=False)

    continent
    EU 193.777778
    SA 175.083333
    OC 89.687500
    AF 61.471698
    AS 37.045455
    Name: beer_servings, dtype: float64
  • 打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值

    1
    drinks.groupby('continent').wine_servings.describe()
count mean std min 25% 50% 75% max
continent
AF 53 16.26415 38.84642 0 1 2 13 233
AS 44 9.068182 21.66703 0 0 1 8 123
EU 45 142.2222 97.42174 0 59 128 195 370
OC 16 35.625 64.55579 0 1 8.5 23.25 212
SA 12 62.41667 88.62019 1 3 12 98.5 221
  • 打印出每个大陆每种酒类别的消耗平均值
    1
    drinks.groupby('continent').mean()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 61.4717 16.33962 16.26415 3.007547
AS 37.04546 60.84091 9.068182 2.170455
EU 193.7778 132.5556 142.2222 8.617778
OC 89.6875 58.4375 35.625 3.38125
SA 175.0833 114.75 62.41667 6.308333
  • 打印出每个大陆每种酒类别的消耗中位数
    1
    drinks.groupby('continent').median()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 32.0 3.0 2.0 2.30
AS 17.5 16.0 1.0 1.20
EU 219.0 122.0 128.0 10.00
OC 52.5 37.0 8.5 1.75
SA 162.5 108.5 12.0 6.85
  • 打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值
    1
    drinks.groupby('continent').agg({'spirit_servings':['mean','max','min']})
spirit_servings
mean max min
continent
AF 16.33962 152 0
AS 60.84091 326 0
EU 132.5556 373 0
OC 58.4375 254 0
SA 114.75 302 25

Apply函数

探索1960-2014美国犯罪数据

  • 将数据框命名为crime
1
2
3
4
5
import pandas as pd
import numpy as np

crime = pd.read_csv('../pydata/US_Crime_Rates_1960_2014.csv')
crime.head()
Unnamed: 0 Year Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
0 0 1960 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1 1 1961 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
2 2 1962 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
3 3 1963 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
4 4 1964 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
  • 每一列(column)的数据类型是什么样的

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    crime.dtypes

    Unnamed: 0 int64
    Year int64
    Population int64
    Total int64
    Violent int64
    Property int64
    Murder int64
    Forcible_Rape int64
    Robbery int64
    Aggravated_assault int64
    Burglary int64
    Larceny_Theft int64
    Vehicle_Theft int64
    dtype: object
  • 将Year的数据类型转换为 datetime64

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    crime.Year = pd.to_datetime(crime.Year,format='%Y')
    crime.dtypes

    Unnamed: 0 int64
    Year datetime64[ns]
    Population int64
    Total int64
    Violent int64
    Property int64
    Murder int64
    Forcible_Rape int64
    Robbery int64
    Aggravated_assault int64
    Burglary int64
    Larceny_Theft int64
    Vehicle_Theft int64
    dtype: object
  • 将列Year设置为数据框的索引

    1
    2
    crime = crime.set_index('Year',drop=True)
    crime.head()
Unnamed: 0 Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960/1/1 0 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1961/1/1 1 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
1962/1/1 2 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
1963/1/1 3 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
1964/1/1 4 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
  • 删除名为Total的列
    1
    2
    del crime['Total']
    crime.head()
Unnamed: 0 Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960/1/1 0 179323175 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1961/1/1 1 182992000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
1962/1/1 2 185771000 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
1963/1/1 3 188483000 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
1964/1/1 4 191141000 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
  • 按照Year对数据框进行分组并求和
    1
    2
    3
    4
    5
    6
    crimes = crime.resample('10AS').sum() # resample a time series per decades
    # 用resample去得到“Population”列的最大值
    population = crime.Population.resample('10AS').max()
    # 更新 "Population"
    crimes['Population'] = population
    crimes
Unnamed: 0 Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960/1/1 45 201385000 4134930 45160900 106180 236720 1633510 2158520 13321100 26547700 5292100
1970/1/1 145 220099000 9607930 91383800 192230 554570 4159020 4702120 28486000 53157800 9739900
1980/1/1 245 248239000 14074328 1.17E+08 206439 865639 5383109 7619130 33073494 72040253 11935411
1990/1/1 345 272690813 17527048 1.19E+08 211664 998827 5748930 10568963 26750015 77679366 14624418
2000/1/1 445 307006550 13968056 1.01E+08 163068 922499 4230366 8652124 21565176 67970291 11412834
2010/1/1 260 318857056 6072017 44095950 72867 421059 1749809 3764142 10125170 30401698 3569080
2020/1/1 0 NaN 0 0 0 0 0 0 0 0 0
  • 何时是美国历史上生存最危险的年代
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    crime.idxmax(0)

    Unnamed: 0 2014-01-01
    Population 2014-01-01
    Violent 1992-01-01
    Property 1991-01-01
    Murder 1991-01-01
    Forcible_Rape 1992-01-01
    Robbery 1991-01-01
    Aggravated_assault 1993-01-01
    Burglary 1980-01-01
    Larceny_Theft 1991-01-01
    Vehicle_Theft 1991-01-01
    dtype: datetime64[ns]

合并

探索虚拟姓名数据

  • 按照如下的元数据内容创建数据框
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
import pandas as pd

raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
  • 将上述的数据框分别命名为data1, data2, data3

    1
    2
    3
    data1 = pd.DataFrame(raw_data_1)
    data2 = pd.DataFrame(raw_data_2)
    data3 = pd.DataFrame(raw_data_3)
  • 将data1和data2两个数据框按照行的维度进行合并,命名为all_data

    1
    2
    all_data = pd.concat([data1,data2])
    all_data
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
  • 将data1和data2两个数据框按照列的维度进行合并,命名为all_data_col
    1
    2
    all_data_col = pd.concat([data1,data2],axis=1)
    all_data_col
subject_id first_name last_name subject_id first_name last_name
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan
  • 打印data3
1
data3
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
  • 按照subject_id的值对all_data和data3作合并
    1
    pd.merge(all_data,data3,on='subject_id',how='outer')
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 6 Bran Balwner NaN
8 7 Bryce Brice 14
9 8 Betty Btisan 15
10 9 NaN NaN 1
11 10 NaN NaN 61
12 11 NaN NaN 16
  • 对data1和data2按照subject_id作连接
    1
    pd.merge(data1,data2,on='subject_id')
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black
  • 找到 data1 和 data2 合并之后的所有匹配结果
    1
    pd.merge(data1,data2,on='subject_id',how='outer')
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
5 6 NaN NaN Bran Balwner
6 7 NaN NaN Bryce Brice
7 8 NaN NaN Betty Btisan

统计

探索风速数据

  • 将数据作存储并且设置前三列为合适的索引
1
2
3
4
5
6
import numpy as np
import pandas as pd
import datetime

data = pd.read_csv('../pydata/wind.csv','\s+',parse_dates=[[0,1,2]])
data.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 2061/1/1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.5 15.04
1 2061/1/2 14.71 NaN 10.83 6.5 12.62 7.67 11.5 10.04 9.79 9.67 17.54 13.83
2 2061/1/3 18.5 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.5 7.67 12.75 12.71
3 2061/1/4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 2061/1/5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
  • 2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug
    1
    2
    3
    4
    5
    6
    7
    8
    9
    def fix_century(x):
    if x.year > 1989:
    year = x.year - 100
    else:
    year = x.year
    return datetime.date(year,x.month,x.day)

    data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
    data.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 1961/1/1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.5 15.04
1 1961/1/2 14.71 NaN 10.83 6.5 12.62 7.67 11.5 10.04 9.79 9.67 17.54 13.83
2 1961/1/3 18.5 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.5 7.67 12.75 12.71
3 1961/1/4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 1961/1/5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
  • 将日期设为索引,注意数据类型,应该是datetime64[ns]
    1
    2
    3
    data['Yr_Mo_Dy'] = pd.to_datetime(data.Yr_Mo_Dyd)
    data = data.set_index('Yr_Mo_Dy')
    data.head()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961/1/1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.5 15.04
1961/1/2 14.71 NaN 10.83 6.5 12.62 7.67 11.5 10.04 9.79 9.67 17.54 13.83
1961/1/3 18.5 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.5 7.67 12.75 12.71
1961/1/4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1961/1/5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
  • 对应每一个location,一共有多少数据值缺失

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    data.isnull().sum()

    RPT 6
    VAL 3
    ROS 2
    KIL 5
    SHA 2
    BIR 0
    DUB 3
    CLA 2
    MUL 3
    CLO 1
    BEL 0
    MAL 4
    dtype: int64
  • 对应每一个location,一共有多少完整的数据值

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    data.notnull().sum()

    RPT 6568
    VAL 6571
    ROS 6572
    KIL 6569
    SHA 6572
    BIR 6574
    DUB 6571
    CLA 6572
    MUL 6571
    CLO 6573
    BEL 6574
    MAL 6570
    dtype: int64
  • 对于全体数据,计算风速的平均值

    1
    2
    3
    data.mean().mean()

    10.227982360836924
  • 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差

    1
    2
    3
    4
    5
    6
    loc_stats = pd.DataFrame()
    loc_stats['max'] = data.max()
    loc_stats['min'] = data.min()
    loc_stats['mean'] = data.mean()
    loc_stats['std'] = data.std()
    loc_stats
max min mean std
RPT 35.8 0.67 12.362987 5.618413
VAL 33.37 0.21 10.644314 5.267356
ROS 33.84 1.5 11.660526 5.00845
KIL 28.46 0 6.306468 3.605811
SHA 37.54 0.13 10.455834 4.936125
BIR 26.16 0 7.092254 3.968683
DUB 30.37 0 9.797343 4.977555
CLA 31.08 0 8.495053 4.499449
MUL 25.88 0 8.49359 4.166872
CLO 28.21 0.04 8.707332 4.503954
BEL 42.38 0.13 13.121007 5.835037
MAL 42.54 0.67 15.599079 6.699794
  • 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差
    1
    2
    3
    4
    5
    6
    day_stats = pd.DataFrame()
    day_stats['min'] = data.min(axis=1)
    day_stats['max'] = data.max(axis=1)
    day_stats['mean'] = data.mean(axis=1)
    day_stats['std'] = data.std(axis=1)
    day_stats.head()
min max mean std
Yr_Mo_Dy
1961/1/1 9.29 18.5 13.018182 2.808875
1961/1/2 6.5 17.54 11.336364 3.188994
1961/1/3 6.17 18.5 11.641818 3.681912
1961/1/4 1.79 11.75 6.619167 3.198126
1961/1/5 6.17 13.33 10.63 2.445356
  • 对于每一个location,计算一月份的平均风速

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    data['date'] = data.index
    data['year'] = data.date.apply(lambda x : x.year)
    data['month'] = data.date.apply(lambda x : x.month)
    data['day'] = data.date.apply(lambda x : x.day)
    january_winds =data.query('month == 1')
    january_winds.loc[:,'RPT':'MAL'].mean()

    RPT 14.847325
    VAL 12.914560
    ROS 13.299624
    KIL 7.199498
    SHA 11.667734
    BIR 8.054839
    DUB 11.819355
    CLA 9.512047
    MUL 9.543208
    CLO 10.053566
    BEL 14.550520
    MAL 18.028763
    dtype: float64
  • 对于数据记录按照年为频率取样

    1
    2
    data_year = data.query('month==1 and day == 1' )
    data_year
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date year month day
Yr_Mo_Dy
1961/1/1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.5 15.04 1961/1/1 1961 1 1
1962/1/1 9.29 3.42 11.54 3.5 2.21 1.96 10.41 2.79 3.54 5.17 4.38 7.92 1962/1/1 1962 1 1
1963/1/1 15.59 13.62 19.79 8.38 12.25 10 23.45 15.71 13.59 14.37 17.58 34.13 1963/1/1 1963 1 1
1964/1/1 25.8 22.13 18.21 13.25 21.29 14.79 14.12 19.58 13.25 16.75 28.96 21 1964/1/1 1964 1 1
1965/1/1 9.54 11.92 9 4.38 6.08 5.21 10.25 6.08 5.71 8.63 12.04 17.41 1965/1/1 1965 1 1
1966/1/1 22.04 21.5 17.08 12.75 22.17 15.59 21.79 18.12 16.66 17.83 28.33 23.79 1966/1/1 1966 1 1
1967/1/1 6.46 4.46 6.5 3.21 6.67 3.79 11.38 3.83 7.71 9.08 10.67 20.91 1967/1/1 1967 1 1
1968/1/1 30.04 17.88 16.25 16.25 21.79 12.54 18.16 16.62 18.75 17.62 22.25 27.29 1968/1/1 1968 1 1
1969/1/1 6.13 1.63 5.41 1.08 2.54 1 8.5 2.42 4.58 6.34 9.17 16.71 1969/1/1 1969 1 1
1970/1/1 9.59 2.96 11.79 3.42 6.13 4.08 9 4.46 7.29 3.5 7.33 13 1970/1/1 1970 1 1
1971/1/1 3.71 0.79 4.71 0.17 1.42 1.04 4.63 0.75 1.54 1.08 4.21 9.54 1971/1/1 1971 1 1
1972/1/1 9.29 3.63 14.54 4.25 6.75 4.42 13 5.33 10.04 8.54 8.71 19.17 1972/1/1 1972 1 1
1973/1/1 16.5 15.92 14.62 7.41 8.29 11.21 13.54 7.79 10.46 10.79 13.37 9.71 1973/1/1 1973 1 1
1974/1/1 23.21 16.54 16.08 9.75 15.83 11.46 9.54 13.54 13.83 16.66 17.21 25.29 1974/1/1 1974 1 1
1975/1/1 14.04 13.54 11.29 5.46 12.58 5.58 8.12 8.96 9.29 5.17 7.71 11.63 1975/1/1 1975 1 1
1976/1/1 18.34 17.67 14.83 8 16.62 10.13 13.17 9.04 13.13 5.75 11.38 14.96 1976/1/1 1976 1 1
1977/1/1 20.04 11.92 20.25 9.13 9.29 8.04 10.75 5.88 9 9 14.88 25.7 1977/1/1 1977 1 1
1978/1/1 8.33 7.12 7.71 3.54 8.5 7.5 14.71 10 11.83 10 15.09 20.46 1978/1/1 1978 1 1
  • 对于数据记录按照月为频率取样
    1
    data.query('day == 1')
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date year month day
Yr_Mo_Dy
1961/1/1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.5 15.04 1961/1/1 1961 1 1
1961/2/1 14.25 15.12 9.04 5.88 12.08 7.17 10.17 3.63 6.5 5.5 9.17 8 1961/2/1 1961 2 1
1961/3/1 12.67 13.13 11.79 6.42 9.79 8.54 10.25 13.29 NaN 12.21 20.62 NaN 1961/3/1 1961 3 1
1961/4/1 8.38 6.34 8.33 6.75 9.33 9.54 11.67 8.21 11.21 6.46 11.96 7.17 1961/4/1 1961 4 1
1961/5/1 15.87 13.88 15.37 9.79 13.46 10.17 9.96 14.04 9.75 9.92 18.63 11.12 1961/5/1 1961 5 1
1961/6/1 15.92 9.59 12.04 8.79 11.54 6.04 9.75 8.29 9.33 10.34 10.67 12.12 1961/6/1 1961 6 1
1961/7/1 7.21 6.83 7.71 4.42 8.46 4.79 6.71 6 5.79 7.96 6.96 8.71 1961/7/1 1961 7 1

可视化

探索泰坦尼克灾难数据

  • 将数据框命名为titanic
1
2
3
4
5
6
7
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

titanic = pd.read_csv('../pydata/train.csv')
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 NaN S
  • 将PassengerId设置为索引
1
2
titanic = titanic.set_index('PassengerId')
titanic.head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 NaN S
  • 绘制一个展示男女乘客比例的扇形图
    1
    2
    3
    4
    5
    6
    7
    8
    9
    males = (titanic['Sex'] == 'male').sum()
    females = (titanic['Sex'] == 'female').sum()
    proportions = [males,females]
    plt.pie(proportions,
    labels = ['Males','Females'],
    autopct = '%1.1f%%')
    plt.axis('equal')
    plt.title('Sex Proportion')
    plt.show()

  • 绘制一个展示船票Fare, 与乘客年龄和性别的散点图
    1
    2
    3
    4
    5
    lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)
    lm.set(title = 'Fare x Age')
    axes = lm.axes
    axes[0,0].set_ylim(-5,)
    axes[0,0].set_xlim(-5,85)

  • 有多少人生还

    1
    2
    3
    titanic['Survived'].sum()

    342
  • 绘制一个展示船票价格的直方图

    1
    2
    3
    4
    5
    6
    7
    df = titanic.Fare.sort_values(ascending=False)
    binsVal = np.arange(0,600,10)
    plt.hist(df,bins=binsVal)
    plt.xlabel('Fare')
    plt.ylabel('Frequency')
    plt.title('Fare Payed Histrogram')
    plt.show()


创建数据框

探索Pokemon数据

  • 创建一个数据字典
1
2
3
4
5
6
7
8
import pandas as pd

data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
"evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
"type": ['grass', 'fire', 'water', 'bug'],
"hp": [45, 39, 44, 45],
"pokedex": ['yes', 'no','yes','no']
}
  • 将数据字典存为一个名叫pokemon的数据框中
1
2
pokemon = pd.DataFrame(data)
pokemon
name evolution type hp pokedex
0 Bulbasaur Ivysaur grass 45 yes
1 Charmander Charmeleon fire 39 no
2 Squirtle Wartortle water 44 yes
3 Caterpie Metapod bug 45 no
  • 数据框的列排序是字母顺序,请重新修改为name, type, hp, evolution, pokedex这个顺序
    1
    2
    pokemon = pokemon[['name','type','hp','evolution','pokedex']]
    pokemon
name type hp evolution pokedex
0 Bulbasaur grass 45 Ivysaur yes
1 Charmander fire 39 Charmeleon no
2 Squirtle water 44 Wartortle yes
3 Caterpie bug 45 Metapod no
  • 添加一个列place
    1
    2
    pokemon['place'] = ['park','street','lake','forest']
    pokemon
name type hp evolution pokedex place
0 Bulbasaur grass 45 Ivysaur yes park
1 Charmander fire 39 Charmeleon no street
2 Squirtle water 44 Wartortle yes lake
3 Caterpie bug 45 Metapod no forest
  • 查看每个列的数据类型
    1
    2
    3
    4
    5
    6
    7
    8
    9
    pokemon.dtypes

    name object
    type object
    hp int64
    evolution object
    pokedex object
    place object
    dtype: object

时间序列

探索Apple公司股价数据

  • 读取数据并存为一个名叫apple的数据框
    1
    2
    3
    4
    5
    6
    import pandas as pd
    import numpy as np
    import matplotlib.pylab as plt

    apple = pd.read_csv('../pydata/appl_1980_2014.csv')
    apple.head()
Date Open High Low Close Volume Adj Close
0 2014/7/8 96.27 96.8 93.92 95.35 65130000 95.35
1 2014/7/7 94.14 95.99 94.1 95.97 56305400 95.97
2 2014/7/3 93.67 94.1 93.2 94.03 22891800 94.03
3 2014/7/2 93.87 94.06 93.09 93.48 28420900 93.48
4 2014/7/1 93.52 94.07 93.13 93.52 38170200 93.52
  • 查看每一列的数据类型

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    apple.dtypes

    Date object
    Open float64
    High float64
    Low float64
    Close float64
    Volume int64
    Adj Close float64
    dtype: object
  • 将Date这个列转换为datetime类型

    1
    2
    apple.Date = pd.to_datetime(apple.Date)
    apple.head()
Date Open High Low Close Volume Adj Close
0 2014/7/8 96.27 96.8 93.92 95.35 65130000 95.35
1 2014/7/7 94.14 95.99 94.1 95.97 56305400 95.97
2 2014/7/3 93.67 94.1 93.2 94.03 22891800 94.03
3 2014/7/2 93.87 94.06 93.09 93.48 28420900 93.48
4 2014/7/1 93.52 94.07 93.13 93.52 38170200 93.52
  • 将Date设置为索引
    1
    2
    apple = apple.set_index('Date')
    apple.head()
Open High Low Close Volume Adj Close
Date
2014/7/8 96.27 96.8 93.92 95.35 65130000 95.35
2014/7/7 94.14 95.99 94.1 95.97 56305400 95.97
2014/7/3 93.67 94.1 93.2 94.03 22891800 94.03
2014/7/2 93.87 94.06 93.09 93.48 28420900 93.48
2014/7/1 93.52 94.07 93.13 93.52 38170200 93.52
  • 有重复的日期吗

    1
    2
    3
    apple.index.is_unique

    True
  • 将index设置为升序

    1
    2
    apple.sort_index(ascending=True)
    apple.head()
Open High Low Close Volume Adj Close
Date
2014/7/8 96.27 96.8 93.92 95.35 65130000 95.35
2014/7/7 94.14 95.99 94.1 95.97 56305400 95.97
2014/7/3 93.67 94.1 93.2 94.03 22891800 94.03
2014/7/2 93.87 94.06 93.09 93.48 28420900 93.48
2014/7/1 93.52 94.07 93.13 93.52 38170200 93.52
  • 找到每个月的最后一个交易日(business day)
    1
    2
    3
    4
    5
    apple['year'] = apple.index.year
    apple['month'] = apple.index.month
    apple['day'] = apple.index.day
    apple['year_month'] = apple['year'].map(str)+'_'+apple['month'].map(str)
    apple.groupby('year_month').agg({'day':'max'}).sort_index(ascending=False).head(10)
day
year_month
2014_7 8
2014_6 30
2014_5 30
2014_4 30
2014_3 31
2014_2 28
2014_1 31
2013_9 30
2013_8 30
2013_7 31
  • 数据集中最早的日期和最晚的日期相差多少天

    1
    2
    3
    (apple.index.max()-apple.index.min()).days

    12261
  • 在数据中一共有多少个月

    1
    2
    3
    apple.year_month.nunique()

    404
  • 按照时间顺序可视化Adj Close值

    1
    2
    3
    appl_open = apple['Adj Close'].plot(title = 'Apple Stock')
    fig = appl_open.get_figure()
    fig.set_size_inches(13.5,9)


删除数据

探索Iris纸鸢花数据

  • 将数据集存成变量iris
1
2
3
4
5
import pandas as pd
import numpy as np

iris = pd.read_csv('../pydata/iris.data')
iris.head()
5.1 3.5 1.4 0.2 Iris-setosa
0 4.9 3 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
  • 创建数据框的列名称
    1
    2
    iris = pd.read_csv('../pydata/iris.data',names = ['sepal_length','sepal_width','petal_length','petal_width','class'])
    iris.head()
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5 3.6 1.4 0.2 Iris-setosa
  • 数据框中有缺失值吗

    1
    2
    3
    4
    5
    6
    7
    8
    pd.isnull(iris).sum()

    sepal_length 0
    sepal_width 0
    petal_length 0
    petal_width 0
    class 0
    dtype: int64
  • 将列petal_length的第10到19行设置为缺失值

    1
    2
    3
    # iris.petal_length[10:20] = np.nan
    iris.iloc[10:20,2:3] = np.nan
    iris.head(20)
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 NaN 0.2 Iris-setosa
11 4.8 3.4 NaN 0.2 Iris-setosa
12 4.8 3 NaN 0.1 Iris-setosa
13 4.3 3 NaN 0.1 Iris-setosa
14 5.8 4 NaN 0.2 Iris-setosa
15 5.7 4.4 NaN 0.4 Iris-setosa
16 5.4 3.9 NaN 0.4 Iris-setosa
17 5.1 3.5 NaN 0.3 Iris-setosa
18 5.7 3.8 NaN 0.3 Iris-setosa
19 5.1 3.8 NaN 0.3 Iris-setosa
  • 将缺失值全部替换为1.0
    1
    2
    iris.fillna(1.0,inplace=True)
    iris.iloc[10:20,:]
sepal_length sepal_width petal_length petal_width class
10 5.4 3.7 1 0.2 Iris-setosa
11 4.8 3.4 1 0.2 Iris-setosa
12 4.8 3 1 0.1 Iris-setosa
13 4.3 3 1 0.1 Iris-setosa
14 5.8 4 1 0.2 Iris-setosa
15 5.7 4.4 1 0.4 Iris-setosa
16 5.4 3.9 1 0.4 Iris-setosa
17 5.1 3.5 1 0.3 Iris-setosa
18 5.7 3.8 1 0.3 Iris-setosa
19 5.1 3.8 1 0.3 Iris-setosa
  • 删除列class
    1
    2
    del iris['class']
    iris.head()
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5 3.6 1.4 0.2
  • 将数据框前三行设置为缺失值
    1
    2
    iris.iloc[:3,:] = np.nan
    iris.head()
sepal_length sepal_width petal_length petal_width
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 4.6 3.1 1.5 0.2
4 5 3.6 1.4 0.2
  • 删除有缺失值的行
    1
    2
    iris = iris.dropna(how='any')
    iris.head()
sepal_length sepal_width petal_length petal_width
3 4.6 3.1 1.5 0.2
4 5 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3
7 5 3.4 1.5 0.2
  • 重新设置索引
    1
    2
    iris = iris.reset_index(drop=True)
    iris.head()
sepal_length sepal_width petal_length petal_width
0 4.6 3.1 1.5 0.2
1 5 3.6 1.4 0.2
2 5.4 3.9 1.7 0.4
3 4.6 3.4 1.4 0.3
4 5 3.4 1.5 0.2