分組號碼
對於以下 DataFrame:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'Age': np.random.randint(20, 70, 100),
'Sex': np.random.choice(['Male', 'Female'], 100),
'number_of_foo': np.random.randint(1, 20, 100)})
df.head()
# Output:
# Age Sex number_of_foo
# 0 64 Female 14
# 1 67 Female 14
# 2 20 Female 12
# 3 23 Male 17
# 4 23 Female 15
將 Age
分為三類(或箱子)。箱子可以作為
- 一個整數
n
,表示 bin 的數量 - 在這種情況下,資料幀的資料被分成相同大小的n
間隔 - 一系列整數,表示資料被分成的左開區間的終點 - 例如
bins=[19, 40, 65, np.inf]
建立了三個年齡組(19, 40]
,(40, 65]
和(65, np.inf]
。
Pandas 自動將間隔的字串版本指定為標籤。也可以通過將 labels
引數定義為字串列表來定義自己的標籤。
pd.cut(df['Age'], bins=4)
# this creates four age groups: (19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69]
Name: Age, dtype: category
Categories (4, object): [(19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69]]
pd.cut(df['Age'], bins=[19, 40, 65, np.inf])
# this creates three age groups: (19, 40], (40, 65] and (65, infinity)
Name: Age, dtype: category
Categories (3, object): [(19, 40] < (40, 65] < (65, inf]]
在 groupby
中使用它來獲得 foo 的平均數:
age_groups = pd.cut(df['Age'], bins=[19, 40, 65, np.inf])
df.groupby(age_groups)['number_of_foo'].mean()
# Output:
# Age
# (19, 40] 9.880000
# (40, 65] 9.452381
# (65, inf] 9.250000
# Name: number_of_foo, dtype: float64
交叉製表年齡組和性別:
pd.crosstab(age_groups, df['Sex'])
# Output:
# Sex Female Male
# Age
# (19, 40] 22 28
# (40, 65] 18 24
# (65, inf] 3 5