基本动词
library(dplyr)
library(nycflights13)
dplyr
中最常用的几个动词用于修改数据集。
选择
从数据帧 planes
中选择 tailnum
,type
,model
变量:
select(planes, tailnum, type, model)
## # A tibble: 3,322 × 3
## tailnum type model
## <chr> <chr> <chr>
## 1 N10156 Fixed wing multi engine EMB-145XR
## 2 N102UW Fixed wing multi engine A320-214
## 3 N103US Fixed wing multi engine A320-214
## 4 N104UW Fixed wing multi engine A320-214
## 5 N10575 Fixed wing multi engine EMB-145LR
## 6 N105UW Fixed wing multi engine A320-214
## 7 N107US Fixed wing multi engine A320-214
## 8 N108UW Fixed wing multi engine A320-214
## 9 N109UW Fixed wing multi engine A320-214
## 10 N110UW Fixed wing multi engine A320-214
## # ... with 3,312 more rows
使用 magrittr 包中的 forward-pipe 运算符(%>%
)重写上面的语句:
planes %>% select(tailnum, type, model)
## # A tibble: 3,322 × 3
## tailnum type model
## <chr> <chr> <chr>
## 1 N10156 Fixed wing multi engine EMB-145XR
## 2 N102UW Fixed wing multi engine A320-214
## 3 N103US Fixed wing multi engine A320-214
## 4 N104UW Fixed wing multi engine A320-214
## 5 N10575 Fixed wing multi engine EMB-145LR
## 6 N105UW Fixed wing multi engine A320-214
## 7 N107US Fixed wing multi engine A320-214
## 8 N108UW Fixed wing multi engine A320-214
## 9 N109UW Fixed wing multi engine A320-214
## 10 N110UW Fixed wing multi engine A320-214
## # ... with 3,312 more rows
过滤
filter
行基于 crieria。
返回 manufacturer
为 EMBRAER
的数据集:
planes %>% filter(manufacturer == "EMBRAER")
## # A tibble: 299 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2
## 3 N11106 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 4 N11107 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 5 N11109 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 6 N11113 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 7 N11119 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 8 N11121 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 9 N11127 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 10 N11137 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## # ... with 289 more rows, and 3 more variables: seats <int>, speed <int>,
## # engine <chr>
返回一个数据集,其中 manufacturer
为 EMBRAER
,model
为“EMB-145XR”:
planes %>%
filter(manufacturer == "EMBRAER", model == "EMB-145XR")
## # A tibble: 104 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N11106 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 3 N11107 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 4 N11109 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 5 N11113 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 6 N11119 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 7 N11121 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 8 N11127 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 9 N11137 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 10 N11140 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## # ... with 94 more rows, and 3 more variables: seats <int>, speed <int>,
## # engine <chr>
上述陈述与写 AND
条件相同。
planes %>% filter(manufacturer == "EMBRAER" & model == "EMB-145XR")
## # A tibble: 104 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N11106 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 3 N11107 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 4 N11109 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 5 N11113 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 6 N11119 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 7 N11121 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 8 N11127 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 9 N11137 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 10 N11140 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## # ... with 94 more rows, and 3 more variables: seats <int>, speed <int>,
## # engine <chr>
对 OR
条件使用管道(|)字符:
planes %>% filter(manufacturer == "EMBRAER" | model == "EMB-145XR")
## # A tibble: 299 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2
## 3 N11106 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 4 N11107 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 5 N11109 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 6 N11113 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 7 N11119 2002 Fixed wing multi engine EMBRAER EMB-145XR 2
## 8 N11121 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 9 N11127 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## 10 N11137 2003 Fixed wing multi engine EMBRAER EMB-145XR 2
## # ... with 289 more rows, and 3 more variables: seats <int>, speed <int>,
## # engine <chr>
将 grepl
与 filter
结合使用可用于图案匹配条件。
planes %>% filter(grepl("^172.", model))
## # A tibble: 3 × 9
## tailnum year type manufacturer model engines seats
## <chr> <int> <chr> <chr> <chr> <int> <int>
## 1 N378AA 1963 Fixed wing single engine CESSNA 172E 1 4
## 2 N621AA 1975 Fixed wing single engine CESSNA 172M 1 4
## 3 N737MQ 1977 Fixed wing single engine CESSNA 172N 1 4
## # ... with 2 more variables: speed <int>, engine <chr>
之间
返回 year
是 2004 年和 2005 年的所有行:
planes %>% filter(between(year, 2004, 2005))
## # A tibble: 354 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N11155 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 3 N11164 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 4 N11165 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 5 N11176 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 6 N11181 2005 Fixed wing multi engine EMBRAER EMB-145XR 2
## 7 N11184 2005 Fixed wing multi engine EMBRAER EMB-145XR 2
## 8 N11187 2005 Fixed wing multi engine EMBRAER EMB-145XR 2
## 9 N11189 2005 Fixed wing multi engine EMBRAER EMB-145XR 2
## 10 N11191 2005 Fixed wing multi engine EMBRAER EMB-145XR 2
## # ... with 344 more rows, and 3 more variables: seats <int>, speed <int>,
## # engine <chr>
切片
slice
仅返回给定索引的行。
返回前五行数据(与基本 head
函数相同):
planes %>% slice(1:5)
## # A tibble: 5 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2
## # ... with 3 more variables: seats <int>, speed <int>, engine <chr>
返回第 1 行,第 3 行和第 5 行数据:
planes %>% slice(c(1, 3, 5))
## # A tibble: 3 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2
## # ... with 3 more variables: seats <int>, speed <int>, engine <chr>
返回第一行和最后一行:
planes %>% slice(c(1, nrow(planes)))
## # A tibble: 2 × 9
## tailnum year type manufacturer
## <chr> <int> <chr> <chr>
## 1 N10156 2004 Fixed wing multi engine EMBRAER
## 2 N999DN 1992 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATION
## # ... with 5 more variables: model <chr>, engines <int>, seats <int>,
## # speed <int>, engine <chr>
变异
mutate
可以添加新变量或修改现有变量。
添加一个虚拟变量 engine.dummy
,默认值为 0:
planes %>%
mutate(engine.dummy = 0) %>%
select(engine, engine.dummy)
## # A tibble: 3,322 × 2
## engine engine.dummy
## <chr> <dbl>
## 1 Turbo-fan 0
## 2 Turbo-fan 0
## 3 Turbo-fan 0
## 4 Turbo-fan 0
## 5 Turbo-fan 0
## 6 Turbo-fan 0
## 7 Turbo-fan 0
## 8 Turbo-fan 0
## 9 Turbo-fan 0
## 10 Turbo-fan 0
## # ... with 3,312 more rows
使用 dplyr::if_else
,如果 engine
==“Turbo-fan”,则将 engine.dummy
设置为 1,否则将 engine.dummy
设置为 0:
planes %>%
mutate(engine.dummy = if_else(engine == "Turbo-fan", 1, 0)) %>%
select(engine, engine.dummy)
## # A tibble: 3,322 × 2
## engine engine.dummy
## <chr> <dbl>
## 1 Turbo-fan 1
## 2 Turbo-fan 1
## 3 Turbo-fan 1
## 4 Turbo-fan 1
## 5 Turbo-fan 1
## 6 Turbo-fan 1
## 7 Turbo-fan 1
## 8 Turbo-fan 1
## 9 Turbo-fan 1
## 10 Turbo-fan 1
## # ... with 3,312 more rows
将 planes$engine
转换为一个因子。
planes %>%
mutate(engine = as.factor(engine)) %>%
select(engine)
## # A tibble: 3,322 × 1
## engine
## <fctr>
## 1 Turbo-fan
## 2 Turbo-fan
## 3 Turbo-fan
## 4 Turbo-fan
## 5 Turbo-fan
## 6 Turbo-fan
## 7 Turbo-fan
## 8 Turbo-fan
## 9 Turbo-fan
## 10 Turbo-fan
## # ... with 3,312 more rows
安排
使用 arrange
对数据帧进行排序。
由 year
安排 planes
:
planes %>% arrange(year)
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N381AA 1956 Fixed wing multi engine DOUGLAS DC-7BF 4
## 2 N201AA 1959 Fixed wing single engine CESSNA 150 1
## 3 N567AA 1959 Fixed wing single engine DEHAVILLAND OTTER DHC-3 1
## 4 N378AA 1963 Fixed wing single engine CESSNA 172E 1
## 5 N575AA 1963 Fixed wing single engine CESSNA 210-`5(205)` 1
## 6 N14629 1965 Fixed wing multi engine BOEING 737-524 2
## 7 N615AA 1967 Fixed wing multi engine BEECH 65-A90 2
## 8 N425AA 1968 Fixed wing single engine PIPER PA-28-180 1
## 9 N383AA 1972 Fixed wing multi engine BEECH E-90 2
## 10 N364AA 1973 Fixed wing multi engine CESSNA 310Q 2
## # ... with 3,312 more rows, and 3 more variables: seats <int>,
## # speed <int>, engine <chr>
arrange
planes
来自 year
desc
:
planes %>% arrange(desc(year))
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines
## <chr> <int> <chr> <chr> <chr> <int>
## 1 N150UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 2 N151UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 3 N152UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 4 N153UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 5 N154UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 6 N155UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 7 N156UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 8 N157UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 9 N198UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## 10 N199UW 2013 Fixed wing multi engine AIRBUS A321-211 2
## # ... with 3,312 more rows, and 3 more variables: seats <int>,
## # speed <int>, engine <chr>
group_by
group_by
允许你通过子集对数据帧执行操作,而无需提取子集。
df <- planes %>% group_by(manufacturer, model)
返回的数据框可能不会显示为分组。但是,数据帧的 class
和 attributes
将确认它。
class(df)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
attributes(df)$vars
## [[1]]
## manufacturer
##
## [[2]]
## model
head(attributes(df)$labels, n = 5L)
## manufacturer model
## 1 AGUSTA SPA A109E
## 2 AIRBUS A319-112
## 3 AIRBUS A319-114
## 4 AIRBUS A319-115
## 5 AIRBUS A319-131
如果要在不删除现有分组元素的情况下将分组元素添加到数据框,请使用 add
参数设置为 TRUE(默认设置为 FALSE):
df <- df %>% group_by(type, year, add = TRUE)
class(df)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
attributes(df)$vars
## [[1]]
## manufacturer
##
## [[2]]
## model
##
## [[3]]
## type
##
## [[4]]
## year
head(attributes(df)$labels, n = 5L)
## manufacturer model type year
## 1 AGUSTA SPA A109E Rotorcraft 2001
## 2 AIRBUS A319-112 Fixed wing multi engine 2002
## 3 AIRBUS A319-112 Fixed wing multi engine 2005
## 4 AIRBUS A319-112 Fixed wing multi engine 2006
## 5 AIRBUS A319-112 Fixed wing multi engine 2007
如果你想删除分组使用 ungroup
。
df <- df %>% ungroup()
class(df)
## [1] "tbl_df" "tbl" "data.frame"
attributes(df)$vars
## NULL
attributes(df)$labels
## NULL
总结
summarise
用于对数据集进行整体计算或按组进行计算。
找到 mean
的 seats
per manufacturer
?
planes %>%
group_by(manufacturer) %>%
summarise(Mean = mean(seats))
## # A tibble: 35 × 2
## manufacturer Mean
## <chr> <dbl>
## 1 AGUSTA SPA 8.0000
## 2 AIRBUS 221.2024
## 3 AIRBUS INDUSTRIE 187.4025
## 4 AMERICAN AIRCRAFT INC 2.0000
## 5 AVIAT AIRCRAFT INC 2.0000
## 6 AVIONS MARCEL DASSAULT 12.0000
## 7 BARKER JACK L 2.0000
## 8 BEECH 9.5000
## 9 BELL 8.0000
## 10 BOEING 175.1877
## # ... with 25 more rows
summarise
不会返回未明确分组或包含在汇总函数中的变量。如果要添加另一个变量,则必须将其作为谓词传递给 group_by
或 summarise
。
planes %>%
group_by(year, manufacturer) %>%
summarise(Mean = mean(seats))
## Source: local data frame [164 x 3]
## Groups: year [?]
##
## year manufacturer Mean
## <int> <chr> <dbl>
## 1 1956 DOUGLAS 102
## 2 1959 CESSNA 2
## 3 1959 DEHAVILLAND 16
## 4 1963 CESSNA 5
## 5 1965 BOEING 149
## 6 1967 BEECH 9
## 7 1968 PIPER 4
## 8 1972 BEECH 10
## 9 1973 CESSNA 6
## 10 1974 CANADAIR LTD 2
## # ... with 154 more rows
改名
rename
一个变量:
planes %>%
rename(Mfr = manufacturer) %>%
names()
## [1] "tailnum" "year" "type" "Mfr" "model" "engines" "seats"
## [8] "speed" "engine"