从数据框中子集行和列

Created: November-22, 2018

访问行和列的语法：`[`，`[[` 和 `$`

本主题介绍了访问数据框特定行和列的最常用语法。这些是

就像一个带有单支架 data[rows, columns] 的 matrix
- 使用行号和列号
- 使用列（和行）名称
像 list：
- 使用单个括号 data[columns] 来获取数据帧
- 用双括号 data[[one_column]] 得到一个向量
用 $ 为单列 data$column_name

我们将使用内置的 mtcars 数据框来说明。

像矩阵：`data[rows, columns]`

使用数字索引

使用内置数据框 mtcars，我们可以使用带有逗号的 [] 括号提取行和列。逗号前的索引是行：

# get the first row
mtcars[1, ]
# get the first five rows
mtcars[1:5, ]

同样，逗号后面是列：

# get the first column
mtcars[, 1]
# get the first, third and fifth columns:
mtcars[, c(1, 3, 5)]

如上所示，如果将行或列留空，则将选择所有行或列。mtcars[1, ] 表示包含所有列的第一行。

使用列（和行）名称

到目前为止，这与访问矩阵的行和列的方式相同。对于 data.frames，大多数情况下最好使用列名称作为列索引。这是通过使用带有列名的 character 而不是带有列号的 numeric 来完成的：

# get the mpg column
mtcars[, "mpg"]
# get the mpg, cyl, and disp columns
mtcars[, c("mpg", "cyl", "disp")]

虽然不太常见，但也可以使用行名称：

mtcars["Mazda Rx4", ]

行和列在一起

行和列参数可以一起使用：

# first four rows of the mpg column
mtcars[1:4, "mpg"]

# 2nd and 5th row of the mpg, cyl, and disp columns
mtcars[c(2, 5), c("mpg", "cyl", "disp")]

有关尺寸的警告：

使用这些方法时，如果提取多个列，则会返回一个数据帧。但是，如果你提取单个列，你将获得一个向量，而不是默认选项下的数据框。

## multiple columns returns a data frame
class(mtcars[, c("mpg", "cyl")])
# [1] "data.frame"
## single column returns a vector
class(mtcars[, "mpg"])
# [1] "numeric"

有两种方法可以解决这个问题。一种是将数据框视为列表（见下文），另一种是添加 drop = FALSE 参数。这告诉 R 不要删除未使用的尺寸：

class(mtcars[, "mpg", drop = FALSE])
# [1] "data.frame"

请注意，矩阵的工作方式相同 - 默认情况下，单个列或行将是向量，但如果指定 drop = FALSE，则可以将其保留为单列或单行矩阵。

就像一个清单

数据帧本质上是 lists，即它们是列向量列表（所有列必须具有相同的长度）。列表可以是使用单个括号 [ 作为子列表的子集，或者对于单个元素使用双括号 [[。

单支架 `data[columns]`

当你使用单括号而不使用逗号时，你将返回列，因为数据框是列的列表。

mtcars["mpg"]
mtcars[c("mpg", "cyl", "disp")]
my_columns <- c("mpg", "cyl", "hp")
mtcars[my_columns]

单个括号，如列表与单个括号，如矩阵

data[columns] 和 data[, columns] 之间的区别在于，当将 data.frame 视为 list（括号中没有逗号）时，返回的对象将是一个 data.frame 。如果你使用逗号将 data.frame 视为 matrix，则选择单个列将返回一个向量，但选择多个列将返回 data.frame。

## When selecting a single column
## like a list will return a data frame
class(mtcars["mpg"])
# [1] "data.frame"
## like a matrix will return a vector
class(mtcars[, "mpg"])
# [1] "numeric"

带双支架 `data[[one_column]]`

要将 data.frame 作为 list 处理时将单个列提取为矢量，可以使用双括号 [[。这仅适用于一次一列。

# extract a single column by name as a vector 
mtcars[["mpg"]]

# extract a single column by name as a data frame (as above)
mtcars["mpg"]

使用 `$` 访问列

可以使用神奇的快捷方式 $ 提取单个列，而不使用带引号的列名：

# get the column "mpg"
mtcars$mpg

$ 访问的列将始终是向量，而不是数据帧。

用于访问列的 `$` 的缺点

$ 可以是一个方便的快捷方式，特别是如果你在一个环境（如 RStudio）中工作，在这种情况下将自动完成列名称。但是， $ 也有缺点：它使用非标准评估来避免引号的需要，这意味着如果你的列名存储在变量中它将无法工作。

my_column <- "mpg"
# the below will not work
mtcars$my_column
# but these will work
mtcars[, my_column]  # vector
mtcars[my_column]    # one-column data frame
mtcars[[my_column]]  # vector

由于这些问题，当列名不变时，$ 最适合用于交互式 R 会话。对于程序化使用，例如在编写将在具有不同列名的不同数据集上使用的通用函数时，应避免使用 $。

另请注意，默认行为是仅在通过 $ 从递归对象（环境除外）中提取时使用部分匹配

# give you the values of "mpg" column 
# as "mtcars" has only one column having name starting with "m"
mtcars$m 
# will give you "NULL" 
# as "mtcars" has more than one columns having name starting with "d"
mtcars$d

高级索引：负索引和逻辑索引

每当我们可以选择使用数字作为索引时，我们也可以使用负数来省略某些索引或布尔（逻辑）向量来准确指出要保留的项目。

负指数省略了元素

mtcars[1, ]   # first row
mtcars[ -1, ] # everything but the first row
mtcars[-(1:10), ] # everything except the first 10 rows

逻辑向量表示要保留的特定元素

我们可以使用 < 等条件生成逻辑向量，并仅提取满足条件的行：

# logical vector indicating TRUE when a row has mpg less than 15
# FALSE when a row has mpg >= 15
test <- mtcars$mpg < 15 

# extract these rows from the data frame 
mtcars[test, ]

我们也可以绕过保存中间变量的步骤

# extract all columns for rows where the value of cyl is 4.
mtcars[mtcars$cyl == 4, ]
# extract the cyl, mpg, and hp columns where the value of cyl is 4
mtcars[mtcars$cyl == 4, c("cyl", "mpg", "hp")]

访问行和列的语法：[，[[ 和 $

像矩阵：data[rows, columns]