編寫與 data.frame 和 data.table 相容的程式碼

Created: November-22, 2018

子集化語法的差異

除了 data.frame，matrix 和（2D）array 之外，data.table 是 R 中可用的幾種二維資料結構之一。所有這些類都使用非常相似但不完全相同的語法進行子集化，即 A[rows, cols] 模式。

考慮儲存在 matrix，data.frame 和 data.table 中的以下資料：

ma <- matrix(rnorm(12), nrow=4, dimnames=list(letters[1:4], c('X', 'Y', 'Z')))
df <- as.data.frame(ma)
dt <- as.data.table(ma)

ma[2:3]  #---> returns the 2nd and 3rd items, as if 'ma' were a vector (because it is!)
df[2:3]  #---> returns the 2nd and 3rd columns
dt[2:3]  #---> returns the 2nd and 3rd rows!

如果你想確定將返回什麼，最好是明確的。

要獲取特定行，只需在範圍後新增逗號：

ma[2:3, ]  # \
df[2:3, ]  #  }---> returns the 2nd and 3rd rows
dt[2:3, ]  # /

但是，如果要對列進行子集，則某些情況的解釋會有所不同。所有三個都可以是相同的子集，其中整數或字元索引不儲存在變數中。

ma[, 2:3]          #  \
df[, 2:3]          #   \
dt[, 2:3]          #    }---> returns the 2nd and 3rd columns
ma[, c("Y", "Z")]  #   /
df[, c("Y", "Z")]  #  /
dt[, c("Y", "Z")]  # /

但是，它們對於不帶引號的變數名稱有所不同

mycols <- 2:3
ma[, mycols]                # \
df[, mycols]                #  }---> returns the 2nd and 3rd columns
dt[, mycols, with = FALSE]  # /

dt[, mycols]                # ---> Raises an error

在最後一種情況下，mycols 被評估為列的名稱。因為 dt 找不到名為 mycols 的列，所以會引發錯誤。

注意：對於 data.table 軟體包 priorto 1.9.8 的版本，此行為略有不同。列索引中的任何內容都將使用 dt 作為環境進行評估。所以 dt[, 2:3] 和 dt[, mycols] 都會返回向量 2:3。第二種情況不會引發錯誤，因為變數 mycols 確實存在於父環境中。

保持與 data.frame 和 data.table 相容的策略

寫程式碼的原因有很多，保證與 data.frame 和 data.table 一起使用。也許你被迫使用 data.frame，或者你可能需要分享一些你不知道將如何使用的程式碼。因此，為方便起見，有一些實現這一目標的主要策略：

使用對兩個類都行為相同的語法。
使用與最短語法相同的常用函式。
強制 data.table 表現為 data.frame（例如：呼叫特定方法 print.data.frame）。
把它們視為 list，它們最終是。
在做任何事情之前將錶轉換為 data.frame（如果它是一個巨大的表，那就糟糕了）。
如果依賴關係不是問題，請將錶轉換為 data.table。

**子集行。**它很簡單，只需使用 [, ] 選擇器，用逗號：

A[1:10, ]
A[A$var > 17, ]  # A[var > 17, ] just works for data.table

**子集列。**如果你想要一個列，請使用 $ 或 [[ ]] 選擇器：

A$var
colname <- 'var'
A[[colname]]
A[[1]]

如果你想要一個統一的方法來獲取多個列，那麼有必要提出一些建議：

B <- `[.data.frame`(A, 2:4)

# We can give it a better name
select <- `[.data.frame`
B <- select(A, 2:4)
C <- select(A, c('foo', 'bar'))

**子集’已編入索引’行。**雖然 data.frame 有 row.names，但 data.table 有其獨特的 key 功能。最好的辦法是完全避免使用 row.names，並在可能的情況下利用 data.table 的現有優化。

B <- A[A$var != 0, ]
# or...
B <- with(A, A[var != 0, ])  # data.table will silently index A by var before subsetting

stuff <- c('a', 'c', 'f')
C <- A[match(stuff, A$name), ]  # really worse than: setkey(A); A[stuff, ]

**獲取 1 列表，獲取行作為向量。**到目前為止，我們所看到的很容易：

B <- select(A, 2)    #---> a table with just the second column
C <- unlist(A[1, ])  #---> the first row as a vector (coerced if necessary)