R中嵌套for回圈的次優使用。矢量化/優化的選項？-有解無憂

我有一個資料集，可以隨著時間的推移垂直存盤參與者的實體。他們基本上可以有任意數量的后續行動，參與者的行數從 1 到 14 行不等，但隨著時間的推移，預計會增加更多。

我有一個變數串列，var參與者可能在每次后續行動中報告了這些變數，并希望創建一組新的“曾經”變數vare來描述在此后續行動之前的任何時間，參與者是否報告了相應的“是”多變的。

這是所需輸入/輸出的示例：

var  = c("var1","var2")
vare = paste0(var,"_ever")

data = data.frame(idno         = c(123,123,123,123,123,123,123)
                  followup_num = c(0,1,2,3,4,5,6)
                  var1         = c(0,NA,0,1,0,NA,1)
                  var2         = c(1,NA,NA,0,0,0,1)
                 )         
data$var1_ever = c(0,0,0,1,1,1,1)
data$var2_ever = c(1,1,1,1,1,1,1)

證件號碼	followup_num	變數1	var1_ever	變數2	var2_ever
123	0	0	0	1	1
123	1	不適用	0	不適用	1
123	2	0	0	不適用	1
123	3	1	1	0	1
123	4	0	1	0	1
123	5	不適用	1	0	1
123	6	1	1	1	1

這是我目前正在使用的代碼。顯然，嵌套的 for 回圈在 R 中并不理想，而且這段代碼在處理幾千行時特別慢。

#For each ID
for (i in unique(data$idno)) {

  id  = data$idno%in%i              #Get the relevant lines for this ID
  fus = sort(data$followup_num[id]) #Get the follow-up numbers
  
  #For each variable in the list
  for (v in seq_along(var)) {

    #Loop through the follow-ups. If you see that the variable reports "yes", mark 
    #  this and every proceeding follow-up as having reported that variable ever 
    #  Otherwise, mark the opposite at that line and move to the next follow-up
    for (f in fus) {
      if (t(data[id & data$followup_num%in%f,var[v]])%in%1) {
        data[id & data$followup_num >= f,vare[v]] = 1
        break
      } else {
        data[id & data$followup_num%in%f,vare[v]] = 0
      }
    }    
  }
}

這是現有解決方案的問題嗎？有沒有辦法優化/簡化？是否有使用 apply/sapply/etc. 我忽略了嘗試的功能？

uj5u.com熱心網友回復：

解決方案的核心是基本功能cummax()。我們需要考慮NA，所以我補充了replace_na()。我們需要通過使用來考慮額外的 idnogroup_by()

最小矢量化解決方案是

df$var1_test<-cummax(x=replace_na(df$var1, 0))

tidyverse mutate across這是一個用函式集解決的大問題！

df = data.frame(idno         = c(123,123,123,123,123,123,123),
                  followup_num = c(0,1,2,3,4,5,6),
                  var1         = c(0,NA,0,1,0,NA,1),
                  var2         = c(1,NA,NA,0,0,0,1))

df %>% group_by(idno) %>%  
       arrange(idno, followup_num) %>% 
       mutate(across(.cols=starts_with("var"), 
                     .fns= ~ cummax(tidyr::replace_na(.x, 0)), 
                     .names="{.col}_ever2"))

   idno followup_num  var1  var2 var1_ever2 var2_ever2
1   123            0     0     1          0          1
2   123            1    NA    NA          0          1
3   123            2     0    NA          0          1
4   123            3     1     0          1          1
5   123            4     0     0          1          1
6   123            5    NA     0          1          1
7   123            6     1     1          1          1

或者，如果您想將資料匯總到一行，那么分組最大值可以作業

df %>%
  group_by(idno) %>%
  summarise(across(.cols=starts_with("var"), 
                   .fns= ~ max(.x, na.rm=T), 
                   .names="{.col}_ever3"))

   idno var1_ever3 var2_ever3
1   123          1          1

附言。data是一個內部函式，最好呼叫 variable df。

uj5u.com熱心網友回復：

考慮ave cummax（ifelse處理 NA）：

data <- within(
  data, {
    var2_ever <- ave(var2, idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
    var1_ever <- ave(var1, idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
  }
)

對于許多列：

vars <- names(data)[grep("var", names(data))]

data[paste0(vars, "_ever")] <- sapply(
  vars, \(var) ave(data[[var]], data$idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
)

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/505818.html

標籤：r for循环优化时间序列矢量化

上一篇：如何將隨機字符放在任意位置的隨機字串之間？

下一篇：TypeScript-回圈回傳被跳過并回傳基本條件