我有兩個資料框,一個看起來像這樣:
>df1
SNP Symbols
1 rs11807834 GRIN1,SETD1A
2 rs3729986 MADD,STAC3,SPI1
3 rs61937595 NDUFA4L2,STAC3,CAMK2N1
另一個看起來像這樣
>df2
Symbol Score
1 GRIN1 167
2 SETD1A 160
3 MADD 164
4 STAC3 12
5 CAMK2N1 3
6 NDUFA4L2 0
7 SPI1 0
我想獲得每SNP
列得分最高的符號。所以它看起來像這樣:
>result
SNP Symbols Highest.Score
rs11807834 GRIN1,SETD1A GRIN1
rs2600490 MADD,STAC3,SPI1 MADD
rs3729986 NDUFA4L2,STAC3,CAMK2N1 STAC3
任何建議如何實作這一目標?
df1 <- data.frame("SNP" = c("rs11807834", "rs3729986", "rs61937595" ), "Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1", "NDUFA4L2,STAC3,CAMK2N1"))
df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3", "CAMK2N1", "NDUFA4L2", "SPI1"), "Score" = c(167, 160, 164,12,3,0,0))
uj5u.com熱心網友回復:
我們可以結合separate_rows
aleft_join
和slice_max
after 分組:
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(Symbols, sep = ",") %>%
left_join(df2, by=c("Symbols" = "Symbol")) %>%
group_by(SNP) %>%
slice_max(Score)
SNP Symbols Score
<chr> <chr> <int>
1 rs11807834 GRIN1 167
2 rs3729986 MADD 164
3 rs61937595 STAC3 12
uj5u.com熱心網友回復:
這是一種使用 dplyr 和 grepl 的方法:
df1 <- data.frame("SNP" = c("rs11807834", "rs3729986", "rs61937595" ), "Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1", "NDUFA4L2,STAC3,CAMK2N1"))
df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3", "CAMK2N1", "NDUFA4L2", "SPI1"),
"Score" = c(167,160,164,12,3,0,0))
library(dplyr)
result= df1%>%
rowwise()%>%
mutate(Highest.Score=df2[max(df2[grepl(paste(unlist(strsplit(Symbols,split = ",")),collapse = "|"),df2$Symbol),]$Score)==df2$Score,]$Symbol)
# > result
# # A tibble: 3 x 3
# # Rowwise:
# SNP Symbols Highest.Score
# <chr> <chr> <chr>
# 1 rs11807834 GRIN1,SETD1A GRIN1
# 2 rs3729986 MADD,STAC3,SPI1 MADD
# 3 rs61937595 NDUFA4L2,STAC3,CAMK2N1 STAC3
uj5u.com熱心網友回復:
我修復了 df2。我確信我的方法不是最“程式化”的方法,并且有一種更好的方法可以使用 pivot、join 和 mutate 函式來執行此操作,但這段代碼有效:
回圈的作用是,對于最高分的每一行(僅在回圈上方的行創建),它根據 篩選該行中的符號df1$Symbol
。這unlist(str_split())
是將符號代碼拆分為單獨的字串向量(通過檢測“,”模式)。然后只檢索第一行(默認從最高排列,否則可以添加arrange(-score)
引數來顯式排列)
df1 <- data.frame(SNP = c("rs11807834", "rs3729986", "rs61937595" ),
"Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1",
"NDUFA4L2,STAC3,CAMK2N1"))
# This code was corrected to match the example
df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3",
"CAMK2N1", "NDUFA4L2", "SPI1"),
Score=c(167,160,164,12,3,0,0))
df3 <- df1
df3$Highest.Score <- NA
for(i in 1:nrow(df1)){
df3$Highest.Score[i] <- df2 %>% filter(Symbol %in%
unlist(str_split(df1$Symbols[i],","))) %>%
slice_head(n=1) %>% pull(Symbol)
}
print(df3)
輸出:
> print(df3)
SNP Symbols Highest.Score
1 rs11807834 GRIN1,SETD1A GRIN1
2 rs3729986 MADD,STAC3,SPI1 MADD
3 rs61937595 NDUFA4L2,STAC3,CAMK2N1 STAC3
uj5u.com熱心網友回復:
用戶定義函式中的str_split
另一個選項:left_join
maxScore <- function(x){
data.frame(Symbol = str_split(x, ",") %>%
unlist()
) %>%
left_join(df2, by = c("Symbol")) %>%
select(Score) %>%
max()
}
df1 %>% rowwise() %>% mutate(MS = maxScore(Symbols))
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/535406.html
標籤:r数据框dplyr
上一篇:在for回圈中將列名指定為標題