我有兩個資料框,我正在嘗試加入并基于試圖分配標志列的連接集。
需求DF1
-------- ----------- ---------
|rgn_nm |file_crt_dt|file_vrsn|
-------- ----------- ---------
|DAO |2022-06-30 |1 |
|DAO |2022-06-30 |1 |
|CCC |2022-06-30 |1 |
|APCC |2022-06-30 |1 |
|ODM |2022-06-29 |3 |
|EMF |2022-06-30 |1 |
|T2Region|2022-06-29 |4 |
|BCC |2022-06-30 |1 |
|EMF |2022-07-01 |1 |
-------- ----------- ---------
outputDistinctDF
------ ----------- ---------
|region|file_crt_dt|file_vrsn|
------ ----------- ---------
|DAO |2022-06-30 |1 |
|CCC |2022-06-29 |1 |
|APCC |2022-06-30 |1 |
|ODM |2022-06-29 |2 |
|EMF |2022-06-30 |1 |
|BCC |2022-06-30 |1 |
------ ----------- ---------
我正在嘗試在下面實作類似的目標。
------------ ----------------- --------------- ------------- ------------------ ---------------- ----
|input_region|input_file_crt_dt|input_file_vrsn|output_region|output_file_crt_dt|output_file_vrsn|flag|
------------ ----------------- --------------- ------------- ------------------ ---------------- ----
|DAO |2022-06-30 |1 |DAO |2022-06-30 |1 |0 |
|CCC |2022-06-30 |1 |CCC |2022-06-29 |1 |1 |
|T2Region |2022-06-29 |4 |null |null |null |1 |
|ODM |2022-06-29 |3 |ODM |2022-06-29 |2 |1 |
|APCC |2022-06-30 |1 |APCC |2022-06-30 |1 |0 |
|EMF |2022-07-01 |1 |EMF |2022-06-30 |1 |1 |
|EMF |2022-06-30 |1 |EMF |2022-06-30 |1 |0 |
|BCC |2022-06-30 |1 |BCC |2022-06-30 |1 |0 |
------------ ----------------- --------------- ------------- ------------------ ---------------- ----
邏輯:
(input_file_crt_dt > output_file_crt_dt ) 或
(input_file_crt_dt = output_file_crt_dt 和 input_file_vrsn > output_file_vrsn) 或
(output_region 為空)
然后標志 = 1 否則 0
我嘗試了以下偽代碼,但最終給出了錯誤:
我遵循的步驟:
val demandDF1 = Seq(("DAO","2022-06-30","1"),
("DAO","2022-06-30","1"),
("CCC","2022-06-30","1"),
("APCC","2022-06-30","1"),
("ODM","2022-06-29","3"),
("EMF","2022-06-30","1"),
("T2Region","2022-06-29","4"),
("BCC","2022-06-30","1"),
("EMF","2022-07-01","1")).toDF("rgn_nm","file_crt_dt","file_vrsn").withColumn("file_crt_dt", col("file_crt_dt").cast("date")).withColumn("file_vrsn", col("file_vrsn").cast("int"))
val outputDistinctDF = Seq(("DAO","2022-06-30","1"),
("CCC","2022-06-29","1"),
("APCC","2022-06-30","1"),
("ODM","2022-06-29","2"),
("EMF","2022-06-30","1"),
("BCC","2022-06-30","1")).toDF("region","file_crt_dt","file_vrsn").withColumn("file_crt_dt", col("file_crt_dt").cast("date")).withColumn("file_vrsn", col("file_vrsn").cast("int"))
val inputDistinctDF = demandDF1.select(col("rgn_nm"), col("file_crt_dt"), col("file_vrsn")).distinct()
val resultantDF = inputDistinctDF.join(outputDistinctDF,
inputDistinctDF.col("rgn_nm") === outputDistinctDF.col("region")
, "left_outer").select(inputDistinctDF.col("rgn_nm") as "input_region",
inputDistinctDF.col("file_crt_dt") as "input_file_crt_dt",
inputDistinctDF.col("file_vrsn") as "input_file_vrsn",
outputDistinctDF.col("region") as "output_region",
outputDistinctDF.col("file_crt_dt") as "output_file_crt_dt",
outputDistinctDF.col("file_vrsn") as "output_file_vrsn"
).withColumn("flag",
when((
col("output_region").isNull || col("input_file_crt_dt").gt(col("output_file_crt_dt")) || ( col("input_file_crt_dt").eq(col("output_file_crt_dt")) && col("input_file_vrsn").gt(col("output_file_vrsn")) )
), lit("1")).otherwise(lit("0")))
uj5u.com熱心網友回復:
我已經仔細閱讀了您的條件,并為它創建了一個單獨的變數,以防出現問題。但一切都很順利。
val cond = (
($"input_file_crt_dt" > $"output_file_crt_dt") ||
(($"input_file_crt_dt" === $"output_file_crt_dt") &&
($"input_file_vrsn" > $"output_file_vrsn")) ||
$"output_region".isNull
)
val resultantDF = inputDistinctDF
.join(
outputDistinctDF,
inputDistinctDF.col("rgn_nm") === outputDistinctDF.col("region"),
"left_outer")
.select(
inputDistinctDF.col("rgn_nm") as "input_region",
inputDistinctDF.col("file_crt_dt") as "input_file_crt_dt",
inputDistinctDF.col("file_vrsn") as "input_file_vrsn",
outputDistinctDF.col("region") as "output_region",
outputDistinctDF.col("file_crt_dt") as "output_file_crt_dt",
outputDistinctDF.col("file_vrsn") as "output_file_vrsn")
.withColumn("flag", when(cond, "1").otherwise("0"))
resultantDF.show()
// ------------ ----------------- --------------- ------------- ------------------ ---------------- ----
// |input_region|input_file_crt_dt|input_file_vrsn|output_region|output_file_crt_dt|output_file_vrsn|flag|
// ------------ ----------------- --------------- ------------- ------------------ ---------------- ----
// | BCC| 2022-06-30| 1| BCC| 2022-06-30| 1| 0|
// | APCC| 2022-06-30| 1| APCC| 2022-06-30| 1| 0|
// | T2Region| 2022-06-29| 4| null| null| null| 1|
// | EMF| 2022-06-30| 1| EMF| 2022-06-30| 1| 0|
// | ODM| 2022-06-29| 3| ODM| 2022-06-29| 2| 1|
// | CCC| 2022-06-30| 1| CCC| 2022-06-29| 1| 1|
// | EMF| 2022-07-01| 1| EMF| 2022-06-30| 1| 1|
// | DAO| 2022-06-30| 1| DAO| 2022-06-30| 1| 0|
// ------------ ----------------- --------------- ------------- ------------------ ---------------- ----
轉載請註明出處,本文鏈接:https://www.uj5u.com/qukuanlian/498440.html