我正在嘗試在提交 dataproc pyspark 作業時添加 kafka 和 mongoDB 包,但這失敗了。到目前為止,我一直只使用 kafka 包,并且作業正常,但是當我嘗試在下面的命令中添加 mongoDB 包時,它給出了錯誤
命令作業正常,只有 Kafka 包
gcloud dataproc jobs submit pyspark main.py \
--cluster versa-structured-stream \
--properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true
我嘗試了幾個選項來添加這兩個包,但這不起作用:例如。
--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2 \
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
--region us-east1 \
--py-files streams.zip,utils.zip
Error :
Traceback (most recent call last):
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 303, in <module>
sys.exit(main())
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 260, in main
df_stream = spark.readStream.format('kafka') \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 482, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming Kafka Integration Guide".
我如何使這項作業?
蒂亞!
uj5u.com熱心網友回復:
在您--properties
中,您已定義^#^
為分隔符。要正確使用分隔符,您需要將 to 更改,
為#
屬性的分隔符。您只會,
在單個鍵中定義多個值時使用。您還需要洗掉spark:
屬性上的前綴。請參閱下面的示例命令:
gcloud dataproc jobs submit pyspark main.py \
--cluster=cluster-3069 \
--region=us-central1 \
--properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.executor.memory=20g#spark.driver.memory=5g#spark.executor.cores=2
檢查作業配置時,結果如下:
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/481491.html
標籤:阿帕奇火花 谷歌云平台 依赖管理 火花结构化流 数据处理
上一篇:在Scala中生成隨機十六進制?