0
0

Some hints on Dataproc

ROBIN DONG 发表于 2021年09月03日 11:54 | Hits: 146
Tag: bigdata | GCP | Hive | PySpark
  1. When running a job in the cluster of Dataproc, it reported:
java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY.

The reason is I haven’t added the Jar file for BigQuery. After adding the new Jar file intopropertiesto the template of creating a cluster:

properties:
          spark:spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.18.1.jar

the job starts to read data from BigQuery tables.

Remember not to usegs://spark-lib/bigquery/spark-bigquery-latest.jarbecause it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p

2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?

Still need to add more items in the template to let it install pip packages:

    clusterName: robin
    config:
      gceClusterConfig:
        metadata:
          enable-cloud-sql-proxy-on-workers: 'false'
          use-cloud-sql-private-ip: 'false'
          PIP_PACKAGES: 'google-cloud-storage google-api-python-client google-auth'
      initializationActions:
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh
        executionTimeout: 600s

3. To see how a Hive table be created

show create table <table>;

原文链接: http://blog.donghao.org/2021/09/03/some-hints-on-dataproc/

0     0

我要给这篇文章打分:

可以不填写评论, 而只是打分. 如果发表评论, 你可以给的分值是-5到+5, 否则, 你只能评-1, +1两种分数. 你的评论可能需要审核.

评价列表(0)