Dataproc job using PySpark cluster mode fails

Problem

When submitting a job to Dataproc using PySpark cluster mode, you get the following error:

'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'



Traceback (most recent call last):

File "smart_reorder_invite_main.py", line 277, in <module> main.spark.sql(data_storage_sql.format(param=reminder_tbl)) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql

File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call 

File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in decopyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

Environment

  • Dataproc 1.5 version cluster with jobs being submitted
  • PySpark cluster mode
  • Spark BigQuery connector version prior to 0.19.1

Solution

You must upgrade to Spark BigQuery connection 0.19.1

Workaround:
Log into a VM and execute PySpark jobs.

Cause

Some newer dependencies in version 0.18.0 of Spark BigQuery connector caused networking issues, including the reported issue of Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. Those newer dependencies causing issues were rolled back in version 0.19.1 to solve the issue. See comments in the documentation below.