Problem
When submitting a job to Dataproc using PySpark cluster mode, you get the following error:
'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' Traceback (most recent call last): File "smart_reorder_invite_main.py", line 277, in <module> main.spark.sql(data_storage_sql.format(param=reminder_tbl)) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in decopyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
Environment
- Dataproc 1.5 version cluster with jobs being submitted
- PySpark cluster mode
- Spark BigQuery connector version prior to 0.19.1
Solution
Workaround:
Log into a VM and execute PySpark jobs.
Cause
Some newer dependencies in version 0.18.0 of Spark BigQuery connector caused networking issues, including the reported issue of Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. Those newer dependencies causing issues were rolled back in version 0.19.1 to solve the issue. See comments in the documentation below.