Problem
Environment
- Dataproc 1.5 and higher
- Spark SQL 2.4 or higher
Solution
- Spark Log and Yarn Application log don't show errors from Diagnostic Tar File because the application doesn't fail, only aggregations like Join do.
- Collect Spark Event log for application. Locate Event is stored in folder location n Diagnostic TAR file specified in property spark.eventLog.dir in spark_default.conf. spark_default.conf.
- Import collected in-house Cluster spark.eventLog.dir location and then you should be able to see Spark application event log in Spark History Server. Which will let you analyse DAG.
- You can analyse DAG in Customer Env using Spark History interface.
- From Spark SQL query DAG, you should be able to identify Tables Spark fails to read.
- The issue can also be resolved by manually setting spark.sql.hive.convertMetastoreOrc to false before running Spark SQL query.
- This property is set to true in Spark 2.4 and higher for better performance while in Spark 2.3. is set to false.
Or you can set spark.sql.hive.convertMetastoreOrc to false as part of Cluster Build.
Example using Google Cloud CLI:
gcloud dataproc clusters create my-cluster \ --region=us-west1 \ --properties=spark:spark.sql.hive.convertMetastoreOrc=false \ other args ...
Cause
This type of issue often occurs when using the migration application from Spark 2.3 to 2.4. Scan failure resulted in Select Query failure.
Starting from Spark version 2.4, the default setting for spark.sql.hive.convertMetastoreOrc was modified to true by default. The spark.sql.hive.convertMetastoreOrc property controls whether to use the built-in ORC reader and writer for Hive tables with the ORC storage format (instead of Hive SerDe) and column type data become string values.