Data Analytics

Managing Java dependencies for Apache Spark applications on Cloud Dataproc

August 14, 2018

Julien Phalip

Solutions Architect, Google Cloud

It is common for Apache Spark applications to depend on third-party Java or Scala libraries. When you submit a Spark job to a Cloud Dataproc cluster, the simplest method you can use to include these dependencies is to list them in the following ways:

If you submit a job via the Cloud Dataproc jobs command in the Cloud SDK, you can provide the --properties spark.jars.packages=[DEPENDENCIES] parameter to the Cloud Dataproc submit command. For example:

gcloud dataproc jobs submit spark \ --cluster my-cluster \ --properties spark.jars.packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'
If you directly submit a job from inside a Cloud Dataproc master instance, you can provide the --packages=[DEPENDENCIES] parameter to the spark-submit command. For example:

spark-submit --packages='com.google.cloud:google-cloud-translate:1.35.0,org.apache.bahir:spark-streaming-pubsub_2.11:2.2.0'

However, the method above may not work in situations where the Spark application’s dependencies conflict with Hadoop’s own dependencies. This conflict results from the fact that Hadoop injects its dependencies into the application’s classpath, and therefore Hadoop’s dependencies take precedence over the application’s dependencies. This prioritization in turn can cause some errors such as NoSuchMethodError.

One common example of such conflicts is with Guava, the Google core library for Java, which is used by many libraries and frameworks, including Hadoop itself. This can be a problem if your job or its dependencies require a version of Guava that is newer than the one used by Hadoop.

This issue is resolved in Hadoop v3.0, but applications that rely on older versions of Hadoop must use a workaround to avoid dependency conflicts.

The workaround consists of two parts:

Create an “uber” JAR, also commonly referred to as a “fat” JAR, that is, a single JAR file that contains not only the application’s package but also all of its dependencies.
Relocate the conflicting dependency packages within the uber JAR to prevent their path names from conflicting with those of Hadoop’s dependency packages. Some plugins are available to automatically operate this relocation during the packaging process so that you don’t have to modify your code. This relocation process is often referred to as “shading”.

In the next sections you learn different methods for creating shaded uber JARs.

Creating a shaded uber JAR with Maven

Apache Maven is a popular package management tool for building Java applications. Maven can also be used for building applications written in Scala, which is the language used by Spark applications. For that you must use a plugin such as the Maven scala plugin. To create a shaded JAR, you also must use another plugin such as the Maven shade plugin.

Below is an example of a pom.xml configuration file to shade the Guava library, which is located in the com.google.common package. This configuration instructs Maven to rename the com.google.common package to repackaged.com.google.common and to automatically update all references to the classes from the original package.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
  </properties>
  <groupId>YOUR_GROUP_ID</groupId>
  <artifactId>YOUR_ARTIFACT_ID</artifactId>
  <version>YOUR_PACKAGE_VERSION</version>
  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>YOUR_SPARK_VERSION</version>
      <scope>provided</scope>
    </dependency>

</dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>YOUR_SCALA_VERSION</scalaVersion>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <transformers>
                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
              <mainClass>YOUR_APPLICATION_MAIN_CLASS</mainClass>
                </transformer>
              </transformers>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/maven/**</exclude>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                  </excludes>
                </filter>
              </filters>
              <relocations>
                <relocation>
                  <pattern>com</pattern>   <shadedPattern>repackaged.com.google.common</shadedPattern>
                  <includes>
                    <include>com.google.common.**</include>
                  </includes>
                </relocation>
              </relocations>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

Execute the following command to run the build:

mvn package

Notes:

The ManifestResourceTransformer is a resource processor that allows the addition of attributes in the MANIFEST.MF of the created uber JAR. It also allows to specify the entry point for your application.
It is recommended to use the provided scope for Spark, since Spark is already installed on Cloud Dataproc.
It is recommended to specify the same version of Spark as the one that is already installed on your Cloud Dataproc cluster. Refer to the documentation to see the list of available versions. If your application requires a different version of Spark than the one that is pre-installed on Cloud Dataproc, then you should consider writing a custom initialization action or a custom image to install the required version.
The <filters> entries exclude signature files from your dependencies’ META-INF directories. Without these filters, you might see an exception “java.lang.SecurityException: Invalid signature file digest for Manifest main attributes” at run time, as those signature files would be invalid in the context of your uber JAR.
In practice, you might find that you have to shade multiple libraries. For that, you may include multiple paths, for example below with both Guava and Protobuf:

Creating a shaded uber JAR with SBT

SBT is a popular tool for building Scala applications. To create a shaded JAR with SBT, you must use the sbt-assembly plugin by adding the following line in a file located in project/plugins.sbt:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")

Below is an example of a build.sbt file to shade the Guava library, which is located in the com.google.common package:

Execute the following command to run the build:

sbt assembly

You might find in some cases, however, that the shade rule in the above example is not enough to solve all dependency conflicts. This is because SBT uses strict conflict resolution strategies. Therefore, you might have to provide more granular rules, where each rule explicitly merges specific types of conflicting files by using one of the available strategies (most typically MergeStrategy.first, last, concat, filterDistinctLines, rename, or discard).

Finally, as mentioned earlier, in practice you might have to shade multiple libraries, for example below with both Guava and Protobuf:

Submitting the uber JAR to Cloud Dataproc

Once you have created a shaded uber JAR that contains your Spark applications and all its dependencies, you are ready to submit a job to Cloud Dataproc:

Using the jobs command

The recommended way to run Spark jobs on Cloud Dataproc is through the jobs command in the Cloud SDK. To submit a new job, run the following command:

gcloud dataproc jobs submit spark \ --cluster [CLUSTER_NAME] \ --jar [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Using spark-submit

You can also submit Spark jobs on Cloud Dataproc by first establishing an SSH connection to a master instance and then running the spark-submit command:

spark-submit [PATH_TO_YOUR_UBER_JAR]/[YOUR_UBER_JAR].jar

Conclusion

We hope that the information in this article will help you manage Java dependencies for Spark applications and resolve any conflicts that you might run into! For a concrete working example, check out spark-translate, a sample Spark application that contains configuration files for both Maven and SBT.

Posted in