Following are the key points described later in this article:
Read the text given on this page, Spark cluster mode overview to understand the fundamentals around how Spark runs on clusters. Make a note that, in this article, we are demonstrating how to run spark cluster using Spark’s standalone cluster manager. Following diagram depicts the setup described in this article.
In the diagram below, it is shown that three docker containers are used, one for driver program, another for hosting cluster manager (master) and the last one for worker program. One could also run and test the cluster setup with just two containers, one for master and another for worker node. In that case, one could start the driver program (SparkContext) in either master or worker node by command such as spark-shell –master spark://192.168.99.100:7077. As a matter of fact, in client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
In order to access the WebUI for driver program (spark-shell) running in worker node, one may need to add port such as “4041:4040” under worker node entry in docker-compose file and access using the URL as 192.168.99.100:4041.
That said, following different cluster managers are supported:
Following steps need to be taken to test your first Spark program using spark shell as driver program.
Note some of the following for setting up the image.
FROM ubuntu:14.04 RUN apt-get -y update RUN apt-get -y install curl RUN apt-get -y install software-properties-common # JAVA RUN \ echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | debconf-set-selections && \ add-apt-repository -y ppa:webupd8team/java && \ apt-get update && \ apt-get install -y oracle-java8-installer && \ rm -rf /var/lib/apt/lists/* && \ rm -rf /var/cache/oracle-jdk8-installer ENV JAVA_HOME /usr/lib/jvm/java-8-oracle ENV PATH $PATH:$JAVA_HOME/bin # RUN curl -s --insecure \ # --header "Cookie: oraclelicense=accept-securebackup-cookie;" ${JAVA_ARCHIVE} \ # | tar -xz -C /usr/local/ && ln -s $JAVA_HOME /usr/local/java # SPARK ARG SPARK_ARCHIVE=http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz ENV SPARK_HOME /usr/local/spark-2.0.0-bin-hadoop2.7 ENV PATH $PATH:${SPARK_HOME}/bin RUN curl -s ${SPARK_ARCHIVE} | tar -xz -C /usr/local/ WORKDIR $SPARK_HOME
Make a note of following:
Following is the code for docker-compose.
spark-master: image: spark command: bin/spark-class org.apache.spark.deploy.master.Master -h spark-master hostname: spark-master environment: MASTER: spark://spark-master:7077 SPARK_CONF_DIR: /conf SPARK_PUBLIC_DNS: 192.168.99.100 expose: - 7001 - 7002 - 7003 - 7004 - 7005 - 7006 - 7077 - 6066 ports: - 4040:4040 - 6066:6066 - 7077:7077 - 8080:8080 volumes: - ./conf/spark-master:/conf - ./data:/tmp/data spark-worker-1: image: spark command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077 hostname: spark-worker-1 environment: SPARK_CONF_DIR: /conf SPARK_PUBLIC_DNS: 192.168.99.100 SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8081 links: - spark-master expose: - 7012 - 7013 - 7014 - 7015 - 7016 - 8881 ports: - 8081:8081 volumes: - ./conf/spark-worker-1:/conf - ./data:/tmp/data
NOTE: While executing “docker-compose up” command, you may get the error such as “Invalid Volume Specification“. The way to solve the problem is to create a file “.env” file in the same folder as docker-compose.yml and put this configuration in “.env” file (mind you, no whitespace around =) – COMPOSE_CONVERT_WINDOWS_PATHS=1
docker run -it -p 8088:8088 -p 8042:8042 -p 4041:4040 --name driver -h driver spark:latest bash
spark-shell --master spark://192.168.99.100:7077
You should be able to see following in the terminal.
Follow steps given below to run a “textFile” analysis in the cluster setup above where the driver program is created as another container (let’s call it as client or driver node) apart from two containers, one running a master and other a worker node.
spark-shell --master spark://192.168.99.100:7077
This would start a spark application, register the app with master and have cluster manager (master) ask worker node to start an executor.
val textFile = sc.textFile("/home/hellospark.txt") textFile.count()
Following are some of the links which would prove useful and helpful and getting started:
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…