You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
Sotirios Kontogiannis 528ac65780 new_commit 3 years ago
executors First commit 3 years ago
MiniKube.md commit 2 3 years ago
Readme.md new_commit 3 years ago

Readme.md

MCSL Spark Cluster setup info

This Readme includes information on how to setup a Spark Worker for the hypatia4.math.uoi.gr master service. The hypatia4 master service offers the following capabilities:

  1. Main master service at: hypatia4.math.uoi.gr:7077 for executors
  2. Hadoop hdfs Standalone service at hypatia4.math.uoi.gr:9000
  3. Yarn scheduler service ad hypatha4.math.uoi.gr:9870 for joining datanode workers
  4. Spark master service UI http service for the spark workers and executors' tasks at: http://hypatia4.math.uoi.gr:8080
  5. Spark local (hypatia4 only) historical service and UI at: http://hypatia4.math.uoi.gr:18080
  6. Spark local jobs (hypatia4 only) UI at: http://hypatia4.math.uoi.gr:4040

  • Workers connected need to be configured using 1 CPU core and 600MB of RAM available per worker, required by the spark-shell and pyspark to work flawleslly

  • Executors can have variable CPU core requirements as well as memory requirements

  • Workers are Linux servers and Executors may be both Linux or Windows systems

This Readme includes the following setup information for Workers/Executors:

Table of Contents:

Versioning


** The version of the master SPARK software used is the SPARK 3.3.1-bin-hadoop2 version -for x64 and x32

** The version of the Hadoop service is hadoop-3.2.4 for x64 and x32

The version for the Windows executors is SPARK SPARK 3.2.2.-bin-hadoop2.7 version of spark for Windows Executors, included in the executors/Windows folder

The version for the Linux executors is spark 3.3.1-bin-hadoop2 included in the executors/Linux folder


System Requirements (Windows/Linux)

  1. Installation of python3 (at least 3.8)

  2. Installation of at least Java openjdk 11 or Java 8 and creation of the JAVA_HOME environmental variable (Done automatically on setup)

  3. For Windows only: Add the python.exe executable path and the java binaries in the %PATH% of Windows My Computer->Right click->->Advanced System Settings->Environmental Variables

  4. For windows only: Create a soft link for python.exe to python3.exe as follows (start->cmd.exe):

mklink "c:\Program Files\Python38\python3.exe" "c:\Program Files\Python38\python.exe"
  1. For Windows only: Check if you can execute the javac compiler and jre from the command line. If not add the java\bin folder to the %PATH% and create the JAVA_HOME envorinmental variable showing to the java root folder
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
PATH = %PATH%;C:\Program Files\Java\jdk1.8.0_201\bin
  1. For Linux only: In the folder /etc/profile.d create a file called java.sh and add the following (JAVA_PATH):
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Worker setup in Liunx

** Workers must have either static ip or an ip at the MCSL 192.168.0.x network or a 10.7.0.9 ip on the MCSL VPN network **

Step 1: Download the prequisites:

sudo apt install default-jdk scala git -y

Verify the installed dependencies by running these commands:

java -version; javac -version; scala -version; git --version

Step 2: Download from this repository the spark-3.3.1-bin-hadoop2.tgz file and uncompress it in the /opt folder:

mv spark-3.3.1-bin-hadoop2.tgz /opt
cd /opt;tar zxvf spark-3.3.1-bin-hadoop2.tgz
ln -s spark-3.3.1-bin-hadoop2 spark

Download the hadoop binary file and extract it also in the opt folder (for the libraries to be available only)

mv hadoop-3.1.3.tar.gz /opt/hadoop
cd opt; tar zxvf hadoop-3.1.3.tar.gz
ln -s hadoop-3.1.3 hadoop

Step 3: Create a non root hadoop user (User named hadoop):

sudo adduser hadoop

Step 4: Enable ssh paswordless for hadoop user:

su - hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost

Step 5: give Permissions to the /opt/spark folder to the hadoop user

sudo chown -R hadoop:hadoop /opt/spark/*
sudo chown -R hadoop:hadoop /opt/spark /opt/spark-3.3.1-bin-hadoop2

and for hadoop also:

sudo chown -R hadoop:hadoop /opt/hadoop/*
sudo chown -R hadoop:hadoop /opt/hadoop /opt/hadoop-3.1.3

Step 6: Create the following scripts in the /etc/profile.d folder and add to their content respectively:

-/etc/profile.d/spark.sh

export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_OPTS=-Djava.library.path=/opt/hadoop/lib/native

-/etc/profile.d/hadoop.sh

export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hadoop/lib/native

Step 7: Configure the spark worker in the /opt/spark/conf folder as hadoop user do:

cd /opt/spark/conf
cp spark-env.sh.template spark-env.sh

Add the following to the spark-env.sh file:

export HADOOP_HOME="/opt/hadoop"
export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
export HDFS_URL="hdfs://hypatia4.math.uoi.gr:9000"
export SPARK_DRIVER_MEMORY=1G
export SPARK_WORKER_INSTANCES=24
export SPARK_EXECUTOR_INSTANCES=1
export SPARK_WORKER_MEMORY=600M
export SPARK_EXECUTOR_MEMORY=500M
export SPARK_WORKER_CORES=1
export SPARK_EXECUTOR_CORES=1
export SPARK_MASTER_IP=195.130.112.214
export SPARK_PUBLIC_DNS=hypatia4.math.uoi.gr

*** Note you can add instead of 24 as many worker instances you can afford SPARK_WORKER_INSTANCES=X ***

Step 8: Install the pyspark python module

sudo pip3 install pyspark

Step 9: As root create a /etc/systemd/system/spark.service

sudo touch /etc/systemd/system/spark.service
vi /etc/systemd/system/spark.service

and add the following:

[Unit]
Description=Spark Worker service
After=syslog.target network-online.target

[Service]
User=hadoop
Group=users
Type=oneshot
#ExecStartPre=/opt/spark/sbin/start-master.sh
ExecStart=/opt/spark/sbin/start-worker.sh spark://195.130.112.214:7077
#ExecStart=/opt/spark/sbin/start-history-server.sh
ExecStop=/opt/spark/sbin/stop-worker.sh
#ExecStop=/opt/spark/sbin/stop-history-server.sh
#ExecStopPost=/opt/spark/sbin/stop-master.sh
WorkingDirectory=/opt/spark
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Step 10: As root enable and start the spark service:

systemctl daemon-reload
systemctl enable spark
service spark start

Step 11: As any user test the scala spark-shell and the python pyspark environment that are working

spark-shell
:q
pyspark
exit()

You will now see your workers instantiating in the hypatia4 web UI. The system can operate also as an executor by executing the following python script(pyspark):

#!/usr/bin/python3.8
import pyspark as sc # importing the module
from pyspark.sql import SparkSession #
from pyspark import SparkConf, SparkContext

spark = SparkSession.builder.appName("example-alpp").master("spark://hypatia4.math.uoi.gr:7077").getOrCreate()

data=[{'name':'Ram', 'class':'first', 'city':'Mumbai'},
  {'name':'Sham', 'class':'first', 'city':'Mumbai'}]
dataf = spark.createDataFrame(data)
dataf.write.csv ("hdfs://195.130.112.214:9000/users/sampledata41.csv")

Executor setup in Windows

Step 1: Verify the installed dependencies by running these commands:

java -version; javac -version; scala -version; git --version

Step 2: Download from this repository (executor/Windows folder) the spark-3.2.2-bin-hadoop2.7.tgz file and uncompress it in the c:\app folder:

mkdir c:\app
copy c:\users\User\downloads\spark-3.2.2-bin-hadoop2.7.tgz c:\app 

Exctract it Using Winzip to c:\app\spark-3.2.2-bin-hadoop2.7\

Step 3: Add the following environmental variables (My Computer->Right click->->Advanced System Settings->Environmental Variables):

SPARK_HOME=C:\app\spark-3.2.2-bin-hadoop2.7
HADOOP_HOME=C:\app\spark-3.2.2-bin-hadoop2.7

JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
PATH <-C:\Program Files\Java\jdk1.8.0_201\bin
PATH <-C:\app\spark-3.2.2-bin-hadoop2.7\bin
PATH <-C:\app\spark-3.2.2-bin-hadoop2.7\sbin

Step 4: As administrator copy the git executor\Windows\winutils.exe to c:\windows\system32

copy winutils.exe c:\windows\system32
copy winutils.exe c:\app\spark-3.2.2-bin\hadoop2.7\bin

Step 5: Install pyspark:

pip3 install pyspark==3.2.2

Step 6: Configure the spark executor in the c:\app\spark\conf folder as user do:

cd c:\app\spark\conf
copy spark-env.sh.template spark-env.sh

Add the following to the spark-env.sh file:

export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
#export HADOOP_HOME="/opt/hadoop"
#export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
export HDFS_URL="hdfs://hypatia4.math.uoi.gr:9000"
export SPARK_DRIVER_MEMORY=1G
#export SPARK_WORKER_INSTANCES=1
export SPARK_EXECUTOR_INSTANCES=1
#export SPARK_WORKER_MEMORY=600M
export SPARK_EXECUTOR_MEMORY=500M
#export SPARK_WORKER_CORES=1
export SPARK_EXECUTOR_CORES=1
export SPARK_MASTER_IP=195.130.112.214
export SPARK_PUBLIC_DNS=hypatia4.math.uoi.gr

*** Note you can add instead of 24 as many executor instances you can afford SPARK_EXECUTOR_INSTANCES=X ***

Step 6: As user test the scala spark-shell and the python pyspark environment that are working

spark-shell
:q
pyspark
exit()

Thats all!

Step 7: For Hadoop command line interaction (hadoop fs -ls /) download from the repo (executors/Linux/hadoop-3.1.3.tar.gz) and extract it in the c:\app folder

Then go to the c:\app\hadoop-3.1.3\etc\hadoop foler and modify the hadoop-env.cmd file in the line where %JAVA_HOME% is mentioned as follows:

set JAVA_HOME=c:\progfiles\Java\jdk1.8.0_202\

Make also as Administrator the appropriate folder link:

mklink /D c:\progfiles "c:\Program Files"

*** This is important since hadoop has a slight problem with PATH spaces.

Alter the environment variable HADOOP_HOME to pinpoint to the new location (My_Computer->Settings->Advanced Settings->Set Environment variables):

HADOOP_HOME=C:\app\hadoop-3.1.3

At also to the %PATH% environment variable the following folders:

C:\app\hadoop-3.1.3\bin
C:\app\hadoop-3.1.3\sbin
C:\app\hadoop-3.1.3\lib

To set the default HADOOP_URI edit the c:\app\hadoop-3.1.3\etc\hadoop\core-site.xml and add the following:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://10.7.0.86:9000</value>
</property>
<property>
          <name>fs.default.name</name>
    <value>hdfs://10.7.0.86:9000</value>
</property>
</configuration>

Copy winutils to hadoop folder:

copy winutils.exe c:\app\hadoop-3.1.3\bin

Then open a new terminal (cmd) and issue the following command

hadoop fs -ls /
drwxrwxrwt   - hdoop supergroup          0 2022-11-30 23:33 /users
-rw-r--r--   1 hdoop supergroup        615 2022-11-25 14:41 /users.parquet

The users folder is rwx permissions for everyone with sticky bit enabled. you can get or put files from the hadoopfs drive using:

hadoop fs -put file [/....path]
hadoop fs -get [/..path/file]

Executor Testing - Windows

The system can operate an executor. Please test using the following test python script(pyspark):

#!/usr/bin/python3.8
import pyspark as sc # importing the module
from pyspark.sql import SparkSession #
from pyspark import SparkConf, SparkContext

spark = SparkSession.builder.appName("example-alpp").master("spark://hypatia4.math.uoi.gr:7077").getOrCreate()

data=[{'name':'Ram', 'class':'first', 'city':'Mumbai'},
  {'name':'Sham', 'class':'first', 'city':'Mumbai'}]
dataf = spark.createDataFrame(data)
dataf.write.csv ("hdfs://195.130.112.214:9000/users/sampledata41.csv")

-EOF

python test.py

Metrics Installation on Manager. /spark/tmp and Prometheus logging

This part applies only for the scala standalone Manager

  1. First you have to configure metrics template on the spark server (as hadoop user):
cd /opt/spark/conf
mv metrics.properties.template metrics.properties
chmod 755 metric.properties
mkdir /opt/spark/tmp
chmod 777 /opt/spark/tmp
  1. Edit the metrics.properties file and add the following for /opt/spark/tmp logging:
# Enable CsvSink for all instances by class name
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink

# Polling period for the CsvSink
*.sink.csv.period=10
# Unit of the polling period for the CsvSink
*.sink.csv.unit=seconds

# Polling directory for CsvSink
*.sink.csv.directory=/opt/spark/tmp/

# Polling period for the CsvSink specific for the worker instance
worker.sink.csv.period=10
# Unit of the polling period for the CsvSink specific for the worker instance
#worker.sink.csv.unit=minutes
worker.sink.csv.unit=seconds
  1. The Spark UI at port 8080 includes by default metrics for master and applications. The urls are :
"master.aliveWorkers":{"value":24}
"master.apps":{"value":0}
"master.waitingApps":{"value":0}

application.albl_10.7.0.87.1669731276026.cores:{"value":1}
application.albl_10.7.0.87.1669731276026.runtime_ms:{"value":24753}
application.albl_10.7.0.87.1669731276026.status:{"value":"FINISHED"} 

For prometheus time series interaction add:

# Example configuration for PrometheusServlet
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus
  1. Prometheus installation and configuration. To install prometheus do the following commands:

ALBL algorithm description

Todo

ALBL agent installation

Todo

Extra! Build Hadoop from source and mount dfs

Mount HDFS as a local file system.

Beware, it will support only basic file operations at a slower pace, installation is cumbersome, but it works.

I will perform the installation on a client machine, but the compilation process can be executed only once elsewhere.

Prerequisites:

You need docker to create Hadoop Developer build environment, Java, and Hadoop libraries to run the fuse_dfs application.

Ownership works in a slightly different way as it is using the user name, not user id, so change ownership a directory using hdfs utilities before you try to write anything there using fuse.

Create workspace directory for compilation purposes.

$ sudo mkdir /opt/workspace
$ sudo chown $(whoami) /opt/workspace
$ cd /opt/workspace

Clone hadoop source code. I will stick with branch-3.2 as I am using Hadoop 3.2.2, but you can skip the branch part.

$ git clone --depth 1 --branch branch-3.2 https://github.com/apache/hadoop.git
$ cd hadoop
$ ./start-build-env.sh
$ exit

You will be left inside Hadoop Developer build environment..Build fuse-dfs executable.

$ mvn package -Pnative -Drequire.fuse=true -DskipTests -Dmaven.javadoc.skip=true

Exit Hadoop Developer build environment and Copy built fuse_dfs binary.

$ exit
$ sudo cp /opt/workspace/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/fuse_dfs /usr/bin/

Create a simple shell wrapper to define used variables.

#!/bin/bash
# naive fuse_dfs wrapper

# define HADOOP_HOME & JAVA_HOME
export HADOOP_HOME=/opt/hadoop
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

# define CLASSPATH
while IFS= read -r -d '' file
do
          export CLASSPATH=$CLASSPATH:$file
  done < <(find ${HADOOP_HOME}/share/hadoop/{common,hdfs} -name "*.jar" -print0)

  # define LD_LIBRARY_PATH
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native/:${JAVA_HOME}/lib/server/
  #export LD_LIBRARY_PATH=\$(find \${HADOOP_HOME} -name libhdfs.so.0.0.0 -exec dirname {} \;):\$(find \${JAVA_HOME} -name libjvm.so -exec dirname {} \;)

  fuse_dfs "$@"

Set executable bit.

$ sudo chmod +x /usr/bin/fuse_dfs
$ sudo chmod +x /usr/bin/fuse_dfs_wrapper.sh

Create a mount directory.

$ mkdir /opt/workspace/hadoop_data

Inspect fuse_dfs options.

$ fuse_dfs_wrapper.sh  

Mount HDFS as a local file system

$ sudo fuse_dfs_wrapper.sh dfs://namenode.example.org:9000 /opt/workspace/hadoop_data -oinitchecks 

You can use fstab to store mount configuration.

fuse_dfs_wrapper.sh#dfs://namenode.example.org:9000 /opt/workspace/hadoop_data fuse rw,initchecks 0 0