|
|
3 years ago | |
|---|---|---|
| executors | 3 years ago | |
| MiniKube.md | 3 years ago | |
| Readme.md | 3 years ago | |
This Readme includes information on how to setup a Spark Worker for the hypatia4.math.uoi.gr master service. The hypatia4 master service offers the following capabilities:
Workers connected need to be configured using 1 CPU core and 600MB of RAM available per worker, required by the spark-shell and pyspark to work flawleslly
Executors can have variable CPU core requirements as well as memory requirements
Workers are Linux servers and Executors may be both Linux or Windows systems
This Readme includes the following setup information for Workers/Executors:
** The version of the master SPARK software used is the SPARK 3.3.1-bin-hadoop2 version -for x64 and x32
** The version of the Hadoop service is hadoop-3.2.4 for x64 and x32
The version for the Windows executors is SPARK SPARK 3.2.2.-bin-hadoop2.7 version of spark for Windows Executors, included in the executors/Windows folder
The version for the Linux executors is spark 3.3.1-bin-hadoop2 included in the executors/Linux folder
Installation of python3 (at least 3.8)
Installation of at least Java openjdk 11 or Java 8 and creation of the JAVA_HOME environmental variable (Done automatically on setup)
For Windows only: Add the python.exe executable path and the java binaries in the %PATH% of Windows My Computer->Right click->->Advanced System Settings->Environmental Variables
For windows only: Create a soft link for python.exe to python3.exe as follows (start->cmd.exe):
mklink "c:\Program Files\Python38\python3.exe" "c:\Program Files\Python38\python.exe"
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
PATH = %PATH%;C:\Program Files\Java\jdk1.8.0_201\bin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
** Workers must have either static ip or an ip at the MCSL 192.168.0.x network or a 10.7.0.9 ip on the MCSL VPN network **
Step 1: Download the prequisites:
sudo apt install default-jdk scala git -y
Verify the installed dependencies by running these commands:
java -version; javac -version; scala -version; git --version
Step 2: Download from this repository the spark-3.3.1-bin-hadoop2.tgz file and uncompress it in the /opt folder:
mv spark-3.3.1-bin-hadoop2.tgz /opt
cd /opt;tar zxvf spark-3.3.1-bin-hadoop2.tgz
ln -s spark-3.3.1-bin-hadoop2 spark
Download the hadoop binary file and extract it also in the opt folder (for the libraries to be available only)
mv hadoop-3.1.3.tar.gz /opt/hadoop
cd opt; tar zxvf hadoop-3.1.3.tar.gz
ln -s hadoop-3.1.3 hadoop
Step 3: Create a non root hadoop user (User named hadoop):
sudo adduser hadoop
Step 4: Enable ssh paswordless for hadoop user:
su - hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost
Step 5: give Permissions to the /opt/spark folder to the hadoop user
sudo chown -R hadoop:hadoop /opt/spark/*
sudo chown -R hadoop:hadoop /opt/spark /opt/spark-3.3.1-bin-hadoop2
and for hadoop also:
sudo chown -R hadoop:hadoop /opt/hadoop/*
sudo chown -R hadoop:hadoop /opt/hadoop /opt/hadoop-3.1.3
Step 6: Create the following scripts in the /etc/profile.d folder and add to their content respectively:
-/etc/profile.d/spark.sh
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_OPTS=-Djava.library.path=/opt/hadoop/lib/native
-/etc/profile.d/hadoop.sh
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hadoop/lib/native
Step 7: Configure the spark worker in the /opt/spark/conf folder as hadoop user do:
cd /opt/spark/conf
cp spark-env.sh.template spark-env.sh
Add the following to the spark-env.sh file:
export HADOOP_HOME="/opt/hadoop"
export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
export HDFS_URL="hdfs://hypatia4.math.uoi.gr:9000"
export SPARK_DRIVER_MEMORY=1G
export SPARK_WORKER_INSTANCES=24
export SPARK_EXECUTOR_INSTANCES=1
export SPARK_WORKER_MEMORY=600M
export SPARK_EXECUTOR_MEMORY=500M
export SPARK_WORKER_CORES=1
export SPARK_EXECUTOR_CORES=1
export SPARK_MASTER_IP=195.130.112.214
export SPARK_PUBLIC_DNS=hypatia4.math.uoi.gr
*** Note you can add instead of 24 as many worker instances you can afford SPARK_WORKER_INSTANCES=X ***
Step 8: Install the pyspark python module
sudo pip3 install pyspark
Step 9: As root create a /etc/systemd/system/spark.service
sudo touch /etc/systemd/system/spark.service
vi /etc/systemd/system/spark.service
and add the following:
[Unit]
Description=Spark Worker service
After=syslog.target network-online.target
[Service]
User=hadoop
Group=users
Type=oneshot
#ExecStartPre=/opt/spark/sbin/start-master.sh
ExecStart=/opt/spark/sbin/start-worker.sh spark://195.130.112.214:7077
#ExecStart=/opt/spark/sbin/start-history-server.sh
ExecStop=/opt/spark/sbin/stop-worker.sh
#ExecStop=/opt/spark/sbin/stop-history-server.sh
#ExecStopPost=/opt/spark/sbin/stop-master.sh
WorkingDirectory=/opt/spark
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
Step 10: As root enable and start the spark service:
systemctl daemon-reload
systemctl enable spark
service spark start
Step 11: As any user test the scala spark-shell and the python pyspark environment that are working
spark-shell
:q
pyspark
exit()
You will now see your workers instantiating in the hypatia4 web UI. The system can operate also as an executor by executing the following python script(pyspark):
#!/usr/bin/python3.8
import pyspark as sc # importing the module
from pyspark.sql import SparkSession #
from pyspark import SparkConf, SparkContext
spark = SparkSession.builder.appName("example-alpp").master("spark://hypatia4.math.uoi.gr:7077").getOrCreate()
data=[{'name':'Ram', 'class':'first', 'city':'Mumbai'},
{'name':'Sham', 'class':'first', 'city':'Mumbai'}]
dataf = spark.createDataFrame(data)
dataf.write.csv ("hdfs://195.130.112.214:9000/users/sampledata41.csv")
Step 1: Verify the installed dependencies by running these commands:
java -version; javac -version; scala -version; git --version
Step 2: Download from this repository (executor/Windows folder) the spark-3.2.2-bin-hadoop2.7.tgz file and uncompress it in the c:\app folder:
mkdir c:\app
copy c:\users\User\downloads\spark-3.2.2-bin-hadoop2.7.tgz c:\app
Exctract it Using Winzip to c:\app\spark-3.2.2-bin-hadoop2.7\
Step 3: Add the following environmental variables (My Computer->Right click->->Advanced System Settings->Environmental Variables):
SPARK_HOME=C:\app\spark-3.2.2-bin-hadoop2.7
HADOOP_HOME=C:\app\spark-3.2.2-bin-hadoop2.7
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
PATH <-C:\Program Files\Java\jdk1.8.0_201\bin
PATH <-C:\app\spark-3.2.2-bin-hadoop2.7\bin
PATH <-C:\app\spark-3.2.2-bin-hadoop2.7\sbin
Step 4: As administrator copy the git executor\Windows\winutils.exe to c:\windows\system32
copy winutils.exe c:\windows\system32
copy winutils.exe c:\app\spark-3.2.2-bin\hadoop2.7\bin
Step 5: Install pyspark:
pip3 install pyspark==3.2.2
Step 6: Configure the spark executor in the c:\app\spark\conf folder as user do:
cd c:\app\spark\conf
copy spark-env.sh.template spark-env.sh
Add the following to the spark-env.sh file:
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
#export HADOOP_HOME="/opt/hadoop"
#export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
export HDFS_URL="hdfs://hypatia4.math.uoi.gr:9000"
export SPARK_DRIVER_MEMORY=1G
#export SPARK_WORKER_INSTANCES=1
export SPARK_EXECUTOR_INSTANCES=1
#export SPARK_WORKER_MEMORY=600M
export SPARK_EXECUTOR_MEMORY=500M
#export SPARK_WORKER_CORES=1
export SPARK_EXECUTOR_CORES=1
export SPARK_MASTER_IP=195.130.112.214
export SPARK_PUBLIC_DNS=hypatia4.math.uoi.gr
*** Note you can add instead of 24 as many executor instances you can afford SPARK_EXECUTOR_INSTANCES=X ***
Step 6: As user test the scala spark-shell and the python pyspark environment that are working
spark-shell
:q
pyspark
exit()
Thats all!
Step 7: For Hadoop command line interaction (hadoop fs -ls /) download from the repo (executors/Linux/hadoop-3.1.3.tar.gz) and extract it in the c:\app folder
Then go to the c:\app\hadoop-3.1.3\etc\hadoop foler and modify the hadoop-env.cmd file in the line where %JAVA_HOME% is mentioned as follows:
set JAVA_HOME=c:\progfiles\Java\jdk1.8.0_202\
Make also as Administrator the appropriate folder link:
mklink /D c:\progfiles "c:\Program Files"
*** This is important since hadoop has a slight problem with PATH spaces.
Alter the environment variable HADOOP_HOME to pinpoint to the new location (My_Computer->Settings->Advanced Settings->Set Environment variables):
HADOOP_HOME=C:\app\hadoop-3.1.3
At also to the %PATH% environment variable the following folders:
C:\app\hadoop-3.1.3\bin
C:\app\hadoop-3.1.3\sbin
C:\app\hadoop-3.1.3\lib
To set the default HADOOP_URI edit the c:\app\hadoop-3.1.3\etc\hadoop\core-site.xml and add the following:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://10.7.0.86:9000</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://10.7.0.86:9000</value>
</property>
</configuration>
Copy winutils to hadoop folder:
copy winutils.exe c:\app\hadoop-3.1.3\bin
Then open a new terminal (cmd) and issue the following command
hadoop fs -ls /
drwxrwxrwt - hdoop supergroup 0 2022-11-30 23:33 /users
-rw-r--r-- 1 hdoop supergroup 615 2022-11-25 14:41 /users.parquet
The users folder is rwx permissions for everyone with sticky bit enabled. you can get or put files from the hadoopfs drive using:
hadoop fs -put file [/....path]
hadoop fs -get [/..path/file]
The system can operate an executor. Please test using the following test python script(pyspark):
#!/usr/bin/python3.8
import pyspark as sc # importing the module
from pyspark.sql import SparkSession #
from pyspark import SparkConf, SparkContext
spark = SparkSession.builder.appName("example-alpp").master("spark://hypatia4.math.uoi.gr:7077").getOrCreate()
data=[{'name':'Ram', 'class':'first', 'city':'Mumbai'},
{'name':'Sham', 'class':'first', 'city':'Mumbai'}]
dataf = spark.createDataFrame(data)
dataf.write.csv ("hdfs://195.130.112.214:9000/users/sampledata41.csv")
-EOF
python test.py
This part applies only for the scala standalone Manager
cd /opt/spark/conf
mv metrics.properties.template metrics.properties
chmod 755 metric.properties
mkdir /opt/spark/tmp
chmod 777 /opt/spark/tmp
# Enable CsvSink for all instances by class name
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for the CsvSink
*.sink.csv.period=10
# Unit of the polling period for the CsvSink
*.sink.csv.unit=seconds
# Polling directory for CsvSink
*.sink.csv.directory=/opt/spark/tmp/
# Polling period for the CsvSink specific for the worker instance
worker.sink.csv.period=10
# Unit of the polling period for the CsvSink specific for the worker instance
#worker.sink.csv.unit=minutes
worker.sink.csv.unit=seconds
"master.aliveWorkers":{"value":24}
"master.apps":{"value":0}
"master.waitingApps":{"value":0}
application.albl_10.7.0.87.1669731276026.cores:{"value":1}
application.albl_10.7.0.87.1669731276026.runtime_ms:{"value":24753}
application.albl_10.7.0.87.1669731276026.status:{"value":"FINISHED"}
For prometheus time series interaction add:
# Example configuration for PrometheusServlet
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus
Todo
Todo
Mount HDFS as a local file system.
Beware, it will support only basic file operations at a slower pace, installation is cumbersome, but it works.
I will perform the installation on a client machine, but the compilation process can be executed only once elsewhere.
You need docker to create Hadoop Developer build environment, Java, and Hadoop libraries to run the fuse_dfs application.
Ownership works in a slightly different way as it is using the user name, not user id, so change ownership a directory using hdfs utilities before you try to write anything there using fuse.
Create workspace directory for compilation purposes.
$ sudo mkdir /opt/workspace
$ sudo chown $(whoami) /opt/workspace
$ cd /opt/workspace
Clone hadoop source code. I will stick with branch-3.2 as I am using Hadoop 3.2.2, but you can skip the branch part.
$ git clone --depth 1 --branch branch-3.2 https://github.com/apache/hadoop.git
$ cd hadoop
$ ./start-build-env.sh
$ exit
You will be left inside Hadoop Developer build environment..Build fuse-dfs executable.
$ mvn package -Pnative -Drequire.fuse=true -DskipTests -Dmaven.javadoc.skip=true
Exit Hadoop Developer build environment and Copy built fuse_dfs binary.
$ exit
$ sudo cp /opt/workspace/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/fuse_dfs /usr/bin/
Create a simple shell wrapper to define used variables.
#!/bin/bash
# naive fuse_dfs wrapper
# define HADOOP_HOME & JAVA_HOME
export HADOOP_HOME=/opt/hadoop
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
# define CLASSPATH
while IFS= read -r -d '' file
do
export CLASSPATH=$CLASSPATH:$file
done < <(find ${HADOOP_HOME}/share/hadoop/{common,hdfs} -name "*.jar" -print0)
# define LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native/:${JAVA_HOME}/lib/server/
#export LD_LIBRARY_PATH=\$(find \${HADOOP_HOME} -name libhdfs.so.0.0.0 -exec dirname {} \;):\$(find \${JAVA_HOME} -name libjvm.so -exec dirname {} \;)
fuse_dfs "$@"
Set executable bit.
$ sudo chmod +x /usr/bin/fuse_dfs
$ sudo chmod +x /usr/bin/fuse_dfs_wrapper.sh
Create a mount directory.
$ mkdir /opt/workspace/hadoop_data
Inspect fuse_dfs options.
$ fuse_dfs_wrapper.sh
Mount HDFS as a local file system
$ sudo fuse_dfs_wrapper.sh dfs://namenode.example.org:9000 /opt/workspace/hadoop_data -oinitchecks
You can use fstab to store mount configuration.
fuse_dfs_wrapper.sh#dfs://namenode.example.org:9000 /opt/workspace/hadoop_data fuse rw,initchecks 0 0