While I was trying to install Apache Spark and use its Python API on my laptop I faced several issues. So after searching several resources online in order to solve those issues I decided to share my insights on how to properly set up Apache Spark on you machine.
Some of the components I used are:
- Apache Spark v2.4.7
- Apache Hadoop v2.7
- Python v3.7.1
- Java Development Kit (JDK) v8
Install Java Development Kit (JDK)
First of all you’ ll need to install Java Development Kit (JDK). Just notice that you will most likely have Java Runtime Environment (JRE) installed in your machine since lots of software need this to run Java code. However, if you want to develop then you would need JDK. Simply put, JDK is as JRE with additional developers’ features.
The version of Java that I’m going to use is 8. However, you need to do a small trick if you want to download JDK from Oracle’s official website without creating an account. So, just go to Oracle’s website and click on jdk-8u271-windows-x64.exe
to download it. A pop-up window will appear. Just click on the checkbox to accept the license, then right-click on the download button and click on “Copy Link Location”.
Paste the URL either directly on your web browser or a text editor (for higher comfort). The URL should look like this:
https://www.oracle.com/webapps/redirect/signon?nexturl=https://download.oracle.com/otn/java/jdk/8u271-b09/61ae65e088624f5aaa0b1d2d801acb16/jdk-8u271-windows-x64.exe
From that you should keep the part of the URL that comes after ?nexturl=
. After that, you might need to change the otn
to otn-pub
. By the end of the above steps, the download URL should look like this:
https://download.oracle.com/otn-pub/java/jdk/8u271-b09/61ae65e088624f5aaa0b1d2d801acb16/jdk-8u271-windows-x64.exe
Okay! Just click it, download JDK8 and install it.
Last thing you should do is to change the Environment (system) variables. In other words you need to tell your system which directories to search every time it needs to use JDK components. You can easily achieve that via command line but I will stick to the graphical interface. So, right-click on “My Computer”, select “Properties” and at the left of your screen click on “Advanced System Properties”. Alternatively you search for “SystemPropertiesAdvanced.exe”. Then you click on “Environment Variables” button.
What we want to modify here is the JAVA_HOME
and Path
variables. If there is no JAVA_HOME
variable we need to create one setting its value to C:\Program Files\Java\jdk1.8.0_271
(or whichever directory we installed JDK). On the other hand, we add a new value (assuming it’s not already there) to the Path
variable being C:\Program Files\Java\jdk1.8.0_271\bin
Install Apache Spark (+Hadoop)
Moving on, let us download and install Apache Spark (+Hadoop) from its official website. What I used is Spark version 2.4.7 and Hadoop 2.7. Click on the relevant hyperlink, download and decompress the file using, for example, 7zip. It doesn’t matter in which directory you decompress the files. For convenience I created a folder named “Spark” in my C: drive. Therefore, by now I have a directory which looks like this: C:\Spark\spark-2.4.7-bin-hadoop2.7
Next steps are similar to what we did with the Environment Variables of Java. So, go to system variables and create two new variables with the names SPARK_HOME
and HADOOP_HOME
, and values C:\Spark\spark-2.4.7-bin-hadoop2.7
and C:\Spark\spark-2.4.7-bin-hadoop2.7
, respectively. Finally, add a new value to the path
variable being C:\Spark\spark-2.4.7-bin-hadoop2.7\bin
.
Final step to make your Apache/Hadoop set-up work properly is to add windows executable in your Apache directory, which is a Hadoop component for windows. Simply go here, choose your Hadoop version (in my case it’s hadoop-2.7.1
) go to bin
directory, click on winutils.exe
and at the right of the Github interface click on the “Download” button. You need to place the executable in this directory: C:\Spark\spark-2.4.7-bin-hadoop2.7\bin
. Now create a dummy Hive directory such as this:
C:/tmp/hive
and elevate permissions by opening the command prompt as administrator and running the command below:
winutils.exe chmod -R 777 C:\tmp\hive
.
You can verify the elevated permissions by running:
winutils.exe ls C:\tmp\hive
.
The first lines of the output should look like this:
drwxrwxrwx
.
In case you’re not familiar with unix/linux, chmod
and ls
are bash commands which run via winutils.exe
. For further info about winutils, check out the hyperlinks at the bottom of this blog post.
spark-shell
or pyspark
, respectively. Apache Spark in Python
Closing this post I’d like to show you how to make it feasible to run Apache Spark Python API (pyspark
). You could either simply download pyspark
using pip
in your Python (venv) or Anaconda virtual environment. However, it seems more handy to use the package that you have already downloaded from Apache Spark website.
pyspark
you need to tell Python where to find the package. You can do this either manually or automatically. Let’s see the manual way first. Manually add spark in your python path
import sys
sys.path.append('C:\Spark\spark-2.4.7-bin-hadoop2.7\python')
sys.path.append('C:\Spark\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip')
print(sys.path)
to get a glimpse of what’s happend before and after appending the two directories. Automatically add spark in your python path
The second way is to use the findspark
python package which does the above job automatically. So just install findspark
using pip install findspark
or via Anaconda repo. Then simply execute the following commands:
import findspark
findspark.init()
findspark.find()
command, as well as the relevant python Apache Spark directories have been added to the python path by using print(sys.path)
. Test pySpark
All in all, you can give a test drive to your pySpark installation by executing the following commands either in the context of a script or standalone in the python shell:
from pyspark import SparkContext
sc = SparkContext()
x = sc.parallelize(range(10))
x.take(5) # first 5 elements
x.glom().collect() # how the elements are distributed in the cores of you machine
Well done! By now you will probably have Apache Spark and pyspark properly configured in your machine which makes you ready to start exploring…
Cheers!