Kostas Vlachos

Earth Observation

Personal Website

How to set up Apache Spark environment in Windows

Share this:

While I was trying to install Apache Spark and use its Python API on my laptop I faced several issues. So after searching several resources online in order to solve those issues I decided to share my insights on how to properly set up Apache Spark on you machine.

Some of the components I used are:

  • Apache Spark v2.4.7
  • Apache Hadoop v2.7
  • Python v3.7.1
  • Java Development Kit (JDK) v8

Install Java Development Kit (JDK)

First of all you’ ll need to install Java Development Kit (JDK). Just notice that you will most likely have Java Runtime Environment (JRE) installed in your machine since lots of software need this to run Java code. However, if you want to develop then you would need JDK. Simply put, JDK is as JRE with additional developers’ features.

The version of Java that I’m going to use is 8. However, you need to do a small trick if you want to download JDK from Oracle’s official website without creating an account. So, just go to Oracle’s website and click on jdk-8u271-windows-x64.exe to download it. A pop-up window will appear. Just click on the checkbox to accept the license, then right-click on the download button and click on “Copy Link Location”.

Paste the URL either directly on your web browser or a text editor (for higher comfort). The URL should look like this:

https://www.oracle.com/webapps/redirect/signon?nexturl=https://download.oracle.com/otn/java/jdk/8u271-b09/61ae65e088624f5aaa0b1d2d801acb16/jdk-8u271-windows-x64.exe

From that you should keep the part of the URL that comes after ?nexturl=. After that, you might need to change the otn to otn-pub. By the end of the above steps, the download URL should look like this:

https://download.oracle.com/otn-pub/java/jdk/8u271-b09/61ae65e088624f5aaa0b1d2d801acb16/jdk-8u271-windows-x64.exe

Okay! Just click it, download JDK8 and install it.

Last thing you should do is to change the Environment (system) variables. In other words you need to tell your system which directories to search every time it needs to use JDK components. You can easily achieve that via command line but I will stick to the graphical interface. So, right-click on “My Computer”, select “Properties” and at the left of your screen click on “Advanced System Properties”. Alternatively you search for “SystemPropertiesAdvanced.exe”. Then you click on “Environment Variables” button.

What we want to modify here is the JAVA_HOME and Path variables. If there is no JAVA_HOME variable we need to create one setting its value to C:\Program Files\Java\jdk1.8.0_271 (or whichever directory we installed JDK). On the other hand, we add a new value (assuming it’s not already there) to the Path variable being C:\Program Files\Java\jdk1.8.0_271\bin

Install Apache Spark (+Hadoop)

Moving on, let us download and install Apache Spark (+Hadoop) from its official website. What I used is Spark version 2.4.7 and Hadoop 2.7. Click on the relevant hyperlink, download and decompress the file using, for example, 7zip. It doesn’t matter in which directory you decompress the files. For convenience I created a folder named “Spark” in my C: drive. Therefore, by now I have a directory which looks like this: C:\Spark\spark-2.4.7-bin-hadoop2.7

Next steps are similar to what we did with the Environment Variables of Java. So, go to system variables and create two new variables with the names SPARK_HOME and HADOOP_HOME, and values C:\Spark\spark-2.4.7-bin-hadoop2.7 and C:\Spark\spark-2.4.7-bin-hadoop2.7, respectively. Finally, add a new value to the path variable being C:\Spark\spark-2.4.7-bin-hadoop2.7\bin.

Final step to make your Apache/Hadoop set-up work properly is to add windows executable in your Apache directory, which is a Hadoop component for windows. Simply go here, choose your Hadoop version (in my case it’s hadoop-2.7.1) go to bin directory, click on winutils.exe and at the right of the Github interface click on the “Download” button. You need to place the executable in this directory: C:\Spark\spark-2.4.7-bin-hadoop2.7\bin. Now create a dummy Hive directory such as this:

C:/tmp/hive

and elevate permissions by opening the command prompt as administrator and running the command below:

winutils.exe chmod -R 777 C:\tmp\hive.

You can verify the elevated permissions by running:

winutils.exe ls C:\tmp\hive.

The first lines of the output should look like this:

drwxrwxrwx.

In case you’re not familiar with unix/linux, chmod and ls are bash commands which run via winutils.exe. For further info about winutils, check out the hyperlinks at the bottom of this blog post.

If everything is correctly configured your Spark interactive shell via either Scala or Python should be up and running by executing the following two commands spark-shell or pyspark, respectively.

Apache Spark in Python

Closing this post I’d like to show you how to make it feasible to run Apache Spark Python API (pyspark). You could either simply download pyspark using pip in your Python (venv) or Anaconda virtual environment. However, it seems more handy to use the package that you have already downloaded from Apache Spark website.

So, in order to be able to import pyspark you need to tell Python where to find the package. You can do this either manually or automatically. Let’s see the manual way first.

Manually add spark in your python path

First of all, run python in your desired virtual environment. You can either start a kernel through Jupyter or Spyder or whichever IDE you prefer. In my case I will just run the python shell in command prompt. After that you need to import the sys library and add the relevant to python Apache Spark directories in the python path as follows:
import sys

sys.path.append('C:\Spark\spark-2.4.7-bin-hadoop2.7\python')
sys.path.append('C:\Spark\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip') 
You can run print(sys.path) to get a glimpse of what’s happend before and after appending the two directories.

Automatically add spark in your python path

The second way is to use the findspark python package which does the above job automatically. So just install findspark using pip install findspark or via Anaconda repo. Then simply execute the following commands:

import findspark

findspark.init() 
You can verify that the correct Apache Spark path has been detected using findspark.find() command, as well as the relevant python Apache Spark directories have been added to the python path by using print(sys.path).

Test pySpark

All in all, you can give a test drive to your pySpark installation by executing the following commands either in the context of a script or standalone in the python shell:

from pyspark import SparkContext
sc = SparkContext()
x = sc.parallelize(range(10))
x.take(5) # first 5 elements
x.glom().collect() # how the elements are distributed in the cores of you machine 

Well done! By now you will probably have Apache Spark and pyspark properly configured in your machine which makes you ready to start exploring…

Cheers!

Share this:

Leave a Comment

Your email address will not be published. Required fields are marked *