While I was trying to install Apache Spark and use its Python API on my laptop I faced several issues. So after searching several resources online in order to solve those issues I decided to share my insights on how to properly set up Apache Spark on you machine.
Some of the components I used are:
- Apache Spark v2.4.7
- Apache Hadoop v2.7
- Python v3.7.1
- Java Development Kit (JDK) v8
Install Java Development Kit (JDK)
First of all you’ ll need to install Java Development Kit (JDK). Just notice that you will most likely have Java Runtime Environment (JRE) installed in your machine since lots of software need this to run Java code. However, if you want to develop then you would need JDK. Simply put, JDK is as JRE with additional developers’ features.
The version of Java that I’m going to use is 8. However, you need to do a small trick if you want to download JDK from Oracle’s official website without creating an account. So, just go to Oracle’s website and click on
jdk-8u271-windows-x64.exe to download it. A pop-up window will appear. Just click on the checkbox to accept the license, then right-click on the download button and click on “Copy Link Location”.
Paste the URL either directly on your web browser or a text editor (for higher comfort). The URL should look like this:
From that you should keep the part of the URL that comes after
?nexturl=. After that, you might need to change the
otn-pub. By the end of the above steps, the download URL should look like this:
Okay! Just click it, download JDK8 and install it.
Last thing you should do is to change the Environment (system) variables. In other words you need to tell your system which directories to search every time it needs to use JDK components. You can easily achieve that via command line but I will stick to the graphical interface. So, right-click on “My Computer”, select “Properties” and at the left of your screen click on “Advanced System Properties”. Alternatively you search for “SystemPropertiesAdvanced.exe”. Then you click on “Environment Variables” button.
What we want to modify here is the
Path variables. If there is no
JAVA_HOME variable we need to create one setting its value to
C:\Program Files\Java\jdk1.8.0_271 (or whichever directory we installed JDK). On the other hand, we add a new value (assuming it’s not already there) to the
Path variable being
Install Apache Spark (+Hadoop)
Moving on, let us download and install Apache Spark (+Hadoop) from its official website. What I used is Spark version 2.4.7 and Hadoop 2.7. Click on the relevant hyperlink, download and decompress the file using, for example, 7zip. It doesn’t matter in which directory you decompress the files. For convenience I created a folder named “Spark” in my C: drive. Therefore, by now I have a directory which looks like this:
Next steps are similar to what we did with the Environment Variables of Java. So, go to system variables and create two new variables with the names
HADOOP_HOME, and values
C:\Spark\spark-2.4.7-bin-hadoop2.7, respectively. Finally, add a new value to the
path variable being
Final step to make your Apache/Hadoop set-up work properly is to add windows executable in your Apache directory, which is a Hadoop component for windows. Simply go here, choose your Hadoop version (in my case it’s
hadoop-2.7.1) go to
bin directory, click on
winutils.exe and at the right of the Github interface click on the “Download” button. You need to place the executable in this directory:
C:\Spark\spark-2.4.7-bin-hadoop2.7\bin. Now create a dummy Hive directory such as this:
and elevate permissions by opening the command prompt as administrator and running the command below:
winutils.exe chmod -R 777 C:\tmp\hive.
You can verify the elevated permissions by running:
winutils.exe ls C:\tmp\hive.
The first lines of the output should look like this:
In case you’re not familiar with unix/linux,
ls are bash commands which run via
winutils.exe. For further info about winutils, check out the hyperlinks at the bottom of this blog post.
Apache Spark in Python
Closing this post I’d like to show you how to make it feasible to run Apache Spark Python API (
pyspark). You could either simply download
pip in your Python (venv) or Anaconda virtual environment. However, it seems more handy to use the package that you have already downloaded from Apache Spark website.
pysparkyou need to tell Python where to find the package. You can do this either manually or automatically. Let’s see the manual way first.
Manually add spark in your python path
import sys sys.path.append('C:\Spark\spark-2.4.7-bin-hadoop2.7\python') sys.path.append('C:\Spark\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip')
print(sys.path)to get a glimpse of what’s happend before and after appending the two directories.
Automatically add spark in your python path
import findspark findspark.init()
findspark.find()command, as well as the relevant python Apache Spark directories have been added to the python path by using
All in all, you can give a test drive to your pySpark installation by executing the following commands either in the context of a script or standalone in the python shell:
from pyspark import SparkContext sc = SparkContext() x = sc.parallelize(range(10)) x.take(5) # first 5 elements x.glom().collect() # how the elements are distributed in the cores of you machine
Well done! By now you will probably have Apache Spark and pyspark properly configured in your machine which makes you ready to start exploring…