Access Hive & HDFS via PySpark

Share this post

The goal of this article is to provide a quick little tutorial that will allow you to quickly and easily access your Hadoop cluster via Hive and HDFS. If you take a quick look on the internet you will see that there is a plethora of solutions and libraries for this. There is in particular a Python library called PyDev which is rather efficient … if you manage to install it correctly.

PySpark

In this tutorial, we will use PySpark which as its name suggests uses the Spark framework. Remember here that Spark is not a programming language but a distributed computing environment or framework. By nature it is therefore widely used with Hadoop.

We will read and write data to hadoop. It is therefore necessary to keep in mind that Spark does not handle a file but Resilient Distributed Dataset or RDD.

These RDDs have the following characteristics:

  1. they are organized online: Be careful because these lines cannot exceed 2 GB. In practice, it is even advisable not to go beyond a few MB.
  2. It is almost impossible to access a part of the file precisely. It will therefore be necessary to browse it in its entirety because of course the RDDs do not have an index
  3. They work like streams or cursors (read or write): we cannot therefore modify an RDD !
  4. They are distributed. The order of the rows of the dataset is not known in advance (Corollary of this, it does not store the names of the columns).

So much for the big drawbacks, but thanks to the Pandas Bookstore , we’ll see how to overcome them easily.

NB: install PySpark via the command $ pip install pyspark

Spark & ​​Python

Let’s say it: Spark is implemented in Java!

The PySpark API is quite efficient but will never be as efficient and efficient as the Java API (or scala). However, for most Machine Learning projects, PySpark will do just fine. And then you will see that it is rather easy to use.

Hadoop cluster

Hadoop side I will use for this tutorial the HortonWorks distribution (HDP 2.6.4).

You can download the sandbox here .

Once the cluster is installed and configured, you can access the AMBARI administration console via the URL: http://:8080

Log in as admin and verify that HDFS and HIVE services are operational:

Then retrieve the hive parameter. metastore. uris in the Ambari console. To do this, select Hive from the left menu, then the Configs and Advanced tabs in the right pane:

In my case I put aside the value: thrift://hdp264.local:9083

Hadoop HDFS

Let’s get to the heart of the matter and see how we are going to be able to write and read a file in a Hadoop HDFS cluster with Python.

Some useful commands

Interacting with HDFS is pretty straightforward from the command line. Here is a little memento:

List a directory:

$ hadoop fs -ls

Create a folder :

$ hadoop fs -mkdir /user/input

Change rights :

$ hadoop fs -chmod 777 /user

Send files :

$ hadoop fs -put /home/file.txt /user/input

Read file :

hadoop fs -cat /user/infa/Exp_Clients.csv

Get file :

hadoop fs -get /user/output/ /home/hadoop_tp/

Writing a file in HDFS with PySpark

You know how to interact with HDFS from the command line now, let’s see how to write a file with Python (PySpark). In the example below we will create an RDD with 4 rows and two columns (data) then write it to a file under HDFS (URI: hdfs: //hdp.local/user/hdfs/example.csv ):

import os
from pyspark.sql import SparkSession
import pandas as pd
os.environ["HADOOP_USER_NAME"] = "hdfs"
os.environ["PYTHON_VERSION"] = "3.5.6"
sparkSession = SparkSession.builder.appName("pyspark_test").getOrCreate()
data = [('data 1', 1), ('data 2', 2), ('data 3', 3), ('data 4', 4)]
df = sparkSession.createDataFrame(data)
df.write.csv("hdfs://hdp.local/user/hdfs/example.csv")

Let’s check that the file has been written correctly. To do this in the Ambari console, select the “Files View” (matrix icon at the top right). Navigate to / user / hdfs as below:

Good news the example.csv file is present

Playing a file in HDFS with PySpark

Reading is just as easy as writing with the sparkSession.read.csv command:

import os
from pyspark.sql import SparkSession
import pandas as pd
os.environ["HADOOP_USER_NAME"] = "hdfs"
os.environ["PYTHON_VERSION"] = "3.5.6"
sparkSession = SparkSession.builder.appName("pyspark_test").getOrCreate()
df = sparkSession.read.csv('hdfs://hdp.local/user/hdfs/example.csv')
df.show()

The show () method displays the contents of the file.

Hive

We have just seen how to write or read a file in HDFS. Now let’s see how we can interact with Hive with PySpark.

Some useful Hive commands

You run hive from the command line simply by typing $ hive. Once the hive client is operational, it offers a hive> prompt with which you can interact:

List all tables

hive> SHOW TABLES;

List all the tables ending with e (Cf. regular expressions):

hive> SHOW TABLES '.*e';

Description of a table (columns):

hive> DESCRIBE INFA.EMPLOYEE;

Database :

hive> CREATE DATABASE [IF NOT EXISTS] userdb;
hive>SHOW DATABASES;
hive>DROP DATABASE IF EXISTS userdb;

You can of course do HiveQL:

SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number];

hive> SELECT * FROM employee WHERE salary>30000;

“Request” Hive with PySpark

For that I created a customers_prov table in Hive that we are going to request. But first of all you will have to specify your Hive URI (ie the parameter hive. Metastore. Uris that you retrieved above). Replace the value “thrift: //hdp.local: 9083” in the code below with yours:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
import pandas as pd

SparkContext.setSystemProperty("hive.metastore.uris", "thrift://hdp.local:9083")
sparkSession = (SparkSession
                .builder
                .appName('pyspark_test')
                .enableHiveSupport()
                .getOrCreate())
df_spark = sparkSession.sql('select * from customers_prov')
df_spark.show()

Unfortunately RDDs are not very usable for my taste. Of course we will have some useful methods:

  • df_spark.schema () which displays the schema
  • df_spark.printSchema (): which displays the same diagram but in the form of a tree
  • etc.

But personally I love the magic command that transforms these RDDs into DataFrame Pandas: toPandas ()

s = df_spark.toPandas()

And there you have it, you can design and work on your Machine Learning Model

Find the sources of this tutorial in Jupyter notebooks in GitHub

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub