The Web Scraping

Share this post

What is Web Scraping?

If we refer to Wikipedia, Web Scrapping “allows you to recover the content of a web page in order to reuse the content”. In other words, Web Scrapping consists in recovering data from the web via the content of html pages. This is especially useful when you have surfed the internet and found the information you were looking for. But ouch! how to retrieve this information if the site does not publish them (in the form of API or CSV export).

Not everyone is or has the vocation to be “Open Data”!

For the literature aspects just know that we are talking about Web Scrapping but also – and it is the same thing – data harvesting.

If you take a look around the web you will find several tools (free or not, Open Source or not) that can also help you quickly. I will quote for example:

In this article we will see how to retrieve data published on a website and then we will write it in a usable way – that is, in tabular form – in a csv file. Obviously we will not use any tool … that would be too simple

Your mission if you accept it

Your target will be the jeuvideo.com site. And more particularly the list of the best video games: http://www.jeuxvideo.com/meilleurs/

The idea is therefore just as simple, to retrieve the list of the best video games (the 200 best for example) and to write this list in a CSV file.

We will of course recover at a minimum:

  • the name of the video game
  • its description
  • its release date
  • his test score
  • his player rating

Simple implementation with Python

As I said above, there are several tools that could do this automatically for you but I chose to show you how to do it with Python. You will see that it is really very simple and totally accessible.

To start, of course you have to go to the page you want to scrape and see if you find a structure in terms of CSS styles. Obviously if you do not know the style sheets this step will be difficult … but not impossible because with modern browsers you will have help

Retrieving the CSS style structure

In my example I am using Firefox. So just select the part you want to recover and right click on it. Finally choose “Examine the element” to display the content of the HTML page. A new panel should appear below with this famous HTML content (Cf. image below). You will see in this panel a highlighted line indicating the HTML code corresponding to the zone you had selected:

In our example we have selected the game title (God Of War). And we see that it is a hyperlink (tag A) which has CSS class gameTitleLink__196nPy. Very well we keep that under the elbow for later, and we will thereby recover the other desired information (description, notes, output).

Why are we getting the CSS classes? and quite simply because they will allow us to point via the Python code directly on the wanted elements thanks to the XPATH query language . In our example above for example the title (well all the titles) is found via the path XPATH: // a [@ class = “gameTitleLink__196nPy”]

Let’s scrap!

We now have the information necessary to retrieve the list of the best video games according to the site jeuvideo.com. Let’s start by opening a Jupyter notebook to code the scraping in Python.

import requests
import lxml.html as lh
page = requests.get('http://www.jeuxvideo.com/meilleurs/')
doc = lh.fromstring(page.content)

The code above retrieves the content of the page http://www.jeuxvideo.com/meilleurs/

nomJeux = doc.xpath('//a[@class="gameTitleLink__196nPy"]')
desc = doc.xpath('//p[@class="description__1-Pqha"]')
sortie = doc.xpath('//span[@class="releaseDate__1RvUmc"]')
test = doc.xpath('//span[@class="editorialRating__1tYu_r"]')
avis = doc.xpath('//span[@class="userRating__1y96su"]')

for i in range(len(nomJeux)):
    print(nomJeux[i].text_content().strip() + "\n" + \
          desc[i].text_content().strip() + "\n" + \
          sortie[i].text_content().strip()+ "\n" + \
          test[i].text_content().strip() + "\n" + \
          avis[i].text_content().strip() + "\n")

Then we will retrieve the areas pointed to by the XPATH references and display them. And here it is magic, isn’t it?
Magic but not so exploitable in fact, I therefore suggest that you put this data in a Pandas DataFrame . To do this, we will create a function which will retrieve the data (via XPATH) and will create a DataFrame which takes a piece of data pointed by column and we will thereby have a set per row with all this information:

def getPage(url):
    page = requests.get(url)
    doc = lh.fromstring(page.content)

    # Get the Web data via XPath
    content = []
    for i in range(len(tags)):
        content.append(doc.xpath(tags[i]))

    # Gather the data into a Pandas DataFrame array
    df_liste = []
    for j in range(len(tags)):
        tmp = pd.DataFrame([content[j][i].text_content().strip() for i in range(len(content[i]))], columns=[cols[j]])
        tmp['key'] = tmp.index
        df_liste.append(tmp)

    # Build the unique Dataframe with one tag (xpath) content per column
    liste = df_liste[0]
    for j in range(len(tags)-1):
        liste = liste.join(df_liste[j+1], on='key', how='left', lsuffix='_l', rsuffix='_r')
        liste['key'] = liste.index
        del liste['key_l']
        del liste['key_r']

    return liste

Here is the result:

What about pagination?

You will have noticed that we only collected the 20 best video games. This is quite logical given that the jeuvideo.com site displays data per page of 20! If we want to retrieve more information we will have to manage the automatic pagination of this site. To put it simply, we will have to wrap up on the HTTP criterion which makes it possible to manage this pagination, namely http://www.jeuxvideo.com/meilleurs/?p=X in order to have page X.

Here is the function that will allow you to loop to retrieve the _nbPages x 20 best video games:

def getPages(_nbPages, _url):
    liste_finale = pd.DataFrame()
    for i in range (_nbPages):
        liste = getPage(_url + uri_pages + str(i+1))
        liste_finale = pd.concat([liste_finale, liste], ignore_index=True)
    return liste_finale

liste_totale = getPages(nbPages, url)

To finish you just have to save your data for example in a csv file:

liste_totale.to_csv('meilleursjeuvideo.csv', index=False)

The idea of ​​this article was to show you that scraping can be a very simple task (as it is very complex in fact) and that above all this technique could be very useful to recover data that is necessarily made available.

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

One thought on “The Web Scraping

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub