Covid-19: Recover French official raw data

Share this post

Confined, we all follow the evolution of this damn virus. No ? What I am proposing to you through this article is to collect by yourself the official data from Public Health as well as from the Ministry of Health in order to do your own analyzes. These data are deposited in github: https://github.com/opencovid19-fr/data but they are in YAML format and require some alterations to be usable.

Format de données YAML

YAML data format

This format is a fairly easy to use format, a kind of very simplified json that looks like structured data through indentation. Here is an example file:

date: "2020-01-24"
source:
  nom: Ministère des Solidarités et de la Santé
  url: https://solidarites-sante.gouv.fr/IMG/pdf/200124-cp_coronavirus.pdf
  archive: https://web.archive.org/web/20200322183046/https://solidarites-sante.gouv.fr/IMG/pdf/200124-cp_coronavirus.pdf
donneesNationales:
  casConfirmes: 3
  hospitalises: 3
  deces: 0
donneesRegionales:
  - nom: Île-de-France
    code: REG-11
    casConfirmes: 2
    hospitalises: 2
  - nom: Nouvelle-Aquitaine
    code: REG-75
    casConfirmes: 1
    hospitalises: 1

Fortunately, and as often Python provides a library to handle this type of format automatically and convert it into a usable format simply: JSON.

To do this, you must install the package:

pip install yaml

Then you can read the file with the method:

<pre class="wp-block-syntaxhighlighter-code">json_data = yaml.load(<Contenu du fichier>, Loader=yaml.FullLoader)</pre>

The value of json_data is returned in JSON format.

Raw data recovery

We will retrieve the data directly by querying Github. To do this, you must query Github directly on the data (raw) in: https://raw.githubusercontent.com/opencovid19-fr/data/master/sante-publique-france/

Each file is in YYYY-MM-DD.yaml format so it will be necessary to scan all the dates from the beginning to today.

r = requests.get('https://raw.githubusercontent.com/opencovid19-fr/data/master/sante-publique-france/2020-03-24.yaml')
if (r.status_code==200):
    print (yaml.load(r.text, Loader=yaml.FullLoader))

Creating the file name to recover is not very complex, you just have to recover all the data one after the other and recompose the file name in GitHub:

hier = date.today() + timedelta(-1)
shier = hier.strftime("%Y-%m-%d")
fichier = 'https://raw.githubusercontent.com/opencovid19-fr/data/master/sante-publique-france/' + shier + '.yaml'

If we want to recover yesterday’s file, here is a small function that will do it in a few lines:

def RecupCovid19Hier():
    hier = date.today() + timedelta(-1)
    shier = hier.strftime("%Y-%m-%d")
    fichier = 'https://raw.githubusercontent.com/opencovid19-fr/data/master/sante-publique-france/' + shier + '.yaml'
    out = ""
    r = requests.get(fichier)
    if (r.status_code==200):
       out = yaml.load(r.text, Loader=yaml.FullLoader)
    return out

Let’s retrieve the data from San Santé Publique France

The data was released as of January 23, 2020.

Here is the function that will allow you to recover data through YAML files (one per day) using the principles explained above.

import yaml
import requests
import time
from datetime import datetime, timedelta, date
import pandas as pd

# INITIALISATION DE VARIABLES GLOBALES
sDateDebut = "2020-01-23"
sAujourdhui = date.today().strftime("%Y-%m-%d")
dDemarrageDate = datetime.fromisoformat(sDateDebut) 
cols = ['Date', 
        'FR Tot Cas Confirmés', 
        'FR Tot Décès', 
        'WW Tot Cas Confirmés', 
        'WW Tot Décès', 
        'Source',
        'FR Tot Guéris',
        'FR Tot Hospitalisés'
       ]

# FONCTIONS
def recupDataWithException2(_jsonout, _index1, _index2):
    try:
        data = _jsonout[_index1][_index2]
    except:
        data = 0
    return data

def AjouteSantePubliqueFrLigne(jsonout, df):
    df = df.append({
                    cols[0] : jsonout['date'] , 
                    cols[1] : recupDataWithException2(jsonout, 'donneesNationales', 'casConfirmes') , 
                    cols[2] : recupDataWithException2(jsonout, 'donneesNationales', 'deces'), 
                    cols[3] : recupDataWithException2(jsonout,'donneesMondiales', 'casConfirmes'),
                    cols[4] : recupDataWithException2(jsonout,'donneesMondiales', 'deces'),
                    cols[5] : 'sante-publique-france',
                    cols[6] : 0,
                    cols[7] : 0
                 } , 
                 ignore_index=True)
    return df

def ajouteLigneVide(df, madate):
    df = df.append({
                    cols[0] : madate , 
                    cols[1] : 0,
                    cols[2] : 0, 
                    cols[3] : 0,
                    cols[4] : 0,
                    cols[5] : 'no-data',
                    cols[6] : 0,
                    cols[7] : 0
                 } , 
                 ignore_index=True)
    return df

# RECUPERATION DONNEES SANTE PUBLIQUE FRANCE
def RecupereDonneesSantePubliqueFrance():
    print ('--> Démarrage du process')
    sDateParcours = sDateDebut
    i=1
    requesReturnCode=200
    yamls = pd.DataFrame()
    while (sDateParcours != sAujourdhui):
        myDate = dDemarrageDate + timedelta(days=i)
        sDateParcours = myDate.strftime("%Y-%m-%d")
        fichier = 'https://raw.githubusercontent.com/opencovid19-fr/data/master/sante-publique-france/' + sDateParcours + '.yaml'
        #print ('--> sante-publique-france / ', sDateParcours)
        req = requests.get(fichier)
        requesReturnCode = req.status_code
        if (requesReturnCode==200):
            yamlout = yaml.load(req.text, Loader=yaml.FullLoader)
            yamls = AjouteSantePubliqueFrLigne(yamlout, yamls)
        else:
            print ("(*) Pas de données pour ", myDate.strftime("%Y-%m-%d"))
            # ajoute une ligne vide
            yamls = ajouteLigneVide(yamls, myDate.strftime("%Y-%m-%d"))
        i=i+1
    print ('--> Fin du process')
    return yamls
    
data = RecupereDonneesSantePubliqueFrance()
data

This function therefore returns a Pandas dataframe which contains Public Health France data.

Date 	FR Tot Cas Confirmés 	FR Tot Décès 	FR Tot Guéris 	FR Tot Hospitalisés 	Source 	WW Tot Cas Confirmés 	WW Tot Décès
0 	2020-01-24 	3.0 	0.0 	0.0 	0.0 	sante-publique-france 	897.0 	26.0
1 	2020-01-25 	3.0 	0.0 	0.0 	0.0 	sante-publique-france 	1329.0 	41.0
2 	2020-01-26 	3.0 	0.0 	0.0 	0.0 	sante-publique-france 	2026.0 	56.0
3 	2020-01-27 	3.0 	0.0 	0.0 	0.0 	sante-publique-france 	2820.0 	81.0
4 	2020-01-28 	4.0 	0.0 	0.0 	0.0 	sante-publique-france 	0.0 	0.0
... 	... 	... 	... 	... 	... 	... 	... 	...
64 	2020-03-28 	37575.0 	2314.0 	0.0 	0.0 	sante-publique-france 	591971.0 	27090.0
65 	2020-03-29 	0.0 	0.0 	0.0 	0.0 	no-data 	0.0 	0.0
66 	2020-03-30 	0.0 	0.0 	0.0 	0.0 	no-data 	0.0 	0.0
67 	2020-03-31 	0.0 	0.0 	0.0 	0.0 	no-data 	0.0 	0.0
68 	2020-04-01 	0.0 	0.0 	0.0 	0.0 	no-data 	0.0 	0.0

69 rows × 8 columns

Data recovery from the Ministry of Health

The recovery will follow the same principle. However, here we are not going to create a new Pandas dataframe , we will rather enrich the data already retrieved previously with those of the ministry (in particular we recover some additional information such as the number of cured, the number of hospitalizations).

Here is the function that retrieves and enriches the previous dataframe:

# RECUPERATION DONNEES SANTE PUBLIQUE FRANCE
def RecupereDonneesMinistereSante(yamls):
    i=1
    sDateParcours = sDateDebut
    print ('--> Démarrage du process')
    while (sDateParcours != sAujourdhui):
        myDate = dDemarrageDate + timedelta(days=i)
        sDateParcours = myDate.strftime("%Y-%m-%d")
        #print("Traintement date: ",  dateParcours)
        fichier = 'https://raw.githubusercontent.com/opencovid19-fr/data/master/ministere-sante/' + sDateParcours + '.yaml'
        req = requests.get(fichier)
        requesReturnCode = req.status_code
        if (requesReturnCode==200):
            yamlout = yaml.load(req.text, Loader=yaml.FullLoader)
            if (yamls[cols[5]][i-1] == "no-data"):
                print ("(*) Modifie les données manquantes pour la date du ", myDate.strftime("%Y-%m-%d"))
                # Modifie toute la ligne
                yamls.loc[i-1, cols[1]] = recupDataWithException2(yamlout, 'donneesNationales', 'casConfirmes')
                yamls.loc[i-1, cols[2]]  = recupDataWithException2(yamlout, 'donneesNationales', 'deces')
                yamls.loc[i-1, cols[3]]  = recupDataWithException2(yamlout, 'donneesMondiales', 'casConfirmes')
                yamls.loc[i-1, cols[4]]  = recupDataWithException2(yamlout, 'donneesMondiales', 'deces')
                yamls.loc[i-1, cols[5]]  = "ministere-sante"

            # Ajoute les nouveaux champs
            yamls.loc[i-1, cols[6]] = recupDataWithException2(yamlout, 'donneesNationales', 'hospitalises')
            yamls.loc[i-1, cols[7]] = recupDataWithException2(yamlout, 'donneesNationales', 'gueris')
        i=i+1
    print ('--> Fin du process')
    return yamls

Some finishing touches on the data

The raw data retrieved is cumulative data. In addition, we note that some days were not officially published (no-data). The consequence is that the data present “holes” which is a bit annoying for cumulative data. We will fill these holes:

def Bouchetrous(yamls):
    # On reparcourre toutes les lignes déjà récupérées précédemment
    i=1
    for index, row in yamls.iterrows():
        if (yamls[cols[5]][i-1] == "no-data"):
            for j in [1, 2, 3, 4, 6, 7]:
                yamls.loc[i-1, cols[j]] = yamls[cols[j]][i-2] 
                #yamls[cols[j]][i-1] = yamls[cols[j]][i-2] 
        i += 1
    return yamls

For those who want to go further, it might also be interesting to “disaggregate” the data, or do other operations.

The result ?

Here is an overview of the data, which unfortunately shows us that the peak of the epidemic has not yet been reached.

You can grab the full code from GitHub

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub