Introduction to XML¶

XML is a widely used data encoding system that can be read by a human as well as a program. It is quite verbose and certainly not suitable for storing very large data, however it is encountered often enough to have to use it sooner or later. Here is Python's xml library to simplify the process.

The goal is not to explain XML here but to recover from an XML file the data that interests us. For this we will use an XML file of fuel prices at the pump (see https://www.prix-carburants.gouv.fr/rubrique/opendata/ ).

In [1]:
# you don't need to understand this cell, it just downloads and opens a zip file from an URL
import requests
import zipfile 
import io
import xmltodict

response = requests.get("https://donnees.roulez-eco.fr/opendata/jour", stream=True)
z = zipfile.ZipFile(io.BytesIO(response.content))
xmlfile = z.read(z.filelist[0].filename)

Let's see what an XML file looks like (only the beginning because it's long):

In [2]:
xmlfile[:1000]
Out[2]:
b'<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>\n<pdv_liste>\n  <pdv id="1000001" latitude="4620114" longitude="519791" cp="01000" pop="R">\n    <adresse>596 AVENUE DE TREVOUX</adresse>\n    <ville>SAINT-DENIS-L\xe8S-BOURG</ville>\n    <services>\n      <service>Station de gonflage</service>\n      <service>Vente de gaz domestique (Butane, Propane)</service>\n      <service>DAB (Distributeur automatique de billets)</service>\n    </services>\n    <prix nom="Gazole" id="1" maj="2022-02-25T10:32:40" valeur="1710"/>\n    <prix nom="SP95" id="2" maj="2022-02-25T10:08:42" valeur="1809"/>\n    <prix nom="SP98" id="6" maj="2022-02-25T10:08:43" valeur="1842"/>\n    <rupture id="3" nom="E85" debut="2017-09-16T09:50:23" fin=""/>\n    <rupture id="4" nom="GPLc" debut="2017-09-16T09:50:23" fin=""/>\n    <rupture id="5" nom="E10" debut="2018-12-13T09:49:49" fin=""/>\n  </pdv>\n  <pdv id="1000002" latitude="4621842" longitude="522767" cp="01000" pop="R">\n    <adresse>16 Avenue de Marboz</adresse>\n    <vill'

Let's say you want to retrieve the mailing address of all the stations that sell unleaded 95 gas, SP95 or E10, and the price.

We see that the information is stored in a tree whose root is pdv_liste. The information that interests us is in

  • each pdv element that is a point of sale,
  • the address, city and price sub-elements
  • the cp attributes of the pdv element and value of price

Note that there are several price sub-elements per point of sale and that it is therefore necessary to choose the correct one.

Our goal is to store all this information in a table having for each line the fields address, cp, city, price (price here being the value of the price).

Using the xml library¶

In [3]:
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlfile)  

If you want to read an XML file and not a URL, then you have to do:

tree = ET.parse('data.xml')
root = tree.getroot()

The root is the first element.

Let's check that our root is called pdv_liste and that there are no attributes for this element:

In [4]:
print(root.tag, root.attrib)
pdv_liste {}

In addition to retrieving its tag name and its attrib attributes, for each element there are three possible cases:

  1. element has sub-elements
  2. the element has no sub-elements, it is a leaf, it has content (a text, a number)
  3. the element has sub-elements and also content which can be before or after the sub-elements.

The possible operations are as follows:

  1. we can
    • iterate over element for x in element:
    • find a sub-element with find or sub-elements with findall
  2. we can retrieve the value of the element with text.
  3. we can
    • iterate or search for one or more sub-elements
    • retrieve value before sub-elements with text
    • retrieve value after sub-elements with tail
In [5]:
for element in root:
    print(element.tag)
    print(element.attrib)
    print(element.find('ville').text)
    break # it would be too long
len(root)
pdv
{'id': '1000001', 'latitude': '4620114', 'longitude': '519791', 'cp': '01000', 'pop': 'R'}
SAINT-DENIS-LèS-BOURG
Out[5]:
13325

So we can write our program.

In [6]:
result = []

for element in root:
    cp = element.attrib['cp']
    adresse = element.find('adresse').text
    ville = element.find('ville').text
    for p in element.findall('prix'):
        if p.attrib['nom'] == 'SP95' or p.attrib['nom'] == 'E10':
            prix = int(p.attrib['valeur']) / 1000
            result.append([adresse, cp, ville, prix])
In [7]:
result[0]
Out[7]:
['596 AVENUE DE TREVOUX', '01000', 'SAINT-DENIS-LèS-BOURG', 1.809]
In [8]:
len(result)
Out[8]:
11097

More¶

This presentation is only an introduction to XML in Python. For more information like

  • avoid security vulnerabilities by reading an XML file retrieved from the web,
  • write an XML file,
  • use a schema that describes the XML,
  • and other tricks

look

  • Documentation
  • the xmltodict library to make life simpler

{{ PreviousNext("11 datetime.ipynb", "90 Project.ipynb")}}

In [ ]: