Introduction to XML¶
XML is a widely used data encoding system that can be read by a human as well as a program. It is quite verbose and certainly not suitable for storing very large data, however it is encountered often enough to have to use it sooner or later. Here is Python's xml
library to simplify the process.
The goal is not to explain XML here but to recover from an XML file the data that interests us. For this we will use an XML file of fuel prices at the pump (see https://www.prix-carburants.gouv.fr/rubrique/opendata/ ).
# you don't need to understand this cell, it just downloads and opens a zip file from an URL
import requests
import zipfile
import io
import xmltodict
response = requests.get("https://donnees.roulez-eco.fr/opendata/jour", stream=True)
z = zipfile.ZipFile(io.BytesIO(response.content))
xmlfile = z.read(z.filelist[0].filename)
Let's see what an XML file looks like (only the beginning because it's long):
xmlfile[:1000]
b'<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>\n<pdv_liste>\n <pdv id="1000001" latitude="4620114" longitude="519791" cp="01000" pop="R">\n <adresse>596 AVENUE DE TREVOUX</adresse>\n <ville>SAINT-DENIS-L\xe8S-BOURG</ville>\n <services>\n <service>Station de gonflage</service>\n <service>Vente de gaz domestique (Butane, Propane)</service>\n <service>DAB (Distributeur automatique de billets)</service>\n </services>\n <prix nom="Gazole" id="1" maj="2022-02-25T10:32:40" valeur="1710"/>\n <prix nom="SP95" id="2" maj="2022-02-25T10:08:42" valeur="1809"/>\n <prix nom="SP98" id="6" maj="2022-02-25T10:08:43" valeur="1842"/>\n <rupture id="3" nom="E85" debut="2017-09-16T09:50:23" fin=""/>\n <rupture id="4" nom="GPLc" debut="2017-09-16T09:50:23" fin=""/>\n <rupture id="5" nom="E10" debut="2018-12-13T09:49:49" fin=""/>\n </pdv>\n <pdv id="1000002" latitude="4621842" longitude="522767" cp="01000" pop="R">\n <adresse>16 Avenue de Marboz</adresse>\n <vill'
Let's say you want to retrieve the mailing address of all the stations that sell unleaded 95 gas, SP95 or E10, and the price.
We see that the information is stored in a tree whose root is pdv_liste
.
The information that interests us is in
- each
pdv
element that is a point of sale, - the
address
,city
andprice
sub-elements - the
cp
attributes of thepdv
element andvalue
ofprice
Note that there are several price
sub-elements per point of sale and that it is therefore necessary to choose the correct one.
Our goal is to store all this information in a table having for each line the fields address
, cp
, city
, price
(price here being the value of the price).
Using the xml
library¶
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlfile)
If you want to read an XML file and not a URL, then you have to do:
tree = ET.parse('data.xml')
root = tree.getroot()
The root is the first element.
Let's check that our root is called pdv_liste and that there are no attributes for this element:
print(root.tag, root.attrib)
pdv_liste {}
In addition to retrieving its tag
name and its attrib
attributes, for each element there are three possible cases:
- element has sub-elements
- the element has no sub-elements, it is a leaf, it has content (a text, a number)
- the element has sub-elements and also content which can be before or after the sub-elements.
The possible operations are as follows:
- we can
- iterate over element
for x in element:
- find a sub-element with
find
or sub-elements withfindall
- iterate over element
- we can retrieve the value of the element with
text
. - we can
- iterate or search for one or more sub-elements
- retrieve value before sub-elements with
text
- retrieve value after sub-elements with
tail
for element in root:
print(element.tag)
print(element.attrib)
print(element.find('ville').text)
break # it would be too long
len(root)
pdv {'id': '1000001', 'latitude': '4620114', 'longitude': '519791', 'cp': '01000', 'pop': 'R'} SAINT-DENIS-LèS-BOURG
13325
So we can write our program.
result = []
for element in root:
cp = element.attrib['cp']
adresse = element.find('adresse').text
ville = element.find('ville').text
for p in element.findall('prix'):
if p.attrib['nom'] == 'SP95' or p.attrib['nom'] == 'E10':
prix = int(p.attrib['valeur']) / 1000
result.append([adresse, cp, ville, prix])
result[0]
['596 AVENUE DE TREVOUX', '01000', 'SAINT-DENIS-LèS-BOURG', 1.809]
len(result)
11097
More¶
This presentation is only an introduction to XML in Python. For more information like
- avoid security vulnerabilities by reading an XML file retrieved from the web,
- write an XML file,
- use a schema that describes the XML,
- and other tricks
look
- Documentation
- the
xmltodict
library to make life simpler
{{ PreviousNext("11 datetime.ipynb", "90 Project.ipynb")}}