XML is a widely used data encoding system that can be read by a human as well as a program. It is quite verbose and certainly not suitable for storing very large data, however it is encountered often enough to have to use it sooner or later. Here is Python's xml
library to simplify the process.
The goal is not to explain XML here but to recover from an XML file the data that interests us. For this we will use an XML file of fuel prices at the pump (see https://www.prix-carburants.gouv.fr/rubrique/opendata/ ).
# you don't need to understand this cell, it just downloads and opens a zip file from an URL
import requests
import zipfile
import io
import xmltodict
response = requests.get("https://donnees.roulez-eco.fr/opendata/jour", stream=True)
z = zipfile.ZipFile(io.BytesIO(response.content))
xmlfile = z.read(z.filelist[0].filename)
Let's see what an XML file looks like (only the beginning because it's long):
xmlfile[:1000]
b'<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>\n<pdv_liste>\n <pdv id="1000001" latitude="4620114" longitude="519791" cp="01000" pop="R">\n <adresse>596 AVENUE DE TREVOUX</adresse>\n <ville>SAINT-DENIS-L\xe8S-BOURG</ville>\n <services>\n <service>Station de gonflage</service>\n <service>Vente de gaz domestique (Butane, Propane)</service>\n <service>DAB (Distributeur automatique de billets)</service>\n </services>\n <prix nom="Gazole" id="1" maj="2022-02-25T10:32:40" valeur="1710"/>\n <prix nom="SP95" id="2" maj="2022-02-25T10:08:42" valeur="1809"/>\n <prix nom="SP98" id="6" maj="2022-02-25T10:08:43" valeur="1842"/>\n <rupture id="3" nom="E85" debut="2017-09-16T09:50:23" fin=""/>\n <rupture id="4" nom="GPLc" debut="2017-09-16T09:50:23" fin=""/>\n <rupture id="5" nom="E10" debut="2018-12-13T09:49:49" fin=""/>\n </pdv>\n <pdv id="1000002" latitude="4621842" longitude="522767" cp="01000" pop="R">\n <adresse>16 Avenue de Marboz</adresse>\n <vill'
Let's say you want to retrieve the mailing address of all the stations that sell unleaded 95 gas, SP95 or E10, and the price.
We see that the information is stored in a tree whose root is pdv_liste
.
The information that interests us is in
pdv
element that is a point of sale,address
, city
and price
sub-elementscp
attributes of the pdv
element and value
of price
Note that there are several price
sub-elements per point of sale and that it is therefore necessary to choose the correct one.
Our goal is to store all this information in a table having for each line the fields address
, cp
, city
, price
(price here being the value of the price).
xml
library¶import xml.etree.ElementTree as ET
root = ET.fromstring(xmlfile)
If you want to read an XML file and not a URL, then you have to do:
tree = ET.parse('data.xml')
root = tree.getroot()
The root is the first element.
Let's check that our root is called pdv_liste and that there are no attributes for this element:
print(root.tag, root.attrib)
pdv_liste {}
In addition to retrieving its tag
name and its attrib
attributes, for each element there are three possible cases:
The possible operations are as follows:
for x in element:
find
or sub-elements with findall
text
.text
tail
for element in root:
print(element.tag)
print(element.attrib)
print(element.find('ville').text)
break # it would be too long
len(root)
pdv {'id': '1000001', 'latitude': '4620114', 'longitude': '519791', 'cp': '01000', 'pop': 'R'} SAINT-DENIS-LèS-BOURG
13325
So we can write our program.
result = []
for element in root:
cp = element.attrib['cp']
adresse = element.find('adresse').text
ville = element.find('ville').text
for p in element.findall('prix'):
if p.attrib['nom'] == 'SP95' or p.attrib['nom'] == 'E10':
prix = int(p.attrib['valeur']) / 1000
result.append([adresse, cp, ville, prix])
result[0]
['596 AVENUE DE TREVOUX', '01000', 'SAINT-DENIS-LèS-BOURG', 1.809]
len(result)
11097
This presentation is only an introduction to XML in Python. For more information like
look
xmltodict
library to make life simpler{{ PreviousNext("11 datetime.ipynb", "90 Project.ipynb")}}