.. _chshhtmlrst:
===================
Cheat Sheet on HTML
===================
.. only:: html
**Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/cheat_sheets/chsh_html.ipynb|*`
Parse HTML and extract information.
.. code:: ipython3
from jyquickhelper import add_notebook_menu
add_notebook_menu()
.. contents::
:local:
Parse with Python and convert it into JSON
------------------------------------------
Inspired from `Convert HTML into
JSON `__. We convet
a page into JSON.
.. code:: ipython3
import html.parser
class HTMLtoJSONParser(html.parser.HTMLParser):
def __init__(self, raise_exception = True) :
html.parser.HTMLParser.__init__(self)
self.doc = { }
self.path = []
self.cur = self.doc
self.line = 0
self.raise_exception = raise_exception
@property
def json(self):
return self.doc
@staticmethod
def to_json(content, raise_exception = True):
parser = HTMLtoJSONParser(raise_exception = raise_exception)
parser.feed(content)
return parser.json
def handle_starttag(self, tag, attrs):
self.path.append(tag)
attrs = { k:v for k,v in attrs }
if tag in self.cur :
if isinstance(self.cur[tag],list) :
self.cur[tag].append( { "__parent__": self.cur } )
self.cur = self.cur[tag][-1]
else :
self.cur[tag] = [ self.cur[tag] ]
self.cur[tag].append( { "__parent__": self.cur } )
self.cur = self.cur[tag][-1]
else :
self.cur[tag] = { "__parent__": self.cur }
self.cur = self.cur[tag]
for a, v in attrs.items():
self.cur["#" + a] = v
self.cur[""] = ""
def handle_endtag(self, tag):
if tag != self.path[-1] and self.raise_exception :
raise Exception("html is malformed around line: {0} (it might be because " \
"of a tag
,
, not closed)".format(self.line))
del self.path[-1]
memo = self.cur
self.cur = self.cur["__parent__"]
self.clean(memo)
def handle_data(self, data):
self.line += data.count("\n")
if "" in self.cur :
self.cur[""] += data
def clean(self, values):
keys = list(values.keys())
for k in keys:
v = values[k]
if isinstance(v, str) :
#print ("clean", k,[v])
c = v.strip(" \n\r\t")
if c != v :
if len(c) > 0 :
values[k] = c
else :
del values[k]
del values["__parent__"]
The following `Informations surfaciques du PLU (doc. du 10.09.2010) de
la commune de
Bannay `__
contains some links we need to extract. We cache the page to avoid
loading it again every time we run the script.
.. code:: ipython3
import os
cache = "cache_content.html.bytes"
if not os.path.exists(cache):
import urllib.request
url = "https://www.data.gouv.fr/fr/datasets/informations-surfaciques-du-plu-doc-du-10-09-2010-de-la-commune-de-bannay/"
with urllib.request.urlopen(url) as response:
content = response.read()
with open(cache, "wb") as f:
f.write(content)
else:
with open(cache, "rb") as f:
content = f.read()
.. code:: ipython3
type(content)
.. parsed-literal::
bytes
We need to convert it into str. The encoding should utf-8.
.. code:: ipython3
page = content.decode("utf-8")
type(page)
.. parsed-literal::
str
We catch an error if there is any.
.. code:: ipython3
try:
js = HTMLtoJSONParser.to_json(page)
error = False
except Exception as e:
error = True
print(e)
.. parsed-literal::
html is malformed around line: 66 (it might be because of a tag
,
, not closed)
Let’s see:
.. code:: ipython3
if error:
lines = page.split("\n")
line = 42
around = 2
begin = max(0, line-around)
end = min(len(lines), line+around)
for i in range(begin, end):
print("%03d %s" % (i, lines[i]))
else:
print("No error.")
.. parsed-literal::
040
041
042
043
HTML format is very often malformed and browsers are used to it. That’s
why there exists module such as
`beautifulsoup `__.
With beautifulsoup
------------------
Very easy to extract all urls.
.. code:: ipython3
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')
.. code:: ipython3
url = list(soup.find_all("a"))
url[:4]
.. parsed-literal::
[Data.gouv.fr,
Découvrez l'OpenData,
En tant que citoyen,
En tant que producteur]
About JSON, because I don’t want to change my code too much, I use
`prettify `__
before calling the code above.
.. code:: ipython3
with open("clean_content.html", "w", encoding="utf-8") as f:
f.write(soup.prettify())
js = HTMLtoJSONParser.to_json(soup.prettify())
Now, I use javascript to go through it.
.. code:: ipython3
from jyquickhelper import JSONJS
JSONJS(js)
.. raw:: html