module data.wikipedia
#
Short summary#
module mlstatpy.data.wikipedia
Functions to retrieve data from Wikipedia
Functions#
function |
truncated documentation |
---|---|
Downloads wikipedia dumps from dumps.wikimedia.org/frwiki/latest/. … |
|
Downloads wikipedia pagacount for a precise date (up to the hours), the url follows the pattern |
|
Downloads wikipedia titles from dumps.wikimedia.org/frwiki/latest/latest-all-titles-in-ns0.gz. … |
|
Enumerates titles from a file. |
|
Normalizes a text such as a wikipedia title. |
Documentation#
Functions to retrieve data from Wikipedia
- mlstatpy.data.wikipedia.download_dump(country, name, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)#
Downloads wikipedia dumps from dumps.wikimedia.org/frwiki/latest/.
- Paramètres:
country – country
name – name of the stream to download
folder – where to download
unzip – unzip the file
timeout – timeout
overwrite – overwrite
fLOG – logging function
- mlstatpy.data.wikipedia.download_pageviews(dt, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)#
Downloads wikipedia pagacount for a precise date (up to the hours), the url follows the pattern:
https://dumps.wikimedia.org/other/pageviews/%Y/%Y-%m/pagecounts-%Y%m%d-%H0000.gz
- Paramètres:
dt – datetime
folder – where to download
unzip – unzip the file
timeout – timeout
overwrite – overwrite
fLOG – logging function
- Renvoie:
filename
More information on page pageviews.
- mlstatpy.data.wikipedia.download_titles(country, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)#
Downloads wikipedia titles from dumps.wikimedia.org/frwiki/latest/latest-all-titles-in-ns0.gz.
- Paramètres:
country – country
folder – where to download
unzip – unzip the file
timeout – timeout
overwrite – overwrite
fLOG – logging function
- mlstatpy.data.wikipedia.enumerate_titles(filename, norm=True, encoding='utf8')#
Enumerates titles from a file.
- Paramètres:
filename – filename
norm – normalize in the function
encoding – encoding
- mlstatpy.data.wikipedia.normalize_wiki_text(text)#
Normalizes a text such as a wikipedia title.
- Paramètres:
text – text to normalize
- Renvoie:
normalized text