module data.wikipedia

Short summary

module mlstatpy.data.wikipedia

Functions to retrieve data from Wikipedia

source on GitHub

Functions

function

truncated documentation

download_dump

Downloads wikipedia dumps from https://dumps.wikimedia.org/frwiki/latest/.

download_pageviews

Downloads wikipedia pagacount for a precise date (up to the hours), the url follows the pattern

download_titles

Downloads wikipedia titles from https://dumps.wikimedia.org/frwiki/latest/latest-all-titles-in-ns0.gz.

enumerate_titles

Enumerates titles from a file.

normalize_wiki_text

Normalizes a text such as a wikipedia title.

Documentation

Functions to retrieve data from Wikipedia

source on GitHub

mlstatpy.data.wikipedia.download_dump(country, name, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)[source]

Downloads wikipedia dumps from https://dumps.wikimedia.org/frwiki/latest/.

Paramètres
  • country – country

  • name – name of the stream to download

  • folder – where to download

  • unzip – unzip the file

  • timeout – timeout

  • overwrite – overwrite

  • fLOG – logging function

source on GitHub

mlstatpy.data.wikipedia.download_pageviews(dt, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)[source]

Downloads wikipedia pagacount for a precise date (up to the hours), the url follows the pattern:

https://dumps.wikimedia.org/other/pageviews/%Y/%Y-%m/pagecounts-%Y%m%d-%H0000.gz
Paramètres
  • dt – datetime

  • folder – where to download

  • unzip – unzip the file

  • timeout – timeout

  • overwrite – overwrite

  • fLOG – logging function

Renvoie

filename

More information on page pageviews.

source on GitHub

mlstatpy.data.wikipedia.download_titles(country, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)[source]

Downloads wikipedia titles from https://dumps.wikimedia.org/frwiki/latest/latest-all-titles-in-ns0.gz.

Paramètres
  • country – country

  • folder – where to download

  • unzip – unzip the file

  • timeout – timeout

  • overwrite – overwrite

  • fLOG – logging function

source on GitHub

mlstatpy.data.wikipedia.enumerate_titles(filename, norm=True, encoding='utf8')[source]

Enumerates titles from a file.

Paramètres
  • filename – filename

  • norm – normalize in the function

  • encoding – encoding

source on GitHub

mlstatpy.data.wikipedia.normalize_wiki_text(text)[source]

Normalizes a text such as a wikipedia title.

Paramètres

text – text to normalize

Renvoie

normalized text

source on GitHub