module data.wikipedia#

Short summary#

module mlstatpy.data.wikipedia

Functions to retrieve data from Wikipedia

source on GitHub

Functions#

function

truncated documentation

download_dump

Downloads wikipedia dumps from dumps.wikimedia.org/frwiki/latest/. …

download_pageviews

Downloads wikipedia pagacount for a precise date (up to the hours), the url follows the pattern

download_titles

Downloads wikipedia titles from dumps.wikimedia.org/frwiki/latest/latest-all-titles-in-ns0.gz. …

enumerate_titles

Enumerates titles from a file.

normalize_wiki_text

Normalizes a text such as a wikipedia title.

Documentation#

Functions to retrieve data from Wikipedia

source on GitHub

mlstatpy.data.wikipedia.download_dump(country, name, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)#

Downloads wikipedia dumps from dumps.wikimedia.org/frwiki/latest/.

Paramètres:
  • country – country

  • name – name of the stream to download

  • folder – where to download

  • unzip – unzip the file

  • timeout – timeout

  • overwrite – overwrite

  • fLOG – logging function

source on GitHub

mlstatpy.data.wikipedia.download_pageviews(dt, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)#

Downloads wikipedia pagacount for a precise date (up to the hours), the url follows the pattern:

https://dumps.wikimedia.org/other/pageviews/%Y/%Y-%m/pagecounts-%Y%m%d-%H0000.gz
Paramètres:
  • dt – datetime

  • folder – where to download

  • unzip – unzip the file

  • timeout – timeout

  • overwrite – overwrite

  • fLOG – logging function

Renvoie:

filename

More information on page pageviews.

source on GitHub

mlstatpy.data.wikipedia.download_titles(country, folder='.', unzip=True, timeout=-1, overwrite=False, fLOG=<function noLOG>)#

Downloads wikipedia titles from dumps.wikimedia.org/frwiki/latest/latest-all-titles-in-ns0.gz.

Paramètres:
  • country – country

  • folder – where to download

  • unzip – unzip the file

  • timeout – timeout

  • overwrite – overwrite

  • fLOG – logging function

source on GitHub

mlstatpy.data.wikipedia.enumerate_titles(filename, norm=True, encoding='utf8')#

Enumerates titles from a file.

Paramètres:
  • filename – filename

  • norm – normalize in the function

  • encoding – encoding

source on GitHub

mlstatpy.data.wikipedia.normalize_wiki_text(text)#

Normalizes a text such as a wikipedia title.

Paramètres:

text – text to normalize

Renvoie:

normalized text

source on GitHub